# Run ADMIXTURE

ADMIXTURE is a software tool for maximum likelihood estimation of individual ancestries from multilocus SNP genotype datasets. It uses the same statistical model as STRUCTURE but calculates estimates much more rapidly using a fast numerical optimization algorithm.

### Convert VCF to PED/MAP Plink format

Admixture requires PED/MAP files from Plink format as input. Requires: 
    - PLINK 1.9: https://www.cog-genomics.org/plink2
    - The following Plink arguments are required: 
        - recode 12 : codes major/minor alleles as 0,1 and 2
        - geno 0.1 : ignores entries with missing genotypes (will cause admixture error)
    - The following Plink arguments are optional:
        --no-fid = No family Id provided
        --no-parents  = No parent info provided
        --no-sex  = No sex info provided
        --no-pheno  = No phenotype info provided
        --double-id = Use subject ID as family ID since none is provided
    

In [212]:
# set working directory
import os
os.chdir('/Users/summerrae/impala_scripts/add_annots/testing')

### Run Plink

In [None]:
!for f in *.vcf.gz; do /Users/summerrae/tools/plink --vcf $f -recode 12 --out ${f%%.*} --memory 30000 --threads 8 --no-fid --no-parents --no-sex --no-pheno --double-id --geno 0.1; done

PLINK v1.90b3w 64-bit (3 Sep 2015)         https://www.cog-genomics.org/plink2
(C) 2005-2015 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to LP6005599-DNA_A02_1to15.log.
Options in effect:
  --chr 1-15
  --double-id
  --geno 0.1
  --memory 15000
  --no-fid
  --no-parents
  --no-pheno
  --no-sex
  --out LP6005599-DNA_A02_1to15
  --recode 12
  --threads 4
  --vcf LP6005599-DNA_A02.genome.vcf.gz

8192 MB RAM detected; reserving 15000 MB for main workspace.
--vcf: LP6005599-DNA_A02_1to15-temporary.bed +
LP6005599-DNA_A02_1to15-temporary.bim + LP6005599-DNA_A02_1to15-temporary.fam
written.
(48296973 variants skipped.)


### Running Admixture

Requires: 
    - Admixture program: https://www.genetics.ucla.edu/software/admixture/
    - 14 = number of admixture populations to use as 'K' value
    - -jN = number of cores to use

In [23]:
!for f in *.ped; do ~/tools/admixture_macosx-1.23/admixture -j5 $f 14; done

Parallel execution requested.  Will use 5 threads.
Random seed: 43
Point estimation method: Block relaxation algorithm
Convergence acceleration algorithm: QuasiNewton, 3 secant conditions
Point estimation will terminate when objective function delta < 0.0001
Estimation of standard errors disabled; will compute point estimates only.
****                   ADMIXTURE Version 1.23                   ****
****                    Copyright 2008-2013                     ****
****          David Alexander, John  Novembre, Ken Lange        ****
****                   Please cite our paper!                   ****
****   Information at www.genetics.ucla.edu/software/admixture  ****

Size of G: 1x3598517
Performing five EM steps to prime main algorithm
1 (EM) 	Elapsed: 1.908	Loglikelihood: -58826.4	(delta): 5.08467e+06
2 (EM) 	Elapsed: 2.193	Loglikelihood: -58659.3	(delta): 167.068
3 (EM) 	Elapsed: 1.767	Loglikelihood: -58635.3	(delta): 23.9763
4 (EM) 	Elapsed: 1.804	Loglikelihood: -58630.9	(delta)

### Add Vendor_ID column to each Q file

Add column containing vendor id to each .Q file. Requires
    - -i.q2 : extension required for mac, even though .i still modifies the file in place. I don't know why. This also creates .q2 files that are identical to the modified original version. Hopefully you are not running this on a mac and can avoid this altogether. 

In [138]:
!for f in *.Q; do sed -i.q2 "s/$/\ ${f%%.*}/" $f; done

To Do: 
    - Find out which population corresponds to which q value 
    (emailed admixture author for help, waiting for response)

### Create merged Q score file 

In [146]:
!cat *.Q | sed 's/ /\,/g' >> master_admix.csv

## Upload Admixture Table to Impala

In [147]:
# read in master_admix.txt as pandas table
master_admix = pd.read_csv('./master_admix.csv')

### Connect to Impala 

In [183]:
import ibis
import os

# connect to impala with ibis
hdfs_port = os.environ.get('glados20', 50070)
hdfs = ibis.hdfs_connect(host='glados20', port=hdfs_port, user='selasady')
con = ibis.impala.connect(host='glados19', port=21050, timeout=120)

# enable interactive mode
ibis.options.interactive = True

### Upload admix tsv file to hdfs

In [199]:
path = '/user/selasady/'
file_name = 'master_admix.csv'
admix_file = path + '/' + file_name

# upload admix file
hdfs.put(path, file_name, verbose=True)



'/user/selasady/master_admix.csv'

### Download csv as ibis object to add sample id

In [204]:
# define talbe schema for tsv file
schema = ibis.schema([
    ('pop1', 'float'), 
    ('pop2', 'float'),
    ('pop3', 'float'),
    ('pop4', 'float'),
    ('pop5', 'float'),
    ('pop6', 'float'),
    ('pop7', 'float'),
    ('pop8', 'float'),
    ('pop9', 'float'),
    ('pop10', 'float'),
    ('pop11', 'float'),
    ('pop12', 'float'),
    ('pop13', 'float'),
    ('pop14', 'float'),
    ('vendor_id', 'string')
])

# create ibis object from admix tsv
admix = con.delimited_file(path, schema)

# create ibis object from mapping table
map_tbl = con.table('gms_metadata', database='p7_itmi')

In [207]:
# join tables to get sample_id
joined = admix.left_join(map_tbl, admix.vendor_id == map_tbl.genome)[admix,map_tbl.subject_id]

### Upload admix table + sample id's as an impala table

In [209]:
con.create_table('admix_test', joined, database='users_selasady')