# Run ADMIXTURE

ADMIXTURE is a software tool for estimating maximum likelihood estimation of individual ancestries from multi-locus SNP genotype datasets. It uses the same statistical model as STRUCTURE, but calculates estimates much more rapidly using a fast numerical optimization algorithm.

In [1]:
###################
## set variables ##
###################

# specify file name prefix
in_name = 'admix'

# specify path to 1000g vcf files
ref_dir = '/users/selasady/my_titan_itmi/impala_scripts/annotation/admix/1000g_vcf/'

# specify desired output directory
out_dir = '/users/selasady/my_titan_itmi/impala_scripts/annotation/admix/'

# set file paths to exectuables
plink_path = '/users/selasady/my_titan_itmi/tools/plink/plink --noweb'
tabix_path = '/users/selasady/my_titan_itmi/tools/tabix-0.2.6/tabix'
vcftools_path = '/titan/ITMI1/workspaces/users/selasady/tools/vcftools_0.1.13/bin/vcftools'
java_path = '/tools/java/jdk1.7/bin/java'
gatk_path = '/users/selasady/my_titan_itmi/tools/GenomeAnalysisTK.jar'

# import required modules
import os
import subprocess as sp
import pandas as pd

## 1000 genomes phase 3 Marker Set

The 1000 genomes project created a marker file for running admixture on their phase 3 genomes using the following pipeline: 

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/admixture_files/README.admixture_20141217

The PED file created in this pipeline was used for the creation of a marker set for running admixture on our own variant files, as follows: 

- VCFtools to filter 1000 Genomes call set, retaining only bialleleic, non-singelton SNV sites that are a minimum of 2kb apart with the following command:   

      ./vcftools --gzvcf
      ALL.chr1.phase3_shapeit2_mvncall_integrated_v4.20130502.genotypes.vcf.gz
      --thin 2000 --min-alleles 2 --max-alleles 2 --non-ref-ac 2 --plink --chr 1 --out ALL.chr1.phase3_shapeit2_filtered.20141217  


- The resulting Plink PED and MAP files were used for the creation of a marker set for running admixture on INOVA genomes. 

- Files downloaded from: 

    - PED file: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/admixture_files/ALL.wgs.phase3_shapeit2_filtered.20141217.maf0.05.ped
    - MAP file: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/admixture_files/ALL.wgs.phase3_shapeit2_filtered.20141217.maf0.05.map

### Filtering on minor allele frequencey (MAF) and Linkage Disequilibrium (LD)

The 1000 genomes PED file was filtered to retain: 
- Variants with MAF between .05 and .5  
- LD pruning to inactivate first marker of pair within 50 marker window, having CHM R^2 value > 0.5, steps of 5 according to documentation at http://pngu.mgh.harvard.edu/~purcell/plink/summary.shtml#prune
- A pruned set was output using the following command: 

      "{} --file {} --out all_chroms_100g_maf_ld --maf 0.05 --max-maf .5 --indep-pairwise 50 5 0.5".format(plink_path, plink_file)
      
- Variants matching the pruned set were extracted from the PED file and the output was converted to a binary BED/BIM/FAM file set using the following command: 

       ped2bed_cmd = '{} --file {} --extract {} --make-bed --out all_chroms_1000g_pruned'.format(plink_path, plink_file, pruned_file)

### Create .POP File Needed for running ADMIXTURE in supervised mode

Admixture requires a .pop file for running admixture in supervised mode; allowing the algorithm to use variants of known admix to better estimate admixture on input samples. 

The .pop file should correspond to each subject listed in the .fam file containing only one line showing the admix code for known subjects, and a dash for subjects with unknown adxmiture: 

CEU  
YRI  
\-   
CEU  

In [None]:
# read in .fam file and original 1000g ped file specifying ancestry
map_file = pd.read_csv('/users/selasady/my_titan_itmi/impala_scripts/annotation/admix/1000g_vcf/integrated_call_samples_v2.20130502.ALL.ped', sep='\t')
fam_cols = ['family', 'subject', 'paternal', 'maternal', 'sex','affection']
fam_file = pd.read_csv('/users/selasady/my_titan_itmi/impala_scripts/annotation/admix/1000g_vcf/all_chroms_1000g_pruned.fam', sep=' ', names = fam_cols)

# match .fam file with 1000g  integrated_call_samples_v2.20130502.ALL.ped by subject id
pop_df = fam_file.merge(map_file, left_on='subject', right_on='Individual ID', how = 'left')

# keep just the population column and save to file
pop_out = pop_df['Population']
pop_out.to_csv('/users/selasady/my_titan_itmi/impala_scripts/annotation/admix/1000g_vcf/all_chroms_1000g_pruned.pop', header=False, index=False)

## Extract Marker Regions from VCF Files

### Convert the binary BED file to a readable map file for extracting matching regions from INOVA VCF files

    /users/selasady/my_titan_itmi/tools/plink/plink --noweb --bfile all_chroms_1000g_pruned --recode --out all_chroms_1000g_pruned_exported

##### head all_chroms_1000g_pruned_exported

1	1:76838	0	76838  
1	1:83084	0	83084  
1	1:91581	0	91581  
1	1:158006	0	158006  
1	1:173052	0	173052  
1	1:521578	0	521578  
1	1:636285	0	636285  
1	1:662414	0	662414  
1	1:712762	0	712762  
1	1:746189	0	746189  

### Connect to Impala 

In [183]:
import ibis
import os

# connect to impala with ibis
hdfs_port = os.environ.get('glados20', 50070)
hdfs = ibis.hdfs_connect(host='glados20', port=hdfs_port, user='selasady')
con = ibis.impala.connect(host='glados19', port=21050, timeout=120)

# enable interactive mode
ibis.options.interactive = True

### Upload admix tsv file to hdfs

In [199]:
path = '/user/selasady/'
file_name = 'master_admix.csv'
admix_file = path + '/' + file_name

# upload admix file
hdfs.put(path, file_name, verbose=True)



'/user/selasady/master_admix.csv'

### Download csv as ibis object to add sample id

In [204]:
# define talbe schema for tsv file
schema = ibis.schema([
    ('pop1', 'float'), 
    ('pop2', 'float'),
    ('pop3', 'float'),
    ('pop4', 'float'),
    ('pop5', 'float'),
    ('pop6', 'float'),
    ('pop7', 'float'),
    ('pop8', 'float'),
    ('pop9', 'float'),
    ('pop10', 'float'),
    ('pop11', 'float'),
    ('pop12', 'float'),
    ('pop13', 'float'),
    ('pop14', 'float'),
    ('vendor_id', 'string')
])

# create ibis object from admix tsv
admix = con.delimited_file(path, schema)

# create ibis object from mapping table
map_tbl = con.table('gms_metadata', database='p7_itmi')

In [207]:
# join tables to get sample_id
joined = admix.left_join(map_tbl, admix.vendor_id == map_tbl.genome)[admix,map_tbl.subject_id]

### Upload admix table + sample id's as an impala table

In [209]:
con.create_table('admix_test', joined, database='users_selasady')