# Merging genotypes

This script will merge several genotypes files
- Merging between participants' genotype platforms
- Merging reference samples (1000G and HGDP)
- Merging reference with participants' samples

## Preliminaries

First let's import modules and set up paths

In [1]:
import glob, os, shutil, subprocess, csv, time
import pandas as pd
import numpy as np

In [2]:
os.getcwd()
projpath = os.path.realpath('..')
outhouse = os.path.join(projpath, 'Results', 'MergeGeno', 'MergeSamples')
outhgdp = os.path.join(projpath, 'Results', 'MergeGeno', 'temp', 'HGDP')
out1000 = os.path.join(projpath, 'Results', 'MergeGeno', 'temp', '1000G')
os.chdir(outhouse)
os.getcwd()

'/home/tomas/Downloads/FacialSD/Results/MergeGeno/MergeSamples'

## Merging in house genotypes

The first step will be to merge our participants' genotypes. 
The steps will be as follows:
- Clean each dataset by removing SNPs with missing rates greater than 0.1
- Merge the datasets, and remove all problematic SNPs
- Finally, LD prune SNPs in the merged dataset

First we will do a cleaning in each data set to remove SNP with missing call rates greater than 0.1

In [63]:
#Clean datasets
for file in glob.glob("01_Originals/*.bed"):
    inputname = file.split(".")
    outname = "Clean_" + inputname[0][13:]
    print("Creating..." + outname)
    subprocess.run(["plink", "--bfile", inputname[0], "--geno", "--make-bed", "--out", "02_Cleaning/" + outname])
    
print("Finished")

Creating...Clean_UIUC2013_116ppl_959Ksnps_hg19_ATGC
Creating...Clean_Euro180_176ppl_317K_hg19_ATGC
Creating...Clean_CHP_1022ppl_114K_hg19_ATGC
Creating...Clean_SA_231ppl_599K_hg19_ATGC
Creating...Clean_TD_198ppl_1M_hg19_ATGC
Creating...Clean_ADAPT_2784ppl_1Msnps_hg19_ATGC
Creating...Clean_GHPAFF_3ppl_907K_hg19_ATGC
Creating...Clean_CV_697ppl_964K_hg19_ATGC
Creating...Clean_UIUC2014_168ppl_703K_hg19_ATGC
Finished


Here you can take a look at the loaded and removed SNPs in each dataset

In [64]:
for file in glob.glob("02_Cleaning/*.log"):
    with open(file) as myfile:
        print("In file: " + file.split(".")[0][12:])
        for num, line in enumerate(myfile, 1):
            if "variants" in line:
                print(line, end='')
        print("Finished file... \n")

In file: Clean_CHP_1022ppl_114K_hg19_ATGC
114495 variants loaded from .bim file.
1132 variants removed due to missing genotype data (--geno).
113363 variants and 1022 people pass filters and QC.
Finished file... 

In file: Clean_UIUC2013_116ppl_959Ksnps_hg19_ATGC
959382 variants loaded from .bim file.
30151 variants removed due to missing genotype data (--geno).
929231 variants and 116 people pass filters and QC.
Finished file... 

In file: Clean_GHPAFF_3ppl_907K_hg19_ATGC
907494 variants loaded from .bim file.
52823 variants removed due to missing genotype data (--geno).
854671 variants and 3 people pass filters and QC.
Finished file... 

In file: Clean_TD_198ppl_1M_hg19_ATGC
1032848 variants loaded from .bim file.
454245 variants removed due to missing genotype data (--geno).
578603 variants and 198 people pass filters and QC.
Finished file... 

In file: Clean_CV_697ppl_964K_hg19_ATGC
964041 variants loaded from .bim file.
0 variants removed due to missing genotype data (--geno).
964

Now we will generate a first merging and we'll then extract the possibly triallelic snps from each dataset

In [65]:
subprocess.run(["plink", "--merge-list", "FirstMergeList.txt", "--out", "03_Merging/TriallelicSnps"])
for file in glob.glob("02_Cleaning/*.bed"):
    inputname = file.split(".")
    outname = "CleanTriallelic_" + inputname[0][18:]
    print("Creating..." + outname)
    subprocess.run(["plink", "--bfile", inputname[0], "--exclude", "03_Merging/TriallelicSnps.missnp", "--make-bed", "--out", "02_Cleaning/" + outname])
    
print("Finished")

Creating...CleanTriallelic_TD_198ppl_1M_hg19_ATGC
Creating...CleanTriallelic_GHPAFF_3ppl_907K_hg19_ATGC
Creating...CleanTriallelic_Euro180_176ppl_317K_hg19_ATGC
Creating...CleanTriallelic_UIUC2014_168ppl_703K_hg19_ATGC
Creating...CleanTriallelic_CV_697ppl_964K_hg19_ATGC
Creating...CleanTriallelic_SA_231ppl_599K_hg19_ATGC
Creating...CleanTriallelic_ADAPT_2784ppl_1Msnps_hg19_ATGC
Creating...CleanTriallelic_CHP_1022ppl_114K_hg19_ATGC
Creating...CleanTriallelic_UIUC2013_116ppl_959Ksnps_hg19_ATGC
Finished


Note: Remember that there are still a lot of variants with different possitions across datasets.
Now we will run a second merging, this time using the datasets with the excluded triallelic snps

In [66]:
subprocess.run(["plink", "--merge-list", "FinalMergeList.txt", "--out", "03_Merging/Merged"])

CompletedProcess(args=['plink', '--merge-list', 'FinalMergeList.txt', '--out', '03_Merging/Merged'], returncode=0)

Finally, we will remove all snps with missing call rates greater than 0.1, and LD prune the set of SNPs using parameters 50, 5, and 2

In [67]:
subprocess.run(["plink", "--bfile", "03_Merging/Merged", "--geno", "--indep", "50", "5", "2", "--out", "04_CleanMerged/ExtractSNPs"])
with open("04_CleanMerged/ExtractSNPs.log") as myfile:
    for num, line in enumerate(myfile, 1):
        if "variants" in line:
            print(line, end='')
    print("Finished file... \n")

1278792 variants loaded from .bim file.
1250083 variants removed due to missing genotype data (--geno).
28709 variants and 5387 people pass filters and QC.
Pruned 1005 variants from chromosome 1, leaving 1856.
Pruned 633 variants from chromosome 2, leaving 1344.
Pruned 612 variants from chromosome 3, leaving 1216.
Pruned 257 variants from chromosome 4, leaving 602.
Pruned 522 variants from chromosome 5, leaving 1156.
Pruned 699 variants from chromosome 6, leaving 1147.
Pruned 441 variants from chromosome 7, leaving 866.
Pruned 640 variants from chromosome 8, leaving 1077.
Pruned 416 variants from chromosome 9, leaving 762.
Pruned 465 variants from chromosome 10, leaving 831.
Pruned 738 variants from chromosome 11, leaving 1408.
Pruned 378 variants from chromosome 12, leaving 818.
Pruned 237 variants from chromosome 13, leaving 554.
Pruned 625 variants from chromosome 14, leaving 981.
Pruned 405 variants from chromosome 15, leaving 771.
Pruned 281 variants from chromosome 16, leaving 55

In [70]:
subprocess.run(["plink", "--bfile", "03_Merging/Merged", "--extract", "04_CleanMerged/ExtractSNPs.prune.in", "--remove", "04_CleanMerged/ExtractSNPs.nosex", 
                "--make-bed", "--out", "04_CleanMerged/CleanMerged"])
with open("04_CleanMerged/CleanMerged.log") as myfile:
    for num, line in enumerate(myfile, 1):
        if "variants" in line:
            print(line, end='')
    print("Finished file... \n")

1278792 variants loaded from .bim file.
--extract: 18953 variants remaining.
18953 variants and 5290 people pass filters and QC.
Finished file... 



We'll remove intermediary datasets to clear space, and move the final dataset to be latter merged with the reference samples

In [71]:
#Remove source files
for f in glob.glob("02_Cleaning/*.*"):
    os.remove(f)
    
for f in glob.glob("03_Merging/*.*"):
    os.remove(f)

In [4]:
#Copy final file to Merge1000G to be merged with the 1000Genomes samples
dest_dir = os.path.join(projpath, 'Results', 'MergeGeno', 'MergeSamples', '05_ReferenceSamples')
for filename in glob.glob("04_CleanMerged/CleanMerged.*"):
    shutil.copy(filename, dest_dir)

## Merging reference samples

In the second step, we will merge the reference samples from 1000G and HGDP.
We will follow these steps:
- Download HGDP files, and transform them into a plink file
- Download 1000G files and keep only SNPs found in the HGDP files
- Merge the HGDP and 1000G files

### HGDP

The HGDP files (Stanford) were downloaded from [here](http://hagsc.org/hgdp/files.html), and the sample list file from [here](http://www.stanford.edu/group/rosenberglab/data/rosenberg2006ahg/SampleInformation.txt).
The script to transform the HGDP data to plink format is called HGDPtoPlink.sh and was modified from [here](http://www.harappadna.org/2011/02/hgdp-to-ped-conversion/).
The HGDP data uses coordinates from build 36.1 (a list of assemblies can be found [here](https://genome.ucsc.edu/FAQ/FAQreleases.html))

In [5]:
#Run the HGDPtoPlink script
os.chdir(os.path.join(projpath, 'Code'))
subprocess.run(["bash", "HGDPtoPlink.sh"])

CompletedProcess(args=['bash', 'HGDPtoPlink.sh'], returncode=0)

Because the 1000G uses the GRCh37 assembly (fasta file can be found [here](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/) as human_g1k_v37.fasta.gz) we'll need to liftover the HGDP coordinates.
To do that we'll use UCSC [liftOver](http://genome.ucsc.edu/cgi-bin/hgLiftOver) already installed using bioconda, and [liftOverPlink](https://github.com/sritchie73/liftOverPlink) as a wrapper to work with plink files (`ped` and `map` formats).
The chain file that tells liftOver how to convert between hg18 and hg19 can be downloaded [here](http://hgdownload.cse.ucsc.edu/goldenPath/hg18/liftOver/hg18ToHg19.over.chain.gz).

In [3]:
os.chdir(outhgdp)
#Using liftover
%run liftOverPlink.py --map hgdp940.map --out lifted --chain hg18ToHg19.over.chain.gz
%run rmBadLifts.py --map lifted.map --out good_lifted.map --log bad_lifted.dat
#Creating a list of snps to include in lifted version
snps = pd.read_csv("good_lifted.map", sep = "\t", header = None)
snps.iloc[:,1].to_csv("snplist.txt", index = False)
#Excluding snps and creating binary file
subprocess.run(["plink", "--file", "hgdp940", "--recode", "--out", "lifted", "--extract", "snplist.txt" ])
subprocess.run(["plink", "--file", "--ped", "lifted.ped", "--map", "good_lifted.map", "--make-bed", "--out", "hgdp940hg19"])

#Removing some files
for file in glob.glob("*.ped"):
    os.remove(file)
    
for file in glob.glob("*.map"):
    os.remove(file)

Converting MAP file to UCSC BED file...
SUCC:  map->bed succ
Lifting BED file...
SUCC:  liftBed succ
Converting lifted BED file back to MAP...
SUCC:  bed->map succ
cleaning up BED files...


In [4]:
#Read hgdp940hg19 log file
with open("hgdp940hg19.log") as myfile:
    for num, line in enumerate(myfile, 1):
        if "variants" in line:
            print(line, end='')
    print("Finished file... \n")

Performing single-pass .bed write (644054 variants, 940 people).
644054 variants loaded from .bim file.
644054 variants and 940 people pass filters and QC.
Finished file... 



In [47]:
dest_dir = os.path.join(projpath, 'Results', 'MergeGeno', 'temp', 'Merge')
for filename in glob.glob("hgdp940hg19.*"):
    shutil.copy(filename, dest_dir)

### 1000G

The 1000G Phase 3 files were downloaded from [here](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/).
First, we will extract the list of snps (`snplist.txt`) from the HGDP dataset for each chromosome of the 1000G samples using vcftools.
Then we will concatenate the different autosomal chromosomes in one file and convert it into a plink binary file using bcftools and plink.

In [16]:
os.chdir(out1000)
shutil.copy(os.path.join(outhgdp, "snplist.txt"), out1000)
for file in glob.glob("*chr[0-9]*.gz"):
    outname = file.split(".")[1] + "_extracted"
    subprocess.run(["vcftools", "--gzvcf", file, "--snps", "snplist.txt", "--recode", "--out", outname])

In [17]:
concatfiles = glob.glob("chr[0-9]*.recode.vcf")
function = ["bcftools", "concat", "-o", "1000g.vcf.gz", "-Oz"]
function.extend(concatfiles)
subprocess.run(function)

CompletedProcess(args=['bcftools', 'concat', '-o', '1000g.vcf.gz', '-Oz', 'chr21_extracted.recode.vcf', 'chr13_extracted.recode.vcf', 'chr4_extracted.recode.vcf', 'chr19_extracted.recode.vcf', 'chr8_extracted.recode.vcf', 'chr18_extracted.recode.vcf', 'chr14_extracted.recode.vcf', 'chr3_extracted.recode.vcf', 'chr11_extracted.recode.vcf', 'chr7_extracted.recode.vcf', 'chr16_extracted.recode.vcf', 'chr10_extracted.recode.vcf', 'chr17_extracted.recode.vcf', 'chr5_extracted.recode.vcf', 'chr2_extracted.recode.vcf', 'chr6_extracted.recode.vcf', 'chr9_extracted.recode.vcf', 'chr1_extracted.recode.vcf', 'chr15_extracted.recode.vcf', 'chr12_extracted.recode.vcf', 'chr22_extracted.recode.vcf', 'chr20_extracted.recode.vcf'], returncode=0)

In [41]:
#Exclude related people and convert to binary plink
subprocess.run(["plink", "--vcf", "1000g.vcf.gz", "--make-bed", "--out", "1000Ghg19" ])
#Updating fam file
allfam = pd.read_csv("integrated_call_samples_v2.20130502.ALL.ped", header = None, skiprows = 1, sep = "\t")
oldfam = pd.read_csv("1000Ghg19.fam", header = None, sep = " ")
updatedfam = pd.merge(oldfam, allfam, how = "inner", left_on = 1, right_on = 1)
updatedfam.iloc[:,[6,1,7,8,9,5]].to_csv("1000Ghg19.fam", sep = " ", header = False, index = False)


In [42]:
for file in glob.glob("chr*.recode.vcf"):
    os.remove(file)
    
os.remove("1000g.vcf.gz")

dest_dir = os.path.join(projpath, 'Results', 'MergeGeno', 'temp', 'Merge')
for filename in glob.glob("1000Ghg19.*"):
    shutil.copy(filename, dest_dir)

### Merge reference samples

Now we will merge the 1000G and HGDP databases, both using the hg19 coordinates and with related people removed

In [45]:
os.chdir(os.path.join(projpath, 'Results', 'MergeGeno', 'temp', 'Merge'))
subprocess.run(["plink", "--bfile", "1000Ghg19", "--bmerge", "hgdp940hg19", "--make-bed", "--out", "hgdp1000ghg19"])
for file in glob.glob("*.bed"):
    outname = file.split(".")[0] + "_temp"
    subprocess.run(["plink", "--bfile", file.split(".")[0], "--exclude", "hgdp1000ghg19-merge.missnp", "--make-bed", "--out", outname])

subprocess.run(["plink", "--bfile", "hgdp940hg19_temp", "--bmerge", "1000Ghg19_temp", "--make-bed", "--out", "hgdp1000ghg19"])
for file in glob.glob("*_temp*"):
    os.remove(file)
    
dest_dir = os.path.join(projpath, 'Results', 'MergeGeno', 'MergeSamples', '05_ReferenceSamples')
for filename in glob.glob("hgdp1000ghg19.*"):
    shutil.copy(filename, dest_dir)

## Merging reference and in-house samples

Finally, we will merge the in-house samples with the reference samples from HGDP and 1000G. 
To do that we will extract the snps from the in-house samples already pruned and cleaned.

In [14]:
os.chdir(os.path.join(outhouse, "05_ReferenceSamples"))
subprocess.run(["plink", "--bfile", "hgdp1000ghg19", "--extract", "CleanMerged.bim", "--make-bed", "--out", "hgdp1000ghg19_subset"])
subprocess.run(["plink", "--bfile", "CleanMerged", "--extract", "hgdp1000ghg19_subset.bim", "--make-bed", "--out", "CleanMerged_subset"])
subprocess.run(["plink", "--bfile", "CleanMerged_subset", "--bmerge", "hgdp1000ghg19_subset", "--make-bed", "--out", "HouseHGDP1000Ghg19"])
#Fliping strand and merging
subprocess.run(["plink", "--bfile", "CleanMerged_subset", "--flip", "HouseHGDP1000Ghg19-merge.missnp", "--make-bed", "--out", "CleanMerged_subset_flip"])
subprocess.run(["plink", "--bfile", "CleanMerged_subset_flip", "--bmerge", "hgdp1000ghg19_subset", "--make-bed", "--out", "HouseHGDP1000Ghg19"])

with open("HouseHGDP1000Ghg19.log", 'r') as fin:
    file_contents = fin.read()
    print(file_contents)

PLINK v1.90b4 64-bit (20 Mar 2017)
Options in effect:
  --bfile CleanMerged_subset_flip
  --bmerge hgdp1000ghg19_subset
  --make-bed
  --out HouseHGDP1000Ghg19

Hostname: tomasgazelle
Working directory: /home/tomas/Downloads/FacialSD/Results/MergeGeno/MergeSamples/05_ReferenceSamples
Start time: Mon May 14 11:57:17 2018

Random number seed: 1526313437
3865 MB RAM detected; reserving 1932 MB for main workspace.
5290 people loaded from CleanMerged_subset_flip.fam.
3444 people to be merged from hgdp1000ghg19_subset.fam.
Of these, 3444 are new, while 0 are present in the base dataset.
12890 markers loaded from CleanMerged_subset_flip.bim.
12890 markers to be merged from hgdp1000ghg19_subset.bim.
Of these, 0 are new, while 12890 are present in the base dataset.
Performing single-pass merge (8734 people, 12890 variants).
Merged fileset written to HouseHGDP1000Ghg19-merge.bed +
HouseHGDP1000Ghg19-merge.bim + HouseHGDP1000Ghg19-merge.fam .
12890 variants loaded from .bim file.
8734 people (372

## Population stratification

Now, we will load the final dataset created before and run some populations stratification analyses (`PCA`, `MDS` and `ADMIXTURE`)

### PCA

In [16]:
os.chdir(os.path.join(outhouse, "05_ReferenceSamples"))
subprocess.run(["plink", "--bfile", "HouseHGDP1000Ghg19", "--pca", "50", "--pca-cluster-names", "0", "--within", "hgdp1000ghg19.fam", "--out", "PCA"])
for file in glob.glob("PCA.*"):
    shutil.move(file, os.path.join(projpath, "Results", "GenPCA", file))

### Admixture

In [None]:
mergesamples = os.path.join(outhouse, "05_ReferenceSamples", "HouseHGDP1000Ghg19.bed")
os.chdir(os.path.join(projpath, "Results", "Admixture"))
for i in range(2,11):
    f = open("log_" + str(i) + ".txt", "w")
    subprocess.run(["./admixture", "--cv", mergesamples, str(i)], stdout=f)