# Merging genotypes

In this script we will take the harmonized genotypes and merge them.
We will have as output three different files, a "dense" file with samples with ~500k SNPs, a "medium" one with samples with ~300k, and a "sparse" one with all the files (~100k).
The list of files is as follows:

- Dense:
    - UIUC2013
    - UIUC2014
    - TD2015
    - TD2016
    - SA
    - ADAPT
    - PSU_FEMMES

- Medium:
    - All of the above
    - GHPAFF_Euro
    - GHPAFF_CV

- Sparse:
    - All of the above
    - Axiom Array
    - UC_FEMMES (Note that adding UC_FEMMES, the number of SNPs decreases from ~30k to ~10k)

## Preliminaries

Let's import modules and set paths

In [1]:
import subprocess, os, glob
import pandas as pd
import numpy as np

In [2]:
#Setting paths
projpath  = os.path.realpath("..")
pathgenos = os.path.join(projpath, "DataBases", "Genotypes")

## Merging each file

Let's first merge the different files divided by chromosomes into a single one

In [82]:
#Move directory
os.chdir(pathgenos)

In [83]:
#Entering every directory in harmonize folder
filenames = os.listdir(os.path.join(pathgenos, "03_Harmonized"))

for filename in filenames: # loop through all the files and folders
    if os.path.isdir(os.path.join(pathgenos, "03_Harmonized", filename)):
        os.chdir(os.path.join(pathgenos, "03_Harmonized", filename))
        #Opening new file and pasting the file names to use for merging
        f = open(filename + ".txt", "w+")
        for file in glob.glob("*_harmonized.bed"):
            f.write(file.split(".")[0] + "\n")
        f.close()
        #Merge using the file just created. Remove no-sex in CHP (those are plate controls)
        if filename.startswith("CHP"):
            subprocess.run(["plink", "--merge-list", filename + ".txt", "--allow-no-sex", "--make-bed", "--out", 
                            os.path.join(pathgenos, "03_Harmonized", filename + "_all_harmonized" )])
            subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "03_Harmonized", filename + "_all_harmonized"), 
                            "--remove", os.path.join(pathgenos, "03_Harmonized", "CHP_1022ppl_114K_hg19_ATGC_geno01_mind01_all_harmonized.nosex"), 
                            "--make-bed", "--out", os.path.join(pathgenos, "03_Harmonized", filename + "_all_harmonized" )])
        else:
            subprocess.run(["plink", "--merge-list", filename + ".txt", "--allow-no-sex", "--make-bed", "--out", 
                            os.path.join(pathgenos, "03_Harmonized", filename + "_all_harmonized" )])

## Merging and cleaning all files

Now we will merge all files into a single dataset, following the instructions stated previously

In [3]:
#Move directory
os.chdir(os.path.join(pathgenos, "03_Harmonized"))

In [91]:
#Creating merging files
f = open("30k_mergefile.txt", "w+")
for file in glob.glob("*harmonized.bed"):
    f.write(file.split(".")[0] + "\n")
f.close()

In [8]:
#Merging 500k files
subprocess.run(["plink", "--merge-list", "500k_mergefile.txt", "--make-bed", "--out", os.path.join(pathgenos, "04_Merge", "Merge_500k_3910pp") ])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_500k_3910pp"), "--geno", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_500k_3910pp_geno01")])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_500k_3910pp_geno01"), "--mind", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_500k_3729pp_geno01_mind01")])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_500k_3729pp_geno01_mind01"), "--filter-founders", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_500k_3612pp_geno01_mind01_founders")])

CompletedProcess(args=['plink', '--bfile', '/home/tomas/Documents/PopStruct/DataBases/Genotypes/04_Merge/Merge_500k_3729pp_geno01_mind01', '--filter-founders', '--make-bed', '--out', '/home/tomas/Documents/PopStruct/DataBases/Genotypes/04_Merge/Merge_500k_3612pp_geno01_mind01_founders'], returncode=0)

In [9]:
#Merging 300k files
subprocess.run(["plink", "--merge-list", "300k_mergefile.txt", "--make-bed", "--out", os.path.join(pathgenos, "04_Merge", "Merge_300k_4783pp") ])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_300k_4783pp"), "--geno", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_300k_4783pp_geno01")])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_300k_4783pp_geno01"), "--mind", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_300k_4426pp_geno01_mind01")])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_300k_4426pp_geno01_mind01"), "--filter-founders", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_300k_4309pp_geno01_mind01_founders")])

CompletedProcess(args=['plink', '--bfile', '/home/tomas/Documents/PopStruct/DataBases/Genotypes/04_Merge/Merge_300k_4426pp_geno01_mind01', '--filter-founders', '--make-bed', '--out', '/home/tomas/Documents/PopStruct/DataBases/Genotypes/04_Merge/Merge_300k_4309pp_geno01_mind01_founders'], returncode=0)

In [10]:
#Merging 30k files
subprocess.run(["plink", "--merge-list", "30k_mergefile.txt", "--make-bed", "--out", os.path.join(pathgenos, "04_Merge", "Merge_30k_5702pp") ])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_30k_5702pp"), "--geno", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_30k_5702pp_geno01")])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_30k_5702pp_geno01"), "--mind", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_30k_5345pp_geno01_mind01")])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_30k_5345pp_geno01_mind01"), "--filter-founders", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_30k_5228pp_geno01_mind01_founders")])

CompletedProcess(args=['plink', '--bfile', '/home/tomas/Documents/PopStruct/DataBases/Genotypes/04_Merge/Merge_30k_5345pp_geno01_mind01', '--filter-founders', '--make-bed', '--out', '/home/tomas/Documents/PopStruct/DataBases/Genotypes/04_Merge/Merge_30k_5228pp_geno01_mind01_founders'], returncode=0)

In [11]:
#Remove some intermediary files in 04_Merge folder
os.chdir(os.path.join(pathgenos, "04_Merge"))
for file in glob.glob("*pp.*"):
    os.remove(file)
for file in glob.glob("*geno01.*"):
    os.remove(file)
for file in glob.glob("*mind01.*"):
    os.remove(file)

Now let's see how many SNPs and samples per file there are

In [12]:
os.chdir(os.path.join(pathgenos, "04_Merge"))
for file in glob.glob("*.log"):
    with open(file) as myfile:
        print("In file: " + file.split(".")[0])
        for num, line in enumerate(myfile, 1):
            if "people" in line or "variants" in line:
                print(line, end='')
        print("Finished file... \n")

In file: Merge_300k_4309pp_geno01_mind01_founders
290001 variants loaded from .bim file.
4426 people (1251 males, 2478 females, 697 ambiguous) loaded from .fam.
117 people removed due to founder status (--filter-founders).
290001 variants and 4309 people pass filters and QC.
Finished file... 

In file: Merge_500k_3612pp_geno01_mind01_founders
497305 variants loaded from .bim file.
3729 people (1251 males, 2478 females) loaded from .fam.
117 people removed due to founder status (--filter-founders).
497305 variants and 3612 people pass filters and QC.
Finished file... 

In file: Merge_30k_5228pp_geno01_mind01_founders
29597 variants loaded from .bim file.
5345 people (1583 males, 3065 females, 697 ambiguous) loaded from .fam.
117 people removed due to founder status (--filter-founders).
29597 variants and 5228 people pass filters and QC.
Finished file... 



## Removing repeated IIDs

There are some repeated IIDs but with different FID (are those the same?).
But at this point we will remove them

In [13]:
os.chdir(os.path.join(pathgenos, "04_Merge"))
for file in glob.glob("*.fam"):
    fam    = pd.read_csv(file, sep = " ", header = None).drop_duplicates(subset = 1)
    fam.iloc[:,0:2].to_csv(file.split(".")[0] + "_keep.fam", sep = " ", header = False, index = False)
    subprocess.run(["plink", "--bfile", file.split(".")[0], "--keep", file.split(".")[0] + "_keep.fam", 
                    "--make-bed", "--out", file.split(".")[0] + "_unique"])

## Split by chromosome

Now we will split the three files by chromosome to use it as input files for phasing.
Also, for each split, we will remove genos and samples with high missing call rates

In [14]:
os.chdir(os.path.join(pathgenos, "04_Merge"))
for file in glob.glob("*_unique.bed"):
    filename = file.split(".")[0]
    #Split by chromosome
    for i in range(1,23):
        subprocess.run(["plink", "--bfile", filename, "--chr", str(i), "--geno", "--mind", "--make-bed", "--out", "Split/" + filename + "_chr_" + str(i)])