# Merging genotypes

In this script we will take the harmonized genotypes and merge them.
Because [fineStructure](https://people.maths.bris.ac.uk/~madjl/finestructure/) requires a lot of computational resources, we previously kept individuals with 3D face data, using the _pheno suffix. 
Note that UIUC2014 only contain duplicates from the ADAPT file, so we will remove it from further analysis.
For each dataset (the pheno and the total) we'll split our data into two, one retaining most of our samples, and a second one only incorporating samples with denser genotype data.
Therefore, we will have a "dense" (~400k) genotype file including only samples from:

- ADAPT
- SA
- UIUC2013
- CV
- TD2016
- TD2015
- PSU_FEMMES

The "sparse" (~20k) file will contain all of the previous datasets plus:

- GHPAFF_Euro
- Axiom Array (CHP)
- UC_FEMMES

## Preliminaries

Let's import modules and set paths

In [1]:
import subprocess, os, glob
import pandas as pd
import numpy as np

In [2]:
#Setting paths
projpath  = os.path.realpath("..")
pathgenos = os.path.join(projpath, "DataBases", "Genotypes")

## Concatenating each file by chromosome

Let's first merge the different files divided by chromosomes into a single one.
In the CV file we'll need to remove some duplicated SNPs.

In [3]:
#Move directory
os.chdir(pathgenos)

In [4]:
#Entering every directory in harmonize folder
filenames = os.listdir(os.path.join(pathgenos, "03_Harmonized"))

for filename in filenames: # loop through all the files and folders
    if os.path.isdir(os.path.join(pathgenos, "03_Harmonized", filename)):
        os.chdir(os.path.join(pathgenos, "03_Harmonized", filename))
        #Opening new file and pasting the file names to use for merging
        f = open(filename + ".txt", "w+")
        for file in glob.glob("*_harmonized.bed"):
            f.write(file.split(".")[0] + "\n")
        f.close()
        if "CV_" in filename: #in CV we need to remove some SNPs before merging
            subprocess.run(["plink", "--merge-list", filename + ".txt", "--allow-no-sex", "--make-bed", "--out", 
                            os.path.join(pathgenos, "03_Harmonized", filename + "_excludedtemp" )])
            for file in glob.glob("*_harmonized.bed"):
                subprocess.run(["plink", "--bfile", file.split(".")[0], "--exclude", 
                                os.path.join(pathgenos, "03_Harmonized", filename + "_excludedtemp-merge.missnp" ), 
                                "--make-bed", "--out", file.split(".")[0] + "_temp"])
                
            f = open(filename + ".txt", "w+")
            for file in glob.glob("*_temp.bed"):
                f.write(file.split(".")[0] + "\n")
            f.close()
            subprocess.run(["plink", "--merge-list", filename + ".txt", "--allow-no-sex", "--make-bed", "--out", 
                            os.path.join(pathgenos, "03_Harmonized", filename + "_all_harmonized" )])
            for file in glob.glob("*_temp.*"):
                os.remove(file)   
                
        else:
            subprocess.run(["plink", "--merge-list", filename + ".txt", "--allow-no-sex", "--make-bed", "--out", 
                            os.path.join(pathgenos, "03_Harmonized", filename + "_all_harmonized" )])

## Merging and cleaning all files

Now we will merge the harmonized files into four different dataset as explained below.

In [5]:
#Move directory
os.chdir(os.path.join(pathgenos, "03_Harmonized"))

In [6]:
#Creating initial merging files without UIUC2014
f = open("merge_dense_phenos.txt", "w+")
exclude = ["UIUC2014", "Euro", "CHP", "UC_FEMMES"]
for file in glob.glob("*phenos*harmonized.bed"):
    if any(x in file for x in exclude):
        pass
    else:
        f.write(file.split(".")[0] + "\n")
f.close()

f = open("merge_dense_all.txt", "w+")
exclude = ["phenos", "UIUC2014", "Euro", "CHP", "UC_FEMMES"]
for file in glob.glob("*harmonized.bed"):
    if any(x in file for x in exclude):
        pass
    else:
        f.write(file.split(".")[0] + "\n")
f.close()


f = open("merge_sparse_phenos.txt", "w+")
for file in glob.glob("*phenos*harmonized.bed"):
    if "UIUC2014" in file:
        pass
    else:
        f.write(file.split(".")[0] + "\n")
f.close()

f = open("merge_sparse_all.txt", "w+")
for file in glob.glob("*harmonized.bed"):
    if "phenos" in file:
        pass
    else:
        if "UIUC2014" in file:
            pass
        else:
            f.write(file.split(".")[0] + "\n")
f.close()

In [7]:
#Merging all listed files, create SNP list to exclude and exclude them from all files to merge

for setfile in glob.glob("merge*.txt"): #For each file with the list of datasets create a file of excluded SNPs
    setname   = setfile.split(".")[0]
    filenames = pd.read_csv(setfile, header = None) #reading file to get the database names
    subprocess.run(["plink", "--merge-list", setfile, "--make-bed", "--out", os.path.join(pathgenos, "04_Merge", setname, "ExcludeSnps") ])
    for i in range(1, (len(filenames.index)) ): #excluding SNPs from each database on the list
        subprocess.run(["plink", "--bfile", filenames.iloc[i,0], "--exclude", os.path.join(pathgenos, "04_Merge", setname, "ExcludeSnps-merge.missnp"),
                        "--make-bed", "--out", os.path.join(pathgenos, "04_Merge", setname, filenames.iloc[i,0] + "_excludedtemp") ])
    f = open(os.path.join(pathgenos, "04_Merge", setname, "mergefile.txt"), "w+") #creating a new mergefile with the databases with excluded SNPs
    for bedfile in glob.glob(os.path.join(pathgenos, "04_Merge", setname, "*.bed") ):
        f.write(bedfile.split(".")[0] + "\n")
    f.close()
    #Second merging for all phenos file and QC process
    subprocess.run(["plink", "--merge-list", os.path.join(pathgenos, "04_Merge", setname, "mergefile.txt"), 
                    "--make-bed", "--out", os.path.join(pathgenos, "04_Merge", setname, setname) ])

    subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", setname, setname), 
                    "--geno", "--make-bed", "--out", os.path.join(pathgenos, "04_Merge", setname, setname + "_geno") ])

    subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", setname, setname + "_geno"), 
                    "--maf", "--make-bed", "--out", os.path.join(pathgenos, "04_Merge", setname, setname + "_geno" + "_maf")])

    subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", setname, setname + "_geno" + "_maf"), 
                    "--hwe", "0.001", "--make-bed", "--out", os.path.join(pathgenos, "04_Merge", setname, setname + "_geno" + "_maf" + "_hwe") ])

    subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", setname, setname + "_geno" + "_maf" + "_hwe"), 
                    "--mind", "--make-bed", "--out", os.path.join(pathgenos, "04_Merge", setname, setname + "_geno" + "_maf" + "_hwe" + "_mind01")])
    
    for file in glob.glob(os.path.join(pathgenos, "04_Merge", setname, "*") ): #Removing intermediary files
        exclude = ["mind01.", "split"]
        if any(x in file for x in exclude):
            pass
        else:
            os.remove(file)

Now let's see how many SNPs and samples per file there are.
There are no duplicated IIDs left.

In [8]:
for file in glob.glob(os.path.join(pathgenos, "04_Merge", "**", "*.log") ):
    with open(file) as myfile:
        print("In file: " + file.split(".")[0])
        for num, line in enumerate(myfile, 1):
            if "people" in line or "variants" in line:
                print(line, end='')
        print("Finished file... \n")

In file: /home/tomas/Documents/Research/PopStruct/DataBases/Genotypes/04_Merge/merge_dense_all/merge_dense_all_geno_maf_hwe_mind01
201736 variants loaded from .bim file.
3533 people (1312 males, 2221 females) loaded from .fam.
90 people removed due to missing genotype data (--mind).
201736 variants and 3443 people pass filters and QC.
Finished file... 

In file: /home/tomas/Documents/Research/PopStruct/DataBases/Genotypes/04_Merge/merge_sparse_all/merge_sparse_all_geno_maf_hwe_mind01
18114 variants loaded from .bim file.
4194 people (1463 males, 2502 females, 229 ambiguous) loaded from .fam.
319 people removed due to missing genotype data (--mind).
18114 variants and 3875 people pass filters and QC.
Finished file... 

In file: /home/tomas/Documents/Research/PopStruct/DataBases/Genotypes/04_Merge/merge_sparse_phenos/merge_sparse_phenos_geno_maf_hwe_mind01
21400 variants loaded from .bim file.
3692 people (1163 males, 2310 females, 219 ambiguous) loaded from .fam.
309 people removed due 

## Split by chromosome

Now we will split the merged files by chromosome to use it as input for phasing.

In [9]:
os.chdir(os.path.join(pathgenos, "04_Merge"))
for file in glob.glob(os.path.join("**", "*.bed") ):
    filename = file.split(".")[0]
    #Split by chromosome
    for i in range(1,23):
        subprocess.run(["plink", "--bfile", filename, "--chr", str(i), "--make-bed", "--out", 
        filename.split("/")[0] + "/split/" + filename.split("/")[1] + "_chr_" + str(i)])