# Merging genotypes

In this script we will take the harmonized genotypes and merge them.
We will remove possible duplicated IIDs, and control plates from the Axiom Array (possible to distinguish them because they don't have sex information).
Because [fineStructure](https://people.maths.bris.ac.uk/~madjl/finestructure/) requires a lot of computational resources, we will split our data in two, one retaining most of our samples, and a second one only incorporating samples that we have phenotype data (facial morphology), with the _pheno suffix.
The files that have phenotypes (from 2016 batch) are:

- ADAPT
- SA
- UIUC2013
- UIUC2014
- GHPAFF_Euro
- Axiom Array

Note that UIUC2014 only contain duplicates from the ADAPT file, so we will remove it from further analysis.
Therefore, we will have a "dense" (500k) genotype file including only samples with phenotypes:

- ADAPT
- SA
- UIUC2013

And a "sparse" (30k) genotype file, with most of our samples. Those will be the ones added to the dense file, plus:

- Dense
    - TD2015
    - TD2016
    - PSU_FEMMES
- Medium:
    - GHPAFF_Euro
    - GHPAFF_CV
- Sparse:
    - Axiom Array
    - UC_FEMMES (Note that adding UC_FEMMES, the number of SNPs decreases from ~30k to ~10k)

## Preliminaries

Let's import modules and set paths

In [1]:
import subprocess, os, glob
import pandas as pd
import numpy as np

In [2]:
#Setting paths
projpath  = os.path.realpath("..")
pathgenos = os.path.join(projpath, "DataBases", "Genotypes")

## Merging each file

Let's first merge the different files divided by chromosomes into a single one

In [3]:
#Move directory
os.chdir(pathgenos)

In [4]:
#Entering every directory in harmonize folder
filenames = os.listdir(os.path.join(pathgenos, "03_Harmonized"))

for filename in filenames: # loop through all the files and folders
    if os.path.isdir(os.path.join(pathgenos, "03_Harmonized", filename)):
        os.chdir(os.path.join(pathgenos, "03_Harmonized", filename))
        #Opening new file and pasting the file names to use for merging
        f = open(filename + ".txt", "w+")
        for file in glob.glob("*_harmonized.bed"):
            f.write(file.split(".")[0] + "\n")
        f.close()
        #Merge using the file just created. Remove no-sex in CHP (those are plate controls)
        if filename.startswith("CHP"):
            subprocess.run(["plink", "--merge-list", filename + ".txt", "--allow-no-sex", "--make-bed", "--out", 
                            os.path.join(pathgenos, "03_Harmonized", filename + "_all_harmonized" )])
            subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "03_Harmonized", filename + "_all_harmonized"), 
                            "--remove", os.path.join(pathgenos, "03_Harmonized", "CHP_1022ppl_114K_hg19_ATGC_geno01_maf_hwe_mind01_all_harmonized.nosex"), 
                            "--make-bed", "--out", os.path.join(pathgenos, "03_Harmonized", filename + "_all_harmonized" )])
        else:
            subprocess.run(["plink", "--merge-list", filename + ".txt", "--allow-no-sex", "--make-bed", "--out", 
                            os.path.join(pathgenos, "03_Harmonized", filename + "_all_harmonized" )])

## Extracting IIDs with phenos

Now we will extract only the subset of samples with phenotypes.

In [7]:
#Move directory
os.chdir(pathgenos)

In [8]:
subprocess.run(["mkdir", "03_Harmonized/Phenos"])
idsphenos = pd.read_csv("../IDsRemap2016.txt", header = None)
for file in glob.glob("03_Harmonized/*harmonized.bed"):
    filename = file.split(".")[0]
    #First create a file with merge between phenos and fam file
    fam  = pd.read_csv(filename + ".fam", header = None, sep = " ").iloc[:,[0,1]]
    keep = pd.merge(fam.astype({1:"str"}), idsphenos.drop_duplicates(subset = 0), how='inner', left_on = 1, right_on = 0).iloc[:,[0,1]]
    keepfilename = filename.split("/")[0] + "/Phenos/KEEP_" + filename.split("/")[1]
    plinkoutfilename = filename.split("/")[0] + "/Phenos/" + filename.split("/")[1] + "_phenos"
    keep.to_csv(keepfilename, header = None, index = False, sep = " ")
    subprocess.run(["plink", "--bfile", filename, "--keep", keepfilename, "--make-bed", "--out",  plinkoutfilename])
    #count = count +1

## Merging and cleaning all files

Now we will merge all files into a single dataset, following the instructions stated previously

In [17]:
#Move directory
os.chdir(os.path.join(pathgenos, "03_Harmonized"))

In [6]:
#Creating merging files
#Skip this if already done
#Remember to delete UIUC 2014 from the list
f = open("30k_mergefile.txt", "w+")
for file in glob.glob("*harmonized.bed"):
    f.write(file.split(".")[0] + "\n")
f.close()

In [18]:
#Merging 500k files from Phenos folder
subprocess.run(["plink", "--merge-list", "500k_mergefile.txt", "--make-bed", "--out", os.path.join(pathgenos, "04_Merge", "Merge_500k_2269pp") ])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_500k_2269pp"), "--geno", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_500k_2269pp_geno01")])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_500k_2269pp_geno01"), "--maf", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_500k_2269pp_geno01_maf")])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_500k_2269pp_geno01_maf"), "--hwe", "0.001", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_500k_2269pp_geno01_maf_hwe")])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_500k_2269pp_geno01_maf_hwe"), "--mind", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_500k_2269pp_geno01_maf_hwe_mind01")])

#Founders already removed

CompletedProcess(args=['plink', '--bfile', '/home/tomas/Documents/Research/PopStruct/DataBases/Genotypes/04_Merge/Merge_500k_2269pp_geno01_maf_hwe', '--mind', '--make-bed', '--out', '/home/tomas/Documents/Research/PopStruct/DataBases/Genotypes/04_Merge/Merge_500k_2269pp_geno01_maf_hwe_mind01'], returncode=0)

In [22]:
#Merging 30k files
#There is something wrong, Euro output from harmonize step removes too many SNPs
subprocess.run(["plink", "--merge-list", "30k_mergefile.txt", "--make-bed", "--out", os.path.join(pathgenos, "04_Merge", "Merge_30k_5795pp") ])

subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_30k_5795pp"), "--geno", "--make-bed", "--out", 
                os.path.join(pathgenos, "04_Merge", "Merge_30k_5795pp_geno01")])

#subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_30k_5543pp_geno01"), "--mind", "--make-bed", "--out", 
#                os.path.join(pathgenos, "04_Merge", "Merge_30k_5186pp_geno01_mind01")])

#subprocess.run(["plink", "--bfile", os.path.join(pathgenos, "04_Merge", "Merge_30k_5186pp_geno01_mind01"), "--filter-founders", "--make-bed", "--out", 
#                os.path.join(pathgenos, "04_Merge", "Merge_30k_5069pp_geno01_mind01_founders")])

CompletedProcess(args=['plink', '--bfile', '/home/tomas/Documents/Research/PopStruct/DataBases/Genotypes/04_Merge/Merge_30k_5795pp', '--geno', '--make-bed', '--out', '/home/tomas/Documents/Research/PopStruct/DataBases/Genotypes/04_Merge/Merge_30k_5795pp_geno01'], returncode=0)

In [19]:
#Remove some intermediary files in 04_Merge folder
os.chdir(os.path.join(pathgenos, "04_Merge"))
removefiles = glob.glob("*pp.*") + glob.glob("*geno01.*") + glob.glob("*maf.*") + glob.glob("*hwe.*")
for file in removefiles:
    os.remove(file)

Now let's see how many SNPs and samples per file there are.
There are no duplicated IIDs left.

In [20]:
os.chdir(os.path.join(pathgenos, "04_Merge"))
for file in glob.glob("*.log"):
    with open(file) as myfile:
        print("In file: " + file.split(".")[0])
        for num, line in enumerate(myfile, 1):
            if "people" in line or "variants" in line:
                print(line, end='')
        print("Finished file... \n")

In file: Merge_500k_2269pp_geno01_maf_hwe_mind01
422161 variants loaded from .bim file.
2269 people (851 males, 1418 females) loaded from .fam.
0 people removed due to missing genotype data (--mind).
422161 variants and 2269 people pass filters and QC.
Finished file... 



## Split by chromosome

Now we will split the three files by chromosome to use it as input files for phasing.
Also, for each split, we will remove genos and samples with high missing call rates

In [21]:
os.chdir(os.path.join(pathgenos, "04_Merge"))
for file in glob.glob("*.bed"):
    filename = file.split(".")[0]
    #Split by chromosome
    for i in range(1,23):
        subprocess.run(["plink", "--bfile", filename, "--chr", str(i), "--geno", "--mind", "--make-bed", "--out", "Split/" + filename + "_chr_" + str(i)])