# Quality Control

This script will assess quality control in the genotypes files.
Specifically, we will remove individuals and genotypes with low call rates. 

## Preliminaries

First, let's import modules and set paths

In [1]:
import glob, os, shutil, subprocess

In [2]:
projpath  = os.path.realpath("..")
pathgenos = os.path.join(projpath, "DataBases", "Genotypes")

## QC procedure

First, we removed all nonfounders in our dataset (sample with a parental ID).
Then, we will clean our datasets by removing individuals and SNPs with high missing call rates (higher than 0.1), using --geno and --mind.
Also, we will remove SNPs with MAF below 0.01, and HWE p-value lower than 0.001.
Finally, we will do a LD prune to run an IBD analysis and remove one individual from any pairs that showed 3rd degree kinship or closer (a pihat score ≥0.09). 

In [3]:
#Move directory
os.chdir(pathgenos)

In [9]:
for file in glob.glob("01_Original/*.bed"):
    filename1 = file.split(".")[0]
    print(filename1)
    filename2 = "02_Clean/" + filename1.split("/")[1]
    subprocess.run(["plink", "--bfile", filename1, "--filter-founders", "--make-bed", "--out", filename2 + "_founders"])
    #SNP missing rate
    subprocess.run(["plink", "--bfile", filename2 + "_founders", "--geno", "--make-bed", "--out", filename2 + "_founders_geno01"])
    #SNP MAF
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01", "--maf", "--make-bed", "--out", filename2 + "_founders_geno01_maf"])
    #SNP HWE
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf", "--hwe", "0.001", "--make-bed", "--out", filename2 + "_founders_geno01_maf_hwe"])
    #Sample missing rate
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe", "--mind", "--make-bed", "--out", filename2 + "_founders_geno01_maf_hwe_mind01"])
    #Remove no sex in CHP
    if "CHP" in filename1:
        subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe_mind01", "--remove", filename2 + "_founders_geno01_maf_hwe_mind01.nosex", 
                        "--make-bed", "--out", filename2 + "_founders_geno01_maf_hwe_mind01"])
        
print("Finished")

01_Original/TD2016_1M_181ppl
01_Original/Euro180_176ppl_317K_hg19_ATGC
01_Original/SA_231ppl_599K_hg19_ATGC
01_Original/UIUC2013_116ppl_959K_hg19_ATGC
01_Original/ADAPT_2784ppl_567K_hg19
01_Original/CV_697ppl_964K_hg19_ATGC
01_Original/TD2015_199ppl_1M_hg19_ATGC
01_Original/CHP_1022ppl_114K_hg19_ATGC
01_Original/FEMMES_20170425
01_Original/UIUC2014_168ppl_703K_hg19_ATGC
01_Original/UC_FEMMES_IDUpdated
Finished


In [10]:
#Remove intermediary files
removefiles =  glob.glob("02_Clean/*_founders.*") + glob.glob("02_Clean/*_geno01.*") + glob.glob("02_Clean/*_maf.*") + glob.glob("02_Clean/*_hwe.*")
for file in removefiles:
    os.remove(file)

Let's us look at how many SNPs ended in each dataset.

In [11]:
for file in glob.glob("02_Clean/*_mind01.log"):
    with open(file) as myfile:
        print("In file: " + file.split("/")[1].split(".")[0])
        for num, line in enumerate(myfile, 1):
            if "variants" in line:
                print(line, end='')
        print("Finished file... \n")

In file: SA_231ppl_599K_hg19_ATGC_founders_geno01_maf_hwe_mind01
524899 variants loaded from .bim file.
524899 variants and 222 people pass filters and QC.
Finished file... 

In file: UIUC2014_168ppl_703K_hg19_ATGC_founders_geno01_maf_hwe_mind01
514038 variants loaded from .bim file.
514038 variants and 166 people pass filters and QC.
Finished file... 

In file: UIUC2013_116ppl_959K_hg19_ATGC_founders_geno01_maf_hwe_mind01
868241 variants loaded from .bim file.
868241 variants and 115 people pass filters and QC.
Finished file... 

In file: TD2016_1M_181ppl_founders_geno01_maf_hwe_mind01
611225 variants loaded from .bim file.
611225 variants and 181 people pass filters and QC.
Finished file... 

In file: TD2015_199ppl_1M_hg19_ATGC_founders_geno01_maf_hwe_mind01
514194 variants loaded from .bim file.
514194 variants and 182 people pass filters and QC.
Finished file... 

In file: Euro180_176ppl_317K_hg19_ATGC_founders_geno01_maf_hwe_mind01
313230 variants loaded from .bim file.
313230 var

Let's us look at how many samples were removed in each dataset.

In [12]:
for file in glob.glob("02_Clean/*_mind01.log"):
    with open(file) as myfile:
        print("In file: " + file.split("/")[1].split(".")[0])
        for num, line in enumerate(myfile, 1):
            if "people" in line:
                print(line, end='')
        print("Finished file... \n")

In file: SA_231ppl_599K_hg19_ATGC_founders_geno01_maf_hwe_mind01
222 people (80 males, 142 females) loaded from .fam.
0 people removed due to missing genotype data (--mind).
524899 variants and 222 people pass filters and QC.
Finished file... 

In file: UIUC2014_168ppl_703K_hg19_ATGC_founders_geno01_maf_hwe_mind01
166 people (75 males, 91 females) loaded from .fam.
0 people removed due to missing genotype data (--mind).
514038 variants and 166 people pass filters and QC.
Finished file... 

In file: UIUC2013_116ppl_959K_hg19_ATGC_founders_geno01_maf_hwe_mind01
115 people (34 males, 81 females) loaded from .fam.
0 people removed due to missing genotype data (--mind).
868241 variants and 115 people pass filters and QC.
Finished file... 

In file: TD2016_1M_181ppl_founders_geno01_maf_hwe_mind01
181 people (40 males, 141 females) loaded from .fam.
0 people removed due to missing genotype data (--mind).
611225 variants and 181 people pass filters and QC.
Finished file... 

In file: TD2015_19