# Quality Control

This script will run a quality control pipeline in the genotype files.

## Preliminaries

First, let's import modules and set paths

In [2]:
import glob, os, shutil, subprocess

In [3]:
projpath  = os.path.realpath("..")
pathgenos = os.path.join(projpath, "DataBases", "Genotypes")

## QC procedure

First, we'll removed all nonfounders in our dataset (sample with a parental ID), and keep only autosomal chromosomes.
Then, we'll clean our datasets by removing individuals and SNPs with high missing call rates (higher than 0.1), using --geno and --mind.
Also, we'll remove SNPs with MAF below 0.01, and HWE p-value lower than 0.001.
Finally, we will do a LD prune to run an IBD analysis and remove one individual from any pairs that shows 3rd degree kinship or closer (a pihat score ≥0.125). 

In [4]:
#Move directory
os.chdir(pathgenos)

In [9]:
for file in glob.glob("01_Original/*.bed"):
    filename1 = file.split(".")[0]
    print(filename1)
    filename2 = "02_Clean/" + filename1.split("/")[1]
    subprocess.run(["plink", "--bfile", filename1, "--filter-founders", "--autosome", "--make-bed", "--out", filename2 + "_founders"])
    #SNP missing rate
    subprocess.run(["plink", "--bfile", filename2 + "_founders", "--geno", "--make-bed", "--out", filename2 + "_founders_geno01"])
    #SNP MAF
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01", "--maf", "--make-bed", "--out", filename2 + "_founders_geno01_maf"])
    #SNP HWE
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf", "--hwe", "0.001", "--make-bed", "--out", filename2 + "_founders_geno01_maf_hwe"])
    #Sample missing rate
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe", "--mind", "--make-bed", "--out", filename2 + "_founders_geno01_maf_hwe_mind01"])
    #Remove no sex in CHP
    if "CHP" in filename1:
        subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe_mind01", "--remove", filename2 + "_founders_geno01_maf_hwe_mind01.nosex", 
                        "--make-bed", "--out", filename2 + "_founders_geno01_maf_hwe_mind01"])
    #LD prune for IBS
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe_mind01", "--indep", "50", "5", "2", "--out", filename2 + "_prune_temp"])
    #Print relatives
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe_mind01", "--exclude", filename2 + "_prune_temp" + ".prune.out", 
                    "--genome", "--min", "0.125", "--out", filename2 + "_rel_temp"])
    #Remove samples in the first column
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe_mind01", "--remove", filename2 + "_rel_temp.genome", "--make-bed", "--out", 
                    filename2 + "_founders_geno01_maf_hwe_mind01_rel"])
        
print("Finished")

01_Original/TD2016_1M_181ppl
01_Original/Euro180_176ppl_317K_hg19_ATGC
01_Original/SA_231ppl_599K_hg19_ATGC
01_Original/UIUC2013_116ppl_959K_hg19_ATGC
01_Original/ADAPT_2784ppl_567K_hg19
01_Original/CV_697ppl_964K_hg19_ATGC
01_Original/TD2015_199ppl_1M_hg19_ATGC
01_Original/CHP_1022ppl_114K_hg19_ATGC
01_Original/FEMMES_20170425
01_Original/UIUC2014_168ppl_703K_hg19_ATGC
01_Original/UC_FEMMES_IDUpdated
Finished


In [11]:
#Remove intermediary files
removefiles =  glob.glob("02_Clean/*_founders.*") + glob.glob("02_Clean/*_geno01.*") + glob.glob("02_Clean/*_maf.*") + glob.glob("02_Clean/*_hwe.*") + glob.glob("02_Clean/*_temp*") + glob.glob("02_Clean/*_mind01.*")
for file in removefiles:
    os.remove(file)

Let's us look at how many SNPs ended in each dataset.

In [13]:
for file in glob.glob("02_Clean/*_rel.log"):
    with open(file) as myfile:
        print("In file: " + file.split("/")[1].split(".")[0])
        for num, line in enumerate(myfile, 1):
            if "variants" in line:
                print(line, end='')
        print("Finished file... \n")

In file: UIUC2013_116ppl_959K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
846249 variants loaded from .bim file.
846249 variants and 91 people pass filters and QC.
Finished file... 

In file: TD2015_199ppl_1M_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
500446 variants loaded from .bim file.
500446 variants and 113 people pass filters and QC.
Finished file... 

In file: UIUC2014_168ppl_703K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
500509 variants loaded from .bim file.
500509 variants and 153 people pass filters and QC.
Finished file... 

In file: ADAPT_2784ppl_567K_hg19_founders_geno01_maf_hwe_mind01_rel
429925 variants loaded from .bim file.
429925 variants and 2374 people pass filters and QC.
Finished file... 

In file: Euro180_176ppl_317K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
304153 variants loaded from .bim file.
304153 variants and 176 people pass filters and QC.
Finished file... 

In file: CHP_1022ppl_114K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
102859 variants 

Let's us look at how many samples were removed in each dataset.

In [14]:
for file in glob.glob("02_Clean/*_rel.log"):
    with open(file) as myfile:
        print("In file: " + file.split("/")[1].split(".")[0])
        for num, line in enumerate(myfile, 1):
            if "people" in line:
                print(line, end='')
        print("Finished file... \n")

In file: UIUC2013_116ppl_959K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
115 people (34 males, 81 females) loaded from .fam.
--remove: 91 people remaining.
846249 variants and 91 people pass filters and QC.
Finished file... 

In file: TD2015_199ppl_1M_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
182 people (28 males, 154 females) loaded from .fam.
--remove: 113 people remaining.
500446 variants and 113 people pass filters and QC.
Finished file... 

In file: UIUC2014_168ppl_703K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
166 people (75 males, 91 females) loaded from .fam.
--remove: 153 people remaining.
500509 variants and 153 people pass filters and QC.
Finished file... 

In file: ADAPT_2784ppl_567K_hg19_founders_geno01_maf_hwe_mind01_rel
2702 people (1007 males, 1695 females) loaded from .fam.
--remove: 2374 people remaining.
429925 variants and 2374 people pass filters and QC.
Finished file... 

In file: Euro180_176ppl_317K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
176 people 