# Quality Control

This script will run a quality control pipeline in the genotype files.

## Preliminaries

First, let's import modules and set paths

In [1]:
import glob, os, shutil, subprocess
import pandas as pd

In [2]:
projpath  = os.path.realpath("..")
pathgenos = os.path.join(projpath, "DataBases", "Genotypes")

## Dataset division

In this section we will split our datasets by retaining individuals with 3D facial morphology information from the 2016 batch (output as _pheno files).

In [3]:
#Move directory
os.chdir(pathgenos)

In [4]:
idsphenos = pd.read_csv("../IDsRemap2016.txt", header = None)
for file in glob.glob("01_Original/*.bed"):
    filename = file.split(".")[0]
    #First create a file with merge between phenos and fam file
    if "UC_FEMMES_IDUpdated" in filename:
        fam  = pd.read_csv(filename + ".fam", header = None, sep = "\t").iloc[:,[0,1]]
    else:
        fam  = pd.read_csv(filename + ".fam", header = None, sep = " ").iloc[:,[0,1]]
    keep = pd.merge(fam.astype({1:"str"}), idsphenos.drop_duplicates(subset = 0), how='inner', left_on = 1, right_on = 0).iloc[:,[0,1]]
    keepfilename = "02_Clean/" + filename.split("/")[1] + "_KEEP"
    plinkoutfilename = "02_Clean/" + filename.split("/")[1] + "_phenos"
    keep.to_csv(keepfilename, header = None, index = False, sep = " ")
    subprocess.run(["plink", "--bfile", filename, "--keep", keepfilename, "--make-bed", "--out",  plinkoutfilename])

for file in glob.glob("02_Clean/*KEEP"):
    os.remove(file)

## QC procedure

The QC procedure runs as follows:

1. Removed founders, that is, individuals with at least one parent in the dataset, and retained only autosomal chromosomes
2. Removed SNPs with missing call rates higher than 0.1
3. Removed SNPs with minor allele frequencies below 0.05
4. Removed SNPs with hardy-weinberg equilibrium p-values less than 0.001
5. Removed samples with missing call rates higher than 0.1
6. Removed one arbitrary individual from any pairwise comparison with a pihat >= 0.125 from an IBD estimation after LD prune


In [5]:
#Move directory
os.chdir(pathgenos)

QC in all samples

In [6]:
for file in glob.glob("01_Original/*.bed"):
    filename1 = file.split(".")[0]
    print(filename1)
    filename2 = "02_Clean/" + filename1.split("/")[1]
    subprocess.run(["plink", "--bfile", filename1, "--filter-founders", "--autosome", "--make-bed", "--out", filename2 + "_founders"])
    #SNP missing rate
    subprocess.run(["plink", "--bfile", filename2 + "_founders", "--geno", "--make-bed", "--out", filename2 + "_founders_geno01"])
    #SNP MAF
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01", "--maf", "0.05", "--make-bed", "--out", filename2 + "_founders_geno01_maf"])
    #SNP HWE
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf", "--hwe", "0.001", "--make-bed", "--out", filename2 + "_founders_geno01_maf_hwe"])
    #Sample missing rate
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe", "--mind", "--make-bed", "--out", filename2 + "_founders_geno01_maf_hwe_mind01"])
    #Remove no sex in CHP
    if "CHP" in filename1:
        subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe_mind01", "--remove", filename2 + "_founders_geno01_maf_hwe_mind01.nosex", 
                        "--make-bed", "--out", filename2 + "_founders_geno01_maf_hwe_mind01"])
    #LD prune for IBS
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe_mind01", "--indep", "50", "5", "2", "--out", filename2 + "_prune_temp"])
    #Print relatives
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe_mind01", "--exclude", filename2 + "_prune_temp" + ".prune.out", 
                    "--genome", "--min", "0.125", "--out", filename2 + "_rel_temp"])
    #Remove samples in the first column
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe_mind01", "--remove", filename2 + "_rel_temp.genome", "--make-bed", "--out", 
                    filename2 + "_founders_geno01_maf_hwe_mind01_rel"])
        
print("Finished")

01_Original/TD2016_1M_181ppl
01_Original/Euro180_176ppl_317K_hg19_ATGC
01_Original/SA_231ppl_599K_hg19_ATGC
01_Original/UIUC2013_116ppl_959K_hg19_ATGC
01_Original/ADAPT_2784ppl_567K_hg19
01_Original/CV_697ppl_964K_hg19_ATGC
01_Original/TD2015_199ppl_1M_hg19_ATGC
01_Original/CHP_1022ppl_114K_hg19_ATGC
01_Original/FEMMES_20170425
01_Original/UIUC2014_168ppl_703K_hg19_ATGC
01_Original/UC_FEMMES_IDUpdated
Finished


QC in _phenos subset

In [7]:
for file in glob.glob("02_Clean/*_phenos.bed"):
    filename1 = file.split(".")[0]
    print(filename1)
    filename2 = "02_Clean/" + filename1.split("/")[1]
    subprocess.run(["plink", "--bfile", filename1, "--filter-founders", "--autosome", "--make-bed", "--out", filename2 + "_founders"])
    #SNP missing rate
    subprocess.run(["plink", "--bfile", filename2 + "_founders", "--geno", "--make-bed", "--out", filename2 + "_founders_geno01"])
    #SNP MAF
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01", "--maf", "0.05", "--make-bed", "--out", filename2 + "_founders_geno01_maf"])
    #SNP HWE
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf", "--hwe", "0.001", "--make-bed", "--out", filename2 + "_founders_geno01_maf_hwe"])
    #Sample missing rate
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe", "--mind", "--make-bed", "--out", filename2 + "_founders_geno01_maf_hwe_mind01"])
    #LD prune for IBS
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe_mind01", "--indep", "50", "5", "2", "--out", filename2 + "_prune_temp"])
    #Print relatives
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe_mind01", "--exclude", filename2 + "_prune_temp" + ".prune.out", 
                    "--genome", "--min", "0.125", "--out", filename2 + "_rel_temp"])
    #Remove samples in the first column
    subprocess.run(["plink", "--bfile", filename2 + "_founders_geno01_maf_hwe_mind01", "--remove", filename2 + "_rel_temp.genome", "--make-bed", "--out", 
                    filename2 + "_founders_geno01_maf_hwe_mind01_rel"])
        
print("Finished")

02_Clean/ADAPT_2784ppl_567K_hg19_phenos
02_Clean/UIUC2013_116ppl_959K_hg19_ATGC_phenos
02_Clean/SA_231ppl_599K_hg19_ATGC_phenos
02_Clean/UIUC2014_168ppl_703K_hg19_ATGC_phenos
02_Clean/Euro180_176ppl_317K_hg19_ATGC_phenos
02_Clean/CV_697ppl_964K_hg19_ATGC_phenos
02_Clean/CHP_1022ppl_114K_hg19_ATGC_phenos
Finished


In [9]:
#Remove intermediary files
removefiles =  glob.glob("02_Clean/*_founders.*") + glob.glob("02_Clean/*_geno01.*") + glob.glob("02_Clean/*_maf.*") + glob.glob("02_Clean/*_hwe.*") + glob.glob("02_Clean/*_temp*") + glob.glob("02_Clean/*_mind01.*") + glob.glob("02_Clean/*_phenos.*")
for file in removefiles:
    os.remove(file)

Let's us look at how many SNPs ended in each dataset.

In [10]:
for file in glob.glob("02_Clean/*_rel.log"):
    with open(file) as myfile:
        print("In file: " + file.split("/")[1].split(".")[0])
        for num, line in enumerate(myfile, 1):
            if "variants" in line:
                print(line, end='')
        print("Finished file... \n")

In file: UIUC2013_116ppl_959K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
770894 variants loaded from .bim file.
770894 variants and 93 people pass filters and QC.
Finished file... 

In file: TD2015_199ppl_1M_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
452138 variants loaded from .bim file.
452138 variants and 113 people pass filters and QC.
Finished file... 

In file: UIUC2014_168ppl_703K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
456478 variants loaded from .bim file.
456478 variants and 153 people pass filters and QC.
Finished file... 

In file: CV_697ppl_964K_hg19_ATGC_phenos_founders_geno01_maf_hwe_mind01_rel
791206 variants loaded from .bim file.
791206 variants and 155 people pass filters and QC.
Finished file... 

In file: ADAPT_2784ppl_567K_hg19_founders_geno01_maf_hwe_mind01_rel
393227 variants loaded from .bim file.
393227 variants and 2367 people pass filters and QC.
Finished file... 

In file: Euro180_176ppl_317K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
296316 vari

Let's us look at how many samples were removed in each dataset.

In [11]:
for file in glob.glob("02_Clean/*_rel.log"):
    with open(file) as myfile:
        print("In file: " + file.split("/")[1].split(".")[0])
        for num, line in enumerate(myfile, 1):
            if "people" in line:
                print(line, end='')
        print("Finished file... \n")

In file: UIUC2013_116ppl_959K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
115 people (34 males, 81 females) loaded from .fam.
--remove: 93 people remaining.
770894 variants and 93 people pass filters and QC.
Finished file... 

In file: TD2015_199ppl_1M_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
182 people (28 males, 154 females) loaded from .fam.
--remove: 113 people remaining.
452138 variants and 113 people pass filters and QC.
Finished file... 

In file: UIUC2014_168ppl_703K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel
166 people (75 males, 91 females) loaded from .fam.
--remove: 153 people remaining.
456478 variants and 153 people pass filters and QC.
Finished file... 

In file: CV_697ppl_964K_hg19_ATGC_phenos_founders_geno01_maf_hwe_mind01_rel
160 people (62 males, 98 females) loaded from .fam.
--remove: 155 people remaining.
791206 variants and 155 people pass filters and QC.
Finished file... 

In file: ADAPT_2784ppl_567K_hg19_founders_geno01_maf_hwe_mind01_rel
2702 people (100