# Quality Control

This script will run a quality control pipeline in the genotype files.

## Preliminaries

First, let's import modules and set paths

In [1]:
import glob, os, shutil, subprocess
import pandas as pd
from GenotypeQC import QC_procedure

In [2]:
projpath  = os.path.realpath("..")
pathgenos = os.path.join(projpath, "DataBases", "Genotypes")

## Dataset division

In this section we will split our datasets by retaining individuals with 3D facial morphology information from the 2016 batch (output as _pheno files).

In [3]:
#Move directory
os.chdir(pathgenos)

In [4]:
#Create the phenos dataset
idsphenos = pd.read_csv("../IDSMap1705.txt", header = None)
for file in glob.glob("01_Original/*.bed"):
    filename = file.split(".")[0]
    #First create a file with merge between phenos and fam file
    if "UC_FEMMES_IDUpdated" in filename:
        fam  = pd.read_csv(filename + ".fam", header = None, sep = "\t").iloc[:,[0,1]]
    else:
        fam  = pd.read_csv(filename + ".fam", header = None, sep = " ").iloc[:,[0,1]]
    keep = pd.merge(fam.astype({1:"str"}), idsphenos.drop_duplicates(subset = 0), how='inner', left_on = 1, right_on = 0).iloc[:,[0,1]]
    keepfilename = "02_Clean/" + filename.split("/")[1] + "_KEEP"
    plinkoutfilename = "02_Clean/" + filename.split("/")[1] + "_phenos"
    keep.to_csv(keepfilename, header = None, index = False, sep = " ")
    subprocess.run(["plink", "--bfile", filename, "--keep", keepfilename, "--make-bed", "--out",  plinkoutfilename])

for file in glob.glob("02_Clean/*KEEP"):
    os.remove(file)

## QC procedure

The QC procedure runs as follows:

1. Removed founders, that is, individuals with at least one parent in the dataset, and retained only autosomal chromosomes
2. Removed SNPs with missing call rates higher than 0.1
3. Removed SNPs with minor allele frequencies below 0.05
4. Removed SNPs with hardy-weinberg equilibrium p-values less than 1e-50
5. Removed samples with missing call rates higher than 0.1
6. Removed one arbitrary individual from any pairwise comparison with a pihat >= 0.25 from an IBD estimation after LD prune

QC in all samples

In [7]:
#Move directory and run QC procedure
os.chdir(os.path.join(pathgenos, "01_Original"))
QC_procedure()

#Remove no sex individuals in CHP (plates)
os.chdir(os.path.join(pathgenos, "01_Original", "QC"))

for file in glob.glob("CHP*.bed"):
    filename1 = file.split(".")[0]
    subprocess.run(["plink", "--bfile", filename1, "--remove", filename1 + ".nosex", 
                        "--make-bed", "--out", filename1])

Running CHP_1022ppl_114K_hg19_ATGC file...
Removing SNPs with missing call rates higher than 0.1...
Removing SNPs with minor allele frequencies below 0.05...
Removing SNPs with hardy-weinberg equilibrium p-values less than 1e-50...
Removing samples with missing call rates higher than 0.1...
Removing one arbitrary individual from any pairwise comparison with a pihat higher than 0.25...
Generating final plink file...
Running Euro180_176ppl_317K_hg19_ATGC file...
Removing SNPs with missing call rates higher than 0.1...
Removing SNPs with minor allele frequencies below 0.05...
Removing SNPs with hardy-weinberg equilibrium p-values less than 1e-50...
Removing samples with missing call rates higher than 0.1...
Removing one arbitrary individual from any pairwise comparison with a pihat higher than 0.25...
Generating final plink file...
Running FEMMES_20170425 file...
Removing SNPs with missing call rates higher than 0.1...
Removing SNPs with minor allele frequencies below 0.05...
Removing SNP

QC in _phenos subset

In [8]:
#Move directory and run QC procedure
os.chdir(os.path.join(pathgenos, "02_Clean"))
QC_procedure()

Running ADAPT_2784ppl_567K_hg19_phenos file...
Removing SNPs with missing call rates higher than 0.1...
Removing SNPs with minor allele frequencies below 0.05...
Removing SNPs with hardy-weinberg equilibrium p-values less than 1e-50...
Removing samples with missing call rates higher than 0.1...
Removing one arbitrary individual from any pairwise comparison with a pihat higher than 0.25...
Generating final plink file...
Running UIUC2013_116ppl_959K_hg19_ATGC_phenos file...
Removing SNPs with missing call rates higher than 0.1...
Removing SNPs with minor allele frequencies below 0.05...
Removing SNPs with hardy-weinberg equilibrium p-values less than 1e-50...
Removing samples with missing call rates higher than 0.1...
Removing one arbitrary individual from any pairwise comparison with a pihat higher than 0.25...
Generating final plink file...
Running TD2016_1M_181ppl_phenos file...
Removing SNPs with missing call rates higher than 0.1...
Removing SNPs with minor allele frequencies below 

Moving files to 02_Clean folder, and remove some intermediary files

In [9]:
#Removing original phenos files in 02_Clean
outpath = os.path.join(pathgenos, "02_Clean")
os.chdir(outpath)

for file in glob.glob("*phenos*"):
    os.remove(file)

os.chdir(os.path.join(outpath, "QC"))
for file in glob.glob("*_rel*"):
    shutil.move(file, os.path.join(outpath, file))
    
os.chdir(os.path.join(pathgenos, "01_Original", "QC"))
for file in glob.glob("*_rel*"):
    shutil.move(file, os.path.join(outpath, file))


Let's us look at how many SNPs and individuals ended in each dataset.

In [12]:
os.chdir(outpath)
for file in glob.glob("*_rel.log"):
    with open(file) as myfile:
        print("In file: " + file)
        for num, line in enumerate(myfile, 1):
            if "variants" in line:
                print(line, end='')
        print("Finished file... \n")

In file: SA_231ppl_599K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel.log
463587 variants loaded from .bim file.
463587 variants and 212 people pass filters and QC.
Finished file... 

In file: FEMMES_20170425_phenos_founders_geno01_maf_hwe_mind01_rel.log
465082 variants loaded from .bim file.
465082 variants and 233 people pass filters and QC.
Finished file... 

In file: TD2016_1M_181ppl_founders_geno01_maf_hwe_mind01_rel.log
432012 variants loaded from .bim file.
432012 variants and 92 people pass filters and QC.
Finished file... 

In file: ADAPT_2784ppl_567K_hg19_phenos_founders_geno01_maf_hwe_mind01_rel.log
465696 variants loaded from .bim file.
465696 variants and 2635 people pass filters and QC.
Finished file... 

In file: Euro180_176ppl_317K_hg19_ATGC_phenos_founders_geno01_maf_hwe_mind01_rel.log
296829 variants loaded from .bim file.
296829 variants and 176 people pass filters and QC.
Finished file... 

In file: TD2015_199ppl_1M_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel.log
45