# Quality Control

This script will run a quality control pipeline in the genotype files.

## Preliminaries

First, let's import modules and set paths

In [1]:
import glob, os, shutil, subprocess
import pandas as pd
from GenotypeQC import QC_procedure

In [2]:
projpath  = os.path.realpath("..")
pathgenos = os.path.join(projpath, "DataBases", "Genotypes")

## QC procedure

The QC procedure runs as follows:

1. Removed founders, that is, individuals with at least one parent in the dataset, and retained only autosomal chromosomes
2. Removed SNPs with missing call rates higher than 0.1
3. Removed SNPs with minor allele frequencies below 0.05
4. Removed SNPs with hardy-weinberg equilibrium p-values less than 1e-50
5. Removed samples with missing call rates higher than 0.1
6. Removed one arbitrary individual from any pairwise comparison with a pihat >= 0.25 from an IBD estimation after LD prune

QC in all samples

In [9]:
#Move directory and run QC procedure
os.chdir(os.path.join(pathgenos, "01_Original"))
QC_procedure()

#Remove no sex individuals in CHP (plates)
os.chdir(os.path.join(pathgenos, "01_Original", "QC"))

for file in glob.glob("CHP*.bed"):
    filename1 = file.split(".")[0]
    subprocess.run(["plink", "--bfile", filename1, "--remove", filename1 + ".nosex", 
                        "--make-bed", "--out", filename1])

Running CHP_1022ppl_114K_hg19_ATGC file...
Removing SNPs with missing call rates higher than 0.1...
Removing SNPs with minor allele frequencies below 0.05...
Removing SNPs with hardy-weinberg equilibrium p-values less than 1e-50...
Removing samples with missing call rates higher than 0.1...
Removing one arbitrary individual from any pairwise comparison with a pihat higher than 0.25...
Generating final plink file...
Running Euro180_176ppl_317K_hg19_ATGC file...
Removing SNPs with missing call rates higher than 0.1...
Removing SNPs with minor allele frequencies below 0.05...
Removing SNPs with hardy-weinberg equilibrium p-values less than 1e-50...
Removing samples with missing call rates higher than 0.1...
Removing one arbitrary individual from any pairwise comparison with a pihat higher than 0.25...
Generating final plink file...
Running FEMMES_20170425 file...
Removing SNPs with missing call rates higher than 0.1...
Removing SNPs with minor allele frequencies below 0.05...
Removing SNP

QC in _phenos subset

Moving files to 02_Clean folder, and remove some intermediary files

In [11]:
#Removing original phenos files in 02_Clean
outpath = os.path.join(pathgenos, "02_Clean")
os.chdir(outpath)

for file in glob.glob("*phenos*"):
    os.remove(file)
    
os.chdir(os.path.join(pathgenos, "01_Original", "QC"))
for file in glob.glob("*_rel*"):
    shutil.move(file, os.path.join(outpath, file))


Let's us look at how many SNPs and individuals ended in each dataset.

In [12]:
os.chdir(outpath)
for file in glob.glob("*_rel.log"):
    with open(file) as myfile:
        print("In file: " + file)
        for num, line in enumerate(myfile, 1):
            if "variants" in line:
                print(line, end='')
        print("Finished file... \n")

In file: SA_231ppl_599K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel.log
463587 variants loaded from .bim file.
463587 variants and 212 people pass filters and QC.
Finished file... 

In file: TD2016_1M_181ppl_founders_geno01_maf_hwe_mind01_rel.log
432012 variants loaded from .bim file.
432012 variants and 92 people pass filters and QC.
Finished file... 

In file: TD2015_199ppl_1M_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel.log
458109 variants loaded from .bim file.
458109 variants and 113 people pass filters and QC.
Finished file... 

In file: UIUC2014_168ppl_703K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel.log
458402 variants loaded from .bim file.
458402 variants and 164 people pass filters and QC.
Finished file... 

In file: CV_697ppl_964K_hg19_ATGC_founders_geno01_maf_hwe_mind01_rel.log
791872 variants loaded from .bim file.
791872 variants and 684 people pass filters and QC.
Finished file... 

In file: UC_FEMMES_IDUpdated_founders_geno01_maf_hwe_mind01_rel.log
92995 variants loa