# Quality Control

This script will assess quality control in the genotypes files.
Specifically, we will remove individuals and genotypes with low call rates. 

## Preliminaries

First, let's import modules and set paths

In [2]:
import glob, os, shutil, subprocess

In [3]:
projpath  = os.path.realpath("..")
pathgenos = os.path.join(projpath, "DataBases", "Genotypes")

## Missing call rates

First, we will clean our datasets by removing individuals and SNPs with high missing call rates (higher than 0.1), using --geno and --mind.

In [6]:
#Move directory
os.chdir(pathgenos)

In [7]:
for file in glob.glob("01_Original/*.bed"):
    filename1 = file.split(".")[0]
    print(filename1)
    filename2 = "02_Clean/" + filename1.split("/")[1]
    #SNP missing rate
    subprocess.run(["plink", "--bfile", filename1, "--geno", "--make-bed", "--out", filename2 + "_geno01"])
    #SNP MAF
    subprocess.run(["plink", "--bfile", filename2 + "_geno01", "--maf", "--make-bed", "--out", filename2 + "_geno01_maf"])
    #SNP HWE
    subprocess.run(["plink", "--bfile", filename2 + "_geno01_maf", "--hwe", "0.001", "--make-bed", "--out", filename2 + "_geno01_maf_hwe"])
    #Sample missing rate
    subprocess.run(["plink", "--bfile", filename2 + "_geno01_maf_hwe", "--mind", "--make-bed", "--out", filename2 + "_geno01_maf_hwe_mind01"])
    
print("Finished")

01_Original/TD2016_1M_181ppl
01_Original/Euro180_176ppl_317K_hg19_ATGC
01_Original/SA_231ppl_599K_hg19_ATGC
01_Original/UIUC2013_116ppl_959K_hg19_ATGC
01_Original/ADAPT_2784ppl_567K_hg19
01_Original/CV_697ppl_964K_hg19_ATGC
01_Original/TD2015_199ppl_1M_hg19_ATGC
01_Original/CHP_1022ppl_114K_hg19_ATGC
01_Original/FEMMES_20170425
01_Original/UIUC2014_168ppl_703K_hg19_ATGC
01_Original/UC_FEMMES_IDUpdated
Finished


In [17]:
#Remove intermediary files
removefiles = glob.glob("02_Clean/*_geno01.*") + glob.glob("02_Clean/*_maf.*") + glob.glob("02_Clean/*_hwe.*")  
for file in removefiles:
    os.remove(file)

Let's us look at how many SNPs ended in each dataset.

In [18]:
for file in glob.glob("02_Clean/*_mind01.log"):
    with open(file) as myfile:
        print("In file: " + file.split("/")[1].split(".")[0])
        for num, line in enumerate(myfile, 1):
            if "variants" in line:
                print(line, end='')
        print("Finished file... \n")

In file: UIUC2014_168ppl_703K_hg19_ATGC_geno01_maf_hwe_mind01
513789 variants loaded from .bim file.
513789 variants and 167 people pass filters and QC.
Finished file... 

In file: CV_697ppl_964K_hg19_ATGC_geno01_maf_hwe_mind01
883530 variants loaded from .bim file.
883530 variants and 697 people pass filters and QC.
Finished file... 

In file: TD2015_199ppl_1M_hg19_ATGC_geno01_maf_hwe_mind01
513938 variants loaded from .bim file.
513938 variants and 199 people pass filters and QC.
Finished file... 

In file: ADAPT_2784ppl_567K_hg19_geno01_maf_hwe_mind01
440599 variants loaded from .bim file.
440599 variants and 2782 people pass filters and QC.
Finished file... 

In file: SA_231ppl_599K_hg19_ATGC_geno01_maf_hwe_mind01
524911 variants loaded from .bim file.
524911 variants and 231 people pass filters and QC.
Finished file... 

In file: TD2016_1M_181ppl_geno01_maf_hwe_mind01
611225 variants loaded from .bim file.
611225 variants and 181 people pass filters and QC.
Finished file... 

In f

Let's us look at how many samples were removed in each dataset.

In [19]:
for file in glob.glob("02_Clean/*_mind01.log"):
    with open(file) as myfile:
        print("In file: " + file.split("/")[1].split(".")[0])
        for num, line in enumerate(myfile, 1):
            if "people" in line:
                print(line, end='')
        print("Finished file... \n")

In file: UIUC2014_168ppl_703K_hg19_ATGC_geno01_maf_hwe_mind01
168 people (76 males, 92 females) loaded from .fam.
513789 variants and 167 people pass filters and QC.
Finished file... 

In file: CV_697ppl_964K_hg19_ATGC_geno01_maf_hwe_mind01
697 people (285 males, 412 females) loaded from .fam.
0 people removed due to missing genotype data (--mind).
883530 variants and 697 people pass filters and QC.
Finished file... 

In file: TD2015_199ppl_1M_hg19_ATGC_geno01_maf_hwe_mind01
199 people (32 males, 167 females) loaded from .fam.
0 people removed due to missing genotype data (--mind).
513938 variants and 199 people pass filters and QC.
Finished file... 

In file: ADAPT_2784ppl_567K_hg19_geno01_maf_hwe_mind01
2784 people (1030 males, 1754 females) loaded from .fam.
2 people removed due to missing genotype data (--mind).
440599 variants and 2782 people pass filters and QC.
Finished file... 

In file: SA_231ppl_599K_hg19_ATGC_geno01_maf_hwe_mind01
231 people (85 males, 146 females) loaded fr