# Quality Control

This script will assess quality control in the genotypes and related information.
Specifically, we will check sex information, remove individuals and genotypes with low call rates, and update any information on pressence of relatives.

## Preliminaries

First, let's import modules and set paths

In [1]:
import glob, os, shutil, subprocess, csv, time
import pandas as pd
import numpy as np

In [2]:
projpath  = os.path.realpath("..")
pathgenos = os.path.join(projpath, "DataBases", "Genotypes")

## Missing call rates

First, we will clean our datasets by removing individuals and SNPs with high missing call rates (higher than 0.1), using --geno and --mind.

In [None]:
#Move directory
os.chdir(pathgenos)

In [33]:
for file in glob.glob("01_Original/*.bed"):
    filename1 = file.split(".")[0]
    print(filename1)
    filename2 = "02_Clean/" + filename1.split("/")[1]
    #SNP missing rate
    subprocess.run(["plink", "--bfile", filename1, "--geno", "--make-bed", "--out", filename2 + "_geno01"])
    #Sample missing rate
    subprocess.run(["plink", "--bfile", filename2 + "_geno01", "--mind", "--make-bed", "--out", filename2 + "_geno01_mind01"])
    
print("Finished")

01_Original/TD2016_1M_181ppl
01_Original/Euro180_176ppl_317K_hg19_ATGC
01_Original/SA_231ppl_599K_hg19_ATGC
01_Original/UIUC2013_116ppl_959K_hg19_ATGC
01_Original/CV_697ppl_954K_hg19_ATGC
01_Original/ADAPT_2784ppl_567K_hg19
01_Original/TD2015_199ppl_1M_hg19_ATGC
01_Original/CHP_1022ppl_114K_hg19_ATGC
01_Original/FEMMES_20170425
01_Original/UIUC2014_168ppl_703K_hg19_ATGC
01_Original/UC_FEMMES_IDUpdated
Finished


Let's us look at how many SNPs were removed in each dataset.

In [35]:
for file in glob.glob("02_Clean/*geno01.log"):
    with open(file) as myfile:
        print("In file: " + file.split("/")[1].split(".")[0])
        for num, line in enumerate(myfile, 1):
            if "variants" in line:
                print(line, end='')
        print("Finished file... \n")

In file: Euro180_176ppl_317K_hg19_ATGC_geno01
317503 variants loaded from .bim file.
3416 variants removed due to missing genotype data (--geno).
314087 variants and 176 people pass filters and QC.
Finished file... 

In file: TD2016_1M_181ppl_geno01
1121300 variants loaded from .bim file.
0 variants removed due to missing genotype data (--geno).
1121300 variants and 181 people pass filters and QC.
Finished file... 

In file: UIUC2013_116ppl_959K_hg19_ATGC_geno01
959801 variants loaded from .bim file.
30164 variants removed due to missing genotype data (--geno).
929637 variants and 116 people pass filters and QC.
Finished file... 

In file: CHP_1022ppl_114K_hg19_ATGC_geno01
114495 variants loaded from .bim file.
1132 variants removed due to missing genotype data (--geno).
113363 variants and 1022 people pass filters and QC.
Finished file... 

In file: SA_231ppl_599K_hg19_ATGC_geno01
599830 variants loaded from .bim file.
25248 variants removed due to missing genotype data (--geno).
5745

Let's us look at how many samples were removed in each dataset.

In [38]:
for file in glob.glob("02_Clean/*geno01_mind01.log"):
    with open(file) as myfile:
        print("In file: " + file.split("/")[1].split(".")[0])
        for num, line in enumerate(myfile, 1):
            if "people" in line:
                print(line, end='')
        print("Finished file... \n")

In file: UIUC2013_116ppl_959K_hg19_ATGC_geno01_mind01
116 people (34 males, 82 females) loaded from .fam.
0 people removed due to missing genotype data (--mind).
929637 variants and 116 people pass filters and QC.
Finished file... 

In file: TD2015_199ppl_1M_hg19_ATGC_geno01_mind01
199 people (32 males, 167 females) loaded from .fam.
578542 variants and 198 people pass filters and QC.
Finished file... 

In file: SA_231ppl_599K_hg19_ATGC_geno01_mind01
231 people (85 males, 146 females) loaded from .fam.
0 people removed due to missing genotype data (--mind).
574582 variants and 231 people pass filters and QC.
Finished file... 

In file: UC_FEMMES_IDUpdated_geno01_mind01
248 people (0 males, 0 females, 248 ambiguous) loaded from .fam.
0 people removed due to missing genotype data (--mind).
110926 variants and 248 people pass filters and QC.
Finished file... 

In file: ADAPT_2784ppl_567K_hg19_geno01_mind01
2784 people (1030 males, 1754 females) loaded from .fam.
5 people removed due to mi