# Quality Control

This script will run a quality control pipeline in the genotype files.

## Preliminaries

First, let's import modules and set paths

In [1]:
import glob, os, shutil, subprocess
import pandas as pd
from GenotypeQC import QC_procedure

In [2]:
projpath  = os.path.realpath("..")
pathgenos = os.path.join(projpath, "DataBases", "Genotypes")
pathdat   = os.path.join(projpath, "DataBases")
outpath   = os.path.join(pathgenos, "02_Clean")

# Retain common IDs

For each dataset, we will retain only the common IDs.
First, we will need to extract the FID for each IID, from the fam files

In [121]:
#Reading common IDS
os.chdir(pathdat)
common_ids = pd.read_csv("common_ids.txt", header=None, names = ["ID"])

#Creating empty dataframe
common_FID_ID = pd.DataFrame(columns=["FID", "ID"])

#Reading fam files and adding to empty dataframe
os.chdir(os.path.join(pathgenos, "01_Original"))
for fam in glob.glob("*.fam"):
    #UIUC2014 are repeated
    if fam == "UIUC2014_168ppl_703K_hg19_ATGC.fam":
        pass
    else:
        #UC_FEMMES is tab separated
        if fam == "UC_FEMMES_IDUpdated.fam":
            famfile = pd.read_csv(fam, header=None, sep="\t", names=["FID", "ID"], usecols=["FID", "ID"], dtype=object )
        else:
            famfile = pd.read_csv(fam, header=None, sep=" ", names=["FID", "ID"], usecols=["FID", "ID"], dtype=object )
        
        inter = pd.merge(common_ids, famfile, on="ID")[ ["FID", "ID"]]
        common_FID_ID = pd.concat([common_FID_ID, inter], ignore_index=True)


Now, we will save the final dataset

In [122]:
os.chdir(pathdat)
common_FID_ID.to_csv("common_fid_id.txt", sep=" ", header=False, index=False)

Finally, we will extract for each dataset only the common IDs

In [127]:
os.chdir(os.path.join(pathgenos, "01_Original"))
pathsave_commonid = os.path.join(pathgenos, "01_Original", "Extract_Common_ID")
keepfile = os.path.join(pathdat, "common_fid_id.txt")

for bedfile in glob.glob("*.bed"):
    filename = bedfile.split(".")[0]
    outfile  = filename + "_CID"
    subprocess.run(["plink", "--bfile", filename, "--keep", keepfile, "--make-bed", "--out", os.path.join(pathsave_commonid, outfile)])

## QC procedure

The QC procedure runs as follows:

1. Removed founders, that is, individuals with at least one parent in the dataset, and retained only autosomal chromosomes
2. Removed SNPs with missing call rates higher than 0.1
3. Removed SNPs with minor allele frequencies below 0.05
4. Removed SNPs with hardy-weinberg equilibrium p-values less than 1e-50
5. Removed samples with missing call rates higher than 0.1
6. Removed one arbitrary individual from any pairwise comparison with a pihat >= 0.25 from an IBD estimation after LD prune

QC in all samples

In [128]:
#Move to directory and run QC procedure
os.chdir(os.path.join(pathgenos, "01_Original", "Extract_Common_ID"))
QC_procedure()

#Remove no sex individuals in CHP (plates)
#os.chdir(os.path.join(pathgenos, "01_Original", "QC"))
#for file in glob.glob("CHP*.bed"):
#    filename1 = file.split(".")[0]
#    subprocess.run(["plink", "--bfile", filename1, "--remove", filename1 + ".nosex", 
#                        "--make-bed", "--out", filename1])

Running SA_231ppl_599K_hg19_ATGC_CID file...
Removing SNPs with missing call rates higher than 0.1...
Removing SNPs with minor allele frequencies below 0.05...
Removing SNPs with hardy-weinberg equilibrium p-values less than 1e-50...
Removing samples with missing call rates higher than 0.1...
Removing one arbitrary individual from any pairwise comparison with a pihat higher than 0.25...
Generating final plink file...
Running Euro180_176ppl_317K_hg19_ATGC_CID file...
Removing SNPs with missing call rates higher than 0.1...
Removing SNPs with minor allele frequencies below 0.05...
Removing SNPs with hardy-weinberg equilibrium p-values less than 1e-50...
Removing samples with missing call rates higher than 0.1...
Removing one arbitrary individual from any pairwise comparison with a pihat higher than 0.25...
Generating final plink file...
Running ADAPT_2784ppl_567K_hg19_CID file...
Removing SNPs with missing call rates higher than 0.1...
Removing SNPs with minor allele frequencies below 0.

QC in _phenos subset

Moving files to 02_Clean folder, and remove some intermediary files

In [129]:
#Moving files to 02_Clean folder
os.chdir(os.path.join(pathgenos, "01_Original", "Extract_Common_ID", "QC"))
for file in glob.glob("*_rel*"):
    shutil.move(file, os.path.join(outpath, file))


Let's us look at how many SNPs and individuals ended in each dataset.

In [130]:
os.chdir(outpath)
for file in glob.glob("*_rel.log"):
    with open(file) as myfile:
        print("In file: " + file)
        for num, line in enumerate(myfile, 1):
            if "variants" in line:
                print(line, end='')
        print("Finished file... \n")

In file: CV_697ppl_964K_hg19_ATGC_CID_founders_geno01_maf_hwe_mind01_rel.log
790985 variants loaded from .bim file.
790985 variants and 154 people pass filters and QC.
Finished file... 

In file: UIUC2013_116ppl_959K_hg19_ATGC_CID_founders_geno01_maf_hwe_mind01_rel.log
774977 variants loaded from .bim file.
774977 variants and 86 people pass filters and QC.
Finished file... 

In file: UIUC2014_168ppl_703K_hg19_ATGC_CID_founders_geno01_maf_hwe_mind01_rel.log
399406 variants loaded from .bim file.
399406 variants and 4 people pass filters and QC.
Finished file... 

In file: ADAPT_2784ppl_567K_hg19_CID_founders_geno01_maf_hwe_mind01_rel.log
463723 variants loaded from .bim file.
463723 variants and 2289 people pass filters and QC.
Finished file... 

In file: Euro180_176ppl_317K_hg19_ATGC_CID_founders_geno01_maf_hwe_mind01_rel.log
296579 variants loaded from .bim file.
296579 variants and 171 people pass filters and QC.
Finished file... 

In file: SA_231ppl_599K_hg19_ATGC_CID_founders_geno