# QC procedure

This script will run a basic QC procedure on the merged HGDP-1000G reference files.
Specifically, we will run the following:

1. Remove founders, that is, individuals with at least one parent in the dataset, and retained only autosomal chromosomes
2. Remove SNPs with missing call rates higher than 0.1
3. Remove SNPs with minor allele frequencies below 0.05
4. Remove SNPs with hardy-weinberg equilibrium p-values less than 1e-50
5. Remove samples with missing call rates higher than 0.1
6. Remove one arbitrary individual from any pairwise comparison with a pihat >= 0.25 from an IBD estimation after LD prune

We will use the file generated after the QC procedure to continue the phasing-IBD pipeline, after splitting by chromosome.
We will also generate another thinned file, generated after LD pruning the genotypes, for a PCA and Admixture analysis.

In [1]:
#Importing modules
import subprocess, os, glob, shutil
import pandas as pd
from GenotypeQC import QC_procedure

In [2]:
#Setting paths
projpath   = os.path.realpath("..")
pathgeno   = os.path.join(projpath, "DataBases", "Genotypes")
pathclean  = os.path.join(projpath, "Results", "CleanGenos")
pathcounts = os.path.join(projpath, "Results", "Counts")
pathadmix  = os.path.join(projpath, "Results", "ADMIXTURE")
pathpca    = os.path.join(projpath, "Results", "PCA")
pathibd    = os.path.join(projpath, "Results", "RefinedIBD")
pathinfo   = os.path.join(projpath, "DataBases", "PopInfo")
os.chdir(pathgeno)

In [3]:
#Cleaning genotyopes 
QC_procedure()

Running hgdp1000ghg19 file...
Removing SNPs with missing call rates higher than 0.1...
Removing SNPs with minor allele frequencies below 0.05...
Removing SNPs with hardy-weinberg equilibrium p-values less than 1e-50...
Removing samples with missing call rates higher than 0.1...
Removing one arbitrary individual from any pairwise comparison with a pihat higher than 0.25...
Generating final plink file...
Finished


In [4]:
#Generating LD prune file with values 50 5 2
filename = glob.glob(os.path.join("QC", "*.bed") )[0].split(".")[0]
subprocess.run(["plink", "--bfile", filename, "--indep", "50", "5", "2", "--out", filename + "_pruned"])
subprocess.run(["plink", "--bfile", filename, "--exclude", filename + "_pruned" + ".prune.out", "--make-bed", "--out", filename + "_pruned"])

CompletedProcess(args=['plink', '--bfile', 'QC/hgdp1000ghg19_founders_geno01_maf_hwe_mind01_rel', '--exclude', 'QC/hgdp1000ghg19_founders_geno01_maf_hwe_mind01_rel_pruned.prune.out', '--make-bed', '--out', 'QC/hgdp1000ghg19_founders_geno01_maf_hwe_mind01_rel_pruned'], returncode=0)

In [5]:
#Printing output from log files with the variants section
for file in glob.glob(os.path.join("QC", "*.log") ):
    with open(file) as myfile:
        print("Reading " + str(myfile.name) + "...")
        for num, line in enumerate(myfile, 1):
            if "variants" in line:
                print(line, end='')
        print("Finished file... \n")

Reading QC/temp6.log...
458912 variants loaded from .bim file.
458912 variants and 3432 people pass filters and QC.
Pruned 22086 variants from chromosome 1, leaving 11986.
Pruned 25705 variants from chromosome 2, leaving 12062.
Pruned 21353 variants from chromosome 3, leaving 10439.
Pruned 19228 variants from chromosome 4, leaving 9570.
Pruned 17918 variants from chromosome 5, leaving 9112.
Pruned 20768 variants from chromosome 6, leaving 9434.
Pruned 17946 variants from chromosome 7, leaving 8532.
Pruned 18665 variants from chromosome 8, leaving 8156.
Pruned 15544 variants from chromosome 9, leaving 7459.
Pruned 16792 variants from chromosome 10, leaving 8128.
Pruned 14546 variants from chromosome 11, leaving 7293.
Pruned 14558 variants from chromosome 12, leaving 7422.
Pruned 12292 variants from chromosome 13, leaving 5905.
Pruned 10066 variants from chromosome 14, leaving 5252.
Pruned 9140 variants from chromosome 15, leaving 5043.
Pruned 9097 variants from chromosome 16, leaving 53

In [6]:
#Running PCA from pruned file
filename = glob.glob(os.path.join("QC", "*pruned.bed") )[0].split(".")[0]
outname  = os.path.join(pathpca, filename.split("/")[1] + "_PCA")
subprocess.run(["plink", "--bfile", filename, "--pca", "--out", outname])


CompletedProcess(args=['plink', '--bfile', 'QC/hgdp1000ghg19_founders_geno01_maf_hwe_mind01_rel_pruned', '--pca', '--out', '/home/tomas/Documents/Research/HGDP_1000G_PopStruct/Results/PCA/hgdp1000ghg19_founders_geno01_maf_hwe_mind01_rel_pruned_PCA'], returncode=0)

In [7]:
#Separate files by chromosome
filename = glob.glob(os.path.join("QC", "*rel.bed") )[0].split(".")[0]

for i in range(1,23): #for all chromosomes
    subprocess.run(["plink", "--bfile", filename, "--chr", str(i), "--make-bed", "--out", filename + "_chr_" + str(i) ])


In [None]:
#Moving files?

In [None]:
#Not running the code below

In [10]:
#Creating population file, generating the allele counts

os.chdir(pathinfo)
#Creating an allele frequency file from CleanGenos_strict
#Create a within file for plink containing FID IID and cluster name
info1000g = pd.read_table("integrated_call_samples_v3.20130502.ALL.panel", header = None, skiprows = 1)
infohgdp  = pd.read_table("SampleInformation.txt", header = None, skiprows = 1, dtype = str)
#Fixing hgdp info file
infohgdp = infohgdp.drop(index = 1066)
for i in range(0,len(infohgdp)):
    n   = len(infohgdp.iloc[i,0])
    ze  = 5 - n
    add = "HGDP" + "0" * ze + infohgdp.iloc[i,0]
    infohgdp.iloc[i,0] = add

#Loading fam file
os.chdir(pathgeno)
fam = pd.read_table("hgdp1000ghg19.fam", sep = " ", header = None)
#Creating within file and writing
temp = pd.merge(fam, info1000g, how="left", left_on=1, right_on=0)
temp = pd.merge(temp, infohgdp, how="left", left_on=1, right_on=0)
pops = pd.concat([fam.iloc[:,0], fam.iloc[:,1], temp.iloc[:,9].fillna( temp.iloc[:,14]) ], axis = 1 )#+ test.iloc[:,30])
pops.to_csv("pops.txt", sep = " ", header=False, index=False)

#Running pop allele frequencies
subprocess.run(["plink", "--bfile", "CleanGenos_strict", "--within", "pops.txt", "--freq", "--out", "CleanGenos_strict"])

#PCA analysis
subprocess.run(["plink", "--bfile", "CleanGenos_relaxed", "--pca", "50", "--out", "CleanGenos_strict_PCA"])

CompletedProcess(args=['plink', '--bfile', 'CleanGenos_relaxed', '--pca', '50', '--out', 'CleanGenos_strict_PCA'], returncode=0)

In [11]:
#Changing from long to wide dataframe
os.chdir(os.path.join(projpath, "Code"))
subprocess.run(["bash", "LongToWide.sh"])

CompletedProcess(args=['bash', 'LongToWide.sh'], returncode=0)

In [12]:
os.chdir(pathgeno)
#Removing temp files
for file in glob.glob("*_temp*"):
    os.remove(file)
    
#Moving PCA files
for filename in glob.glob("*_PCA.*"):
    shutil.move(filename, pathpca)
    
#Moving count files
for filename in glob.glob("*.frq.*"):
    shutil.move(filename, pathcounts)

#Moving cleangenos files
for filename in glob.glob("CleanGenos_strict.*"):
    shutil.move(filename, pathclean)

#Moving cleangenos files
for filename in glob.glob("CleanGenos_relaxed.*"):
    shutil.move(filename, pathclean)