# Clean SNPs

This script will run a basic clean of SNPs on the merged HGDP-1000G reference files.
Specifically, we will remove SNPs with minor allele frequencies below 0.01, a Hardy-Weinberg equilibrium exact test p-value below 1e-50, missing call rates exceeding 0.1.
After that we will run an LD prune, using a window size of 50 SNPs, with a step size of 5 SNPs, and a variance inflation factor (VIF) of 2.

In [1]:
import subprocess, os, glob, shutil

In [2]:
projpath = os.path.realpath("..")
pathgeno = os.path.join(projpath, "DataBases", "Genotypes")
pathout  = os.path.join(projpath, "Results", "CleanGenos")
pathinfo = os.path.join(projpath, "DataBases", "PopInfo")

In [9]:
os.chdir(pathgeno)
#Removing MAF
subprocess.run(["plink", "--bfile", "hgdp1000ghg19", "--maf", "0.01", "--make-bed", "--out", "clean_temp1"])
#Removing missing call rates
subprocess.run(["plink", "--bfile", "clean_temp1", "--geno", "0.1", "--make-bed", "--out", "clean_temp2"])
#Removing H-W
subprocess.run(["plink", "--bfile", "clean_temp2", "--hwe", "1e-50", "--make-bed", "--out", "clean_temp3"])
#LD prune relaxed
subprocess.run(["plink", "--bfile", "clean_temp3", "--indep", "50", "5", "2", "--out", "relaxed"])
subprocess.run(["plink", "--bfile", "clean_temp3", "--extract", "relaxed.prune.in", "--make-bed", "--out", "CleanGenos_relaxed"])
#LD prune strict
subprocess.run(["plink", "--bfile", "clean_temp3", "--indep", "200", "5", "1.3", "--out", "strict"])
subprocess.run(["plink", "--bfile", "clean_temp3", "--extract", "strict.prune.in", "--make-bed", "--out", "CleanGenos_strict"])

#Read log files with the variants section
for file in glob.glob("*.log"):
    with open(file) as myfile:
        for num, line in enumerate(myfile, 1):
            if "variants" in line:
                print(line, end='')
        print("Finished file... \n")

#Cleaning folder
for file in glob.glob("*_temp*"):
    os.remove(file)
    
#Moving files
for filename in glob.glob("CleanGenos_*.*"):
    shutil.move(filename, pathout)

641330 variants loaded from .bim file.
3589 variants removed due to missing genotype data (--geno).
637741 variants and 3444 people pass filters and QC.
Finished file... 

622661 variants loaded from .bim file.
--extract: 108583 variants remaining.
108583 variants and 3444 people pass filters and QC.
Finished file... 

622661 variants loaded from .bim file.
622661 variants and 3444 people pass filters and QC.
Pruned 39403 variants from chromosome 1, leaving 8683.
Pruned 43296 variants from chromosome 2, leaving 8515.
Pruned 35924 variants from chromosome 3, leaving 7240.
Pruned 32059 variants from chromosome 4, leaving 6697.
Pruned 32349 variants from chromosome 5, leaving 6630.
Pruned 35184 variants from chromosome 6, leaving 6646.
Pruned 28519 variants from chromosome 7, leaving 5828.
Pruned 30428 variants from chromosome 8, leaving 5623.
Pruned 25263 variants from chromosome 9, leaving 5063.
Pruned 27768 variants from chromosome 10, leaving 5708.
Pruned 25846 variants from chromosom