# Clean SNPs

This script will run a basic clean of SNPs on the merged HGDP-1000G reference files.
Specifically, we will remove SNPs with minor allele frequencies below 0.01, a Hardy-Weinberg equilibrium exact test p-value below 1e-50, missing call rates exceeding 0.1.
After that we will run an LD prune, using a window size of 50 SNPs, with a step size of 5 SNPs, and a variance inflation factor (VIF) of 2.

In [9]:
import subprocess, os, glob, shutil

In [4]:
projpath = os.path.realpath("..")
pathgeno = os.path.join(projpath, "DataBases", "Genotypes")
pathout  = os.path.join(projpath, "Results", "CleanGenos")
pathinfo = os.path.join(projpath, "DataBases", "PopInfo")

In [10]:
os.chdir(pathgeno)
subprocess.run(["plink", "--bfile", "hgdp1000ghg19", "--maf", "0.01", "--make-bed", "--out", "clean_temp1"])
subprocess.run(["plink", "--bfile", "clean_temp1", "--geno", "0.1", "--make-bed", "--out", "clean_temp2"])
subprocess.run(["plink", "--bfile", "clean_temp2", "--hwe", "1e-50", "--make-bed", "--out", "clean_temp3"])
subprocess.run(["plink", "--bfile", "clean_temp3", "--indep", "50", "5", "2"])
subprocess.run(["plink", "--bfile", "clean_temp3", "--extract", "plink.prune.in", "--make-bed", "--out", "CleanGenos"])

#Read log files with the variants section
for file in glob.glob("*_temp*.log"):
    with open(file) as myfile:
        for num, line in enumerate(myfile, 1):
            if "variants" in line:
                print(line, end='')
        print("Finished file... \n")
        
#Cleaning folder
for file in glob.glob("*_temp*"):
    os.remove(file)
    
#Moving files
for filename in glob.glob("CleanGenos.*"):
    shutil.move(filename, pathout)

641330 variants loaded from .bim file.
3589 variants removed due to missing genotype data (--geno).
637741 variants and 3444 people pass filters and QC.
Finished file... 

637741 variants loaded from .bim file.
--hwe: 15080 variants removed due to Hardy-Weinberg exact test.
622661 variants and 3444 people pass filters and QC.
Finished file... 

644003 variants loaded from .bim file.
2673 variants removed due to minor allele threshold(s)
641330 variants and 3444 people pass filters and QC.
Finished file... 

