# SNP filtering using vcftools   


use preferred conda env  
**Packages needed**: vcftools, bgzip, tabix

### Things we filter on:  

#### Individuals:  
- **Coverage**: Also can be thought of as **depth**.See 3mapping. Calculated on bam files. Average read count per locus per individual. 
- **Missing**: Proportion of missing data allowed across all loci for individual. Common and high in GBS/RADseq data. Kinda an issue all around. Many methods, including PCA (all ordination methods), require a complete matrix with no missing data. Additionally, PCA will cluster by missing data with individuals with higher missing data clustering closer to the center and get this "fan" effect. Can be the same for coverage too. This (among other reasons) is why people use a variance-covariance matrix of genetic data to do ordinations. Other methods involve imputation. This can be fancy and use phased haplotype data OR simply, when you z-score, (g - mean(g))/sd(g), your genotype data across each locus, you make all missing data equal to 0 or Mean (i.e., the global allele frequency). There's more to this standardization, see *Patterson et al. 2006* (https://dx.plos.org/10.1371/journal.pgen.0020190) for more info. See PCAsim_ex in examples directory for showing all these issues.
    - (additional) This is another reason to use entropy. Entropy is a hierarchical bayesian model so it gets an updated genotype estimate for each missing value based on genotype likelihoods across loci, individuals, and the allele frequency of the cluster/deme that individual assigns to.   
    
#### Loci:  
- **Biallelic**: Only keep biallelic SNPs. Multiallelic SNPs are rare at the time scale we work (Citation??) and also,  mathematical nightmare and we have enough data so just ignore. Everyone does unless deep time phylogenetics. 
- **thin**: Keeps one locus within a specified range. Not 100% how it decides with one to keep. I think it's on quality or depth. This is a necessary step as loci in close physical are prone to sequencing error and linkage disequalibrium (LD) confounds many different population genetic parameters. For *de novo* reference assemblies, we thin to 100 as contigs/reads are ~92 bp in length. This keeps one locus per contig to control for LD and sequencing error, which is really common in pop gen and necessary for many analyses.   
- **max-missing** = max proportion of missing data per locus  
- **MAF** = minor allele frequency. Proportion of individuals a altSPCRte allele needs to be present in order for that locus to be kept as a SNP. (e.g. maf = 0.02 for 250 individuals means that an altSPCRte allele needs to be present in at least 5 individuals to be kept) Many papers have shown this is a big issue in clustering and demography (Citation). We do this a second time near the end if we removed individuals during missing data filtering.   
- **Mean Depth**: Average allelic depth or read depth per locus. Too low could be sequencing error, too high could be PCR replication artifact (Citation).    
- **Qual**: Locus Quality. You can look up the math. Usually above 20-30 is good but given our coverage and number of individuals, we can usually go way higher.     
- **Fis**: Inbreeding coefficient. This is a contentous topic. This has to do with paralogs or paralogous loci. This is where loci map to multiple regions of the genome. Issues in highly repeative genomes. Usually leads to an excess of heterozygotes. Filtering on negative Fis can help. See these two McKinney papers (https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12763, https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12613). Katie and others in the lab use his package called HDPlot to deal with this.   


**See methods_ex.Rmd in examples directory for text** 


In [1]:
import sys
import ipyparallel as ipp
import os
from os import environ
import gzip
import warnings
import pandas as pd
import numpy as np
import scipy as sp
import glob
import re
import random

In [2]:
vcftools = "vcftools"
bgzip = "bgzip"
tabix = "tabix"

In [3]:
root = "/data/gpfs/assoc/denovo/tfaske/SPCR/ddocent"

In [4]:
cd $root

/data/gpfs/assoc/denovo/tfaske/SPCR/ddocent


In [6]:
analysis_dir = os.path.join(root,"filtering")

In [7]:
cd $analysis_dir

/data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering


#### count snps in zipped vcf 

In [8]:
vcf_file = os.path.join(analysis_dir, "SPCR_concat.vcf.gz")
assert os.path.exists(vcf_file)
vcf_file

'/data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/SPCR_concat.vcf.gz'

In [14]:
#!zcat $vcf_file | grep -v '#' | wc -l 
#2727778

1613057


## keep only biallelic loci as first step

In [9]:
!$vcftools --remove-indels \
--min-alleles 2 \
--max-alleles 2 \
--remove-filtered-all \
--recode \
--recode-INFO-all \
--gzvcf \
$vcf_file \
--out $'SPCR.biallelic'


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/SPCR_concat.vcf.gz
	--recode-INFO-all
	--max-alleles 2
	--min-alleles 2
	--out SPCR.biallelic
	--recode
	--remove-filtered-all
	--remove-indels

Using zlib version: 1.2.11
After filtering, kept 628 out of 628 Individuals
Outputting VCF file...
After filtering, kept 2126042 out of a possible 2727778 Sites
Run Time = 3503.00 seconds


In [10]:
vcf_biallelic = os.path.join(analysis_dir, "SPCR.biallelic.recode.vcf")
vcf_biallelic_gz = vcf_biallelic + '.gz'
!$bgzip -c $vcf_biallelic > {vcf_biallelic_gz}
!$tabix {vcf_biallelic_gz}

## Remove by MAF, missing data, and thin   


In [11]:
!$vcftools \
--max-missing 0.6 \
--maf 0.02 \
--thin 120 \
--remove-filtered-all \
--recode \
--recode-INFO-all \
--gzvcf \
$vcf_biallelic_gz \
--out $'SPCR_miss60_thin120_MAF2'


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/SPCR.biallelic.recode.vcf.gz
	--recode-INFO-all
	--maf 0.02
	--thin 100
	--max-missing 0.6
	--out SPCR_miss60_thin100_MAF2
	--recode
	--remove-filtered-all

Using zlib version: 1.2.11
After filtering, kept 628 out of 628 Individuals
Outputting VCF file...
After filtering, kept 58148 out of a possible 2126042 Sites
Run Time = 431.00 seconds


In [12]:
!$vcftools \
--max-missing 0.7 \
--maf 0.02 \
--thin 120 \
--remove-filtered-all \
--recode \
--recode-INFO-all \
--gzvcf \
$vcf_biallelic_gz \
--out $'SPCR_miss70_thin120_MAF2'


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/SPCR.biallelic.recode.vcf.gz
	--recode-INFO-all
	--maf 0.02
	--thin 100
	--max-missing 0.7
	--out SPCR_miss70_thin100_MAF2
	--recode
	--remove-filtered-all

Using zlib version: 1.2.11
After filtering, kept 628 out of 628 Individuals
Outputting VCF file...
After filtering, kept 44651 out of a possible 2126042 Sites
Run Time = 397.00 seconds


In [13]:
vcf_filtered = "SPCR_miss70_thin100_MAF2.recode.vcf"
vcf_filtered_gz = "%s.gz" % vcf_filtered

In [14]:
!$bgzip -c $vcf_filtered > {vcf_filtered_gz}
!$tabix {vcf_filtered_gz}

# Remove individuals with too much missing data (bad_indv)


In [15]:
!$vcftools --gzvcf $vcf_filtered_gz --out $vcf_filtered_gz --missing-indv


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf SPCR_miss70_thin100_MAF2.recode.vcf.gz
	--missing-indv
	--out SPCR_miss70_thin100_MAF2.recode.vcf.gz

Using zlib version: 1.2.11
After filtering, kept 628 out of 628 Individuals
Outputting Individual Missingness
After filtering, kept 44651 out of a possible 44651 Sites
Run Time = 10.00 seconds


In [16]:
def get_imiss(vcf_file):
    imiss_file = !ls {vcf_file}.imiss
    imiss_df = pd.read_csv(imiss_file[0], sep="\t")
    imiss_df.index = imiss_df.INDV
    return imiss_df

In [17]:
imiss_df = get_imiss(vcf_filtered_gz)
imiss_df.head()

Unnamed: 0_level_0,INDV,N_DATA,N_GENOTYPES_FILTERED,N_MISS,F_MISS
INDV,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SPCR_100,SPCR_100,44651,0,6943,0.155495
SPCR_101,SPCR_101,44651,0,6692,0.149873
SPCR_102,SPCR_102,44651,0,6901,0.154554
SPCR_103,SPCR_103,44651,0,7062,0.15816
SPCR_104,SPCR_104,44651,0,6856,0.153546


In [18]:
imiss_df.F_MISS.describe()

count    628.000000
mean       0.244826
std        0.165853
min        0.057378
25%        0.123989
50%        0.163580
75%        0.359404
max        0.928871
Name: F_MISS, dtype: float64

#### Below is just allows you to input a cutoff and see how many individuals you lose, keeping individuals with at least 60% data

In [20]:
len(imiss_df),len(imiss_df[imiss_df.F_MISS >= .4]),len(imiss_df[imiss_df.F_MISS >= .5]), len(imiss_df[imiss_df.F_MISS >= .6])


(628, 64, 44, 26)

In [21]:
bad_indv = imiss_df.INDV[imiss_df.F_MISS >= .6]

In [22]:
with open(os.path.join(analysis_dir, "bad_indv.txt"), "w") as o:
    o.write("INDV\n")
    for elem in bad_indv.index:
        o.write("%s\n" % elem)

In [23]:
!$vcftools --gzvcf $vcf_filtered_gz \
--remove-indels  \
--remove-filtered-all \
--recode \
--recode-INFO-all \
--remove {os.path.join(analysis_dir, "bad_indv.txt")} \
--out {os.path.join(analysis_dir, "snps_indv_removed")}


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf SPCR_miss70_thin100_MAF2.recode.vcf.gz
	--remove /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/bad_indv.txt
	--recode-INFO-all
	--out /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed
	--recode
	--remove-filtered-all
	--remove-indels

Using zlib version: 1.2.11
Excluding individuals in 'exclude' list
After filtering, kept 602 out of 628 Individuals
Outputting VCF file...
After filtering, kept 44651 out of a possible 44651 Sites
Run Time = 98.00 seconds


In [24]:
vcf_bad_remove = os.path.join(analysis_dir, "snps_indv_removed.recode.vcf")
vcf_bad_remove_gz = vcf_bad_remove + ".gz"
!$bgzip -c {vcf_bad_remove} > {vcf_bad_remove_gz}
!$tabix {vcf_bad_remove_gz}

# Filter snps further 
this needs to be done after removing individuals   

### This uses vcftools to get some stats to summarize and make decisions with later

In [25]:
def get_vcf_stats(vcf_gz):
    
    stats = ['depth',
            'site-depth',
            'site-mean-depth',
            'site-quality',
            'missing-site',
            'freq',
            'counts',
            'hardy',
            'het']
    
    for stat in stats:
        !$vcftools --gzvcf $vcf_gz \
        --out $vcf_gz \
        {"--%s" % stat} 

In [26]:
get_vcf_stats(vcf_bad_remove_gz)


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--depth
	--out /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz

Using zlib version: 1.2.11
After filtering, kept 602 out of 602 Individuals
Outputting Mean Depth by Individual
After filtering, kept 44651 out of a possible 44651 Sites
Run Time = 10.00 seconds

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--out /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--site-depth

Using zlib version: 1.2.11
After filtering, kept 602 out of 602 Individuals
Outputting Depth for Each Site


After filtering, kept 44651 out of a possible 44651 Sites
Run Time = 10.00 seconds

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--out /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--site-mean-depth

Using zlib version: 1.2.11
After filtering, kept 602 out of 602 Individuals
Outputting Depth for Each Site
After filtering, kept 44651 out of a possible 44651 Sites
Run Time = 12.00 seconds

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--out /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--site-quality

Using zlib version: 1.2.11
After filtering, kept 602 out of 602 Individuals
Outputting Quality for Each Site


After filtering, kept 44651 out of a possible 44651 Sites
Run Time = 7.00 seconds

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--out /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--missing-site

Using zlib version: 1.2.11
After filtering, kept 602 out of 602 Individuals
Outputting Site Missingness
After filtering, kept 44651 out of a possible 44651 Sites
Run Time = 10.00 seconds

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--freq
	--out /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz

Using zlib version: 1.2.11
After filtering, kept 602 out of 602 Individuals
Outputting Frequency Statistics...


After filtering, kept 44651 out of a possible 44651 Sites
Run Time = 11.00 seconds

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--counts
	--out /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz

Using zlib version: 1.2.11
After filtering, kept 602 out of 602 Individuals
Outputting Frequency Statistics...
After filtering, kept 44651 out of a possible 44651 Sites
Run Time = 9.00 seconds

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--hardy
	--out /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz

Using zlib version: 1.2.11
After filtering, kept 602 out of 602 Individuals
Outputting HWE statistics (but only for biallelic loci)


	HWE: Only using fully diploid SNPs.
After filtering, kept 44651 out of a possible 44651 Sites
Run Time = 10.00 seconds

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--het
	--out /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz

Using zlib version: 1.2.11
After filtering, kept 602 out of 602 Individuals
Outputting Individual Heterozygosity
	Individual Heterozygosity: Only using fully diploid SNPs.
After filtering, kept 44651 out of a possible 44651 Sites
Run Time = 11.00 seconds


### Functions that will calculate various metrics we use to filter   

In [27]:
#pd.set_option('display.max_columns', 100)

def get_MAF(row):
    try:
        return np.min([row.A1_freq, row.A2_freq])
    except:
        print(row)
        
def get_correction(n):
    #for finite sample size
    return (2*n)/(2*n-1)

def calculate_Fis(vals):
    try:
        data = [float(x) for x in vals.split("/")]
        assert len(data) == 3
        num_individuals = np.sum(data)
        total_alleles = 2*num_individuals
        a1_count = 2*data[0]
        a2_count = 2*data[2]
        het_count = data[1]
        a1_count += het_count
        a2_count += het_count
        a1_freq = a1_count/total_alleles
        a2_freq = a2_count/total_alleles
        assert a1_freq + a2_freq == 1.0
        He = 2 * a1_freq * a2_freq * get_correction(num_individuals)
        Ho = het_count/num_individuals
        Fis = 1 - (Ho/He)
        return Fis
    except:
        return -9

def combine_vcf_stats(filedir, prefix):
    
    hardy_files = !ls {filedir}/{prefix}.hwe
    hardy = pd.read_csv(hardy_files[0], sep="\t")

    hardy.columns = ['CHROM', 'POS', 'OBS(HOM1/HET/HOM2)', 'E(HOM1/HET/HOM2)', 'ChiSq_HWE',
       'P_HWE', 'P_HET_DEFICIT', 'P_HET_EXCESS']
    hardy.index = hardy.apply(lambda x: "%s-%d" % (x.CHROM, x.POS), axis=1)
    
    loci_files = !ls {filedir}/{prefix}.l* | grep -v log
    loci_df = pd.concat([pd.read_csv(x, sep="\t", skiprows=0) for x in loci_files], axis=1)
    chrom_pos = loci_df.iloc[:,0:2]
    
    frq_files = !ls {filedir}/{prefix}.frq* | grep -v count
    frq_data = []
    h = open(frq_files[0])
    header = h.readline().strip().split()
    for line in h:
        frq_data.append(line.strip().split('\t'))

    header = ['CHROM', 'POS', 'N_ALLELES', 'N_CHR', 'A1_FREQ', "A2_FREQ"]
    frq_df = pd.DataFrame(frq_data)
    print(frq_df.columns)
    #frq_df = frq_df.drop([6,7],axis=1)
    frq_df.columns = header
    frq_df.index = frq_df.apply(lambda x: "%s-%s" % (x.CHROM, x.POS), axis=1)
    
    loci_df = loci_df.drop(['CHROM','CHR','POS'], axis=1)
    loci_df = pd.concat([chrom_pos, loci_df], axis=1)
    loci_df.index = loci_df.apply(lambda x: "%s-%d" % (x.CHROM, x.POS), axis=1)
    
    loci_df = pd.concat([loci_df, frq_df, hardy], axis=1)
    loci_df["A1_allele"] = loci_df.apply(lambda row: row.A1_FREQ.split(":")[0], axis=1)
    loci_df["A2_allele"] = loci_df.apply(lambda row: row.A2_FREQ.split(":")[0], axis=1)
    
    loci_df["A1_freq"] = loci_df.apply(lambda row: float(row.A1_FREQ.split(":")[1]), axis=1)
    loci_df["A2_freq"] = loci_df.apply(lambda row: float(row.A2_FREQ.split(":")[1]), axis=1)
    
    loci_df['MAF'] = loci_df.apply(get_MAF, axis=1)
    loci_df = loci_df.drop(['CHROM', 'POS'], axis=1)
    
    loci_df['Fis'] = loci_df['OBS(HOM1/HET/HOM2)'].apply(calculate_Fis)
    
    return loci_df, frq_df, hardy

In [28]:
vcf_bad_remove_gz

'/data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz'

In [29]:
loci_df, frq_df, hardy = combine_vcf_stats(analysis_dir,'snps_indv_removed.recode.vcf.gz')

RangeIndex(start=0, stop=6, step=1)


### Look at the summary stats and make decisions


In [30]:
loci_df.MEAN_DEPTH.describe()

count    44651.000000
mean        15.973388
std         16.391832
min          2.126340
25%          7.107770
50%         11.161200
75%         18.686050
max        426.970000
Name: MEAN_DEPTH, dtype: float64

In [31]:
loci_df.QUAL.describe()

count    4.465100e+04
mean     1.498650e+04
std      3.571597e+04
min      3.510270e-04
25%      1.944535e+03
50%      5.344410e+03
75%      1.455885e+04
max      2.090010e+06
Name: QUAL, dtype: float64

In [32]:
# this would not be zero if there was an error in the calculation
len(loci_df[loci_df.Fis == -9])

44062

In [33]:
len(loci_df[loci_df.MEAN_DEPTH > 25]),len(loci_df[loci_df.MEAN_DEPTH >= 3])

(6837, 44588)

In [34]:
len(loci_df[loci_df.QUAL >= 100]) - len(loci_df[loci_df.QUAL >= 200])

450

In [35]:
len(loci_df[loci_df.QUAL <  500]), len(loci_df[loci_df.QUAL < 750]), len(loci_df[loci_df.QUAL < 999])

(3084, 4578, 6089)

In [36]:
 len(loci_df[loci_df.Fis <= -0.5]), len(loci_df[loci_df.MAF < 0.02])

(44089, 461)

In [37]:
def filter_snps(df, imputed=False):
    if imputed:
        return df[(df.MAF >= 0.01) &  
                  (df.Fis > -0.5)]
    else:
        return df[(df.MEAN_DEPTH >= 3) & 
                  (df.MEAN_DEPTH < 25) & 
                  (df.QUAL >= 750) & 
                  (df.MAF >= 0.02)]

In [38]:
loci_stage1 = filter_snps(loci_df)
loci_stage1.shape

(33072, 25)

In [39]:
with open(os.path.join(analysis_dir, "stage1_positions.txt"), "w") as o:
    for elem in loci_stage1.index:
        o.write("%s\n" % "\t".join(elem.split("-")))

In [40]:
!$vcftools --gzvcf $vcf_bad_remove_gz \
--remove-indels  \
--remove-filtered-all \
--recode \
--recode-INFO-all \
--positions {os.path.join(analysis_dir, "stage1_positions.txt")} \
--out {os.path.join(analysis_dir, "good_snps")}


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/snps_indv_removed.recode.vcf.gz
	--recode-INFO-all
	--out /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/good_snps
	--positions /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/stage1_positions.txt
	--recode
	--remove-filtered-all
	--remove-indels

Using zlib version: 1.2.11
After filtering, kept 602 out of 602 Individuals
Outputting VCF file...
After filtering, kept 33072 out of a possible 44651 Sites
Run Time = 70.00 seconds


In [41]:
snps = os.path.join(analysis_dir, "good_snps.recode.vcf")
snps_gz = snps + ".gz"
!$bgzip -c {snps} > {snps_gz}
!$tabix {snps_gz}

# Make 012, see directory PCA_012 for results

In [42]:
!$vcftools --gzvcf {snps_gz} \
--out {snps_gz} \
--012


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/good_snps.recode.vcf.gz
	--012
	--out /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/good_snps.recode.vcf.gz

Using zlib version: 1.2.11
After filtering, kept 602 out of 602 Individuals
Writing 012 matrix files ... Done.
After filtering, kept 33072 out of a possible 33072 Sites
Run Time = 13.00 seconds


# Get coverage per individual

In [43]:
!$vcftools --gzvcf {snps_gz} \
--out {snps_gz} \
--depth


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/good_snps.recode.vcf.gz
	--depth
	--out /data/gpfs/assoc/denovo/tfaske/SPCR/ddocent/filtering/good_snps.recode.vcf.gz

Using zlib version: 1.2.11
After filtering, kept 602 out of 602 Individuals
Outputting Mean Depth by Individual
After filtering, kept 33072 out of a possible 33072 Sites
Run Time = 8.00 seconds


In [44]:
depth_file = os.path.join(analysis_dir, "good_snps.recode.vcf.gz.idepth")
depth_df = pd.read_csv(depth_file, sep="\t")
depth_df.head()

Unnamed: 0,INDV,N_SITES,MEAN_DEPTH
0,SPCR_100,27606,9.31406
1,SPCR_101,27976,13.9943
2,SPCR_102,27825,12.5613
3,SPCR_103,27682,12.7719
4,SPCR_104,27877,12.8541


In [45]:
depth_df.MEAN_DEPTH.describe()

count    602.000000
mean      12.070903
std        4.041770
min        4.039260
25%        9.396430
50%       11.445400
75%       13.816000
max       48.032300
Name: MEAN_DEPTH, dtype: float64

## Estimate relatedness

#### Papers:
KING: 10.1093/bioinformatics/btq559  
Error in KING: 10.3389/fgene.2022.882268  

#### Relationship from papers:   
Duplicate/Clone/twin: r > 0.354    
Parent-offspring/full-sib: 0.177 < r < 0.354    
half-sib/cousin/2nd-degree: 0.0884 < r < 0.177  

In [97]:
!$vcftools --gzvcf {snps_gz} \
--out {snps_gz} \
--relatedness2


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/good_snps.recode.vcf.gz
	--out /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/good_snps.recode.vcf.gz
	--relatedness2

Using zlib version: 1.2.11
After filtering, kept 586 out of 586 Individuals
Outputting Individual Relatedness
After filtering, kept 22917 out of a possible 22917 Sites
Run Time = 52.00 seconds


# Remove unnecessary files

In [46]:
!rm snps*

In [47]:
!rm *miss*

In [48]:
!rm *vcf

# Check 012 that PCA looks okay  

scp ggood_snps.recode.vcf.gz, good_snps.recode.vcf.gz.012, good_snps.recode.vcf.gz.012.indv over to your local computer and run same R markdown as in the ddocent_output, PCA_012 dir