# SNP filtering using vcftools   

## Filter ERNA_phylo using the same metrics as filtering without outgroup
### Do not remove individuals

use preferred conda env  
**Packages needed**: vcftools, bgzip, tabix

### Things we filter on:  

#### Individuals:  
- **Coverage**: Also can be thought of as **depth**.See 3mapping. Calculated on bam files. Average read count per locus per individual. 
- **Missing**: Proportion of missing data allowed across all loci for individual. Common and high in GBS/RADseq data. Kinda an issue all around. Many methods, including PCA (all ordination methods), require a complete matrix with no missing data. Additionally, PCA will cluster by missing data with individuals with higher missing data clustering closer to the center and get this "fan" effect. Can be the same for coverage too. This (among other reasons) is why people use a variance-covariance matrix of genetic data to do ordinations. Other methods involve imputation. This can be fancy and use phased haplotype data OR simply, when you z-score, (g - mean(g))/sd(g), your genotype data across each locus, you make all missing data equal to 0 or Mean (i.e., the global allele frequency). There's more to this standardization, see *Patterson et al. 2006* (https://dx.plos.org/10.1371/journal.pgen.0020190) for more info. See PCAsim_ex in examples directory for showing all these issues.
    - (additional) This is another reason to use entropy. Entropy is a hierarchical bayesian model so it gets an updated genotype estimate for each missing value based on genotype likelihoods across loci, individuals, and the allele frequency of the cluster/deme that individual assigns to.   
    
#### Loci:  
- **Biallelic**: Only keep biallelic SNPs. Multiallelic SNPs are rare at the time scale we work (Citation??) and also,  mathematical nightmare and we have enough data so just ignore. Everyone does unless deep time phylogenetics. 
- **thin**: Keeps one locus within a specified range. Not 100% how it decides with one to keep. I think it's on quality or depth. This is a necessary step as loci in close physical are prone to sequencing error and linkage disequalibrium (LD) confounds many different population genetic parameters. For *de novo* reference assemblies, we thin to 100 as contigs/reads are ~92 bp in length. This keeps one locus per contig to control for LD and sequencing error, which is really common in pop gen and necessary for many analyses.   
- **max-missing** = max proportion of missing data per locus  
- **MAF** = minor allele frequency. Proportion of individuals a alternate allele needs to be present in order for that locus to be kept as a SNP. (e.g. maf = 0.02 for 250 individuals means that an alternate allele needs to be present in at least 5 individuals to be kept) Many papers have shown this is a big issue in clustering and demography (Citation). We do this a second time near the end if we removed individuals during missing data filtering.   
- **Mean Depth**: Average allelic depth or read depth per locus. Too low could be sequencing error, too high could be PCR replication artifact (Citation).    
- **Qual**: Locus Quality. You can look up the math. Usually above 20-30 is good but given our coverage and number of individuals, we can usually go way higher.     
- **Fis**: Inbreeding coefficient. This is a contentous topic. This has to do with paralogs or paralogous loci. This is where loci map to multiple regions of the genome. Issues in highly repeative genomes. Usually leads to an excess of heterozygotes. Filtering on negative Fis can help. See these two McKinney papers (https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12763, https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12613). Katie and others in the lab use his package called HDPlot to deal with this.   


**See methods_ex.Rmd in examples directory for text** 


In [1]:
import sys
import ipyparallel as ipp
import os
from os import environ
import gzip
import warnings
import pandas as pd
import numpy as np
import scipy as sp
import glob
import re
import random

In [2]:
vcftools = "vcftools"
bgzip = "bgzip"
tabix = "tabix"

In [3]:
root = "/data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/phylo"

In [4]:
cd $root

/data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/phylo


#### cp vcf.gz from vcf_dir to filtering

In [5]:
!mkdir filtering

In [6]:
!cp vcf/ERNA_phylo.vcf.gz filtering/

In [7]:
analysis_dir = os.path.join(root,"filtering")

In [8]:
cd $analysis_dir

/data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/phylo/filtering


#### count snps in zipped vcf 

In [9]:
vcf_file = os.path.join(analysis_dir, "ERNA_phylo.vcf.gz")
assert os.path.exists(vcf_file)
vcf_file

'/data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/phylo/filtering/ERNA_phylo.vcf.gz'

In [None]:
!zcat $vcf_file | grep -v '#' | wc -l 

## keep only biallelic loci as first step

In [15]:
!$vcftools --remove-indels \
--min-alleles 2 \
--max-alleles 2 \
--remove-filtered-all \
--recode \
--recode-INFO-all \
--gzvcf \
$vcf_file \
--out $'ERNA.biallelic'


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/ERNA.vcf.gz
	--recode-INFO-all
	--max-alleles 2
	--min-alleles 2
	--out ERNA.biallelic
	--recode
	--remove-filtered-all
	--remove-indels

Using zlib version: 1.2.11
After filtering, kept 600 out of 600 Individuals
Outputting VCF file...
After filtering, kept 1317542 out of a possible 1613057 Sites
Run Time = 1542.00 seconds


In [16]:
vcf_biallelic = os.path.join(analysis_dir, "ERNA.biallelic.recode.vcf")
vcf_biallelic_gz = vcf_biallelic + '.gz'
!$bgzip -c $vcf_biallelic > {vcf_biallelic_gz}
!$tabix {vcf_biallelic_gz}

## Remove by MAF, missing data, and thin   


In [20]:
!$vcftools \
--max-missing 0.6 \
--maf 0.02 \
--thin 100 \
--remove-filtered-all \
--recode \
--recode-INFO-all \
--gzvcf \
$vcf_biallelic_gz \
--out $'ERNA_miss60_thin100_MAF2'


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/ERNA.biallelic.recode.vcf.gz
	--recode-INFO-all
	--maf 0.02
	--thin 100
	--max-missing 0.6
	--out ERNA_miss60_thin100_MAF2
	--recode
	--remove-filtered-all

Using zlib version: 1.2.11
After filtering, kept 600 out of 600 Individuals
Outputting VCF file...
After filtering, kept 36948 out of a possible 1317542 Sites
Run Time = 173.00 seconds


In [21]:
!$vcftools \
--max-missing 0.7 \
--maf 0.02 \
--thin 100 \
--remove-filtered-all \
--recode \
--recode-INFO-all \
--gzvcf \
$vcf_biallelic_gz \
--out $'ERNA_miss70_thin100_MAF2'


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/ERNA.biallelic.recode.vcf.gz
	--recode-INFO-all
	--maf 0.02
	--thin 100
	--max-missing 0.7
	--out ERNA_miss70_thin100_MAF2
	--recode
	--remove-filtered-all

Using zlib version: 1.2.11
After filtering, kept 600 out of 600 Individuals
Outputting VCF file...
After filtering, kept 31474 out of a possible 1317542 Sites
Run Time = 167.00 seconds


In [22]:
vcf_filtered = "ERNA_miss70_thin100_MAF2.recode.vcf"
vcf_filtered_gz = "%s.gz" % vcf_filtered

In [23]:
!$bgzip -c $vcf_filtered > {vcf_filtered_gz}
!$tabix {vcf_filtered_gz}

# Remove individuals with too much missing data (bad_indv)


In [24]:
!$vcftools --gzvcf $vcf_filtered_gz --out $vcf_filtered_gz --missing-indv


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf ERNA_miss70_thin100_MAF2.recode.vcf.gz
	--missing-indv
	--out ERNA_miss70_thin100_MAF2.recode.vcf.gz

Using zlib version: 1.2.11
After filtering, kept 600 out of 600 Individuals
Outputting Individual Missingness
After filtering, kept 31474 out of a possible 31474 Sites
Run Time = 4.00 seconds


In [25]:
def get_imiss(vcf_file):
    imiss_file = !ls {vcf_file}.imiss
    imiss_df = pd.read_csv(imiss_file[0], sep="\t")
    imiss_df.index = imiss_df.INDV
    return imiss_df

In [26]:
imiss_df = get_imiss(vcf_filtered_gz)
imiss_df.head()

Unnamed: 0_level_0,INDV,N_DATA,N_GENOTYPES_FILTERED,N_MISS,F_MISS
INDV,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
EN_AH_10,EN_AH_10,31474,0,3153,0.100178
EN_AH_11,EN_AH_11,31474,0,2694,0.085595
EN_AH_12,EN_AH_12,31474,0,2672,0.084895
EN_AH_13,EN_AH_13,31474,0,2525,0.080225
EN_AH_14,EN_AH_14,31474,0,3522,0.111902


In [27]:
imiss_df.F_MISS.describe()

count    600.000000
mean       0.126786
std        0.115348
min        0.034060
25%        0.065435
50%        0.088200
75%        0.143904
max        0.985925
Name: F_MISS, dtype: float64

#### Below is just allows you to input a cutoff and see how many individuals you lose, keeping individuals with at least 60% data

In [28]:
len(imiss_df),len(imiss_df[imiss_df.F_MISS >= .4]),len(imiss_df[imiss_df.F_MISS >= .5]), len(imiss_df[imiss_df.F_MISS >= .25])


(600, 14, 10, 52)

In [29]:
bad_indv = imiss_df.INDV[imiss_df.F_MISS >= .4]

In [30]:
with open(os.path.join(analysis_dir, "bad_indv.txt"), "w") as o:
    o.write("INDV\n")
    for elem in bad_indv.index:
        o.write("%s\n" % elem)

In [31]:
!$vcftools --gzvcf $vcf_filtered_gz \
--remove-indels  \
--remove-filtered-all \
--recode \
--recode-INFO-all \
--remove {os.path.join(analysis_dir, "bad_indv.txt")} \
--out {os.path.join(analysis_dir, "snps_indv_removed")}


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf ERNA_miss70_thin100_MAF2.recode.vcf.gz
	--remove /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/bad_indv.txt
	--recode-INFO-all
	--out /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/snps_indv_removed
	--recode
	--remove-filtered-all
	--remove-indels

Using zlib version: 1.2.11
Excluding individuals in 'exclude' list
After filtering, kept 586 out of 600 Individuals
Outputting VCF file...
After filtering, kept 31474 out of a possible 31474 Sites
Run Time = 37.00 seconds


In [32]:
vcf_bad_remove = os.path.join(analysis_dir, "snps_indv_removed.recode.vcf")
vcf_bad_remove_gz = vcf_bad_remove + ".gz"
!$bgzip -c {vcf_bad_remove} > {vcf_bad_remove_gz}
!$tabix {vcf_bad_remove_gz}

# Filter snps further 
this needs to be done after removing individuals   

### This uses vcftools to get some stats to summarize and make decisions with later

In [33]:
def get_vcf_stats(vcf_gz):
    
    stats = ['depth',
            'site-depth',
            'site-mean-depth',
            'site-quality',
            'missing-site',
            'freq',
            'counts',
            'hardy',
            'het']
    
    for stat in stats:
        !$vcftools --gzvcf $vcf_gz \
        --out $vcf_gz \
        {"--%s" % stat} 

In [34]:
get_vcf_stats(vcf_bad_remove_gz)


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/snps_indv_removed.recode.vcf.gz
	--depth
	--out /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/snps_indv_removed.recode.vcf.gz

Using zlib version: 1.2.11
After filtering, kept 586 out of 586 Individuals
Outputting Mean Depth by Individual
After filtering, kept 31474 out of a possible 31474 Sites
Run Time = 4.00 seconds

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/snps_indv_removed.recode.vcf.gz
	--out /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/snps_indv_removed.recode.vcf.gz
	--site-depth

Using zlib version: 1.2.11
After filtering, kept 586 out of 586 Individuals
Outputting Depth for Each Site
After filtering, kept 31474 out of a possible 31474 Sites
Run Time = 4.0


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/snps_indv_removed.recode.vcf.gz
	--het
	--out /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/snps_indv_removed.recode.vcf.gz

Using zlib version: 1.2.11
After filtering, kept 586 out of 586 Individuals
Outputting Individual Heterozygosity
After filtering, kept 31474 out of a possible 31474 Sites
Run Time = 4.00 seconds


### Functions that will calculate various metrics we use to filter   

In [38]:
#pd.set_option('display.max_columns', 100)

def get_MAF(row):
    try:
        return np.min([row.A1_freq, row.A2_freq])
    except:
        print(row)
        
def get_correction(n):
    #for finite sample size
    return (2*n)/(2*n-1)

def calculate_Fis(vals):
    try:
        data = [float(x) for x in vals.split("/")]
        assert len(data) == 3
        num_individuals = np.sum(data)
        total_alleles = 2*num_individuals
        a1_count = 2*data[0]
        a2_count = 2*data[2]
        het_count = data[1]
        a1_count += het_count
        a2_count += het_count
        a1_freq = a1_count/total_alleles
        a2_freq = a2_count/total_alleles
        assert a1_freq + a2_freq == 1.0
        He = 2 * a1_freq * a2_freq * get_correction(num_individuals)
        Ho = het_count/num_individuals
        Fis = 1 - (Ho/He)
        return Fis
    except:
        return -9

def combine_vcf_stats(filedir, prefix):
    
    hardy_files = !ls {filedir}/{prefix}.hwe
    hardy = pd.read_csv(hardy_files[0], sep="\t")

    hardy.columns = ['CHROM', 'POS', 'OBS(HOM1/HET/HOM2)', 'E(HOM1/HET/HOM2)', 'ChiSq_HWE',
       'P_HWE', 'P_HET_DEFICIT', 'P_HET_EXCESS']
    hardy.index = hardy.apply(lambda x: "%s-%d" % (x.CHROM, x.POS), axis=1)
    
    loci_files = !ls {filedir}/{prefix}.l* | grep -v log
    loci_df = pd.concat([pd.read_csv(x, sep="\t", skiprows=0) for x in loci_files], axis=1)
    chrom_pos = loci_df.iloc[:,0:2]
    
    frq_files = !ls {filedir}/{prefix}.frq* | grep -v count
    frq_data = []
    h = open(frq_files[0])
    header = h.readline().strip().split()
    for line in h:
        frq_data.append(line.strip().split('\t'))

    header = ['CHROM', 'POS', 'N_ALLELES', 'N_CHR', 'A1_FREQ', "A2_FREQ"]
    frq_df = pd.DataFrame(frq_data)
    print(frq_df.columns)
    #frq_df = frq_df.drop([6,7],axis=1)
    frq_df.columns = header
    frq_df.index = frq_df.apply(lambda x: "%s-%s" % (x.CHROM, x.POS), axis=1)
    
    loci_df = loci_df.drop(['CHROM','CHR','POS'], axis=1)
    loci_df = pd.concat([chrom_pos, loci_df], axis=1)
    loci_df.index = loci_df.apply(lambda x: "%s-%d" % (x.CHROM, x.POS), axis=1)
    
    loci_df = pd.concat([loci_df, frq_df, hardy], axis=1)
    loci_df["A1_allele"] = loci_df.apply(lambda row: row.A1_FREQ.split(":")[0], axis=1)
    loci_df["A2_allele"] = loci_df.apply(lambda row: row.A2_FREQ.split(":")[0], axis=1)
    
    loci_df["A1_freq"] = loci_df.apply(lambda row: float(row.A1_FREQ.split(":")[1]), axis=1)
    loci_df["A2_freq"] = loci_df.apply(lambda row: float(row.A2_FREQ.split(":")[1]), axis=1)
    
    loci_df['MAF'] = loci_df.apply(get_MAF, axis=1)
    loci_df = loci_df.drop(['CHROM', 'POS'], axis=1)
    
    loci_df['Fis'] = loci_df['OBS(HOM1/HET/HOM2)'].apply(calculate_Fis)
    
    return loci_df, frq_df, hardy

In [39]:
vcf_bad_remove_gz

'/data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/snps_indv_removed.recode.vcf.gz'

In [40]:
loci_df, frq_df, hardy = combine_vcf_stats(analysis_dir,'snps_indv_removed.recode.vcf.gz')

RangeIndex(start=0, stop=6, step=1)


### Look at the summary stats and make decisions


In [41]:
loci_df.MEAN_DEPTH.describe()

count    31474.000000
mean        13.750139
std         28.069152
min          1.605800
25%          3.824230
50%          5.619450
75%         10.572975
max        248.159000
Name: MEAN_DEPTH, dtype: float64

In [42]:
loci_df.QUAL.describe()

count    31474.000000
mean       948.087908
std        142.178279
min         15.167700
25%        999.000000
50%        999.000000
75%        999.000000
max        999.000000
Name: QUAL, dtype: float64

In [45]:
# this would not be zero if there was an error in the calculation
len(loci_df[loci_df.Fis == -9])

0

In [52]:
len(loci_df[loci_df.MEAN_DEPTH > 25]),len(loci_df[loci_df.MEAN_DEPTH >= 3])

(3170, 28547)

In [48]:
len(loci_df[loci_df.QUAL >= 100]) - len(loci_df[loci_df.QUAL >= 200])

62

In [51]:
len(loci_df[loci_df.QUAL <  500]), len(loci_df[loci_df.QUAL < 750]), len(loci_df[loci_df.QUAL < 999])

(1034, 2838, 5194)

In [None]:
 len(loci_df[loci_df.Fis <= -0.5]), len(loci_df[loci_df.MAF < 0.02])

In [53]:
def filter_snps(df, imputed=False):
    if imputed:
        return df[(df.MAF >= 0.01) &  
                  (df.Fis > -0.5)]
    else:
        return df[(df.MEAN_DEPTH >= 3) & 
                  (df.MEAN_DEPTH < 25) & 
                  (df.QUAL >= 750) & 
                  (df.MAF >= 0.02) &  
                  (df.Fis > -0.5)]

In [54]:
loci_stage1 = filter_snps(loci_df)
loci_stage1.shape

(22917, 25)

In [55]:
with open(os.path.join(analysis_dir, "stage1_positions.txt"), "w") as o:
    for elem in loci_stage1.index:
        o.write("%s\n" % "\t".join(elem.split("-")))

In [56]:
!$vcftools --gzvcf $vcf_bad_remove_gz \
--remove-indels  \
--remove-filtered-all \
--recode \
--recode-INFO-all \
--positions {os.path.join(analysis_dir, "stage1_positions.txt")} \
--out {os.path.join(analysis_dir, "good_snps")}


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/snps_indv_removed.recode.vcf.gz
	--recode-INFO-all
	--out /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/good_snps
	--positions /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/stage1_positions.txt
	--recode
	--remove-filtered-all
	--remove-indels

Using zlib version: 1.2.11
After filtering, kept 586 out of 586 Individuals
Outputting VCF file...
After filtering, kept 22917 out of a possible 31474 Sites
Run Time = 28.00 seconds


In [57]:
snps = os.path.join(analysis_dir, "good_snps.recode.vcf")
snps_gz = snps + ".gz"
!$bgzip -c {snps} > {snps_gz}
!$tabix {snps_gz}

# Make 012, see directory PCA_012 for results

In [58]:
!$vcftools --gzvcf {snps_gz} \
--out {snps_gz} \
--012


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/good_snps.recode.vcf.gz
	--012
	--out /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/good_snps.recode.vcf.gz

Using zlib version: 1.2.11
After filtering, kept 586 out of 586 Individuals
Writing 012 matrix files ... Done.
After filtering, kept 22917 out of a possible 22917 Sites
Run Time = 4.00 seconds


# Get coverage per individual

In [59]:
!$vcftools --gzvcf {snps_gz} \
--out {snps_gz} \
--depth


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/good_snps.recode.vcf.gz
	--depth
	--out /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/good_snps.recode.vcf.gz

Using zlib version: 1.2.11
After filtering, kept 586 out of 586 Individuals
Outputting Mean Depth by Individual
After filtering, kept 22917 out of a possible 22917 Sites
Run Time = 4.00 seconds


In [60]:
depth_file = os.path.join(analysis_dir, "good_snps.recode.vcf.gz.idepth")
depth_df = pd.read_csv(depth_file, sep="\t")
depth_df.head()

Unnamed: 0,INDV,N_SITES,MEAN_DEPTH
0,EN_AH_10,22917,9.05965
1,EN_AH_11,22917,8.50823
2,EN_AH_12,22917,10.047
3,EN_AH_13,22917,9.19121
4,EN_AH_14,22917,6.10045


In [61]:
depth_df.MEAN_DEPTH.describe()

count    586.000000
mean       7.249223
std        1.962058
min        1.709740
25%        6.041997
50%        7.516385
75%        8.721273
max       11.961000
Name: MEAN_DEPTH, dtype: float64

## Estimate relatedness

#### Papers:
KING: 10.1093/bioinformatics/btq559  
Error in KING: 10.3389/fgene.2022.882268  

#### Relationship from papers:   
Duplicate/Clone/twin: r > 0.354    
Parent-offspring/full-sib: 0.177 < r < 0.354    
half-sib/cousin/2nd-degree: 0.0884 < r < 0.177  

In [62]:
!$vcftools --gzvcf {snps_gz} \
--out {snps_gz} \
--relatedness2


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/good_snps.recode.vcf.gz
	--out /data/gpfs/assoc/denovo/tfaske/rabbit/full/REDO/SNPcall/filtering/good_snps.recode.vcf.gz
	--relatedness2

Using zlib version: 1.2.11
After filtering, kept 586 out of 586 Individuals
Outputting Individual Relatedness
After filtering, kept 22917 out of a possible 22917 Sites
Run Time = 52.00 seconds


# Remove unnecessary files

In [63]:
!rm snps*

In [64]:
!rm *miss*

In [65]:
!rm *vcf

# Check 012 that PCA looks okay  

scp ggood_snps.recode.vcf.gz, good_snps.recode.vcf.gz.012, good_snps.recode.vcf.gz.012.indv over to your local computer and run same R markdown as in the ddocent_output, PCA_012 dir