# Quality control on VCF files pipeline

## **Aim**

To generate QC'ed files from the UKBB pVCF data

## Filters applied

1. Genotype depth filters: SNPs DP>7 and Indels DP>10 for indels
    > Then only SNV variant sites that met at least one of the following two criteria were [retained](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwjPw5r_s5fvAhUVUzUKHe7GD-kQFjAEegQIDRAD&url=https%3A%2F%2Fwww.biorxiv.org%2Fcontent%2Fbiorxiv%2Fearly%2F2019%2F03%2F09%2F572347%2FDC2%2Fembed%2Fmedia-2.pdf%3Fdownload%3Dtrue&usg=AOvVaw06fvt4jBTPq5VfepojT1mZ) according to filtering made on the ~50K exomes by the UKBB
    
    > 1) at least one heterozygous variant genotype with allele balance ratio greater than or equal to 15% (AB >= 0.15) 
    
    > 2) at least one homozygous variant genotype
    
2. At least one sample per site passed the allele balance threshold >= 0.15 for SNPs and >=0.20 for indels (heterozygous variants).
3. Genotype quality GQ>20

More recent reference [here](https://www.medrxiv.org/content/10.1101/2020.11.02.20222232v1.full-text)

## Workflow

FIXME: what's the best order of operations?

### Step 1. Annotate rsid with dbSNP

I have decided that this is going to be the first step of the VCF processing given that left normalization introduces bias in SNP annotation and stats calculations. 

## Concepts

* Bi-allelic: is a specific locus in a genome that contains two observed alleles, counting the reference as one, and therefore allowing for one variant allele.
* Multi-allelic: is a specific locus in a genome that contains three or more observed alleles, again counting the reference as one, and therefore allowing for two or more variant alleles.
* Allele balance: is calculated for heterozygotes as the number of bases supporting the least-represented allele over the total number of base observations

bcftools expression `bcftools filter -i '(FMT/DP)>10'`: includes sites for which at least one sample has DP>10

### Annotation of known/novel variants

In this case we used the dbSNP repository located [here](https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/) and the following code to create the tab file needed by bcftools
The dbSNP on Columbia's cluster is located here `/mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/dbSPN_hg38/`

```
bcftools query  -f'%CHROM\t%POS\t%ID\t%REF\t%ALT\n' /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/dbSPN_hg38/00-All.renamechrs.vcf.gz | \
awk 'BEGIN{OFS="\t";} {if (length ($4) > length ($5)) {print $1,$2,$2+ (length ($4) - 1),$3} else {print $1,$2, $2 + (length ($4) -1 )}}' | \
bgzip -c > /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/dbSPN_hg38/00-All.renamechrs.indels_snps.tab.gz
```

I used [this](https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionVI/lessons/03_annotation-snpeff.html) tutorial that illustrates usage with bcftools

## To run this notebook

```
sos dryrun ~/project/UKBB_GWAS_dev/workflow/VCF_QC_pipeline.ipynb qc:3 \
    --cwd output \
    --vcfs `echo /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/ukb23156_c{1..22}_b*` \
    --DP_snp 7 \
    --DP_indel 10 \
    --GQ 20 \
    --AB_snp 0.15 \
    --AB_indel 0.2 \
    --geno_filter 0.1 \
    --mind_filter 0.1 \
    --container_lmm /mnt/mfs/statgen/containers/lmm.sif
```


In [1]:
[global]
# the output directory for generated files
parameter: cwd = path
# pVCF files
parameter: vcfs = paths
# reference genome hg38 path
parameter: ref_hg38 = path(f'/home/dmc2245/software/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa')
# Sample level QC - read depth (DP) to filter out SNPs below this value
parameter: DP_snp = 7
# Sample level QC - read depth (DP) to filter out indels below this value
parameter: DP_indel = 10
# Sample level QC - genotype quality (GQ) of specific sample. This measure tells you how confident we are that the genotype we assigned to a particular sample is correct
parameter: GQ = 20
# Allele balance for snps
parameter: AB_snp = 0.15
# Allele balance for indels
parameter: AB_indel = 0.2
# Variant missigness cut-off 10% default
parameter: geno_filter = 0.1
# Sample missigness cut-off 10% default
parameter: mind_filter = 0.1
# Minor allele frequency cut-off
parameter: maf_filter  = 0.0 
# Annotation file in tab 
parameter: anno_file = path
# Annotation header of tab file
parameter: anno_header = path
# Specific bins to create tstv stats comma separated values
parameter: bins = str
# Container with bcftools
parameter: container_lmm = 'statisticalgenetics/lmm:2.3'
# Number of threads
parameter: numThreads = 4
# For cluster jobs, number commands to run per job
parameter: job_size = 1

In [None]:
# Split multiallelic sites and create unique variant annotation
[qc_1]
input: vcfs, group_by=1
output: vcf_leftnorm=f'{cwd}/cache/{_input:bnn}.leftnorm.vcf.gz',
        vcf_stats=f'{cwd}/stats/{_input:bnn}.stats_bcftools'
task: trunk_workers = 1, trunk_size=10, walltime = '12h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container = container_lmm, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    
    bcftools_raw_stats() { 
        echo "Calculate bcftools stats from the raw vcf"
        bcftools stats -v -s- --af-bins ${bins} ${_input} > ${_output[1]}
        grep "TSTV" ${_output[1]} > ${_output[1]:n}.unfiltered.tstv
    }
    snpsift_raw_stats() {
        echo "Calculate SnpSift stats from the raw vcf"
        java -jar /usr/local/bin/SnpSift.jar tstv ${_input} > ${_output[1]:n}.stats_snpsift
    }
    left_norm() {
        echo "Left-normalize indels and split multiallelic variants in the vcf"
        bcftools norm -m-any ${_input} |\
        bcftools norm --check-ref w -f ${ref_hg38}  -Oz|\
        bcftools +fill-tags -- -t all,F_MISSING,'VD=sum(DP)' |\
        bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' |\
        bcftools annotate -a ${anno_file}  -h ${anno_header} -c CHROM,FROM,TO,RS -Oz > ${_output[0]}
    }

    bcftools_raw_stats &
    snpsift_raw_stats &
    left_norm &
    wait
    
    bcftools_leftnorm_maf_stats() {
        echo "Calculate stats after leftnorm for MAF bins"
        bcftools stats -v -s- --af-bins ${bins} ${_output[0]} > ${_output[1]}.leftnorm.maf
    }
    bcftools_lefnorm_novel_stats() {
        echo "Calculate stats after leftnorm including only novel variants as per dbSNP"
        bcftools stats -i 'RS="."' -v --af-bins ${bins} ${_output[0]} > ${_output[1]}.leftnorm.novel
    }
    bcftools_lefnorm_known_stats() {
        echo "Calculate stats after leftnorm including only know variants as per dbSNP"
        bcftools stats -i 'RS!="."' -v --af-bins ${bins} ${_output[0]} > ${_output[1]}.leftnorm.known
    }
    snpsift_lefnorm_stats() {
        echo "Calculate stats after leftnorm using SnpSift"
        java -jar /usr/local/bin/SnpSift.jar tstv ${_output[0]} > ${_output[1]:n}.leftnorm.stats_snpsift
    }
    bcftools_leftnorm_maf_stats &
    bcftools_lefnorm_novel_stats &
    bcftools_lefnorm_known_stats &
    snpsift_lefnorm_stats &
    wait

In [None]:
# Filter out variants based on DP and QC for snps and indels
# Remove monomorphic sites -- using bcftools view -c1  will only keep sites with at least one nonref allele
#F_MISSING= Fraction of missing genotypes (all samples)
[qc_2]
input: output_from('qc_1')['vcf_leftnorm'], group_by=1
output: f'{cwd}/cache/{_input:bnn}.filtered.vcf.gz',
        f'{cwd}/stats/{_input:bnn}.filtered.stats_bcftools'
task: trunk_workers = 1, trunk_size=10, walltime = '12h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container = container_lmm, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'

    bcftools filter -S . -e '(TYPE="SNP" & (FMT/DP)<${DP_snp} & (FMT/GQ)<${GQ})|(TYPE="INDEL" & (FMT/DP)<${DP_indel} & (FMT/GQ)<${GQ}) ' ${_input} | \
    bcftools view -c1  | \
    bcftools filter -i 'GT="hom" | TYPE="snp" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= ${AB_snp} | TYPE="indel" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= ${AB_indel}' -Oz -o ${_output[0]} 
    
    bcftools_filter_maf_stats() {
        echo "Calculate stats after filtering for MAF bins"
        bcftools stats -v -s- --af-bins ${bins} ${_output[0]} > ${_output[1]}
        grep "TSTV" ${_output[1]}  > ${_output[1]:n}.tstv
    }
    bcftools_filter_novel_stats() {
        echo "Calculate stats after filtering including only novel variants as per dbSNP"
        bcftools stats -i 'RS="."' -v --af-bins ${bins} ${_output[0]} > ${_output[1]}.novel
    }
    bcftools_filter_known_stats() {
        echo "Calculate stats after filtering including only known variants as per dbSNP"
        bcftools stats -i 'RS!="."' -v --af-bins ${bins} ${_output[0]} > ${_output[1]}.known
    }
    snpsift_filter_stats() {
        echo "Calculate stats after filtering using SnpSift"
        java -jar /usr/local/bin/SnpSift.jar tstv ${_output[0]}  > ${_output[1]:n}.stats_snpsift
    }
    bcftools_filter_maf_stats &
    bcftools_filter_novel_stats &
    bcftools_filter_known_stats &
    snpsift_filter_stats &
    wait


In [None]:
# Merge all the pVCF blocks for each chromosome
[qc_3]
def group_chrom(vcfs):
  from itertools import groupby
  # First, order by chr and blk
  temp = sorted(vcfs, key = lambda x: ([int(y[1:]) for y in str(x).split('_')[4:6]]))
  # Then group by chrom
  return [list(ele) for i, ele in groupby(temp, lambda x: str(x).split('_')[4])]
from glob import glob
input: [f"{cwd}/cache/ukb23156_c{x+1}_b*_v1.leftnorm.filtered.vcf.gz" for x in range(22)], group_by = group_chrom
output: [f'{cwd}/cache/{_input[0].name.split("_b")[0]}.merged.vcf.gz']
task: trunk_workers = 1, trunk_size = job_size, walltime = '24h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container_lmm, expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
     bcftools concat -Oz ${_input} > ${_output}
     tabix -p vcf ${_output}

In [None]:
# Create plink files for analysis
#  individual and variant missingness <10% 
# --vcf-filter skip variants that failed one or more filters tracked by the FILTER field
# --vcf-require-gt keep only variants with genotypes
[qc_4]
input: group_by=1
output: f'{cwd}/{_input:bnn}.filtered.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    plink --vcf ${_input} \
    --keep-allele-order \
    --allow-extra-chr \
    --vcf-filter \
    --vcf-require-gt \
    --vcf-half-call m \
    ${('--maf %s' % maf_filter) if maf_filter > 0 else ''} ${('--geno %s' % geno_filter) if geno_filter > 0 else ''} ${('--mind %s' % mind_filter) if mind_filter > 0 else ''} \
    --make-bed --out ${_output:n}

# Badri Script


In [None]:
#!/bin/bash
input_gvcf=$1
output_bcf=$2
output_plink=$3
source /mnt/mfs/hgrcgrid/homes/bnv2103/.bashrc
#input_gvcf="MAY_Alzheimers_GRM_WGS_2019-08-12_chrY.recalibrated_variants.vcf.gz"
#output_bcf="EFIGA_NIALOAD_chrY.bcf"
#outputplink="EFIGA_NIALOAD_chrY"
#bcftools norm -Ou -m -any ${input_gvcf} | bcftools norm -Ou -f /mnt/mfs/hgrcgrid/shared/GATK_Resources/b38/Homo_sapiens_assembly38.fasta | bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' >${output_bcf} 
/mnt/mfs/hgrcgrid/homes/bnv2103/softwares/plink --bcf ${output_bcf} \
	--keep-allele-order \
	--allow-extra-chr \
 	--vcf-filter \
	--vcf-require-gt \
	--maf 0.00000001 \
	--make-bed --out ${output_plink}