# Quality control on VCF files pipeline

**Aim**

To generate QC'ed files from the UKBB pVCF data

## Filters applied

1. Genotype depth filters: SNPs DP>7 and Indels DP>10 for indels
    > Then only SNV variant sites that met at least one of the following two criteria were [retained](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwjPw5r_s5fvAhUVUzUKHe7GD-kQFjAEegQIDRAD&url=https%3A%2F%2Fwww.biorxiv.org%2Fcontent%2Fbiorxiv%2Fearly%2F2019%2F03%2F09%2F572347%2FDC2%2Fembed%2Fmedia-2.pdf%3Fdownload%3Dtrue&usg=AOvVaw06fvt4jBTPq5VfepojT1mZ) according to filtering made on the ~50K exomes by the UKBB
    
    > 1) at least one heterozygous variant genotype with allele balance ratio greater than or equal to 15% (AB >= 0.15) 
    
    > 2) at least one homozygous variant genotype
    
2. At least one sample per site passed the allele balance threshold > 0.15 for SNPs and 0.20 for indels (heterozygous variants).
3. Genotype quality GQ>20

More recent reference [here](https://www.medrxiv.org/content/10.1101/2020.11.02.20222232v1.full-text)

## Concepts

Allele balance: is calculated for heterozygotes as the number of bases supporting the least-represented allele over the total number of base observations

## To run this notebook

```
sos run VCF_QC_pipeline.ipynb qc \
    --cwd output \
    --vcf_path /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/ \
    --DP_snp 7 \
    --DP_indel 10 \
    --GQ 20 \
    --AB_snp 0.15 \
    --AB_indel 0.2 \
    --geno 0.1 \
    --mind 0.1 \
    --maf 0.00000001
```


In [1]:
[global]
# the output directory for generated files
parameter: cwd = path
# pVCF files
parameter vcf_path = paths
# reference genome hg38 path
parameter: ref_hg38 = path(f'/home/dmc2245/software/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa')
# Sample level QC - read depth (DP) to filter out SNPs below this value
parameter: DP_snp = 7
# Sample level QC - read depth (DP) to filter out indels below this value
parameter: DP_indel = 10
# Sample level QC - genotype quality (GQ) of specific sample. This measure tells you how confident we are that the genotype we assigned to a particular sample is correct
parameter: GQ = 20
# Allele balance for snps
parameter: AB_snp = 0.15
# Allele balance for indels
parameter: AB_indel = 0.2
# Variant missigness cut-off 10% default
parameter: geno = 0.1
# Sample missigness cut-off 10% default
parameter: mind = 0.1
# Minor allele frequency cut-off
parameter: maf  = 0.00000001

In [None]:
# Left normalize multiallelic sites
# Exclude all sites where FILTER=="."
[qc_1]
input: vcf_path
output: f'{cwd}/{_input:bn}.leftnorm.vcf.gz', f'{cwd}/{_input:bn}.nofiltered.tstv'
task: trunk_workers = 1, walltime = '24h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'

    bcftools norm -m-any ${_input} | \
    bcftools norm --check-ref w -f ${ref_hg38} -Oz | \
    bcftools annotate -Oz -x ID -I +'%CHROM:%POS:%REF:%ALT' |
    bcftools filter -e 'FILTER=="."' -Oz > ${_output[0]}
    
    ##Calulate ts/tv ratio before filtering
    bcftools stats ${_output[0]} | grep "TSTV" >> ${_output[1]}

In [None]:
# Filter out variants based on DP and QC for snps and indels, remove monomorphic sites
# AC==0 sites at with no alternative alleles are called
# AC==AN sites at which only alternative alleles are called 
[qc_2]
input: f'{cwd}/{_input:bn}.leftnorm.vcf.gz'
output: f'{cwd}/{_input:bn}.filtered.vcf.gz', f'{cwd}/{_input:bn}.filtered.tstv'
task: trunk_workers = 1, walltime = '24h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'

    bcftools filter  -S . -e '(TYPE="SNP" & FMT/DP<${DP_snp} & FMT/GQ<${GQ}) | (TYPE="INDEL"& FMT/DP<${DP_indel} & FMT/GQ<${GQ})' ${_input} |
    bcftools filter -e 'AC==0 || AC==AN' -Oz |
    bcftools filter -i 'GT="hom" | (TYPE="snp" & GT="het" & (FMT/AD[*:1])/(FORMAT/AD[*:0] + FMT/AD[*:1]) >= ${AB_snp}) | (TYPE="indel" & GT="het" & (FMT/AD[*:1])/(FORMAT/AD[*:0] + FMT/AD[*:1]) >= ${AB_indel})' > ${_output[0]} 
    
    ##Calulate ts/tv ratio after filtering
    bcftools stats ${_output[0]} | grep "TSTV" >> ${_output[1]}

In [None]:
# Create plink files for analysis
#  individual and variant missingness <10% 
# --vcf-filter skip variants that failed one or more filters tracked by the FILTER field
# --vcf-require-gt keep only variants with genotypes
[qc_3]
input: f'{cwd}/{_input:bn}.filtered.vcf.gz'
output: f'{cwd}/{_input:bn}.filtered.bed'
task: trunk_workers = 1, walltime = '24h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    plink --vcf ${_input} \
    --keep-allele-order \
    --allow-extra-chr \
    --vcf-filter \
    --vcf-require-gt \
    --maf ${maf} \
    --geno ${geno} \
    --mind ${mind} \
    --make-bed --out ${_output:n}

# Badri Script


In [None]:
#!/bin/bash
input_gvcf=$1
output_bcf=$2
output_plink=$3
source /mnt/mfs/hgrcgrid/homes/bnv2103/.bashrc
#input_gvcf="MAY_Alzheimers_GRM_WGS_2019-08-12_chrY.recalibrated_variants.vcf.gz"
#output_bcf="EFIGA_NIALOAD_chrY.bcf"
#outputplink="EFIGA_NIALOAD_chrY"
#bcftools norm -Ou -m -any ${input_gvcf} | bcftools norm -Ou -f /mnt/mfs/hgrcgrid/shared/GATK_Resources/b38/Homo_sapiens_assembly38.fasta | bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' >${output_bcf} 
/mnt/mfs/hgrcgrid/homes/bnv2103/softwares/plink --bcf ${output_bcf} \
	--keep-allele-order \
	--allow-extra-chr \
 	--vcf-filter \
	--vcf-require-gt \
	--maf 0.00000001 \
	--make-bed --out ${output_plink}