# Quality control exomes UKBB Q4/2020

The data consists of 200,000 exomes 

50K exomes were made available in March 2019

150K exomes were made available in October 2020

The 50K set includes the following family relationships:

* 194 parent-offspring pairs
* 613 full-sibling pairs
* 26 trios
* 1 monozygotic twin pair
* 195 second degree genetically determined relationships

**Quality control published for the 50K set**

FASTQ files aligned to GRCh38 with BWA-mem and BAM files generated. 

In the BAM files identify and mark duplicates using PICARD

gVCF files with called variants produced using WeCall

Samples excluded if:
* Differences between genetic and reported sex
* High rates of heterozygosity/contamination (Dstat>0.4)
* Low sequence coverage (<85% of bases with 20X coverage)
* Sample duplicates 
* WES variants discordant with genotyping chip

Then creation of project-level VCF or pVCF

Goldilocks:
* SNV with DP<7 changed to no-call
* SNV heterozygotes retained if allele balance ratio was AB>=0.15
* Multiallelic left-normalized and represented as bi-allelic

## Parameters for downstream QC

1. Individual and variant missingnes <10%
2. Hardy Weinberg p-value > 10e-15

# Quality control of pVCF/PLINK files

This pipeline is intended to use as an extra step of quality control after obtaining the joint-call file provided by the UKBB (PLINK, pVCF formats)

To download the PLINK files generate and use the following script

```
tpl_file=../farnam.yml
jobid=23155
cwd=/home/dc2325/scratch60/exomes_UKBB
job_size=1
numThreads=22
exome_UKBB=/home/dc2325/project/UKBB_GWAS_dev/workflow/exome_UKBB.ipynb
exome_sbatch=../output/$(date +"%Y-%m-%d")_exome_download.sbatch

cmd_args="""default
    --cwd $cwd
    --jobid $jobid
    --job_size $job_size
    --numThreads $numThreads
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $exome_UKBB \
    --to-script $exome_sbatch \
    --args "$cmd_args"
```

## Basic summary statistics using PLINK v1.9

### Calculate MAF for chr1 using PLINK

module load PLINK/1.90-beta5.3

plink --bfile /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1 --freq gz --out /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1

frq <- read.table(gzfile('/home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.frq.gz'), header=T)
head(frq, 10)
dim(frq)

### Calculate number of variants above/below threshold with R

options(scipen = 999)
rare_var <- frq[frq[,'MAF']<=0.005 ,]
#Some NA values are generated that mess up the nrow
common_var <- frq[frq$MAF > 0.005 & !is.na(frq$MAF),]
# Total number of variants 1783906
dim(rare_var)
dim(common_var)

### Calculate number of variants using awk

zcat /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.frq.gz | awk '($5 + 0) < 0.005' | wc -l
zcat /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.frq.gz | awk '($5 + 0) > 0.005' | wc -l
zcat /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.frq.gz | awk '($5 + 0) == 0.005' | wc -l

## Evaluate the missingness per individual/SNP

plink --bfile /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1 --missing --out /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1

imiss <- read.table('/home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.imiss', header=T)
lmiss <- read.table('/home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.lmiss', header=T)

# F_MISS: frequency of missing genotypes per sample
head(imiss)

min(imiss$F_MISS)
max(imiss$F_MISS)
miss_ten <- imiss[imiss[,'F_MISS']>0.1 ,] # No individuals missing more than 10% of genotypes
dim(miss_ten)

head(lmiss)

#F_MISS: proportion of samples missing this SNP
lmiss_ten <- lmiss[lmiss[,'F_MISS']>=0.1 ,] # 26989 SNPs are missing in >=10% of the samples
dim(lmiss_ten)
head(lmiss_ten)

## Check missingness between cases and controls

## Check HWE 

plink --bfile /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1 --hardy --out /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1

hardy <- read.table('/home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.hwe', header=T)

head(hardy)

hwe <- hardy[hardy[,'P']<=1e-15 ,] #Number of variants that are not in HWE
dim(hwe)

head(hwe)

## Pipeline for exome annotation and downstream analysis

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# Genotype files in plink binary this is used for computing the GRM
parameter: bfile = path
# Specific number of threads to use
parameter: numThreads = 2
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Human genome build
parameter: build = 'hg38'
# Load annovar module from cluster
parameter: annovar_module = '''
module load ANNOVAR/2020Jun08-foss-2018b-Perl-5.28.0
echo "Module annovar loaded"
{cmd}
'''
# Software container option
parameter: container_option = 'gaow/gatk4-annovar'
parameter: container_lmm = 'statisticalgenetics/lmm:1.4'

In [None]:
# Select the SNPs and samples to be used based on maf, geno, hwe and mind options
[qc: provides = [f'{cwd}/cache/{bfile:bn}.qc_pass.vcf.gz']]
parameter: maf_filter = 0.0
parameter: geno_filter = 0.1
parameter: hwe_filter = 5e-8
parameter: mind_filter = 0.1
input: bfile
output: f'{cwd}/cache/{bfile:bn}.qc_pass.vcf.gz'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout' 
    plink2 \
      --bfile ${bfile:n} ${('--maf %s' % maf_filter) if maf_filter > 0 else ''} ${('--geno %s' % geno_filter) if geno_filter > 0 else ''} ${('--hwe %s' % hwe_filter) if hwe_filter > 0 else ''} ${('--mind %s' % mind_filter) if mind_filter > 0 else ''} \
      --recode vcf bgz \
      --threads ${numThreads} \
      --out ${_output:nn} 

## Annovar details

For a list of available [databases](https://doc-openbio.readthedocs.io/projects/annovar/en/latest/user-guide/download/#-for-gene-based-annotation)

On Farnam's Yale HPC there is a folder for shared databases
```/gpfs/ysm/datasets/db/annovar/humandb```

In [2]:
# convert vcf to annovar input format
[annovar_1]
#parameter: vcf_prefix = path('joint_call_output')
input: group_by=1
output: f'{cwd}/{_input:bnn}.avinput'

bash: container=container_option,expand="${ }", stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'

    convert2annovar.pl \
        -includeinfo \
        -allsample \
        -withfreq \
        -format vcf4 ${_input} > ${_output[0]} 

In [None]:
# Annotate vcf file using ANNOVAR
[annovar_2]
# humandb path for ANNOVAR
parameter: humandb = path
#add xreffile to option without -exonicsplicing
#mart_export_2019_LOFtools3.txt #xreffile latest option -> Phenotype description,HGNC symbol,MIM morbid description,CGD_CONDITION,CGD_inh,CGD_man,CGD_comm,LOF_tools
parameter: x_ref = path(f"{humandb}/mart_export_2019_LOFtools3.txt")
# Annovar protocol
parameter: protocol = ['refGene', 'refGeneWithVer', 'knownGene', 'ensGene', 'phastConsElements30way', 'encRegTfbsClustered', 'gwasCatalog', 'gnomad211_genome', 'gnomad211_exome', 'gme', 'kaviar_20150923', 'abraom', 'avsnp150', 'dbnsfp41a', 'dbscsnv11', 'regsnpintron', 'clinvar_20200316', 'gene4denovo201907']
# Annovar operation
parameter: operation = ['g', 'g', 'g', 'gx', 'r', 'r', 'r', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f']
# Annovar args
parameter: arg = ['"-splicing 12 -exonicsplicing"', '"-splicing 30"', '"-splicing 12 -exonicsplicing"', '"-splicing 12"', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
#input: f'{vcf_prefix:a}.avinput'
output: f'{cwd}/{_input:bn}.{build}_multianno.csv'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}', template = '{cmd}' if executable('annotate_variation.pl').target_exists() else annovar_module
bash: container=container_option, volumes=[f'{humandb:a}:{humandb:a}', f'{x_ref:ad}:{x_ref:ad}'], expand="${ }", stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'
    #do not add -intronhgvs as option -> writes cDNA variants as HGVS but creates issues (+2 splice site reported only)
    #-nastring . can only be . for VCF files
    #regsnpintron might cause shifted lines (be carefull using)
    table_annovar.pl \
        ${_input} \
        ${humandb} \
        -buildver ${build} \
        -out ${_output}\
        -remove \
        -polish \
        -nastring . \
        -protocol ${",".join(protocol)} \
        -operation ${",".join(operation)} \
        -arg ${",".join(arg)} \
        -csvout \
        -xreffile ${x_ref}

## GARCOM: gene and region counting of mutations