# Quality control, annotation and rare variant coding of exomes UKBB Q4/2020

## Aim

Prepare the data for further association analyses using the LMM.ipynb on rare variants. 

## Data description

The data consists of 200,000 exomes: 50K exomes were made available in March 2019 and 150K exomes were made available in October 2020

The 50K set includes the following family relationships:

* 194 parent-offspring pairs
* 613 full-sibling pairs
* 26 trios
* 1 monozygotic twin pair
* 195 second degree genetically determined relationships

**Quality control published for the 50K set**

FASTQ files aligned to GRCh38 with BWA-mem and BAM files generated. 

In the BAM files identify and mark duplicates using PICARD

gVCF files with called variants produced using WeCall

Samples excluded if:
* Differences between genetic and reported sex
* High rates of heterozygosity/contamination (Dstat>0.4)
* Low sequence coverage (<85% of bases with 20X coverage)
* Sample duplicates 
* WES variants discordant with genotyping chip

Then creation of project-level VCF or pVCF

Goldilocks:
* SNV with DP<7 changed to no-call
* SNV heterozygotes retained if allele balance ratio was AB>=0.15
* Multiallelic left-normalized and represented as bi-allelic

## Data analysis
### Understanding missingess patterns pVCF/PLINK files

This pipeline(`UKBB_GWAS_dev/wokflow/plink_missing.ipynb`)is intended to understand the rates of missing data and filter variants for later use in PCA analysis. Please refer to that specific notebook for output format and code to run it.

### PCA analysis

This pipeline (`UKBB_GWAS_dev/wokflow/PCA.ipynb`) was used to generate the Principal Components.
Sample or variants were removed if:
* Sample missingness > 2%
* Variant missigness > 1%
* MAF < 1%

For a more comprenhensive description of the pipeline refer to the specific notebook. 

### Analisis of common variants

After removing the outliers detected using the PCA and Mahalanobis distance, we can proceed to do GWAS using LMM.ipynb on the exome data:

* Retain variants with MAF>0.001 (same parameters used for inputed data)

**Step 0: filter our individuals from exome files and merge them**


### Analisis of rare variants: quality control of pVCF/PLINK files

Before annotating the variants and recoding them for posterior analysis we want to make some filtering to retain only the rare variants.

**Step 1: filter the data and output correct format**

To run:

```
famFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
bedfiles=`echo /gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
bimfiles=`echo /gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/UKBexomeOQFE_chr{1..22}.bim`

sos run ~/project/UKBB_GWAS_dev/workflow/QC_Exome_UKBB.ipynb qc \
    --cwd ~/scratch60 \
    --bedfiles $bedfiles\
    --bimfiles $bimfiles \
    --famFile $famFile \
    --maf_filter 0.001 \
    --geno_filter 0.1 \
    --mind_filter 0.2 \
    --build 'hg38' \
    --container_lmm /gpfs/gibbs/pi/dewan/data/UKBiobank/lmm.sif
```
 
### Annotation

**Step 2: annotate the variants**


### GARCOM 

**Step 3: count the rare variants within a gene boundary and output for LMM analysis**

### Variant coding 1/0

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# bed files plink format
parameter: bedfiles = path
# bim files plink format
parameter: bimfiles = paths
# fam file plink format
parameter: famFile = path 
# Specific number of threads to use
parameter: numThreads = 2
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Human genome build
parameter: build = 'hg38'
# Load annovar module from cluster
parameter: annovar_module = '''
module load ANNOVAR/2020Jun08-foss-2018b-Perl-5.28.0
echo "Module annovar loaded"
{cmd}
'''
# Software container option
parameter: container_option = 'gaow/gatk4-annovar'
parameter: container_lmm = 'statisticalgenetics/lmm:1.4'

In [None]:
# Select the SNPs and samples to be used based on maf>=0.001
[common_variants]
# Filter based on minor allele frequency
parameter: maf_filter = 0.001
# Filter out variants with missing call rate higher that this value
parameter: geno_filter = 0.0
# Filter according to Hardy Weiberg Equilibrium
parameter: hwe_filter = 0.0
# Fitler out samples with missing rate higher than this value
parameter: mind_filter = 0.0
input: bedfiles, paired_with=['bimfiles','samples'], group_by=1
output: f'{cwd}/cache/{bfile:bn}.common_var.bed'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout' 
    plink2 \
      --bed ${_input}  --bim ${_input._bimfiles} --fam ${famFile} \
      ${('--maf %s' % maf_filter) if maf_filter > 0 else ''} ${('--geno %s' % geno_filter) if geno_filter > 0 else ''} ${('--hwe %s' % hwe_filter) if hwe_filter > 0 else ''} ${('--mind %s' % mind_filter) if mind_filter > 0 else ''} \
      --keep ${_input._samples} \
      --threads ${numThreads} \
      --out ${_output:nn} 

In [None]:
# Merge all the .bed files into one bed file for input to eigensoft
[filter_3]
input: group_by = 'all'
output: bfile_merge = f'{cwd}/{famFile:bn}.filtered.merged.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', template = '{cmd}' if executable('plink').target_exists() else plink_module
    echo -e ${' '.join([str(x)[:-4] for x in _input[1:]])} | sed 's/ /\n/g' > ${_output:n}.merge_list
    plink \
    --bfile ${_input[0]:n} \
    --merge-list ${_output:n}.merge_list \
    --make-bed \
    --out ${_output:n} \
    --threads ${numThreads} \
    --memory 48000

In [None]:
# Select the SNPs and samples to be used based on maf, geno, hwe and mind options
[rare_variants: provides = [f'{cwd}/cache/{bfile:bn}.qc_pass.vcf.gz']]
# Filter based on minor allele frequency
parameter: maf_filter = 0.0
# Filter out variants with missing call rate higher that this value
parameter: geno_filter = 0.0
# Filter according to Hardy Weiberg Equilibrium
parameter: hwe_filter = 0.0
# Fitler out samples with missing rate higher than this value
parameter: mind_filter = 0.0
input: bedfiles, paired_with=['bimfiles'], group_by=1
output: f'{cwd}/cache/{bfile:bn}.qc_pass.vcf.gz'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout' 
    plink2 \
      --bed ${_input}  --bim ${_input._bimfiles} --fam ${famFile} \
      ${('--maf %s' % maf_filter) if maf_filter > 0 else ''} ${('--geno %s' % geno_filter) if geno_filter > 0 else ''} ${('--hwe %s' % hwe_filter) if hwe_filter > 0 else ''} ${('--mind %s' % mind_filter) if mind_filter > 0 else ''} \
      --recode vcf bgz \
      --threads ${numThreads} \
      --out ${_output:nn} 

## Annovar details

For a list of available [databases](https://doc-openbio.readthedocs.io/projects/annovar/en/latest/user-guide/download/#-for-gene-based-annotation)

On Farnam's Yale HPC there is a folder for shared databases
```/gpfs/ysm/datasets/db/annovar/humandb```

In [2]:
# convert vcf to annovar input format
[annovar_1]
#parameter: vcf_prefix = path('joint_call_output')
input: group_by=1
output: f'{cwd}/{_input:bnn}.avinput'

bash: container=container_option,expand="${ }", stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'

    convert2annovar.pl \
        -includeinfo \
        -allsample \
        -withfreq \
        -format vcf4 ${_input} > ${_output[0]} 

In [None]:
# Annotate vcf file using ANNOVAR
[annovar_2]
# humandb path for ANNOVAR
parameter: humandb = path
#add xreffile to option without -exonicsplicing
#mart_export_2019_LOFtools3.txt #xreffile latest option -> Phenotype description,HGNC symbol,MIM morbid description,CGD_CONDITION,CGD_inh,CGD_man,CGD_comm,LOF_tools
parameter: x_ref = path(f"{humandb}/mart_export_2019_LOFtools3.txt")
# Annovar protocol
parameter: protocol = ['refGene', 'refGeneWithVer', 'knownGene', 'ensGene', 'phastConsElements30way', 'encRegTfbsClustered', 'gwasCatalog', 'gnomad211_genome', 'gnomad211_exome', 'gme', 'kaviar_20150923', 'abraom', 'avsnp150', 'dbnsfp41a', 'dbscsnv11', 'regsnpintron', 'clinvar_20200316', 'gene4denovo201907']
# Annovar operation
parameter: operation = ['g', 'g', 'g', 'gx', 'r', 'r', 'r', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f']
# Annovar args
parameter: arg = ['"-splicing 12 -exonicsplicing"', '"-splicing 30"', '"-splicing 12 -exonicsplicing"', '"-splicing 12"', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
#input: f'{vcf_prefix:a}.avinput'
output: f'{cwd}/{_input:bn}.{build}_multianno.csv'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}', template = '{cmd}' if executable('annotate_variation.pl').target_exists() else annovar_module
bash: container=container_option, volumes=[f'{humandb:a}:{humandb:a}', f'{x_ref:ad}:{x_ref:ad}'], expand="${ }", stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'
    #do not add -intronhgvs as option -> writes cDNA variants as HGVS but creates issues (+2 splice site reported only)
    #-nastring . can only be . for VCF files
    #regsnpintron might cause shifted lines (be carefull using)
    table_annovar.pl \
        ${_input} \
        ${humandb} \
        -buildver ${build} \
        -out ${_output}\
        -remove \
        -polish \
        -nastring . \
        -protocol ${",".join(protocol)} \
        -operation ${",".join(operation)} \
        -arg ${",".join(arg)} \
        -csvout \
        -xreffile ${x_ref}

## GARCOM: gene and region counting of mutations