# Quality control, annotation and rare variant coding of exomes UKBB Q4/2020

## Aim

Prepare the data for further association analyses using the LMM.ipynb on rare variants. 

## Data description

The data consists of 200,000 exomes: 50K exomes were made available in March 2019 and 150K exomes were made available in October 2020

The 50K set includes the following family relationships:

* 194 parent-offspring pairs
* 613 full-sibling pairs
* 26 trios
* 1 monozygotic twin pair
* 195 second degree genetically determined relationships

**Quality control published for the 50K set**

FASTQ files aligned to GRCh38 with BWA-mem and BAM files generated. 

In the BAM files identify and mark duplicates using PICARD

gVCF files with called variants produced using WeCall

Samples excluded if:
* Differences between genetic and reported sex
* High rates of heterozygosity/contamination (Dstat>0.4)
* Low sequence coverage (<85% of bases with 20X coverage)
* Sample duplicates 
* WES variants discordant with genotyping chip

Then creation of project-level VCF or pVCF

Goldilocks:
* SNV with DP<7 changed to no-call
* SNV heterozygotes retained if allele balance ratio was AB>=0.15
* Multiallelic left-normalized and represented as bi-allelic

## Data analysis
### Understanding missingness patterns pVCF/PLINK files

This pipeline(`UKBB_GWAS_dev/wokflow/plink_missing.ipynb`)is intended to understand the rates of missing data and filter variants for later use in PCA analysis. Please refer to that specific notebook for output format and code to run it.

### Filter and merge exome plink files (remove duplicated variants, keep SNPs only)

In order to run PCA analysis and LMM.ipynb on the exome data. We need to create a single bed file that contains only SNPs and where the duplicated variants have been removed.

Refer to the `UKBB_GWAS_dev/wokflow/plink_snps_only.ipynb` for more details

In Yale's cluster the filtered bed files are located here:

`/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/exome_files_snpsonly/ukb23155.filtered.merged.bed`

In Columbia's cluster they are here:

`/mnt/mfs/statgen/UKBiobank/data/exome_data`

### PCA analysis

This pipeline (`UKBB_GWAS_dev/wokflow/PCA.ipynb`) was used to generate the Principal Components.
Sample or variants were removed if:
* Sample missingness > 2%
* Variant missigness > 1%
* MAF < 1%

For a more comprenhensive description of the pipeline refer to the specific notebook. 

### Analysis of common variants

After removing the outliers detected using the PCA and Mahalanobis distance, we can proceed to do GWAS using LMM.ipynb on the exome data:

Used file `/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/exome_files_snpsonly/ukb23155.filtered.merged.bed`

* Retain variants with MAF>0.001 (same parameters used for inputed data)

### Analysis of rare variants: quality control of pVCF/PLINK files

Before annotating the variants and recoding them for posterior analysis we want to make some filtering to retain only the rare variants.

To run:

```

bfiles=`echo /gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`


sos run ~/project/UKBB_GWAS_dev/workflow/QC_Exome_UKBB.ipynb qc \
    --cwd ~/scratch60 \
    --bfiles $bedfiles\
    --maf_filter 0.001 \
    --geno_filter 0.1 \
    --mind_filter 0.2 \
    --build 'hg38' \
    --container_lmm /gpfs/gibbs/pi/dewan/data/UKBiobank/lmm.sif
```

### Annotation
FIXME

**Step 2: annotate the variants**

### GARCOM 
FIXME

**Step 3: count the rare variants within a gene boundary and output for LMM analysis**

### Output file for LMM.ipynb for rare-variants (variant coding 1/0)

FIXME

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# bed files plink format
parameter: bedfiles = paths
parameter: bimfiles = paths
# Specific number of threads to use
parameter: numThreads = 2
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Human genome build
parameter: build = 'hg38'
# Name for the merged bimfiles
parameter: bim_name = path
# Prefix for the rare variant file name
parameter: rare_var_prefix = str
# Load annovar module from cluster
parameter: annovar_module = '''
module load ANNOVAR/2020Jun08-foss-2018b-Perl-5.28.0
echo "Module annovar loaded"
{cmd}
'''
# Software container option
parameter: container_option = 'gaow/gatk4-annovar'
parameter: container_lmm = 'statisticalgenetics/lmm:1.6'

In [None]:
# Get a list of common SNPs with MAF >= 0.01
[common_1]
# Filter based on minor allele frequency
parameter: maf_filter = 0.01
# Filter out variants with missing call rate higher that this value
parameter: geno_filter = 0.0
# Filter according to Hardy Weiberg Equilibrium
parameter: hwe_filter = 0.0
# Fitler out samples with missing rate higher than this value
parameter: mind_filter = 0.0
input: bedfiles, group_by=1
output: f'{cwd}/cache/{_input:bn}.common_var.snplist'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
    plink2 \
      --bfile ${_input:n}\
      ${('--maf %s' % maf_filter) if maf_filter > 0 else ''} ${('--geno %s' % geno_filter) if geno_filter > 0 else ''} ${('--hwe %s' % hwe_filter) if hwe_filter > 0 else ''} ${('--mind %s' % mind_filter) if mind_filter > 0 else ''} \
      --write-snplist --no-id-header\
      --threads ${numThreads} \
      --out ${_output:n} 

In [1]:
# Get a list of rare SNPs with MAF < 0.01
[rare_1]
# Filter based on minor allele frequency
parameter: max_maf_filter = 0.01
# Filter out variants with missing call rate higher that this value
parameter: geno_filter = 0.0
# Filter according to Hardy Weiberg Equilibrium
parameter: hwe_filter = 0.0
# Fitler out samples with missing rate higher than this value
parameter: mind_filter = 0.0
input: bedfiles, group_by=1
output: f'{cwd}/cache/{_input:bn}.rare_var.snplist', f'{cwd}/cache/{_input:bn}.rare_var.afreq'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout' 
    plink2 \
      --bfile ${_input:n} \
      ${('--max-maf %s' % max_maf_filter) if max_maf_filter > 0 else ''} ${('--geno %s' % geno_filter) if geno_filter > 0 else ''} ${('--hwe %s' % hwe_filter) if hwe_filter > 0 else ''} ${('--mind %s' % mind_filter) if mind_filter > 0 else ''} \
      --write-snplist --no-id-header \
      --freq \
      --threads ${numThreads} \
      --out ${_output[0]:n} 

In [None]:
# Merge all of the common_var.snplist into a single file and all the rare_var.snplist into another single file
[common_2, rare_2]
input: group_by = lambda x: [x[i::len(phenoCol)] for i in range(len(phenoCol))], group_with='phenoCol'
output: f'{cwd}/cache/{common_var_prefix}.{step_name.rsplit("_",1)[0]}_var.snplist',
        f'{cwd}/cache/{rare_var_prefix}.{step_name.rsplit("_",1)[1]}_var.snplist'
task: trunk_workers = 1, walltime = '10h', mem = '10G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
      cat ${_input} > ${_output}

### Format file for plink .bim

A text file with no header line, and one line per variant with the following six fields:
1. Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name
2. Variant identifier
3. Position in morgans or centimorgans (safe to use dummy value of '0')
4. Base-pair coordinate (1-based; limited to 231-2)
5. Allele 1 (corresponding to clear bits in .bed; usually minor)
6. Allele 2 (corresponding to set bits in .bed; usually major)

In the bim file the second column e.g `1:930232:C:T` contains the alleles in ref/alt mode

In [None]:
# Merge all the bimfiles into a single file to use with awk
[bim]
input: bimfiles
output: bim_name
task: trunk_workers = 1, walltime = '10h', mem = '10G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
      cat ${_input} > ${_output}

In [None]:
#Search for rare variants in bimfile and convert to annovar input
[common_3, rare_3]
input:bim_name, f'{cwd}/cache/{rare_var_prefix}.rare_var.snplist'
output: f'{cwd}/{common_var_prefix}.common_var.avinput',
        f'{cwd}/{rare_var_prefix}.rare_var.avinput'
task: trunk_workers = 1, walltime = '10h', mem = '10G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output[1]:n}.stderr', stdout = f'{_output[1]:n}.stdout' 
    awk -F" " 'FNR==NR {lines[$1]; next} $2 in lines ' ${_input[1]} ${_input[0]} > ${_output[1]:n}.tmp
    awk '{if ($2 ~ /D/) {print $1, $4, $4 + (length ($6) - length ($5)), $6, $5 } else {print $1, $4, $4, $6, $5 }}'  ${_output[0]:n}.tmp >  ${_output[1]}

## Annovar details

For a list of available [databases](https://doc-openbio.readthedocs.io/projects/annovar/en/latest/user-guide/download/#-for-gene-based-annotation)

On Farnam's Yale HPC there is a folder for shared databases
```/gpfs/ysm/datasets/db/annovar/humandb```

### Format file for annovar input

On each line, the first five space- or tab- delimited columns represent 

1. chromosome 
2. start position 
3. end position 
4. the reference nucleotides
5. the observed nucleotides

In [2]:
# convert vcf to annovar input format
[annovar_1]
#parameter: vcf_prefix = path('joint_call_output')
input: group_by=1
output: f'{cwd}/{_input:bnn}.avinput'

bash: container=container_option,expand="${ }", stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'

    convert2annovar.pl \
        -includeinfo \
        -allsample \
        -withfreq \
        -format vcf4 ${_input} > ${_output[0]} 

In [None]:
# Annotate vcf file using ANNOVAR
[annovar_2]
# humandb path for ANNOVAR
parameter: humandb = path
#add xreffile to option without -exonicsplicing
#mart_export_2019_LOFtools3.txt #xreffile latest option -> Phenotype description,HGNC symbol,MIM morbid description,CGD_CONDITION,CGD_inh,CGD_man,CGD_comm,LOF_tools
parameter: x_ref = path(f"{humandb}/mart_export_2019_LOFtools3.txt")
# Annovar protocol
parameter: protocol = ['refGene', 'refGeneWithVer', 'knownGene', 'ensGene', 'phastConsElements30way', 'encRegTfbsClustered', 'gwasCatalog', 'gnomad211_genome', 'gnomad211_exome', 'gme', 'kaviar_20150923', 'abraom', 'avsnp150', 'dbnsfp41a', 'dbscsnv11', 'regsnpintron', 'clinvar_20200316', 'gene4denovo201907']
# Annovar operation
parameter: operation = ['g', 'g', 'g', 'gx', 'r', 'r', 'r', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f']
# Annovar args
parameter: arg = ['"-splicing 12 -exonicsplicing"', '"-splicing 30"', '"-splicing 12 -exonicsplicing"', '"-splicing 12"', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
#input: f'{vcf_prefix:a}.avinput'
output: f'{cwd}/{_input:bn}.{build}_multianno.csv'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}', template = '{cmd}' if executable('annotate_variation.pl').target_exists() else annovar_module
bash: container=container_option, volumes=[f'{humandb:a}:{humandb:a}', f'{x_ref:ad}:{x_ref:ad}'], expand="${ }", stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'
    #do not add -intronhgvs as option -> writes cDNA variants as HGVS but creates issues (+2 splice site reported only)
    #-nastring . can only be . for VCF files
    #regsnpintron might cause shifted lines (be carefull using)
    table_annovar.pl \
        ${_input} \
        ${humandb} \
        -buildver ${build} \
        -out ${_output}\
        -remove \
        -polish \
        -nastring . \
        -protocol ${",".join(protocol)} \
        -operation ${",".join(operation)} \
        -arg ${",".join(arg)} \
        -csvout \
        -xreffile ${x_ref}

## GARCOM: gene and region counting of mutations