# Definition of the benchmark CNVs
CNVs used during performance evaluation were derived from the Tier 1 GIAB gold standard of structural variants (SV)
available for HG002 on GRCh37<sup>1</sup>.
The `scripts` folder under this repository holds the code used to download SV and exome sequencing data.

Our goal is to obtain from the HG002 SV Tier 1 benchmark a set of high confidence CNVs
overlapping the intervals targeted during exome sequencing.
Thus, we also need the BED file describing the exome targets.
In our case, considering the Ion AmpliSeq™ Exome Panel, this file is`AmpliSeqExome.20141113.designed.bed`
and it can be obtained on [Ion Ampliseq Designer's website](https://ampliseq.com/login/login.action).

## Write variants to a tabular format
After unzipping the file with gunzip, we're extracting the VCF fields needed to create a BED file describing
the regions encompassed by insertions and deletions
using [GATK's VariantToTable command](https://gatk.broadinstitute.org/hc/en-us/articles/360036896892-VariantsToTable):

In [8]:
%%capture
!gunzip --keep -c resources/GIAB-HG002-SV/HG002_SVs_Tier1_v0.6.vcf.gz > resources/GIAB-HG002-SV/HG002_SVs_Tier1_v0.6.vcf
!gatk VariantsToTable -V resources/GIAB-HG002-SV/HG002_SVs_Tier1_v0.6.vcf \
    -F CHROM -F POS -F TYPE -F SVTYPE -F REF -F ALT -F FILTER --show-filtered \
    -O results/HG002_SVs_Tier1_v0.6.vcf.table

## Create a BED file for deletions

In [9]:
import pandas as pd
sv_table = pd.read_table('results/HG002_SVs_Tier1_v0.6.vcf.table')
# Derive 0-based position from 1-based POS field
sv_table['start'] = sv_table['POS'] - 1
# Compute variant size considering the max length between reference and alternative alleles
sv_table['len'] = sv_table.apply(lambda row: max(len(row['REF']), len(row['ALT'])), axis=1)
# Compute variant stop position
sv_table['stop'] = sv_table['start'] + sv_table['len']
# Add 'chr' to chromosome names
sv_table['chromosome'] = 'chr' + sv_table['CHROM']
# Select variants of high confidence (filter=PASS), longer than 1000 bp and of type deletion
selected = sv_table[(sv_table['FILTER'] == 'PASS') & (sv_table['len'] > 1000) & (sv_table['SVTYPE'] == 'DEL')]
# Save chromosome, start, stop, and SVTYPE to a tabular file
selected[['chromosome', 'start', 'stop', 'SVTYPE']].to_csv(f'results/HG002_SVs_Tier1_v0.6.vcf.del.bed',
                                                           sep='\t', index=False, header=False)

## Retain exome targets that are overlapped by deletions

In [12]:
!bedtools intersect -a resources/AmpliseqExome/AmpliSeqExome.20141113.designed.bed \
    -b results/HG002_SVs_Tier1_v0.6.vcf.del.bed > results/AmpliSeqExome.20141113.intersected.giab.del.bed
!python CNV-detection-pipeline-1.0/workflow/scripts/create_intervals_with_genes_file.py \
    --input results/AmpliSeqExome.20141113.intersected.giab.del.bed \
    --output results/AmpliSeqExome.20141113.intersected.giab.del.genes.bed

## Group exome targets by gene

In [21]:
columns =['chrom', 'start', 'end', 'gene']
target_table = pd.read_table('results/AmpliSeqExome.20141113.intersected.giab.del.genes.bed',
                             header=None, names=columns)
targets_by_genes = target_table.groupby('gene').apply(
    lambda group: pd.Series({'chrom': group['chrom'].iloc[0],
                             'start': group['start'].min(),
                             'end': group['end'].max(),
                             'gene': group['gene'].iloc[0]}))
targets_by_genes[columns].to_csv('results/AmpliSeqExome.20141113.intersected.giab.del.genes.grouped.bed',
                        sep='\t', index=False, header=False)

# References
1. Zook JM, Hansen NF, Olson ND, Chapman L, Mullikin JC, Xiao C, Sherry S, Koren S, Phillippy AM, Boutros PC, Sahraeian SME, Huang V, Rouette A, Alexander N, Mason CE, Hajirasouliha I, Ricketts C, Lee J, Tearle R, Fiddes IT, Barrio AM, Wala J, Carroll A, Ghaffari N, Rodriguez OL, Bashir A, Jackman S, Farrell JJ, Wenger AM, Alkan C, Soylev A, Schatz MC, Garg S, Church G, Marschall T, Chen K, Fan X, English AC, Rosenfeld JA, Zhou W, Mills RE, Sage JM, Davis JR, Kaiser MD, Oliver JS, Catalano AP, Chaisson MJP, Spies N, Sedlazeck FJ, Salit M. A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol. 2020 Nov;38(11):1347-1355. doi: 10.1038/s41587-020-0538-8. Epub 2020 Jun 15. Erratum in: Nat Biotechnol. 2020 Jul 22;: PMID: 32541955.

