# GRCh38 Difficult, medically-relevant genes overlaps and processing HPRC assemblies dipcall results

This notebook details the analysis of:

1) Difficult regions excluded from GIAB HG002 GRCh38 v4.1 benchmark and the overlap of those regions with medically-relevant genes from Mandelker and the COSMIC gene census
2) Process HPRC assembly alignments to GRCh38 and subsequent dipcall results to identify which medically-relevant genes that the diploid assemblies cover which are not covered by the v4.1 benchmark

## Overlap of genes with v4.1 and excluded regions 

The excluded regions are available at: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v2.0/GRCh38/

In [None]:
python python_scripts/find_overlap_per_gene.py --input_benchmark HG002_GRCh38_1_22_v4.1_draft_benchmark.bed --input_genes GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000.bed --output GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000_with_benchmark_overlap.bed

python python_scripts/find_overlap_per_gene.py --input_benchmark GRCh38_AllTandemRepeats_gt10000bp_slop5.bed --input_genes GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000.bed --output GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000_with_AllTandemRepeats_gt10000_overlap.bed

python python_scripts/find_overlap_per_gene.py --input_benchmark GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set_REF_N_slop_15kb.bed --input_genes GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000.bed --output GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000_with_example_refn_overlap.bed

python python_scripts/find_overlap_per_gene.py --input_benchmark expanded_150_GRCh38_remapped_HG002_SVs_Tier1plusTier2_v0.6.1.bed --input_genes GRCh38_Mandelker_COSMIC_ENSEMBLE_IDs_and_geneName_primary_assembly_only_slop20000.bed --output GRCh38_Mandelker_COSMIC_ENSEMBLE_IDs_and_geneName_primary_assembly_only_slop20000_with_SV_0.6_overlap.bed

python python_scripts/find_overlap_per_gene.py --input_benchmark GRCh38_mrcanavar_intersect_ccs_1000_window_size_cnv_threshold_intersect_ont_1000_window_size_cnv_threshold.bed --input_genes GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000.bed --output GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000_with_CNV_intersection.bed

python python_scripts/find_overlap_per_gene.py --input_benchmark GRCh38_union_HG002_CCS_15kb_20kb_merged_ONT_1000_window_size_combined_elliptical_outlier_threshold.bed --input_genes GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000.bed --output GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000_with_CNV_union.bed

python python_scripts/find_overlap_per_gene.py --input_benchmark HG2_SKor_TrioONTCanu_intersect_HG2_SKor_TrioONTFlye_intersect_HG2_SKor_CCS15_gt10kb_GRCh38.bed --input_genes GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000.bed --output GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000_with_assembly_CNV.bed

python python_scripts/find_overlap_per_gene.py --input_benchmark hg38.vdj.bed --input_genes GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000.bed --output GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000_with_vdj_overlap.bed

python python_scripts/find_overlap_per_gene.py --input_benchmark hg38.segdups_sorted_merged.bed --input_genes GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000.bed --output GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000_with_segdups_overlap.bed

python python_scripts/find_overlap_per_gene.py --input_benchmark SVMergeInversions.GRCh38.120519.clustered_slop150_chr1_22.bed --input_genes GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000.bed --output GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000_with_inversions_overlap.bed

python python_scripts/find_overlap_per_gene.py --input_benchmark hg38.segdups_chr1-22_gte_10kb_identity_gte_990_segdups_counts_gt_5.bed --input_genes GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000.bed --output GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000_with_segdup_counts_gt5_percentidentity_gte_990_overlap.bed


python python_scripts/convert_mosdepth_thresholds_to_percentage.py --input HG002_GRCh38.CCS_Mandelker_COSMIC_ENSEMBLE_IDs_and_geneName_primary_assembly_only_slop20000_excluding_MT_w_thresholds.thresholds.bed --output  HG002_GRCh38.CCS_Mandelker_COSMIC_ENSEMBLE_IDs_and_geneName_primary_assembly_only_slop20000_excluding_MT_w_thresholds.percentages.tsv
python python_scripts/convert_mosdepth_thresholds_to_percentage.py --input HG002_GRCh38.ont_Mandelker_COSMIC_ENSEMBLE_IDs_and_geneName_primary_assembly_only_slop20000_excluding_MT_w_thresholds.thresholds.bed --output  HG002_GRCh38.ont_Mandelker_COSMIC_ENSEMBLE_IDs_and_geneName_primary_assembly_only_slop20000_excluding_MT_w_thresholds.percentages.tsv
python python_scripts/convert_mosdepth_thresholds_to_percentage.py --input HG002_GRCh38.10X_Mandelker_COSMIC_ENSEMBLE_IDs_and_geneName_primary_assembly_only_slop20000_excluding_MT_w_thresholds.thresholds.bed --output  HG002_GRCh38.10X_Mandelker_COSMIC_ENSEMBLE_IDs_and_geneName_primary_assembly_only_slop20000_excluding_MT_w_thresholds.percentages.tsv

## Removing partially covered segmental duplications from HPRC asm9 and r253 assemblies 

Used a new dip.bed from updated parameters for dipcall as run by Jennifer McDaniel with the assembly benchmarking pipeline:  then subtract partially covered segmental duplications using the file at ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v2.0/GRCh38/SegmentalDuplications/GRCh38_segdups.bed.gz

In [None]:
complementBed -i asm9.dip.bed -g human.b38.genome | intersectBed -wa -a GRCh38_segdups.bed -b stdin | slopBed -i stdin -g human.b38.genome -b 20000 | subtractBed -a asm9ab.dip.bed -b stdin > asm9_dip_bed_remove_partial_segdups_slop_20kb.dip.bed

complementBed -i HG002r253ab-align2-GRCh38_dip.bed -g human.b38.genome | intersectBed -wa -a GRCh38_segdups.bed -b stdin | slopBed -i stdin -g human.b38.genome -b 20000 | subtractBed -a HG002r253ab-align2-GRCh38_dip.bed -b stdin > HG002r253ab-align2-GRCh38_dip_remove_partial_segdups_slop_20kb.dip.bed

## Overlap of genes with asm9 and newer assembly r253

In [None]:
python python_scripts/find_overlap_per_gene.py --input_benchmark asm9_dip_bed_remove_partial_segdups_slop_20kb.dip.bed --input_genes GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000.bed --output GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000_with_asm9_dip_bed_remove_partial_segdups_slop_20kb_overlap.bed


python python_scripts/find_overlap_per_gene.py --input_benchmark HG002r253ab-align2-GRCh38.dip.bed --input_genes GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000.bed --output GRCh38_Mandelker_COSMIC_ENSEMBLE_coordinates_primary_assembly_slop20000_with_HG002r253ab-align2-GRCh38_dip.bed