In [None]:
# CMRG Coordinates and Gene List GRCh37

### V4.2.1 and ENSEMBL gene overlap



- `GRCh37_ENSEMBL_MRG_coordinates.bed` is the output of `scripts/GRCh37_lookup_MRG_symbol_coordinates_ENSEMBL.R`

- Correct for duplicate gene entries using `https://gitlab.nist.gov/gitlab/nolson/mrg-bench-manuscript/-/blob/master/data/gene_coords/ensembl_coords/selected_coordinates_for_duplicated_and_incorrect_gene_symbol_entries.tsv` to produce `data/manually_created_files/GRCh37_mrg_full_gene.bed`.

mkdir -p workflow/smallvar_benchmark/GRCh37
bedtools coverage \
    -a data/manually_created_files/GRCh37_mrg_full_gene.bed \
    -b data/v4.2.1_benchmark_regions/HG002_GRCh37_1_22_v4.2.1_benchmark_noinconsistent.bed \
    | cut -f1,2,3,4,7,8  \
    > workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_overlap_with_v4.2.1_benchmark.bed

## Create flanking sequence bed for MRG candidates

20,000 kb on either side of gene plus overlapping segmental duplications.

bedtools slop \
    -i data/manually_created_files/GRCh37_mrg_full_gene.bed \
    -g resources/human.b37.genome \
    -b 20000 \
    > workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_slop20000bp.bed


`HG002v11-align2-GRCh37.dip_check_for_breaks.bed` is generated as HG002v11-align2-GRCh37.dip.bed chrom, start, start+1

awk '{FS=OFS="\t"} {print $1,$2,$2+1}' \
    data/hifiasm_dipcall_output/HG002v11-align2-GRCh37.dip.bed \
    > workflow/smallvar_benchmark/GRCh37/HG002v11-align2-GRCh37.dip_check_for_breaks.bed

python scripts/find_overlap_per_gene.py \
    --input_benchmark data/manually_created_files/HG002v11-align2-GRCh37.dip_check_for_breaks.bed \
    --input_genes workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_slop20000bp.bed \
    --output  workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_slop20000bp_check_for_breaks_in_dip.bed

## Find coordinates of ENSEMBL gene annotations with flanking sequence and overlapping segdups

python scripts/expand_gene_coordinates_with_flank_and_overlapping_segdups_GRCh37.py \
    --input_genes workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_slop20000bp.bed \
    --output workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates.bed

## Append gene names 
Adding genes names column from `GRCh37_mrg_full_gene_coordinates_slop20000bp.bed` to `GRCh37_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates.bed`


cat workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_slop20000bp.bed \
    | cut -f4 \
    | paste workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates.bed - \
    > workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates_w_gene_names.bed 

## Sort HG002v11-align2-GRCh37.dip.bed

cat data/hifiasm_dipcall_output/HG002v11-align2-GRCh37.dip.bed \
    | sort -k1,1 -k2,2n \
    > workflow/smallvar_benchmark/GRCh37/HG002v11-align2-GRCh37.dip_sorted.bed

## Find overlap of HG002 GRCh37 hifiasm v0.11 of ENSEMBL gene annotations with flanking sequence and overlapping segdups

bedtools coverage \
    -a workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates_w_gene_names.bed \
    -b workflow/smallvar_benchmark/GRCh37/HG002v11-align2-GRCh37.dip_sorted.bed \
    | cut -f1,2,3,4,7,8 \
    > workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates_w_gene_names_overlap_with_HG002v11-align2-GRCh37.dip_sorted.bed


## Create HG002_GRCh37_overlap_v4.2.1_and_hifiasm.tsv 
Combine chrom, start, end, gene name, bp_covered, frac_covered columns of `GRCh37_mrg_full_gene_coordinates_overlap_with_v4.2.1_benchmark.bed`, appending columns 5 and 6 of `GRCh37_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates_w_gene_names_overlap_with_HG002v11-align2-GRCh37.dip_sorted.bed`, then append column 5 of `GRCh37_mrg_full_gene_coordinates_slop20000bp_check_for_breaks_in_dip.bed` 

`GRCh37_overlap_v4.2.1_and_hifiasm.tsv` column names are chrom, start, end, gene, bp_overlap_v4.2.1, percent_overlap_v4.2.1, bp_flanking_plus_segdups_overlap_hifiasm, percent_flanking_plus_segdups_overlap_hifiasm, flanking_breaks_in_dip_bed


cat workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates_w_gene_names_overlap_with_HG002v11-align2-GRCh37.dip_sorted.bed \
    | cut -f5,6 \
    | paste workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_overlap_with_v4.2.1_benchmark.bed - \
    > workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_overlap_with_v4.2.1_benchmark_overlap_hifiasm.bed

## Create header
echo 'chrom  start   end     gene      bp_overlap_v4.2.1       percent_overlap_v4.2.1  bp_flanking_plus_segdups_overlap_hifiasm  percent_flanking_plus_segdups_overlap_hifiasm   flanking_breaks_in_dip_bed' \
    > workflow/smallvar_benchmark/GRCh37/GRCh37_overlap_v4.2.1_and_hifiasm.tsv

cat workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_slop20000bp_check_for_breaks_in_dip.bed \
    | cut -f5 \
    | paste workflow/smallvar_benchmark/GRCh37/GRCh37_mrg_full_gene_coordinates_overlap_with_v4.2.1_benchmark_overlap_hifiasm.bed - \
    >> workflow/smallvar_benchmark/GRCh37/GRCh37_overlap_v4.2.1_and_hifiasm_temp.bed



##  Use find_coordinates_of_MRG_GRCh37_GRCh38_union.R to generate HG002_GRCh37_CMRG_coordinates.bed

NOTE: `find_coordinates_of_MRG_GRCh37_GRCh38_union.R` depends on the creation of `workflow/smallvar_benchmark/GRCh37/GRCh38_overlap_v4.2.1_and_hifiasm.tsv` as detailed in `analysis/GRCh38_HG002_medical_genes_benchmark_generation.ipynb`

## Run bedtools merge

bedtools merge \
    -i data/cmrg_coords/HG002_GRCh37_CMRG_coordinates.bed \
    > workflow/smallvar_benchmark/GRCh37/HG002_GRCh37_CMRG_coordinates_temp_bedtools_merge.bed

In [None]:
# CMRG Coordinates and Gene List GRCh38

### v4.2.1 and ENSEMBL gene overlaps
 
- `GRCh38_ENSEMBL_MRG_coordinates.bed` is the output of `GRCh38_lookup_MRG_symbol_coordinates_ENSEMBL.R`

- Correct for duplicate gene entries using `https://gitlab.nist.gov/gitlab/nolson/mrg-bench-manuscript/-/blob/master/data/gene_coords/ensembl_coords/selected_coordinates_for_duplicated_and_incorrect_gene_symbol_entries.tsv` to produce `data/manually_created_files/GRCh38_mrg_full_gene.bed`


mkdir -p workflow/smallvar_benchmark/GRCh38
bedtools coverage \
    -a data/manually_created_files/GRCh38_mrg_full_gene.bed \
    -b data/v4.2.1_benchmark_regions/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
    | cut -f1,2,3,4,7,8  \
    > workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_overlap_with_v4.2.1_benchmark.bed

## Create flanking sequence bed for MRG candidates

20,000 kb on either side of gene plus overlapping segmental duplications.

bedtools slop \
    -i data/manually_created_files/GRCh38_mrg_full_gene.bed \
    -g resources/human.b38.genome \
    -b 20000 \
    > workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_slop20000bp.bed


`HG002v11-align2-GRCh38.dip_check_for_breaks.bed` is generated as `HG002v11-align2-GRCh38.dip.bed` chrom, start, start+1 

awk '{FS=OFS="\t"} {print $1,$2,$2+1}' \
    data/hifiasm_dipcall_output/HG002v11-align2-GRCh38.dip.bed \
    > workflow/smallvar_benchmark/GRCh37/HG002v11-align2-GRCh38.dip_check_for_breaks.bed  
    
python scripts/find_overlap_per_gene.py \
    --input_benchmark data/manually_created_files/HG002v11-align2-GRCh38.dip_check_for_breaks.bed \
    --input_genes workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_slop20000bp.bed \
    --output  workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_slop20000bp_check_for_breaks_in_dip.bed  
    
    
## Find coordinates of ENSEMBL gene annotations with flanking sequence and overlapping segdups

python scripts/expand_gene_coordinates_with_flank_and_overlapping_segdups_GRCh38.py \
    --input_genes workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_slop20000bp.bed \
    --output workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates.bed  
   
   
## Append gene names
Adding genes names column from `GRCh38_mrg_full_gene_coordinates_slop20000bp.bed` to `GRCh38_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates.bed`
cat workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_slop20000bp.bed \
    | cut -f4 \
    | paste workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates.bed - \
    > workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates_w_gene_names.bed   
    
    
## Sort HG002v11-align2-GRCh38.dip.bed

cat data/hifiasm_dipcall_output/HG002v11-align2-GRCh38.dip.bed \
    | sed 's/^chr//' \
    | sort -k1,1 -k2,2n \
    | sed 's/^/chr/' \
    > workflow/smallvar_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_sorted.bed  
    
## Find overlap of HG002 GRCh38 hifiasm v0.11 of ENSEMBL gene annotations with flanking sequence and overlapping segdups

bedtools coverage \
    -a workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates_w_gene_names.bed \
    -b workflow/smallvar_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_sorted.bed \
    | cut -f1,2,3,4,7,8 \
    > workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates_w_gene_names_overlap_with_HG002v11-align2-GRCh38.dip_sorted.bed  
    
    
## Create HG002_GRCh38_overlap_v4.2.1_and_hifiasm.tsv
Combine chrom, start, end, gene name, bp_covered, frac_covered columns of `GRCh38_mrg_full_gene_coordinates_overlap_with_v4.2.1_benchmark.bed`, appending columns 5 and 6 of `GRCh38_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates_w_gene_names_overlap_with_HG002v11-align2-GRCh38.dip_sorted.bed`, then append column 5 of `GRCh38_mrg_full_gene_coordinates_slop20000bp_check_for_breaks_in_dip.bed`

`GRCh38_overlap_v4.2.1_and_hifiasm.tsv` column names are chrom, start, end, gene, bp_overlap_v4.2.1, percent_overlap_v4.2.1, bp_flanking_plus_segdups_overlap_hifiasm, percent_flanking_plus_segdups_overlap_hifiasm, flanking_breaks_in_dip_bed

cat workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_slop20000bp_flanking_and_segdups_coordinates_w_gene_names_overlap_with_HG002v11-align2-GRCh38.dip_sorted.bed \
    | cut -f5,6 \
    | paste workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_overlap_with_v4.2.1_benchmark.bed - \
    > workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_overlap_with_v4.2.1_benchmark_overlap_hifiasm.bed


## Create header
echo 'chrom  start   end     gene      bp_overlap_v4.2.1       percent_overlap_v4.2.1  bp_flanking_plus_segdups_overlap_hifiasm  percent_flanking_plus_segdups_overlap_hifiasm   flanking_breaks_in_dip_bed' \
    > workflow/smallvar_benchmark/GRCh37/GRCh38_overlap_v4.2.1_and_hifiasm.tsv

cat workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_slop20000bp_check_for_breaks_in_dip.bed \
    | cut -f5 \
    | paste workflow/smallvar_benchmark/GRCh38/GRCh38_mrg_full_gene_coordinates_overlap_with_v4.2.1_benchmark_overlap_hifiasm.bed - \
    >> workflow/smallvar_benchmark/GRCh38/GRCh38_overlap_v4.2.1_and_hifiasm_temp.bed  
    
    
## Use `find_coordinates_of_MRG_GRCh37_GRCh38_union.R` to generate `HG002_GRCh38_CMRG_coordinates.bed`

NOTE: `find_coordinates_of_MRG_GRCh37_GRCh38_union.R` depends on the creation of `workflow/smallvar_benchmark/GRCh37/GRCh37_overlap_v4.2.1_and_hifiasm.tsv` as detailed in `analysis/GRCh37_HG002_medical_genes_benchmark_generation.ipynb`

## Run bedtools merge

bedtools merge \
    -i data/cmrg_coords/HG002_GRCh38_CMRG_coordinates.bed \
    > workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_bedtools_merge.bed