# Selecting contigs for alignment

## Problem
Missing read values from main contig.

Turned out that there are multiple alt contigs for the same region. Which make it not possible to do local decision making for determining which reads to use. Alt contigs should not be mix-and-matched, if one is selected then it should be used in its entirety.

## Exploring solutions

### Calculate alignment metadata to select contig with highest read coverage

Select contigs based on read coverage.

In [7]:
import json
import pandas as pd
import pysam

from search_your_dna.util import get_file_header_line_number, CHROM_LIST

#### Read in the alignment and assembly metadata

In [8]:
bam_file = "/home/s/Dropbox/Siim/health/genetest_2020/GFX0237425.GRCh38.p7_v2.bam"
alignment_data = pysam.AlignmentFile(bam_file, "rb")

assembly_report_file = "data/grch38.p7/GCA_000001405.22_GRCh38.p7_assembly_report.txt"
assembly_report_file_header_line_number = get_file_header_line_number(assembly_report_file, header_pattern="# Sequence-Name\t") # header starts with `# Sequence-Name`
assembly_report_df = pd.read_csv(assembly_report_file, sep="\t", skiprows=assembly_report_file_header_line_number, dtype=str)
assembly_report_df.columns = assembly_report_df.columns.str.lower()
assembly_report_df.columns = assembly_report_df.columns.str.replace("-", "_")
assembly_report_df.columns = assembly_report_df.columns.str.replace("# ", "")

assembly_regions_file = "data/grch38.p7/GCA_000001405.22_GRCh38.p7_assembly_regions.txt"
assembly_regions_file_header_line_number = get_file_header_line_number(assembly_regions_file, header_pattern="# Region-Name\t") # header starts with `# Region-Name`
assembly_regions_df = pd.read_csv(assembly_regions_file, sep="\t", skiprows=assembly_regions_file_header_line_number, dtype=str)
assembly_regions_df.columns = assembly_regions_df.columns.str.lower()
assembly_regions_df.columns = assembly_regions_df.columns.str.replace("-", "_")
assembly_regions_df.columns = assembly_regions_df.columns.str.replace("# ", "")
assembly_regions_df["chromosome_start"] = assembly_regions_df["chromosome_start"].astype('float').astype("int64")
assembly_regions_df["chromosome_stop"] = assembly_regions_df["chromosome_stop"].astype('float').astype("int64")

assembly_metadata_df = pd.merge(assembly_regions_df, assembly_report_df, how="inner", left_on="scaffold_genbank_accn", right_on="genbank_accn") # inner join to exclude unlocalized contigs
assembly_metadata_df = assembly_metadata_df[assembly_metadata_df["ucsc_style_name"] != "na"]
assembly_metadata_df

Unnamed: 0,region_name,chromosome,chromosome_start,chromosome_stop,scaffold_role,scaffold_genbank_accn,scaffold_refseq_accn,assembly_unit_x,sequence_name,sequence_role,assigned_molecule,assigned_molecule_location/type,genbank_accn,relationship,refseq_accn,assembly_unit_y,sequence_length,ucsc_style_name
0,REGION108,1,2448811,2791270,alt-scaffold,KI270762.1,NT_187515.1,ALT_REF_LOCI_1,HSCHR1_1_CTG3,alt-scaffold,1,Chromosome,KI270762.1,=,NT_187515.1,ALT_REF_LOCI_1,354444,chr1_KI270762v1_alt
1,PRAME_REGION_1,1,12818488,13312803,alt-scaffold,KI270766.1,NT_187517.1,ALT_REF_LOCI_1,HSCHR1_2_CTG3,alt-scaffold,1,Chromosome,KI270766.1,=,NT_187517.1,ALT_REF_LOCI_1,256271,chr1_KI270766v1_alt
7,REGION109,1,30352191,30456601,alt-scaffold,KI270760.1,NT_187514.1,ALT_REF_LOCI_1,HSCHR1_1_CTG11,alt-scaffold,1,Chromosome,KI270760.1,=,NT_187514.1,ALT_REF_LOCI_1,109528,chr1_KI270760v1_alt
11,1Q21,1,144488706,144674781,alt-scaffold,KI270765.1,NT_187520.1,ALT_REF_LOCI_1,HSCHR1_4_CTG31,alt-scaffold,1,Chromosome,KI270765.1,=,NT_187520.1,ALT_REF_LOCI_1,185285,chr1_KI270765v1_alt
12,REGION2,1,153700531,153865738,alt-scaffold,GL383518.1,NW_003315905.1,ALT_REF_LOCI_1,HSCHR1_1_CTG31,alt-scaffold,1,Chromosome,GL383518.1,=,NW_003315905.1,ALT_REF_LOCI_1,182439,chr1_GL383518v1_alt
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318,CYP2D6,22,42077656,42253758,alt-scaffold,KB663609.1,NW_004504305.1,ALT_REF_LOCI_2,HSCHR22_2_CTG1,alt-scaffold,22,Chromosome,KB663609.1,=,NW_004504305.1,ALT_REF_LOCI_2,74013,chr22_KB663609v1_alt
319,CYP2D6,22,42077656,42253758,alt-scaffold,KI270928.1,NT_187682.1,ALT_REF_LOCI_3,HSCHR22_3_CTG1,alt-scaffold,22,Chromosome,KI270928.1,=,NT_187682.1,ALT_REF_LOCI_3,176103,chr22_KI270928v1_alt
326,REGION187,X,319338,601516,alt-scaffold,KI270880.1,NT_187634.1,ALT_REF_LOCI_1,HSCHRX_1_CTG3,alt-scaffold,X,Chromosome,KI270880.1,=,NT_187634.1,ALT_REF_LOCI_1,284869,chrX_KI270880v1_alt
327,REGION187,X,319338,601516,alt-scaffold,KI270913.1,NT_187667.1,ALT_REF_LOCI_2,HSCHRX_2_CTG3,alt-scaffold,X,Chromosome,KI270913.1,=,NT_187667.1,ALT_REF_LOCI_2,274009,chrX_KI270913v1_alt


In [9]:
contigs = set(alignment_data.header.references)
ucsc_style_names = set(assembly_metadata_df["ucsc_style_name"].to_list())

f"Total contigs {len(contigs)}. excluding #{len(contigs - ucsc_style_names)} unlocalized contigs eg. there isn't a region determined."

"Total contigs 456. excluding #195 unlocalized contigs eg. there isn't a region determined."

#### Get alt contig read values per region

In [10]:
all_alt_contig_reads = {}
for contig in sorted(ucsc_style_names):
    # print(contig)
    if contig in list(map(lambda chrom: "chr" + chrom, CHROM_LIST)):
        continue
    reads = alignment_data.fetch(contig)
    all_alt_contig_reads[contig] = list(reads)

In [11]:
alt_contig_region_reads = {}
for region, contigs_df in assembly_metadata_df[["region_name", "ucsc_style_name"]].groupby(by="region_name"):
    alt_contig_region_reads[region] = {}
    for contig in contigs_df["ucsc_style_name"].to_list():
        alt_contig_region_reads[region][contig] = all_alt_contig_reads[contig]

#### Get main contig read values per region

In [12]:
region_based_chrom_start_stop_df = assembly_metadata_df[["region_name","chromosome","chromosome_start","chromosome_stop"]].groupby(by="region_name")

main_contig_region_reads = {}
for region, chrom_start_stop_df in region_based_chrom_start_stop_df:
    contig = "chr" + chrom_start_stop_df.iloc[0,:]["chromosome"]
    start = chrom_start_stop_df.iloc[0,:]["chromosome_start"]
    stop = chrom_start_stop_df.iloc[0,:]["chromosome_stop"]
    reads = alignment_data.fetch(contig=contig,start=start, stop=stop)
    main_contig_region_reads[region] = list(reads)

In [13]:
aggregated_region_contig_reads = {}
for region in alt_contig_region_reads:
    aggregated_region_contig_reads[region] = {}
    aggregated_region_contig_reads[region]["main"] = len(main_contig_region_reads[region])
    for contig, reads in alt_contig_region_reads[region].items():
        aggregated_region_contig_reads[region][contig] = len(reads)

In [14]:
with open("data/region_contig_read_counts.json", "w") as f:
    json.dump(aggregated_region_contig_reads, f, indent=2)

*Exploration results: based on the region alt contig read count data it seems that selecting either contig will lose a lot of reads, which can potentially introduce mistakes into my genotype calculation. Need to check with bioinformaticians. Maybe I need to run alignment again with selecting concrete alt contigs. *

### Povilas suggested that reads can be mapped to multiple contigs

Selected contig `chr16_KI270856v1_alt` to check for reads.

In [15]:
chr16_KI270856v1_alt_reads = all_alt_contig_reads["chr16_KI270856v1_alt"]
chr16_KI270856v1_alt_main_equivalent_reads = main_contig_region_reads["REGION173"]

In [17]:
chr16_KI270856v1_alt_read_ids = set(map(lambda r: r.query_name, chr16_KI270856v1_alt_reads))
chr16_KI270856v1_alt_main_equivalent_read_ids = set(map(lambda r: r.query_name, chr16_KI270856v1_alt_main_equivalent_reads))

In [19]:
len(chr16_KI270856v1_alt_main_equivalent_read_ids - chr16_KI270856v1_alt_read_ids)

3282

In [20]:
len(chr16_KI270856v1_alt_read_ids - chr16_KI270856v1_alt_main_equivalent_read_ids)

3171

*Conclusion: Reads are mapped to multiple regions*