# Figuring out how to use ref genome alt strands

## Problem

After running alignment for GRCh38 reference genome some reads were aligned to alternate contig. Which made the allele values disappear from as my bam file analysis process doesn't take this into account.

Namely, I was searching for snp `rs9380142` value in the new alignment file (random value that I found missing). Which based on GRCh38 build should be visible in `chr6:29,831,013-29,831,027` and based on GRCh37 can be seen in `chr6:29,798,790-29,798,804` (pos `29831017`).

Purpose of this notebook is to understand how and when to use values from alternate contig.

## Confirming the issue

Finding the reads from the GRCh37 produced alignment file:
```
$ samtools view /home/s/Dropbox/Siim/health/genetest_2020/GFX0237425.bam | grep "A00925:63:HY3V3DSXX:1:1204:8314:25629"
A00925:63:HY3V3DSXX:1:1204:8314:25629	99	6	29798723	60	151M	=	29798778	206	CTCCGTCTCTGTCTCAAATTTGTGGTCCACTGAGCTATAACTTACTTCTGTATTAAAATTAGAATCTGAGTGTAAATTTACTTTTTCAAATTATTTCCAAGAGAGATTGATGGGTTAATTAAAGGAGACGATTCCTGAAATTTGAGAGACA	F:FFF:FF,F,FFFFFFFFFFF,FFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFF:::F:FFF,F:FFFFFFFFF:FFFFFF,FF,FF,:FF,F:FF:FF,:FFFFFFFFF,FFFF,FF:,,FFF:,F:FFFFF,:F,F,FFF	RG:Z:1	AS:i:141	XS:i:66	NM:i:2	XQ:i:250
A00925:63:HY3V3DSXX:1:1204:8314:25629	147	6	29798778	60	151M	=	29798723	-206	AAATTAGAATCTGAGTGTATATTTACTTTTTCAAATTATTTCCAAGAGAGATTGATGGGTTAATTAAAGGAGAAGATTCCTGAAATTTGAGAGACAAAATAAATGGAAGACATGAGAACTTTCCACAGTACACGTGTTTCTTGTGCTGATT	,F:FFFF,F:F::FFFFFF:F,FFF:F::F:FFFFFFFFFF,FF:,FFF:FFFFFF,,:FFFFFF,FFFFF,,FFFFFFF:FF:FFFF:FFFFF:FF:FFFFFFFFFFFFFFFFFFFFF,FF,F:,FFFFFFFF::FF:FFFFFFFFFF:F	RG:Z:1	AS:i:141	XS:i:91	NM:i:2	XQ:i:220
```

Finding the reads from the GRCh38.p7 produced alignment file:
```
$ samtools view /home/s/Dropbox/Siim/health/genetest_2020/GFX0237425.GRCh38.p7_v2.bam | grep "A00925:63:HY3V3DSXX:1:1204:8314:25629"
A00925:63:HY3V3DSXX:1:1204:8314:25629	99	chr6_GL000254v2_alt	1093417	1	151M	=	1093472	206	CTCCGTCTCTGTCTCAAATTTGTGGTCCACTGAGCTATAACTTACTTCTGTATTAAAATTAGAATCTGAGTGTAAATTTACTTTTTCAAATTATTTCCAAGAGAGATTGATGGGTTAATTAAAGGAGACGATTCCTGAAATTTGAGAGACA	F:FFF:FF,F,FFFFFFFFFFF,FFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFF:::F:FFF,F:FFFFFFFFF:FFFFFF,FF,FF,:FF,F:FF:FF,:FFFFFFFFF,FFFF,FF:,,FFF:,F:FFFFF,:F,F,FFF	AS:i:-3	XS:i:-3	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:128A22	YS:i:-4	YT:Z:CP
A00925:63:HY3V3DSXX:1:1204:8314:25629	147	chr6_GL000254v2_alt	1093472	1	151M	=	1093417	-206	AAATTAGAATCTGAGTGTATATTTACTTTTTCAAATTATTTCCAAGAGAGATTGATGGGTTAATTAAAGGAGAAGATTCCTGAAATTTGAGAGACAAAATAAATGGAAGACATGAGAACTTTCCACAGTACACGTGTTTCTTGTGCTGATT	,F:FFFF,F:F::FFFFFF:F,FFF:F::F:FFFFFFFFFF,FF:,FFF:FFFFFF,,:FFFFFF,FFFFF,,FFFFFFF:FF:FFFF:FFFFF:FF:FFFFFFFFFFFFFFFFFFFFF,FF,F:,FFFFFFFF::FF:FFFFFFFFFF:F	AS:i:-4	XS:i:-4	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:19A131	YS:i:-3	YT:Z:CP
```

Sure enough contig `chr6_GL000254v2_alt` is visible, not `chr6` like I hoped for.

SNP `rs9380142` is visible here `chr6_GL000254v2_alt:1,093,484-1,093,498`; and as reference genome has changed I don't have a variance in the position anymore.

## Finding a solution

### First go: ref gene assembly exploring metadata

Found some metadata from [here](https://www.gencodegenes.org/human/) and [here](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.22_GRCh38.p7/)

Following files seem to contain information to map alt contigs to chromosomes.

#### Read in the metadata

In [None]:
from functools import reduce
from typing import List, Dict
import pandas as pd
import pysam
from search_your_dna.util import get_file_header_line_number, get_read_values_for_allele, \
    _get_contig, genotype_from_reads

In [None]:
assembly_report_file = "data/grch38.p7/GCA_000001405.22_GRCh38.p7_assembly_report.txt"
assembly_report_file_header_line_number = get_file_header_line_number(assembly_report_file, header_pattern="# Sequence-Name\t") # header starts with `# Sequence-Name`
assembly_report_df = pd.read_csv(assembly_report_file, sep="\t", skiprows=assembly_report_file_header_line_number, dtype=str)
assembly_report_df.columns = assembly_report_df.columns.str.lower()
assembly_report_df.columns = assembly_report_df.columns.str.replace("-", "_")
assembly_report_df.columns = assembly_report_df.columns.str.replace("# ", "")
assembly_report_df

In [None]:
assembly_regions_file = "data/grch38.p7/GCA_000001405.22_GRCh38.p7_assembly_regions.txt"
assembly_regions_file_header_line_number = get_file_header_line_number(assembly_regions_file, header_pattern="# Region-Name\t") # header starts with `# Region-Name`
assembly_regions_df = pd.read_csv(assembly_regions_file, sep="\t", skiprows=assembly_regions_file_header_line_number, dtype=str)
assembly_regions_df.columns = assembly_regions_df.columns.str.lower()
assembly_regions_df.columns = assembly_regions_df.columns.str.replace("-", "_")
assembly_regions_df.columns = assembly_regions_df.columns.str.replace("# ", "")
assembly_regions_df["chromosome_start"] = assembly_regions_df["chromosome_start"].astype('float').astype("int64")
assembly_regions_df["chromosome_stop"] = assembly_regions_df["chromosome_stop"].astype('float').astype("int64")
assembly_regions_df

##### Aggregate assembly metadata

In [None]:
assembly_metadata_df = pd.merge(assembly_regions_df, assembly_report_df, left_on="scaffold_genbank_accn", right_on="genbank_accn")
assembly_metadata_df

In [None]:
bam_file = "/home/s/Dropbox/Siim/health/genetest_2020/GFX0237425.GRCh38.p7_v2.bam"
alignment_data = pysam.AlignmentFile(bam_file, "rb")

#### Find mapping from chrom+pos to contigs

in the end I want to be able to search for chrom+pos from alignment file taking into account alt contigs.
meaning I need to have function that takes in chrom+pos and returns all contigs where this exists.

In [None]:
chrom = "6"
pos = 29831017

assembly_regions_for_chr_df = assembly_regions_df[assembly_regions_df["chromosome"] == "6"]
assembly_regions_for_chr_pos_df = assembly_regions_for_chr_df[(assembly_regions_for_chr_df["chromosome_start"] < pos) & (assembly_regions_for_chr_df["chromosome_stop"] >= pos)]
assembly_regions_for_chr_pos_df["pos_on_contig"] = pos - assembly_regions_for_chr_pos_df["chromosome_start"]
assembly_regions_for_chr_pos_df

To contigs mapping to the ones used in the alignment file

In [None]:
alt_contig_metadata = pd.merge(assembly_regions_for_chr_pos_df, assembly_report_df, left_on="scaffold_genbank_accn", right_on="genbank_accn")
alt_contig_metadata.rename(columns={"ucsc_style_name": "contig"}, inplace=True)
alt_contig_metadata

In [None]:
res = alt_contig_metadata[["pos_on_contig", "contig"]]
res.attrs["chrom"] = chrom
res.attrs["pos"] = pos
res

#### Test getting alignment data for `rs9380142`


In [None]:
def get_all_allele_contig_pos_values(
        chrom: str,
        pos: int,
        alignment_data: pysam.AlignmentFile,
        assembly_metadata: pd.DataFrame
) -> pd.DataFrame:
    """
    :param assembly_metadata: data frame with columns: "chromosome" (str), "chromosome_start" (int),
        "chromosome_stop" (int) and "ucsc_style_name" (str as header name in the alignment file)
    :return: data frame with two columns: "pos_on_contig" (int), "contig" (str)
    """
    # get contig
    assembly_metadata = assembly_metadata[assembly_metadata["chromosome"] == chrom]
    assembly_metadata = assembly_metadata[
        (assembly_metadata["chromosome_start"] < pos) & (assembly_metadata["chromosome_stop"] >= pos)]
    assembly_metadata["pos_on_contig"] = pos - assembly_metadata["chromosome_start"]
    assembly_metadata.rename(columns={"ucsc_style_name": "contig"}, inplace=True)

    main_contig = _get_contig(alignment_data=alignment_data, chrom=chrom)

    contig_positions = assembly_metadata[["pos_on_contig", "contig"]]
    contig_positions = contig_positions.append({"pos_on_contig": pos, "contig": main_contig}, ignore_index=True)
    contig_positions["pos_on_contig"] = contig_positions["pos_on_contig"].astype("int64")

    return contig_positions



pos = 29831017

def get_reads_for_allele(chrom, pos, alignment_data, assembly_metadata):
    contig_positions = get_all_allele_contig_pos_values(
        chrom=chrom, pos=pos, alignment_data=alignment_data, assembly_metadata=assembly_metadata
    )

    res = []
    for index, row in contig_positions.iterrows():
        pos_on_contig = row["pos_on_contig"]
        contig = row["contig"]
        pos_reads_per_contig = get_read_values_for_allele(alignment_data=alignment_data, pos=pos_on_contig, contig=contig)
        if len(pos_reads_per_contig) == 0:
            print(f"debug: no reads on contig '{contig}'")
            continue
        assert len(pos_reads_per_contig) == 1, f"should have reads for single allele. but got #{len(pos_reads_per_contig)}"
        reads = list(pos_reads_per_contig.values())[0]
        print("debug:", contig, pos_reads_per_contig)
        res.append(reads)

    def return_longer_list(a: List, b: List) -> List:
        return b if len(b) > len(a) else a

    most_relevant_reads = reduce(return_longer_list, res)
    return most_relevant_reads

reads = get_reads_for_allele(chrom="6", pos=pos, alignment_data=alignment_data, assembly_metadata=assembly_metadata_df)
genotype = genotype_from_reads(reads)
f"chrom {chrom} pos {pos} genotype is {genotype}"

### Testing with another rsID that I know doesn't have read in the main strand rs3131294

In [None]:
chrom = "6"
pos = 32212369
reads = get_reads_for_allele(chrom=chrom, pos=pos, alignment_data=alignment_data, assembly_metadata=assembly_metadata_df)
genotype = genotype_from_reads(reads)
f"chrom {chrom} pos {pos} genotype is {genotype}"

### Testing with another rsID rs143334143

In [None]:
chrom = "6"
pos = 31153649
reads = get_reads_for_allele(chrom=chrom, pos=pos, alignment_data=alignment_data, assembly_metadata=assembly_metadata_df)
genotype = genotype_from_reads(reads, chrom=chrom, sex="male")
f"chrom {chrom} pos {pos} genotype is {genotype}"