# Figuring out how to use ref genome alt strands

## Problem

After running alignment for GRCh38 reference genome some reads were aligned to alternate contig. Which made the allele values disappear from as my bam file analysis process doesn't take this into account.

Namely, I was searching for snp `rs9380142` value in the new alignment file (random value that I found missing). Which based on GRCh38 build should be visible in `chr6:29,831,013-29,831,027` and based on GRCh37 can be seen in `chr6:29,798,790-29,798,804` (pos `29831017`).

Purpose of this notebook is to understand how and when to use values from alternate contig.

## Confirming the issue

Finding the reads from the GRCh37 produced alignment file:
```
$ samtools view /home/s/Dropbox/Siim/health/genetest_2020/GFX0237425.bam | grep "A00925:63:HY3V3DSXX:1:1204:8314:25629"
A00925:63:HY3V3DSXX:1:1204:8314:25629	99	6	29798723	60	151M	=	29798778	206	CTCCGTCTCTGTCTCAAATTTGTGGTCCACTGAGCTATAACTTACTTCTGTATTAAAATTAGAATCTGAGTGTAAATTTACTTTTTCAAATTATTTCCAAGAGAGATTGATGGGTTAATTAAAGGAGACGATTCCTGAAATTTGAGAGACA	F:FFF:FF,F,FFFFFFFFFFF,FFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFF:::F:FFF,F:FFFFFFFFF:FFFFFF,FF,FF,:FF,F:FF:FF,:FFFFFFFFF,FFFF,FF:,,FFF:,F:FFFFF,:F,F,FFF	RG:Z:1	AS:i:141	XS:i:66	NM:i:2	XQ:i:250
A00925:63:HY3V3DSXX:1:1204:8314:25629	147	6	29798778	60	151M	=	29798723	-206	AAATTAGAATCTGAGTGTATATTTACTTTTTCAAATTATTTCCAAGAGAGATTGATGGGTTAATTAAAGGAGAAGATTCCTGAAATTTGAGAGACAAAATAAATGGAAGACATGAGAACTTTCCACAGTACACGTGTTTCTTGTGCTGATT	,F:FFFF,F:F::FFFFFF:F,FFF:F::F:FFFFFFFFFF,FF:,FFF:FFFFFF,,:FFFFFF,FFFFF,,FFFFFFF:FF:FFFF:FFFFF:FF:FFFFFFFFFFFFFFFFFFFFF,FF,F:,FFFFFFFF::FF:FFFFFFFFFF:F	RG:Z:1	AS:i:141	XS:i:91	NM:i:2	XQ:i:220
```

Finding the reads from the GRCh38.p7 produced alignment file:
```
$ samtools view /home/s/Dropbox/Siim/health/genetest_2020/GFX0237425.GRCh38.p7_v2.bam | grep "A00925:63:HY3V3DSXX:1:1204:8314:25629"
A00925:63:HY3V3DSXX:1:1204:8314:25629	99	chr6_GL000254v2_alt	1093417	1	151M	=	1093472	206	CTCCGTCTCTGTCTCAAATTTGTGGTCCACTGAGCTATAACTTACTTCTGTATTAAAATTAGAATCTGAGTGTAAATTTACTTTTTCAAATTATTTCCAAGAGAGATTGATGGGTTAATTAAAGGAGACGATTCCTGAAATTTGAGAGACA	F:FFF:FF,F,FFFFFFFFFFF,FFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFF:::F:FFF,F:FFFFFFFFF:FFFFFF,FF,FF,:FF,F:FF:FF,:FFFFFFFFF,FFFF,FF:,,FFF:,F:FFFFF,:F,F,FFF	AS:i:-3	XS:i:-3	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:128A22	YS:i:-4	YT:Z:CP
A00925:63:HY3V3DSXX:1:1204:8314:25629	147	chr6_GL000254v2_alt	1093472	1	151M	=	1093417	-206	AAATTAGAATCTGAGTGTATATTTACTTTTTCAAATTATTTCCAAGAGAGATTGATGGGTTAATTAAAGGAGAAGATTCCTGAAATTTGAGAGACAAAATAAATGGAAGACATGAGAACTTTCCACAGTACACGTGTTTCTTGTGCTGATT	,F:FFFF,F:F::FFFFFF:F,FFF:F::F:FFFFFFFFFF,FF:,FFF:FFFFFF,,:FFFFFF,FFFFF,,FFFFFFF:FF:FFFF:FFFFF:FF:FFFFFFFFFFFFFFFFFFFFF,FF,F:,FFFFFFFF::FF:FFFFFFFFFF:F	AS:i:-4	XS:i:-4	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:19A131	YS:i:-3	YT:Z:CP
```

Sure enough contig `chr6_GL000254v2_alt` is visible, not `chr6` like I hoped for.

SNP `rs9380142` is visible here `chr6_GL000254v2_alt:1,093,484-1,093,498`; and as reference genome has changed I don't have a variance in the position anymore.

## Finding a solution

### First go: ref gene assembly exploring metadata

Found some metadata from [here](https://www.gencodegenes.org/human/) and [here](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.22_GRCh38.p7/)

Following files seem to contain information to map alt contigs to chromosomes.

#### Read in the metadata

In [1]:
from functools import reduce
from typing import List, Dict
import pandas as pd
import pysam
from search_your_dna.util import get_file_header_line_number, get_read_values_for_allele, \
    _get_contig, genotype_from_reads

In [2]:
assembly_report_file = "data/grch38.p7/GCA_000001405.22_GRCh38.p7_assembly_report.txt"
assembly_report_file_header_line_number = get_file_header_line_number(assembly_report_file, header_pattern="# Sequence-Name\t") # header starts with `# Sequence-Name`
assembly_report_df = pd.read_csv(assembly_report_file, sep="\t", skiprows=assembly_report_file_header_line_number, dtype=str)
assembly_report_df.columns = assembly_report_df.columns.str.lower()
assembly_report_df.columns = assembly_report_df.columns.str.replace("-", "_")
assembly_report_df.columns = assembly_report_df.columns.str.replace("# ", "")
assembly_report_df

Unnamed: 0,sequence_name,sequence_role,assigned_molecule,assigned_molecule_location/type,genbank_accn,relationship,refseq_accn,assembly_unit,sequence_length,ucsc_style_name
0,1,assembled-molecule,1,Chromosome,CM000663.2,=,NC_000001.11,Primary Assembly,248956422,chr1
1,2,assembled-molecule,2,Chromosome,CM000664.2,=,NC_000002.12,Primary Assembly,242193529,chr2
2,3,assembled-molecule,3,Chromosome,CM000665.2,=,NC_000003.12,Primary Assembly,198295559,chr3
3,4,assembled-molecule,4,Chromosome,CM000666.2,=,NC_000004.12,Primary Assembly,190214555,chr4
4,5,assembled-molecule,5,Chromosome,CM000667.2,=,NC_000005.10,Primary Assembly,181538259,chr5
...,...,...,...,...,...,...,...,...,...,...
520,HSCHR19KIR_FH13_A_HAP_CTG3_1,alt-scaffold,19,Chromosome,KI270931.1,=,NT_187685.1,ALT_REF_LOCI_32,170148,chr19_KI270931v1_alt
521,HSCHR19KIR_FH13_BA2_HAP_CTG3_1,alt-scaffold,19,Chromosome,KI270932.1,=,NT_187686.1,ALT_REF_LOCI_33,215732,chr19_KI270932v1_alt
522,HSCHR19KIR_FH15_A_HAP_CTG3_1,alt-scaffold,19,Chromosome,KI270933.1,=,NT_187687.1,ALT_REF_LOCI_34,170537,chr19_KI270933v1_alt
523,HSCHR19KIR_RP5_B_HAP_CTG3_1,alt-scaffold,19,Chromosome,GL000209.2,=,NT_113949.2,ALT_REF_LOCI_35,177381,chr19_GL000209v2_alt


In [3]:
assembly_regions_file = "data/grch38.p7/GCA_000001405.22_GRCh38.p7_assembly_regions.txt"
assembly_regions_file_header_line_number = get_file_header_line_number(assembly_regions_file, header_pattern="# Region-Name\t") # header starts with `# Region-Name`
assembly_regions_df = pd.read_csv(assembly_regions_file, sep="\t", skiprows=assembly_regions_file_header_line_number, dtype=str)
assembly_regions_df.columns = assembly_regions_df.columns.str.lower()
assembly_regions_df.columns = assembly_regions_df.columns.str.replace("-", "_")
assembly_regions_df.columns = assembly_regions_df.columns.str.replace("# ", "")
assembly_regions_df["chromosome_start"] = assembly_regions_df["chromosome_start"].astype('float').astype("int64")
assembly_regions_df["chromosome_stop"] = assembly_regions_df["chromosome_stop"].astype('float').astype("int64")
assembly_regions_df

Unnamed: 0,region_name,chromosome,chromosome_start,chromosome_stop,scaffold_role,scaffold_genbank_accn,scaffold_refseq_accn,assembly_unit
0,REGION108,1,2448811,2791270,alt-scaffold,KI270762.1,NT_187515.1,ALT_REF_LOCI_1
1,PRAME_REGION_1,1,12818488,13312803,alt-scaffold,KI270766.1,NT_187517.1,ALT_REF_LOCI_1
2,PRAME_REGION_1,1,12818488,13312803,fix-patch,KQ031383.1,NW_012132914.1,PATCHES
3,PRAME_REGION_1,1,12818488,13312803,novel-patch,KQ983255.1,NW_015495298.1,PATCHES
4,REGION200,1,17157487,17460319,fix-patch,KN538361.1,NW_011332688.1,PATCHES
...,...,...,...,...,...,...,...,...
355,PAR#1,Y,10001,2781479,PAR,na,na,na
356,CENY,Y,10316945,10544039,CEN,na,na,na
357,REGION198,Y,56821510,56887902,fix-patch,KN196487.1,NW_009646209.1,PATCHES
358,PAR#2,Y,56887903,57217415,PAR,na,na,na


##### Aggregate assembly metadata

In [4]:
assembly_metadata_df = pd.merge(assembly_regions_df, assembly_report_df, left_on="scaffold_genbank_accn", right_on="genbank_accn")
assembly_metadata_df

Unnamed: 0,region_name,chromosome,chromosome_start,chromosome_stop,scaffold_role,scaffold_genbank_accn,scaffold_refseq_accn,assembly_unit_x,sequence_name,sequence_role,assigned_molecule,assigned_molecule_location/type,genbank_accn,relationship,refseq_accn,assembly_unit_y,sequence_length,ucsc_style_name
0,REGION108,1,2448811,2791270,alt-scaffold,KI270762.1,NT_187515.1,ALT_REF_LOCI_1,HSCHR1_1_CTG3,alt-scaffold,1,Chromosome,KI270762.1,=,NT_187515.1,ALT_REF_LOCI_1,354444,chr1_KI270762v1_alt
1,PRAME_REGION_1,1,12818488,13312803,alt-scaffold,KI270766.1,NT_187517.1,ALT_REF_LOCI_1,HSCHR1_2_CTG3,alt-scaffold,1,Chromosome,KI270766.1,=,NT_187517.1,ALT_REF_LOCI_1,256271,chr1_KI270766v1_alt
2,PRAME_REGION_1,1,12818488,13312803,fix-patch,KQ031383.1,NW_012132914.1,PATCHES,HG1342_HG2282_PATCH,fix-patch,1,Chromosome,KQ031383.1,=,NW_012132914.1,PATCHES,467143,na
3,PRAME_REGION_1,1,12818488,13312803,novel-patch,KQ983255.1,NW_015495298.1,PATCHES,HSCHR1_5_CTG3,novel-patch,1,Chromosome,KQ983255.1,=,NW_015495298.1,PATCHES,278659,na
4,REGION200,1,17157487,17460319,fix-patch,KN538361.1,NW_011332688.1,PATCHES,HG2095_PATCH,fix-patch,1,Chromosome,KN538361.1,=,NW_011332688.1,PATCHES,305542,na
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
326,REGION187,X,319338,601516,alt-scaffold,KI270880.1,NT_187634.1,ALT_REF_LOCI_1,HSCHRX_1_CTG3,alt-scaffold,X,Chromosome,KI270880.1,=,NT_187634.1,ALT_REF_LOCI_1,284869,chrX_KI270880v1_alt
327,REGION187,X,319338,601516,alt-scaffold,KI270913.1,NT_187667.1,ALT_REF_LOCI_2,HSCHRX_2_CTG3,alt-scaffold,X,Chromosome,KI270913.1,=,NT_187667.1,ALT_REF_LOCI_2,274009,chrX_KI270913v1_alt
328,REGION188,X,79965154,80097082,alt-scaffold,KI270881.1,NT_187635.1,ALT_REF_LOCI_1,HSCHRX_2_CTG12,alt-scaffold,X,Chromosome,KI270881.1,=,NT_187635.1,ALT_REF_LOCI_1,144206,chrX_KI270881v1_alt
329,REGION198,Y,56821510,56887902,fix-patch,KN196487.1,NW_009646209.1,PATCHES,HG2062_PATCH,fix-patch,Y,Chromosome,KN196487.1,=,NW_009646209.1,PATCHES,101150,na


In [6]:
bam_file = "/home/s/Dropbox/Siim/health/genetest_2020/GFX0237425.GRCh38.p7_v2.bam"
alignment_data = pysam.AlignmentFile(bam_file, "rb")

#### Find mapping from chrom+pos to contigs

in the end I want to be able to search for chrom+pos from alignment file taking into account alt contigs.
meaning I need to have function that takes in chrom+pos and returns all contigs where this exists.

In [7]:
chrom = "6"
pos = 29831017

assembly_regions_for_chr_df = assembly_regions_df[assembly_regions_df["chromosome"] == "6"]
assembly_regions_for_chr_pos_df = assembly_regions_for_chr_df[(assembly_regions_for_chr_df["chromosome_start"] < pos) & (assembly_regions_for_chr_df["chromosome_stop"] >= pos)]
assembly_regions_for_chr_pos_df["pos_on_contig"] = pos - assembly_regions_for_chr_pos_df["chromosome_start"]
assembly_regions_for_chr_pos_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  assembly_regions_for_chr_pos_df["pos_on_contig"] = pos - assembly_regions_for_chr_pos_df["chromosome_start"]


Unnamed: 0,region_name,chromosome,chromosome_start,chromosome_stop,scaffold_role,scaffold_genbank_accn,scaffold_refseq_accn,assembly_unit,pos_on_contig
98,MHC,6,28510120,33480577,alt-scaffold,GL000250.2,NT_167244.2,ALT_REF_LOCI_1,1320897
99,MHC,6,28510120,33480577,alt-scaffold,GL000251.2,NT_113891.3,ALT_REF_LOCI_2,1320897
100,MHC,6,28510120,33480577,alt-scaffold,GL000252.2,NT_167245.2,ALT_REF_LOCI_3,1320897
101,MHC,6,28510120,33480577,alt-scaffold,GL000253.2,NT_167246.2,ALT_REF_LOCI_4,1320897
102,MHC,6,28510120,33480577,alt-scaffold,GL000254.2,NT_167247.2,ALT_REF_LOCI_5,1320897
103,MHC,6,28510120,33480577,alt-scaffold,GL000255.2,NT_167248.2,ALT_REF_LOCI_6,1320897
104,MHC,6,28510120,33480577,alt-scaffold,GL000256.2,NT_167249.2,ALT_REF_LOCI_7,1320897
105,MHC,6,28510120,33480577,alt-scaffold,KI270758.1,NT_187692.1,ALT_REF_LOCI_8,1320897


To contigs mapping to the ones used in the alignment file

In [8]:
alt_contig_metadata = pd.merge(assembly_regions_for_chr_pos_df, assembly_report_df, left_on="scaffold_genbank_accn", right_on="genbank_accn")
alt_contig_metadata.rename(columns={"ucsc_style_name": "contig"}, inplace=True)
alt_contig_metadata

Unnamed: 0,region_name,chromosome,chromosome_start,chromosome_stop,scaffold_role,scaffold_genbank_accn,scaffold_refseq_accn,assembly_unit_x,pos_on_contig,sequence_name,sequence_role,assigned_molecule,assigned_molecule_location/type,genbank_accn,relationship,refseq_accn,assembly_unit_y,sequence_length,contig
0,MHC,6,28510120,33480577,alt-scaffold,GL000250.2,NT_167244.2,ALT_REF_LOCI_1,1320897,HSCHR6_MHC_APD_CTG1,alt-scaffold,6,Chromosome,GL000250.2,=,NT_167244.2,ALT_REF_LOCI_1,4672374,chr6_GL000250v2_alt
1,MHC,6,28510120,33480577,alt-scaffold,GL000251.2,NT_113891.3,ALT_REF_LOCI_2,1320897,HSCHR6_MHC_COX_CTG1,alt-scaffold,6,Chromosome,GL000251.2,=,NT_113891.3,ALT_REF_LOCI_2,4795265,chr6_GL000251v2_alt
2,MHC,6,28510120,33480577,alt-scaffold,GL000252.2,NT_167245.2,ALT_REF_LOCI_3,1320897,HSCHR6_MHC_DBB_CTG1,alt-scaffold,6,Chromosome,GL000252.2,=,NT_167245.2,ALT_REF_LOCI_3,4604811,chr6_GL000252v2_alt
3,MHC,6,28510120,33480577,alt-scaffold,GL000253.2,NT_167246.2,ALT_REF_LOCI_4,1320897,HSCHR6_MHC_MANN_CTG1,alt-scaffold,6,Chromosome,GL000253.2,=,NT_167246.2,ALT_REF_LOCI_4,4677643,chr6_GL000253v2_alt
4,MHC,6,28510120,33480577,alt-scaffold,GL000254.2,NT_167247.2,ALT_REF_LOCI_5,1320897,HSCHR6_MHC_MCF_CTG1,alt-scaffold,6,Chromosome,GL000254.2,=,NT_167247.2,ALT_REF_LOCI_5,4827813,chr6_GL000254v2_alt
5,MHC,6,28510120,33480577,alt-scaffold,GL000255.2,NT_167248.2,ALT_REF_LOCI_6,1320897,HSCHR6_MHC_QBL_CTG1,alt-scaffold,6,Chromosome,GL000255.2,=,NT_167248.2,ALT_REF_LOCI_6,4606388,chr6_GL000255v2_alt
6,MHC,6,28510120,33480577,alt-scaffold,GL000256.2,NT_167249.2,ALT_REF_LOCI_7,1320897,HSCHR6_MHC_SSTO_CTG1,alt-scaffold,6,Chromosome,GL000256.2,=,NT_167249.2,ALT_REF_LOCI_7,4929269,chr6_GL000256v2_alt
7,MHC,6,28510120,33480577,alt-scaffold,KI270758.1,NT_187692.1,ALT_REF_LOCI_8,1320897,HSCHR6_8_CTG1,alt-scaffold,6,Chromosome,KI270758.1,=,NT_187692.1,ALT_REF_LOCI_8,76752,chr6_KI270758v1_alt


In [9]:
res = alt_contig_metadata[["pos_on_contig", "contig"]]
res.attrs["chrom"] = chrom
res.attrs["pos"] = pos
res

Unnamed: 0,pos_on_contig,contig
0,1320897,chr6_GL000250v2_alt
1,1320897,chr6_GL000251v2_alt
2,1320897,chr6_GL000252v2_alt
3,1320897,chr6_GL000253v2_alt
4,1320897,chr6_GL000254v2_alt
5,1320897,chr6_GL000255v2_alt
6,1320897,chr6_GL000256v2_alt
7,1320897,chr6_KI270758v1_alt


#### Test getting alignment data for `rs9380142`


In [21]:
def get_all_allele_contig_pos_values(
        chrom: str,
        pos: int,
        alignment_data: pysam.AlignmentFile,
        assembly_metadata: pd.DataFrame
) -> pd.DataFrame:
    """
    :param assembly_metadata: data frame with columns: "chromosome" (str), "chromosome_start" (int),
        "chromosome_stop" (int) and "ucsc_style_name" (str as header name in the alignment file)
    :return: data frame with two columns: "pos_on_contig" (int), "contig" (str)
    """
    # get contig
    assembly_metadata = assembly_metadata[assembly_metadata["chromosome"] == chrom]
    assembly_metadata = assembly_metadata[
        (assembly_metadata["chromosome_start"] < pos) & (assembly_metadata["chromosome_stop"] >= pos)]
    assembly_metadata["pos_on_contig"] = pos - assembly_metadata["chromosome_start"]
    assembly_metadata.rename(columns={"ucsc_style_name": "contig"}, inplace=True)

    main_contig = _get_contig(alignment_data=alignment_data, chrom=chrom)

    contig_positions = assembly_metadata[["pos_on_contig", "contig"]]
    contig_positions = contig_positions.append({"pos_on_contig": pos, "contig": main_contig}, ignore_index=True)
    contig_positions["pos_on_contig"] = contig_positions["pos_on_contig"].astype("int64")

    return contig_positions



pos = 29831017

def get_reads_for_allele(chrom, pos, alignment_data, assembly_metadata):
    contig_positions = get_all_allele_contig_pos_values(
        chrom=chrom, pos=pos, alignment_data=alignment_data, assembly_metadata=assembly_metadata
    )

    res = []
    for index, row in contig_positions.iterrows():
        pos_on_contig = row["pos_on_contig"]
        contig = row["contig"]
        pos_reads_per_contig = get_read_values_for_allele(alignment_data=alignment_data, pos=pos_on_contig, contig=contig)
        if len(pos_reads_per_contig) == 0:
            print(f"debug: no reads on contig '{contig}'")
            continue
        assert len(pos_reads_per_contig) == 1, f"should have reads for single allele. but got #{len(pos_reads_per_contig)}"
        reads = list(pos_reads_per_contig.values())[0]
        print("debug:", contig, pos_reads_per_contig)
        res.append(reads)

    def return_longer_list(a: List, b: List) -> List:
        return b if len(b) > len(a) else a

    most_relevant_reads = reduce(return_longer_list, res)
    return most_relevant_reads

reads = get_reads_for_allele(chrom="6", pos=pos, alignment_data=alignment_data, assembly_metadata=assembly_metadata_df)
genotype = genotype_from_reads(reads)
f"chrom {chrom} pos {pos} genotype is {genotype}"

debug: no reads on contig 'chr6_GL000250v2_alt'
debug: chr6_GL000251v2_alt defaultdict(None, {1320897: ['G', 'G']})
debug: chr6_GL000252v2_alt defaultdict(None, {1320897: ['A', 'A']})
debug: chr6_GL000253v2_alt defaultdict(None, {1320897: ['C', 'C', 'C', 'C', 'C']})
debug: chr6_GL000254v2_alt defaultdict(None, {1320897: ['G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G']})
debug: chr6_GL000255v2_alt defaultdict(None, {1320897: ['T']})
debug: chr6_GL000256v2_alt defaultdict(None, {1320897: ['C']})
debug: no reads on contig 'chr6_KI270758v1_alt'
debug: no reads on contig 'chr6'


'chrom 6 pos 29831017 genotype is GG'

### Testing with another rsID that I know doesn't have read in the main strand rs3131294

In [20]:
chrom = "6"
pos = 32212369
reads = get_reads_for_allele(chrom=chrom, pos=pos, alignment_data=alignment_data, assembly_metadata=assembly_metadata_df)
genotype = genotype_from_reads(reads)
f"chrom {chrom} pos {pos} genotype is {genotype}"

debug: chr6_GL000250v2_alt defaultdict(None, {3702249: ['A', 'A', 'A']})
debug: no reads on contig 'chr6_GL000251v2_alt'
debug: no reads on contig 'chr6_GL000252v2_alt'
debug: chr6_GL000253v2_alt defaultdict(None, {3702249: ['T']})
debug: chr6_GL000254v2_alt defaultdict(None, {3702249: ['A', 'A', 'A', 'A', 'A', 'A', 'A']})
debug: chr6_GL000255v2_alt defaultdict(None, {3702249: ['T', 'T', 'T', 'T', 'T']})
debug: chr6_GL000256v2_alt defaultdict(None, {3702249: ['G', 'G', 'G']})
debug: no reads on contig 'chr6_KI270758v1_alt'
debug: no reads on contig 'chr6'


'chrom 6 pos 32212369 genotype is AA'

### Testing with another rsID rs143334143

In [19]:
chrom = "6"
pos = 31153649
reads = get_reads_for_allele(chrom=chrom, pos=pos, alignment_data=alignment_data, assembly_metadata=assembly_metadata_df)
genotype = genotype_from_reads(reads)
f"chrom {chrom} pos {pos} genotype is {genotype}"

debug: no reads on contig 'chr6_GL000250v2_alt'
debug: chr6_GL000251v2_alt defaultdict(None, {2643529: ['C', 'C', 'C', 'C']})
debug: no reads on contig 'chr6_GL000252v2_alt'
debug: chr6_GL000253v2_alt defaultdict(None, {2643529: ['C', 'C', 'C']})
debug: chr6_GL000254v2_alt defaultdict(None, {2643529: ['G', 'G', 'G', 'G']})
debug: chr6_GL000255v2_alt defaultdict(None, {2643529: ['G', 'G', 'G', 'G', 'G', 'G', 'G', 'G']})
debug: chr6_GL000256v2_alt defaultdict(None, {2643529: ['C']})
debug: no reads on contig 'chr6_KI270758v1_alt'
debug: chr6 defaultdict(None, {31153649: ['G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G']})


'chrom 6 pos 31153649 genotype is GG'