# How to select alt contigs to use

## What is a contig?

> A contig (from contiguous) is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequence and to overlapping physical segments (fragments) contained in clones depending on the context. [source: wiki](https://en.wikipedia.org/wiki/Contig)

Available contigs can be accessed via the reference genome, they are visible in the alignment file and vcf file headers:

```python
# How to list them for alignment file
import pysam
alignment = pysam.AlignmentFile("/path/to/alignment/file.bam", "rb")

print("contigs:", alignment.header.references)
```

It is not enough to use only the main chromosome contigs (`chr1`, `chr2`, ..., `chr22`, `chrX`, `chrY`, `chrMT`), as your region in the main contig might have lower coverage than some of the alternative contigs for that region. Potentially leaving to incorrect genotype values for some chr-pos values. To know what alternative contig to use for an individual one would need to select contigs from alignment based on the conting read coverage.

This can be done very easily by counting number of reads aligned to specific region in main contigs and alternative contigs. Because same read can be aligned to multiple contigs.

This notebook includes code to generate a metadata file which lists read counts for each region and contig.

## Shortcomings:

1. Haven't found good tutorials online which show how alternative contigs should be handled. Have validated this approach only with a single bioinformatician.
2. Not sure what to do with unlocalized contigs.
3. Output higher read coverage vcf file header doesn't reflect what transformation was done.

In [None]:
%reload_ext autoreload
%autoreload 2

import json
from pathlib import Path
import pandas as pd
import pysam
from search_your_dna.hg_util import get_assembly_metadata_df

from search_your_dna.util import read_raw_zipped_vcf_file, get_vcf_file_header_line_number, calc_alt_contigs_to_use

## Inputs

In [None]:
project_root_path = Path("/home/s/src/search_your_dna/")
project_root_path.mkdir(parents=True, exist_ok=True)

alignment_bam_file = "data/my_genome_data/GFX0237425.GRCh38.p7.bam"
alignment_data = pysam.AlignmentFile(alignment_bam_file, "rb")

vcf_uncompressed_file = "data/GFX0237425.GRCh38.p7.vcf"
vcf_file = "data/GFX0237425.GRCh38.p7.vcf.gz"
vcf_tabix_file = "data/GFX0237425.GRCh38.p7.vcf.gz"

# metadata files can be obtained from reference genome build's file server: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.22_GRCh38.p7/
assembly_report_file = "data/grch38.p7/GCA_000001405.22_GRCh38.p7_assembly_report.txt"
assembly_regions_file = "data/grch38.p7/GCA_000001405.22_GRCh38.p7_assembly_regions.txt"

## Outputs

In [None]:
region_contig_read_counts_file = "data/region_contig_read_counts.json"
vcf_file_with_high_coverage_contigs_used = "data/GFX0237425.GRCh38.p7.using_high_coverage_alt_contigs.vcf"

## Calculate alignment metadata to select contig with highest read coverage

### Read in the alignment and assembly metadata

In [None]:
assembly_metadata_df = get_assembly_metadata_df(assembly_report_file = assembly_report_file, assembly_regions_file=assembly_regions_file)
assembly_metadata_df

In [None]:
contigs = set(alignment_data.header.references)
ucsc_style_names = set(assembly_metadata_df["ucsc_style_name"].to_list())

f"Total contigs {len(contigs)}. excluding #{len(contigs - ucsc_style_names)} unlocalized contigs eg. there isn't a region determined."

### Get alt contig read lengths per region

In [None]:
%%time

alt_contig_region_read_lengths = {}
for region, contigs_df in assembly_metadata_df[["region_name", "ucsc_style_name"]].groupby(by="region_name"):
    alt_contig_region_read_lengths[region] = {}
    for contig in contigs_df["ucsc_style_name"].to_list():
        reads = alignment_data.fetch(contig)
        alt_contig_region_read_lengths[region][contig] = len(list(reads))

### Get main contig read values per region

In [None]:
region_based_chrom_start_stop_df = assembly_metadata_df[["region_name","chromosome","chromosome_start","chromosome_stop"]].groupby(by="region_name")

main_contig_region_reads = {}
for region, chrom_start_stop_df in region_based_chrom_start_stop_df:
    contig = "chr" + chrom_start_stop_df.iloc[0,:]["chromosome"]
    start = chrom_start_stop_df.iloc[0,:]["chromosome_start"]
    stop = chrom_start_stop_df.iloc[0,:]["chromosome_stop"]
    reads = alignment_data.fetch(contig=contig,start=start, stop=stop)
    main_contig_region_reads[region] = list(reads)


### Put together main contig and alternative contig reads lengths per region

In [None]:
aggregated_region_contig_reads = {}
for region in alt_contig_region_read_lengths:
    aggregated_region_contig_reads[region] = {}
    aggregated_region_contig_reads[region]["main"] = len(main_contig_region_reads[region])
    for contig, read_lengths in alt_contig_region_read_lengths[region].items():
        aggregated_region_contig_reads[region][contig] = read_lengths

### Store region to contig lengths in a file

In [None]:
with open(project_root_path / region_contig_read_counts_file, "w") as f:
    json.dump(aggregated_region_contig_reads, f, indent=2)

## Create a new vcf file using higher coverage alternative contigs

In [None]:
raw_vcf_df = read_raw_zipped_vcf_file(project_root_path / vcf_file)

In [None]:
raw_vcf_df[raw_vcf_df["#CHROM"] == "chrX_KI270913v1_alt"].head()

In [None]:
with open(project_root_path / region_contig_read_counts_file, "r") as f:
    region_contig_read_counts = json.load(f)

alt_contigs_to_use = calc_alt_contigs_to_use(region_contig_read_counts, assembly_metadata_df)
# df with columns: "chrom", "start", "stop", "contig", "region"
alt_contigs_to_use = alt_contigs_to_use.sort_values(by=["chrom", "start"])#, ascending=False)
alt_contigs_to_use.head()

### Remove main contig parts that shouldn't be used

In [None]:
%%time

for index, row in list(alt_contigs_to_use.iterrows()):
    length_before = raw_vcf_df.shape[0]
    # print(row["contig"], "vcf length before", length_before)
    raw_vcf_df = raw_vcf_df[~((raw_vcf_df["#CHROM"] == "chr" + row["chrom"]) & (row["start"] <= raw_vcf_df["POS"]) & (raw_vcf_df["POS"] <= row["stop"]))]
    # print("\t\tlength after", raw_vcf_df.shape[0])
    # print("\t\t\tremoved snps", length_before - raw_vcf_df.shape[0])

### Make alt contigs that should be used part of main contig

In [None]:
%%time

def transform_row(contig_start):
    def _transform(row):
        row["#CHROM"] = row["#CHROM"].split("_")[0]
        row["POS"] = contig_start + row["POS"]
        return row
    return _transform

for index, row in list(alt_contigs_to_use.iterrows()):
    length_before = raw_vcf_df.shape[0]
    # print(row["contig"], "vcf length before", length_before)
    alt_contig_rows_in_raw_vcf_df = raw_vcf_df[raw_vcf_df["#CHROM"] == row["contig"]]
    raw_vcf_df = raw_vcf_df[raw_vcf_df["#CHROM"] != row["contig"]]
    updated_alt_contig_rows_in_raw_vcf_df = alt_contig_rows_in_raw_vcf_df.apply(transform_row(row["start"]), axis="columns")
    raw_vcf_df = raw_vcf_df.append(updated_alt_contig_rows_in_raw_vcf_df, ignore_index=True)
    # print("\t\talt_contig length", updated_alt_contig_rows_in_raw_vcf_df.shape[0])
    # print("\t\tlength after", raw_vcf_df.shape[0])
    # print("\t\t\tlost snps", raw_vcf_df.shape[0] - length_before)


### Remove all other alt contigs

In [None]:
CHROM_TO_KEEP = [
    "chr1",
    "chr2",
    "chr3",
    "chr4",
    "chr5",
    "chr6",
    "chr7",
    "chr8",
    "chr9",
    "chr10",
    "chr11",
    "chr12",
    "chr13",
    "chr14",
    "chr15",
    "chr16",
    "chr17",
    "chr18",
    "chr19",
    "chr20",
    "chr21",
    "chr22",
    "chrX",
    "chrY",
    "chrM",
]

print("\t\tlength before", raw_vcf_df.shape[0])
raw_vcf_df = raw_vcf_df[raw_vcf_df["#CHROM"].isin(CHROM_TO_KEEP)]
print("\t\tlength after", raw_vcf_df.shape[0])

### Sort new vcf

In [None]:
raw_vcf_df = raw_vcf_df.sort_values(by=["#CHROM", "POS"], ignore_index=True)

### Store new vcf

#### Create new vcf header

In [None]:
header_row_number = get_vcf_file_header_line_number(file_name=project_root_path / vcf_uncompressed_file)
row_counter = 0
header_text = []
with open(project_root_path / vcf_uncompressed_file, "r") as f:
    for line_text in f:
        if row_counter >= header_row_number:
            break
        row_counter += 1
        if line_text.startswith("##contig=<ID="):
            contig = line_text.replace("##contig=<ID=", "").split(",")[0]
            if contig not in CHROM_TO_KEEP:
                continue
        header_text.append(line_text)

#### Write vcf file

In [None]:
with open(project_root_path / vcf_file_with_high_coverage_contigs_used, "w") as f:
    f.writelines(header_text)
    raw_vcf_df.to_csv(f, sep="\t", index=None)