# Missing snp values in my chromosome

Problem: after using the highest coverage alt contigs there were still many missing snp values for hg38.

In [None]:
import re
from collections import defaultdict
from typing import Optional, Union, List, Dict
import pandas as pd
import numpy as np
import pysam
from search_your_dna.util import get_chrom_reads_in_range, _get_contig

missing_read_count_per_chrom = pd.DataFrame([
    ["1",	101791],
    ["2",	67426],
    ["3",	43433],
    ["4",	52966],
    ["5",	57445],
    ["6",	182353],
    ["7",	45498],
    ["8",	50272],
    ["9",	45889],
    ["10",	48119],
    ["11",	32284],
    ["12",	55814],
    ["13",	63840],
    ["14",	30236],
    ["15",	56842],
    ["16",	27492],
    ["17",	75705],
    ["18",	32489],
    ["19",	64677],
    ["20",	38252],
    ["21",	13239],
    ["22",	15351],
    ["X",	57470],
    ["Y",	1078]]
)
missing_read_count_per_chrom


Had a look in the 00-All.vcf.gz (from https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/) and it seems that there are many SNPs that have multiple nucleotide values for SNP reference allele values. My analysis so far hasn't taken into account this at all.

ex: chrom 1 pos 29659. has REF `TG` and ALT `T`.

My SNP database shows `T` in this SNP position. This leaves ambiguity. Let's have a look what are some values around it with IGV viewer.

IGV viewer shows `T` in positions 29659 and `G` in position 29660.

I think it is better to filter out SNP values that have multiple values on the REF column. Currently, it causes false positives.

Still curious to know what are these SNPs.

- ncbi tells that `rs1306082948` is a deletion of `G` in chr1:29660
- ncbi tells that `rs1424255478` is an indel of `CCTCCCGGAAGCTCCC(GCC)2GC` in chr1:29327-29355
- ncbi tells that `rs1228409428` is an indel of `GAGTA` in chr1:52457-52463


Seems like snp db has extra to SNPs also indels (and maybe some other?). [Wiki says](https://en.wikipedia.org/wiki/DbSNP):

```
Although the name of the database implies a collection of one class of polymorphisms only (i.e., single nucleotide polymorphisms (SNPs)), it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants.
```

I'd assume this file has all these different types.

My processing doesn't take into account anything except SNV. Based on how single nucleotide deletions are represented in the 00-All.vcf.gz file (from: https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/) it seems that I might have either missing rsid values for deletions pos values for deletion (if there is a snp on the previous location).

I think correct solution would be to parse 00-All.vcf.gz file again (from https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/) and including only SNVs and short INDELS (excluding longer indel). Logic for this could be sth like:

```python
if len(REF_col_value) == 1 and len(ALT_col_value) == 1:
    process_snv(row)
elif len(REF_col_value) == 1 and len(REF_col_value) > 1:
    process_insertion(row)
elif len(REF_col_value) > 1 and len(REF_col_value) == 1:
    process_deletion(row) # need to read INFO field RSPOS value, as it gives true rsid position. Should read always that.
else:
    raise Exception("Unknown rsid type")
```

In [None]:
all_rsid_file = "/home/s/src/search_your_dna/data/00-All.vcf"
header_pattern = "#CHROM\s+POS\s+ID\s+REF\s+ALT\s+QUAL\s+FILTER\s+INFO"
snvs = []
insertions = []
deletions = []
unknown = []
def process_snv(row):
    snvs.append(row)
def process_insertion(row):
    insertions.append(row)
def process_deletion(line_parts):
    deletions.append(line_parts)

def all_strings_single_letters(strings) -> bool:
    return np.all([len(s) == 1 for s in strings])
def all_strings_double_or_single_letters(strings) -> bool:
    return np.all([len(s) <= 2 for s in strings])

with open(str(all_rsid_file), "r") as f:
    passed_header = False
    for line_text in f:
        if not passed_header:
            if re.search(header_pattern, line_text):
                passed_header = True
        else:
            line_parts = line_text.split("\t")
            if line_parts[0] != "1":
                break
            REF_col_value = line_parts[3]
            ref_nucleotides = REF_col_value.split(",")
            ALT_col_value = line_parts[4]
            alt_nucleotides = ALT_col_value.split(",")
            if all_strings_single_letters(ref_nucleotides) and all_strings_single_letters(alt_nucleotides):
                process_snv(line_parts)
            elif all_strings_single_letters(ref_nucleotides) and all_strings_double_or_single_letters(alt_nucleotides):
                process_insertion(line_parts)
            elif all_strings_double_or_single_letters(ref_nucleotides) and all_strings_single_letters(alt_nucleotides):
                process_deletion(line_parts) # need to read INFO field RSPOS value, as it gives true rsid position. Should read always that.
            else:
                unknown.append(line_parts)

creating index for newly created vcf files. necessary for fetch
```bash
cd /some/dir/path/to/vcf/files
for filename in *.vcf; do
bgzip -c "$filename" > "$filename".gz
tabix -p vcf "$filename".gz
done
```

In [None]:
vcf_1 = pysam.VariantFile("/home/s/src/search_your_dna/data/ncbi_snpdb_all_vcf/ncbi_snpdb_all_chr1.vcf.gz")
vcf_10 = pysam.VariantFile("/home/s/src/search_your_dna/data/ncbi_snpdb_all_vcf/ncbi_snpdb_all_chr10.vcf.gz")

recs = {}
# for rec in vcf_10.fetch("10", 110581962, 110581962 + 200):
# for rec in vcf_1.fetch("1", 94800, 94830): # deletion
for rec in vcf_1.fetch("1", 104150, 104170): # insertion
    print(rec)
    recs[rec.id] = rec

# res = vcf_10.fetch("10", 110581962)
# res = vcf_1.fetch("chr1", 52457, 52463)
# res

In [None]:
print (rec.info)
print (dict(zip(rec.info.keys(), rec.info.values())))
# print (rec.info["DP"])

In [None]:
bam_file_grch37 = "/home/s/Dropbox/Siim/health/genetest_2020/GFX0237425.bam"
bam_file_hg38_v1 = "/home/s/Dropbox/Siim/health/genetest_2020/GFX0237425.GRCh38.p7.bam"
bam_file_hg38_v2 = "/home/s/Dropbox/Siim/health/genetest_2020/GFX0237425.GRCh38.p7_v2.bam"
alignment_data_grch37 = pysam.AlignmentFile(bam_file_grch37, "rb")
alignment_data_grch38_v1 = pysam.AlignmentFile(bam_file_hg38_v1, "rb")
alignment_data_grch38_v2 = pysam.AlignmentFile(bam_file_hg38_v2, "rb")
pileupreads = []
def get_chrom_reads_in_range(
        alignment_data,
        start: int,
        stop: int,
        contig: Optional[str] = None,
        chrom: Optional[Union[int, str]] = None
) -> Dict[int, List[str]]:
    global pileupreads
    if contig is None:
        contig = _get_contig(alignment_data=alignment_data, chrom=str(chrom))
    sequence = defaultdict()
    for pileup_column in alignment_data.pileup(contig, start, stop):
        # print ("\ncoverage at base %s = %s" % (pileup_column.pos, pileup_column.n), "pileups", len(pileup_column.pileups))
        pos = pileup_column.pos + 1  # NOTE: pileup is 0 based, thus +1 is needed
        if pos != 104160:
            continue
        try:
            if len(pileup_column.pileups) == 0:
                raise RuntimeWarning("No reads found")
            reads_at_current_position = []
            for pileupread in pileup_column.pileups:
                pileupreads.append(pileupread)
                if pileupread.is_del:
                    reads_at_current_position.append("D")
                if pileupread.indel:
                    # need to handle insertions
                    ...
                else:
                    # print(pileupread.alignment.query_name, pileupread.alignment.query_sequence[pileupread.query_position])
                    # print('\tbase in read %s = %s' % (pileupread.alignment.query_name, pileupread.alignment.query_sequence[pileupread.query_position]))
                    reads_at_current_position.append(pileupread.alignment.query_sequence[pileupread.query_position])
            sequence[pos] = reads_at_current_position
        except RuntimeWarning:
            ...
            # print(f"Chromosome {chrom} position {pos} does not have any READS")
    return sequence


# reads = get_chrom_reads_in_range(alignment_data_grch38, 94800, 94830, chrom="1") # deletion
reads = get_chrom_reads_in_range(alignment_data_grch38_v2, 104158, 104162, chrom="1") # insertion
reads

In [None]:
a = list(filter(lambda r: r.indel, pileupreads))

In [None]:
list(map(lambda r: r.alignment.query_sequence[r.query_position], a))

# How to calculate my variance?

1. [X] Compute alignment.
2. [X] Have a list of allele CHR-POS values that I want to get variance for

    * Tutorial: https://wikis.utexas.edu/display/bioiteam/Variant+calling+tutorial
    * chr-pos needs to be in FASTA format.
    * can I get variance in FASTA from here? https://ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta/ >> NO
    * using the reference genome fasta file
      ```
      bcftools mpileup -Ou -f ~/Downloads/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz ~/Dropbox/Siim/health/genetest_2020/GFX0237425.GRCh38.p7_v2.bam | bcftools call -mv -Ob -o ~/Downloads/GFX0237425.GRCh38.p7.bcf
      bcftools view ~/Downloads/GFX0237425.GRCh38.p7.bcf > ~/Downloads/GFX0237425.GRCh38.p7.vcf
      bgzip -c ~/Downloads/GFX0237425.GRCh38.p7.vcf > ~/Downloads/GFX0237425.GRCh38.p7.vcf.gz
      tabix -p vcf ~/Downloads/GFX0237425.GRCh38.p7.vcf.gz
      ```
3. [X] Using standard bioinformatics tools to compute VCF file for CHR-POS (would be a huge file)
4. [X] create tabix index file for VCF file
5. [X] Need to check how to get right genotype values out from that
6. [X] If it is still slow to query variance values from VCF file put them to a sql db with index

## Try to find genotype for missing SNV `rs9380142`

In [None]:
my_vcf = pysam.VariantFile("/home/s/Dropbox/Siim/health/genetest_2020/GFX0237425.GRCh38.p7.vcf.gz")
rs9380142_pos = 29831017

my_recs = {}
for rec in my_vcf.fetch("chr6", rs9380142_pos - 1, rs9380142_pos): # rs9380142 is visible here
    print(rec)
    my_recs[rec.id] = rec

## Given no variance found need to look allele up in the reference genome

### First idea is to try to look it up in the all snp vcf file

In [None]:
vcf_all = pysam.VariantFile("/home/s/src/search_your_dna/data/00-All.vcf.gz") # from https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/
recs_all = {}
for rec in vcf_all.fetch("6", rs9380142_pos - 1, rs9380142_pos):
    print(rec)
    recs_all[rec.id] = rec

In [None]:
rs9380142_pos_on_contig = rs9380142_pos - 28510120 # 28510120 start of alt contig
for rec in my_vcf.fetch('chr6_GL000254v2_alt', rs9380142_pos_on_contig - 10):
    print(rec)

*Based on this quick experiment, it seems all data can be put together based on my VCF file and ALL VCF file*

There will be logic related to querying variance from the right alt contig.


