# How to annotate a variant calling file?

What are RS id numbers?

> The rs number is an accession number used by researchers and databases to refer to specific SNPs. It stands for Reference SNP cluster ID. [source: 23andme](https://customercare.23andme.com/hc/en-us/articles/212196908-What-are-rs-numbers-rsid-)

Raw vcf files don't have RS ids included. RR ids are good to use when they are available as they are reference genome build agnostic - meaning easy to use for variance lookups across different genome builds.

This notebook will show how to use tool called [annovar](https://www.openbioinformatics.org/annovar/) (free for personal and academical usage) to annotate a vcf file with RS IDs.

*Current shortcomings:*

1. Relying on correctness of `annovar` annotations. Haven't properly validated the results.

In [None]:
from pathlib import Path
import re
from typing import List

## Inputs

In [None]:
root_path = Path("/home/s/src/search_your_dna")
root_path.mkdir(parents=True, exist_ok=True)

raw_vcf_file = root_path / "data" / "GFX0237425.GRCh38.p7.vcf"

## Outputs

In [None]:
output_path = root_path / "data" / "output_4"
output_path.mkdir(parents=True, exist_ok=True)

annovar_hg38_metadata_dir = output_path / "hg38"
rsid_annotated_vcf_file = output_path / "GFX0237425.GRCh38.p7.rsid_annotated.vcf"
rsid_annotated_vcf_compressed_file = rsid_annotated_vcf_file.parent / f"{rsid_annotated_vcf_file.name}.gz"
rsid_annotated_vcf_tabix_file = rsid_annotated_vcf_compressed_file.parent / f"{rsid_annotated_vcf_compressed_file}.tbi"
rsid_annotated_vcf_rsidx_file = rsid_annotated_vcf_compressed_file.parent / f"{rsid_annotated_vcf_compressed_file}.rsidx"

# not really relevant and can be deleted after
annovar_annotated_vcf_file = output_path / "GFX0237425.GRCh38.p7.rsid_annotated.hg38_multianno.vcf"

## Using Annovar to annotate variance call file

Example from: https://annovar.openbioinformatics.org/en/latest/user-guide/startup/

*NOTE: annovar doesn't handle compressed vcf files, make sure to uncompress vcf before feeding it in to annovar for annotations.*

In [None]:
# replace path values /path/to/annovar and /path/to/annovar_hg38_metadata_dir
!/path/to/annovar/annotate_variation.pl -buildver hg38 -downdb -webfrom annovar avsnp150 /path/to/annovar_hg38_metadata_dir

In [None]:
# replace path values /path/to/annovar, /path/to/raw_vcf_file, /path/to/annovar_hg38_metadata_dir, /path/to/annovar_annotated_vcf_file
!/path/to/annovar/table_annovar.pl /path/to/raw_vcf_file /path/to/annovar_hg38_metadata_dir -buildver hg38 -out /path/to/annovar_annotated_vcf_file -remove -protocol avsnp150 -operation f -nastring . -vcfinput -polish

### Move rsid to the ID columns

Unfortunately annovar doesn't populate ID field in the vcf file, but add RS ids to the `info` column. Having ID column filled with RS ids is useful for creating indices to make lookups based on RS ids fast.

In [None]:
header_pattern = "#CHROM\tPOS\tID"
def get_rsid_from_info_column(row: List[str]) -> str:
    return next(filter(lambda t: "avsnp150" in t, row[-3].split(";"))).split("=")[1]

with open(annovar_annotated_vcf_file, "r") as raw_f:
    with open(rsid_annotated_vcf_file, "w") as new_f:
        passed_header = False
        for line_text in raw_f:
            if not passed_header:
                if re.search(header_pattern, line_text):
                    passed_header = True
                new_f.writelines([line_text])
            else:
                line_parts = line_text.split("\t")
                rsid_from_info = get_rsid_from_info_column(line_parts)
                line_parts[2] = rsid_from_info
                new_f.writelines(["\t".join(line_parts)])

### Compress and create tabix index

In [None]:
# replace path values /path/to/rsid_annotated_vcf_file and /path/to/rsid_annotated_vcf_compressed_file
!bgzip -c /path/to/rsid_annotated_vcf_file > /path/to/rsid_annotated_vcf_compressed_file
!tabix -p vcf /path/to/rsid_annotated_vcf_compressed_file

### Create index for rsid

In [None]:
# replace path values /path/to/rsid_annotated_vcf_compressed_file and /path/to/rsid_annotated_vcf_rsidx_file
!rsidx index /path/to/rsid_annotated_vcf_compressed_file /path/to/rsid_annotated_vcf_rsidx_file