# How to annotate a variant calling file?

What are RS id numbers?

> The rs number is an accession number used by researchers and databases to refer to specific SNPs. It stands for Reference SNP cluster ID. [source: 23andme](https://customercare.23andme.com/hc/en-us/articles/212196908-What-are-rs-numbers-rsid-)

Raw vcf files don't have RS ids included. RR ids are good to use when they are available as they are reference genome build agnostic - meaning easy to use for variance lookups across different genome builds.

This notebook will show how to use tool called [annovar](https://www.openbioinformatics.org/annovar/) (free for personal and academical usage) to annotate a vcf file with RS IDs.

### Shortcomings

1. Relying on correctness of `annovar` annotations. Haven't properly validated the results.

In [None]:
from pathlib import Path
import re

## Inputs

In [None]:
project_root_dir = Path("/home/s/src/search_your_dna/")
raw_vcf_file = "data/GFX0237425.GRCh38.p7.vcf"

## Outputs

In [None]:
annovar_hg38_metadata_dir = "data/hg38"
annotated_vcf_file = "data/GFX0237425.GRCh38.p7.annotated.vcf"
annotated_vcf_compressed_file = f"{annotated_vcf_file}.gz"
annotated_vcf_tabix_file = f"{annotated_vcf_compressed_file}.tbi"
annotated_vcf_rsidx_file = f"{annotated_vcf_compressed_file}.rsidx"

# not really relevant and can be deleted after
annovar_annotated_vcf_file = "data/GFX0237425.GRCh38.p7.raw_annotated.vcf"

## Using Annovar to annotate variance call file

Example from: https://annovar.openbioinformatics.org/en/latest/user-guide/startup/

In [None]:
# replace path values /path/to/annovar and /project_root_dir/annovar_hg38_metadata_dir
!/path/to/annovar/annotate_variation.pl -buildver hg38 -downdb -webfrom annovar avsnp150 /project_root_dir/annovar_hg38_metadata_dir

In [None]:
# replace path values /path/to/annovar, /project_root_dir/raw_vcf_file.vcf, /project_root_dir/annovar_hg38_metadata_dir, /project_root_dir/annovar_annotated_vcf_file
!/path/to/annovar /table_annovar.pl /project_root_dir/raw_vcf_file.vcf /project_root_dir/annovar_hg38_metadata_dir -buildver hg38 -out /project_root_dir/annovar_annotated_vcf_file -remove -protocol avsnp150 -operation gx,f -nastring . -vcfinput -polish

### Move rsid to the ID columns

Unfortunately annovar doesn't populate ID field in the vcf file, but add RS ids to the `info` column. Having ID column filled with RS ids is useful for creating indices to make lookups based on RS ids fast.

In [None]:
header_pattern = "#CHROM\tPOS\tID"
with open(str(project_root_dir / annovar_annotated_vcf_file), "r") as raw_f:
    with open(str(project_root_dir / annotated_vcf_file), "w") as new_f:
        passed_header = False
        for line_text in raw_f:
            if not passed_header:
                if re.search(header_pattern, line_text):
                    passed_header = True
                new_f.writelines([line_text])
            else:
                line_parts = line_text.split("\t")
                rsid_from_info = next(filter(lambda t: "avsnp150" in t, line_parts[-3].split(";"))).split("=")[1]
                line_parts[2] = rsid_from_info
                new_f.writelines(["\t".join(line_parts)])

### Compress and create tabix index

In [None]:
# replace path values /project_root_dir/annotated_vcf_file.vcf and /project_root_dir/annotated_vcf_file.vcf.gz
!bgzip -c /project_root_dir/annotated_vcf_file.vcf > /project_root_dir/annotated_vcf_file.vcf.gz
!tabix -p vcf /project_root_dir/annotated_vcf_file.vcf.gz

### Create index for rsid

In [None]:
# replace path values /project_root_dir/annotated_vcf_file.vcf.gz and /project_root_dir/annotated_vcf_file.vcf.rsidx
!rsidx index /project_root_dir/annotated_vcf_file.vcf.gz /project_root_dir/annotated_vcf_file.vcf.rsidx