Skip to content

makevariantsdb

Victoria edited this page Apr 14, 2023 · 6 revisions

To accelerate the mapping process, we need to split variants or annotated positions files by individual transcript IDs.

What makevariantsdb does?

Splits the variants/positions file by transcript ID into individual files using awk and sed command lines.

Input files

  • Variants file: The input annotated genomic variants file must be either in VCF, VEP or MAF default format. Additionally, a VEP-like format is admissible which must contain at least the following columns:

    • Uploaded_variation
    • Gene
    • Feature
    • Consequence
    • Protein_position
    • Amino_acids
  • Positions file: If we want to map a set of positions to protein structures that are NOT mutations, we have to create a "positions" file. Its format imitates the one of variants files. More specifically, the easiest way to generate a "positions file" is to replicate the VEP-like format, keeping the same column names but adding some modifications.

An example of a position file would be:

  • Uploaded_variation: contains the position ID
  • Gene: can be aprotein ID
  • Feature: can be a protein ID
  • Consequence: description of this position, if any. Examples could be "conserved_position", "functional_position", etc
  • Protein_position: position in the protein sequence
  • Amino_acids: corresponding amino acid

This file it is not only useful to map positions or variants to protein structures, but also to find analogous positions of one protein in other protein structures across evolution. For example, if we have identified interesting positions in a protein, we can map this positions not only to the true structure, but also to other protein homologs, avoiding the obstacle of manual alignment and mapping.

Input arguments

  • --varfile, -vf: input VCF, VEP or VEP-like file(s).
  • --maf_file, -maf: input MAF file(s).
  • --force, -f: force to overwrite? Inactive by default.
  • --out, -o: output directory. Default is the current directory.
  • --sort, -s: sort input file to split. It speeds up the splitting step.
  • --parallel, -p: speed up running time. Depends on GNU Parallel. O. Tange(2011): GNU Parallel - The Command-Line Power Tool, login: The USENIX Magazine, February 2011: 42-47.
  • --jobs, -j: number of jobs to run in parallel. Default is 1.

Output

The following directories and files are generated:

|__DBs
     |_____makevariantsdb.log
     |_____makevariantsdb.report
     |_____varDB
            |_____transcript_ID1.txt
            |_____transcript_ID2.txt        
            |_____...        
            |_____transcript_IDn.txt        

By default, a directory called DBs is created containing files makevariantsdb.log and makevaraintsdb.report, which inform about the the executed command and the varDB directory which contains the splitted files by individual transcript ID.