makevariantsdb

To accelerate the mapping process, we need to split variants or annotated positions files by individual transcript IDs.

What `makevariantsdb` does?

Splits the variants/positions file by transcript ID into individual files using awk and sed command lines.

Input files

Variants file: The input annotated genomic variants file must be either in VCF, VEP or MAF default format. Additionally, a VEP-like format is admissible which must contain at least the following columns:
- Uploaded_variation
- Gene
- Feature
- Consequence
- Protein_position
- Amino_acids
Positions file: If we want to map a set of positions to protein structures that are NOT mutations, we have to create a "positions" file. Its format imitates the one of variants files. More specifically, the easiest way to generate a "positions file" is to replicate the VEP-like format, keeping the same column names but adding some modifications.

An example of a position file would be:

Uploaded_variation: contains the position ID
Gene: can be aprotein ID
Feature: can be a protein ID
Consequence: description of this position, if any. Examples could be "conserved_position", "functional_position", etc
Protein_position: position in the protein sequence
Amino_acids: corresponding amino acid

This file it is not only useful to map positions or variants to protein structures, but also to find analogous positions of one protein in other protein structures across evolution. For example, if we have identified interesting positions in a protein, we can map this positions not only to the true structure, but also to other protein homologs, avoiding the obstacle of manual alignment and mapping.

Input arguments

--varfile, -vf: input VCF, VEP or VEP-like file(s).
--maf_file, -maf: input MAF file(s).
--force, -f: force to overwrite? Inactive by default.
--out, -o: output directory. Default is the current directory.
--sort, -s: sort input file to split. It speeds up the splitting step.
--parallel, -p: speed up running time. Depends on GNU Parallel. O. Tange(2011): GNU Parallel - The Command-Line Power Tool, login: The USENIX Magazine, February 2011: 42-47.
--jobs, -j: number of jobs to run in parallel. Default is 1.

Output

The following directories and files are generated:

|__DBs
     |_____makevariantsdb.log
     |_____makevariantsdb.report
     |_____varDB
            |_____transcript_ID1.txt
            |_____transcript_ID2.txt        
            |_____...        
            |_____transcript_IDn.txt

By default, a directory called DBs is created containing files makevariantsdb.log and makevaraintsdb.report, which inform about the the executed command and the varDB directory which contains the splitted files by individual transcript ID.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

makevariantsdb

What `makevariantsdb` does?

Input files

Input arguments

Output

Clone this wiki locally

makevariantsdb

What makevariantsdb does?

Input files

Input arguments

Output

Clone this wiki locally

What `makevariantsdb` does?