-
Notifications
You must be signed in to change notification settings - Fork 10
refhash
Files from different sources do not use consistent naming conventions for genetic references. The biograph refhash
utility can uniquely identify the genetic reference used for a VCF, FASTA, SAM header, or BioGraph reference directory. This makes it possible to confirm that two files use the same genetic reference even when the nucleotide sequence of the reference is not available, and the reported reference filenames differ.
The refhash is used to ensure that genetic references match for samples imported into the BioGraph variant database regardless of the filename of the reference used for analysis.
The refhash
command accepts a single file path as an argument. The default output is a sha256 hash of the contig names and lengths, sorted by name.
(bg7)$ biograph refhash mystery.vcf.gz
1f5faf40c2b1b8715e9df75375cb392117a9c5734fca790e6399d7a50e90ebdd
The --common
or -c
option will output the "common" name of the reference, if known:
(bg7)$ biograph refhash -c mystery.vcf.gz
grch38
The --verbose
or -v
option will output both in a format similar to VCF comment lines:
(bg7)$ biograph refhash -v mystery.vcf.gz
refhash=1f5faf40c2b1b8715e9df75375cb392117a9c5734fca790e6399d7a50e90ebdd,name=grch38
Inputs can also be streamed on STDIN, but should be uncompressed first.
(bg7)$ zcat another.vcf.gz | biograph refhash -c
hg19
The VCF may be optionally compressed.
(bg7)$ biograph refhash -c my.vcf.gz
hs37d5
(bg7)$ biograph refhash -c another.vcf
e_coli_k12_ASM584v1
FASTA files may be optionally compressed.
(bg7)$ biograph refhash -c some.fasta.gz
grch38.p12
(bg7)$ biograph refhash -c another.fa
human_g1k_v37
You can use SAM files directly, or run samtools view -H
for BAMs to identify the reference used for alignment:
(bg7)$ samtools view -H my.bam | biograph refhash -v
refhash=1e4ef0c15393ae133ad336a36a376bb62e564a43a65892966002e75713282aec,name=hs37d5
Identify a BioGraph reference directory by specifying the path to the refdir:
(bg7)$ biograph refhash -v /reference/human/
refhash=9d1184b1f957da7a499793e838a6509626bf772c8f437c0972c25f30fbab9fd7,name=human_g1k_v37
If the common name of a reference is not known, it is reported as unknown-
plus the first 8 digits of the refhash.
(bg7)$ biograph refhash -v new.fasta.gz
refhash=d997af2c333a5d699477a10f95980a3fd95aa0283fc5be707d4c8a6fd1a0cb6f,name=unknown-d997af2c
If no contigs are found on the input (for example, a VCF with no contig lines is supplied) then a warning is issued and the following refhash is shown:
(bg7)$ biograph refhash -v reference_unknown.vcf
Warning: no contigs present in reference_unknown.vcf
refhash=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855,name=no_contigs_present
The refhash is not guaranteed to be unique in every case, since it is possible that two references may have identical contig names and lengths, but still have different nucleotide sequences. Practically speaking, this is extremely unlikely for human references. Please let us know if you find an example of two different human references with identical refhash identifiers.
Identifying large FASTA files may take some time since the entire file must be parsed for contigs. If contig lengths are not available, the nucleotides must also be counted.
Run biograph refhash
with no options, or with --help
:
(bg7)$ biograph refhash --help
usage: biograph [-h] [-c] [-v] [-f] [-l] [input]
Identify the reference in a VCF, FASTA, SAM, or BioGraph refdir
positional arguments:
input Input filename or refdir (/dev/stdin)
optional arguments:
-h, --help show this help message and exit
-c, --common Print the common name if known, otherwise print the hash
-v, --verbose Print the hash and the common name
-f, --full Print all available information
-l, --list List all known hashes and names