Skip to content

refhash

Rob Flickenger edited this page Aug 9, 2021 · 1 revision

Files from different sources do not use consistent naming conventions for genetic references. The biograph refhash utility can uniquely identify the genetic reference used for a VCF, FASTA, SAM header, or BioGraph reference directory. This makes it possible to confirm that two files use the same genetic reference even when the nucleotide sequence of the reference is not available, and the reported reference filenames differ.

The refhash is used to ensure that genetic references match for samples imported into the BioGraph variant database regardless of the filename of the reference used for analysis.

Basic usage

The refhash command accepts a single file path as an argument. The default output is a sha256 hash of the contig names and lengths, sorted by name.

(bg7)$ biograph refhash mystery.vcf.gz
1f5faf40c2b1b8715e9df75375cb392117a9c5734fca790e6399d7a50e90ebdd

The --common or -c option will output the "common" name of the reference, if known:

(bg7)$ biograph refhash -c mystery.vcf.gz
grch38

The --verbose or -v option will output both in a format similar to VCF comment lines:

(bg7)$ biograph refhash -v mystery.vcf.gz
refhash=1f5faf40c2b1b8715e9df75375cb392117a9c5734fca790e6399d7a50e90ebdd,name=grch38

Streaming on STDIN

Inputs can also be streamed on STDIN, but should be uncompressed first.

(bg7)$ zcat another.vcf.gz | biograph refhash -c
hg19

VCF files

The VCF may be optionally compressed.

(bg7)$ biograph refhash -c my.vcf.gz
hs37d5
(bg7)$ biograph refhash -c another.vcf
e_coli_k12_ASM584v1

FASTA files

FASTA files may be optionally compressed.

(bg7)$ biograph refhash -c some.fasta.gz
grch38.p12
(bg7)$ biograph refhash -c another.fa
human_g1k_v37

SAM/BAM files

You can use SAM files directly, or run samtools view -H for BAMs to identify the reference used for alignment:

(bg7)$ samtools view -H my.bam | biograph refhash -v
refhash=1e4ef0c15393ae133ad336a36a376bb62e564a43a65892966002e75713282aec,name=hs37d5

BioGraph reference directories

Identify a BioGraph reference directory by specifying the path to the refdir:

(bg7)$ biograph refhash -v /reference/human/
refhash=9d1184b1f957da7a499793e838a6509626bf772c8f437c0972c25f30fbab9fd7,name=human_g1k_v37

Unknown references

If the common name of a reference is not known, it is reported as unknown- plus the first 8 digits of the refhash.

(bg7)$ biograph refhash -v new.fasta.gz
refhash=d997af2c333a5d699477a10f95980a3fd95aa0283fc5be707d4c8a6fd1a0cb6f,name=unknown-d997af2c

If no contigs are found on the input (for example, a VCF with no contig lines is supplied) then a warning is issued and the following refhash is shown:

(bg7)$ biograph refhash -v reference_unknown.vcf
Warning: no contigs present in reference_unknown.vcf
refhash=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855,name=no_contigs_present

Limitations

The refhash is not guaranteed to be unique in every case, since it is possible that two references may have identical contig names and lengths, but still have different nucleotide sequences. Practically speaking, this is extremely unlikely for human references. Please let us know if you find an example of two different human references with identical refhash identifiers.

Identifying large FASTA files may take some time since the entire file must be parsed for contigs. If contig lengths are not available, the nucleotides must also be counted.

Getting help

Run biograph refhash with no options, or with --help:

(bg7)$ biograph refhash --help
usage: biograph [-h] [-c] [-v] [-f] [-l] [input]

Identify the reference in a VCF, FASTA, SAM, or BioGraph refdir

positional arguments:
  input          Input filename or refdir (/dev/stdin)

optional arguments:
  -h, --help     show this help message and exit
  -c, --common   Print the common name if known, otherwise print the hash
  -v, --verbose  Print the hash and the common name
  -f, --full     Print all available information
  -l, --list     List all known hashes and names
Clone this wiki locally