UPDATE: JULY 2019
I no longer work for STFC. All versions of HULK pre 1.0.0 have been renamed and archived to the STFC github. The STFC Hartree Centre are building genomic solutions based on these and other tools - if you are interested, please contact them.
This repo now hosts HULK >= version 1.0.0, which is a complete re-implementation of HULK and based solely off the method described in the open-access paper.
I've tried to keep much of the syntax and existing functionality, but make sure to check the change log below. It's a work in progress but the master branch should be a close drop-in replacement for the old HULK (for sketching at least). There are a few algorithmic differences, mainly that HULK now uses minimizers frequencies for representing the underling microbiome sample.
Importantly, this project is now fully open source!
HULK is a tool that creates small, fixed-size sketches from streaming microbiome sequencing data, enabling rapid metagenomic dissimilarity analysis.
HULK approximates a k-mer spectrum from a FASTQ data stream, incrementally sketches it and makes similarity search queries against other microbiome sketches.
HULK works by collecting minimizers from sequences. Minimizers are assigned to a finite number of histogram bins using a consistent jump hash; these bins are incremented as their corresponding minimizers are found. At set intervals (i.e. after X sequences have been processed), the bins are histosketched by
HULK. Similarly to MinHash sketches, histosketches can be used to estimate similarity between sequence data sets.
The advantages of
- it's fast and can run on a laptop
- hulk sketches are compact, fixed size and incorporate k-mer frequency information
- it works on data streams and does not require complete data instances
- it can use concept drift for histosketching
- you get to type
hulk smashinto the command line...
version 1.0.1 (dev branch)
- WASM interface
- run HULK locally and from a browser
- based on my baby-GROOT user interface
- HULK will output additional sketches
- KMV MinHash
- re-implementation of the LSH Forest index
version 1.0.0 (current release)
- fully re-written codebase
- I've aimed for it to be largely backwards compatible with previous releases
- fully open-sourced!
- MIT license (OSI approved)
- algorithm changes
- underlying histogram is now based on minimizer frequencies
- count-min sketch for k-mer frequencies is now replaced with a fixed-size array and a jump-hash for minimizer placement
- changes to the
- sketches saved to JSON by default (ala sourmash)
- histosketch count-min sketch is no longer configurable by the user (this was Epsilon and Delta)
- spectrum size is determined based on k-mer size
- minCount for k-mer frequencies is removed
- changes to the
- operates on JSON input
- outputs matrix as csv
- replaced some unecessary features
- the functionality of the
distancesubcommands is available in the
- the functionality of the
pre version 1.0.0
- all versions of HULK (and BANNER) pre v1.0.0 have been moved to the UKRI github and renamed. I can no longer work on these code bases.
Check out the releases to download a binary. Alternatively, install using Bioconda or compile the software from source.
For versions <1.0.0, use bioconda. I will add the recipe for HULK 1.0.0 asap.
conda install -c bioconda hulk
HULK is written in Go (v1.12) - to compile from source you will first need the Go tool chain. Once you have it, try something like this to compile:
# Clone this repository git clone https://github.com/will-rowe/hulk.git # Go into the repository and get the package dependencies cd hulk go get -d -t -v ./... # Run the unit tests go test -v ./... # Compile the program go build ./ # Call the program ./hulk --help
HULK is called by typing hulk, followed by the subcommand you wish to run. There main subcommands are sketch and smash:
# Create a hulk sketch gunzip -c microbiome.fq.gz | hulk sketch -o sketches/sampleA # Get a pairwise weighted Jaccard similarity matrix for a set of hulk histosketches hulk smash -k 31 -m weightedjaccard -d ./sketches -o myOutfile
Further Information & Citing
I'm working on some new documentation and this will be available on readthedocs soon.
A paper describing the
HULK method is published in Microbiome: