Skip to content

Commit

Permalink
adding new readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Will Rowe committed Jul 14, 2019
1 parent e96be79 commit a882ce8
Showing 1 changed file with 125 additions and 0 deletions.
125 changes: 125 additions & 0 deletions README.md
@@ -0,0 +1,125 @@
<div align="center">
<img src="paper/img/misc/hulk-logo-with-text.png?raw=true?" alt="hulk-logo" width="250">
<h3><a style="color:#9900FF">H</a>istosketching <a style="color:#9900FF">U</a>sing <a style="color:#9900FF">L</a>ittle <a style="color:#9900FF">K</a>mers</h3>
<hr/>
<a href="https://travis-ci.org/will-rowe/hulk"><img src="https://travis-ci.org/will-rowe/hulk.svg?branch=master" alt="travis"></a>
<a href='http://hulk.readthedocs.io/en/latest/?badge=latest'><img src='https://readthedocs.org/projects/hulk/badge/?version=latest' alt='Documentation Status' /></a>
<a href="https://goreportcard.com/report/github.com/will-rowe/hulk"><img src="https://goreportcard.com/badge/github.com/will-rowe/hulk" alt="reportcard"></a>
<a href="https://zenodo.org/badge/latestdoi/143890875"><img src="https://zenodo.org/badge/143890875.svg" alt="DOI"></a>
<a href="https://github.com/will-rowe/hulk/blob/master/LICENSE"><img src="https://img.shields.io/badge/license-MIT-orange.svg" alt="License"></a>
<a href="https://bioconda.github.io/recipes/hulk/README.html"><img src="https://anaconda.org/bioconda/hulk/badges/downloads.svg" alt="bioconda"></a>
<a href="https://mybinder.org/v2/gh/will-rowe/hulk/master?filepath=paper%2Fanalysis-notebooks"><img src="https://mybinder.org/badge_logo.svg" alt="Binder"></a>
<hr/>
</div>

> UPDATE: JULY 2019
> I no longer worker for UKRI. As a result, all versions of HULK pre 1.0.0 have been renamed and archived to the [UKRI github](https://github.com/stfc/histogramSketcher).
> This repo now hosts HULK >= version 1.0.0, which is a complete re-implementation of HULK and based solely off the method described in the [open-access paper](https://doi.org/10.1186/s40168-019-0653-2).
> I've tried to keep much of the syntax and existing functionality, but make sure to check the change log below. It's a work in progress but the master branch should be a close drop-in replacement for the old HULK (for sketching at least). There are a few algorithmic differences, mainly that HULK now uses **minimizers frequencies** for representing the underling microbiome sample.
> Importantly, this project is now **fully open source** and I can develop freely on it!
## Overview

`HULK` is a tool that creates small, fixed-size sketches from streaming microbiome sequencing data, enabling **rapid metagenomic dissimilarity analysis**. `HULK` approximates a [k-mer spectrum](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0875-7) from a FASTQ data stream, incrementally sketches it and makes similarity search queries against other microbiome sketches.

`HULK` works by collecting **minimizers** from sequences. Minimizers are assigned to a finite number of histogram bins using a [consistent jump hash](https://arxiv.org/abs/1406.2294); these bins are incremented as their corresponding minimizers are found. At set intervals (i.e. after X sequences have been processed), the bins are histosketched by `HULK`. Similarly to [MinHash sketches](https://en.wikipedia.org/wiki/MinHash), histosketches can be used to estimate similarity between sequence data sets.

The advantages of `HULK` include:

* it's fast and can run on a laptop
* **hulk sketches** are compact, fixed size and incorporate k-mer frequency information
* it works on data streams and does not require complete data instances
* it can use [concept drift](https://en.wikipedia.org/wiki/Concept_drift) for histosketching
* you get to type `hulk smash` into the command line...

Finally, you can use **hulk sketches** to with a Machine Learning classifier to predict microbiome sample origin (see [the paper](https://doi.org/10.1186/s40168-019-0653-2) and [BANNER](https://github.com/will-rowe/banner)).

## Change log

### version 1.0.1 (dev branch)

* WASM interface
* run HULK locally and from a browser
* based on my [baby-GROOT](https://github.com/will-rowe/baby-groot) user interface
* HULK will output additional sketches
* KMV MinHash
* HyperMinHash
* Indexing
* re-implementation of the LSH Forest index

### version 1.0.0 (current release)

* fully re-written codebase
* I've aimed for it to be largely backwards compatible with previous releases
* fully open-sourced!
* MIT license ([OSI approved](https://opensource.org/licenses))
* changes to the `sketch` subcommand:
* underlying histogram is now based on minimizer frequencies
* sketches saved to JSON by default (ala [sourmash](https://github.com/dib-lab/sourmash))
* changes to the `smash` subcommand:
* operates on JSON input
* outputs matrix as csv
* replaced some unecessary features
* the functionality of the `print` and `distance` subcommands is available in the `smash` subcommand

### pre version 1.0.0

* all versions of HULK (and BANNER) pre v1.0.0 have been moved to the [UKRI github](https://github.com/stfc/histogramSketcher) and renamed. I can no longer work on these code bases.

## Installation

Check out the [releases](https://github.com/will-rowe/hulk/releases) to download a binary. Alternatively, install using Bioconda or compile the software from source.

### Bioconda

For versions <1.0.0, use bioconda. I will add the recipe for HULK 1.0.0 asap.

```bash
conda install -c bioconda hulk
```

### Source

`HULK` is written in Go (v1.12) - to compile from source you will first need the [Go tool chain](https://golang.org/doc/install). Once you have it, try something like this to compile:

```bash
# Clone this repository
git clone https://github.com/will-rowe/hulk.git

# Go into the repository and get the package dependencies
cd hulk
go get -d -t -v ./...

# Run the unit tests
go test -v ./...

# Compile the program
go build ./

# Call the program
./hulk --help
```

## Quick Start

`HULK` is called by typing **hulk**, followed by the subcommand you wish to run. There main subcommands are **sketch** and **smash**:

```bash
# Create a hulk sketch
gunzip -c microbiome.fq.gz | hulk sketch -o sketches/sampleA

# Get a pairwise weighted Jaccard similarity matrix for a set of hulk histosketches
hulk smash -k 31 -m weightedjaccard -d ./sketches -o myOutfile
```

## Further Information & Citing

I'm working on some new documentation and this will be available on [readthedocs](http://hulk.readthedocs.io/en/latest/?badge=latest) soon.

A paper describing the `HULK` method is published in Microbiome:

>[Rowe WPM et al. Streaming histogram sketching for rapid microbiome analytics. Microbiome. 2019.](https://doi.org/10.1186/s40168-019-0653-2)

0 comments on commit a882ce8

Please sign in to comment.