ntHash2 AVX512

This is a bugfixed version of ntHash2 AVX2 and AVX512 ports with expanded tests. Also added 32-bits scalar ntHash.

In terms of correctness, the 64-bit hash version (which is default hash type in ntHash 1 and 2) of scalar, AVX2 and AVX512 all agree together. The 32-bit AVX2/AVX512 ports agree together, as they implement some strange 31-bit ntHash2, but they do not agree with the 32-bits scalar version which implement ntHash1.

This hasn't been merged in the original ntHash repository as this is an old ntHash codebase.

See bcgsc/ntHash#9 for initial version.

ntHash

ntHash is a recursive hash function for hashing all possible k-mers in a DNA/RNA sequence.

Build the test suite

$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

To install nttest in a specified directory:

$ ./autogen.sh
$ ./configure --prefix=/opt/ntHash/
$ make
$ make install

The nttest suite has the options for runtime and uniformity tests.

Runtime test

For the runtime test the program has the following options:

nttest [OPTIONS] ... [FILE]

Parameters:

-k, --kmer=SIZE: the length of k-mer used for runtime test hashing [50]
-h, --hash=SIZE: the number of generated hashes for each k-mer [1]
FILE: is the input fasta or fastq file

For example to evaluate the runtime of different hash methods on the test file reads.fa in DATA/ folder for k-mer length 50, run:

$ nttest -k50 reads.fa

Uniformity test

For the uniformity test using the Bloom filter data structure the program has the following options:

nttest --uniformity [OPTIONS] ... [REF_FILE] [QUERY_FILE]

Parameters:

-q, --qnum=SIZE: number of queries in query file
-l, --qlen=SIZE: length of reads in query file
-t, --tnum=SIZE: number of sequences in reference file
-g, --tlen=SIZE: length of reference sequence
-i, --input: generate random query and reference files
-j, threads=SIZE: number of threads to run uniformity test [1]
REF_FILE: the reference file name
QUERY_FILE: the query file name

For example, to evaluate the uniformity of different hash methods using the Bloom filter data structure on randomly generated data sets with following options:

100 genes of length 5,000,000bp as reference in file genes.fa
4,000,000 reads of length 250bp as query in file reads.fa
12 threads

run:

$ nttest --uniformity --input -q4000000 -l250 -t100 -g5000000 -j12 genes.fa reads.fa

Code samples

To hash all k-mers of length k in a given sequence seq:

    string kmer = seq.substr(0, k);
    uint64_t hVal=0;
    hVal = NT64(kmer.c_str(), k); // initial hash value
    ...
    for (size_t i = 0; i < seq.length() - k; i++) 
    {
        hVal = NT64(hVal, seq[i], seq[i+k], k); // consecutive hash values
        ...
    }

To canonical hash all k-mers of length k in a given sequence seq:

    string kmer = seq.substr(0, k);
    uint64_t hVal, fhVal=0, rhVal=0; // canonical, forward, and reverse-strand hash values
    hVal = NTC64(kmer.c_str(), k, fhVal, rhVal); // initial hash value
    ...
    for (size_t i = 0; i < seq.length() - k; i++) 
    {
        hVal = NTC64(seq[i], seq[i+k], k, fhVal, rhVal); // consecutive hash values
        ...
    }

To multi-hash with h hash values all k-mers of length k in a given sequence seq:

    string kmer = seq.substr(0, k);
    uint64_t hVec[h];
    NTM64(kmer.c_str(), k, h, hVec); // initial hash vector
    ...
    for (size_t i = 0; i < seq.length() - k; i++) 
    {
        NTM64(seq[i], seq[i+k], k, h, hVec); // consecutive hash vectors
        ...
    }

ntHashIterator

Enables ntHash on sequences

To hash all k-mers of length k in a given sequence seq with h hash values using ntHashIterator:

ntHashIterator itr(seq, h, k);			
while (itr != itr.end()) 
{
 ... use *itr ...
 ++itr;
}

Usage example (C++)

Outputing hash values of all k-mers in a sequence

#include <iostream>
#include <string>
#include "ntHashIterator.hpp"

int main(int argc, const char* argv[])
{
	/* test sequence */
	std::string seq = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAGGCCACAGTAG";
	
	/* k is the k-mer length */
	unsigned k = 70;
	
	/* h is the number of hashes for each k-mer */
	unsigned h = 1;

	/* init ntHash state and compute hash values for first k-mer */
	ntHashIterator itr(seq, h, k);
	while (itr != itr.end()) {
		std::cout << (*itr)[0] << std::endl;
		++itr;
	}

	return 0;
}

Publications

ntHash

Hamid Mohamadi, Justin Chu, Benjamin P Vandervalk, and Inanc Birol. ntHash: recursive nucleotide hashing. Bioinformatics (2016) 32 (22): 3492-3494. doi:10.1093/bioinformatics/btw397

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
lib		lib
CITATION.bib		CITATION.bib
ChangeLog		ChangeLog
LICENSE		LICENSE
Makefile.am		Makefile.am
README.md		README.md
autogen.sh		autogen.sh
configure.ac		configure.ac
ntHashIterator.hpp		ntHashIterator.hpp
nthash.hpp		nthash.hpp
nthash_avx.hpp		nthash_avx.hpp
nttest.cpp		nttest.cpp
nttest_avx.cpp		nttest_avx.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ntHash2 AVX512

ntHash

Build the test suite

Runtime test

Uniformity test

Code samples

ntHashIterator

Usage example (C++)

Publications

ntHash

About

Releases

Packages

Contributors 4

Languages

License

rchikhi/ntHash-AVX512

Folders and files

Latest commit

History

Repository files navigation

ntHash2 AVX512

ntHash

Build the test suite

Runtime test

Uniformity test

Code samples

ntHashIterator

Usage example (C++)

Publications

ntHash

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages