title

tags

authors

affiliations

date

bibliography

sourmash: a tool to quickly search, compare, and analyze genomic and metagenomic data sets

FracMinHash

MinHash

k-mers

Python

Rust

name	orcid	equal-contrib	affiliation
Luiz Irber	0000-0003-4371-9659	true	1

name	orcid	equal-contrib	affiliation
N. Tessa Pierce-Ward	0000-0002-2942-5331	true	1

name	orcid	affiliation
Mohamed Abuelanin	0000-0002-3419-4785	1

name	orcid	affiliation
Harriet Alexander	0000-0003-1308-8008	2

name	orcid	affiliation
Abhishek Anant	0000-0002-5751-2010	9

name	orcid	affiliation
Keya Barve	0000-0003-3241-2117	1

name	orcid	affiliation
Colton Baumler	0000-0002-5926-7792	1

name	orcid	affiliation
Olga Botvinnik	0000-0003-4412-7970	3

name	orcid	affiliation
Phillip Brooks	0000-0003-3987-244X	1

name	orcid	affiliation
Daniel Dsouza	0000-0001-7843-8596	9

name	orcid	affiliation
Laurent Gautier	0000-0003-0638-3391	9

name	orcid	affiliation
Mahmudur Rahman Hera	0000-0002-5992-9012	4

name	orcid	affiliation
Hannah Eve Houts	0000-0002-7954-4793	1

name	orcid	affiliation
Lisa K. Johnson	0000-0002-3600-7218	1

name	orcid	affiliation
Fabian Klötzl	0000-0002-6930-0592	5

name	orcid	affiliation
David Koslicki	0000-0002-0640-954X	4

name	orcid	affiliation
Marisa Lim	0000-0003-2097-8818	1

name	orcid	affiliation
Ricky Lim	0000-0003-1313-7076	9

name	orcid	affiliation
Ivan Ogasawara	0000-0001-5049-4289	9

name	orcid	affiliation
Taylor Reiter	0000-0002-7388-421X	1

name	orcid	affiliation
Camille Scott	0000-0001-8822-8779	1

name	orcid	affiliation
Andreas Sjödin	0000-0001-5350-4219	6

name	orcid	affiliation
Daniel Standage	0000-0003-0342-8531	7

name	orcid	affiliation
S. Joshua Swamidass	0000-0003-2191-0778	8

name	orcid	affiliation
Connor Tiffany	0000-0001-8188-7720	9

name	orcid	affiliation
Pranathi Vemuri	0000-0002-5748-9594	3

name	orcid	affiliation
Erik Young	0000-0002-9195-9801	1

name	orcid	corresponding	affiliation
C. Titus Brown	0000-0001-6001-2677	true	1

name	index
University of California, Davis	1

name	index
Woods Hole Oceanographic Institution	2

name	index
Chan-Zuckerberg Biohub	3

name	index
Pennsylvania State University	4

name	index
MPI for Evolutionary Biology	5

name	index
Swedish Defence Research Agency (FOI)	6

name	index
National Bioforensic Analysis Center	7

name	index
Washington University in St Louis	8

name	index
No affiliation	9

27 Mar 2023

paper.bib

Summary

sourmash is a command line tool and Python library for sketching collections of DNA, RNA, and amino acid k-mers for biological sequence search, comparison, and analysis [@Pierce:2019]. sourmash's FracMinHash sketching supports fast and accurate sequence comparisons between datasets of different sizes [@gather], including petabase-scale database search [@branchwater]. From release 4.x, sourmash is built on top of Rust and provides an experimental Rust interface.

FracMinHash sketching is a lossy compression approach that represents data sets using a "fractional" sketch containing $1/S$ of the original k-mers. Like other sequence sketching techniques (e.g. MinHash, [@Ondov:2015]), FracMinHash provides a lightweight way to store representations of large DNA or RNA sequence collections for comparison and search. Sketches can be used to identify samples, find similar samples, identify data sets with shared sequences, and build phylogenetic trees. FracMinHash sketching supports estimation of overlap, bidirectional containment, and Jaccard similarity between data sets and is accurate even for data sets of very different sizes.

Since sourmash v1 was released in 2016 [@Brown:2016], sourmash has expanded to support new database types and many more command line functions. In particular, sourmash now has robust support for both Jaccard similarity and containment calculations, which enables analysis and comparison of data sets of different sizes, including large metagenomic samples. As of v4.4, sourmash can convert these to estimated Average Nucleotide Identity (ANI) values, which can provide improved biological context to sketch comparisons [@hera2022debiasing].

Statement of Need

Large collections of genomes, transcriptomes, and raw sequencing data sets are readily available in biology, and the field needs lightweight computational methods for searching and summarizing the content of both public and private collections. sourmash provides a flexible set of programmatic functionality for this purpose, together with a robust and well-tested command-line interface. It has been used in well over 200 publications (based on citations of @Brown:2016 and @Pierce:2019) and it continues to expand in functionality.

Acknowledgements

This work is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative [GBMF4551 to CTB].

Notice: This manuscript has been authored by BNBI under Contract No. HSHQDC-15-C-00064 with the DHS. The US Government retains and the publisher, by accepting the article for publication, acknowledges that the USG retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for USG purposes. Views and conclusions contained herein are those of the authors and should not be interpreted to represent policies, expressed or implied, of the DHS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paper.md

paper.md

Summary

Statement of Need

Acknowledgements

References

Files

paper.md

Latest commit

History

paper.md

File metadata and controls

Summary

Statement of Need

Acknowledgements

References