title |
tags |
authors |
affiliations |
date |
bibliography |
sourmash: a tool to quickly search, compare, and analyze genomic and metagenomic data sets |
FracMinHash |
MinHash |
k-mers |
Python |
Rust |
|
name |
orcid |
equal-contrib |
affiliation |
Luiz Irber |
0000-0003-4371-9659 |
true |
1 |
|
name |
orcid |
equal-contrib |
affiliation |
N. Tessa Pierce-Ward |
0000-0002-2942-5331 |
true |
1 |
|
name |
orcid |
affiliation |
Mohamed Abuelanin |
0000-0002-3419-4785 |
1 |
|
name |
orcid |
affiliation |
Harriet Alexander |
0000-0003-1308-8008 |
2 |
|
name |
orcid |
affiliation |
Abhishek Anant |
0000-0002-5751-2010 |
9 |
|
name |
orcid |
affiliation |
Keya Barve |
0000-0003-3241-2117 |
1 |
|
name |
orcid |
affiliation |
Colton Baumler |
0000-0002-5926-7792 |
1 |
|
name |
orcid |
affiliation |
Olga Botvinnik |
0000-0003-4412-7970 |
3 |
|
name |
orcid |
affiliation |
Phillip Brooks |
0000-0003-3987-244X |
1 |
|
name |
orcid |
affiliation |
Daniel Dsouza |
0000-0001-7843-8596 |
9 |
|
name |
orcid |
affiliation |
Laurent Gautier |
0000-0003-0638-3391 |
9 |
|
name |
orcid |
affiliation |
Mahmudur Rahman Hera |
0000-0002-5992-9012 |
4 |
|
name |
orcid |
affiliation |
Hannah Eve Houts |
0000-0002-7954-4793 |
1 |
|
name |
orcid |
affiliation |
Lisa K. Johnson |
0000-0002-3600-7218 |
1 |
|
name |
orcid |
affiliation |
Fabian Klötzl |
0000-0002-6930-0592 |
5 |
|
name |
orcid |
affiliation |
David Koslicki |
0000-0002-0640-954X |
4 |
|
name |
orcid |
affiliation |
Marisa Lim |
0000-0003-2097-8818 |
1 |
|
name |
orcid |
affiliation |
Ricky Lim |
0000-0003-1313-7076 |
9 |
|
name |
orcid |
affiliation |
Ivan Ogasawara |
0000-0001-5049-4289 |
9 |
|
name |
orcid |
affiliation |
Taylor Reiter |
0000-0002-7388-421X |
1 |
|
name |
orcid |
affiliation |
Camille Scott |
0000-0001-8822-8779 |
1 |
|
name |
orcid |
affiliation |
Andreas Sjödin |
0000-0001-5350-4219 |
6 |
|
name |
orcid |
affiliation |
Daniel Standage |
0000-0003-0342-8531 |
7 |
|
name |
orcid |
affiliation |
S. Joshua Swamidass |
0000-0003-2191-0778 |
8 |
|
name |
orcid |
affiliation |
Connor Tiffany |
0000-0001-8188-7720 |
9 |
|
name |
orcid |
affiliation |
Pranathi Vemuri |
0000-0002-5748-9594 |
3 |
|
name |
orcid |
affiliation |
Erik Young |
0000-0002-9195-9801 |
1 |
|
name |
orcid |
corresponding |
affiliation |
C. Titus Brown |
0000-0001-6001-2677 |
true |
1 |
|
|
name |
index |
University of California, Davis |
1 |
|
name |
index |
Woods Hole Oceanographic Institution |
2 |
|
name |
index |
Chan-Zuckerberg Biohub |
3 |
|
name |
index |
Pennsylvania State University |
4 |
|
name |
index |
MPI for Evolutionary Biology |
5 |
|
name |
index |
Swedish Defence Research Agency (FOI) |
6 |
|
name |
index |
National Bioforensic Analysis Center |
7 |
|
name |
index |
Washington University in St Louis |
8 |
|
name |
index |
No affiliation |
9 |
|
|
27 Mar 2023 |
paper.bib |
sourmash is a command line tool and Python library for sketching
collections of DNA, RNA, and amino acid k-mers for biological sequence
search, comparison, and analysis [@Pierce:2019]. sourmash's FracMinHash sketching supports fast and accurate sequence comparisons between datasets of different sizes [@gather], including petabase-scale database search [@branchwater]. From release 4.x, sourmash is built on top of Rust and provides an experimental Rust interface.
FracMinHash sketching is a lossy compression approach that represents
data sets using a "fractional" sketch containing $1/S$ of the original
k-mers. Like other sequence sketching techniques (e.g. MinHash, [@Ondov:2015]), FracMinHash provides a lightweight way to store representations of large DNA or RNA sequence collections for comparison and search. Sketches can be used to identify samples, find similar samples, identify data sets with shared sequences, and build phylogenetic trees. FracMinHash sketching supports estimation of overlap, bidirectional containment, and Jaccard similarity between data sets and is accurate even for data sets of very different sizes.
Since sourmash v1 was released in 2016 [@Brown:2016], sourmash has expanded
to support new database types and many more command line functions.
In particular, sourmash now has robust support for both Jaccard similarity
and containment calculations, which enables analysis and comparison of data sets
of different sizes, including large metagenomic samples. As of v4.4,
sourmash can convert these to estimated Average Nucleotide Identity (ANI)
values, which can provide improved biological context to sketch comparisons [@hera2022debiasing].
Large collections of genomes, transcriptomes, and raw sequencing data
sets are readily available in biology, and the field needs lightweight
computational methods for searching and summarizing the content of
both public and private collections. sourmash provides a flexible set
of programmatic functionality for this purpose, together with a robust
and well-tested command-line interface. It has been used in well over 200
publications (based on citations of @Brown:2016 and @Pierce:2019) and it continues
to expand in functionality.
This work is funded in part by the Gordon and Betty Moore Foundation’s
Data-Driven Discovery Initiative [GBMF4551 to CTB].
Notice: This manuscript has been authored by BNBI under Contract
No. HSHQDC-15-C-00064 with the DHS. The US Government retains
and the publisher, by accepting the article for publication, acknowledges
that the USG retains a non-exclusive, paid-up, irrevocable, world-wide
license to publish or reproduce the published form of this manuscript,
or allow others to do so, for USG purposes. Views and conclusions
contained herein are those of the authors and should not be interpreted
to represent policies, expressed or implied, of the DHS.