Skip to content

Latest commit

 

History

History
164 lines (153 loc) · 5.61 KB

paper.md

File metadata and controls

164 lines (153 loc) · 5.61 KB
title tags authors affiliations date bibliography
sourmash: a tool to quickly search, compare, and analyze genomic and metagenomic data sets
FracMinHash
MinHash
k-mers
Python
Rust
name orcid equal-contrib affiliation
Luiz Irber
0000-0003-4371-9659
true
1
name orcid equal-contrib affiliation
N. Tessa Pierce-Ward
0000-0002-2942-5331
true
1
name orcid affiliation
Mohamed Abuelanin
0000-0002-3419-4785
1
name orcid affiliation
Harriet Alexander
0000-0003-1308-8008
2
name orcid affiliation
Abhishek Anant
0000-0002-5751-2010
9
name orcid affiliation
Keya Barve
0000-0003-3241-2117
1
name orcid affiliation
Colton Baumler
0000-0002-5926-7792
1
name orcid affiliation
Olga Botvinnik
0000-0003-4412-7970
3
name orcid affiliation
Phillip Brooks
0000-0003-3987-244X
1
name orcid affiliation
Daniel Dsouza
0000-0001-7843-8596
9
name orcid affiliation
Laurent Gautier
0000-0003-0638-3391
9
name orcid affiliation
Mahmudur Rahman Hera
0000-0002-5992-9012
4
name orcid affiliation
Hannah Eve Houts
0000-0002-7954-4793
1
name orcid affiliation
Lisa K. Johnson
0000-0002-3600-7218
1
name orcid affiliation
Fabian Klötzl
0000-0002-6930-0592
5
name orcid affiliation
David Koslicki
0000-0002-0640-954X
4
name orcid affiliation
Marisa Lim
0000-0003-2097-8818
1
name orcid affiliation
Ricky Lim
0000-0003-1313-7076
9
name orcid affiliation
Ivan Ogasawara
0000-0001-5049-4289
9
name orcid affiliation
Taylor Reiter
0000-0002-7388-421X
1
name orcid affiliation
Camille Scott
0000-0001-8822-8779
1
name orcid affiliation
Andreas Sjödin
0000-0001-5350-4219
6
name orcid affiliation
Daniel Standage
0000-0003-0342-8531
7
name orcid affiliation
S. Joshua Swamidass
0000-0003-2191-0778
8
name orcid affiliation
Connor Tiffany
0000-0001-8188-7720
9
name orcid affiliation
Pranathi Vemuri
0000-0002-5748-9594
3
name orcid affiliation
Erik Young
0000-0002-9195-9801
1
name orcid corresponding affiliation
C. Titus Brown
0000-0001-6001-2677
true
1
name index
University of California, Davis
1
name index
Woods Hole Oceanographic Institution
2
name index
Chan-Zuckerberg Biohub
3
name index
Pennsylvania State University
4
name index
MPI for Evolutionary Biology
5
name index
Swedish Defence Research Agency (FOI)
6
name index
National Bioforensic Analysis Center
7
name index
Washington University in St Louis
8
name index
No affiliation
9
27 Mar 2023
paper.bib

Summary

sourmash is a command line tool and Python library for sketching collections of DNA, RNA, and amino acid k-mers for biological sequence search, comparison, and analysis [@Pierce:2019]. sourmash's FracMinHash sketching supports fast and accurate sequence comparisons between datasets of different sizes [@gather], including petabase-scale database search [@branchwater]. From release 4.x, sourmash is built on top of Rust and provides an experimental Rust interface.

FracMinHash sketching is a lossy compression approach that represents data sets using a "fractional" sketch containing $1/S$ of the original k-mers. Like other sequence sketching techniques (e.g. MinHash, [@Ondov:2015]), FracMinHash provides a lightweight way to store representations of large DNA or RNA sequence collections for comparison and search. Sketches can be used to identify samples, find similar samples, identify data sets with shared sequences, and build phylogenetic trees. FracMinHash sketching supports estimation of overlap, bidirectional containment, and Jaccard similarity between data sets and is accurate even for data sets of very different sizes.

Since sourmash v1 was released in 2016 [@Brown:2016], sourmash has expanded to support new database types and many more command line functions. In particular, sourmash now has robust support for both Jaccard similarity and containment calculations, which enables analysis and comparison of data sets of different sizes, including large metagenomic samples. As of v4.4, sourmash can convert these to estimated Average Nucleotide Identity (ANI) values, which can provide improved biological context to sketch comparisons [@hera2022debiasing].

Statement of Need

Large collections of genomes, transcriptomes, and raw sequencing data sets are readily available in biology, and the field needs lightweight computational methods for searching and summarizing the content of both public and private collections. sourmash provides a flexible set of programmatic functionality for this purpose, together with a robust and well-tested command-line interface. It has been used in well over 200 publications (based on citations of @Brown:2016 and @Pierce:2019) and it continues to expand in functionality.

Acknowledgements

This work is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative [GBMF4551 to CTB].

Notice: This manuscript has been authored by BNBI under Contract No. HSHQDC-15-C-00064 with the DHS. The US Government retains and the publisher, by accepting the article for publication, acknowledges that the USG retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for USG purposes. Views and conclusions contained herein are those of the authors and should not be interpreted to represent policies, expressed or implied, of the DHS.

References