fast clustering of many large sketches - kspider #2271

ctb · 2022-09-08T13:25:22Z

@mr-eyes has been working steadily on using kspider (docs and repo) to cluster many large collections of k-mers, and has achieved some impressive results.

This issue is b/c I wanted to link some of the kSpider work into this repo so that it was discoverable by sourmash aficionados!

@mr-eyes if you have a tutorial or some guidance for people wanting to try out kSpider with sourmash sketches, please point to it here!

mr-eyes · 2022-09-09T00:51:40Z

I will be working on updating the docs to include the latest updates of kSpider dev and will add some tutorials on how to run it on sourmash sigs. Will update this issue when I am done.

@mr-eyes

This PR adds a new command, `cluster`, that can be used to cluster the output from `pairwise` and `multisearch`. `cluster`uses `rustworkx-core` (which internally uses `petgraph`) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by `pairwise` or `multisearch`, and will add all nodes to the graph to preserve singleton 'clusters' in the output. `cluster` outputs two files: 1. cluster identities file: `Component_X, name1;name2;name3...` 2. cluster size histogram `cluster_size, count` context for some things I tried: - try using petgraph directly and removing rustworkx dependency > nope,`rustworkx-core` adds `connected_components` that returns the connected components, rather than just the number of connected components. Could reimplement if `rustworkx-core` brings in a lot of deps - try using 'extend_with_edges' instead of add_edge logic. > nope, only in `petgraph` **Punted Issues:** - develop clustering visualizations (ref @mr-eyes kSpider/dbretina work). Optionally output dot file of graph? (#248) - enable updating clusters, rather than always regenerating from scratch (#249) - benchmark `cluster` (#247) > `pairwise` files can be millions of lines long. Would it be faster to parallel read them, store them in an `edges` vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from `names_to_node` that are not already in the graph to preserve singletons. Related issues: * #219 * sourmash-bio/sourmash#2271 * sourmash-bio/sourmash#700 * sourmash-bio/sourmash#225 * sourmash-bio/sourmash#274 --------- Co-authored-by: C. Titus Brown <titus@idyll.org>

ctb mentioned this issue Sep 8, 2022

can we update clustering results with new signatures? #2272

Open

mr-eyes mentioned this issue Sep 9, 2022

Link kSpider updates with sourmash docs dib-lab/kSpider#23

Open

ctb mentioned this issue Sep 25, 2022

how much memory does sourmash compare need? #2299

Open

This was referenced Feb 26, 2024

Adapt the kSpider's algorithm in pairwise comparisons sourmash-bio/sourmash_plugin_branchwater#219

Open

MRG: Add graph-based clustering sourmash-bio/sourmash_plugin_branchwater#234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fast clustering of many large sketches - kspider #2271

fast clustering of many large sketches - kspider #2271

ctb commented Sep 8, 2022

mr-eyes commented Sep 9, 2022

fast clustering of many large sketches - kspider #2271

fast clustering of many large sketches - kspider #2271

Comments

ctb commented Sep 8, 2022

mr-eyes commented Sep 9, 2022