Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] replace khmer.Nodegraph with rust nodegraph #799

Merged
merged 25 commits into from Apr 23, 2020
Merged

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Dec 17, 2019

Main benefits:

  • Read/write Nodegraph to/from buffers (no intermediate temporary files)
  • Remove dep on khmer (but can still convert this Nodegraph into a khmer.Nodegraph)

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@luizirber luizirber added the rust label Dec 17, 2019
@codecov
Copy link

@codecov codecov bot commented Dec 17, 2019

Codecov Report

Merging #799 into master will decrease coverage by 11.87%.
The diff coverage is 84.21%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #799       +/-   ##
===========================================
- Coverage   91.61%   79.74%   -11.88%     
===========================================
  Files          71       97       +26     
  Lines        5071     7706     +2635     
===========================================
+ Hits         4646     6145     +1499     
- Misses        425     1561     +1136     
Flag Coverage Δ
#rusttests 53.79% <77.20%> (?)
Impacted Files Coverage Δ
sourmash/commands.py 86.23% <ø> (+1.59%) ⬆️
src/core/src/ffi/nodegraph.rs 0.00% <0.00%> (ø)
src/core/src/ffi/signature.rs 0.00% <0.00%> (ø)
src/core/src/index/sbt/mod.rs 74.51% <ø> (ø)
sourmash/nodegraph.py 93.75% <93.75%> (ø)
src/core/src/sketch/nodegraph.rs 84.74% <93.78%> (ø)
sourmash/sbtmh.py 92.72% <95.00%> (-0.13%) ⬇️
sourmash/cli/info.py 100.00% <100.00%> (ø)
sourmash/sbt.py 87.80% <100.00%> (+1.97%) ⬆️
sourmash/sbt_storage.py 95.69% <100.00%> (+0.19%) ⬆️
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a42508f...eeed865. Read the comment docs.

@luizirber luizirber added this to the post-3.0 milestone Dec 30, 2019
@luizirber luizirber force-pushed the rust_nodegraph branch 2 times, most recently from c133ba7 to a81a466 Compare Jan 7, 2020
@luizirber luizirber removed this from the post-3.0 milestone Jan 14, 2020
@luizirber luizirber added this to the 3.2 milestone Jan 14, 2020
@luizirber luizirber removed this from the 3.2 milestone Jan 14, 2020
@luizirber luizirber added this to the 3.3 milestone Jan 14, 2020
@luizirber luizirber force-pushed the rust_nodegraph branch 2 times, most recently from dd7b391 to bfedd50 Compare Apr 15, 2020
@olgabot
Copy link
Collaborator

@olgabot olgabot commented Apr 15, 2020

So much force-pushing ... ;) exciting though! Any word on how this decreases the footprint of SBTs?

@luizirber
Copy link
Member Author

@luizirber luizirber commented Apr 15, 2020

So much force-pushing ... ;) exciting though! Any word on how this decreases the footprint of SBTs?

Hmm, this doesn't change much in that front... Main issue with SBTs is keeping internal nodes in memory, and sometimes you want to keep them, sometimes you don't. For search you can unload it once it's processed, and #784 does that.

For gather it is more complicated, because you will probably check the top nodes in the tree again, but probably not so much of the ones near the leaves. We don't have a good way to deal with that yet...

Note that calling node.unload() should be safe, because the node knows how to reload the data from the storage. It will take longer (because you need to read the data again into memory), but it is a way of saving memory.

@luizirber
Copy link
Member Author

@luizirber luizirber commented Apr 15, 2020

What this PR avoids is writing/reading the data to a temporary file, because the Rust Nodegraph can be read from a memory buffer, which khmer didn't support. It used to be

  • Read from storage into memory buffer
  • Write memory buffer to temp file
  • load nodegraph from temp file

now it is

  • Read from storage into memory buffer
  • load nodegraph from memory buffer

@ctb
Copy link
Contributor

@ctb ctb commented Apr 16, 2020

Side note, at one point David Koslicki was interested in reusing our Bloom filter implementation. Might see if this fits his needs.

@luizirber
Copy link
Member Author

@luizirber luizirber commented Apr 17, 2020

Side note, at one point David Koslicki was interested in reusing our Bloom filter implementation. Might see if this fits his needs.

This is not a full reimplementation of the khmer.Nodegraph, and I think he was using other features that sourmash doesn't use (so I didn't implement them).

And I think CMash is also using kmc nowadays.

@luizirber luizirber force-pushed the rust_nodegraph branch 9 times, most recently from 6e9f860 to 080f46b Compare Apr 19, 2020
@ctb
Copy link
Contributor

@ctb ctb commented Apr 21, 2020

@ctb
Copy link
Contributor

@ctb ctb commented Apr 23, 2020

How does writing compressed nodegraphs work, then? I'm guessing we would need to add that as a feature to SBTs?

@luizirber
Copy link
Member Author

@luizirber luizirber commented Apr 23, 2020

How does writing compressed nodegraphs work, then? I'm guessing we would need to add that as a feature to SBTs?

SBTs are currently v4; saving compressed nodes imply a v5 (because if we keep the same version number, older versions of sourmash can't load a v4 anymore if it is compressed...)

There is already experimental support for loading v5 in #694, but we still save as v4 by default. Since we don't save as v5 yet, there is no officially supported v5 SBTs in existence, so it's OK to update the format (probably also adding some field for the LocalizedSBT in #925?)

(Currently the difference from v4 to v5 is splitting leaves and internal nodes, which allows saving LinearIndex in the same format. Also, only the leaves are required, and maybe start calling them datasets or signatures is also good)

I suggest merging this PR with support for loading compressed nodegraphs, and implement v5 and other changes in #648

@ctb
Copy link
Contributor

@ctb ctb commented Apr 23, 2020

@luizirber luizirber merged commit bada50e into master Apr 23, 2020
22 of 24 checks passed
@luizirber luizirber deleted the rust_nodegraph branch May 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants