Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Move greyhound-core into sourmash #1238

Merged
merged 17 commits into from Feb 14, 2022
Merged

[MRG] Move greyhound-core into sourmash #1238

merged 17 commits into from Feb 14, 2022

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Nov 17, 2020

This PR moves the greyhound-core (RevIndex and gather) into sourmash. It doesn't bring the CLI, web server or browser frontend.
(The CLI should probably be exposed here at some point, the web server and frontend should go into wort).

I created a new feature on the Rust side, "experimental". The idea is to allow experimentation without making stability guarantees, including passing all checks required for merging (like wasm support). #1221 is another example of an "experimental" feature.
In order to avoid piling up experimental features, I also propose a requirement that sourmash-Python CAN'T use the "experimental" feature. This keeps us honest, and force stabilization in the Rust side =]
This is sort of equivalent to the nightly features in the Rust compiler.

Lots left to do, but a small list:

  • properly use the parallel feature to expose rayon
  • fix wasm compilation

@codecov
Copy link

codecov bot commented Nov 17, 2020

Codecov Report

Merging #1238 (f016123) into latest (f9a2e96) will increase coverage by 6.35%.
The diff coverage is 65.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1238      +/-   ##
==========================================
+ Coverage   83.58%   89.94%   +6.35%     
==========================================
  Files         113       88      -25     
  Lines       12185     8592    -3593     
  Branches     1684     1705      +21     
==========================================
- Hits        10185     7728    -2457     
+ Misses       1743      600    -1143     
- Partials      257      264       +7     
Flag Coverage Δ
python 89.94% <65.17%> (-0.32%) ⬇️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/utils.py 78.94% <0.00%> (ø)
src/sourmash/index/revindex.py 63.46% <63.46%> (ø)
src/sourmash/index/__init__.py 95.96% <100.00%> (ø)
src/core/src/encodings.rs
src/core/src/ffi/mod.rs
src/core/src/ffi/utils.rs
src/core/src/index/linear.rs
src/core/src/index/mod.rs
src/core/src/index/sbt/mhbt.rs
src/core/src/index/sbt/mod.rs
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f9a2e96...f016123. Read the comment docs.

@ctb
Copy link
Contributor

ctb commented Nov 17, 2020

In order to avoid piling up experimental features, I also propose a requirement that sourmash-Python CAN'T use the "experimental" feature. This keeps us honest, and force stabilization in the Rust side =]

Hmm. 👎

I instead propose that we add an "experimental" subcommand in sourmash CLI that is explicitly not stable.

Alternatively, as long as it's exposed by the Rust API to Python, I can use it in scripts. So we could keep it away from the CLI, at the cost of not being able to evolve a better CLI as we play with use cases.

Motivation: I do most of my experimentation in Python, as do others, and I think sourmash has benefited substantially from this kind of experimentation. So I want to keep this available!

@luizirber luizirber force-pushed the greyhound branch 2 times, most recently from ab74189 to b37b51f Compare November 17, 2020 17:49
@luizirber
Copy link
Member Author

In order to avoid piling up experimental features, I also propose a requirement that sourmash-Python CAN'T use the "experimental" feature. This keeps us honest, and force stabilization in the Rust side =]

Hmm. -1

I instead propose that we add an "experimental" subcommand in sourmash CLI that is explicitly not stable.

Alternatively, as long as it's exposed by the Rust API to Python, I can use it in scripts. So we could keep it away from the CLI, at the cost of not being able to evolve a better CLI as we play with use cases.

Motivation: I do most of my experimentation in Python, as do others, and I think sourmash has benefited substantially from this kind of experimentation. So I want to keep this available!

Fair point. Note that we can request "experimental" features from the Rust side on any sourmash PRs just fine (add "--features experimental" to the cargo invocation in setup.py), what I'm blocking is merging PRs that use the experimental feature without also "stabilizing" it on the Rust side.

@ctb
Copy link
Contributor

ctb commented Nov 17, 2020

is there value in combining this with #1131 at all? (I don't really know, am wondering.)

@luizirber
Copy link
Member Author

is there value in combining this with #1131 at all? (I don't really know, am wondering.)

Probably... I see the LcaDB as a superset of RevIndex, and could probably extend the latter with the taxinfo bits from the former, but this becomes a larger PR than this one.

@luizirber
Copy link
Member Author

as I suspected, shoehorning greyhound into how sourmash does gather is more
complicated than it looks.

First hurdle is that I did the greyhound demos using on-disk signatures (and paths),
but sourmash indices also work with in-memory signatures
(create an empty index, add in-memory sigs).
So, I've been modifying greyhound to support both (not hard, but does change
some of the internal logic).

Next, the gather issue discussed in #1263
I was thinking about making another method in Index for now, counter_gather,
for the greyhound behavior (returning a Counter for the matches in the index),
with a blanket implementation behaving like a LinearIndex (which... I think it
would also work by default for the SBT?), and figuring out later how to
integrate back into the regular Index.gather.
(side benefit: testing how well counter-gather works on SBTs, which I think
might not be terrible on runtime and actually stellar on memory consumption?)

@ctb
Copy link
Contributor

ctb commented Feb 11, 2022

Is fine by me!

add getset, wip parallel feature flag

wip colors

simpler impl first

parallel hash to color construction

wip

Revert "wip"

This reverts commit d65da76.

must insert small_color into large_color before setting it

trying out a small set impl

try compressing colors inside reduce

size test

use new released vec-collections

update cbindgen

make parallel/sequential more maintainable

some notes on partial serde

start revindex in py

start ffi

first test passing

second test passing

modify colors.update to accept an iter instead of slices

color count tracker

update sourmash.h

blanket implementation for counter_gather

start working on memstorage

niv update

avoid a mut ref in save by using lots of mutexes

fix codecov path fixes

expose InnerStorage

basic test passing in-memory sigs working

revert counter_gather to gather in search.py

lint cleanup

cbindgen fixes

moved MemStorage to #1463

implement signatures()

fix initialization
@luizirber luizirber changed the title [WIP] move greyhound-core into sourmash [MRG] Move greyhound-core into sourmash Feb 14, 2022
@luizirber luizirber merged commit 13bf0d5 into latest Feb 14, 2022
@luizirber luizirber deleted the greyhound branch February 14, 2022 05:59
@mr-eyes
Copy link
Member

mr-eyes commented Feb 15, 2022

🎉 🎉 🎉 🎉 🎉 🎉

@ctb
Copy link
Contributor

ctb commented Feb 17, 2022

After a frustrating 15 minutes of debugging, I feel like the reorganization of index.py into index/ should have received a mention in the PR description :)

ctb added a commit that referenced this pull request Dec 1, 2023
On-disk RevIndex based on RocksDB, initially implemented in
https://github.com/luizirber/2022-06-26-rocksdb-eval
This is the index/core data structure backing
https://mastiff.sourmash.bio

There are many changes in the Rust code, so bumping the version to
`0.12.0`.

This is mostly not exposed thru the FFI yet. Tests from the from the
in-memory `RevIndex` (greyhound) from #1238 were kept working, but it is
not well supported (doesn't allow saving/loading from disk, for
example), and should be wholly replaced by
`sourmash::index::revindex::disk_revindex` (the on-disk RevIndex) in the
future.
It is confusing to have these different RevIndex impls in Rust, and I
started converging them, but the work is not completely done yet.

#2727 is a better starting point for how `Index` abc/trait should work
acrosss Python/Rust, and I started moving the Rust indices to start from
a `LinearIndex` and later specialize into a `RevIndex`, which will make
easier to translate the work from #2727 for future indices across FFI.

A couple of new concepts introduced in this PR:

- a `Collection` is a `Manifest` + `Storage`. So a zip file like the
ones for GTDB databases fit this easily (storage = `ZipStorage`,
manifest is read from the zipfile), but a file paths list does too
(manifest built from the file paths, storage = `FSStorage`). This goes
in a bit of different direction than #1901, which was extending
`Storage` to support more functionality. I think `Storage` should stay
pretty bare and mostly deal with loading/saving data, but not much
knowledge of **what** data is there (this is covered with `Manifest`).

- a `CollectionSet` is a consistent collection of signatures. Consistent
here means: same k-size, downsample-compatible for scaled, same moltype.
You can create a `CollectionSet` by running `.select()` on a
`Collection`. `CollectionSet` is required for building indices (because
we shouldn't be building indices mixing k-size/moltype), and greatly
simplifies the logic in many places because it is not necessary to check
for compatibility.

- `LinearIndex` was rewritten based on `Collection` (and also the
`greyhound`/`branchwater` parallelism), and this supports the "parallel
search without an index" use case. There is no index construction per se
here, pretty much just a thin layer on top of `Collection` implementing
functionality expected from indices.

- `Manifest`, `Selection`, and `Picklist` are still incomplete, but the
relevant function definitions are in place, need to barrage it with
tests (and potentially exposing to Python and reusing the ones there in
#2726)

## Feature

- Initial implementation for `Manifest`, `Selection`, and `Picklist`
following
  the Python API.
- `Collection` is a new abstraction for working with a set of
signatures. A
collection needs a `Storage` for holding the signatures (on-disk,
in-memory,
or remotely), and a `Manifest` to describe the metadata for each
signature.
- Expose CSV parsing and RocksDB errors.
- New module `sourmash::index::revindex::disk_revindex` with the on-disk
  RevIndex implementation based on RocksDB.
- Add `iter` and `iter_mut` methods for `Signature`.
- Add `load_sig` and `save_sig` methods to `Storage` trait for
higher-level data
  manipulation and caching.
- Add `spec` method to `Storage` to allow constructing a concrete
`Storage` from
  a string description.
- Add `InnerStorage` for synchronizing parallel access to `Storage`
  implementations.
- Add `MemStorage` for keeping signatures in-memory (mostly for
debugging and
  testing).

## Refactor

- Rename `HashFunctions` variants to follow camel-case, so
`Murmur64Protein` instead of `murmur64_protein`
- `LinearIndex` is now implemented as a thin layer on top of
`Collection`.
- Move `GatherResult` to `sourmash::index` module.
- Move `sourmash::index::revindex` to `sourmash::index::mem_revindex`
(this is
the Greyhound version of revindex, in-memory only). It was also
refactored
internally to build a version of a `LinearIndex` that will be merged in
the
  future with `sourmash::index::LinearIndex`
- Move `select` method from `Index` trait into a separate `Select`
trait,
  and implement it for `Signature` based on the new `Selection` API.
- Move `SigStore` into `sourmash::storage` module, and remove the
generic. Now
  it always stores `Signature`. Also implement `Select` for it.

## Build

- Add new `branchwater` feature (enabled by default), which can be
disabled by downstream projects to limit bringing heavy dependencies
like rocksdb
- Add new `rkyv` feature (disabled by default), making `MinHash`
serializable
  with the `rkyv` crate.
- Add semver checks for CI (so we bump versions accordingly, or avoid
breaking changes)
- Reduce features combinations on Rust checks (takes much less time to
run)
- Disable `musllinux` wheels (need to figure out how to build rocksdb
for it)

---------

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: C. Titus Brown <titus@idyll.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants