Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: use correct denominator in f_unique_to_query #3138

Merged
merged 6 commits into from
May 10, 2024

Conversation

ctb
Copy link
Contributor

@ctb ctb commented May 4, 2024

This PR fixes an issue introduced in #2943 where we introduced a subtly broken calculation that uses the current size of the query metagenome as the denominator for the f_unique_to_query calculation.

Fixes #3137

This PR also adds some commented-out test code that demonstrates #3139 / sourmash-bio/sourmash_plugin_branchwater#322. That's something I haven't been able to debug, so I'd suggest fixing that independently - I'd rather fix a problem now, rather than waiting until we can fix multiple problems at some later indeterminate time :).

Notes

Copy link

codecov bot commented May 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.65%. Comparing base (880b488) to head (14f9c3b).

Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #3138      +/-   ##
==========================================
+ Coverage   86.62%   86.65%   +0.02%     
==========================================
  Files         136      136              
  Lines       15818    15818              
  Branches     2713     2713              
==========================================
+ Hits        13703    13707       +4     
+ Misses       1805     1801       -4     
  Partials      310      310              
Flag Coverage Δ
hypothesis-py 25.36% <ø> (ø)
python 92.32% <ø> (ø)
rust 62.09% <100.00%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bluegenes
Copy link
Contributor

this change looks good. There is a test for the calculate_gather_stats function in the same file, but it doesn't have a query subtraction (just the initial calculation), so it doesn't reveal the issue fixed here.

I didn't originally realize - the revindex tests are here:

that would be a good spot to add tests

@ctb
Copy link
Contributor Author

ctb commented May 10, 2024

I did not know this: you have to specifically run the tests with cargo test --features branchwater or maybe cargo test-all-features in order to run the revindex tests!

(Our github actions does the right thing, I think. I just was struggling to get cargo test to fail ;)

@bluegenes
Copy link
Contributor

I did not know this: you have to specifically run the tests with cargo test --feature branchwater or maybe cargo test-all-features in order to run the revindex tests!

This confused me at first too + I regularly need to remind myself of cargo testing syntax. Do we have any sourmash rust dev docs yet?

@ctb
Copy link
Contributor Author

ctb commented May 10, 2024

the code in linear.rs doesn't seem to get run at all by the tests, and in any case doesn't calculate most of the stats. So I think we can safely ignore it. Before merging this, I'll make sure there's an appropriate issue update elsewhere (either a new issue, or updating an existing issue, gotta go check to figure out which is appropriate.)

@ctb
Copy link
Contributor Author

ctb commented May 10, 2024

note that when I started adding tests, I got problematic results per #3139 / sourmash-bio/sourmash_plugin_branchwater#322.

The current tests do check the denominator issue in #3137 so I think we should merge it and leave #3139 to another PR. Let me know if you disagree @bluegenes !

@ctb ctb changed the title WIP: use correct denominator in f_unique_to_query MRG: use correct denominator in f_unique_to_query May 10, 2024
@ctb
Copy link
Contributor Author

ctb commented May 10, 2024

Ready for review & (assuming tests pass) merge!

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@bluegenes bluegenes merged commit 777fa20 into latest May 10, 2024
40 checks passed
@bluegenes bluegenes deleted the fix_f_unique_to_query branch May 10, 2024 17:23
ctb added a commit that referenced this pull request Jun 9, 2024
…d by plugins (#3193)

This PR fixes a bug in `disk_revindex.rs::RevIndexOps::gather` where
RocksDB-based `gather` was
incorrectly subtracting hashes multiple times from the counter in
situations of high redundancy.

For example, consider this Venn diagram of the 3-way intersection
between three sketches:


![image](https://github.com/sourmash-bio/sourmash/assets/51016/f98bf6a0-acc4-415f-bf39-1c48a599996a)

When a metagenome contains the union of all three of these sketches, the
broken implementation would subtract the `10` at the center multiple
times. This was caused by removing hashes from the matches, rather than
the intersection, each pass through the counter.

Of note, this made RocksDB-based `fastmultigather` return incorrect
results, ref
sourmash-bio/sourmash_plugin_branchwater#322;
first discovered in
#3138 (comment).

The PR fixes this, and adds a more complete pair of tests, based on
`test_gather_metagenome_num_results` in the Python tests for sourmash.

This PR also adjusts the hash function string for DNA sketches in Rust
to be uppercase `DNA` rather than lowercase `dna`, ref
sourmash-bio/sourmash_plugin_directsketch#49

And remember, it's not *just* the destination - it's the friends you
make along the way, like `env_logger`.

* Fixes #3139
* Fixes
sourmash-bio/sourmash_plugin_directsketch#49

For consideration:

Right now we are calculating the intersection twice, once in
`disk_revindex.rs` and once in `calculate_gather_stats` in
`revindex/mod.rs`. This is unnecessary, of course. But the function
signature for `calculate_gather_stats` would need to change to take the
intersection as an argument. We could:
* keep calculating it twice, just for simplicity;
* change `calculate_gather_stats` to take an _optional_ intersection,
and calculate it if not provided;
* change `calculate_gather_stats` to require an intersection.

TODO:
- [x] add at least one explicit test for the moltype fix

Other notes:

* there is a discrepancy between the Python (`sourmash gather`) and Rust
(`sourmash_plugin_branchwater` results) calculations for `remaining_bp`.
It seems to me like the Python one is definitely wrong; not yet sure
about Rust. Viz #3194.
ctb added a commit that referenced this pull request Jun 10, 2024
#3199)

## [0.14.0] - 2024-06-10

MSRV: 1.65

Changes/additions:

* fix cargo fmt for updated `disk_revindex.rs` code (#3197)
* fix RocksDB-based gather & other rust-based infelicities revealed by
plugins (#3193)
* use correct denominator in f_unique_to_query (#3138)
* fix clippy warnings about max_value (#3146)
* allow get/set record.filename (#3121)
Updates:

* Bump statrs from 0.16.0 to 0.16.1 (#3186)
* Bump serde from 1.0.202 to 1.0.203 (#3175)
* Bump ouroboros from 0.18.3 to 0.18.4 (#3176)
* Bump itertools from 0.12.1 to 0.13.0 (#3166)
* Bump camino from 1.1.6 to 1.1.7 (#3169)
* Bump serde from 1.0.201 to 1.0.202 (#3168)
* Bump serde_json from 1.0.116 to 1.0.117 (#3159)
* Bump serde from 1.0.200 to 1.0.201 (#3160)
* Bump roaring from 0.10.3 to 0.10.4 (#3142)
* Bump histogram from 0.10.0 to 0.10.1 (#3141)
* Bump num-iter from 0.1.44 to 0.1.45 (#3140)
* Bump serde from 1.0.199 to 1.0.200 (#3144)
* Bump serde from 1.0.198 to 1.0.199 (#3130)
* Bump serde_json from 1.0.115 to 1.0.116 (#3124)
* Bump serde from 1.0.197 to 1.0.198 (#3122)
* Bump histogram from 0.9.1 to 0.10.0 (#3109)
* Bump enum_dispatch from 0.3.12 to 0.3.13 (#3102)
* Bump serde_json from 1.0.114 to 1.0.115 (#3101)
* Bump rayon from 1.9.0 to 1.10.0 (#3098)
@ctb ctb mentioned this pull request Jun 10, 2024
ctb added a commit that referenced this pull request Jun 11, 2024
Minor new features:

* add `--set-name` to `sig intersect` and `sig subtract` (#3162)
* upgrade `sig overlap` and `sig subtract` to load more than JSON
signatures (#3153)
* force continue past `tax genome` classification errors (#3100)

Bug fixes:

* fix `remaining_bp` output from sourmash gather (#3195)
* fix RocksDB-based gather & other rust-based infelicities revealed by
plugins (#3193, #3197)
* use correct denominator in f_unique_to_query (#3138)

Cleanup and documentation updates:

* update JOSS for sourmash v4 (#3114, #3203, #3209)
* fix links to taxonomy spreadsheets (#3119)
* fix description of `f_unique_weighted` (#3164)

Developer updates:

* transition internal signature loading functions (#3161)
* allow get/set record.filename (#3121)
* round a number that is losing precision in 15th place in
`test_distance_utpy` (#3126)
* disable ppc64le wheel building (#3127)
* prepare to remove `sourmash compute` for sourmash v5.0 (#3103)
* add rustup target x86_64-apple-darwin (#3148)
* mv `.cargo/config` to `config.toml` (#3147)
* fix clippy warnings about max_value (#3146)
* bump to v4.8.9-dev (#3135)
* update src/core/CHANGELOG.md for sourmash-rs core release r0.14.0
(#3199)

Dependabot updates:

* Bump DeterminateSystems/nix-installer-action from 11 to 12 (#3184)
* Bump DeterminateSystems/magic-nix-cache-action from 6 to 7 (#3185)
* Bump statrs from 0.16.0 to 0.16.1 (#3186)
* Bump serde from 1.0.202 to 1.0.203 (#3175)
* Bump ouroboros from 0.18.3 to 0.18.4 (#3176)
* Bump itertools from 0.12.1 to 0.13.0 (#3166)
* Bump camino from 1.1.6 to 1.1.7 (#3169)
* Bump serde from 1.0.201 to 1.0.202 (#3168)
* Bump thiserror from 1.0.60 to 1.0.61 (#3167)
* Bump pypa/cibuildwheel from 2.18.0 to 2.18.1 (#3165)
* Bump DeterminateSystems/magic-nix-cache-action from 4 to 6 (#3157)
* Bump DeterminateSystems/nix-installer-action from 10 to 11 (#3156)
* Bump pypa/cibuildwheel from 2.17.0 to 2.18.0 (#3155)
* Bump serde_json from 1.0.116 to 1.0.117 (#3159)
* Bump thiserror from 1.0.59 to 1.0.60 (#3158)
* Bump serde from 1.0.200 to 1.0.201 (#3160)
* Bump roaring from 0.10.3 to 0.10.4 (#3142)
* Bump histogram from 0.10.0 to 0.10.1 (#3141)
* Bump getrandom from 0.2.14 to 0.2.15 (#3143)
* Bump num-iter from 0.1.44 to 0.1.45 (#3140)
* Bump jinja2 from 3.1.3 to 3.1.4 (#3145)
* Bump serde from 1.0.199 to 1.0.200 (#3144)
* Bump serde from 1.0.198 to 1.0.199 (#3130)
* Bump conda-incubator/setup-miniconda from 3.0.3 to 3.0.4 (#3131)
* Update pytest requirement from <8.2.0,>=6.2.4 to >=6.2.4,<8.3.0
(#3132)
* Bump myst-parser from 2.0.0 to 3.0.1 (#3133)
* Bump thiserror from 1.0.58 to 1.0.59 (#3123)
* Bump serde_json from 1.0.115 to 1.0.116 (#3124)
* Bump serde from 1.0.197 to 1.0.198 (#3122)
* Update docutils requirement from <0.21,>=0.17.1 to >=0.17.1,<0.22
(#3116)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Erroneous f_unique_to_query calculation in Rust code.
2 participants