Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Ertl estimators for scaled minhash #1270

Open
wants to merge 4 commits into
base: latest
Choose a base branch
from
Open

Conversation

luizirber
Copy link
Member

The idea of this PR is to estimate the cardinality of the original dataset using the scaled minhash as a proxy
(the scaled minhash has a reduced resolution HLL equivalent to full resolution HLL from the original dataset), keeping it backwards-compatible with existing sigs (since we can build the reduced HLL from the scaled minhash).

(Surfacing this after talking with @bluegenes)

TODO

  • finish impl
  • validate with experiments

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Jan 6, 2021

Codecov Report

Merging #1270 (234eab0) into latest (d57127b) will increase coverage by 5.05%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1270      +/-   ##
==========================================
+ Coverage   89.88%   94.93%   +5.05%     
==========================================
  Files         124       97      -27     
  Lines       19889    16266    -3623     
  Branches     1515     1515              
==========================================
- Hits        17877    15442    -2435     
+ Misses       1783      595    -1188     
  Partials      229      229              
Flag Coverage Δ
python 94.93% <ø> (ø)
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/ffi/hyperloglog.rs
src/core/src/index/mod.rs
src/core/src/index/linear.rs
src/core/src/sketch/hyperloglog/mod.rs
src/core/src/ffi/utils.rs
src/core/src/cmd.rs
src/core/src/ffi/minhash.rs
src/core/src/index/sbt/mhbt.rs
src/core/src/index/search.rs
src/core/src/index/bigsi.rs
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d57127b...234eab0. Read the comment docs.

@luizirber luizirber force-pushed the minhash_ertl branch 7 times, most recently from e305b82 to 4c7e2e4 Compare May 6, 2021 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant