Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Force unload data from SBT searches by default (and fix ZipStorage deallocation along the way) #1513

Merged
merged 6 commits into from
May 12, 2021

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented May 10, 2021

Missing bit from #1370, re: #1370 (comment)

While changing the SBT code, this triggered deeper bugs in ZipStorage, so I fixed them too. I changed the .save() method to only .flush() the storage, but not .close() it. Because the .flush() call uses a temp file, I'm also avoiding deleting it (because it becomes the new ZipStorage), and also dealing with some interesting cases during deallocation (flushing/closing the underlying Zip file properly).

TODO:

  • should this be exposed back to the command line? I don't think it actually makes sense for current use cases, so unload_data in SBT.find becomes available only at the Python API level

@luizirber luizirber changed the title Force unload data from SBT searches by default [WIP] Force unload data from SBT searches by default May 10, 2021
@ctb
Copy link
Contributor

ctb commented May 10, 2021

* [ ]  should this be exposed back to the command line? I don't think it actually makes sense for current use cases, so `unload_data` in `SBT.find` becomes available only at the Python API level

I do not think it should be exposed.

@luizirber luizirber changed the title [WIP] Force unload data from SBT searches by default [MRG] Force unload data from SBT searches by default May 10, 2021
@luizirber
Copy link
Member Author

Ready for review and merge @ctb

@codecov
Copy link

codecov bot commented May 10, 2021

Codecov Report

Merging #1513 (3621349) into latest (6922f38) will increase coverage by 5.01%.
The diff coverage is 78.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1513      +/-   ##
==========================================
+ Coverage   90.26%   95.28%   +5.01%     
==========================================
  Files         126       99      -27     
  Lines       21099    17417    -3682     
  Branches     1585     1591       +6     
==========================================
- Hits        19045    16595    -2450     
+ Misses       1827      593    -1234     
- Partials      227      229       +2     
Flag Coverage Δ
python 95.28% <78.26%> (-0.02%) ⬇️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/sbt_storage.py 87.50% <72.22%> (-1.21%) ⬇️
src/sourmash/commands.py 86.29% <100.00%> (+0.03%) ⬆️
src/sourmash/sbt.py 80.87% <100.00%> (+0.02%) ⬆️
src/core/src/signature.rs
src/core/src/index/linear.rs
src/core/src/sketch/hyperloglog/mod.rs
src/core/src/index/bigsi.rs
src/core/src/sketch/hyperloglog/estimators.rs
src/core/src/ffi/signature.rs
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6922f38...3621349. Read the comment docs.

Comment on lines 473 to 487
to_search = load_one_signature(utils.get_test_data(utils.SIG_FILES[0]))
search_obj = make_jaccard_search_query(threshold=0.1)

tree = SBT.load(str(testsbt), leaf_loader=SigLeaf.load)
old_result = {str(s.signature) for s in tree.find(search_obj, to_search)}
tree.save(str(newsbt))

assert newsbt.exists()

new_tree = SBT.load(str(newsbt), leaf_loader=SigLeaf.load)
assert isinstance(new_tree.storage, ZipStorage)
assert new_tree.storage.list_sbts() == ['new.sbt.json']

to_search = load_one_signature(utils.get_test_data(utils.SIG_FILES[0]))
new_result = {str(s.signature) for s in new_tree.find(search_obj, to_search)}

print("*" * 60)
print("{}:".format(to_search))
search_obj = make_jaccard_search_query(threshold=0.1)
old_result = {str(s.signature) for s in tree.find(search_obj, to_search)}
new_result = {str(s.signature) for s in new_tree.find(search_obj, to_search)}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is fun: I reordered the test because the call to tree.save(str(newsbt)) closes the original zipfile used as storage in testsbt. Why did it trigger only when defaulting to unload_data=True? Because with unload_data=False the data was still in memory, even with the storage zipfile closed.

the call to .close() inside .save() should probably be something like .flush() instead, which might fix the warnings about closed files too...

@ctb
Copy link
Contributor

ctb commented May 10, 2021

...did the scope creep on this one a bit? 😆

@luizirber
Copy link
Member Author

...did the scope creep on this one a bit? laughing

Yes, but I'll argue it is for good reasons. This fixes a bunch of Warnings that were real problems and being ignored 😬

So, instead of reordering the test, I changed the .save() method to only .flush() the storage, but not .close() it. Because the .flush() call uses a temp file, I'm also avoiding deleting it (because it becomes the new ZipStorage).

But there are some warnings that can be fixed, so... I'll ping again for another review =]

@luizirber luizirber changed the title [MRG] Force unload data from SBT searches by default [MRG] Force unload data from SBT searches by default (and fix ZipStorage deallocation along the way) May 12, 2021
@luizirber
Copy link
Member Author

So, this one is ready for review and merge @ctb @bluegenes...

But there is a new error during doc building that I think was triggered by some newly released version of the docs dependencies. Sigh.

@ctb
Copy link
Contributor

ctb commented May 12, 2021 via email

@luizirber
Copy link
Member Author

ok, post as issue I guess?

Fixing in #1516

@luizirber luizirber merged commit 91603d4 into latest May 12, 2021
@luizirber luizirber deleted the gather_unload branch May 12, 2021 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants