Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New bucket serialization format #18

Closed
zkat opened this issue Oct 20, 2019 · 2 comments
Closed

New bucket serialization format #18

zkat opened this issue Oct 20, 2019 · 2 comments
Labels
enhancement New feature or request help wanted Extra attention is needed semver-major Semver-breaking change
Milestone

Comments

@zkat
Copy link
Owner

zkat commented Oct 20, 2019

The current bucket format is copied directly from what the JavaScript version of cacache does.

I no longer think it's worth trying to preserve compatibility, and the performance of index-related operations is kind of horrendous right now, so I think it's time to explore a new on-disk format for the index buckets.

My current thinking is to use serde more directly, and come up with a better strategy for the generic metadata field, as well.

And of course, if there's no actual perf difference, this issue should just be closed, but this is worth exploring anyway.

@zkat zkat added enhancement New feature or request help wanted Extra attention is needed semver-major Semver-breaking change labels Oct 20, 2019
@zkat zkat added this to the 4.0.0 milestone Oct 20, 2019
@isaacs
Copy link

isaacs commented Oct 28, 2019

I'm curious if you plan to keep the whole "nested shasum parts" thing. I noticed that the JS cacache spends a fair bit of FS ops on that, and it seems like most file systems in use today can handle bazillions of files in a single dir.

On my machine, I'm seeing about a 2-5% performance boost in all the benchmarks that are sensitive to bucket loading efficiency when I make it just use a single layer of files in one big index folder. Not revolutionary, but not nothing. I haven't yet run it through a benchmark that creates millions of buckets, so it's possible that it's A Bad Idea, of course :)

@zkat
Copy link
Owner Author

zkat commented Nov 7, 2019

I did a spike towards this and realized this is way more trouble than it's worth -- generally, if you want performance, cacache::read_hash(_sync) is the way to go, as it's just as fast as a regular filesystem read. Considering the general slowness is pretty much I/O bound, with only a bit of JSON parsing delay (and serde_json is very fast)... I think I'm just gonna go ahead and say this wouldn't be worth the trouble just to squeeze another couple of percentage points in.

@zkat zkat closed this as completed Nov 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed semver-major Semver-breaking change
Projects
None yet
Development

No branches or pull requests

2 participants