Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uses LruCache instead of InversionTree for caching data decode matrices #104

Merged
merged 2 commits into from
Sep 23, 2022

Conversation

behzadnouri
Copy link
Contributor

Current implementation of LRU eviction policy on InversionTree is wrong
and inefficient.
The commit removes InversionTree and instead uses LruCache to cache data
decode matrices.

@behzadnouri
Copy link
Contributor Author

@nazar-pc can you please take a look? thank you.
I am seeing decent performance improvements on our codebase using LruCache instead of the InversionTree.

@nazar-pc
Copy link
Member

nazar-pc commented Sep 3, 2022

Thanks, I'll try to take a look and bench it later this week.
Can you elaborate on what was wrong?

@behzadnouri
Copy link
Contributor Author

behzadnouri commented Sep 3, 2022

Can you elaborate on what was wrong?

#74 have some context why some eviction policy is needed. LRU eviction was added in #83.

But as is implemented, incrementing used field here:
https://github.com/rust-rse/reed-solomon-erasure/blob/eb1f66f47/src/inversion_tree.rs#L157
does not actually result in LRU. Once for a specific entry used becomes large enough, the entry won't get evicted even if it has not been used recently. Instead other more recently used entries will be evicted.

The eviction code implemented here:
https://github.com/rust-rse/reed-solomon-erasure/blob/eb1f66f47/src/inversion_tree.rs#L209-L238
is pretty inefficient. It requires several times sorting and a full traversal of the entire tree:
https://github.com/rust-rse/reed-solomon-erasure/blob/eb1f66f47/src/inversion_tree.rs#L111-L126
This all can instead be done O(1) as is implemented by lru::LruCache.

The two atomic operations and the mutex lock here:
https://github.com/rust-rse/reed-solomon-erasure/blob/eb1f66f47/src/inversion_tree.rs#L87-L97
are not thread-safe and are subject to race conditions in the time between them.

The InversionTree and the InversionNode:
https://github.com/rust-rse/reed-solomon-erasure/blob/eb1f66f47/src/inversion_tree.rs#L23-L36
have a lot of overhead compared to just using a hash-map.

The combination of all these causes that when you have a persistent instance of ReedSolomon encoder/decoder, and you keep calling into reconstruct_data with large enough shards (say 32 data and 32 code), and different indices are missing in each call, the code spends a lot of time just traversing the tree and evicting entries which is very suboptimal.

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done a simple benchmark and it is actually slower than older code.

Before:

reconstruct :
    shards           : 10 / 2
    shard length     : 1048576
    time taken       : 0.12414031
    byte count       : 524288000
    MB/s             : 4027.700591371167
reconstruct :
    shards           : 10 / 10
    shard length     : 1048576
    time taken       : 0.133304495
    byte count       : 524288000
    MB/s             : 3750.811253589011

After:

reconstruct :
    shards           : 10 / 2
    shard length     : 1048576
    time taken       : 0.150660363
    byte count       : 524288000
    MB/s             : 3318.722921170713
reconstruct :
    shards           : 10 / 10
    shard length     : 1048576
    time taken       : 0.154363785
    byte count       : 524288000
    MB/s             : 3239.1017102878113

I'm now wondering if we should just have a small-ish hashmap (like 256 entries) and when it gets full just stop adding more entries since it is likely that application will have random inputs all the time anyway, rendering caching useless.

Then we don't need LRU at all. What do you think?

cc @darrenldl @mvines

Cargo.toml Outdated
@@ -43,6 +43,7 @@ parking_lot = { version = "0.11.2", optional = true }
smallvec = "1.2"
# `Mutex` implementation for `no_std` environment with the same high-level API as `parking_lot`
spin = { version = "0.9.2", default-features = false, features = ["spin_mutex"] }
lru = "0.7.8"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep dependencies sorted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

src/core.rs Show resolved Hide resolved
use super::Field;
use super::ReconstructShard;

const DATA_DECODE_MATRIX_CACHE_CAPACITY: usize = 254;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an odd number. Can you elaborate on the rationale behind this number, maybe leaving a comment in the code too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just preserved the current cache capacity from the existing code, not to introduce any change of behavior or performance due to cache size change:
https://github.com/rust-rse/reed-solomon-erasure/blob/eb1f66f47/src/inversion_tree.rs#L15

Cargo.toml Outdated
@@ -43,6 +43,7 @@ parking_lot = { version = "0.11.2", optional = true }
smallvec = "1.2"
# `Mutex` implementation for `no_std` environment with the same high-level API as `parking_lot`
spin = { version = "0.9.2", default-features = false, features = ["spin_mutex"] }
lru = "0.7.8"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep dependencies sorted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@behzadnouri
Copy link
Contributor Author

I have done a simple benchmark and it is actually slower than older code.

It can depend on the setup and the actual benchmark. In particular how many data/parity shards are there and how often you hit or miss the cache, or have to evict entries from the cache.
In our setup:

  • There are usually 32 data shards and 32 parity shards.
  • Missing shards are very random.

So invalid_indices (the key used for caching data decode matrices in the inversion-tree) is ~32 random indices; and so:

  • It takes a lot of time to traverse the inverstion-tree in order to load and store matrices in the tree.
  • When looking up from the inversion-tree it will almost never hit an already cached entry.

So effectively we pay the price to traverse the tree and cache decode matrices, but not only there is no cache hit and so no gain, but also it is constantly evicting entries from the cache using a very inefficient eviction code as I explained earlier.

In our code-base I see a very significant gain (50%+) in some of our metrics by just switching from inversion-tree to lru-cache.

I am guessing the setup for your benchmark hits a sweet spot for the current implementation of inversion-tree. If you share the code for the benchmark I can look more into it.

Also please note that besides performance there are other issues with current implementation as mentioned earlier here: #104 (comment)

I'm now wondering if we should just have a small-ish hashmap (like 256 entries) and when it gets full just stop adding more entries since it is likely that application will have random inputs all the time anyway, rendering caching useless.

Then we don't need LRU at all. What do you think?

That sounds fair to me; as long as the change of behavior is not a concern. My objective here was to keep the behavior similar to current intended design to the extent possible, so preserving the lru policy.

@nazar-pc
Copy link
Member

nazar-pc commented Sep 9, 2022

https://github.com/rust-rse/rse-benchmark is the benchmark I was using.

If change is behavior is positive, I see no reason to not do it. But I'd like to hear some feedback from other maintainers first.

@behzadnouri
Copy link
Contributor Author

https://github.com/rust-rse/rse-benchmark is the benchmark I was using.

Looking at that benchmark code, it is always only removing the first shard:
https://github.com/rust-rse/rse-benchmark/blob/e241b2c8c/src/main.rs#L133

With that setting:

  • InversionTree navigation to find the cached entry is very short.
  • Cache never becomes full, and the cache eviction code is never invoked.

So that benchmark never really hits performance bottlenecks of the current implementation.

I added a benchmark code in this same pull request as a separate commit. You can run the benchmark by

cargo bench bench_reconstruct

with and without the last commit to compare.

On my machine, with current InversionTree implementation:

test bench_reconstruct ... bench:   3,779,956 ns/iter (+/- 2,025,730)

whereas with the LruCache:

test bench_reconstruct ... bench:     501,518 ns/iter (+/- 26,591)

So there is a massive improvement by switching from InversionTree to LruCache.

Current implementation of LRU eviction policy on InversionTree is wrong
and inefficient.
The commit removes InversionTree and instead uses LruCache to cache data
decode matrices.
@behzadnouri
Copy link
Contributor Author

behzadnouri commented Sep 19, 2022

@nazar-pc I updated benchmark code to test different combinations of number of data shards, and number of parity shards.
As shown below, as the number of parity shards grows, current code using InversionTree shows massive slow down, and the LruCache is significantly faster.
You may rerun this benchmarks using

cargo bench bench_reconstruct

With current InversionTree implementation:
The 1st number is number of data shards, the 2nd number is number of parity shards.

test bench_reconstruct_2_2   ... bench:       1,810 ns/iter (+/- 118)
test bench_reconstruct_4_2   ... bench:       3,797 ns/iter (+/- 120)
test bench_reconstruct_4_4   ... bench:       6,856 ns/iter (+/- 544)
test bench_reconstruct_8_2   ... bench:       7,873 ns/iter (+/- 388)
test bench_reconstruct_8_4   ... bench:      14,300 ns/iter (+/- 483)
test bench_reconstruct_8_8   ... bench:     295,623 ns/iter (+/- 63,375)
test bench_reconstruct_16_2  ... bench:      16,300 ns/iter (+/- 1,254)
test bench_reconstruct_16_4  ... bench:      30,483 ns/iter (+/- 1,584)
test bench_reconstruct_16_8  ... bench:   1,054,134 ns/iter (+/- 782,379)
test bench_reconstruct_16_16 ... bench:   1,888,967 ns/iter (+/- 1,247,035)
test bench_reconstruct_32_2  ... bench:      33,170 ns/iter (+/- 2,529)
test bench_reconstruct_32_4  ... bench:     811,539 ns/iter (+/- 493,022)
test bench_reconstruct_32_8  ... bench:   1,126,356 ns/iter (+/- 1,029,006)
test bench_reconstruct_32_16 ... bench:   2,063,159 ns/iter (+/- 1,576,207)
test bench_reconstruct_32_32 ... bench:   3,234,665 ns/iter (+/- 2,084,087)

Using LruCache instead:

test bench_reconstruct_2_2   ... bench:       1,789 ns/iter (+/- 149)
test bench_reconstruct_4_2   ... bench:       3,796 ns/iter (+/- 340)
test bench_reconstruct_4_4   ... bench:       6,793 ns/iter (+/- 745)
test bench_reconstruct_8_2   ... bench:       7,731 ns/iter (+/- 475)
test bench_reconstruct_8_4   ... bench:      15,823 ns/iter (+/- 791)
test bench_reconstruct_8_8   ... bench:      29,180 ns/iter (+/- 4,356)
test bench_reconstruct_16_2  ... bench:      16,180 ns/iter (+/- 1,063)
test bench_reconstruct_16_4  ... bench:      39,291 ns/iter (+/- 2,881)
test bench_reconstruct_16_8  ... bench:      66,470 ns/iter (+/- 2,748)
test bench_reconstruct_16_16 ... bench:     115,723 ns/iter (+/- 11,446)
test bench_reconstruct_32_2  ... bench:      46,704 ns/iter (+/- 2,409)
test bench_reconstruct_32_4  ... bench:      98,911 ns/iter (+/- 9,110)
test bench_reconstruct_32_8  ... bench:     169,096 ns/iter (+/- 10,051)
test bench_reconstruct_32_16 ... bench:     291,011 ns/iter (+/- 44,759)
test bench_reconstruct_32_32 ... bench:     502,477 ns/iter (+/- 44,652)

@mvines
Copy link
Collaborator

mvines commented Sep 23, 2022

fwiw, I'm fine with this change. I'm also heavily biased as @behzadnouri is on my team

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me and thanks for adding reconstruction bench, ideally we'd move all the benches in here, maybe with criterion for easier testing in the future.

@behzadnouri
Copy link
Contributor Author

@nazar-pc thanks for approving the pr.
what would be the process to merge this change, and then have a new release with this change pushed to crates.io?

@nazar-pc nazar-pc merged commit 160bcff into rust-rse:master Sep 23, 2022
@nazar-pc
Copy link
Member

@mvines can you update changelog, version and do another release? I'm lazy today 🙂

@mvines
Copy link
Collaborator

mvines commented Sep 23, 2022

heh, yep. that seems fair to me :)

@mvines
Copy link
Collaborator

mvines commented Sep 23, 2022

v6.0.0 is now on crates.io

@behzadnouri behzadnouri deleted the rm-inversion-tree branch September 23, 2022 20:21
@behzadnouri
Copy link
Contributor Author

Thank you both

behzadnouri added a commit to behzadnouri/solana that referenced this pull request Sep 23, 2022
behzadnouri added a commit to behzadnouri/solana that referenced this pull request Sep 23, 2022
behzadnouri added a commit to solana-labs/solana that referenced this pull request Sep 24, 2022
mergify bot pushed a commit to solana-labs/solana that referenced this pull request Sep 24, 2022
Need to pick up:
rust-rse/reed-solomon-erasure#104
in order to unblock:
#27510

(cherry picked from commit f02fe9c)

# Conflicts:
#	ledger/Cargo.toml
mergify bot added a commit to solana-labs/solana that referenced this pull request Sep 24, 2022
…#28048)

* updates reed-solomon-erasure crate version to 6.0.0 (#28033)

Need to pick up:
rust-rse/reed-solomon-erasure#104
in order to unblock:
#27510

(cherry picked from commit f02fe9c)

# Conflicts:
#	ledger/Cargo.toml

* removes mergify merge conflicts

Co-authored-by: behzad nouri <behzadnouri@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants