Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

accounts-db: Fix out-of-memory error caused by DashMaps in secondary indices#34955

Closed
dnut wants to merge 5 commits intosolana-labs:masterfrom
dnut:dashmap-to-hashmap
Closed

accounts-db: Fix out-of-memory error caused by DashMaps in secondary indices#34955
dnut wants to merge 5 commits intosolana-labs:masterfrom
dnut:dashmap-to-hashmap

Conversation

@dnut
Copy link
Copy Markdown

@dnut dnut commented Jan 25, 2024

Problem

The validator crashes when indexing the account snapshot on high core-count machines. It consumes 1 TB of memory and the process is killed by the OOM killer. This occurs on every single startup before the validator has a chance to catch up the network.

This is a big problem for node operators like Syndica because it prevents us from upgrading our hardware.

To reproduce, run the validator with:

  • version 1.17.16
  • 64 cores (128 threads) and 1 TB of RAM
  • secondary indices enabled

Root cause: Secondary indices contain many DashMaps. Each DashMap is sharded proportionally to core count. The massive number of shards triggers an explosion of memory usage.

Summary of changes

Replace DashMap with RwLock<HashMap> in the program_id_index and spl_token_mint_index secondary indices in accounts-db. This reduces total validator memory usage from over 1 TB to under 200 GB.

While the memory usage could be improved by hardcoding a low number of shards, this change instead fully replaces DashMap with RwLock<HashMap> because DashMap appears to have little to no benefit in these indices, and it introduces unnecessary complexity and CPU load relative to HashMaps.

Root Cause

Secondary indices are stored in a DashMap that contains another nested DashMap for every entry in the index. This means a massive number of DashMaps will be instantiated.

Each DashMap is sharded, containing several RwLock<HashMap>s within it. When using DashMap::new(), DashMaps are sharded based on the number of cores using this logic:

(std::thread::available_parallelism().map_or(1, usize::from) * 4).next_power_of_two()

This means every DashMap on a 128 thread machine will contain 512 HashMap instances within it.

The overhead from allocating so many empty maps explains the memory usage.

Simply swapping out the DashMaps with HashMaps scales down the memory overhead by a factor of 512. Running with this patch on the same machine with 128 threads, the validator was able to finish the indexing process without exceeding 200 GB of total memory. We have been using this code on our production nodes for the past few weeks without issue, and have observed that the validator catches up in around 30% of the time that it used to take with DashMaps.

Benchmarks

I ran some benchmarks to illustrate the memory problem and to determine if DashMaps are actually beneficial.

https://github.com/dnut/dashmap-benchmark

I ran these benchmarks on an Macbook pro with a 12-core M3 and 36 GB of RAM.

cargo run --release -- --help

Memory

To illustrate the memory usage with a minimal example, I wrote a simple benchmark that initializes a large number of DashMaps or RwLock<HashMap>s. It measures the peak memory usage and the amount of time it took to initialize all the maps. test_init_many_maps

cargo run --release -- hashmap init --entries $ENTRIES
cargo run --release -- dashmap --shards $SHARDS init --entries $ENTRIES
Type Shards Entries Memory (GB) Time (s) Completion
HashMap 1 1,000,000 0.31 0.42 100%
HashMap 1 10,000,000 2.1 2.49 100%
DashMap 4 1,000,000 0.47 0.45 100%
DashMap 4 10,000,000 4.4 3.39 100%
DashMap 32 1,000,000 2.3 0.64 100%
DashMap 32 10,000,000 22.4 4.73 100%
DashMap 128 1,000,000 7.3 1.15 100%
DashMap 128 10,000,000 OOM n/a 30%
DashMap 512 1,000,000 OOM n/a 90%
DashMap 512 10,000,000 OOM n/a 9%

This demonstrates the relationship between DashMap shards and memory usage. The final test ran out of memory at 9% which implies the empty DashMaps would have used hundreds of GB. RwLock<HashMap> only uses 2.1 GB for the same amount of data.

Contention

My assumption is that DashMap was used within each index entry because a select few of the index entries will be heavily used. For example, the token program in the program_id index will have many entries, and it may be written to or scanned concurrently at a high frequency. In this scenario, there may be some benefit to sharding the data to reduce lock contention.

As a basic benchmark for contention, I ran many concurrent read and write operations. For reads, I replicated the logic of the SecondaryIndexEntry::keys, since that is the most expensive read operation on these indices. For writes, I did an insertion. These operations were scheduled across several threads to achieve a specified number of operations per second overall. test_contention

Assuming there is no bottleneck, the tests should complete in about 1 second. If it takes more than 1 second, this implies that operations are being blocked by locks held by operations in other threads. The contention causes operations to pile up and take longer than their scheduled time.

There are three customizable variables for this benchmark:

  • prior_writes: number of items (roughly) that exist in the map before starting the benchmark
  • writes_per_second: number of writes to execute per second during the benchmark (spread across many threads)
  • reads_per_second: number of reads to execute per second during the benchmark (spread across many threads)

With run.sh, I tested every combination of each of these three variables set to every power of ten ranging from 0 to 10 million, and ran those tests for each of these three data structures:

  • RwLock<HashMap>
  • DashMap with 4 shards
  • DashMap with 8 shards

This is a total of 3*8^3 or 1536 benchmarks. I parsed the rust stdout with parse.py into a CSV containing all 1536 duration data points, and uploaded it to this google sheet: https://docs.google.com/spreadsheets/d/10XNX-CSBmejQnK8_YTbir0Y_aFxf0cn5YBmNQAD9CaQ/edit#gid=0

Note: I also ran a few manual tests for higher sharded DashMaps but they were always slower than the lower sharded DashMaps. Only 4 and 8 sharded DashMaps were included in the full battery of tests. It took about 20 hours.

I aggregated this 6-dimensional data into a more digestible format by focusing on the individual relationships that each of the three variables have with respect to duration. I examined each relationship in three different load profiles (light, reasonable, and heavy). This generated a total of 9 tables and plots, which you can view in the other tabs of the spreadsheet. The aggregation logic is performed by aggregate() in parse.py. For a more readable illustration of the logic, here is a query that would generate the duration vs prior_writes table for a light load profile:

SELECT
    prior_writes,
    AVG(hashmap_duration),
    AVG(dashmap4_duration),
    AVG(dashmap8_duration),
FROM raw_data
WHERE 
    writes_per_second >= 1
    AND writes_per_second <= 1000
    AND reads_per_second >= 1
    AND reads_per_second <= 1000
GROUP BY prior_writes

These results show that, for this code, DashMaps are only faster than RwLock<HashMap> in two situations:

  • There are at least 10,000 reads and writes per second to the map, and there is a total of less than 10,000 entries in the map.
  • Writes per second exceeds 100,000 and reads per second exceeds 10,000.

In all other scenarios, RwLock<HashMap> had equal or better performance.

Another important point is that the concurrent DashMap operations were much more CPU intensive. Even if the same job finished in exactly the same amount of wallclock time using both DashMap and RwLock<HashMap>, the DashMap code used about 8x as much CPU time according to the time command. This means DashMap is actually more computationally expensive than it appears based on the durations alone.

Conclusion

The data from my benchmarks suggests that DashMap is not helpful within secondary index entries, and is actually counterproductive.

The secondary index implementation does not hold the lock for very long. It quickly transforms the data and returns a copy, releasing the lock immediately. It seems we're not likely to benefit from sharding unless index entries are being written to millions of times per second.

Furthermore, DashMaps come with significant memory and CPU costs. The memory cost is clear: there are many more data structures contained within it and this consumes hundreds of times more memory in practice. The CPU cost is not as easily explained, but the working hypothesis is that DashMaps suffer from significant memory fragmentation, plus they execute additional logic to deal with sharding. For example, common read operations such as SecondaryIndexEntry::keys and SecondaryIndexEntry::len acquire and release every single lock in the DashMap, because they read from every contained HashMap on every execution.

Due to the high cost and negligible benefit of DashMaps for secondary index entries, I propose that we replace them all with RwLock<HashMap>.

Fixes OOM bug on high core machines where memory utilization can exceed 1 TB.

program_id_index and spl_token_mint_index previously used DashMapSecondaryIndexEntry for index entries. This changes them to RwLockSecondaryIndexEntry to reduce memory and cpu usage.
@mergify mergify bot added community Community contribution need:merge-assist labels Jan 25, 2024
@mergify mergify bot requested a review from a team January 25, 2024 19:40
@dnut dnut changed the title accounts-db: Replace DashMap with RwLock<HashMap> in Secondary Index Entries accounts-db: Fix out-of-memory error caused by DashMaps in secondary indices Jan 31, 2024
@dnut
Copy link
Copy Markdown
Author

dnut commented Jan 31, 2024

Requesting review from @carllin and @jeffwashington

(I do not have permission to add reviewers to the pr)

@jeffwashington
Copy link
Copy Markdown
Contributor

wow. this is thorough. I don't recall all the history. So we used to use RwLock<HashMap> maybe 3 years ago, @carllin?

@carllin
Copy link
Copy Markdown
Contributor

carllin commented Jan 31, 2024

We always used DashMaps: https://github.com/solana-labs/solana/pull/14212/files#diff-95c60d63a9644e0efe4da37f0ccd4d7169a459e0c963170888b1fd56f340b3c6R22-R24 💀

I didn't do nearly as much benchmarking as @dnut here, so I'm inclined to merge this. I'm looking through the benchmarks for curiosity, but otherwise looks good!

@carllin
Copy link
Copy Markdown
Contributor

carllin commented Feb 1, 2024

@dnut, for clarification, these tests were done with this expensive flag enabled to replicate the secondary index scan right? https://github.com/dnut/dashmap-benchmark/blob/6465c044a67f3146a951d9b88ae681d285335f0c/src/lib.rs#L109-L110

@carllin
Copy link
Copy Markdown
Contributor

carllin commented Feb 1, 2024

After reading through the benchmarks (very detailed by the way, great work!), a few things:

  1. My main concern is not slowing down writes to the secondary indexes, which comes in the critical path for state updates in AccountsIndex. If this is slowed down, validators running secondary indexes risk falling behind. Here, I would be interested in isolating in the benchmarks how much slower writes specifically are being affected, rather than grouping together the reads and writes into one end-to-end metric.
  2. My assumption is that DashMap was used within each index entry because a select few of the index entries will be heavily used. For example, the token program in the program_id index will have many entries, and it may be written to or scanned concurrently at a high frequency. In this scenario, there may be some benefit to sharding the data to reduce lock contention.

Yup, I think this was the intention. Here, I would be interested in a benchmark that replicates this sort of hotspot for reading and comparing the performance of writes in Dashmap compared to RwLock as well

A little worried the main account index updates might be blocked by this

@dnut
Copy link
Copy Markdown
Author

dnut commented Feb 2, 2024

@carllin, thanks for the thoughtful feedback.

@dnut, for clarification, these tests were done with this expensive flag enabled to replicate the secondary index scan right? https://github.com/dnut/dashmap-benchmark/blob/6465c044a67f3146a951d9b88ae681d285335f0c/src/lib.rs#L109-L110

Yes, all the data in the spreadsheet comes from benchmarks with the expensive reads. I used run.sh and these lines include the -e flag, which enables the expensive reads.

Just now, I committed a change to make the expensive reads the default. -e is no longer necessary.

Measuring reads and writes separately

I would be interested in isolating in the benchmarks how much slower writes specifically are being affected, rather than grouping together the reads and writes into one end-to-end metric.

It seems that you're looking for a different approach to measurement. You want separate numbers to come out of the test: one that indicates the read performance and one that indicates the write performance.

A simple approach would be to time the read and write tasks separately without changing the actual logic being benchmarked. Currently the benchmarks work by scheduling a finite number of reads and writes to complete over one second, and testing how long the entire thing takes. If it takes over a second, there was contention somewhere. Instead of timing the entire thing, we can independently measure how long it takes for all the readers to complete, and how long it takes for all the writers to complete, and end up with two separate numbers.

Another approach would be to run the test with the focus on one type of operation. Let's say we select writes as the focus. We just keep running reads at the specified rate until the pre-determined number of writes finish. The output of the test tells us how long the writes took.

Specific scenarios

Regarding your other concerns, I did try to address them in my initial set of benchmarks by testing every combination of reads, writes, and prior entries. The goal was that any realistic situation you can imagine was approximated by at least one benchmark. I used this triply nested for loop to generate test cases, and you can see the data from each combination in the Raw Data tab of this google sheet.

The other tabs in that sheet are aggregations based on the raw data, but they don't actually include all of the information. The raw data can be sliced and aggregated however you please to focus on specific relationships.

My main concern is not slowing down writes to the secondary indexes, which comes in the critical path for state updates in AccountsIndex. If this is slowed down, validators running secondary indexes risk falling behind.

This is partially addressed by my existing data. Even if we don't measure writes independently, we can understand write performance by looking at how variable numbers of writes influence the overall benchmark, especially when there are low numbers of reads.

The "light load" plot in the "Writes Per Second" tab illustrates write performance when there are less than 1000 reads per second. Under these conditions, DashMap and RwLock<HashMap> performed about the same from 0-100,000 writes per second. at 1,000,000 writes per second, RwLock<HashMap> performed slightly better. At 10,000,000 writes per second, DashMap performed slightly better.

Just now, I created another two tabs in the google sheet, one called "Isolated Writes" that is aggregated from the same raw data, and one called "Isolated Writes (new data)" from new benchmarks I just ran for more detail. Here we see that all the maps perform equivalently until 5,000,000 writes per second, which is where RwLock<HashMap> starts to suffer from a contention bottleneck. The more sharded the DashMap, the faster it can process writes before it begins to struggle with contention.

Here, I would be interested in a benchmark that replicates this sort of hotspot for reading and comparing the performance of writes in Dashmap compared to RwLock as well

My instinct here is the "heavy load" or "reasonable load" plots in the "Writes Per Second" sheet should provide insight for this concern. While significant reads are occurring, RwLock<HashMap> performs better at under 1,000,000 writes per second, and DashMap performs better over 1,000,000 writes per second.

But I'd also like to address your concern with individual measurements of reads and writes, with the aproaches I described above. I didn't re-run all the benchmarks with all the new measurements approaches because this would take several days to complete, and I wanted to post a reply today, but I've selected a few to re-run with this more granular data. Please feel free to run my code to test any cases that are interesting to you.

Separate timers

Like the original benchmarks, these tests run a pre-determined number of read and write operations that are scheduled to complete in one second (or longer in the event of a bottleneck). The only change here is that reads and writes are now separately timed, instead of timing the entire test (effectively only measuring whichever takes longer).

Let's define 1 million reads per second as a hotspot. We can run the following command to run 10,000 writes with a 4-shard dashmap:

cargo run --release -- -s4 dashmap contention -r 1000000 -w 100000
Contention test (writers) duration: 1.33s
Contention test (readers) duration: 3.021s

Here's a summary from a few more test cases:

measured
duration
prior writes reads/s writes/s dashmap
duration
hashmap
duration
writers 0 1,000,000 100,000 1.3 1.8
readers 0 1,000,000 100,000 3 13.5
writers 100,000 1,000,000 10,000 2.6 1.0
readers 100,000 1,000,000 10,000 46 13.1

With so many more reads than writes, the tests are usually taking a long time due to all of the reads. The writes actually finish in a reasonable amount of time for both DashMap an HashMap. Curiously, DashMap performance starts out better than HashMap, but degrades to be worse than HashMap as its size increases. The same pattern can be seen in the "Prior Writes" tab of the google sheet.

Focus on individual operations

These tests differ from the original benchmarks in that they select one operation, either reads or writes, to be the focus. The focused operation has a finite number of operations that are scheduled to complete within one second (unless bottlenecked). The other operation will continue running indefinitely, until the focused operation completes, then it will exit immediately.

Let's take the previous cases and run them again, with a focus on reads. Here's how you would run the read-focused version of the first test:

cargo run --release -- -s4 dashmap contention -r 1000000 -w 100000 --focus read
Contention test (readers) duration: 2.472s
focus prior writes reads/s writes/s dashmap
duration
hashmap
duration
read 0 1,000,000 100,000 2.5 121
read 100,000 1,000,000 10,000 50 29
write 0 1,000,000 100,000 183 1.8
write 100,000 1,000,000 10,000 2.8 1

It seems that DashMap prioritizes reads, whereas RwLock<HashMap> prioritizies writes. Faced with a bombardment of reads, if your top priority is write performance, RwLock<HashMap> seems like the better option, at least with these numbers. Faced with a bombardment of writes, if your top priority is read performance, DashMap starts out faster, but the performance degrades as the size increases. It would be interesting to see how this plays out with other numbers, but I'll have to get around to that later.

@brooksprumo
Copy link
Copy Markdown
Contributor

Firstly, thanks for the investigation and writeup work here @dnut!

Here's some general thoughts after reading through the PR description and code, the DashMap code, and the current Secondary Index code:

  • On mnb today, we write about 1,000 accounts per slot. So I think we should target the 1-10k writes per second numbers

    • Note that of these 1,000 written accounts, most are not actually updated. We're currently doing a lot of wasted work. These unnecessary rewrites will be removed, but still a bit longer to go. This will reduce the number of writes per second the secondary index needs to handle.
  • Looking at the RPC methods, it looks like we always read every entry of the inner map/set. This read is blasting through the inner map/set, copying all pubkeys out as fast as possible. I'm not sure how long this actually takes[1]. Any inserts to the same inner map/set (iow an insert with the same output pubkey) would need to wait; this would be the part of replay to look at.

    • An option to look at here, is that we could asynchronously handle secondary index requests. Replay could push inserts/removes to a channel and then move on. RPC methods would need to ensure all updates had completed for the slot of interest, or punt on the updates and say the state is transient when requesting information very close to the tip.
  • There is a fixed number of replay threads, and there is a fixed number of rpc threads (at least I think that's what --rpc-threads does). We could set the number of shards to the larger of these two. This could apply to both the outer and inner maps. [2]

  • I think we should keep the outer map as a DashMap. I know this PR is not changing that, just wanted to note it.

Some requests:

  • Would it be possible to get benchmark results for outermap = Dash and innermap = Hash? Right now it looks like the results have both outer and inner the same.

  • It would be valuable to see how long [1] takes on a real RPC node on mnb. Ideally with all three of this PR's impl, the impl in master, and an impl setting both inner and outer dashmaps to have 16 shards.

  • Seeing how [2] behaves (like above, on a real RPC node on mnb).

Are these reasonable? Let me know what you think. Thanks!

@zhy827827

This comment was marked as off-topic.

@willhickey
Copy link
Copy Markdown
Contributor

This repository is no longer in use. Please re-open this pull request in the agave repo: https://github.com/anza-xyz/agave

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

community Community contribution need:merge-assist

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants