accounts-db: Fix out-of-memory error caused by DashMaps in secondary indices#34955
accounts-db: Fix out-of-memory error caused by DashMaps in secondary indices#34955dnut wants to merge 5 commits intosolana-labs:masterfrom
Conversation
Fixes OOM bug on high core machines where memory utilization can exceed 1 TB. program_id_index and spl_token_mint_index previously used DashMapSecondaryIndexEntry for index entries. This changes them to RwLockSecondaryIndexEntry to reduce memory and cpu usage.
|
Requesting review from @carllin and @jeffwashington (I do not have permission to add reviewers to the pr) |
|
wow. this is thorough. I don't recall all the history. So we used to use |
|
We always used DashMaps: https://github.com/solana-labs/solana/pull/14212/files#diff-95c60d63a9644e0efe4da37f0ccd4d7169a459e0c963170888b1fd56f340b3c6R22-R24 💀 I didn't do nearly as much benchmarking as @dnut here, so I'm inclined to merge this. I'm looking through the benchmarks for curiosity, but otherwise looks good! |
|
@dnut, for clarification, these tests were done with this |
|
After reading through the benchmarks (very detailed by the way, great work!), a few things:
Yup, I think this was the intention. Here, I would be interested in a benchmark that replicates this sort of hotspot for reading and comparing the performance of writes in A little worried the main account index updates might be blocked by this |
|
@carllin, thanks for the thoughtful feedback.
Yes, all the data in the spreadsheet comes from benchmarks with the expensive reads. I used run.sh and these lines include the Just now, I committed a change to make the expensive reads the default. Measuring reads and writes separately
It seems that you're looking for a different approach to measurement. You want separate numbers to come out of the test: one that indicates the read performance and one that indicates the write performance. A simple approach would be to time the read and write tasks separately without changing the actual logic being benchmarked. Currently the benchmarks work by scheduling a finite number of reads and writes to complete over one second, and testing how long the entire thing takes. If it takes over a second, there was contention somewhere. Instead of timing the entire thing, we can independently measure how long it takes for all the readers to complete, and how long it takes for all the writers to complete, and end up with two separate numbers. Another approach would be to run the test with the focus on one type of operation. Let's say we select writes as the focus. We just keep running reads at the specified rate until the pre-determined number of writes finish. The output of the test tells us how long the writes took. Specific scenariosRegarding your other concerns, I did try to address them in my initial set of benchmarks by testing every combination of reads, writes, and prior entries. The goal was that any realistic situation you can imagine was approximated by at least one benchmark. I used this triply nested for loop to generate test cases, and you can see the data from each combination in the Raw Data tab of this google sheet. The other tabs in that sheet are aggregations based on the raw data, but they don't actually include all of the information. The raw data can be sliced and aggregated however you please to focus on specific relationships.
This is partially addressed by my existing data. Even if we don't measure writes independently, we can understand write performance by looking at how variable numbers of writes influence the overall benchmark, especially when there are low numbers of reads. The "light load" plot in the "Writes Per Second" tab illustrates write performance when there are less than 1000 reads per second. Under these conditions, DashMap and Just now, I created another two tabs in the google sheet, one called "Isolated Writes" that is aggregated from the same raw data, and one called "Isolated Writes (new data)" from new benchmarks I just ran for more detail. Here we see that all the maps perform equivalently until 5,000,000 writes per second, which is where
My instinct here is the "heavy load" or "reasonable load" plots in the "Writes Per Second" sheet should provide insight for this concern. While significant reads are occurring, But I'd also like to address your concern with individual measurements of reads and writes, with the aproaches I described above. I didn't re-run all the benchmarks with all the new measurements approaches because this would take several days to complete, and I wanted to post a reply today, but I've selected a few to re-run with this more granular data. Please feel free to run my code to test any cases that are interesting to you. Separate timersLike the original benchmarks, these tests run a pre-determined number of read and write operations that are scheduled to complete in one second (or longer in the event of a bottleneck). The only change here is that reads and writes are now separately timed, instead of timing the entire test (effectively only measuring whichever takes longer). Let's define 1 million reads per second as a hotspot. We can run the following command to run 10,000 writes with a 4-shard dashmap: cargo run --release -- -s4 dashmap contention -r 1000000 -w 100000
Contention test (writers) duration: 1.33s
Contention test (readers) duration: 3.021sHere's a summary from a few more test cases:
With so many more reads than writes, the tests are usually taking a long time due to all of the reads. The writes actually finish in a reasonable amount of time for both DashMap an HashMap. Curiously, DashMap performance starts out better than HashMap, but degrades to be worse than HashMap as its size increases. The same pattern can be seen in the "Prior Writes" tab of the google sheet. Focus on individual operationsThese tests differ from the original benchmarks in that they select one operation, either reads or writes, to be the focus. The focused operation has a finite number of operations that are scheduled to complete within one second (unless bottlenecked). The other operation will continue running indefinitely, until the focused operation completes, then it will exit immediately. Let's take the previous cases and run them again, with a focus on reads. Here's how you would run the read-focused version of the first test: cargo run --release -- -s4 dashmap contention -r 1000000 -w 100000 --focus read
Contention test (readers) duration: 2.472s
It seems that DashMap prioritizes reads, whereas |
|
Firstly, thanks for the investigation and writeup work here @dnut! Here's some general thoughts after reading through the PR description and code, the DashMap code, and the current Secondary Index code:
Some requests:
Are these reasonable? Let me know what you think. Thanks! |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
This repository is no longer in use. Please re-open this pull request in the agave repo: https://github.com/anza-xyz/agave |
Problem
The validator crashes when indexing the account snapshot on high core-count machines. It consumes 1 TB of memory and the process is killed by the OOM killer. This occurs on every single startup before the validator has a chance to catch up the network.
This is a big problem for node operators like Syndica because it prevents us from upgrading our hardware.
To reproduce, run the validator with:
Root cause: Secondary indices contain many DashMaps. Each DashMap is sharded proportionally to core count. The massive number of shards triggers an explosion of memory usage.
Summary of changes
Replace DashMap with
RwLock<HashMap>in theprogram_id_indexandspl_token_mint_indexsecondary indices in accounts-db. This reduces total validator memory usage from over 1 TB to under 200 GB.While the memory usage could be improved by hardcoding a low number of shards, this change instead fully replaces DashMap with
RwLock<HashMap>because DashMap appears to have little to no benefit in these indices, and it introduces unnecessary complexity and CPU load relative to HashMaps.Root Cause
Secondary indices are stored in a DashMap that contains another nested DashMap for every entry in the index. This means a massive number of DashMaps will be instantiated.
Each DashMap is sharded, containing several
RwLock<HashMap>s within it. When usingDashMap::new(), DashMaps are sharded based on the number of cores using this logic:This means every DashMap on a 128 thread machine will contain 512 HashMap instances within it.
The overhead from allocating so many empty maps explains the memory usage.
Simply swapping out the DashMaps with HashMaps scales down the memory overhead by a factor of 512. Running with this patch on the same machine with 128 threads, the validator was able to finish the indexing process without exceeding 200 GB of total memory. We have been using this code on our production nodes for the past few weeks without issue, and have observed that the validator catches up in around 30% of the time that it used to take with DashMaps.
Benchmarks
I ran some benchmarks to illustrate the memory problem and to determine if DashMaps are actually beneficial.
https://github.com/dnut/dashmap-benchmark
I ran these benchmarks on an Macbook pro with a 12-core M3 and 36 GB of RAM.
Memory
To illustrate the memory usage with a minimal example, I wrote a simple benchmark that initializes a large number of DashMaps or
RwLock<HashMap>s. It measures the peak memory usage and the amount of time it took to initialize all the maps. test_init_many_mapsThis demonstrates the relationship between DashMap shards and memory usage. The final test ran out of memory at 9% which implies the empty DashMaps would have used hundreds of GB.
RwLock<HashMap>only uses 2.1 GB for the same amount of data.Contention
My assumption is that DashMap was used within each index entry because a select few of the index entries will be heavily used. For example, the token program in the program_id index will have many entries, and it may be written to or scanned concurrently at a high frequency. In this scenario, there may be some benefit to sharding the data to reduce lock contention.
As a basic benchmark for contention, I ran many concurrent read and write operations. For reads, I replicated the logic of the
SecondaryIndexEntry::keys, since that is the most expensive read operation on these indices. For writes, I did an insertion. These operations were scheduled across several threads to achieve a specified number of operations per second overall. test_contentionAssuming there is no bottleneck, the tests should complete in about 1 second. If it takes more than 1 second, this implies that operations are being blocked by locks held by operations in other threads. The contention causes operations to pile up and take longer than their scheduled time.
There are three customizable variables for this benchmark:
prior_writes: number of items (roughly) that exist in the map before starting the benchmarkwrites_per_second: number of writes to execute per second during the benchmark (spread across many threads)reads_per_second: number of reads to execute per second during the benchmark (spread across many threads)With run.sh, I tested every combination of each of these three variables set to every power of ten ranging from 0 to 10 million, and ran those tests for each of these three data structures:
RwLock<HashMap>This is a total of
3*8^3or 1536 benchmarks. I parsed the rust stdout with parse.py into a CSV containing all 1536 duration data points, and uploaded it to this google sheet: https://docs.google.com/spreadsheets/d/10XNX-CSBmejQnK8_YTbir0Y_aFxf0cn5YBmNQAD9CaQ/edit#gid=0I aggregated this 6-dimensional data into a more digestible format by focusing on the individual relationships that each of the three variables have with respect to duration. I examined each relationship in three different load profiles (light, reasonable, and heavy). This generated a total of 9 tables and plots, which you can view in the other tabs of the spreadsheet. The aggregation logic is performed by
aggregate()in parse.py. For a more readable illustration of the logic, here is a query that would generate the duration vs prior_writes table for a light load profile:These results show that, for this code, DashMaps are only faster than
RwLock<HashMap>in two situations:In all other scenarios,
RwLock<HashMap>had equal or better performance.Another important point is that the concurrent DashMap operations were much more CPU intensive. Even if the same job finished in exactly the same amount of wallclock time using both DashMap and
RwLock<HashMap>, the DashMap code used about 8x as much CPU time according to thetimecommand. This means DashMap is actually more computationally expensive than it appears based on the durations alone.Conclusion
The data from my benchmarks suggests that DashMap is not helpful within secondary index entries, and is actually counterproductive.
The secondary index implementation does not hold the lock for very long. It quickly transforms the data and returns a copy, releasing the lock immediately. It seems we're not likely to benefit from sharding unless index entries are being written to millions of times per second.
Furthermore, DashMaps come with significant memory and CPU costs. The memory cost is clear: there are many more data structures contained within it and this consumes hundreds of times more memory in practice. The CPU cost is not as easily explained, but the working hypothesis is that DashMaps suffer from significant memory fragmentation, plus they execute additional logic to deal with sharding. For example, common read operations such as
SecondaryIndexEntry::keysandSecondaryIndexEntry::lenacquire and release every single lock in the DashMap, because they read from every contained HashMap on every execution.Due to the high cost and negligible benefit of DashMaps for secondary index entries, I propose that we replace them all with
RwLock<HashMap>.