-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor SaveVersion() performance when using PruningOptions #256
Comments
Hi @Lbird, thank you for the thorough report and analysis! Your proposed patch seems reasonable, we'll have a closer look at this. |
Okay this is exactly what I'm seeing on the game of zones nodes. |
What I'm doing to do is make a Want to take the opportunity to try this on a high load network. |
…ory growth suggested in cosmos#256.
Running zmanian@5f042ac on game of zones hub now. Monitoring for changes in behavior. It took several hours after a restart for RAM to really blow up. |
zmanian@5f042ac appears to have slowed the memory growth but not stopped it.
|
Maybe, we need to avoid using memdb altogether. MemDB seems to be not adequate for long-term use. |
I remember I saw some code snippet in cosmos-sdk (I forgot the exact location though) that sets a pruning option to be
And I think it is the intended use of pruning option. That being said, it is no use to keep multiple versions in recentDB. And that means we can close and remove recentDB after we save the tree to the snapshotDB. And we can open a new fresh recentDB for use in intermediate versions until the next snapshot version. It will solve the memory problem also. Any objection to this idea? I can make a PR for this. |
The above idea requires non-trivial interface change since iavl.NewMutableTreeWithOpts() is open to public and it receives recentDB as an argument. So we cannot guarantee that recentDB is always MemDB. Anyway, I think there are two options to solve the issue:
|
Interesting, are you seeing an actual memory leak in btree? It's not just that we're accumulating garbage data in the database, or a Go GC issue? |
I'm not 100% sure yet. And I'm not sure it is a leak. But I've seen real RES increase over time in a sample program. tm-db.MemDB.Delete() is quite simple, and I couldn't find any flaws in MemDB code. So, my guess is that underlying btree is responsible for the memory increase. I've seen some sub-slicing in btree's code. I'll take a closer look. |
One thing I've observed is that when the btree gets large, |
I'm interested in trying out a PR with this change. |
Hi @Lbird, thank you for the detailed issue!! Your initial proposed fix makes sense, though perhaps an eventual fix should involve // Doesn't exist, load.
buf, err := ndb.recentDB.Get(ndb.nodeKey(hash))
if err != nil {
panic(fmt.Sprintf("can't get node %X: %v", hash, err))
}
persisted := false
if buf == nil {
// Doesn't exist, load from disk
buf, err = ndb.snapshotDB.Get(ndb.nodeKey(hash))
if err != nil {
panic(err)
}
if buf == nil {
panic(fmt.Sprintf("Value missing for hash %x corresponding to nodeKey %s", hash, ndb.nodeKey(hash)))
}
persisted = true
} Purging and restarting the memDB seems like something worth trying, though I don't see why it would be an issue if |
I thought about that also. But it will harm the general performance in intermediate versions between snapshot versions. I thought about adding new method like GetNodeToSave(), but it is ugly I think.
I see, if everything is as expected, there should be no problem. MemDB code instructs the btree code to delete obsolete nodes. But there is something wrong somewhere between MemDB and btree. Output from my sample program:
MemDB size is initially zero. But after saving at tree version 179000, the size becomes 1, not zero. I think that is a root node maybe? |
The remaining 1 item turned out to be a version root information from saveRoot(). I applied the following patch: diff --git a/nodedb.go b/nodedb.go
index b28b664..852f427 100644
--- a/nodedb.go
+++ b/nodedb.go
@@ -767,9 +767,11 @@ func (ndb *nodeDB) saveRoot(hash []byte, version int64, flushToDisk bool) error
key := ndb.rootKey(version)
ndb.updateLatestVersion(version)
- ndb.recentBatch.Set(key, hash)
if flushToDisk {
ndb.snapshotBatch.Set(key, hash)
+ ndb.recentBatch.Delete(key)
+ } else {
+ ndb.recentBatch.Set(key, hash)
}
return nil But, the memory still increase over time. I'll look into btree code. |
Couldn't find any flaw in btree code. And even after applying purging logic, there Is no notable difference in memory usage in one-hour testings. Strange to say, the memory usage increases even with default option. In order to tell there is an unrecoverable memory usage increase, I need some long-term testing. Maybe tens of hours. :( |
Thanks again for looking into this @Lbird, it's much appreciated! I hope to have some time to dig into this myself later today. It's quite possible that this is a Go GC issue as well, it can have rather unintuitive behavior at times. |
Oh, it's good to hear that! :) BTW, I ran long-run (roughly 7-8 hours) tests overnight, and the result is interesting. With all the identical load, default option gave 404MB max RES, while pruning option and pruning option with purging procedure gave 451MB and 457MB max RES respectively. Purging procedure is indeed not very helpful, pruning option yield ~50MB more max RES anyway. I don't know what gives ~400MB base RES, but ~50MB more RES is quite reasonable considering usage of recentDB. Maybe there is no memory leak related to recentDB? |
Yes, that sounds reasonable to me. The Go garbage collector rarely actually frees memory, instead keeping it allocated for future use. It may be more useful to look at |
Having done quite a bit of testing with your patch, I don't see any evidence of memory leaks - running thousands of versions with thousands of random create/update/delete operations per version biased towards retaining 4096 keys total shows a stable heap and memDB size over time. This was with However, there does appear to be a problem with the pruning retaining more memDB data than it should. With the same pruning settings ( |
I've been busy working on a different IAVL issue, but had a quick look at this again today. The memory growth we see even with your patch is due to inner node write amplification, and is inherent to IAVL (and Merkle trees in general): as the tree grows, the number of inner nodes affected by a single key change is O(log n), and the probability that an inner node is shared by multiple changed keys is similarly reduced. I don't see any way to avoid this with the current design. However, the memory cost per key change should level out over time as O(log n), and I've run some benchmarks that seem to confirm this. However, I'm not sure if the suggested PR is necessarily the right fix. Let's say I configure IAVL with I will go through your original issue comment and the IAVL code more closely once I finish work on a different bug - without your PR, the memory usage is far, far higher than it should be, and I wonder if there might be an alternative quick fix which would still retain all Thank you again for bringing this up! |
Is pruning opt 'syncable' not working on cosmos-sdk v0.38.4 ? I can's get proof data from any version but the latest one. Any help, plz. |
Hi @zouxyan. With |
|
No, all versions except the latest (and each version that is divisible by 10000) are deleted. In previous version of Cosmos SDK, there was no pruning so all versions were always retained. |
@erikgrinaker Thanks. |
There was still pruning, but it's just that each version was always flushed to disk. |
@zouxyan we've introduced granular pruning options in the CLI/config. I can't recall if we backported this to 0.38, but you can check. If so, you can tweak these values to whatever you like. IMHO the default strategy is too conservative. It should flush to disk more often. |
Thank you! I have some comments I hope you to consider.
Well, this is true only when there are small number writes in the relatively large database, say 100 ~ 1000 write in the tree with million leaves. But as the blockchain node keeps running very long time, recentDB size may not be O(log n). Suppose a node starts with the million-leaf tree, and keep running until the tree holds 2 million leaves. The size of recentDB would be roughly k times O(log n), where k is not negligible compared to n. This is obvious when thinking a node which starts with the empty tree and keeps running until the tree grows to 10GB size. The size of the recentDB will be O(n log n) = whole tree in the memory. Without periodic restart or periodic purging, it is inevitable. I think something must be done anyway. Honestly, my suggested patch was for the case with keepRecent = 1. So it may not be a proper solution to general cases.
For the read performance, you are right when there are massive read operations for all nodes in the recentDB. But consider this: all nodes in the recentDB are the nodes touched(updated or inserted) some time in the past. Without purging, it means all the touched nodes since the tree is loaded from the disk. It is difficult for me to imagine a tx which requires read operation for the majority of the touched nodes in the tree. For the generic read performance, isn't it the responsibility of the cache anyway? And there is already a cache mechanism.
Right, it is a rather difficult issue if keeping the original design. If you are considering a design change, my suggestion is to change the role of recentDB to be something like write buffer which offers limited read performance boost between the snapshots or checkpoints. BTW, I prefer the term checkpoint to snapshot. It sounds more clear.
Since this is a blocking issue for our application, I have great interest to this issue. I hope to hear from you soon. Thank you! |
Yes, this is the core of the problem; it seems like the current design expects to eventually keep the entire tree in memory.
I have run some tests, and the behavior at least seems to be correct for keepRecent != 1. I'm currently writing additional test cases to make sure it doesn't have any adverse effects, but I think we should be able to merge your patch shortly as a stopgap solution.
I agree.
This is what I've been thinking as well, we're considering it for the medium term.
I'm aiming to do a patch release with this fix and a couple of other bug fixes next week, pending further testing. |
@Lbird I'm afraid we will be removing the current pruning implementation. There are several problems with it that I don't think can be fully addressed with the current design, and we unfortunately don't have the resources to do a redesign right now. We will be making a new release within the week. |
@erikgrinaker OK. I understand. It seems you may close this issue and the PR. We will stick to the forked iavl until there is some progress in pruning feature. |
Sure. Please be aware that it has multiple known data corruption bugs, see e.g. #261. If you decide to keep running it, you should avoid changing your pruning settings, deleting the latest persisted version, or using |
Pruning option of IAVL introduced with #144 is great. It could save huge portion of disk i/o burden. Nice approach. But there is a problem when dealing with a large database. Suppose the following case:
On a fresh start, recentDB is empty and it grows over time. Between snapshot versions (or between checkpoints), when we call MutableTree.Set() it saves a node in recentDB. Depending on the tree structure, several internal nodes would be touched also. When we reach a snapshot or checkpoint version, MutableTree.SaveVersion() needs to save those nodes in the recentDB to the snapshotDB. The current code "clears" root.leftNode and root.rightNode. After some update, some of the nodes would be retrieved from recentDB or snapshotDB, and mutated. Untouched branch will remain nil. And when we reach the next snapshot version, there rises a problem.
Untouched branch is marked by nil in a tree in the memory. Current SaveBranch() traversal call ndb.GetNode() when the node is marked as nil. If this node happen to be saved in the previous snapshot version, it will reside in the recentDB and snapshotDB, but the node will be loaded with node.saved = true, but node.persisted = false even if the node is saved to the disk and untouched until this snapshot version. Since node.persisted is false, SaveBranch() will be called on this node. This node will have nil children also. And ndb.GetNode() is called. This process goes on. So, eventually almost(all?) nodes in the recentDB will be saved to the snapshotDB again. And there is a bigger problem. There is no cleaning procedure for the recentDB. As the tree is mutating over time, recentDB would grow indefinitely until it store almost all the nodes in the tree in MEMORY. Suppose you have 100GB snapshotDB. That should not happen. As the recentDB grows, SaveVersion() duration will increase almost linearly. That should not happen.
There could be a design change including redefining recentDB as a write-through cache I suppose, introducing "dirty" field in the nodes stored in recentDB, or whatever. However, we can handle this with simple remedy for now.
The text was updated successfully, but these errors were encountered: