-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Less RAM hungry deduplication implementation #6116
Comments
No, you don't understand something important. There is already a bit in the block pointer indicating that the block is dedup'd. If you care about the on-disk DDT contents, it's because you know you're dealing with DDT data and need to modify it. Searching the in-memory DDT for an existing record isn't going to save you much because if you're searching for the record, it means you'll either modify it to increase the reference count (if there's a hit) or create a DDT entry with a count of 1 (if there's a miss). In either case you're looking at pretty random IO; all you've saved is deferring the IO which may be to the IO scheduler's advantage. If you're deallocating a block then you need the DDT record on disk to decrement the reference count (and possibly delete the whole record) and it's the same thing all over again. Sorry, it might help but it's not going to be a magic bullet or a "no-brainer" |
Thank you for your comment! Sorry, most likely it's due to my lack of understanding of ZFS internals, but I still don't see why in-memory DDT really needs 320 bytes per one block. I do not discuss here the I/O that dedup causes, only the RAM consumption caused by deduplication mechanism. |
You're assuming that the data on disk and the data in RAM are different. What really happens is that the data on-disk needs to be loaded into RAM to be updated. If nothing needs to be modified, there's really no reason to load the DDT from disk in the first place. When data does get read from disk it goes into the ARC and it sees massive churn and random seek workloads since access patterns of cryptographically hashed content is effectively random. This is why the DDT hurts so much. |
But does it really need to be loaded into RAM completely at all times? Wouldn't it be possible to only keep an essential fraction of the data and load the rest when it is needed (block content match) and unload it once the meta-data update has finished? |
.. Only to have to load the needed data again on the next write cycle? You're thrashing either way. |
Only when the block contents match and only the tiny fraction of the data, relevant to the written block, has to be loaded, and it does not need to stay loaded. So yes, there will be an extra read for each successfully deduped write, but these reads can be efficiently deferred and batched, so it won't be a huge I/O hit. And for this potential extra read we could get 20(!) times lover RAM usage. |
I think the point of the idea is being missed. The idea is to trade off guaranteeing RAM-based hash uniqueness, for more RAM efficiency. In other words, rather than storing the searchable hash table in RAM, store it on-disk, and do all operations on-disk rather than in-RAM. But with an additional layer of a much smaller in-RAM hashtable to search first, but knowing that the odds of false collisions are much higher. (And any time a collision is detected, it has to then be searched in the on-disk table to be sure.) Reference counters etc. would also be stored on-disk rather than in RAM. The disk-based structures could be something as a SQLite database (possibly in /var which odds are high might be an SSD), or in the pool if necessary for integrity and atomicity. So a block write operation might go something like this:
In other words, the we're trading memory efficiency for an increase in false collisions, and having to search the on-disk database now and then. Overall performance would be slower, since some percentage of blocks would have to be searched for twice, with one of them being on-disk. The benefit though, is a significantly smaller memory footprint, and the performance could even be a wash, as most hash searches would only be made against a much smaller in-memory hash table. What would a shorter hash look like? There are many options, such as:
Hope this helps clarify the idea. |
Would it be possible to reduce the RAM requirements when deduplication feature is enabled?
From what I understood, at the moment RAM usage is about 320 bytes per block, which is connected to the size of sha256 (32 bytes) and blkptr_t (128 byte).
Would it be possible not to store this huge structures in RAM, but instead use some short references? One could probably use an 8-byte short_data_hash (first 8 bytes of sha for example) and 8 bytes for the short_block_pointer. In the case when short_hash of the new block matches the one already in the dedup table, full-length sha of the already stored block must be read/recalculated and then full hashes are compared. For short_block_pointer, algorithm would have to use some form of an offset, that can be converted to the full blkptr, but I'm not familiar enough with the internal organization of ZFS to suggest, how exactly it can be implemented, if at all.
Assuming 1PB dataset and average block of 64kB, we get ~16 gigablocks. If we use 8-byte hash, we get collision probability of ~10^-9. 8-byte short_block_pointer should be able to address ~1.2 yottabytes of disk space (average block of 64kB is assumed).
Such an implementation (if feasible) will require ~16 byte of RAM/block, or 250MB of RAM per 1TB of disk space. Such memory requirements would turn dedup from a feature that is only advisable for a limited set of installations to an almost no-brainer.
The text was updated successfully, but these errors were encountered: