Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Less RAM hungry deduplication implementation #6116

Closed
mdimura opened this issue May 10, 2017 · 8 comments
Closed

Less RAM hungry deduplication implementation #6116

mdimura opened this issue May 10, 2017 · 8 comments

Comments

@mdimura
Copy link

mdimura commented May 10, 2017

Would it be possible to reduce the RAM requirements when deduplication feature is enabled?
From what I understood, at the moment RAM usage is about 320 bytes per block, which is connected to the size of sha256 (32 bytes) and blkptr_t (128 byte).
Would it be possible not to store this huge structures in RAM, but instead use some short references? One could probably use an 8-byte short_data_hash (first 8 bytes of sha for example) and 8 bytes for the short_block_pointer. In the case when short_hash of the new block matches the one already in the dedup table, full-length sha of the already stored block must be read/recalculated and then full hashes are compared. For short_block_pointer, algorithm would have to use some form of an offset, that can be converted to the full blkptr, but I'm not familiar enough with the internal organization of ZFS to suggest, how exactly it can be implemented, if at all.

Assuming 1PB dataset and average block of 64kB, we get ~16 gigablocks. If we use 8-byte hash, we get collision probability of ~10^-9. 8-byte short_block_pointer should be able to address ~1.2 yottabytes of disk space (average block of 64kB is assumed).

Such an implementation (if feasible) will require ~16 byte of RAM/block, or 250MB of RAM per 1TB of disk space. Such memory requirements would turn dedup from a feature that is only advisable for a limited set of installations to an almost no-brainer.

@DeHackEd
Copy link
Contributor

No, you don't understand something important. There is already a bit in the block pointer indicating that the block is dedup'd. If you care about the on-disk DDT contents, it's because you know you're dealing with DDT data and need to modify it. Searching the in-memory DDT for an existing record isn't going to save you much because if you're searching for the record, it means you'll either modify it to increase the reference count (if there's a hit) or create a DDT entry with a count of 1 (if there's a miss). In either case you're looking at pretty random IO; all you've saved is deferring the IO which may be to the IO scheduler's advantage. If you're deallocating a block then you need the DDT record on disk to decrement the reference count (and possibly delete the whole record) and it's the same thing all over again.

Sorry, it might help but it's not going to be a magic bullet or a "no-brainer"

@mdimura
Copy link
Author

mdimura commented May 10, 2017

Thank you for your comment! Sorry, most likely it's due to my lack of understanding of ZFS internals, but I still don't see why in-memory DDT really needs 320 bytes per one block. I do not discuss here the I/O that dedup causes, only the RAM consumption caused by deduplication mechanism.
From what I got, deduplication mechanism is implemented as a hash_table, where key is the hash of block's contents, and value is the location of the block on the disk/array. In this hash_table each entry stands for a unique block of data (already on disk). When a new block is written, deduplication algorithm has to check if a block with the same content is already present in the file system. To do that, hash of the new block's content has to be looked up in the_hash_table. If it is not present - just add a new entry. If such a block already exists, file system needs to change some meta-data on the disk, like reference counts or something else. It seems to me, that such hash_table could work with not much more than 16 bytes of RAM per block.

@DeHackEd
Copy link
Contributor

You're assuming that the data on disk and the data in RAM are different. What really happens is that the data on-disk needs to be loaded into RAM to be updated. If nothing needs to be modified, there's really no reason to load the DDT from disk in the first place. When data does get read from disk it goes into the ARC and it sees massive churn and random seek workloads since access patterns of cryptographically hashed content is effectively random. This is why the DDT hurts so much.

@mdimura
Copy link
Author

mdimura commented May 10, 2017

What really happens is that the data on-disk needs to be loaded into RAM to be updated.

But does it really need to be loaded into RAM completely at all times? Wouldn't it be possible to only keep an essential fraction of the data and load the rest when it is needed (block content match) and unload it once the meta-data update has finished?

@DeHackEd
Copy link
Contributor

.. Only to have to load the needed data again on the next write cycle? You're thrashing either way.

@mdimura
Copy link
Author

mdimura commented May 10, 2017

.. Only to have to load the needed data again on the next write cycle?

Only when the block contents match and only the tiny fraction of the data, relevant to the written block, has to be loaded, and it does not need to stay loaded. So yes, there will be an extra read for each successfully deduped write, but these reads can be efficiently deferred and batched, so it won't be a huge I/O hit. And for this potential extra read we could get 20(!) times lover RAM usage.
It would be also possible to maintain some ring-buffer/cache for full version of recently read data, but there will be control over it's size, unlike in the case when everything is loaded all the time.

@jim-collier
Copy link

I think the point of the idea is being missed.

The idea is to trade off guaranteeing RAM-based hash uniqueness, for more RAM efficiency. In other words, rather than storing the searchable hash table in RAM, store it on-disk, and do all operations on-disk rather than in-RAM. But with an additional layer of a much smaller in-RAM hashtable to search first, but knowing that the odds of false collisions are much higher. (And any time a collision is detected, it has to then be searched in the on-disk table to be sure.)

Reference counters etc. would also be stored on-disk rather than in RAM. The disk-based structures could be something as a SQLite database (possibly in /var which odds are high might be an SSD), or in the pool if necessary for integrity and atomicity.

So a block write operation might go something like this:

  1. At whatever time a full block hash would be stored in RAM (whether that's on-demand or all at once, I don't know how that currently works), instead of doing that one thing, do these two things: Store a smaller hash in RAM, and the full Sha256 hash in a database or other disk-based hash table, along with reference counters and whatever else is needed.

  2. When writing new blocks, first search for a collision in the smaller in-RAM hash table. If the new block is unique, great. Update whatever needs to be updated in the on-disk table, and as well also add the new short hash to the in-RAM table. If not unique, then it falls back to a second stage of searching the full Sha256 hash in the on-disk table to make SURE (or "more sure") there really is a hash collision. Then go from there using the same logic as just described.

In other words, the we're trading memory efficiency for an increase in false collisions, and having to search the on-disk database now and then. Overall performance would be slower, since some percentage of blocks would have to be searched for twice, with one of them being on-disk. The benefit though, is a significantly smaller memory footprint, and the performance could even be a wash, as most hash searches would only be made against a much smaller in-memory hash table.

What would a shorter hash look like? There are many options, such as:

  • An actual shorter hash, like CRC-32. (We don't care about cryptographic "security" for this purpose.) A drawback to this approach, would be having to compute two hashes every time a block is written, even without dedup turned on. It also might require a change to the on-disk format, which would understandably make it potentially not a good solution.

  • First 1/4 (or some offset as Mdimura suggested) of the Sha256 hash, which adds zero overhead in normal operation, only when dedup is turned on. And even then, would require no change to the on-disk format, and no overhead of calculating two hashes—just an additional operation per-block to strip out a shorter segment from each hash.

Hope this helps clarify the idea.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@DeHackEd @mdimura @jim-collier and others