Make DB_SIZE= resize the hashtable #70

Massimo-B · 2018-09-14T10:00:21Z

Hi,

is it possible to make changing the DB_SIZE= in the beesd config trigger a resizing of the hastable?
How would I actually resize the hashtable myself, without loosing all the existing hashtable?

nefelim4ag · 2018-09-14T12:11:16Z

Change that and restart beesd. That will resize hashtable.

kakra · 2018-09-14T12:26:34Z

@Massimo-B Maybe you are confused by how truncate works when you looked at the script: Truncate does what the name says: It truncates the contents of the file at the specified position. Everything after that would be thrown away (you loose that part of the hash table). If you truncate the file after its end, the file would be resized filling the gap with a hole (it makes the file sparse). Truncate does not create an empty file by overwriting it with zeros. It just creates void (or cuts away what will end up in void).

Zygo · 2018-09-14T15:21:18Z

bees saves a few bits per hash table entry by encoding some data in the position within the hash table (i.e. some bits of the hash form the page address in the hash table file so it is not necessary to store those bits in the hash table entry). If the hash table is resized, bees can no longer find most of the data, and rejects any data that is found as invalid. If the hash table is resized by a multiple of 2, then 50% of the data lands in the right place by chance (and the other 50% is lost) but this is an exception. If the hash table is resized by a multiple of 3, over 99% of the data is lost.

Issue 44 has a deeper discussion of the technical issues (and an algorithm for resizing at #44 (comment)).

danieldjewell · 2019-01-20T04:32:22Z

FWIW as you mentioned in #105 (comment) @Zygo there are certain situations where creating a new empty hash table is preferable (e.g. full btrfs balance). I read #44 and all of the other documentation extensively - I am very far from an expert at the internals of bees and/or btrfs (I've done enough sysadminning/programming [is that a word?] to understand what's going on in general...)

That said, perhaps I can make a small request/plea for some clarification/documentation on the internals of bees? Maybe a section on "Don't do these things - they will break btrfs" or "These things are OK/safe to do"? Especially on interactions with btrfs functionality like defrag/defrag+compress/balance/etc. (I think probably a huge factor in my lack of clarity comes from applying other filesystem dedupe mechanisms (e.g. NTFS/ZFS) into the BTRFS/bees world - some things are similar, some things are very different.)

Maybe just as a suggestion (in no particular order at all):

What happens when you have an existing beeshash.dat hash table and you delete it and recreate it? If larger, can bees dedupe smaller extents? Does this have an effect on the existing dedupe data? (Possibly a feature request - Is it possible to do a one-time "maximum effort/system resources" rebuild? [e.g. during offline hours so the hash table is populated as quickly as possible - similar to being able to tell NTFS Dedupe to use high priority, all the memory, and all the cores -- see below]
```
  Start-DedupJob -Priority High -Memory 100 -Cores 100 -Type Optimization
```
It seems like a defrag/defrag+compress/balance would essentially undo any dedupe work done by bees because (if I'm understanding things correctly) those operations rewrite the file - does bees cope well with this? Is it better to stop bees while doing these operations and then start up later? (I mean, if I put my sysadmin hat on, I just want to do what is best/most efficient and what is not going to break a production system - potential arguments about "You shouldn't use this in production..." aside)
https://github.com/Zygo/bees/blob/master/docs/config.md#hash-table-sizing and https://github.com/Zygo/bees/blob/master/docs/config.md#hash-table-sizing explain very generally how to size a hash table - when I put my sysadmin hat on, I can't really determine what the optimal size actually is... Especially when thinking about compression. Perhaps some pro/cons and/or some clarification would help? Maybe even an algorithm/metric that could guesstimate the optimal size and could even be possibly added to a future setup./initialization script (e.g. enter the estimated size of data, how much RAM do you have, how much RAM are you willing to give to bees, how much disk space are you willing to commit to the hash table, do you want maximum deduplication vs minimal RAM/disk space usage for the hash table...)

I'd be happy to volunteer and help write the documentation I suggested above -- I just don't have the understanding (yet :) ) to do so. I hope my comments don't come off as critical -- I am enormously grateful for the project so far and am excited for the future.

kakra · 2019-01-20T12:36:13Z

NTFS dedupe attacks the problem from a very different vector. As far as I know, it creates a hidden file somewhere in System Volume Information and stores all the duplicate extents there, then reflinks file extents to that file. So dedupe source is always a file in the visible directory tree, destination is always that chunks file. Extents are no directly reflinked with each other like bees does it. It also does not keep a hash index, it just walks modified files. That's why it needs a garbage collector (to remove unused chunks from the big hidden file). Space is not automatically freed if you remove all the files referencing a reflink because the extents still exist in the chunks file. It can waste a lot of storage space on very active file systems, you could compare it to the space waste when using read-only snapshots with bees, garbage collection would then be removing and recreating the read-only snapshot.

I guess the ZFS approach is more like bees because it keeps a hash table in memory for dedupe with the difference that it is real-time (native implementation within the FS, dedupe happens before writing to disk) while bees is near-time (walks and re-reads the extents as soon as transaction changed are detected, the extents have already been written to disk).

From that perspective, you'd be better using bees on a storage with NTFS images than using NTFS Dedup within a Windows VM.

Zygo · 2019-01-20T22:04:23Z

What happens when you have an existing beeshash.dat hash table and you delete it and recreate it?

The hash table becomes empty. You must also delete beescrawl.dat when you do this, because bees will not reread the old data and will not be able to locate duplicate data that was scanned before the hash table was recreated.

If larger, can bees dedupe smaller extents?

The answer is probabilistic and depends on which subset of the data is present in the hash table. If the hash table is large enough and there are no hash collisions, then all the data is present, and all extents can be duplicated. If the hash table is smaller, unique data and data scanned a long time ago will be discarded in favor of duplicate data and data scanned recently. There is a random function which selects which hash table entries to discard to avoid biasing the hash table toward some data structures at the expense of others (e.g. files which are duplicate in the middle, but have unique headers or trailers).

Once bees has located a pair of duplicate blocks, any adjacent blocks in the same extents can be deduplicated. This means that only one block per extent needs to be in the hash table. Blocks are deleted from the hash table randomly, so a larger extent is more likely to retain blocks in the hash table than a smaller extent.

When comparing bees to a dedupe engine that uses blocks with constant size and alignment, bees appears have a block size that is proportional to the ratio of data size divided by hash table size; however, this is only comparing averages while ignoring the rest of the practical differences in implementation.

Does this have an effect on the existing dedupe data?

It has no effect on previously deduped data--it will stay deduped until it is deleted, overwritten, or defragmented (btrfs defrag is a data copy with some atomic update magic).

Is it possible to do a one-time "maximum effort/system resources" rebuild? [e.g. during offline hours so the hash table is populated as quickly as possible - similar to being able to tell NTFS Dedupe to use high priority, all the memory, and all the cores]

Memory usage is constant in bees and all dedupe state is persistent, so the best you could do would be to adjust the number of cores and IO/CPU priorities.

It may be possible to sample the hash table to make an existing table smaller (e.g. discard 75% of the hash table) to get something similar to this.

It seems like a defrag/defrag+compress/balance would essentially undo any dedupe work done by bees

defrag undoes all dedupe whether it compresses or not. bees will treat defragged data as new data and dedupe it again (usually undoing the defrag if a non-defragged reference to the old data still exists).

btrfs balance does not modify extents or references except to relocate them, i.e. balanced data retains all fragment sizes and compression, but not position on disk (both virtual and physical addresses change). bees stores references to data by virtual address, so anything that changes virtual address of data will invalidate the bees hash table. bees will interpret balanced data as new data and reinsert it into the hash table at its new virtual address.

because (if I'm understanding things correctly) those operations rewrite the file - does bees cope well with this?

Depends what you mean by 'well'. bees copes with this by brute force. bees treats all such operations the same way as any other new data write: new data is read, looked up into the hash table, and inserted/updated/deduped as appropriate. Obviously there are cases (e.g. full balance to reshape a RAID array) where this is not the optimally performing behavior, hence #105 (comment).

It will not break a system to run bees and balance together, but running two maintenance functions at the same time is usually slower than running them at separate times on spinning disks.

defrag and bees are more directly opposed--both will undo the other.

Zygo · 2019-01-20T23:39:03Z

explain very generally how to size a hash table - when I put my sysadmin hat on, I can't really determine what the optimal size actually is...

When you figure this out please tell me. ;)

bees is designed to work with minimal degradation when the hash table is not optimally sized. If the hash table is too small, dedupe ratio decreases because bees can't match extents that are far apart in space or time. If the hash table is too large, performance decreases because bees will have stale data in the hash table that takes extra time to read and invalidate. An "optimal" hash table size also depends on the desired efficiency rate vs. RAM usage, which can only be answered by a sysadmin: is it worth using an extra 30% of RAM to get a 2.6:1 dedupe ratio instead of 2.5:1?

The time component makes estimation based on size metrics difficult. It matters when data was written to the disk. Data that is written at close to the same time is more likely to be present in an undersized hash table than data that is written at different times (e.g. everybody downloads the same popular file at once). If all of your data is like that, you can reduce bees hash table size with no impact; if none of your data is like that, you need a bigger hash table or you don't find any duplicates.

It's also possible to delete beescrawl.dat and have bees traverse the filesystem again using a different scan-mode. This will change the data traversal order and can dedupe some data that may have been skipped in the other traversal orders. This is usually a tiny fraction of the total data, so it is rarely worth the IO cost of reading the entire filesystem multiple times unless the hash table is severely undersized (e.g. 1000000 times smaller than the filesystem). To make this work really well bees would need to explicitly read the filesystem in random order on each pass.

Especially when thinking about compression.

compsize will tell you how many regular extents your filesystem has and the total uncompressed size of the data:

Processed 9397 files, 84388 regular extents (122937 refs), 4508 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL       24%      2.4G          10G          14G       
none       100%      860M         860M         864M       
zstd        17%      1.6G         9.3G          14G

If your filesystem is too big to run compsize then find a representative sample of the data and use that to extrapolate to the full filesystem sizes.

The total size of the data divided by 4K (btrfs dedupe alignment block size) and multiplied by 16 (number of bytes in each bees hash table entry) is the largest hash table that could possibly be useful. At that size, every 4K block on the filesystem is indexed and all duplicate blocks can be found. If the hash table is larger than this size then bees will not use the extra space.

The average size of an extent is uncompressed disk usage (10G) divided by regular extent count (84388)--in this example, just under 128K. The probability of an extent being deduped is a function of the extent length and the ratio between the largest useful hash table size and the actual hash table size. For a 4K extent, if the hash table drops 1 block of the extent, the extent cannot be deduped; for a 128K extent, the hash table can drop 31 blocks before the extent cannot be deduped. If all extents were 128K, the filesystem could theoretically be deduped by a hash table that was 32 times smaller than the maximum useful size.

Note these numbers are averages: some extents will certainly have all 32 blocks dropped from the hash table, while others will have more than 1 block still present. In reality, real filesystems don't have a uniform data distribution, which affects the dedupe hit rate either way. Temporal proximity has a large impact as well: very old extents are more likely to be completely dropped while very new extents probably have all their blocks in the hash table.

Perhaps some pro/cons and/or some clarification would help?

config.md has a lot of those already. Do you have more concrete criticisms?

Maybe even an algorithm/metric that could guesstimate the optimal size

I'd start with uncompressed total data size (as reported by compsize) divided by 8192. That has an average of 128K of data per hash table entry, which is the same as the average extent size for filesystems where most data is compressed by the filesystem.

how much RAM are you willing to give to bees

In practice this is the only question that matters, unless you want to know how much RAM you should buy for a given data size.

how much disk space are you willing to commit to the hash table

bees on-disk hash table is the same size as bees in-RAM hash table. The table can be about 10% smaller on disk due to filesystem compression.

I'd be happy to volunteer and help write the documentation I suggested above

Pull requests welcome!

Zygo · 2019-01-20T23:59:59Z

[NTFS] needs a garbage collector (to remove unused chunks from the big hidden file)

bees uses btrfs's backref counting to achieve this. It is similar in some ways, just on a much different scale. i.e. if you have multiple reflinks to an extent, and you delete some but not all of them, you can have unreachable blocks on the filesystem that don't get freed until the last relink to any part of the extent is removed. This means you can sometimes see disk usage > referenced in compsize in extreme cases (e.g. run defrag on a database file, then let the database run for a while, and storage efficiency drops down to 0.6 or so).

ZFS approach is more like bees because it keeps a hash table in memory for dedupe

ZFS allows swapping of the hash table to disk. bees does not--anything that doesn't fit in RAM is simply discarded.

I considered various swapping algorithms for bees, but any swapping algorithm would make bees 3-6 orders of magnitude slower, while only adding a few percent more dedupe ratio. 10,000 times slower for 1% more disk space didn't seem like a worthwhile tradeoff.

That said, there has been some discussion here about putting some kind of hash table swapping into bees to reduce RAM requirements during long periods of idleness, but this is not the same use case as ZFS DDT swap.

[ZFS] is real-time (native implementation within the FS, dedupe happens before writing to disk) while bees is near-time (walks and re-reads the extents as soon as transaction changed are detected, the extents have already been written to disk).

bees trails a little behind the filesystem (waiting until at least two commits have gone to disk since the data was written before bees reads the data) due to limitations of the btrfs kernel interface. bees needs to know physical locations of data on disk, but the kernel doesn't necessarily allocate data immediately, so bees has to wait until the data flush to disk is done before picking up the data. If all goes well, the data is still in VFS cache, so at least bees doesn't need to do extra IOs to read it.

kakra · 2019-01-21T09:46:54Z

[NTFS] needs a garbage collector (to remove unused chunks from the big hidden file)

bees uses btrfs's backref counting to achieve this. It is similar in some ways, just on a much different scale. i.e. if you have multiple reflinks to an extent, and you delete some but not all of them, you can have unreachable blocks on the filesystem that don't get freed until the last relink to any part of the extent is removed.

Yes that's true but it's still different in NTFS: If you delete all files sharing an extent, it won't free up space unless the GC ran. Because NTFS reflinks shared extents to one additional chunk database file. It also seems to be a very special feature: Files with shared extents can only be accessed by Windows Server. Windows consumer versions cannot access such files (or maybe cannot even mount the volume). So extent reflinks in NTFS seem not to be something baked into the FS at a native level, it probably needs some helper service to access those files. This makes me think that NTFS extents shared this way are not really reflinked but are a special type of extent linking into the chunk database. It's probably build around the VSS functions so the chunk database would be some special volume snapshot and as such all chunks are kept, even if references to it are all deleted.

This is also a big headache for backup software which doesn't know about the chunk database and tries to back it up: This database may be very big and take a good part of the storage space of the volume, and it constantly changes, thus it is backed up over and over again by file-based backup software.

kakra · 2019-01-21T11:13:29Z

What happens when you have an existing beeshash.dat hash table and you delete it and recreate it?

The hash table becomes empty. You must also delete beescrawl.dat when you do this, because bees will not reread the old data and will not be able to locate duplicate data that was scanned before the hash table was recreated.

@Zygo That would mean the service wrapper script should kill the beescrawl.dat file when it encounters a hash table size change? (I didn't look if it does but afair it doesn't)

Zygo · 2019-01-21T15:26:02Z

the service wrapper script should kill the beescrawl.dat file when it encounters a hash table size change?

If the hash table is deleted and recreated, so should beescrawl.dat.

If the file is just resized with truncate, then the data that remains in the file is still usable.

agowa · 2023-07-21T19:53:33Z

the service wrapper script should kill the beescrawl.dat file when it encounters a hash table size change?

If the hash table is deleted and recreated, so should beescrawl.dat.

If the file is just resized with truncate, then the data that remains in the file is still usable.

Is this from @Zygo still true, or did it change by now? The beesd wrapper does delete the beescrawl.dat file, even when resizing.

https://github.com/Zygo/bees/blob/master/scripts/beesd.in#L129-L133

    if (( "$OLD_SIZE" != "$NEW_SIZE" )); then
        INFO "Resize db: $OLD_SIZE -> $NEW_SIZE"
        rm -f "$BEESHOME/beescrawl.dat"
        truncate -s $NEW_SIZE $DB_PATH
    fi

Besides being unnecessary, does it impact performance, cause reindexing, or other issues?

Zygo · 2023-07-22T01:37:34Z

beescrawl.dat stores the point where bees has reached in the filesystem scan. Removing beescrawl.dat causes bees to restart from the beginning of the filesystem and reinsert hashes of all data in the filesystem into the hash table.

If the hash table size is altered by any change other than multiplying or dividing the hash table size by 2, then bees won't be able to locate old hash table entries, and won't be able to dedupe new data against copies that were scanned before the hash table resize. That usually makes dedupe stop working, and to fix that, we remove beescrawl.dat to make bees start over.

Even when dividing or multiplying by 2, half of the pre-resize hash table entries will be lost, but bees can still get a reasonable hit rate with only half the entries. When bees can't locate the old hash table entries, it won't be able to dedupe any data that was scanned before the hash table was resized. This wasn't really an intended feature, but it is a result of the structure of the hash table: it's a giant array that places hash entries in blocks according to hash % file_size, so when file_size changes by a power of 2, most of the hashes happen to end up in the same blocks, and bees can find them (the other half are treated as stale data and evicted).

There's a long-standing feature request to have bees properly support resizing the hash table, or a tool that can read a hash table and write a new table with a different size. This would involve reading the old hash table and inserting its data into a new hash table while preserving the historical LRU order, effectively simulating just the hash table insertions of a filesystem scan. That would remove the need to restart the filesystem scan from the beginning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make DB_SIZE= resize the hashtable #70

Make DB_SIZE= resize the hashtable #70

Massimo-B commented Sep 14, 2018

nefelim4ag commented Sep 14, 2018

kakra commented Sep 14, 2018

Zygo commented Sep 14, 2018

danieldjewell commented Jan 20, 2019

kakra commented Jan 20, 2019

Zygo commented Jan 20, 2019

Zygo commented Jan 20, 2019

Zygo commented Jan 20, 2019

kakra commented Jan 21, 2019

kakra commented Jan 21, 2019

Zygo commented Jan 21, 2019

agowa commented Jul 21, 2023 •

edited

Loading

Zygo commented Jul 22, 2023

Make DB_SIZE= resize the hashtable #70

Make DB_SIZE= resize the hashtable #70

Comments

Massimo-B commented Sep 14, 2018

nefelim4ag commented Sep 14, 2018

kakra commented Sep 14, 2018

Zygo commented Sep 14, 2018

danieldjewell commented Jan 20, 2019

kakra commented Jan 20, 2019

Zygo commented Jan 20, 2019

Zygo commented Jan 20, 2019

Zygo commented Jan 20, 2019

kakra commented Jan 21, 2019

kakra commented Jan 21, 2019

Zygo commented Jan 21, 2019

agowa commented Jul 21, 2023 • edited Loading

Zygo commented Jul 22, 2023

agowa commented Jul 21, 2023 •

edited

Loading