New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Compression #21

Open
scoddy opened this Issue Nov 15, 2014 · 116 comments

Comments

Projects
None yet
@scoddy
Member

scoddy commented Nov 15, 2014

No description provided.

@fd0 fd0 added next release and removed next release labels Nov 15, 2014

@scoddy scoddy added this to the 2014-50 milestone Nov 23, 2014

@fd0 fd0 removed this from the 2014-50 milestone Mar 14, 2015

@fd0

This comment has been minimized.

Member

fd0 commented Mar 14, 2015

When implementing this, add benchmarks and especially have a look at memory usage with -benchmem and benchcmp!

@ThomasWaldmann

This comment has been minimized.

Contributor

ThomasWaldmann commented Mar 15, 2015

lz4, lzo, lzma, null. bz2 is rather slow.

@andrewchambers

This comment has been minimized.

andrewchambers commented Mar 15, 2015

snappy is fast with moderate compression

@cfcs

This comment has been minimized.

cfcs commented Mar 26, 2015

If compression is done per-chunk, care should be taken that it doesn't leave restic backups open to watermarking/fingerprinting attacks.

This is essentially the same problem we discussed related to fingerprinting the CDC deduplication process:
With "naive" CDC, a "known plaintext" file can be verified to exist within the backup if the size of individual blocks can be observed by an attacker, by using CDC on the file in parallel and comparing the resulting amount of chunks and individual chunk lengths.
As discussed earlier, this can be somewhat mitigated by salting the CDC algorithm with a secret value, as done in attic.

With salted CDC, I assume compression would happen on each individual chunk, after splitting the problematic file into chunks. Restic chunks are in the range of 512 KB to 8 MB (but not evenly distributed - right?).

  • Attacker knows that the CDC algorithm uses a secret salt, so the attacker generates a range of chunks consisting of the first 512 KB to 8 MB of the file, one for each valid chunk length. The attacker is also able to determine the lengths of compressed chunks.
  • The attacker then compresses that chunk using the compression algorithm.
  • The attacker compares the lengths of the resulting chunks to the first chunk in the restic backup sets.
  • IF a matching block length is found, the attacker repeats the exercise with the next chunk, and the next chunk, and the next chunk, ... and the next chunk.
  • It is my belief that with sufficiently large files, and considering the fact that the CDC algorithm is "biased" (in lack of better of words) towards generating blocks of about 1 MB, this would be sufficient to ascertain whether or not a certain large file exists in the backup.

AS always, a paranoid and highly unscientific stream of consciousness.

Thoughts?

@fd0

This comment has been minimized.

Member

fd0 commented Mar 26, 2015

Interesting. I don't understand how the attack you describe depends on whether compression is used, is it necessary at all? Doesn't this attack work with and without compression?

At the moment I'm thinking about how to implement #56. What are your thoughts on bundling together several blobs into a single file?

@cfcs

This comment has been minimized.

cfcs commented Mar 29, 2015

The exact workings of the CDC implementation is a bit unclear to me:

  • Do you split on exact byte boundaries or are the 512 KB - 8 MB blocks "rounded up" to a multiple of something?
  • (The real question is: are there (15 * 512 * 1024) / (16 because of AES-CTR) possible chunk size, or fewer?)
  • I'm also curious about how feasible it would be to reconstruct the seed given enough chunks of a known file - not very feasible, I'm guessing?

To answer your first question:
With the seeded CDC, the "fingerprint" depends on the (content + the secret seed), but the difference is that when compression is done after the chunking, and assuming that you can distinguish individual blocks from each other, you have a fingerprint/watermark (the compression rate of a certain block) that depends solely on the content, in this scenario a known plaintext.

Example:
If the watermarked file contains 64 MB (8-128 chunks) of "AAAA" , then 64 MB of "ABCABCABCABC", then 64 MB of random data, the first 16-256 chunks would be very small (because those sequences compress very well, where the 8-128 chunks would compress rather poorly).
The attacker would also be able to work backwards, starting with the very last (24th - 384th) chunk and compress 512KB-8MB until the attacker finds a size that compresses to exactly the same chunk size. Once that is found, the "next" 512KB-8MB of the original plaintext is compressed to find out which length compresses to the length of the second-to-last block (23rd - 383rd), and so on, until the attacker meets the small chunks that are the result of the "AAAA" strings.
This doesn't enable an adversary to positively confirm that a watermarked file is stored in the backup, but I think that statistically it can create quite clear results, given enough data.

I see some potential solutions, perhaps you have more ideas:

  • Permit turning off compression and/or deduplication for certain directories (probably the easiest to implement)
  • Randomize compression dictionaries (haven't really thought this one through, but it seems like an interesting idea)
  • Prevent attackers from learning the lengths of individual compressed chunks, for example by padding (potentially quite expensive) or by bundling together several small chunks and padding the remaining (more complexity, but more efficient)
@fd0

This comment has been minimized.

Member

fd0 commented Mar 29, 2015

Thanks for your explanation, I understand your scenario now.

There will probably be no option to turn off deduplication in restic, because that's really hard to do given the current structure of the program, restic is built around CDC. Adding compression support is of low priority right now and not a goal for the alpha release.

Your third idea will be implemented in #56 (bundling together multiple chunks), I'm working on that right now. And I'll probably add more documentation to doc/Design.md regarding how the chunker works.

Thanks again for bringing up this scenario!

@fd0 fd0 added feature and removed enhancement labels Jul 5, 2015

@klauspost

This comment has been minimized.

Contributor

klauspost commented Aug 11, 2015

I am not sure if I follow @cfcs - isn't the compressed size just like an incredibly bad hash? Given a compressed file size, the number of possible inputs that generate that file size is endless. But probably I just don't understand.

Anyway. I just wanted to shamelessly point you to a modified deflate/gzip library I made. It might interest you that I put in a constant time compression mode, which enables ~150MB/s/core throughput on ANY data, which makes it almost invisible in backup scenarios. There is also a parallel gzip package that gzip-compresses bigger files on multiple cores.

@cfcs

This comment has been minimized.

cfcs commented Aug 11, 2015

@klauspost: Bear in mind that we are discussing the compressed sizes of individual chunks rather than files. A 100GB file with an average chunk size of 1MB will have about 100 x 1024 chunks, each leaking the compression ratio for a certain piece of the file. This results in much more statistical data than the size of a single compressed file, making it possible to compare a known plaintext to a compressed and chunked file even if the CDC salt (and thus the exact alignment borders of chunks) is unknown.
#151 has been merged, though, so this is probably not an issue now.

@fd0

This comment has been minimized.

Member

fd0 commented Aug 11, 2015

I my opinion, this information leak is only of minor importance, considering we have a seeded chunker (via the custom per-repo polynomial) and blob packing (where an attacker cannot see individual chunks, but only groups of chunks and the number of chunks in an group, via the cleartext header length). I think it's a good idea to offer disabling compression completely for speed or privacy concerns, but the default (when this is implemented) will probably be "enable".

@klauspost I will definitely look at your library, thanks for pointing it out!

@cfcs

This comment has been minimized.

cfcs commented Aug 11, 2015

I agree with your observations above, @fd0, but I'd like to add that I think there can be another important use-case for selectively disabling compression for specific target files/directories/devices, for example when backing up multimedia formats that already compressed, binary files with high entropy that do not compress well, or when doing incremental backups of encrypted volumes.

@klauspost

This comment has been minimized.

Contributor

klauspost commented Aug 11, 2015

@cfcs - ok - from your text I thought you implied you could deduce the data. So for this to make any difference, you would need the original data, which is rare.

Regarding non-compressible files, that is the reason I mentioned the constant time Huffman only mode I put into deflate, as the name implies it compresses all data at the same speed, and has automatic fallback to storing uncompressed data. So the maximum overhead is about 0.04% if the content is compressed.

Here are some benchmarks. The most applicable for backups is probably on the "Medium Compression", which is 10GB mixed content - compressible and uncompressible.

Defaulting to gzip Huffman only and having an option of specifying more CPU-consuming compression level could make sense.

@fd0

This comment has been minimized.

Member

fd0 commented Aug 12, 2015

In my opinion it could be valuable to selectively enabled/disable compression separately for data and tree objects. Especially large tree objects should compress really good.

@klauspost

This comment has been minimized.

Contributor

klauspost commented Aug 14, 2015

I looked at some "points" where it could make sense to insert compression. The most transparent and versatile place would be somewhere between the repository and the backend.

I briefly looked at modifying Repository.Encrypt and Repository.DecryptTo, but we don't have types, and the varying sizes would make a mess of things.

My proposition is to implement compression and encryption as a "backend", that both writes to an underlying backend. This will make compression and encryption transparent to the repository.

The reason we need to separate out encryption is that encrypted data doesn't compress (as you probably know).

repository.Repository

The compression algorithm can only be be set on initialization, and everything except the configuration is assumed to be compressed with that algorithm.

repository.Config

The configuration cannot be compressed. We add a string that indicates the decompression type to use for everything. Empty ("") is uncompressed. Otherwise it is the last part of the package name of the compression library used.

Note that compression levels can be changed between each run/type. There is no problem having a repo where some snapshots/types are deflate level 0 (store) and some at level 9 (best compression) - as long as the decompressor is the same.

type Config struct {
    Version           uint        `json:"version"`
    ID                string      `json:"id"`
    ChunkerPolynomial chunker.Pol `json:"chunker_polynomial"`
+   Compression       string
}

Compression is added as create parameter:

-func CreateConfig(r JSONUnpackedSaver) (Config, error) {
+func CreateConfig(r JSONUnpackedSaver, compression string) (Config, error) {

The backend is replaced after LoadConfig/CreateConfig. Here is an example of how it could look:

// SearchKey finds a key with the supplied password, afterwards the config is
// read and parsed.
func (r *Repository) SearchKey(password string) error {
    key, err := SearchKey(r, password)
    if err != nil {
        return err
    }

-   r.key = key.master
-   r.keyName = key.Name()
    r.Config, err = LoadConfig(r)
+   r.be, err = FindCompressor(r.Config.Compression, Encryption(key, r.be))
    return err
}

Compression Implementation

The compressor can implement selective/adjustable compression for some types. Since it is seen as a "backend", the compressed size will never we visible to the repository. The compressor must be summetric with all settings.

Issues

HELPME/FIXME: "Packed" files seems like a problem, since encryption re-starts at each file. If encryption is moved to the backend, encryption will be for the entire blob, not for each file. I assume that is a problem, and I have no good solution.

TODO: Find some good way for parameters/configuration to be sent to the compressor. Not needed for a first implementation.

TODO: Do we care about the on-disk sizes? If the above is implemented, the repository will not know it.

@fd0

This comment has been minimized.

Member

fd0 commented Aug 14, 2015

Thanks for sharing your thoughts, here are mine:

I think that compression and encryption need to be integrated with one another, I will think about this. At the moment, I don't like the thought of completely abstracting the compression/encryption layer from one another. As you already described, we must take care to not do stupid things, e.g. compress encrypted data. In addition, we should offer an option to disable compression in favor of speed and/or security concerns. Regarding encryption: There may be a non-crypto mode for restic later, but for now crypto isn't optional.

Also the packed files are a level of abstraction in itself (e.g. stuff can be repacked), which doesn't work with abstracting crypto.

I think the Repository object is way too complex and needs a general overhaul before tackling this. I don't really have good plan ready for it right now.

Regarding the different compression algorithms and options: I think we should select one algorithm (including a set of parameters) for data, and maybe a second one for compressing the JSON tree structures, but that's it. For all optional things (e.g. different configurable compression algorithms and/or parameters) I'd like to have a sound consideration weighting the benefit against the added complexity and amount of additional code.

Please don't get me wrong, I'd like to add features to restic, but especially for changes that modify the on-disk format, I need very good arguments. In the general case of adding compression, I can see the benefit, but the complexity and changes to the on-disk format must be managable.

@ThomasWaldmann

This comment has been minimized.

Contributor

ThomasWaldmann commented Aug 14, 2015

you'll need different algorithms and parameters (and not just per repo, but even per backup run), there is no "best for every usecase" compression.

just as a practical example:

i do backups from my company server to my backup server at home. dsl uplink with ~700kbit/s.
for this i want the best compression (like lzma + high level). the cpu has lots of spare time at night while waiting for the crappy connection to accept the next packet.

to the same repository i also backup my laptop when i am at home, there i have a wireless "N" connection. of course i don't want to slow down the connection by using lzma + high level there, but rather i want something very quick that doesn't slow down at all - like lz4. i still want compression though, not using lz4 would take about 2x the space.

@fw42

This comment has been minimized.

Member

fw42 commented Aug 14, 2015

I want lzma here but lz4 there (rephrased :-))

That's all bikeshedding imho... Lets pick a reasonable default and not expose too much configuration, that will just introduce lots of complexity for little gain, imho.

@klauspost

This comment has been minimized.

Contributor

klauspost commented Aug 14, 2015

I think the Repository object is way too complex and needs a general overhaul before tackling this. I don't really have good plan ready for it right now.

Fair enough. I started looking through repository.go and found all the places you would insert a compression/decompression step, and the added complexity was not a good thing. By implementing it as a backend interface, you could suddenly take out all the encryption, and make it part of a backend chain. You could make a generic test that ensures symmetry of the backend parts, including compression and encryption.

Regarding encryption: There may be a non-crypto mode for restic later, but for now crypto isn't optional.

Which is why backend chaining makes it so nice. You just leave out the encryption (or create a passthrough backend) and it seamlessly works.

In the general case of adding compression, I can see the benefit, but the complexity and changes to the on-disk format must be managable.

This was the least intrusive way I can see. If there is a way to fix the 'packed file' issue, uncompressed repos remains fully backwards compatible; old client will be able to operate on them as before alongside new ones.

Compressed repos will obviously not, but we can change version to 2 only if the repo is compressed, this will make older clients fail. The check should then obviously be changed to if cfg.Version > RepoVersion {, but that has no compatibility problems.

That's all bikeshedding imho... Lets pick a reasonable default and not expose too much configuration, that will just introduce lots of complexity for little gain, imho.

Agree. Most algorithms (lzma/deflate) have plenty of flexibility within the same decompression format.

@Intensity

This comment has been minimized.

Intensity commented Nov 3, 2015

To test compressibility there is DataSmoke: https://github.com/Bulat-Ziganshin/DataSmoke

Also, pcompress chooses a nice assortment of compression algorithms: https://github.com/moinakg/pcompress

The squash compression abstraction library has a good list of algorithms and a benchmark: https://quixdb.github.io/squash/

There is a text compression benchmark here: http://mattmahoney.net/dc/text.html

@sanmai

This comment has been minimized.

sanmai commented Jan 11, 2018

If we would compress whole files, that could tamper with the de-duplication algorithm possibly making it less efficient.

Other than that let's not forget that any compression while offering tremendous advantages space-wise, opens us for a side-channel attack. From a size of a compressed data one can make an educated guess on the contents of the data. I think this was mentioned before, but still.

@fd0

This comment has been minimized.

Member

fd0 commented Jan 11, 2018

@alphapapa

We seem to be talking about compression on the chunk level, not on the file level, correct?

Yep, on the chunk level.

If so, that obviously limits the effectiveness since duplicated data in multiple chunks of a single file will be stored and compressed for each chunk. However, that obviously depends on the chunk size as well. So, what is the average chunk size? Seems like this is an important factor in how useful compression is.

We're aiming for 1MiB, but it can be as large as 8MiB.

If the chunk size is fairly small, perhaps we should consider full-file, pre-chunking compression for highly compressible files (e.g. using @klauspost's estimator). For example, a 50 MB text file (e.g. log files, large Org-mode files, etc) is likely to be highly compressible as a single file. But if it's chunked first, and then each chunk is compressed individually, not sharing an index, that will greatly limit the effectiveness of the compression (IIUC).

At first I'd like to integrate compression on the chunk level, and see how well that performs in real-life scenarios. We can revisit this idea later.

@fd0

This comment has been minimized.

Member

fd0 commented Jan 11, 2018

@klauspost Thanks a lot for taking the time to benchmark some algorithms/implementations and your recommendations, I appreciate it! While it would be nice to have zstd, I think not depending on cgo is much more important for the project as a whole. And using a compressibility estimator is a great idea, I love that.

The places you mentioned for adding compression/decompression sound good, but we need to track the metadata for that somewhere else. I think we'll probably add meaning to bits in the byte in the pack header, see http://restic.readthedocs.io/en/latest/100_references.html#pack-format. This is the part that needs to be done very carefully.

So, let me finish with #1494, then we'll see that this gets resolved.

@cfcs

This comment has been minimized.

cfcs commented Jan 20, 2018

@sanmai re: side-channels: I brought that up originally.
Various solutions were suggested, I personally would be content with:

  • having configuration options for whitelisting/blacklisting use of compression (similar to what we have for file inclusion)

Another idea was to try to hide the chunk boundaries in the packfiles, which would theoretically speaking make it harder, but I feel that you would still be able to time network writes, and side-channels like to which filesystem extent the chunk was written to and so on could be used to infer the boundaries, so I feel that the safest/easiest would be to just advise not compressing sensitive data.

@lwis

This comment has been minimized.

lwis commented Feb 10, 2018

This would be awesome! 🍺 +$10

@iluvcapra

This comment has been minimized.

iluvcapra commented Feb 21, 2018

Just throwing it out there, but setting aside lzma or any of the more general compression algos, what about just run-length encoding or zero-squashing? Or would this not be sufficiently useful to enough people?

(I have a dog in this hunt, I often am backing up huge WAV files with lots of silence.)

@jahands

This comment has been minimized.

jahands commented Feb 21, 2018

+$15

@bherila

This comment has been minimized.

bherila commented Feb 23, 2018

Just throwing it out there, but setting aside lzma or any of the more general compression algos, what about just run-length encoding or zero-squashing? Or would this not be sufficiently useful to enough people?

Also useful for backing up VM drives with mostly empty space / sparse files (not sure if restic already supports backup/restoring sparse files)

@fd0

This comment has been minimized.

Member

fd0 commented Feb 24, 2018

@bherila restic does not support archiving/restoring sparse files yet, the files will be stored in the repo as if they just contain many zeroes. These large blocks of zeroes are deduplicated, so it'll not take up much space in the repo. For restore though, you will end up with a regular (non-sparse) file without "holes".

@Alwaysin

This comment has been minimized.

Alwaysin commented Mar 8, 2018

I just wanted to check, is there already some kind of compression? I've backed up several computers, including one with 50GB of data, and I get a much lower number on the server:

# du -shc /home/restic/
40G     /home/restic/
40G     total
@rawtaz

This comment has been minimized.

Contributor

rawtaz commented Mar 8, 2018

@Alwaysin It's probably the deduplication, unless some files were excluded of course.

@Alwaysin

This comment has been minimized.

Alwaysin commented Mar 8, 2018

@rawtaz thank you, I wasn't aware about deduplication, must be that!

@cfcs

This comment has been minimized.

cfcs commented Mar 8, 2018

@iluvcapra squashing of large repeated blocks is already implemented through deduplication, as mentioned by @rawtaz.

@fd0

This comment has been minimized.

Member

fd0 commented Mar 12, 2018

@klauspost

This comment has been minimized.

Contributor

klauspost commented Mar 13, 2018

Yes, but honestly a stream decoder is the easy part. I have finished FSE en/decoding and have a Huffman encoder ready. Once the Huffman decoding is done a zstd stream decoder is pretty straightforward, with the full encoder being the final part.

@dave-fl

This comment has been minimized.

dave-fl commented Mar 20, 2018

LZ4 is perfectly sufficient and would also be a quick win.

Why not add lz4 and create another PR to support zstd?

@fd0

This comment has been minimized.

Member

fd0 commented Mar 20, 2018

Why not add lz4 and create another PR to support zstd?

@dave-fl because we need to be very careful when we modify the repository format. It must be done in a backwards-compatible way. The most important part of the whole project is the repo format, not the implementation. People depend on us to not screw up the format so they can restore their data :)

@flibustenet

This comment has been minimized.

flibustenet commented Mar 20, 2018

I think we cannot wait too much with compression. I just did some tests on few repositories of servers backups, i exactly win nothing when i gzip the repository ! Like @Alwaysin I already win 30% with deduplication.

About backwards-compatible way, you mean Restic should read both formats or a tools to migrate from the old one to the new one ? When Restic is not at v1.0.0 I believe it's ok to just migrate.

@fd0

This comment has been minimized.

Member

fd0 commented Mar 20, 2018

I just did some tests on few repositories of servers backups, i exactly win nothing when i gzip the repository!

Uhm, that's expected: All data in the repo is encrypted, so it is hardly compressible at all. If compression is used, it must be done on data before encrypting it.

@dave-fl

This comment has been minimized.

dave-fl commented Mar 20, 2018

I don't see how using LZ4 makes things non backwards compatible. Compression is compression. Why not support multiple formats?

@flibustenet

This comment has been minimized.

flibustenet commented Mar 20, 2018

You're right, i didn't think about that.
However when i gzip the source I don't win more than 30%, deduplication is already very efficient on a big directory with many duplicates. But of course with both it can be impressing.
With zpaq which does compression and deduplication I win just a little more, not so much.
I'm very open to test a branch with compression, doesn't matter if it's not compatible !

@dimejo

This comment has been minimized.

Contributor

dimejo commented Mar 20, 2018

I don't see how using LZ4 makes things non backwards compatible. Compression is compression. Why not support multiple formats?

What happens if 2 clients are using the same repo but 1 of them uses an older version of restic which doesn't support compression? This feature needs to be designed carfully with all possible corner cases in mind.

I'd prefer no compression over a half working solution that possible breaks previous backups.

@fd0

This comment has been minimized.

Member

fd0 commented Mar 20, 2018

I think there has been enough discussion on the issue of adding compression. I can see it's a highly anticipated feature. I will tackle this next after finishing the new archiver code (see #1494).

Please don't add any further comments, thanks!

@dave-fl

This comment has been minimized.

dave-fl commented Mar 20, 2018

@dimejo What your are stating has nothing to do with what I've proposed. Whether you choose to implement zstd or lz4 would impact both cases.

I dare say that the CGO version of zstd looks somewhat portable :)

@anarcat

This comment has been minimized.

Contributor

anarcat commented May 24, 2018

i looked a how feasible it would be to write a golang implementation of zstd, very briefly, based on the specification.

zstd mostly all in-house algorithms but (optionnally) relies on the xxHash-64 checksum for error checking, and there's golang port of that. since optional bits are, well, optional, you wouldn't have to implement those parts to get zstd support for a reader/writer in restic. zstd supports the concept of "dictionnaries" to optimize compression - i am not sure how that would interact with restict, but would be an interesting area of research to compress specific parts of the archive, e.g. JSON or metadata streams. otherwise that implementation could also be skipped as it is optional.

Where it gets trickier, of course, is where entropy coding kicks in. zstd uses a novel approach there called Finite State Entropy (FSE, a variation of [ANS](https://en.wikipedia.org/wiki/Asymmetric_numeral_systems#, of which only a C implementation exists as well. other entropy coding bits are implemented with huffman coding, of which there are several implementations, including two in the standard library: one in compress.flate and another in net.http2.hpack, which is rather odd.

From what I can tell, everything else is glue on top of that... Some huffman trees, sequences, frames and blocks. There are interesting properties in the way blocks and frames are built as well that might map well into restic blobs, which might make it possible to compress the repository as a whole while keeping blobs separate inside, although I haven't looked into that in details. It might also make coupling between the repository format and compression unacceptable.

zstd is largely more complicated than gzip or xzip, with about 70k lines of code (according to cloc) compared to 36k and 12k, respectively. that includes tests, however, which are numerous: when those are ignored, the implementation itself is about comparable with gzip (~34k).

so, in summary, it's just a matter of time before this is implemented in go. I believe such an engine could also leverage golang's parallelism because zstd "frames" are independent from each other. it's unclear to me, however, how frame are used: most streams i tested had only one (zstd /etc/motd) or two (zstd isos/Fedora-Workstation-Live-x86_64-27-1.6.iso) frames (as found by binwalk -R "\x28\xb5\x2f\xfd"), so maybe not such a gain there, because blocks are interrelated and less parallelizable...

anyways, all moot unless someone here wants to actually sit down and port that, but i figured i would share what i found while reading the thing... considering that zstd is an expansion of LZMA part of the LZ77 family of compressors, it shouldn't be unfeasible to port...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment