Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Compression #21

Open
scoddy opened this issue Nov 15, 2014 · 167 comments · May be fixed by #2441
Open

Implement Compression #21

scoddy opened this issue Nov 15, 2014 · 167 comments · May be fixed by #2441

Comments

@scoddy
Copy link
Member

@scoddy scoddy commented Nov 15, 2014

This issue is a tracking issue for the purposes of tracking discussions and other issues/PRs related to the request for implementing compression.

The following issues/PRs are related to this topic (and may therefore be closed in favor of this one):

@fd0 fd0 added SprintBacklog and removed SprintBacklog labels Nov 15, 2014
@scoddy scoddy added this to the 2014-50 milestone Nov 23, 2014
@fd0 fd0 removed this from the 2014-50 milestone Mar 14, 2015
@fd0
Copy link
Member

@fd0 fd0 commented Mar 14, 2015

When implementing this, add benchmarks and especially have a look at memory usage with -benchmem and benchcmp!

@ThomasWaldmann
Copy link
Contributor

@ThomasWaldmann ThomasWaldmann commented Mar 15, 2015

lz4, lzo, lzma, null. bz2 is rather slow.

@andrewchambers
Copy link

@andrewchambers andrewchambers commented Mar 15, 2015

snappy is fast with moderate compression

@cfcs
Copy link

@cfcs cfcs commented Mar 26, 2015

If compression is done per-chunk, care should be taken that it doesn't leave restic backups open to watermarking/fingerprinting attacks.

This is essentially the same problem we discussed related to fingerprinting the CDC deduplication process:
With "naive" CDC, a "known plaintext" file can be verified to exist within the backup if the size of individual blocks can be observed by an attacker, by using CDC on the file in parallel and comparing the resulting amount of chunks and individual chunk lengths.
As discussed earlier, this can be somewhat mitigated by salting the CDC algorithm with a secret value, as done in attic.

With salted CDC, I assume compression would happen on each individual chunk, after splitting the problematic file into chunks. Restic chunks are in the range of 512 KB to 8 MB (but not evenly distributed - right?).

  • Attacker knows that the CDC algorithm uses a secret salt, so the attacker generates a range of chunks consisting of the first 512 KB to 8 MB of the file, one for each valid chunk length. The attacker is also able to determine the lengths of compressed chunks.
  • The attacker then compresses that chunk using the compression algorithm.
  • The attacker compares the lengths of the resulting chunks to the first chunk in the restic backup sets.
  • IF a matching block length is found, the attacker repeats the exercise with the next chunk, and the next chunk, and the next chunk, ... and the next chunk.
  • It is my belief that with sufficiently large files, and considering the fact that the CDC algorithm is "biased" (in lack of better of words) towards generating blocks of about 1 MB, this would be sufficient to ascertain whether or not a certain large file exists in the backup.

AS always, a paranoid and highly unscientific stream of consciousness.

Thoughts?

@fd0
Copy link
Member

@fd0 fd0 commented Mar 26, 2015

Interesting. I don't understand how the attack you describe depends on whether compression is used, is it necessary at all? Doesn't this attack work with and without compression?

At the moment I'm thinking about how to implement #56. What are your thoughts on bundling together several blobs into a single file?

@cfcs
Copy link

@cfcs cfcs commented Mar 29, 2015

The exact workings of the CDC implementation is a bit unclear to me:

  • Do you split on exact byte boundaries or are the 512 KB - 8 MB blocks "rounded up" to a multiple of something?
  • (The real question is: are there (15 * 512 * 1024) / (16 because of AES-CTR) possible chunk size, or fewer?)
  • I'm also curious about how feasible it would be to reconstruct the seed given enough chunks of a known file - not very feasible, I'm guessing?

To answer your first question:
With the seeded CDC, the "fingerprint" depends on the (content + the secret seed), but the difference is that when compression is done after the chunking, and assuming that you can distinguish individual blocks from each other, you have a fingerprint/watermark (the compression rate of a certain block) that depends solely on the content, in this scenario a known plaintext.

Example:
If the watermarked file contains 64 MB (8-128 chunks) of "AAAA" , then 64 MB of "ABCABCABCABC", then 64 MB of random data, the first 16-256 chunks would be very small (because those sequences compress very well, where the 8-128 chunks would compress rather poorly).
The attacker would also be able to work backwards, starting with the very last (24th - 384th) chunk and compress 512KB-8MB until the attacker finds a size that compresses to exactly the same chunk size. Once that is found, the "next" 512KB-8MB of the original plaintext is compressed to find out which length compresses to the length of the second-to-last block (23rd - 383rd), and so on, until the attacker meets the small chunks that are the result of the "AAAA" strings.
This doesn't enable an adversary to positively confirm that a watermarked file is stored in the backup, but I think that statistically it can create quite clear results, given enough data.

I see some potential solutions, perhaps you have more ideas:

  • Permit turning off compression and/or deduplication for certain directories (probably the easiest to implement)
  • Randomize compression dictionaries (haven't really thought this one through, but it seems like an interesting idea)
  • Prevent attackers from learning the lengths of individual compressed chunks, for example by padding (potentially quite expensive) or by bundling together several small chunks and padding the remaining (more complexity, but more efficient)
@fd0
Copy link
Member

@fd0 fd0 commented Mar 29, 2015

Thanks for your explanation, I understand your scenario now.

There will probably be no option to turn off deduplication in restic, because that's really hard to do given the current structure of the program, restic is built around CDC. Adding compression support is of low priority right now and not a goal for the alpha release.

Your third idea will be implemented in #56 (bundling together multiple chunks), I'm working on that right now. And I'll probably add more documentation to doc/Design.md regarding how the chunker works.

Thanks again for bringing up this scenario!

@klauspost
Copy link
Contributor

@klauspost klauspost commented Aug 11, 2015

I am not sure if I follow @cfcs - isn't the compressed size just like an incredibly bad hash? Given a compressed file size, the number of possible inputs that generate that file size is endless. But probably I just don't understand.

Anyway. I just wanted to shamelessly point you to a modified deflate/gzip library I made. It might interest you that I put in a constant time compression mode, which enables ~150MB/s/core throughput on ANY data, which makes it almost invisible in backup scenarios. There is also a parallel gzip package that gzip-compresses bigger files on multiple cores.

@cfcs
Copy link

@cfcs cfcs commented Aug 11, 2015

@klauspost: Bear in mind that we are discussing the compressed sizes of individual chunks rather than files. A 100GB file with an average chunk size of 1MB will have about 100 x 1024 chunks, each leaking the compression ratio for a certain piece of the file. This results in much more statistical data than the size of a single compressed file, making it possible to compare a known plaintext to a compressed and chunked file even if the CDC salt (and thus the exact alignment borders of chunks) is unknown.
#151 has been merged, though, so this is probably not an issue now.

@fd0
Copy link
Member

@fd0 fd0 commented Aug 11, 2015

I my opinion, this information leak is only of minor importance, considering we have a seeded chunker (via the custom per-repo polynomial) and blob packing (where an attacker cannot see individual chunks, but only groups of chunks and the number of chunks in an group, via the cleartext header length). I think it's a good idea to offer disabling compression completely for speed or privacy concerns, but the default (when this is implemented) will probably be "enable".

@klauspost I will definitely look at your library, thanks for pointing it out!

@cfcs
Copy link

@cfcs cfcs commented Aug 11, 2015

I agree with your observations above, @fd0, but I'd like to add that I think there can be another important use-case for selectively disabling compression for specific target files/directories/devices, for example when backing up multimedia formats that already compressed, binary files with high entropy that do not compress well, or when doing incremental backups of encrypted volumes.

@klauspost
Copy link
Contributor

@klauspost klauspost commented Aug 11, 2015

@cfcs - ok - from your text I thought you implied you could deduce the data. So for this to make any difference, you would need the original data, which is rare.

Regarding non-compressible files, that is the reason I mentioned the constant time Huffman only mode I put into deflate, as the name implies it compresses all data at the same speed, and has automatic fallback to storing uncompressed data. So the maximum overhead is about 0.04% if the content is compressed.

Here are some benchmarks. The most applicable for backups is probably on the "Medium Compression", which is 10GB mixed content - compressible and uncompressible.

Defaulting to gzip Huffman only and having an option of specifying more CPU-consuming compression level could make sense.

@fd0
Copy link
Member

@fd0 fd0 commented Aug 12, 2015

In my opinion it could be valuable to selectively enabled/disable compression separately for data and tree objects. Especially large tree objects should compress really good.

@klauspost
Copy link
Contributor

@klauspost klauspost commented Aug 14, 2015

I looked at some "points" where it could make sense to insert compression. The most transparent and versatile place would be somewhere between the repository and the backend.

I briefly looked at modifying Repository.Encrypt and Repository.DecryptTo, but we don't have types, and the varying sizes would make a mess of things.

My proposition is to implement compression and encryption as a "backend", that both writes to an underlying backend. This will make compression and encryption transparent to the repository.

The reason we need to separate out encryption is that encrypted data doesn't compress (as you probably know).

repository.Repository

The compression algorithm can only be be set on initialization, and everything except the configuration is assumed to be compressed with that algorithm.

repository.Config

The configuration cannot be compressed. We add a string that indicates the decompression type to use for everything. Empty ("") is uncompressed. Otherwise it is the last part of the package name of the compression library used.

Note that compression levels can be changed between each run/type. There is no problem having a repo where some snapshots/types are deflate level 0 (store) and some at level 9 (best compression) - as long as the decompressor is the same.

type Config struct {
    Version           uint        `json:"version"`
    ID                string      `json:"id"`
    ChunkerPolynomial chunker.Pol `json:"chunker_polynomial"`
+   Compression       string
}

Compression is added as create parameter:

-func CreateConfig(r JSONUnpackedSaver) (Config, error) {
+func CreateConfig(r JSONUnpackedSaver, compression string) (Config, error) {

The backend is replaced after LoadConfig/CreateConfig. Here is an example of how it could look:

// SearchKey finds a key with the supplied password, afterwards the config is
// read and parsed.
func (r *Repository) SearchKey(password string) error {
    key, err := SearchKey(r, password)
    if err != nil {
        return err
    }

-   r.key = key.master
-   r.keyName = key.Name()
    r.Config, err = LoadConfig(r)
+   r.be, err = FindCompressor(r.Config.Compression, Encryption(key, r.be))
    return err
}

Compression Implementation

The compressor can implement selective/adjustable compression for some types. Since it is seen as a "backend", the compressed size will never we visible to the repository. The compressor must be summetric with all settings.

Issues

HELPME/FIXME: "Packed" files seems like a problem, since encryption re-starts at each file. If encryption is moved to the backend, encryption will be for the entire blob, not for each file. I assume that is a problem, and I have no good solution.

TODO: Find some good way for parameters/configuration to be sent to the compressor. Not needed for a first implementation.

TODO: Do we care about the on-disk sizes? If the above is implemented, the repository will not know it.

@fd0
Copy link
Member

@fd0 fd0 commented Aug 14, 2015

Thanks for sharing your thoughts, here are mine:

I think that compression and encryption need to be integrated with one another, I will think about this. At the moment, I don't like the thought of completely abstracting the compression/encryption layer from one another. As you already described, we must take care to not do stupid things, e.g. compress encrypted data. In addition, we should offer an option to disable compression in favor of speed and/or security concerns. Regarding encryption: There may be a non-crypto mode for restic later, but for now crypto isn't optional.

Also the packed files are a level of abstraction in itself (e.g. stuff can be repacked), which doesn't work with abstracting crypto.

I think the Repository object is way too complex and needs a general overhaul before tackling this. I don't really have good plan ready for it right now.

Regarding the different compression algorithms and options: I think we should select one algorithm (including a set of parameters) for data, and maybe a second one for compressing the JSON tree structures, but that's it. For all optional things (e.g. different configurable compression algorithms and/or parameters) I'd like to have a sound consideration weighting the benefit against the added complexity and amount of additional code.

Please don't get me wrong, I'd like to add features to restic, but especially for changes that modify the on-disk format, I need very good arguments. In the general case of adding compression, I can see the benefit, but the complexity and changes to the on-disk format must be managable.

@ThomasWaldmann
Copy link
Contributor

@ThomasWaldmann ThomasWaldmann commented Aug 14, 2015

you'll need different algorithms and parameters (and not just per repo, but even per backup run), there is no "best for every usecase" compression.

just as a practical example:

i do backups from my company server to my backup server at home. dsl uplink with ~700kbit/s.
for this i want the best compression (like lzma + high level). the cpu has lots of spare time at night while waiting for the crappy connection to accept the next packet.

to the same repository i also backup my laptop when i am at home, there i have a wireless "N" connection. of course i don't want to slow down the connection by using lzma + high level there, but rather i want something very quick that doesn't slow down at all - like lz4. i still want compression though, not using lz4 would take about 2x the space.

@fw42
Copy link
Member

@fw42 fw42 commented Aug 14, 2015

I want lzma here but lz4 there (rephrased :-))

That's all bikeshedding imho... Lets pick a reasonable default and not expose too much configuration, that will just introduce lots of complexity for little gain, imho.

@klauspost
Copy link
Contributor

@klauspost klauspost commented Aug 14, 2015

I think the Repository object is way too complex and needs a general overhaul before tackling this. I don't really have good plan ready for it right now.

Fair enough. I started looking through repository.go and found all the places you would insert a compression/decompression step, and the added complexity was not a good thing. By implementing it as a backend interface, you could suddenly take out all the encryption, and make it part of a backend chain. You could make a generic test that ensures symmetry of the backend parts, including compression and encryption.

Regarding encryption: There may be a non-crypto mode for restic later, but for now crypto isn't optional.

Which is why backend chaining makes it so nice. You just leave out the encryption (or create a passthrough backend) and it seamlessly works.

In the general case of adding compression, I can see the benefit, but the complexity and changes to the on-disk format must be managable.

This was the least intrusive way I can see. If there is a way to fix the 'packed file' issue, uncompressed repos remains fully backwards compatible; old client will be able to operate on them as before alongside new ones.

Compressed repos will obviously not, but we can change version to 2 only if the repo is compressed, this will make older clients fail. The check should then obviously be changed to if cfg.Version > RepoVersion {, but that has no compatibility problems.

That's all bikeshedding imho... Lets pick a reasonable default and not expose too much configuration, that will just introduce lots of complexity for little gain, imho.

Agree. Most algorithms (lzma/deflate) have plenty of flexibility within the same decompression format.

@Intensity
Copy link

@Intensity Intensity commented Nov 3, 2015

To test compressibility there is DataSmoke: https://github.com/Bulat-Ziganshin/DataSmoke

Also, pcompress chooses a nice assortment of compression algorithms: https://github.com/moinakg/pcompress

The squash compression abstraction library has a good list of algorithms and a benchmark: https://quixdb.github.io/squash/

There is a text compression benchmark here: http://mattmahoney.net/dc/text.html

@etix
Copy link

@etix etix commented Aug 8, 2019

Not every file will have an interesting compression ratio. Compressing a video file is probably worthless but compressing an SQL dump most certainly is. That's why filesystems like Btrfs first try to compress the first 128KB of a file and if there's a significant compression ratio will then compress the whole file. It's definitely not perfect but it's fast and should work for most use cases if it's decided to compress files individually.

@alyandon
Copy link

@alyandon alyandon commented Aug 8, 2019

For those that are arguing against providing compression as an option, my use case is that I back up a mixture of mostly compressible files types that I have no control of the content of and it is unreasonable to expect me to have to compress the data on multiple machines (which either takes up more local disk space in the case of compressing to a new archive or makes the files unusable by their associated applications if compressed in place) prior to performing a backup operation.

I'd prefer to be able to use restic as my DR backup tool but I'm currently using borg (slow, massive ram requirements, etc) because the resulting compression + dedupe it achieves saves me many gigabytes of network transfer per backup operation and easily over a terrabyte of storage space (which I pay for by the month) in the cloud over the entirety of my backup set. I would be able to retain backups for longer or reduce my storage costs if restic supported compression.

@filippobottega
Copy link
Contributor

@filippobottega filippobottega commented Aug 8, 2019

Hello @joerg , thank you for sharing your tests.
Have you tried to backup with restic the output of the Tar compression task?
I'm curious about comparing "Restic S3 Remote Compress" and "Tar" + "Restic S3 Remote Upstream".
Moreover what you says seems to be not really true:

I think restic would gain massively with optional compression of any kind because it reduces pretty much everything

Seeing the test results, it seems that the CPU time needed by restic is 2x longer for local backup and 6x longer on restore. Not really good compared to Tar.

@schmitch
Copy link

@schmitch schmitch commented Aug 8, 2019

tar is not a compression algorithm. of course it is fast.
EDIT: oh and btw. if you tar an directory it will not use multiple threads per file and it will also not work with two files or more at a time, instead it will scan the directory and add a file and then go to the next. quite slow. but the problem is the archive file which is not designed for adding with multiple threads.

@thedaveCA
Copy link

@thedaveCA thedaveCA commented Aug 8, 2019

@shibumi
Copy link

@shibumi shibumi commented Aug 8, 2019

@joerg Can your friends open a Pull Request and make their patches for restic with compression public available? Which compression algorithm do they use?

@filippobottega
Copy link
Contributor

@filippobottega filippobottega commented Aug 8, 2019

@joerg @thedaveCA
I apologies, I misunderstood the meaning of @joerg's assertion. Clearly restic with compression is better than restic without compression. Now my question is: Tar + restic is better or not compared to restic with compression?

@joerg
Copy link

@joerg joerg commented Aug 8, 2019

Please mind that we are not using bare tar archives, but gzipped tar ones with a special parallel zip implementation, otherwise archiving terabytes of data would take days instead of the "just" hours it takes right now: https://gist.github.com/joerg/b88bf1de0ce824894ffc38f597cfef5f#tarpigz
@shibumi I informed them about this issue and my posting so it is now up to them if and in how far they want to be involved in this. Personally, I hope they will open that pull request...

@sanmai
Copy link

@sanmai sanmai commented Aug 9, 2019

Compression is a no go for encryption. It lets an attacker to make a guess if an encrypted repository contains a certain file as a section (chunk) of a file compresses to the same size independent of an encryption key used. This is a very well known vulnerability of encryption protocols, and that is why compression was removed from TLS 1.3.

Let's not create a known problem where's none, shall we?

(I think this problem was already mentioned, and probable not even once. Still this issue is open, where I feel like for this only reason it should be closed once and for all.)

@k3a
Copy link

@k3a k3a commented Aug 9, 2019

Why are you spamming the issue? :( It has been discussed so many times that it's almost offtopic. You will not be FORCED to enable the compression!!

Moreover, I think your attack idea requires the attacker to be able to control the data to be compressed and encrypted (I am not sure though!). https://en.m.wikipedia.org/wiki/CRIME

But in any case, even if it is a security concern, someone may want to use compression only to a storage which is under his own control to simply save the storage space.

@sanmai
Copy link

@sanmai sanmai commented Aug 9, 2019

Having even an optional feature that weakens the encryption invites a false sense of security. Restic claims to be a secure backup program. Adding an optional compression will void this promise as you can't be secure sometimes, only full time, all the time. And there will be CVE reports, for sure. Who wants for their software this type of "popularity"?

But I think that adding a compression in a way that it will never be used with encryption is a viable option.

FWIW in 2017 I made a demo where I stripped encryption from Restic, and showed that compression can be very effective. Hundred times effective. IIRC compression can be add as a some kind of wrapper just like encryption, but I haven't looked at the code for a long time so things may be harder these days, or easier.

@schmitch
Copy link

@schmitch schmitch commented Aug 9, 2019

actually CRIME needs to know the length of the ciphertext, which is basically impossible in restic.
also there is no "secure" backup program. if the backup files are accessible by thirds parties there is always a chance that somebody can tamper or worse, read the data.
so saying compression makes it worse, is just stupid.

@sanmai
Copy link

@sanmai sanmai commented Aug 9, 2019

actually CRIME needs to know the length of the ciphertext

CRIME needs but you don't. Imagine you're a investigative journalist given a set of top secret files by your source. You back them up with encryption and no one will know you have these files.

Now imagine you were not clever enough to enable compression, and now everyone else, who also happen to have these files, just judging by the sizes of compressed-then-encrypted chunks, will know that you have these top secret files in this archive without even needing to know the encryption key. This is very far from being secure. People can go to jail because of this "feature", get tortured, or worse.

@sanmai
Copy link

@sanmai sanmai commented Aug 9, 2019

there is no "secure" backup program

This then needs an update.

Fast, secure, efficient backup program

Also note secure-by-default.

@viric
Copy link
Contributor

@viric viric commented Aug 9, 2019

@bkmeneguello
Copy link

@bkmeneguello bkmeneguello commented Aug 9, 2019

For who want to know more about these security concerns, there is a nice paper describing it http://www.iacr.org/cryptodb/archive/2002/FSE/3091/3091.pdf
In my understand there could be a flaw if the file is chunked and then compressed and encrypted. But if the file is compressed before chunking, it's a binary file like any other and those plaintext attack become useless.

@sanmai
Copy link

@sanmai sanmai commented Aug 9, 2019

But if the file is compressed before chunking, it's a binary file like any other and those plaintext attack become useless.

That is correct. But that'll not exactly help with efficient deduplication if I understand correctly since a compression algorithm may use a different vocabulary for each version of a file resulting in a very different binary result. Which obviously won't deduplicate. Otherwise put, it only makes sense to compress the resulting chunks.

restic stores only packed chunks, so the size of chunks is not evident to someone not having the keys

That's a relief.

My point still stands: that there are many ways one can add a hidden weakness into a program when it implements compression together with encryption, so best not to add one at all. Even encryption experts deciding about TLS chose to remove compression. Guess they had a similar reasoning.

@schmitch
Copy link

@schmitch schmitch commented Aug 9, 2019

btw.:

However, it is important to note that these attacks have little security impact on, say, a bulkencryption application which compresses data before encrypting

...
also CRIME only works if you have multiple different versions of the encrypted files.
i.e. multiple backup runs (to different repositories, where the attacker obtained all of them)
and it also only works by a small amount of data.

CRIME needs but you don't. Imagine you're a investigative journalist given a set of top secret files by your source. You back them up with encryption and no one will know you have these files.

Now imagine you were not clever enough to enable compression, and now everyone else, who also happen to have these files, just judging by the sizes of compressed-then-encrypted chunks, will know that you have these top secret files in this archive without even needing to know the encryption key. This is very far from being secure. People can go to jail because of this "feature", get tortured, or worse.

that is bullshit. because it only works with a small sample size. it also can still be possible to go to jail without compression. especially at some point in time, when an attacker gained your backup files he might be able to brutforce them in the future.
there might be other security problems that appear in the future, etc...
the discussion just turned into meanigless fearmongering.

@alexeymuranov
Copy link

@alexeymuranov alexeymuranov commented Aug 9, 2019

@sanmai, I do not get this example with

Imagine you're a investigative journalist ... Now imagine you were not clever enough to enable compression, and now everyone else, who also happen to have these files, just judging by the sizes of compressed-then-encrypted chunks, will know that you have these top secret files in this archive without even needing to know the encryption key.

What is meant? That someone can guess that an encrypted snapshot has these files just by looking at the size? This assumes that the files are compressed alone, or together with other known files. But then the same guess can be done with an unencrypted shapshot.

Actually, how about gzipping files before backing up? Does this open a security vulnurability too?

I think this example is plain nonsense: if you claim you can determine whether a snapshot contains compressed versions of some (arbitrary) files known to you, you could as well determine whether it contains those files uncompressed.

I do not believe compression can make encryption significantly less secure.

@kieranclancy
Copy link

@kieranclancy kieranclancy commented Aug 9, 2019

Most compression side-channel attacks involve several factors:

  1. Attacker can control input
  2. Attacker can observe the size of the output
  3. Small changes to the input data result in measurable changes to the output size
  4. Attacker can change the input and retry hundreds of thousands of times

Unlike web-based systems, in the vast majority involving restic backups, (1) and (2) will rarely hold at the same time. Furthermore, for block-based compression (3) is not really guaranteed, and for most backup regimes (4) certainly does not hold. Because backup frequency is usually once a day or so, it would take thousands of years to be able to manipulate data and monitor the compressed output size to notice any significant differences, and that's assuming that no other data is changing, which in most cases it would be.

If you were making backups where the output size were visible then you might want to consider disabling compression. Otherwise, there are really no practical attacks against it and it wouldn't make it less secure to have it enabled.

restic already does de-duplication which exposes it to the same theoretical attacks as compression side-channels anyway, and nobody has complained about this to my knowledge.

The fact is, there are hundreds or thousands of users who would benefit from a compression feature with no down-sides whatsoever. Can we please just leave this 5-year old issue to the developers who are working on it?

@cljk
Copy link

@cljk cljk commented Aug 10, 2019

to be honest.... I prefer the concept of restic ... but I made tests in my usecase (lot of CSV files and SQL dumps) and had to switch to borg.

I tested with four generations of incremental backup and my files get a compression of 7:1 and together with deduplication I achieve > 20:1. I can´t ignore that since already said I pay for my online backup storage per GB.

root@xxxx:~# borg list
2019-08-08_14:37                     Thu, 2019-08-08 14:37:10 [5e113a8102f2bd7e40d100343f849dc73843d145011c7214d5fa0895927eb6d1]
2019-08-08_22:28                     Thu, 2019-08-08 22:28:21 [17d815d000212a576610b2fd5688ab87cce00039bb89f63722c6a7819dec1821]
2019-08-09_02:00                     Fri, 2019-08-09 02:00:23 [217c53b07f30dfbca584c49468cfa624a2445a005890220509c97715f7007e81]
2019-08-10_02:00                     Sat, 2019-08-10 02:00:10 [5dd45b8ccf0aa382bf00d5b08e1d5d88daae014f0a1a42b3e2b0fc368623bba0]
root@xxxx:~# borg info
Repository ID: xxxx
Location: ssh://xxxx
Encrypted: Yes (repokey)
Cache: /var/lib/borg/cache/xxxx
Security dir: /var/lib/borg/security/xxxx
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
All archives:               69.02 GB             11.24 GB              2.80 GB

                       Unique chunks         Total chunks
Chunk index:                    9227                41812
@sanmai
Copy link

@sanmai sanmai commented Aug 12, 2019

What is meant? That someone can guess that an encrypted snapshot has these files just by looking at the size? This assumes that the files are compressed alone, or together with other known files. But then the same guess can be done with an unencrypted shapshot.

Exactly. Slice a plain text file into equal pieces, compress then, then encrypt. Slice again, compress and encrypt. As sizes of encrypted files do not change AES-wise, you would see that in both cases you have a range sizes that match each other like a fingerprint. They (and by they I mean mainly administrations of oppressive regimes like Iran or Russia) can make a reasonable assumption that these files are present here, which therefore gives the reason, say, to continue to torture the suspect. I don't understand why y'all get so offended by these ideas, aren't they simple to understand? This ain't CRIME per se, is it?

But as noted before by @viric, technically Restic is not affected by these vulnerabilities as sizes of chunks are not seen without an encryption key. Yet if compression is added at some point, Restic may still not affected now, but may become affected later.

@bherila
Copy link

@bherila bherila commented Aug 12, 2019

Does adding compression expose Restic to any additional vulnerability, given that it already does deduplication?

@thedaveCA
Copy link

@thedaveCA thedaveCA commented Aug 12, 2019

@cljk
Copy link

@cljk cljk commented Aug 12, 2019

I simply don´t understand why you discuss about hypothetical security concerns about guessing a file presence by seeing the size of an encrypted chunk..,,

You Guys use ZIP or GZ? Then you should be fine.

You think that iranian authorities can guess my content by sizes? Then just dont use compression(!). That does simply not mean that compression should not be available.

@fd0
Copy link
Member

@fd0 fd0 commented Aug 12, 2019

I think we've covered all relevant angles of adding compression to restic, thank you very much for all your input.

I think we should add compression and enable it by default, but allow users to disable compression. Please have patience until I have some more time to work on it.

It feels to me this discussion gets out of hand, so I'm locking this issue for now. If you like to continue this discussion, please head over to the forum. Thanks!

@restic restic locked and limited conversation to collaborators Aug 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Linked pull requests

Successfully merging a pull request may close this issue.