Join GitHub today
Support for file encryption (e.g. non-trusted servers) #109
So I have had a look at BitTorrent sync, syncthing and alternatives and what I always wondered about was the possibility to not only sync between resources I own and trust, but also external resources/servers which I do NOT trust with my data, up to a certain extent.
One way to do this is using ecryptfs or encfs, but this has many obvious downsides: it is not an interoperable solution (only works on Linux), the files are actually stored in encrypted form on the disk (even if the resource is trusted and this is not necessary, for instance because of the file system being encrypted already), etc.
What I propose is somehow configuring nodes which are only sent the files in an encrypted format, with all file contents (and potentially file/directory names as well; or even permissions) being encrypted. This way, if I want to store my private files on a fast server in a datacenter to access them from anywhere, I could do this with syncthing without essentially giving up ownership of those files. I could also prevent that particular sync node from being allowed/able to make any changes to the files without me noticing.
I realize that this requires a LOT of additional effort, but it would be a killer feature that seems to not be available in any other "private cloud" solution so far. What are your thoughts on this feature?
EDIT: BitTorrent sync mentions a feature like this in their API docs: "Encryption secret
This would be amazing. I tried to spec out what this might look like in this clearskies extension, but it adds so much complexity that I've tabled plans for it for now.
Like you say, if only the file contents are synchronized to the "untrusted" peers, that would be a lot simpler to implement (i.e. the metadata never hits the untrusted peer in any form). I hadn't thought of that.
It seems like you even thought of a zero-knowledge-proof to show that the server is legitimate/actually stores the files (did I understand that correctly?). Not bad.
CTR mode sounds like an extremely bad choice for me, just like other stream ciphers like GCM. Yes, it is seekable and that is useful, but XORing two snapshots of encryption on top of each other will result in an adversary knowing what changed between the plaintext of those two files. CBC is a much better choice: when seeking, you may need two blocks of ciphertext to decrypt the first block of plaintext, but that is negligible usually because you will read more than one block anyway, and the more you decrypt the less overhead you get.
I don't really understand why encrypting everything - including metadata - should somehow be "easier" or simpler to implement. Maybe I'm misunderstanding you? What do you mean?
I could see how this would be useful. As you say, it would require some work because it's currently not a part of the design at all - an additional layer of encryption would be needed. There would obviously be some trade offs between privacy and efficiency, ie if a blocks changes in the middle of a large file, do we resynchronize just that block and leak that fact or re-encrypt the entire file etc.
This idea is particularly useful for people who would use syncthing to setup a network of devices, and require 1 of them to be available 24 hours a day. Let's say all the devices are trusted except for the always on device, which is a 3rd party VPS server.
In this case, it would desirable for some additional properties built into syncthing so that the VPS node has the following characteristics:
READ ONLY (can't change any data)
No doubt this adds complexity and performance hits to support the encryption, especially if this project eventually extends to devices that don't support hardware-based encryption, such as most current smartphones,
Tahoe-LAFS has this feature and it would be awesome if a more usable implementation (I find tahoe-lafs WebAPI very painful and difficult to use).
It has the notion of "storage nodes" that hold chunks of distributed encrypted data. The default configuration is that any 3 chunks can restore the file out of a goal of 10 chunks on different storage nodes using erasure coding.
It would be nice if syncthing could support the distributed fractional concept as well, but that sounds like a topic for another issue. It may be out of scope too, hopefully not :)
Tahoe-LAFS sounds pretty much exactly like what we want - what and incredible find, I hadn't heard of it. Thanks, @kylemanna :)
The way I see it, syncthing already has the functionality to keep directories in sync and perform upload/download operations between the nodes whenever something changes. So the feature we want might not be that far out of reach: whenever a file changes, then we have to call the API of tahoe-lafs and upload/download the file.
I agree that we should start with a configuration where files are simply replicated completely on all foreign nodes. Fractional distribution can be added later if this setup turns out to work well.
The solution would also work on both Windows and Linux, which is a huge plus! And we don't have to do any crypto storage of our own, which would most probably turn out to be a failure anyway, I presume. :)
Sooo... anyone see a problem with this approach yet, from a design perspective? @calmh, do you think syncthing's design is compatible with tahoe-lafs?
I played with Tahoe-lafs for a while and it doesn't really do what I want. The major deal breaker for me was that the storage nodes don't work behind NAT. Everything I could find suggested that I needed to do port forwarding and tunneling of some sort. I'd imagine that a significant portion of the user base for syncthing is behind a NAT.
This would indeed be a great feature, especially if it could be defined on a folder and/or file level as in BTSync. I'd argue adding this feature is part of the "BTSync-replacement" goal. This would add complexity for sure, but it would be great to have one Syncthing interface from which I can manage my synchronised shares with people who are supposed to have access as well as with locations which are not supposed to either have access or be able to see the files. As a VPS user, this would be great for me - and surely for a lot of others as well.
For me, this feature is the only thing keeping me from switching from BTSync to Syncthing.
How the encryprtion works with BTSync ist discribed here in detail:
The use would be for me: I can store data at a friends home, at my family members PCs and i dont have to worry about that they can access my data. Additionally i can store data for them and i cannot access it.
The more people which have my data, the faster my download / upload and spreading of data is.
Syncing data is for me not only having it available, it has become data safekeeping as well.
Syncthing is open source. That's the point. That's why I don't want the
Syncing important files to untrusted locations is usually not a problem
On Mittwoch, 28. Mai 2014 11:44:09, Jens Diemer wrote:
Quote: Can closed source projects ever offer security? Keyword: verifiability...
Thats why i want to change.
Quote: Syncing important files to untrusted locations is usually not a problem when they are encrypted+signed.
I totally agree on that.
One issue with the clearskies proposal is that it only addresses encryption of file data, not metadata about the file like name length, file size, etc.
If you really don't trust the remote storage, this is not sufficient -- it's often possible to tell what's in a directory tree just by looking at the file sizes, for example. Encrypting file systems try to mitigate this somewhat by padding files up and so forth, but dealing with the remote security issues may be rather hard.
At minimum, you probably want to think about randomizing the file locations in the tree and padding the files. Better would be to break them up into blocks and scatter them around in fixed size storage chunks that the remote end doesn't know anything about.
To elaborate a little bit, you can't entirely eliminate the data leakage if you're storing in a completely untrusted location. For example, at an absolute minimum, someone who can watch your traffic can tell how much data you change (and thus need to sync) every hour/minute/etc.
But systems that just encrypt the file data (and hopefully the names) leak a lot more. For example, say I just stored the new Weird Al album in my share. Even encrypted, rounded up to 16 byte boundaries, the directory contains files of these sizes:
By track By size 1. 7151712 5497664 2. 9123472 5822608 3. 5822608 7032608 4. 5497664 7151712 5. 9032544 7159040 6. 8931184 7856016 7. 9947920 8931184 8. 10858000 9032544 9. 7159040 9123472 10. 7856016 9947920 11. 7032608 10858000 12. 21923472 21923472
Probably no other set of files will show this pattern. So it's pretty easy for an adversary with a database of these things (they exist) to tell that I have a Weird Al album there.
You might assume that the sort order of the files will be scrambled, but of course the tool probably uploaded them in order (so they can get it from the CTIME). Even if it didn't, the file sizes are nearly as good in sorted order. You might try to store the files in random locations in the directory tree (better), but that has the same CTIME problem.
If you really want to have much hope of a secure system here, you really want to avoid storing the data in files entirely. One simple way to think of this is to break all the data you want to sync into 4k blocks, and then have the untrusted side store a database of SHA256 hash -> encrypted 4k block. You do updates by sending new blocks, and then giving the remote store a manifest of which blocks are still needed (the data about file names and the map of blocks to files is itself stored in encrypted 4k blocks hidden in the data). The layout of the database is now mostly irrelevant, since the protocol just talks in terms of hashes and manifests.
You'll note that this is starting to look a lot like a filesystem in its own right. I think something like this is probably needed to have a reasonable level of security.
Well, the question certainly is what counts as "reasonable". There are file systems like EncFS and ecryptfs which expose the same problems that you mention here, but are still widely used - especially for cloud storage. If syncthing can do it just as good as these state-of-the-art systems, then that is a big leap forward!
Security is never absolute, but relative to a use case. Leaking the alphabetical orders can be easily circumvented by shuffling the order in which files are uploaded - that is a good idea. Leaking the file sizes can lead to some exposure, but for most use cases leaking your musical preferences will not be the end of the world. Files with private data in them, however, would still benefit a hundred percent from having just their name and contents encrypted, like in EncFS or ecryptfs.
Don't get me wrong: it is important to think about these issues. But we don't have to come up with a perfect solution that exposes absolutely nothing under no circumstances ever. If a perfect solution fulfilling 100% of the use cases means so much work, then it should be fine to opt for a much less complicated option that just fits 95% of use cases - at least for now, until a better option is available.
Perfectionism is the greatest enemy of open source progress. ;) As long as you inform your users of the security implications, e.g. what does not get protected and what does, it's completely legitimate.
I'm really interested in Syncthing, but client-side encryption is a major feature for me, as I want to sync my files against my dedicated server (which hosts several other services) and thus, I don't want to risk to have any sensitive files unencrypted on such as server.
I read this issue and saw that this feature is ongoing. But do you know about some working setups usable as of today ? For example using an encfs or ecryptfs container which could be automatically mounted and unmounted before / after each synchronization or something similar ? (just for basic file content encryption, waiting for a better solution directly implemented in syncthing)
i'd offer 40 EUR when i could replace tahoe-lafs with syncthing...
That's why it would be easier to not send anybody plaintext ever.
Which isn't too difficult if we use symmetric encryption for the data, and do the following:
1. Store only encrypted files on trusted AND untrusted nodes, give the key only to trusted nodes and present the decrypted files via something like fuse. If we don't want fuse, the trusted devices have to store the encrypted AND the decrypted blocks, doubling the storage requirements.
If we only want to store plaintext on trusted nodes, it becomes more complex:
2. Never send unencrypted blocks over the wire to anybody. Trusted peers receive the key to decrypt the encrypted blocks via a secure channel once. (This could be PGP encrypted email, SCP, someone walking from machine to machine with a thumbdrive etc. depending on the thread model.) Trusted peers only store plaintext, but have to store the mapping between encrypted blocks and plaintext blocks. This means we need to save twice the amount of hashes. The trusted nodes also have to decrypt/encrypt all incoming/outgoing blocks on the fly, which will make it more resource intensive. The details of the implementation are certainly going to be interesting. As we should use something like AES-GCM the question arises how we can make sure that the nonce is the same on all machines AND never used twice? This could be done by using a pseudo random number with 96 bits of entropy for each block, because it is highly unlikely that the same number will be generated twice. The nonce is then sent unencrypted along with the ciphertext. That's ok because the nonce doesn't need to be secret. But it means, that while trusted nodes can decrypt the block, compare it's hash to their list of hashes of decrypted blocks and thus throw it away if it already exists, it still has to be sent over the wire. It also means that untrusted nodes have to store files multiple times, if they are added separately on out-of-sync nodes.
But it could be solved with deterministic encryption (the same input always creates the same output for one key). If the same plaintext always produces the same ciphertext, the untrusted nodes can compare the hashes of ciphertext blocks, so they don't store files multiple times, if those were added on different trusted machines while offline. And trusted hosts can compare a list of hashes of encrypted blocks to their own list of hashes of encrypted blocks, which means they don't waste traffic on files they already have. (Note: I used the term deterministic encryption a bit misleading here. AES is for example deterministic, but made non-deterministic by using different IVs/nonces.)
Intuitively one could think about something like this:
3. Nearly everything is the same as in 2. but instead of a completely random nonce we use the (first 96 bits of the) hash of the unencrypted block (plus a shared secret to protect against file confirmation and similar attacks) as the nonce.
If anybody has more information on how this can be done securely, I'd be happy to hear about it.
So we are left with two options: asymmetric deterministic encryption and convergent encryption .
4. If we instead choose to use deterministic asymmetric encryption (like RSA without padding), we would have to create one shared key pair. This would itself have to be moved to all trusted devices (as the key has to be in 2.).
However, there don't seem to be any widely used encryption schemes using deterministic encryption. Apparently some interesting work was presented at the Crypto 2007 and Crypto 2008 conferences, but I haven't really looked into this yet.
Also I am not sure if there are any modern encryption schemes with elliptic curves which would allow for fast asymmetric encryption with small keys.
Any information on those topics would be appreciated.
5. In convergent encryption files are encrypted with their own hash (more precisely: their hash and a static shared salt).
It doesn't, I just wanted to point out that we have to be very careful which ciphers we use, because most are not resilient enough for a threat model where the adversary can see the ciphertext change over time. And if no node can be tricked into sending the cleartext, because they all just work with the ciphertext, which is stored on disc, that's a big bonus. At least if we don't want to store the ciphertext AND the cleartext on all trusted devices, effectively doubling the storage needs.
FUSE is certainly not the solution to all problems, but it's one of the easier ways to allow users on trusted nodes transparent access to the encrypted data with low implementation effort from syncthing, no storage overhead and only computing overhead when the files are accessed or changed. And it supports nearly all platforms.
You're really worrying me with the focus on FUSE.
For syncthing developers maybe but try getting your grandma or even mildly technologically proficient friend to set up encFS on Windows with Dokan or try grabbing a few of your home files while you're at work or school and don't have admin access to your machine.
On platforms with low computing resources available like the Raspberry or Android devices, storage is often more available than computing.
Show me a FUSE filesystem running on:
BSD might not be a big deal for you but I doubt I'm the only one running this on an NAS. The latter two are pretty important to most users.
I don't think the tradeoffs here are nearly worth it. Dropping support for platforms, devices and use cases isn't a good decision for usability and adoption of the project. The current easy 'run a file' way Syncthing works is great and I think it should stay that way.
@generalmanager I understand very little about crypto and I am not the most clever man on the planet to start with, but my initial ideas were as follows.
Most of it echoes what you already said, but probably in more implementational terms.
First let's start with the fact that we have a constraint:
My basic ideas which might not be secure, but should however make getting plaintext data harder:
Plan A (allows reusing blocks, leaks info about two identical blocks/files):
Plan B (prevents reusing blocks, does not leak info about two identical blocks, 1-3 same as before):
Plan C (a big overhaul):
 Ideally I'd like a readonly secret and readwrite secret but block cipher as such does not seem to exist. Plus the constraint that we have means we can probably only use ECB without a major protocol rewrite and going to plan C.
Just wondering, is the possibility of using dedicated server-side software, ie: a version of syncthing specific for servers, completely out of the question?
By "insecure server" I don't assume you mean you don't have enough privileges to install and run software on it and only store files but simply a server that might be accessed by others without your consent.
Great discussion here... really interesting food for thought...
As we've discussed, any 'crypto' is only as good as the weakest link. For example, regardless of any implementation (FUSE, kernel drivers, userspace encryption, whatever you choose)... If the 'master' host (for example a Windows machine) is infested with Malware/Keyloggers etc, then the whole point is moot because even a perfect implementation is compromised.
Same premise holds if our hardware is compromised / backdoored. There is a reason DARPA started X-ray'ing ASICs and FPGAs sourced from Asia, to ensure there were no hardware backdoors.
"Almost all FPGAs are now made at foundries outside the United States, about 80 percent of them in Taiwan. Defense contractors have no good way of guaranteeing that these economical chips haven't been tampered with. "
How are we to be sure that the BIOS chips we use (or TPM etc) are not backdoor'ed? We don't. We can either accept or reject the premise, but that's about all we can do.
And so any security will only be as good as it's assumptions.
Then we have the ironic fact that most crypto is never broken based on the cipher, but rather on implementation goof-ups, or more insidiously -- side-channel attacks.
I think the best example of this is the 'padding oracle' attack which is theoretically possible against any block cipher operating in CBC mode .. And in practice this resulted in the Lucky13 attack against SSL/TLS which were thought to have been fixed!
"Any person can invent a security system so clever that she or he can't think of how to break it."
Even the best we can build is only as strong as the weakest link.
There is also the balance between ease-of-use / ease-of-installation vs security, as well as the issue of 'scope creep' / reinventing the wheel VS. that a P2P block exchange may need it's own solution to be optimal.
As for encFS, I've got my own issues with it. . . It's not suitable for cloud storage beyond a 'single snapshot' model... The main issue I think is (1) any 'fix' of encFS would necessarily break all backward compatibility with previous versions, and (2) it's 10 years old (plus) now, so it would take significant work to bring it up to 'state-of-the-art'.
I think what you have outlined is a fair approach, in terms of only approaching the problem of securing data 'in transit'. This lets the user manage their own solution in terms of local crypto, whether encFS, dm-crypt, Truecrypt, or otherwise. We do run into the problem that perhaps these 'user determined' crypto solutions are not ideal for a P2P network which exchanges small block-like chunks.
But if the user implements the same crypto globally , then Syncthing does not really care as to what it's transporting , which is a nice abstraction that simplifies life for anyone contributing code.
I agree that the particular enhancement detailed in this Github issue (baking in crypto for storage on untrusted nodes) opens up a whole can of worms and is a pretty good example of 'scope creep' . So I think we're in agreement that this probably isn't feasible given the constraints, certainly not before nailing down existing issues and closing them out.
Beyond that, as I'm sure you are aware, selective encryption on a per-node basis opens up the new set of problems regarding the synchronization of encrypted vs unencrypted blocks... Not to mention problems of key management, access permissions, key revocation, various attacks involving chosen-ciphertext , block vs stream ciphers, selection of appropriate IV generation, operating modes, and so forth.
On and on... Ideally, my feeling at the moment is either (1) let the user handle their crypto problem, or (2) take a long-term view and really implement a novel solution based on proven cutting edge techniques in cryptography -- specifically that relating to cloud storage, authenticated encryption ,and so forth.
In the latter case, it would be ideal to have an 'off-the-shelf' solution that could be dropped in... The only reason I think to DIY is if there were a clever way around some of the limitations... Certainly the current state-of-the-art in crypto literature is not focusing on creating secure P2P applications.
Thanks for clarifying my points -- after our conversations , I think you have done a good job of summarizing my thoughts on the issue, which have changed somewhat as I have done additional research.
"IF they get fixed upstream, this could be a good way to go, because EncFS has the best shot at beeing usable a cross-platform solution for encrypted storage."
This was my original thinking, but as I've delved deeper, I simply don't think EncFS is feasible as a solution UNLESS it's released as a "2.0" version -- which completely dispenses with legacy code... I do like the 'filesystem' level encryption which is convenient, but I think gains in convenience are a tradeoff with security.
Here is an example of one of the problems I just ran into today using EncFS on Linux... I tried using EncFS coupled with the google-drive-ocamlfuse module for mounting Google Drive as a 'shared' network drive... Now I had no problems with the encFS FUSE driver , but it's more of an integration headache from end-to-end... Note that this commentary is independent of Syncthing, and is simply looking at using encFS locally to mirror content to an encrypted folder on Google Drive (with google drive mounted in linux as a sort of 'network share' showing up as a local dir).
The idea here is to drop large amounts of unencrypted files into a mountpoint, have them transparently encrypted and uploaded to Google Drive without wasting tons of local storage making copies of everything.
Some major problems right out of the gate... The Google Drive FUSE driver is a pain to compile since it's written in OCaml. Even after I got it working, I was astonished to find there is not really decent built-in support for caching or buffering. (This highlights a major advantage to something like Syncthing , which breaks such transport into manageable blocks).
Anyway, so I found that what happens is if I copy a large (4 GB) file to my encFS directory (linked to the Google Drive remote network mount)... A 'cp' command of some large files to the EncFS plaintext input dir simply 'hangs' as encFS / Google Drive oCAML attempt to (1) encrypt the file and metadata and write to the Google Drive network mount point, and (2) we are subject to network-level failures (wireless AP disconnects ,etc). It's slow as heck, but it works as long as there is no transport failure.
But a wireless AP disconnect during copying in this setup will easily corrupt or ruin large gigabyte-sized files.
Given these issues on a native client on Linux -- there's a whole set of problems of the optimum between a 100% local mirror of content (which requires extensive use of disk space locally) vs a 100% 'network attached drive with minimal caching (which means that network failures of large files 'in transit' can cause data loss and corruption).
This is sort of a microcosm of similar issues which apply to Syncthing, albeit to a lesser extent since it's not 'all or nothing' in the transport sense. But I do have concerns with the lack of multithreaded / asynchronous performance of both encFS and the Google Drive oCaml driver. The performance of encFS-FUSE was quite poor in terms of handling 'below the surface' transport failures.
In plain English, if I were to disconnect / reconnect to the wireless access point while in the middle of copying data to the encFS plaintext dir , which is then writing to a 'network attached' Google Drive ... Network connection loss would corrupt the file in the process of being copied.
Regarding encFS vs BoxCryptor -- After release of BoxCryptor Classic (which is hypothetically compatible with encFS) , BoxCryptor ditched all compatibility with encFS for the 2.x series of BoxCryptor. So new versions of BoxCryptor are completely incompatible with encFS.
"Just for good measure I wanted to point at https://github.com/jasonmccampbell/GoSodium, which is an (not completely finished) GO wrapper for libsodium"
NaCl and the related wrappers are awesome. Good point. If I personally ever had to implement crypto that was 'roll your own', I'd absolutely say 100% the way to go is with NaCl or related libraries and wrappers. They are a fantastic compromise between using the bloated (but tested) OpenSSL libraries and rolling a custom solution (which could easily introduce major vulnerabilities, which NaCl could help prevent).
"As the article by Thomas Ptacek, which we both referred to, as well as the EncFS audit make very clear there are plenty attacks left, especially if an attacker has knowledge about how the encrypted data changes over time."
This specifically is a huge concern for me, and is why I don't think encFS is feasible for any sort of remote storage which 'incrementally' changes on any regular basis. encFS might be fine for a 'one time' backup, or twice a year backup, but anything that gives a remote node insight into filesystem changes over time is going to be a major problem in ANY modern cryptosystem...
It's far worse with 10 year old technology like encFS.
So I agree there -- to me the limitation (more than FUSE, driver compatibility, trust models, or portability) is actually the fact that encFS is not suitable for cloud backup storage on 'untrusted' nodes. Not without some major updates anyway.
In fact, I think there are very few solutions suitable for this purpose besides more advanced technologies like convergent encryption, authenticated encryption, and so forth.
"Also I don't see how de/encrypting at the block level via fuse suddenly solves the "know how data changes over time" problem." -Audrius
You're right -- it doesn't. This is a major problem that keeps cropping up as I think about how to set up a good P2P network backup / sync system.
There's the critical issue that's become apparent with additional research -- how ciphertext data changes over time may allow an adversary to break the entire system. I'm not sure any current cryptosystem is equipped to become resistant to these sort of attacks (where you provide a potential attacker with N-snapshots of a ciphertext filesystem as it changes over time, where N could be quite large).
I don't know what the solution is for this issue. Even using a P2P sync tool with a Truecrypt file container could be problematic. I suppose it's a matter of balancing the threat vs the countermeasures. Usability vs Security, etc.
I understand the objection regarding the complexity of setting up Dokan on Windows -- any bundled solution needs to work out of the box. You would probably be okay if there was an installer that 'worked' regardless of the underlying solution, right?
I think what we are discussing here is not some much FUSE vs not-FUSE ... it's more questions as to (1) should we bother with this aspect of baking in filesystem encryption (probably not, hah), (2) are there good already-written solutions available off-the-shelf?, (3) what is the state-of-the-art in terms of P2P encryption, (4) benefits vs drawbacks of various solutions, etc.
The whole FUSE vs non-FUSE debate is really not the core of the issue, because most solutions developed in FUSE could be ported out of FUSE with enough time and effort. FUSE is simply conducive to rapid prototyping and testing, as it's just an abstraction layer on top of the common system calls for file/folder interaction.
Your post on deterministic asymmetric encryption vs convergent encryption is a really good overview. I'll check out some of your links and get back to you.
"This gives us the very interesting property, that it is possible to determine if two encrypted blocks are identical without decrypting them first."
This is a huge benefit for the idea of a distributed p2p system, where we are exchanging data on the 'block' level of arbitrary size (say 1k to 1024k). Ideally, if you and I are on the same network, and we both have a copy of the same movie (perhaps encrypted with different IVs or what-have-you) -- can we mutually share the blocks to accelerate synchronization?
And if that's possible, what do we lose by doing so?
"That's what Tahoe-LAFS and Maidsafe use and seems to be one of the best ways to do this kind of thing."
I agree, based on my somewhat limited knowledge of convergent encryption. But this allows 'watermarking' attacks , no? I.E. Some omniscient entity can prove you and I possess a copy of 'TransformersTheMovie.avi' or whatever, even if they cannot decrypt the movie from any of our encrypted shares?
Certainly an open-source solution vulnerable to watermarking is preferable to a binary-only client vulnerable to watermarking (ie. Syncthing vs. BitSync)... Syncthing would be far superior in this case, since at least we can be sure it's not backdoor'ed.
"This means an attacker could proof that you stored a forbidden book/pirated movie/mp3 in your encrypted files"
Okay, that's what I thought. I wish there was a way around this . Perhaps there is? For example, utilizing a 'keyed hash function' to calculate the block hashes for a file we re sharing, where the input to the keyed hash function is related to a shared secret or the results of key agreement?
I guess anything involves trade-offs.
I think what you have proposed in the latest message is on the right track.
My suggestion is that any implementation uses a construct of 'authenticated encryption' , where the idea of encryption and HMAC are combined into a basic primitive. The new ChaCha20 TLS standards have this 'baked in' -- whether at the transport level or otherwise.
As you've suggested, AES-GCM for a block cipher is another example of authenticated encryption, though not my personal favorite... It is off-patent and already included in openSSL, which is nice.
Personally, for block ciphers, I like OCB mode, but unfortunately this is on-patent. It's free to use for open-source non-commercial purposes though.
But regardless, I think we have numerous opposing forces here...
(1) Utilization of 'baked in' crypto vs 'Let the user run TrueCrypt'
(2) Level of Effort and Scope-Creep vs. Broad Spectrum of Applications
(3) Prevention of watermarking attacks etc vs Block P2P inter-operability
(4) FUSE type drivers vs. 'Works out of the box'
(5) Low-level (block/loopback device) vs. High-level (VFS or file/folder encryption)
Great discussion, best ideas I've seen in a long time. Don't want to get too sidetracked from any short term goals, but I think the last few pages of comments really get to the core of the issues regarding client-side and remote-side encryption.
Also, I do like the idea of a non-hardwired blocksize , but I agree it'd be a major overhaul.
Maybe a solution is to do key agreement on a separate shared secret K_dht -- call it a 'session DHT key' or something that's agreed upon using some decent DH primitive.
For any given file, the file's hash Fh = HMAC(K_dht, filedata).... The block hash for a particular block in a file at index idx, is HMAC(K_dht, Fh + idx + blockdata).
Then the only people that can derive the DHT are those with the shared secret,
Something like that just as a first idea, anyway.
Perhaps combining HMACs, Authenticated Encryption, and the idea of 'tweakable ciphers' (like XEX mode, the basis for GCM) can allow a balance between block-level sharing, untrusted storage endpoints, and resistance to watermarking.
Okay, so there's a way around the 'watermarking' problem of convergent encryption.
Convergent Encryption (standard / vulnerable):
Convergent Encryption (keyed / resistant):
In the latter example, only those who know Skey can conduct 'proof-of-file' and related attacks, thus Skey is shared among all nodes in a cluster which are sharing files.
Using the latter example with AES-CTR mode, and a public per-file random IV, then we actually have completely random access to file blocks.
The general way such algorithms work is as follows:
The object to be encrypted is validated to ensure it is suitable for this type of encryption. This generally means, at a minimum, the the file is sufficiently long. (There is no point in encrypting, say, 3 bytes this way. Someone could trivially encrypt every 3-byte combination to create a reversing table.)
Some kind of hash of the decrypted data is created. Usually a specialized function just for this purpose is used, not a generic one like SHA-1. (For example, HMAC-SHA1 can be used with a specially-selected HMAC key not used for any other purpose.)
This hash is called the 'key'. The data is encrypted with the key (using any symmetric encryption function such as AES-CBC).
The encrypted data is then hashed (a standard hash function can be used for this purpose). This hash is called the 'locator'.
The client sends the locator to the server to store the data. If the server already has the data, it can increment the reference count if desired. If the server does not, the client uploads it. The client need not send the key to the server. (The server can validate the locator without knowing the key simply by checking the hash of the encrypted data.)
A client who needs access to this data stores the key and the locator. They send the locator to the server so the server can lookup the data for them, then they decrypt it with the key. This function is 100% deterministic, so any clients encrypting the same data will generate the same key, locator, and encrypted data.
(A) Yes, skey is a shared secret .
In other words, if you and I both have copy of 'Transformers2.avi', and we have the same shared secret, then if I request to upload the movie to you.... The P2P network (aka you) will say don't worry about it, we already have a fully copy of that file that matches this hash ("locator"). BUT the network can only know that the Transformers2.avi file already exists on your computer by comparing the hash of a deterministically generated ciphertext.
File_Locator = SHA1(Transformers2.avi.aes)
I think that when skey is a shared secret across all users sharing files in a 'cluster'... Aka In the latter example above, skey just acts as a 'tweak' to prevent the "confirmation that you or I have a copy of Transformers2.avi type" attacks, unless the attacker knows skey.
The idea is basically that you can request a file from the server (or peer) by asking for it's file locator (the hash of the mostly-deterministically encrypted ciphertext), and see whether the file is already stored or not. This principle can apply to fixed or variable length 'chunks' as well, supposedly.
The two examples are above functionally identical , just the latter is more secure, and probably more appropriate for this discussion where we are not sharing 'with the world' on bittorrent, but rather on small private p2p networks.
The main point of convergent encryption is that it allows 'de-duplication' -- meaning that if we've already stored one copy of a file, then the server is smart enough not to store a second copy of the file -- but rather it stores just a reference to it in some sort of metadata / mapping table.
I know this particular scheme (convergent encryption) is utilized in Tahoe LAFS and Bitcasa... It may be utilized in BitSync as well, though on that I'm unclear.
I'm still trying to understand it fully , so my apologies if my explanation is not very good.
Check out the pdf linked below at the end -- it actually discusses convergent encryption in the context of fixed and variable length 'chunks' of a plaintext file.
Here are two other good links:
"Both models of our secure deduplication strategy rely on a num-
Here are two resources that helped me so far:
(1) Secure Data Deduplication
(2) Source code for Convergent Encryption in Python (includes both example 1 and example 2)
So you can't verify that someone has something, unless you have skey, because you need that to generate to locator to verify if someone has a given file.
If you managed to get a locator from someone for the given file you want to verify against, the fact that the content was encrypted using a hash over the plaintext is meaningless, as you got the locator already.
If you have skey, all bets are off, and there is no encryption left at that point.|
The only sensible reason I can come up with, is given you have the plaintext and the ciphertext of some other file, it's not possible to recover skey, because the ciphertext is encrypted with HMAC(skey, plaintext) rather than skey, so the only thing you can recover is HMAC(skey, plaintext) which is not good enough to decrypt a different file.
If in need to "sell" this feature to non tech users I say:
"... allows you to setup an 'attic' you can share with your family and friends, where your boxes are sealed and can only be opened by you if needed. It's not a replacement for a data-bunker or data-safe to store your most precious data. Sometimes an attic is what you need, though"
referenced this issue
Mar 10, 2015
I'm thinking right now it would be cool to have something generic for this task. Sort of a reverse ecryptfs. Instead of an encrypted version on disk and a virtual unencrypted folder like ecryptfs you'd have the files unencrypted on disk and there is a folder which shows them only in encrypted form. Then you could use that folder and sync the encrypted version to other nodes with syncthing. I guess it should not be too hard to implement based on ecryptfs, either, as all the pieces are already there, they just need to be plugged in a different order...
And someone already had the idea: https://bugs.launchpad.net/ecryptfs/+bug/376580
from the encfs man page:
Of course you could also simply encrypt everything with ecryptfs and then sync the encrypted version, which is probably even safer. But while encfs is not cloud safe then encfs is arguably a lot safer than btsync.
@djtm ecryptfs is the piece of sht, slow and buggy piece of sht. 2 or 3 years ago I try to do the same trick with encfs --reverse for dropbox-like services and just lost my time - the resulting "frankenstein" was too unstable and slow.
@bademux I'm using ecryptfs shipping with Ubuntu for my home directoy. I never had any issues of any kind. The only thing is that it's a bit difficult to mount an ecryptfs directory from a bootable linux distribution. I'm not worried as much about speed as about security. The whole point of encryption is that it's reliable. encfs currently allows various attacks which are especially problematic within the cloud. However I believe it might be the better option to fix encfs, which had undergone security reviews by cryptography experts than to invent a new encryption here which will most likely end up being either a ton of work or mostly snake oil. As much as I'd love to see this feature implemented...
ext4 encryption will certainly not help, as the encrypted files will be invisible to syncthing.
As per the previous comments on this bug/enhancement request, @djtm, ecryptfs/encfs isn't good enough for something that is sent passing trough wires you don't own and can't control, as the changes within the structure of your database/filesystem can reveal informations. There is also to keep in mind that the solution has to be OS-independent and, in particular, work with windows (OSX could be easier to work with from what I know and understand), as well as being something (relatively) simple and easy to do that doesn't go beyond the scope of Syncthing.