Support for file encryption (e.g. non-trusted servers) #109

Open
Natanji opened this Issue Apr 4, 2014 · 136 comments
@Natanji

So I have had a look at BitTorrent sync, syncthing and alternatives and what I always wondered about was the possibility to not only sync between resources I own and trust, but also external resources/servers which I do NOT trust with my data, up to a certain extent.

One way to do this is using ecryptfs or encfs, but this has many obvious downsides: it is not an interoperable solution (only works on Linux), the files are actually stored in encrypted form on the disk (even if the resource is trusted and this is not necessary, for instance because of the file system being encrypted already), etc.

What I propose is somehow configuring nodes which are only sent the files in an encrypted format, with all file contents (and potentially file/directory names as well; or even permissions) being encrypted. This way, if I want to store my private files on a fast server in a datacenter to access them from anywhere, I could do this with syncthing without essentially giving up ownership of those files. I could also prevent that particular sync node from being allowed/able to make any changes to the files without me noticing.

I realize that this requires a LOT of additional effort, but it would be a killer feature that seems to not be available in any other "private cloud" solution so far. What are your thoughts on this feature?

EDIT: BitTorrent sync mentions a feature like this in their API docs: "Encryption secret
API users can generate folder secrets with encrypted peer support. Encryption secrets are read-only. They make Sync data encrypted on the receiver’s side. Recipients can sync files, but they can’t see file content, and they can’t modify the files. Encryption secrets come in handy if you need to sync to an untrusted location." (from http://www.bittorrent.com/intl/de/sync/developers/api)

@jewel

This would be amazing. I tried to spec out what this might look like in this clearskies extension, but it adds so much complexity that I've tabled plans for it for now.

Like you say, if only the file contents are synchronized to the "untrusted" peers, that would be a lot simpler to implement (i.e. the metadata never hits the untrusted peer in any form). I hadn't thought of that.

@Natanji

It seems like you even thought of a zero-knowledge-proof to show that the server is legitimate/actually stores the files (did I understand that correctly?). Not bad.

CTR mode sounds like an extremely bad choice for me, just like other stream ciphers like GCM. Yes, it is seekable and that is useful, but XORing two snapshots of encryption on top of each other will result in an adversary knowing what changed between the plaintext of those two files. CBC is a much better choice: when seeking, you may need two blocks of ciphertext to decrypt the first block of plaintext, but that is negligible usually because you will read more than one block anyway, and the more you decrypt the less overhead you get.

I don't really understand why encrypting everything - including metadata - should somehow be "easier" or simpler to implement. Maybe I'm misunderstanding you? What do you mean?

@jewel

I think you might have misunderstood, I was trying to say that it'd be simpler to implement if the metadata isn't synced.

Thanks for the feedback on CTR mode, I wasn't aware that seeking was possible with CBC mode.

@calmh
The Syncthing Project member

I could see how this would be useful. As you say, it would require some work because it's currently not a part of the design at all - an additional layer of encryption would be needed. There would obviously be some trade offs between privacy and efficiency, ie if a blocks changes in the middle of a large file, do we resynchronize just that block and leak that fact or re-encrypt the entire file etc.

@calmh calmh added the far-future label Apr 5, 2014
@calmh
The Syncthing Project member

Also slightly related to #62 which is similar functionality, minus the encryption (ie for when we trust the node with the data, just not with modifying it).

@NickPyz

This idea is particularly useful for people who would use syncthing to setup a network of devices, and require 1 of them to be available 24 hours a day. Let's say all the devices are trusted except for the always on device, which is a 3rd party VPS server.

In this case, it would desirable for some additional properties built into syncthing so that the VPS node has the following characteristics:

READ ONLY (can't change any data)
ENCRYPTED (so the VPS personnel can't see the data).

No doubt this adds complexity and performance hits to support the encryption, especially if this project eventually extends to devices that don't support hardware-based encryption, such as most current smartphones,

@kylemanna

Tahoe-LAFS has this feature and it would be awesome if a more usable implementation (I find tahoe-lafs WebAPI very painful and difficult to use).

https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/architecture.rst#security

It has the notion of "storage nodes" that hold chunks of distributed encrypted data. The default configuration is that any 3 chunks can restore the file out of a goal of 10 chunks on different storage nodes using erasure coding.

It would be nice if syncthing could support the distributed fractional concept as well, but that sounds like a topic for another issue. It may be out of scope too, hopefully not :)

@Natanji

Tahoe-LAFS sounds pretty much exactly like what we want - what and incredible find, I hadn't heard of it. Thanks, @kylemanna :)

The way I see it, syncthing already has the functionality to keep directories in sync and perform upload/download operations between the nodes whenever something changes. So the feature we want might not be that far out of reach: whenever a file changes, then we have to call the API of tahoe-lafs and upload/download the file.

I agree that we should start with a configuration where files are simply replicated completely on all foreign nodes. Fractional distribution can be added later if this setup turns out to work well.

The solution would also work on both Windows and Linux, which is a huge plus! And we don't have to do any crypto storage of our own, which would most probably turn out to be a failure anyway, I presume. :)

Sooo... anyone see a problem with this approach yet, from a design perspective? @calmh, do you think syncthing's design is compatible with tahoe-lafs?

@calmh
The Syncthing Project member
@kylemanna

I played with Tahoe-lafs for a while and it doesn't really do what I want. The major deal breaker for me was that the storage nodes don't work behind NAT. Everything I could find suggested that I needed to do port forwarding and tunneling of some sort. I'd imagine that a significant portion of the user base for syncthing is behind a NAT.

@mcg

These days, without some form of encrypted/untrusted node support, Syncthing is probably going to be unusable for some portion of users. One of the reasons I choose BT Sync over other solutions was it's support for this.

@elimisteve

This feature would be great because it'd allow me to replace the rather bulky and Mono dependency-laden SparkleShare with syncthing, which is much easier to set up :-).

SparkleShare has been working well for me, though.

@EvdH0

This would indeed be a great feature!

@menelic

This would indeed be a great feature, especially if it could be defined on a folder and/or file level as in BTSync. I'd argue adding this feature is part of the "BTSync-replacement" goal. This would add complexity for sure, but it would be great to have one Syncthing interface from which I can manage my synchronised shares with people who are supposed to have access as well as with locations which are not supposed to either have access or be able to see the files. As a VPS user, this would be great for me - and surely for a lot of others as well.

@bigbear2nd

For me, this feature is the only thing keeping me from switching from BTSync to Syncthing.

How the encryprtion works with BTSync ist discribed here in detail:
http://forum.bittorrent.com/topic/25823-generate-encrypted-read-only-secret-without-api-key/

The use would be for me: I can store data at a friends home, at my family members PCs and i dont have to worry about that they can access my data. Additionally i can store data for them and i cannot access it.

The more people which have my data, the faster my download / upload and spreading of data is.
Additionally, the safer my data is.

Syncing data is for me not only having it available, it has become data safekeeping as well.

@jedie

Can closed source projects ever offer security? Keyword: verifiability...

And should i really sync important files to an untrusted location?

Just my 2¢...

@Natanji
@bigbear2nd

Quote: Can closed source projects ever offer security? Keyword: verifiability...

Thats why i want to change.

Quote: Syncing important files to untrusted locations is usually not a problem when they are encrypted+signed.

I totally agree on that.
But for me, I would say that my family members and friends computers are kind of "half trusted" locations.

@nadalle

One issue with the clearskies proposal is that it only addresses encryption of file data, not metadata about the file like name length, file size, etc.

If you really don't trust the remote storage, this is not sufficient -- it's often possible to tell what's in a directory tree just by looking at the file sizes, for example. Encrypting file systems try to mitigate this somewhat by padding files up and so forth, but dealing with the remote security issues may be rather hard.

At minimum, you probably want to think about randomizing the file locations in the tree and padding the files. Better would be to break them up into blocks and scatter them around in fixed size storage chunks that the remote end doesn't know anything about.

@nadalle

To elaborate a little bit, you can't entirely eliminate the data leakage if you're storing in a completely untrusted location. For example, at an absolute minimum, someone who can watch your traffic can tell how much data you change (and thus need to sync) every hour/minute/etc.

But systems that just encrypt the file data (and hopefully the names) leak a lot more. For example, say I just stored the new Weird Al album in my share. Even encrypted, rounded up to 16 byte boundaries, the directory contains files of these sizes:

    By track     By size
 1.  7151712     5497664
 2.  9123472     5822608
 3.  5822608     7032608
 4.  5497664     7151712
 5.  9032544     7159040
 6.  8931184     7856016
 7.  9947920     8931184
 8. 10858000     9032544
 9.  7159040     9123472
10.  7856016     9947920
11.  7032608    10858000
12. 21923472    21923472

Probably no other set of files will show this pattern. So it's pretty easy for an adversary with a database of these things (they exist) to tell that I have a Weird Al album there.

You might assume that the sort order of the files will be scrambled, but of course the tool probably uploaded them in order (so they can get it from the CTIME). Even if it didn't, the file sizes are nearly as good in sorted order. You might try to store the files in random locations in the directory tree (better), but that has the same CTIME problem.

If you really want to have much hope of a secure system here, you really want to avoid storing the data in files entirely. One simple way to think of this is to break all the data you want to sync into 4k blocks, and then have the untrusted side store a database of SHA256 hash -> encrypted 4k block. You do updates by sending new blocks, and then giving the remote store a manifest of which blocks are still needed (the data about file names and the map of blocks to files is itself stored in encrypted 4k blocks hidden in the data). The layout of the database is now mostly irrelevant, since the protocol just talks in terms of hashes and manifests.

You'll note that this is starting to look a lot like a filesystem in its own right. I think something like this is probably needed to have a reasonable level of security.

@Natanji

Well, the question certainly is what counts as "reasonable". There are file systems like EncFS and ecryptfs which expose the same problems that you mention here, but are still widely used - especially for cloud storage. If syncthing can do it just as good as these state-of-the-art systems, then that is a big leap forward!

Security is never absolute, but relative to a use case. Leaking the alphabetical orders can be easily circumvented by shuffling the order in which files are uploaded - that is a good idea. Leaking the file sizes can lead to some exposure, but for most use cases leaking your musical preferences will not be the end of the world. Files with private data in them, however, would still benefit a hundred percent from having just their name and contents encrypted, like in EncFS or ecryptfs.

Don't get me wrong: it is important to think about these issues. But we don't have to come up with a perfect solution that exposes absolutely nothing under no circumstances ever. If a perfect solution fulfilling 100% of the use cases means so much work, then it should be fine to opt for a much less complicated option that just fits 95% of use cases - at least for now, until a better option is available.

Perfectionism is the greatest enemy of open source progress. ;) As long as you inform your users of the security implications, e.g. what does not get protected and what does, it's completely legitimate.

@calmh calmh added enhancement and removed far-future labels Jul 26, 2014
@Phyks

Hi,

I'm really interested in Syncthing, but client-side encryption is a major feature for me, as I want to sync my files against my dedicated server (which hosts several other services) and thus, I don't want to risk to have any sensitive files unencrypted on such as server.

I read this issue and saw that this feature is ongoing. But do you know about some working setups usable as of today ? For example using an encfs or ecryptfs container which could be automatically mounted and unmounted before / after each synchronization or something similar ? (just for basic file content encryption, waiting for a better solution directly implemented in syncthing)

Thanks !

@Finkregh

i'd offer 40 EUR when i could replace tahoe-lafs with syncthing... 👍
https://www.bountysource.com/issues/1474343-support-for-file-encryption-e-g-non-trusted-servers

@elimisteve
@Finkregh

well, i like tahoe-lafs very much concept-wise, but putting in files and the whole setup is a pita :/

i need something that i can just install on my mothers' PC ;)

@aral

Just wanted to chime in that this is a hugely important feature for me also for exactly the reason that NickPyz mentioned in their comment (#109 (comment))

@ghost

This is indeed the one thing keeping the app from becoming a BTSync killer, simple to use P2P sync tool. Since this feature could require some protocol changes, it'd be awesome to see it implemented before the protocol freeze in 1.0.

I'd happily give up forward secrecy if encrypted nodes are joined to a cluster and just use an AES-256 block cipher and proper signing. This could even work on a per-node basis, so normal nodes would still use AES-128-GCM with forward secrecy and keep that slight performance gain.

As for hiding metadata, the untrusted node could use something similar to a virtual drive split into 2GB chunks. Yes, all this is a big challenge - well out my depth - but the heart wants what the heart wants.

@NickPyz

As I see the discussion so far, there are 2 desired features:

(1) Encrypt data on untrusted nodes (with read only properties)
(2) Obfuscate metadata

I agree with others that the addition of these features + Syncthing's code transparency would be a winner vs. the proprietary alternative(s). I can live with these improvements coming in incremental steps rather than 1 giant leap forward.

I use an open source, multi-platform backup/restore application called duplicati that has incorporated these principles successfully. It does client side ZIP compression and encryption (AES-256 or GPG - user choice). It breaks the data up into equal sized chunks (size is configurable) on the destination end to obscure the metadata. The remote location sees multiple equal sized *.zip.aes files. The original directory and file names and sizes are gone.

https://github.com/duplicati/duplicati and
http://www.duplicati.com/

I understand that real-time sync presents different challenges than scheduled backup and restore. However, encryption / metadata hiding mechanisms already exist out there in the Open Source community.

EDIT: I just wanted to add my concern that this development should also consider how the resulting Block Exchange Protocol would be compatible with low resource devices such as Android phones. I am sitting on the fence about the benefits of remote encryption at the possible expense of not having a working node on my phone.

@djtm

I think it would be both problematic from a security standpoint and from a developer resource standpoint to implement something new here.

Security wise we should not reinvent the wheel, especially taking into account that a third party might intercept also the changes between different file versions on the end point. (i.e. if you change your file, the difference for that change will show up on the untrusted node, even if encrypted - this makes an encryption easier to break.) From a security standpoint the assumption should be that the changes on an untrusted node are visible.

From a developer resources standpoint we should also try to avoid reinventing the wheel. Ideally, something with a go interface of course. But I don't think that exists. So let's look at existing solutions:

  • truecrypt - could already be used, but not for this exact purpose, no growing of volumes, no longer maintained as such etc.
  • cryptsetup - also for volumes, not for files, but a good standard.
  • boxcryptor - seems secure and might probably be adapted to the purpose, but not open source
  • ecryptfs - pretty secure even after audit, supports file content and name encryption (not size though), open source, but implemented in a Linux kernel module
  • encfs - pretty similar, a bit less secure I think, but implemented as FUSE. last version in 2010, does not seem very healthy, still probably an ideal candidate.
  • gokeyczar - might be worth a look.

So I think the issue at this point is to find a working existing security solution that can be used, not to write down all the wonderful features we would love to have. And - maybe - to think about how the protocol might have to be adapted and if that is possible in a backwards compatible mode (i.e. that old versions may still be used for storing encrypted files).

@Nutomic
The Syncthing Project member

Syncany might be worth looking at for this.

@ghost

@djtm
From a developer resources standpoint you're right. This is a huge deal. But the thing is if we were to use yet another application to encrypt data then all the security of Syncthing is just an overhead.

AES CBC or CTR with IV is a lot more resilient - to changes as well - than you make it sound.
I also use full disk encryption on my computers, so it's superfluous to encrypt everything again just so an untrusted node can sync it. I'm thinking about something like encfs' reverse mode.

@radeksvarz

Very good discussion here. I am planning to replace Crashplan backup by either BTSync or syncthing.
My preference is towards the open protocol implementation.
Crashplan does encryption on the originator node and stores the encrypted data on the other nodes (e.g. friend's computer). The resulting structure is bellow.

Since the data is already encrypted the transfer from the source node to the other nodes might not need to be encrypted again (via https).

There is however additional aspect of the crashplan backup - the maintenance runs on the target nodes, which crosscheck the stored data for its validity (and thus applicability for restoration).

 Directory of C:\CrashPlan_friends_backup

23.03.2013  16:30    <DIR>          .
23.03.2013  16:30    <DIR>          ..
18.08.2014  14:27    <DIR>          575963701973615056
               0 File(s)              0 bytes

 Directory of C:\CrashPlan_friends_backup\575963701973615056

18.08.2014  14:27    <DIR>          .
18.08.2014  14:27    <DIR>          ..
23.03.2013  16:30                 0 .4to5ind
23.03.2013  16:30                18 575963701973615056
20.08.2014  01:11             1 712 cp.properties
23.03.2013  16:30    <DIR>          cpbf0000000000000000000
26.02.2014  22:55               382 cpbp
18.08.2014  14:27         1 375 528 cpfmf
20.08.2014  01:11           562 593 cpfmfp
18.08.2014  14:27                 8 cpfmfs
20.08.2014  01:11           314 984 cpfmfx
18.08.2014  14:27         1 029 083 cphdf
26.02.2014  22:55             1 273 cprp
20.08.2014  01:11                53 cptl
              11 File(s)      3 285 634 bytes

 Directory of C:\CrashPlan_friends_backup\575963701973615056\cpbf000000000000000
0000

23.03.2013  16:30    <DIR>          .
23.03.2013  16:30    <DIR>          ..
23.03.2013  16:30                23 575963701973615056
18.08.2014  14:27     1 255 655 868 cpbdf
20.08.2014  01:11           450 256 cpbmf
               3 File(s)  1 256 106 147 bytes

     Total Files Listed:
              14 File(s)  1 259 391 781 bytes
               8 Dir(s)  29 795 094 528 bytes free
@dr4Ke

In for $30

@davidak

this would be a great feature.

the idea reminds me of a concept from Wuala:
a user can give free disk space to the community and get that space on nodes of other users

so i can store a repo in the Syncthing Cloud, encrypted and distributed to several nodes.
a problem is that nodes are not 24/7 up, so chunks of files should be on more than one node.

so you can get more space if you have a higher uptime.

that all could be implemented as P2P between all nodes.

that would also be interesting for buisnesses as they can sell storage to syncthing users. that will help syncthing to get more users and aceptance.

@AudriusButkevicius
The Syncthing Project member

So this is a huge thread which I haven't even read much detail.

I know nothing between the different AES modes, hence please correct my obvious mistakes.

From what I've seen I think the following would be possible:

  1. Per block encryption - each 128kb in each file is encrypted, though the blocks are still layed out the same way. We use the block index as the IV to AES CFB which reduces security but means that the same block at different positions will yield a different encrypted blob.
  2. AES CFB (even when IV is known) encryption which means that each encrypted block will end up of the same size as the input block, which means the filesize will stay of the same length (unless we choose to pad or unless someone has some beter example how to get exactly 128kb of encrypted data for every 128kb or less of unencrypted data).
  3. Filepaths are also encrypted in full and base32'ed, resulting in a flat layout where all files are in the root directory.

All nodes share the same encryption key which they use to encrypt departing data (blocks and filenames in the index), and decrypt incoming data (block data, and indexes).

Nodes that do not have a key, just assume the data is not encryped and behave exactly the same way as before.

This is less secure then proper AES CFB, but this allows us update single blocks rather than rewriting the whole file once one of the blocks change, and judging the internet (which is always the trusted source, right?) doesn't give the attacker much.

@devurandom

If you encrypt the filenames anyway, and thus get a flat list of blobs, I think you could also store each block (e.g. 128KiB) as an individual blob. That way you could still update each block individually, but would not need to ensure the encrypted output is the same size as the unencrypted input. The offset of the block within the file could be appended to the filename before encrypting it, e.g. "{filename}\0{offset}".

@AudriusButkevicius
The Syncthing Project member

Yeah, but depending on the underlying filesystem it might hit the limits very soon.

Also, imagine someone requesting a large file, you'd be going to file descriptors like there's no tomorrow, furthermore, your spinning disk would probably kill itself, because 128k is probably small enough to fill a lot of gaps, hence seeking would be crazy.

Try copying a single 1GB file, and 1GB worth of 128k files, and you'll see what I mean.

@bigbear2nd

if it can be considered:

  • It would be nice if we still could use the versioning option, even with the encrypted folders. The ability to protect against unintendet deletion is important.
  • If only the encrypted file is left, because all other devices failed, decryption should be possible, as long as i have the encryption key. My scenario would be kind of a "TimeMachine" backup of the encrypted folder onto another HDD. But if i only store undecryptable material....
@jpjp

How does obnam do it?

@AudriusButkevicius
The Syncthing Project member

So I've realized why my proposal would have problems:
Scanning- we'd have to have a write only folder which has no scanning going on, just receiving indexes, and pulling everything as per the remote index.

Furthermore, somehow other nodes should know to reject any index updates from the unencrypted device, which means that there has to be a special flag on the index message verifying that the device who sent you the index actually has the key.

For decryption, I see a lot of problems, as you basically have to take the folder offline before you can do any decryption.

Perhaps we should provide a separate utility for decrypting stuff, but I cannot see an easy way of adding this 'Oh here is the key, now decrypt it and work just like before' functionality.
You'd probably have to delete the old folder, take syncthing offline, decrypt all the data, start it back up, and create a folder with the encryption key pointing it at the directory holding the now decrypted data.

Furthermore, every block received by a device which has the key would have to be rehashed and verified that it does produce the hash before it would be used, because the encrypted devices would have hashes of plaintext data, though would have encrypted data on disk so it would have no way of knowing if the data integrity is still there.

Otherwise, we'd have to encrypt all the data as we scan it in order to get the hashes of the encrypted data, which just seems to be very hairy.

@bigbear2nd

Perhaps we should provide a separate utility for decrypting stuff, but I cannot see an easy way of adding this 'Oh here is the key, now decrypt it and work just like before' functionality.
You'd probably have to delete the old folder, take syncthing offline, decrypt all the data, start it back up, and create a folder with the encryption key pointing it at the directory holding the now decrypted data.

It would be ok for me: Once i selected "encrypted folder" it cannot be undone, if i want to have it decrypted, the steps you mention seem reasonable.

We still have the functionality to sync the files to another device where they get downloaded and decrypted, if we provide the encryption key to the folder.

But if one needs one single file out of the encrypted folder, it would be good if one could use some decrypter tool. The file + encryption key should be enough, in BTSync you need some Metadata, too.

Otherwise, we'd have to encrypt all the data as we scan it in order to get the hashes of the encrypted data, which just seems to be very hairy.

This is exactly what BTSync is doing.
In BTSync the hash of the non-encrypted file is also synced to the encrypted device. They call it the Metadata, which is also one of the components of their actual encryption key. See more: here

@jnordberg

+$20

@maran

@calmh Would you be at all interested in building this? Would there a specific amount of money required for you to develop this as a paid feature funded by the community? It seems there might be enough interest to try this.

@calmh
The Syncthing Project member

This is a very interesting feature, yes. Money has absolutely nothing to do with it.

@maran

Time is often a problem, money can often be exchanged for time. That's why I proposed it. If money is not the issue is there anything else we as the community can do to help push this feature along?

@AudriusButkevicius
The Syncthing Project member

I think I already have worked out how to do this properly, I am just waiting for my first million dollars to accumulate here and I'll start working on it ;D

@aral

Hey Audrius — let’s have a chat about this and the other three feature requests I popped on the forum this week. I am actually in London tomorrow for the London Swift meetup, want to pop over?

PS. Use case for Heartbeat: encrypted backup nodes where people do not have to trust the host.

@AudriusButkevicius
The Syncthing Project member

@Aral, I've got a product launch event in the evening tomorrow (not sure if I will be attending though), but I am free at around lunch time.
The meetup seems to be so secretive that it's impossible to find the address of where it's happening.

@generalmanager

@Natanji
encfs has Windows, and Android implementations: encfs4win and boxcryptor classic for windows, Cryptonite for Android. Mac users can use the original or ports. There is also a Java version, which can be compiled to HTML+JS: https://bitbucket.org/marcoschulte/encfsanywhere

Unfortunately encfs is absolutely not state of the art and is suceptible to some not-so-theoretical attacks:
https://defuse.ca/audits/encfs.htm

ecryptfs seems to be better than encfs, but no gold standard by any means either:
https://defuse.ca/audits/ecryptfs.htm

@djtm Boxcryptor classic is just a closed source implementation of encfs. The new Boxcryptor versions are literally junk aka "let's roll our own crypto done wrong".

@Nutomic
The Syncthing Project member

Okay here's what I've come up with, from my very limited understanding of the sync process:

For the index, there would need to be a secondary one, that does not contain file names etc, only info about blocks (this index would be shared between the encrypted device and all devices that are directly connected).

Additionally to that, the actual index would be stored as normal, encrypted files in the repo. We'd have to consider how the encryption is handled, either
a) the whole index is retransmitted on every tiny change (more traffic, more privacy) or
b) only part of the index is retransmitted (less traffic, an adversary with disk access could analyze which files are being changed)
Maybe better would be to "mix" the index with the encrypted files.

Files could be stored as AudriusButkevicius said here, but then one could easily see that a 1kb file is a text file, a 3mb file is an mp3 and so on. It would be better to build bundles of blocks in a deterministic way, eg each disk file contains 1000 blocks, so all files are 128mb. The mapping of files/blocks could be stored in the secondary index (so for new additions, we could just look up the next free spot).

One more thing, it might make sense to handle deletions somewhat delayed (eg 1 hour, or even 1 day or 1 week depending on available space), to decrease data leakage and allow reuse of blocks (though not sure how likely reuse is in practice).

@calmh @AudriusButkevicius Does this make sense to you? :D

@bigbear2nd

I am against building bundles of blocks.
The use of it is in only rare cases, because the standard user will only want that others cannot access the data, but not more. On the other side, with block bundles:

  • you dont know which blocks to download if you want to get one specific file for decryption.
  • you cannot use things like the Versioner any more
  • deletions/changes are much harder
  • timemachine backup impossible
  • decryption without Syncthing difficult. (see here )
@nadalle

There are some downsides to building bundles of blocks as stated, but I don't see how most of bigbear2nd's issues are problems:

To download a specific file, you download the file index (you always have to do this!), and then ask for the blocks in that file. Trivial.

Various version tools / time machine / etc. should all basically work. The main difficulty is the same as people have with backing up virtual machines: you need to know that you backed up a sync point, or work if you don't. There are two basic solutions to this, also similar:

  • You take a backup when sync is quiescent (pause-backup-resume), or:
  • You write an "fsck" type tool which can repair inconsistencies. In this case, I think the tool would be one which can scan through multiple copies of the data and multiple manifests looking for one that validates.
  • Keeping old blocks for a while would tend to help too, since it would make it likely that a version exists which was in a quiescent state. For example, if I keep a "version" live for 2 weeks, it's likely that a backup will run in that 2 week period, so that version can be recovered.

Decryption without syncthing is a bit involved regardless since you need a tool that understands the syncthing format. You might be able to avoid this if you clone encfs, but I suspect the end result is basically the same since you'll surely have to violate the encfs file protocol to fix all the bugs in encfs (e.g. it explodes on long file names and so forth). Either way, providing a standalone "decrypt a folder" tool seems like basic good manners.

Deletes and changes per se aren't hard (the idea is to use a manifest of blocks for each version, so a delete is an update that doesn't include a block), but they're related to the main problem with bundling:

Fragmentation.

There are a few different general schemes for how you might do this sort of thing, but in general if you break up your source file tree into a bunch of opaque pieces and ask a remote server to store them with no knowledge of where they go, you'll tend to see that the bundled block order may not have a strong correlation to the source file order (and indeed, theoretically you might not want it to since it might leak some sort of information).

This probably gets worse over time, so as files are added/deleted/changed, you might find that the bundle system tends to degrade to random $BLOCK_SIZE IO.

There's also an IO multiplying effect on delete, since if I delete a 1K file locally, you might want to re-write an entire 1MB bundle on the untrusted server side since most systems have difficulty poking holes in existing files (and some such as MacOS don't support holes at all).

@nadalle

A few more comments on the whole bundling type idea:

The place I started thinking about this is like a dedup block store. Imagine that you hash the blocks in your source files, and then just store one copy of each block remotely. Each version of the backup would just supply a manifest of blocks which are used in that version, and then you send each block that isn't already stored.

In a sort of naive sense, it's easy to see how you could just encrypt every block using a secret (!) IV, and then encrypt the file index which describes the file -> block mapping too.

However, this by itself isn't going to work since you can't use an IV like that with a block cipher. It would give some minimal security, but an attacker could find similarities between the blocks and break your scheme that way.

A better solution would do something like mix a hash of the block plaintext with the IV secret to make the real IV, and then obviously keep the hash secret by putting it in the encrypted file index. You would still find the encrypted blocks by looking them up via a hash of the ciphertext.

A different approach is to dispense with the deduplication style concept, and just use bundling to disguise the file boundaries. You could imagine just sending the encrypted files in a stream, broken up on bundle boundaries. This is easy the first time, but as you make changes or deletes you'll have to do things like send:

  • Replace blocks 23-30 in bundle 744 with this data.
  • Remove blocks 30-42 in bundle 744.

... which leads to complexity and sparse (or underfilled) bundles. The remote server may also have a much easier time inferring where file boundaries are that way.

@Nutomic
The Syncthing Project member

Regarding decryption, I don't think we should worry about it. If needed, it would be trivial to create a script that starts up 2 syncthing instances and transfers the data from encrypted to unencrypted folder.

@AudriusButkevicius
The Syncthing Project member

That doesn't work in my approach atleast, because the encrypted node does never update its index, as he has no clue what the hell he has, as the hashes he holds are for plaintext, but the content he holds is encrypted. So he just accepts what he is being given, and stays silent.

@Nutomic
The Syncthing Project member

That's why I suggested a secondary index for the encrypted data, and the actual index being stored in encrypted format in the repo. That way, other clients can download the actual index with no extra info needed.

@menelic

duplicati offers encrypted backup and restore that works with many cloud backends - surely their apporach must have solved some of the issues discussed above? https://github.com/duplicati/duplicati

@Nutomic
The Syncthing Project member

Duplicati is somewhat different in that it is made for daily backups, not continous syncing.

Still, here's their paper on the storage format:
https://code.google.com/p/duplicati/downloads/detail?name=Block-basedstorageformat.pdf&can=2&q=

@Natanji

@generalmanager Encfs4win is utterly useless. It doesn't work properly, neither does the Windows version of sshfs. It will be slow and crash randomly and is REALLY not how you want to store your data.

@menelic duplicati/duplicity are an entirely different use case. They are not good for syncing files over multiple computers in real time, as far as I can see. For instance, compression is something useful for backups, but really not needed for syncing.

For the blocks thing, I looked at syncany (as per suggestion of @Nutomic), whose security concept REALLY is worth looking into in my opinion. A while ago I asked the author about some improvements in documentation of the concept, and he did and the results can be seen here: http://syncany.readthedocs.org/en/latest/security.html

I think it's a very sound concept. It basically uses bundling into multichunk files of (at most) 4MB size. The concept has built-in versioning as well, which I see as useful.

@davidak

if we talk about encryption, deduplication and blocks, that sounds like a filesystem.
could we just use a filesystem that does this things? maybe there is something implemented in go or we can use C binarys?
truecrypt encrypts files inside a big file with a filesystem like fat32 or ext3.

virtualbox can create files that grow with usage... with a filesystem in it.

@Natanji

We cannot use any existing filesystem because we need to only send data in an encrypted form to the server (never revealing the plaintext to the server) but ideally want to only locally store unencrypted data.

Also, creating single files that grow is a terrible thing for a networked environment. If you have a 4GB encrypted truecrypt virtual disk, you might end up uploading all 4 gigs of that if you just change a single character of a file inside of it. (Granted, probably not when using the rsync algorithm or something similar to upload it, but still, it's not a very transparent process and you will upload much more than you actually need).

@davidak

git-annex has several remotes, lots with encryption: http://git-annex.branchable.com/special_remotes/

http://git-annex.branchable.com/encryption/

maybe this is a way syncthing could do it?

@bobobo1618

To save people the effort, Git Annex encrypts the contents of the files and uses HMAC hashes of the filenames.

Syncany breaks files into chunks and stores metadata related to them in a separate encrypted database.

Personally I prefer Syncany's approach. It results in encryption of all content and metadata and its chunking approach allows for further extension into deduplication. It also has the added benefit (assuming configurable chunk sizes) of being able to fit files into arbitrary file size limits. For example if you want to sync to a FAT filesystem or a cloud service.

I'm most interested in this issue because once completed, it invites the possibility of sending data to a conventional cloud without any issues with security. With OneDrive's unlimited storage, I could use Syncthing for backups.

@jceloria

I've used and grown to love EncFS over Dropbox, AeroFS, and BTSync. I've been able to mount the encrypted file system on Windows, Mac, iPhone, Android and of course Linux boxen for years now... I have had issues with the encrypted shares though with Syncthing... I too would pony up some cash if there was a solution for this.

@menelic

@aral it would be great if the ind.ie project could pick this up as one of the goals you define for your crowd-funding campaign. As this thread makes clear, people are ready to pay to have this feature implemented. Your campaign is the medium through which monetary contributions to the development of pulse can be made. For pulse to be the backbone of heartbeat, this feature needs to be implemented - that is how I interpret your comment above in the context of the ind.ie campaign. I appreciate that you want the campaign to be legible to non-tech people. But I think there should be a section in which you detail specific features, goals and, well, waystones that contributions will fund. If you do this for this feature, I am sure some of the above pledges would find their way to your campaign.

@AudriusButkevicius
The Syncthing Project member

@jceloria what were the issues just out of curiosity? I recall there being something todo with file name limits?

@jceloria
@AudriusButkevicius
The Syncthing Project member

I recall that this might have been due to the file length limit on Windows?

@jceloria
@benguild

I agree that this feature would be kickass. You could exchange "hosting only access" with friends and keep folders distributed.

However, other considerations:

  • Remote versioning support... for example, being able to restore various versions of files from other nodes, and have those nodes keep X number of copies for X amount of days (both minimum and configurable)
  • Limiting overall file size and storage of folders, and being able to purge old versions automatically if a folder is using up too much space on a HDD/SSD based on what someone allocated and gave permission for. (helpful for avoiding malicious peers)

Just thinking out loud!

@davidandreoletti

I would put up 30$ for this alone and would help with the implementation.

@stooone

You got my $10 too.

@fschwebel

Bounties amount to $250 so far 👍

@fti7

+$50 :-)

@bademux

I think, it will be terrific feature to create safe mirroring backup (kind of RAID 1 ) with 2 or more lowpower devices (RPi like).
small bounty was added :)

@davidak

bounty increased to $400!

@Nutomic
The Syncthing Project member

We're on the first page of Bountysource!

I think we shouldn't fix this until we're at the top :D

@sdaves

We're at the top of the organic links for me, right under the featured items!

@ouroboros8

I was going to ask about this kind of feature myself, looks like I was beaten to it. Looking forward to this!

@benguild

I think the really important point for this... especially in alpha/beta testing... is limiting the amount of changes that the untrusted peers can potentially make to your system. I think the trusted peer(s) will need to be the authority on whether or not changes are valid based on a private key that they retain to validate said changes and data on the untrusted peers.

For example, if malicious data somehow gets stored on an untrusted peer or a system is suffering from drive/data corruption, that data should obviously fail some sort of checksum given that it's not intact as delivered by a trusted peer to the untrusted peer. However, the untrusted peer simultaneously needs to be able to verify using some other key that the trusted peer it's talking to is actually a trusted peer, and THAT data (that it receives) validates as well.

So I think there almost needs to be like 4 different sets of keys for this, but I'm just thinking out loud.

@AudriusButkevicius
The Syncthing Project member

I think one or two keys are enough.
In a single key system you either have the shared secret and can participate in inroducing changes, or you don't and you can only receive changes and serve requests.

In a two key system you could perhaps have a key for decrypting received data, and a key for encrypting dispatched data. Having both would make you a full blown member, having none would make you a transit node. Having decryption key would store thevdata decrypted but you could still not introduce changes, having a encryption key should be the same as having both keys.

Option 1 is much easier to implement in the ui and code.

@benguild

As long as both the untrusted and trusted peers are both able to verify data integrity to some degree, the actual number of keys doesn't matter. I was just saying that it sounded like there needed to be 4 whether or not they are derived or stitched together.

There are situations where an "untrusted" peer could actually be more trustworthy than a trusted peer that's been compromised or is corrupted, for example. There needs to be a chain of trust and validity.

@AudriusButkevicius
The Syncthing Project member

Well setting up devices give you that control, it's nothing todo with encryption keys.

What it sounds you are saying is: it has to be super sophisticated but I don't know how.

Also I am not sure what type of integrity you are talking about.
We use TCP which is reliable transport so data cannot get corrupted, and our downloads are hash verified so even hdd errors are detectable. What other integrity are you talking about?

If a device is compromised, you remove it from trusted devices on all peers, problem solved.

@benguild

Right, I'm talking about disk corruption on one of the peers.

@AudriusButkevicius
The Syncthing Project member

Well it's detectable to some extent, if the corrupt file is advertised as changed due to changed mtime, then the corruption will propagate, but thre is no way to prevent that. I haven't seen software which tackles that.

@bherila

P.S. It would be super cool if the encrypted target on the encrypted host could be mounted by encfs.

@fschwebel

@bherila I don't think it would: being compatible with encfs means using encfs, and that's not secure enough, as @generalmanager said.

@vincenzoml

Whoa, what a long and interesting thread. I've been thinking about zero-knowledge hosting for a long time when a famous service was shut down in new Zealand 2011 if I recall correctly. I even started some prototypes that eventually went dead. But before I explain my own ideas, may I ask for a summary of the thread? Is there a proposed / accepted project for the implementation of this feature?

@neolefty

A tangential feature would be read-only sharing. Non-trusted, but without encryption.

For example, KA Lite (an offline fork of Khan Academy) offers the option to use btsync to download its videos. A core of editors have read-write access, and a much larger group of consumers have read-only access, which is very conveniently supported by btsync. I think KA Lite would love to use an open-source alternative such as Syncthing, but currently Syncthing lacks 2 key features:

  1. Read-only participation, to prevent accidental deletion or corruption by consumers

  2. Reduced administrative overhead for mass sharing of files (auto-peering) -- a separate issue!

@bigbear2nd

@neolefty
1. It has this feature already. Its called "Folder Master"

@calmh calmh modified the milestone: v1.0-maybe Jan 15, 2015
@elimisteve

Added to the BountySource campaign to fund this feature, whose total is now $520. Not bad!

@bitshark

I've done a ton of work on this... The easiest way is to put LUKS volumes on your remote sync machines, and use an encfs product on Windows. Crypto option is possible in the software.

But we need to sort out what to do about the filesystem . This is a design decsion point about how to proceed ( do we use XFS or EXT and write windows filter drivers? ) Do we use FUSE drivers ?

We don't even know what the route is going to be to fix the outstanding bugs and feature requests.... do we just fix the bugs and make it file copying software? Do we move towards a virtual device driver for block devices? Do we implement an object store? Native file system? Distributed Hash Table with file inside?

Can this all be fixed with multithreading , and so we don't need to change the protocol at alll? If so, great!

Any changes will need benchmark tests -- we need metrics before we select any major change / redesign, because no one is going to use it ifi it's slow. So it's got to be FAST. My goal is to make this thing the fastest thing out there outside a datacenter.

I'm leaning towards usiing metadata in a the DHT, and FUSE if it's really necessary. We'll have to see but this is a great thread. Any research appreciated. The encryption is the easy part to be honest.

Out block size is going to dynamic, most likely a power of 2, between about 4k (smallest inode size), and on the upper end about 1MB or so (largest bittorent block size), and block size of Google FS.

@AudriusButkevicius
The Syncthing Project member

@cydron I asked you about crypto in your original thread on discourse, as I do see crypto as one (and perhaps only) of the unsolved problems here atleast on this issue.

@SjonHortensius

Wouldn't it be quicker to approach this as 2 different issues? The first to encrypt file-contents, the second to encrypt meta-data?

I have a feeling the first would be relatively easy compared to the second, while it also helps the most.

@AudriusButkevicius
The Syncthing Project member

Actually first is harder than the second one, due to us having a fix block size of 128kb.
I am struggling to find a asymmetric (would ideally like read keys and write keys) block cipher which can preserve message length when encrypted. I guess eventually I will have to give up and just think about tackling varying block size.

@generalmanager

@cydron I don't really think the encryption is the easy part. Also your notion to roll our own crypto in the discussion on discourse regarding the transport layer makes me a bit uneasy, because it is strictly against the single most important advice given to people who are not professional cryptographers.
I am not one of those and I don't know if you are, so I am certainly not going to judge you. But that the encryption is easy sounds rather hand-wavy. Could you go into the details how you would implement the encrypted storage and syncing of data and metadata on an untrusted device?

I consider myself an interested amateur in cryptography, thus I generally try to find sources where experts in the field have taken a stance on the things I criticize or recommend.

To everyone interested in crypto I recommend watching the Crypto 101 talk and the accompanying PDF:
https://www.youtube.com/watch?v=3rmCGsCYJF8
https://www.crypto101.io/

@AudriusButkevicius AES CFB doesn't seem like the best choice and is only considered secure with random IVs.
It also doesn't provide authentication, which then has to be achieved through something like HMAC-SHA256.

This is a good read, especially the long post by Perseids with the accompanying warning:
https://stackoverflow.com/questions/1220751/how-to-choose-an-aes-encryption-mode-cbc-ecb-ctr-ocb-cfb

And here is a very informative post by Thomas Ptacek. Even though we don't want to encrypt a block device, the whole post should be read.
http://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/

As he points out we should probably use AES-GCM (which also gives you authenticated encryption of metadata for free) or a true stream cipher like Salsa20 + Poly1305. Chacha20 + Poly1305 would be even better. AES-OCB is faster, but unfortunately patent-encumbered in the US.

There even seems to be a golang implementation:
https://godoc.org/github.com/codahale/chacha20poly1305
Unfortunately it's not going to become official tough: golang/go#9489

One other interesting example is how tarsnap, one of the safest existing backup solutions does this. BTW: it's written by the creator of scrypt and Ex-security office of BSD, Colin Percival.
https://www.tarsnap.com/crypto.html
Unfortunately the code is open to read, but not under an open source license. It's in the wrong language anyway though.

This still leaves the question in which format we should store the files.
Something along the lines of kivaloo would seem like a good idea:
https://www.tarsnap.com/kivaloo.html

IMHO we shouldn't try to build a modern replacement for EncFS with FUSE adapters and everything. This is simply too much for such a small development team.
The main usecase of encrypted storage is to offer protection against potentially hostile servers which are used as an always-on storage node or snooping by people who you share diskspace with in a tit-for-tat style manner.

For all those cases it should be sufficient to decrypt the whole repo via a cli option or a little lingle purpose tool.

@bitshark

@AudriusButkevicius

Sorry if I wan'tt clear . First -- I used to design crypto systems for a defense contractor. So you are talking to a crypto engineer here , not an applied math major.. hah.

Second, you re correct. There are two issues. Once again, I'm not worried about the crypto beyond my personal interest in the engineering. We can always add modules to convert back from legacy systems like encFS if that became an issue.

I've written FUSE models... they are cross platform and give flexibility for both crypto as well as file handling, since FUSE can trap events.

The issue is more along the lines of solving the integrated problems of distribution of little 'blocks' of data on its own... Then making sure that solution will work with a good form of crypto. That all depends on how we want the program to act (does it act like a backup service? or a network drive that can stream from a remote location? does it do both sync and stream?)

Do people want a copy stored locally of whatever they add to their folder, or do they want the drive to act like a network (meaning the file is immediately streamed out across the network via torrent blocks)? Do we employ a local cache?

Do we transfer file metadata long with the file data? Or are they two separate entities like in cloud clustered file systems? Do we want fault tolerance built in?

How can we get a third peer in there sharing packets and making the throughput go up?

The crypto is not an issue, in the sense that we have bigger issues to solve first.

We can layer in filesystem level encryption once some of these other bugs are resolved, assuming we solve the issue of the block checksums , and making sure that the crypto doesn't screw up how blocks are identified and stiched together.

We just want to make sure we use an appropriate mode with an appropriate cipher with a good implementation. The details as to whether its COTS or FUSE or something else is a non issue to me as long as it's best practice and passes a security audit. Personally, I'm fine with encFS as a stop-gap solution if we patch their IV handling.

So we need to do some research and make a decision eventually... part of this has to do with the user requirements, part of it to do with scalability etc. All that sort of thing.

But personally, I'm going to focus on the BEP protocol behavior in my reference Java client. Once that's all working, then perhaps I'll start thinking crypto.

Of course, you all are welcome to implement whatever you want now. I'm just going to take a shot t a different part of the system for starters.

@bitshark

Oh and for everyone worried about Tahoe-LAFS , it has few problems that make not a first choice for me, but it's okay. We could still use it if we do the research and test it on EC2.

I' m definitely looking at Tahoe-LAFS... but I think it depends on whether we are running 'peers' at datacenters or on iphones. Where is the user base located? Probably the latter.

On any machine, Tahoe is going to be a VFS, not native. If we're going to do that, it might make sense to go with something 'lightweight' that's not for HPC clusters but will be fast enough for a smart phone.

@Zillode

FUSE is not enough, we can't trust the external host so we should never decrypt traffic on that side. Therefore we are looking for asynchronous (block) ciphers to send encrypted data to the external node. The external node only needs to announce which blocks it has such that unsynced nodes can pull data from it and decrypt it when it arrives.

@Natanji

First off: cydron, is it possible for you to produce some more readable text in this thread? It's really hard to read. Thanks.

Second: be aware that syncthing wants to be multi-platform, so eCryptFS is not an option as far as I see (unless you mean essentially re-implementing it so that it can be used on Windows). This is also a problem with other technologies such as FUSE, the implementations on Windows are extremely poor. I have tried using encFS on Windows and it just sucks completely.

I very much agree with Zillode that server-side crypto would not make any sense. You go through great lengths explaining (I think) that private keys should be handled on, well, a non-trusted node. That is a plainly bad idea, it's nothing more than security by obscurity. Maybe I misunderstood you, though? I mean, sorry, I really can't read your text that you wrote up there very well, so perhaps you can re-state your main points succinctly.

@igrewup

Because of the easy of sending funds using Bitcoin, I've donated to this cause.
I think Bitcoin is going to unleash a monster that empowers people to develop software and solve a bunch of issues without any company/corporation involved.
With all the negativity going around, seeing people band together for the greater good of society is a wonderful thing! Keep up the good work...

@bitshark

Yeah, sorry, keyboard broke. Here's what I'm suggesting....

  1. The opposition to FUSE is a bit ideological in my opinion.
  2. We gotta fix and sort out the existing transport bugs before adding cypto.
  3. The crypto will require careful planning and evaluate of limitations and constraints.

Generic crypto is easy. Storing (and accepting) data from a potentially malicious node is hard.

The reason for point 3 above is that what we are doing is particularly vulnerable if a cloud data host is the adversary, because they will have snapshots of the encrypted files through time. Lots of attacks are possible if they get to see how the encrypted files change over time.

The use cases that are being discussed are not clearly defined. I think we are unclear about what the objective is in terms of keeping data safe, what is the threat model, and so forth.

At this point, we are just the mailman for file/folder data and we secure it in transit to network of peers.... some of which are our own devices, some peers are servers or vps boxes, and some peers are cloud drives. Eventually peers could expand further.

If you start getting out of the realm of 'local peers and a server' and into disk encryption on case by case basis ... it's a vast subject. Because of it's complexity.

Let me state some of my points as follows:

(0) I'm going to work on BEP and the Java client. I'm weighing in here to add my 2c so hopefully a solution will be carefully considered.

(1) The currently implemented transport encryption in the code base is fine. It should eventually mandate TLS 1.2 with PFS, specifically ECDHE-ECSDA with AES CCM or ChaCha20-Poly1305

(2) What people are discussing is support for file encryption on 'non-trusted servers'. but not on 'trusted peers' This is a big topic because of the cryptography, and it might break the torrent mechanism. The selectivity of trust -- that some nodes store plaintext and others crypto-- that's is the core issue to solve.

Also... I don't think these use cases nor the threat model are well defined. Furthermore, selective 'per node' encryption will have a huge impact on transport, requiring rewrites of the BEP protocol at minimum. So that needs to be considered ... "how does per-node selective encryption effect block transport using BEP1". This is also connected with the cipher, mode, file data, and control messaging. Big architecture questions from a relatively small seeming feature.

(3) Personally, I think the idea that encryption is selectively enabled for a node/peer is going to be more trouble than it's worth in the short term..... I think instead -- In-place file encryption should be globally enabled / or globally disabled per-cluster, at least for now.

For me -- Either I trust the remote machines (because I own them), or I encrypt all the data all the time (including the data stored locally!). There is no drawback to global encryption , like speed slowdown if the encryption is "On the Fly" . The only issue I suppose is Android which not as powerful as computers , so there may be some lag issues there.

(4) With the speed of modern encryption (excluding super-fast ChaCha20), I feel like selective encryption per node just makes design complicated without any benefits. And as I mentioned, it will probably break the block distribution protocol (the block hashes and file hashes will go out of sync and won't transport properly).

(5) If were going to get into these complex scenarios where it's encrypted on some machines but not others, and we also expect the protocol to distribute the blocks fast and reliably, then we need to get into key management and a secure DHT.. This means PKI , authenticated encryption, HMACs, the whole deal -- not to mention a bit of a pain. Why? Well we'd need to ensure that an 'untrusted' host can distribute blocks to its peers in a 'torrent' block stream, even as though the IVs between nodes may differ, perhaps inducing a change in the SHA1 sum of the encrypted file.

This could cause all sorts of mayhem, because peers will think the file has been updated/changed when it's really just been encrypted and decrypted...There could be different IVs than before (as they are random numbers) which changes the SHA1 sum of the file.

(6) As a feature, encryption should be ALL ON or ALL OFF. For now, thats my opinion anyway. Unless there's an easy way to have it all.

(7) What is done about this issue is going to depend on how someone plans to use this software. If we are just encrypting data to dropbox from our house, then what's the point of a P2P protocol? How big do we want it to scale ?

ENCRYPTION

Let's suppose we agree system-wide encryption is either enabled or disabled.

So let's say we agree this is an all or nothing feature across a collection of nodes. And we turn it on.

What do we use to make the encryption cross platform, be it Windows, Mac, Linux or Android?

We've only got a couple of solutions here... (1) Roll our own, (2) COTS, or (3) Both

I don't have all the answers , but I can tell you it'd be a mistake to rule out FUSE prior to careful consideration.

FUSE:

FUSE is a project consisting of an emulation layer and API, allowing a non-root user to write driver for any filesystem or even invent their own! FUSE is steadily gaining more use, and now there are bindings to all major languages including Java and Python. A 'hello world' FUSE filesystem is only a few lines of code.

FUSE is officially supported on Linux, BSD, OS X, and Mac. FUSE is supported on windows through a port called Dokan (there is a commercial port called CFS s well)... FUSE has a respectable security track record -- certainly far better than OpenSSL . FUSE exposes an APIng in user-mode so that clients can write drivers.

One HUGE advantage to FUSE is that you can do commonly required operations (like implementing file system monitoring for events, trapping calls, locking and unlocking files, etc) ... These operations are almost essential in cryptographic filesystems which run in user mode. So FUSE makes these operations much easier.

FUSE is used in enterprise cloud systems in datacenters all the time, especially in distributed fault tolerant clustered file systems , as well as in various cloud products.

eCryptFS

eCryptFS is a neat idea, but has security and practical issues. One point about it is that it is an actual filesystem that hooks into Linux's filesystem. It does NOT use FUSE as far as I know on Linux -- it is a fully functonal filesystem that runs in userspace on top of your OSs file system, perhaps at the vNode level -- eCryptFS was desgned on ext filesystems. It includes metddata in files, they can be separated from their directory structure and decrypted. It also has a cache built in for speed.

Getting it to be cross platform would require some customization and work.

Part of the reason there is no windows port of eCryptFS is for precisely the way it was designed -- it sits above the vNode layer so it is tightly coupled with Linux. Hypothetically though, you could modify it to run it on windows but I think it'd be easier just to get something better. eCryptFS had a major vulnerability (plaintext keys disclosure) just last year.

Here is eCryptFS security audit:
https://defuse.ca/audits/ecryptfs.htm

encFS

encFS is an older program that encrypts nested directory structure on the fly. It's supported on non-Linux platforms through FUSE. To my knowledge,it does not use FUSE on Linux, but it does use FUSE on other OSs.

It has a couple of security issues that need to be patched , most involving how IVs are handled, and that it needs to use a separate master key for the HMACs it uses. So it's old, it works, and it needs to be patched to be fully secure.

I have had no problems at all using encFS with Dokan (FUSE clone) on Windows 7.x64 I used multiple encFS software by two authors as well, both worked.

I'd be happy to send instructions on how I got it working. It will keep out most of the hackers, but not the NSA.

Here is the result of a recent security audit of encFS:
https://defuse.ca/audits/encfs.htm


So my specific question is what is the threat model that precludes the use of FUSE? Who or what are we worried about, and what are their capabilities? Is it just compatibility? What OS does it not work on?

Suppose we don't use FUSE prior to transport.. Instead we use a Windows netfilter driver that we write... What makes writing a windows filesystem driver (in kernel mode) preferable to running FUSE?

If we don't use FUSE and we don't use a kernel mode driver, the our only option is 'all code in userspace'? If that's the case, how does that give us more security? Or make it easier? I think that just means more code to write unless we find something off the shelf.

I think a side effect to implementing crypto without FUSE would be 'reinventing the wheel' so to speak, essentially rolling our own cryptography for a filesystem tree -- and the codebase will be huge and hard to audit. It's doable, it's just a huge project..

Put it this way. I'm not for or against a particular solution.. What I believe is we have a problem somewhere between the protocol and the models and assumptions about what a 'filesystem' is in the abstract (a tree of nodes of folders and files, but we modelled it like a windows FAT). And this has resulted in some bugs in the system.

I think enhancement of encryption at the file level is a great idea, but I think it's a bit early to incorporate..

Here's a good example on how this effects UX... For example, (1) I want to mount my p2p drive as a 'sync' device (background sync) vs if (2) I want to have my p2p device mounted 'streaming'.(to watch avi movies on the fly) ... In the latter case, we need to be able to get an event when the user opens the avi file , AND we need to stream the blocks from random starting points (if the user pauses the movie etc).

I'm suggesting we keep an open mind in general about how to solve problems.

Let's not reinvent the wheel, and lets not discard FUSE entirely... It might be our solution to implementing a platform independent layered filesystem.. Or maybe there is something better that is found after further research.

Take this example of how FUSE is being used in cloud computing...

  • GlusterFS is a scale-out network-attached storage file system. It has found applications including cloud computing, streaming media services, and content delivery networks. GlusterFS was developed originally by Gluster, Inc., then by Red Hat, Inc., after their purchase of Gluster in 2011.

  • Servers are typically deployed as storage bricks, with each server running a glusterfsd daemon to export a local file system as a volume. The glusterfs client process, which connects to servers with a custom protocol over TCP/IP, InfiniBand or Sockets Direct Protocol, creates composite virtual volumes from multiple remote servers using stackable translators. By default, files are stored whole, but striping of files across multiple remote volumes is also supported. The final volume may then be mounted by the client host using its own native protocol via the FUSE mechanism, using NFS v3 protocol using a built-in server translator, or accessed via gfapi client library. Native-protocol mounts may then be re-exported e.g. via the kernel NFSv4 server, SAMBA, or the object-based OpenStack Storage (Swift) protocol using the "UFO" (Unified File and Object) translator"

Also have a look at this article on why block-level encryption sucks, and why you should never use XTS mode.
http://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/

Another good link: Writing encrypted FUSE system in Python for syncing to cloud storage
http://www.stavros.io/posts/python-fuse-filesystem/

For more info on solutions to stability of a peer to peer storage network, look into "distributed , parallel, fault-tolerant filesystems"...

If anyone needs this sort of encryption for their cloud provider, I can send you either the instructions for getting encFS working on windows, or you can download the standalone program Duplicati for Windows (which you can use with any cloud provider more or less).
.

@fschwebel

@cydron : thanks for the work. I agree with you on many points.
However, I can say Dokan doesn't work on Windows 8.1, and actually isn't supported anymore (its maintainer has been payed by a company to drop the open source lib and start for them a commercial one that is probably very lucrative now)...

@bobobo1618

I think we should really rule out FUSE or any other filesystem level solution. Not everyone has admin access to the PC they're running this on and it'll essentially have to be reimplemented on Android and iOS anyway since those platforms don't have it.

@Finkregh

👍 to NOT use FUSE as @bobobo1618 stated.

@bitshark

Okay I'm off to code... good luck to anyone who is sorting this out.

@finkregh
Neither of you answered my questions... what is the technical objection to FUSE?

@fsch Thank you for answering my questions with a specific technical limitation. I wasn't aware Dokan doesn't work on Win 8. I know there is an active fork of Dokan called DokanX. DokanX will work on Windows 8.

@bobo Okay , got it.. rule out fuse and 'other filesystem level' solution. And we can't have admin, even once for install. . . From here there are very limited options. That leaves psuedo-block device emulation encryption (like a TrueCrypt file container) , . Most cryptographers do not like this method. Did you read that XTS article? It details this specific use case (local machine to cloud hosting).

"Encrypt things at the highest layer you can."

"Sector-level crypto is last-resort crypto."

"If you’re encrypting a filesystem that you’re going to host on a cloud service, consider trying to reframe your problem. Sectors or nothing? By all means, stick a Truecrypt volume on Dropbox. But you have better options.

Remember that disk encryption is designed to counter an attacker with very limited capabilities. That’s why it falls to “Evil Maids”: the threat model doesn’t really accommodate attackers with multiple bytes at the (physical) apple. But whatever margin of safety XTS gets you on physical media probably goes out the window when you stick a Truecrypt volume on Dropbox. From the vantage point of Dropbox, attackers have far more capabilities than the XTS designers planned for."

http://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/

"Simulated hardware encryption sucks. If you free yourself from the idea that you need to encrypt a whole disk, you win a bunch of things:"

@AudriusButkevicius
The Syncthing Project member

I don't think we should have a discussion about fuse at all, and I think @calmh will agree with me.

My personal view:

Syncthing is not a file system, and I personally have NO plans of making it one.
If you want to write a filesystem which uses BEP to get it's data, that's your fight.
The iOS port behaves similarly, it obviously doesn't use FUSE, but pulls stuff on demand.
Every fuse magic I ever used was broken as hell (s3fs, dokan sshfs et al)
Fuse is not cross-platform.

Syncthing is a stand-alone cross-platform statically compiled utility which syncs files across multiple machines, no more, no less.

And quoting:
"Encrypt things at the highest layer you can."

I can say that the highest level in this case, is the protocol level.
The data arrives encrypted, and there will be no possible attacks left.

@bobobo1618

I'm no cryptographer and this is of course far from perfect but I'd be totally fine with file level encryption.

Just encrypt the metadata and contents of each file and folder and be done with it.

This would result in the untrusted server having an idea of the directory structure but that should be all and should be pretty useless.

@generalmanager

@cydron Thanks a lot for the comprehensive writeup and making your previous posts more readable.

We gotta fix and sort out the existing transport bugs before adding cypto.

Seconded.

I also agree that designing and implementing a state of the art encryption and authentication system with key management and key distribution would - at least at this point in time - be overkill for what it delivers: that only some nodes receive encrypted blocks, while other nodes receive decrypted blocks.

Thus encrypting it everywhere and offering easy access to the plaintext should be the way to proceed.

This gives us another nice feature for free:
We always have two layers of encryption in transport, which means nearly all possible vulnerabilities in TLS or our chosen ciphers/implementations would be much harder to exploit, because even a fatal flaw in the crypto would only unwrap the next layer of ciphertext instead of cleartext.

I am familiar with the ecryptfs and encfs audits, but it seems that it would be best to fix encfs upstream:
Quote from the EncFS Audit:

In conclusion, while EncFS is a useful tool, it ignores many standard
best-practices in cryptography. This is most likely due to it's old
age (originally developed before 2005), however, it is still being
used today, and needs to be updated.

The EncFS author says that a 2.0 version is being developed [3]. This
would be a good time to fix the old problems.

EncFS is probably safe as long as the adversary only gets one copy of
the ciphertext and nothing more. EncFS is not safe if the adversary
has the opportunity to see two or more snapshots of the ciphertext at
different times. EncFS attempts to protect files from malicious
modification, but there are serious problems with this feature.

Unfortunately it's not even clear if these problems will ever be fixed upstream:
README.md from https://github.com/vgough/encfs on the current status says:

EncFS has been dormant for a while. I've started cleaning up in order to try and provide a better base for a version 2, but whether EncFS flowers again depends upon community interest. In order to make it easier for anyone to contribute, it is moving a new home on Github. So if you're interested in EncFS, please dive in!

The issues in the crypto are filed, but nobody seems to work on this at the moment.
IF they get fixed upstream, this could be a good way to go, because EncFS has the best shot at beeing usable a cross-platform solution for encrypted storage.
There is even an Android app called cryptonite which can open and decrypt EncFS folders on unrooted devices and mount the fuse fs on rooted devices.

Regarding the portability of EncFS:
I tried to make a shared EncFS folder usable between Windows and Linux. For most operations this would work ok with Dokan/ DokanX on Windows, but it was impossible to open, edit and safe Microsoft Office documents with Microsoft Office. It just wouldn't write to the mounted device. It only worked with Boxcryptor classic. I would like to hear more on how you got around those problems.

I' m definitely looking at Tahoe-LAFS... but I think it depends on whether we are running 'peers' at datacenters or on iphones. Where is the user base located? Probably the latter.

I'm really interested to hear your thoughts on Tahoe-LAFS if you have the time.

I think the problem is that it's both. Many users want to have at least manual access to the decrypted files on their mobile devices, but the encrypted directory function is interesting, because a cheap VPS solves the problem in small P2P groups that there should always be at least one node online with the latest state.

Could you take a look and tell us what you think about Syncany's concept:
http://syncany.readthedocs.org/en/latest/security.html
http://syncany.readthedocs.org/en/latest/concepts.html
This seems especially interesting:
http://blog.philippheckel.com/2013/05/20/minimizing-remote-storage-usage-and-synchronization-time-using-deduplication-and-multichunking-syncany-as-an-example/5/#Implications-of-the-Architecture

Sorry if I wan'tt clear . First -- I used to design crypto systems for a defense contractor. So you are talking to a crypto engineer here , not an applied math major.. hah.

Was this a little jab at Telegram? ;-)

Just for good measure I wanted to point at https://github.com/jasonmccampbell/GoSodium, which is an (not completely finished) GO wrapper for libsodium, the praised cross-platform crypto library of NACL, which was specifically designed to make secure, modern and fast crypto as easy to implement correctly as possible.

@Zillode @Natanji

You seem to have misunderstood @cydron in that he wants the untrusted nodes to mount the encrypted data as a fuse device, for which would they would obviously need the key. This is not the case. He fixed the typos in his previous posts and what he proposes is to encrypt on ALL nodes which share an encrypted directory and use FUSE or something similar to transparently allow access to the decrypted files on trusted devices.

@AudriusButkevicius

"Encrypt things at the highest layer you can."

I can say that the highest level in this case, is the protocol level.
The data arrives encrypted, and there will be no possible attacks left.

As the article by Thomas Ptacek, which we both referred to, as well as the EncFS audit make very clear there are plenty attacks left, especially if an attacker has knowledge about how the encrypted data changes over time.

@AudriusButkevicius
The Syncthing Project member

Well if it's not at the protocol layer, then I don't even see how you could make this secure, as your shady VPS provider could grab your syncthing private key and start asking for unencrypted data from other peers. Also I don't see how de/encrypting at the block level via fuse suddenly solves the "know how data changes over time" problem.

@Finkregh

@cydron FUSE:

  • it makes things more complicated/layered. The beauty of syncthing at the moment is that i can use my normal filesystem, run one binary (which even patches itself, how cool is that) and i am about done. Mangling all files through another layer (FUSE) adds complexity which i suppose is not that easy to hide
  • it won't run on windows as far as i know (i might be wrong there and there is some easy-peasy to use FUSE that runs on windows ;) )
  • (if i want something complex which i can only read/write to via ftp/fuse/... i use TahoeLAFS (i love their concept, tho))
@generalmanager

@AudriusButkevicius

Well if it's not at the protocol layer, then I don't even see how you could make this secure, as your shady VPS provider could grab your syncthing private key and start asking for unencrypted data from other peers.

That's why it would be easier to not send anybody plaintext ever.

Which isn't too difficult if we use symmetric encryption for the data, and do the following:

1. Store only encrypted files on trusted AND untrusted nodes, give the key only to trusted nodes and present the decrypted files via something like fuse. If we don't want fuse, the trusted devices have to store the encrypted AND the decrypted blocks, doubling the storage requirements.

If we only want to store plaintext on trusted nodes, it becomes more complex:

2. Never send unencrypted blocks over the wire to anybody. Trusted peers receive the key to decrypt the encrypted blocks via a secure channel once. (This could be PGP encrypted email, SCP, someone walking from machine to machine with a thumbdrive etc. depending on the thread model.) Trusted peers only store plaintext, but have to store the mapping between encrypted blocks and plaintext blocks. This means we need to save twice the amount of hashes. The trusted nodes also have to decrypt/encrypt all incoming/outgoing blocks on the fly, which will make it more resource intensive. The details of the implementation are certainly going to be interesting. As we should use something like AES-GCM the question arises how we can make sure that the nonce is the same on all machines AND never used twice? This could be done by using a pseudo random number with 96 bits of entropy for each block, because it is highly unlikely that the same number will be generated twice. The nonce is then sent unencrypted along with the ciphertext. That's ok because the nonce doesn't need to be secret. But it means, that while trusted nodes can decrypt the block, compare it's hash to their list of hashes of decrypted blocks and thus throw it away if it already exists, it still has to be sent over the wire. It also means that untrusted nodes have to store files multiple times, if they are added separately on out-of-sync nodes.
This obviously sucks.

But it could be solved with deterministic encryption (the same input always creates the same output for one key). If the same plaintext always produces the same ciphertext, the untrusted nodes can compare the hashes of ciphertext blocks, so they don't store files multiple times, if those were added on different trusted machines while offline. And trusted hosts can compare a list of hashes of encrypted blocks to their own list of hashes of encrypted blocks, which means they don't waste traffic on files they already have. (Note: I used the term deterministic encryption a bit misleading here. AES is for example deterministic, but made non-deterministic by using different IVs/nonces.)

Intuitively one could think about something like this:

3. Nearly everything is the same as in 2. but instead of a completely random nonce we use the (first 96 bits of the) hash of the unencrypted block (plus a shared secret to protect against file confirmation and similar attacks) as the nonce.
This way the ciphertext is always the same for identical plaintext blocks, but it leads us to the barren lands of not well researched crypto and doesn't sound like a good idea:
https://crypto.stackexchange.com/questions/3754/is-it-safe-to-use-files-hash-as-iv

If anybody has more information on how this can be done securely, I'd be happy to hear about it.

So we are left with two options: asymmetric deterministic encryption and convergent encryption .
Let's take a look at asymmetric deterministic encryption first:

4. If we instead choose to use deterministic asymmetric encryption (like RSA without padding), we would have to create one shared key pair. This would itself have to be moved to all trusted devices (as the key has to be in 2.).

However, there don't seem to be any widely used encryption schemes using deterministic encryption. Apparently some interesting work was presented at the Crypto 2007 and Crypto 2008 conferences, but I haven't really looked into this yet.

Also I am not sure if there are any modern encryption schemes with elliptic curves which would allow for fast asymmetric encryption with small keys.

Any information on those topics would be appreciated.
https://en.wikipedia.org/wiki/Deterministic_encryption

5. In convergent encryption files are encrypted with their own hash (more precisely: their hash and a static shared salt).
This gives us the very interesting property, that it is possible to determine if two encrypted blocks are identical without decrypting them first. This way untrusted nodes don't need to store some files multiple times and trusted nodes can check if an encrypted block exists without having to transfer and decrypt it first.
That's what Tahoe-LAFS and Maidsafe use and seems to be one of the best ways to do this kind of thing.
A good writeup can be found here:
http://2π.com/14/convergent-encryption
Now we won right? Not quiet, because how do the trusted nodes get the keys (hashes) to decrypt files?
We will send the hashes of the unencrypted blocks along with the encrypted blocks, which are itself encrypted with AES-GCM and a random nonce. The nonce is then appended to the encrypted hashes.
Convergent encryption has the problem that hashes of public files or files the attacker knows can be precomputed and the ciphertexts can be compared. This means an attacker could proof that you stored a forbidden book/pirated movie/mp3 in your encrypted files. This and another, more nuanced attack can be preveted by adding a static salt to the cleartext of a file before creating the hash with which the file will be encrypted.
More details: https://tahoe-lafs.org/hacktahoelafs/drew_perttula.html

Also I don't see how de/encrypting at the block level via fuse suddenly solves the "know how data changes over time" problem.

It doesn't, I just wanted to point out that we have to be very careful which ciphers we use, because most are not resilient enough for a threat model where the adversary can see the ciphertext change over time. And if no node can be tricked into sending the cleartext, because they all just work with the ciphertext, which is stored on disc, that's a big bonus. At least if we don't want to store the ciphertext AND the cleartext on all trusted devices, effectively doubling the storage needs.

FUSE is certainly not the solution to all problems, but it's one of the easier ways to allow users on trusted nodes transparent access to the encrypted data with low implementation effort from syncthing, no storage overhead and only computing overhead when the files are accessed or changed. And it supports nearly all platforms.

@Finkregh
It's possible to use FUSE on windows, as was previously discussed in this thread. But a second tool called Dokan/DokanX has to be installed.

@cydron
Could you take a look at this? If I made a mistake, it should be corrected as quickly as possible. Thanks!

@bobobo1618

You're really worrying me with the focus on FUSE.

it's one of the easier ways to allow users on trusted nodes transparent access to the encrypted data

For syncthing developers maybe but try getting your grandma or even mildly technologically proficient friend to set up encFS on Windows with Dokan or try grabbing a few of your home files while you're at work or school and don't have admin access to your machine.

no storage overhead and only computing overhead when the files are accessed or changed

On platforms with low computing resources available like the Raspberry or Android devices, storage is often more available than computing.

And it supports nearly all platforms.

Show me a FUSE filesystem running on:

  • BSD
  • Android
  • iOS

BSD might not be a big deal for you but I doubt I'm the only one running this on an NAS. The latter two are pretty important to most users.

I don't think the tradeoffs here are nearly worth it. Dropping support for platforms, devices and use cases isn't a good decision for usability and adoption of the project. The current easy 'run a file' way Syncthing works is great and I think it should stay that way.

@AudriusButkevicius
The Syncthing Project member

@generalmanager I understand very little about crypto and I am not the most clever man on the planet to start with, but my initial ideas were as follows.
It does require a bit of understanding how syncthing works under the hood I guess, but you seem to be "on the ball".

Most of it echoes what you already said, but probably in more implementational terms.
Verify that you agree with what I said, and fill in any blanks if you have any answers.

First let's start with the fact that we have a constraint:
We are using a fixed block size (128kb), hence for N bytes input, our ciphertext has be N or more bytes but not more than 128kb, which sort of corners us into using AES-ECB (google might lie, and there might be something better?).

My basic ideas which might not be secure, but should however make getting plaintext data harder:

Plan A (allows reusing blocks, leaks info about two identical blocks/files):

  1. Have a shared secret [1] .. (which as you say is shared via a postal pigeon or whatever)
  2. Encrypt sensitive metadata [2] in indexes with AES-whatever perhaps using first N bits of the first block hash as the IV (this one is not constrained by length limits as much)
  3. Sign Index/IndexUpdate messages using the shared secret (file versions changing will act as a random nonce) preventing replay attacks by an attacker advertising some random old signed index to the cluster (newest version wins) causing DoS.
  4. In the Index/IndexUpdate messages leave block hashes unencrypted. So plaintext hashes are on the encrypted machine. They will allow you to distinguish two identical 128kb blocks among any files, regardless of how it's stored on the disk. (As well as potentially a known plaintext attack given you have a plaintext block with the same hash? Some random license file in some git checkout for example...)
  5. As other peers start asking us for data for some random block (by hash) we just encrypt the block with the shared secret (constant IV or block hash as the IV?) and send that across. Nodes which have the shared secret decrypt and store plaintext, nodes which don't have the shared secret, store the ciphertext and trust that the given plaintext hash matches what's there.
  6. As we ask information from an encrypted node, the node does block lookup by plaintext hash, and provides us with ciphertext, which we can verify if it matches the plaintext hash once decrypted.

Pros:

  • Reduced bandwidth on encrypted nodes due to block reuse.
  • Potentially eventually reduced storage on encrypted nodes (makes me shake when I think how hard it would be to implement this)

Cons:

  • Fairly weak?

Plan B (prevents reusing blocks, does not leak info about two identical blocks, 1-3 same as before):

  1. Have a shared secret [1] .. (which as you say is shared via a postal pigeon or whatever)
  2. Encrypt sensitive metadata [2] in indexes with AES-whatever perhaps using first N bits of the first block hash as the IV.
  3. Sign Index/IndexUpdate messages using the shared secret (file versions changing will act as a random nonce) preventing replay attacks by an attacker advertising some random old signed index to the cluster (newest version wins) causing DoS.
  4. Compute cipher text hash during the scanning process, store cipher text hashes along with plain text hashes.
  5. Advertise Index/IndexUpdate with cipher text block hashes. Given we use [3] or something that prevents you from identifying to identical blocks, you will not be able to reuse anything.
  6. As other peers start asking us for data, we use [3] to to encrypt it, the receiving device verifies that the received ciphertext hash matches the advertised hash, decrypts, hashes plaintext, stores both ciphertext and plaintext hashes, and writes the data to a file.
  7. Encrypted devices work exactly like plaintext devices work now.

Pros:

  • More secure as blocks cannot be identified as exactly the same.

Cons:

  • No block reuse, potentially increased bandwidth.

Plan C (a big overhaul):

  1. Move away from fixed block size throughout the whole protocol.
  2. Use RSA + AES for [1]
  3. ???
  4. Profit

[1] Ideally I'd like a readonly secret and readwrite secret but block cipher as such does not seem to exist. Plus the constraint that we have means we can probably only use ECB without a major protocol rewrite and going to plan C.
[2] Mainly file/dir names,flattening out the folder hierarchy. You will still know if a file is a file and not a symlink, what permissions it has, etc.
[3] We could use the shared key + HASH(unencrypted filename + block index in the file/block hash) as the IV for encrypting the block.

@JIVS

Just wondering, is the possibility of using dedicated server-side software, ie: a version of syncthing specific for servers, completely out of the question?

By "insecure server" I don't assume you mean you don't have enough privileges to install and run software on it and only store files but simply a server that might be accessed by others without your consent.

@bitshark

Great discussion here... really interesting food for thought...

@generalmanager

As we've discussed, any 'crypto' is only as good as the weakest link. For example, regardless of any implementation (FUSE, kernel drivers, userspace encryption, whatever you choose)... If the 'master' host (for example a Windows machine) is infested with Malware/Keyloggers etc, then the whole point is moot because even a perfect implementation is compromised.

Same premise holds if our hardware is compromised / backdoored. There is a reason DARPA started X-ray'ing ASICs and FPGAs sourced from Asia, to ensure there were no hardware backdoors.

"Almost all FPGAs are now made at foundries outside the United States, about 80 percent of them in Taiwan. Defense contractors have no good way of guaranteeing that these economical chips haven't been tampered with. "

http://spectrum.ieee.org/semiconductors/design/the-hunt-for-the-kill-switch

How are we to be sure that the BIOS chips we use (or TPM etc) are not backdoor'ed? We don't. We can either accept or reject the premise, but that's about all we can do.

And so any security will only be as good as it's assumptions.

Then we have the ironic fact that most crypto is never broken based on the cipher, but rather on implementation goof-ups, or more insidiously -- side-channel attacks.

I think the best example of this is the 'padding oracle' attack which is theoretically possible against any block cipher operating in CBC mode .. And in practice this resulted in the Lucky13 attack against SSL/TLS which were thought to have been fixed!

http://www.isg.rhul.ac.uk/tls/Lucky13.html

"Any person can invent a security system so clever that she or he can't think of how to break it."
-Schneier's law

Even the best we can build is only as strong as the weakest link.

There is also the balance between ease-of-use / ease-of-installation vs security, as well as the issue of 'scope creep' / reinventing the wheel VS. that a P2P block exchange may need it's own solution to be optimal.

encFS Results

As for encFS, I've got my own issues with it. . . It's not suitable for cloud storage beyond a 'single snapshot' model... The main issue I think is (1) any 'fix' of encFS would necessarily break all backward compatibility with previous versions, and (2) it's 10 years old (plus) now, so it would take significant work to bring it up to 'state-of-the-art'.

@AudriusButkevicius

I think what you have outlined is a fair approach, in terms of only approaching the problem of securing data 'in transit'. This lets the user manage their own solution in terms of local crypto, whether encFS, dm-crypt, Truecrypt, or otherwise. We do run into the problem that perhaps these 'user determined' crypto solutions are not ideal for a P2P network which exchanges small block-like chunks.

But if the user implements the same crypto globally , then Syncthing does not really care as to what it's transporting , which is a nice abstraction that simplifies life for anyone contributing code.

I agree that the particular enhancement detailed in this Github issue (baking in crypto for storage on untrusted nodes) opens up a whole can of worms and is a pretty good example of 'scope creep' . So I think we're in agreement that this probably isn't feasible given the constraints, certainly not before nailing down existing issues and closing them out.

Beyond that, as I'm sure you are aware, selective encryption on a per-node basis opens up the new set of problems regarding the synchronization of encrypted vs unencrypted blocks... Not to mention problems of key management, access permissions, key revocation, various attacks involving chosen-ciphertext , block vs stream ciphers, selection of appropriate IV generation, operating modes, and so forth.

On and on... Ideally, my feeling at the moment is either (1) let the user handle their crypto problem, or (2) take a long-term view and really implement a novel solution based on proven cutting edge techniques in cryptography -- specifically that relating to cloud storage, authenticated encryption ,and so forth.

In the latter case, it would be ideal to have an 'off-the-shelf' solution that could be dropped in... The only reason I think to DIY is if there were a clever way around some of the limitations... Certainly the current state-of-the-art in crypto literature is not focusing on creating secure P2P applications.

@generalmanager

Thanks for clarifying my points -- after our conversations , I think you have done a good job of summarizing my thoughts on the issue, which have changed somewhat as I have done additional research.

"IF they get fixed upstream, this could be a good way to go, because EncFS has the best shot at beeing usable a cross-platform solution for encrypted storage."

This was my original thinking, but as I've delved deeper, I simply don't think EncFS is feasible as a solution UNLESS it's released as a "2.0" version -- which completely dispenses with legacy code... I do like the 'filesystem' level encryption which is convenient, but I think gains in convenience are a tradeoff with security.

Here is an example of one of the problems I just ran into today using EncFS on Linux... I tried using EncFS coupled with the google-drive-ocamlfuse module for mounting Google Drive as a 'shared' network drive... Now I had no problems with the encFS FUSE driver , but it's more of an integration headache from end-to-end... Note that this commentary is independent of Syncthing, and is simply looking at using encFS locally to mirror content to an encrypted folder on Google Drive (with google drive mounted in linux as a sort of 'network share' showing up as a local dir).

The idea here is to drop large amounts of unencrypted files into a mountpoint, have them transparently encrypted and uploaded to Google Drive without wasting tons of local storage making copies of everything.

Some major problems right out of the gate... The Google Drive FUSE driver is a pain to compile since it's written in OCaml. Even after I got it working, I was astonished to find there is not really decent built-in support for caching or buffering. (This highlights a major advantage to something like Syncthing , which breaks such transport into manageable blocks).

Anyway, so I found that what happens is if I copy a large (4 GB) file to my encFS directory (linked to the Google Drive remote network mount)... A 'cp' command of some large files to the EncFS plaintext input dir simply 'hangs' as encFS / Google Drive oCAML attempt to (1) encrypt the file and metadata and write to the Google Drive network mount point, and (2) we are subject to network-level failures (wireless AP disconnects ,etc). It's slow as heck, but it works as long as there is no transport failure.

But a wireless AP disconnect during copying in this setup will easily corrupt or ruin large gigabyte-sized files.

Given these issues on a native client on Linux -- there's a whole set of problems of the optimum between a 100% local mirror of content (which requires extensive use of disk space locally) vs a 100% 'network attached drive with minimal caching (which means that network failures of large files 'in transit' can cause data loss and corruption).

This is sort of a microcosm of similar issues which apply to Syncthing, albeit to a lesser extent since it's not 'all or nothing' in the transport sense. But I do have concerns with the lack of multithreaded / asynchronous performance of both encFS and the Google Drive oCaml driver. The performance of encFS-FUSE was quite poor in terms of handling 'below the surface' transport failures.

In plain English, if I were to disconnect / reconnect to the wireless access point while in the middle of copying data to the encFS plaintext dir , which is then writing to a 'network attached' Google Drive ... Network connection loss would corrupt the file in the process of being copied.

BOXCRYPTOR

Regarding encFS vs BoxCryptor -- After release of BoxCryptor Classic (which is hypothetically compatible with encFS) , BoxCryptor ditched all compatibility with encFS for the 2.x series of BoxCryptor. So new versions of BoxCryptor are completely incompatible with encFS.

"Just for good measure I wanted to point at https://github.com/jasonmccampbell/GoSodium, which is an (not completely finished) GO wrapper for libsodium"

NaCl and the related wrappers are awesome. Good point. If I personally ever had to implement crypto that was 'roll your own', I'd absolutely say 100% the way to go is with NaCl or related libraries and wrappers. They are a fantastic compromise between using the bloated (but tested) OpenSSL libraries and rolling a custom solution (which could easily introduce major vulnerabilities, which NaCl could help prevent).

EncFS Attacks

"As the article by Thomas Ptacek, which we both referred to, as well as the EncFS audit make very clear there are plenty attacks left, especially if an attacker has knowledge about how the encrypted data changes over time."

This specifically is a huge concern for me, and is why I don't think encFS is feasible for any sort of remote storage which 'incrementally' changes on any regular basis. encFS might be fine for a 'one time' backup, or twice a year backup, but anything that gives a remote node insight into filesystem changes over time is going to be a major problem in ANY modern cryptosystem...

It's far worse with 10 year old technology like encFS.

So I agree there -- to me the limitation (more than FUSE, driver compatibility, trust models, or portability) is actually the fact that encFS is not suitable for cloud backup storage on 'untrusted' nodes. Not without some major updates anyway.

In fact, I think there are very few solutions suitable for this purpose besides more advanced technologies like convergent encryption, authenticated encryption, and so forth.

"Also I don't see how de/encrypting at the block level via fuse suddenly solves the "know how data changes over time" problem." -Audrius

You're right -- it doesn't. This is a major problem that keeps cropping up as I think about how to set up a good P2P network backup / sync system.

There's the critical issue that's become apparent with additional research -- how ciphertext data changes over time may allow an adversary to break the entire system. I'm not sure any current cryptosystem is equipped to become resistant to these sort of attacks (where you provide a potential attacker with N-snapshots of a ciphertext filesystem as it changes over time, where N could be quite large).

I don't know what the solution is for this issue. Even using a P2P sync tool with a Truecrypt file container could be problematic. I suppose it's a matter of balancing the threat vs the countermeasures. Usability vs Security, etc.

@Finkregh

I understand the objection regarding the complexity of setting up Dokan on Windows -- any bundled solution needs to work out of the box. You would probably be okay if there was an installer that 'worked' regardless of the underlying solution, right?

I think what we are discussing here is not some much FUSE vs not-FUSE ... it's more questions as to (1) should we bother with this aspect of baking in filesystem encryption (probably not, hah), (2) are there good already-written solutions available off-the-shelf?, (3) what is the state-of-the-art in terms of P2P encryption, (4) benefits vs drawbacks of various solutions, etc.

The whole FUSE vs non-FUSE debate is really not the core of the issue, because most solutions developed in FUSE could be ported out of FUSE with enough time and effort. FUSE is simply conducive to rapid prototyping and testing, as it's just an abstraction layer on top of the common system calls for file/folder interaction.

@generalmanager

Your post on deterministic asymmetric encryption vs convergent encryption is a really good overview. I'll check out some of your links and get back to you.

"This gives us the very interesting property, that it is possible to determine if two encrypted blocks are identical without decrypting them first."

This is a huge benefit for the idea of a distributed p2p system, where we are exchanging data on the 'block' level of arbitrary size (say 1k to 1024k). Ideally, if you and I are on the same network, and we both have a copy of the same movie (perhaps encrypted with different IVs or what-have-you) -- can we mutually share the blocks to accelerate synchronization?

And if that's possible, what do we lose by doing so?

"That's what Tahoe-LAFS and Maidsafe use and seems to be one of the best ways to do this kind of thing."

I agree, based on my somewhat limited knowledge of convergent encryption. But this allows 'watermarking' attacks , no? I.E. Some omniscient entity can prove you and I possess a copy of 'TransformersTheMovie.avi' or whatever, even if they cannot decrypt the movie from any of our encrypted shares?

Certainly an open-source solution vulnerable to watermarking is preferable to a binary-only client vulnerable to watermarking (ie. Syncthing vs. BitSync)... Syncthing would be far superior in this case, since at least we can be sure it's not backdoor'ed.

"This means an attacker could proof that you stored a forbidden book/pirated movie/mp3 in your encrypted files"

Okay, that's what I thought. I wish there was a way around this . Perhaps there is? For example, utilizing a 'keyed hash function' to calculate the block hashes for a file we re sharing, where the input to the keyed hash function is related to a shared secret or the results of key agreement?

I guess anything involves trade-offs.

@AudriusButkevicius

I think what you have proposed in the latest message is on the right track.

My suggestion is that any implementation uses a construct of 'authenticated encryption' , where the idea of encryption and HMAC are combined into a basic primitive. The new ChaCha20 TLS standards have this 'baked in' -- whether at the transport level or otherwise.

As you've suggested, AES-GCM for a block cipher is another example of authenticated encryption, though not my personal favorite... It is off-patent and already included in openSSL, which is nice.

Personally, for block ciphers, I like OCB mode, but unfortunately this is on-patent. It's free to use for open-source non-commercial purposes though.

http://en.wikipedia.org/wiki/Authenticated_encryption

http://blog.cryptographyengineering.com/2012/05/how-to-choose-authenticated-encryption.html

But regardless, I think we have numerous opposing forces here...

(1) Utilization of 'baked in' crypto vs 'Let the user run TrueCrypt'

(2) Level of Effort and Scope-Creep vs. Broad Spectrum of Applications

(3) Prevention of watermarking attacks etc vs Block P2P inter-operability

(4) FUSE type drivers vs. 'Works out of the box'

(5) Low-level (block/loopback device) vs. High-level (VFS or file/folder encryption)

Great discussion, best ideas I've seen in a long time. Don't want to get too sidetracked from any short term goals, but I think the last few pages of comments really get to the core of the issues regarding client-side and remote-side encryption.

@bitshark

Also, I do like the idea of a non-hardwired blocksize , but I agree it'd be a major overhaul.

Maybe a solution is to do key agreement on a separate shared secret K_dht -- call it a 'session DHT key' or something that's agreed upon using some decent DH primitive.

For any given file, the file's hash Fh = HMAC(K_dht, filedata).... The block hash for a particular block in a file at index idx, is HMAC(K_dht, Fh + idx + blockdata).

Then the only people that can derive the DHT are those with the shared secret,

Something like that just as a first idea, anyway.

Perhaps combining HMACs, Authenticated Encryption, and the idea of 'tweakable ciphers' (like XEX mode, the basis for GCM) can allow a balance between block-level sharing, untrusted storage endpoints, and resistance to watermarking.

@bitshark

Okay, so there's a way around the 'watermarking' problem of convergent encryption.

Convergent Encryption (standard / vulnerable):
Fkey = SHA1(Plaintext)
Ciphertext = AES(Fkey, Plaintext)
Locator_Hash = SHA1(Ciphertext)

Convergent Encryption (keyed / resistant):
Fkey = HMAC_SHA1(Skey, Plaintext)
Ciphertext = AES(Fkey, Plaintext)
Locator_Hash = SHA1(Ciphertext)

In the latter example, only those who know Skey can conduct 'proof-of-file' and related attacks, thus Skey is shared among all nodes in a cluster which are sharing files.

Using the latter example with AES-CTR mode, and a public per-file random IV, then we actually have completely random access to file blocks.

quote:

The general way such algorithms work is as follows:

The object to be encrypted is validated to ensure it is suitable for this type of encryption. This generally means, at a minimum, the the file is sufficiently long. (There is no point in encrypting, say, 3 bytes this way. Someone could trivially encrypt every 3-byte combination to create a reversing table.)

Some kind of hash of the decrypted data is created. Usually a specialized function just for this purpose is used, not a generic one like SHA-1. (For example, HMAC-SHA1 can be used with a specially-selected HMAC key not used for any other purpose.)

This hash is called the 'key'. The data is encrypted with the key (using any symmetric encryption function such as AES-CBC).

The encrypted data is then hashed (a standard hash function can be used for this purpose). This hash is called the 'locator'.

The client sends the locator to the server to store the data. If the server already has the data, it can increment the reference count if desired. If the server does not, the client uploads it. The client need not send the key to the server. (The server can validate the locator without knowing the key simply by checking the hash of the encrypted data.)

A client who needs access to this data stores the key and the locator. They send the locator to the server so the server can lookup the data for them, then they decrypt it with the key. This function is 100% deterministic, so any clients encrypting the same data will generate the same key, locator, and encrypted data.

http://crypto.stackexchange.com/questions/729/is-convergent-encryption-really-secure

@AudriusButkevicius
The Syncthing Project member

I assume skey is a shared secret?
Why can't you just use skey in the second example instead of a hash of skey + plaintext to the aes function?

@bitshark

(A) Yes, skey is a shared secret .
(B) I'm not 100% clear on why you can't just use skey in example two... so I don't know. But I think that the idea is that we want there to be a certain amount of 'determinism' to the encryption so that we don't waste bandwidth and space re-sharing the same file over and over.

In other words, if you and I both have copy of 'Transformers2.avi', and we have the same shared secret, then if I request to upload the movie to you.... The P2P network (aka you) will say don't worry about it, we already have a fully copy of that file that matches this hash ("locator"). BUT the network can only know that the Transformers2.avi file already exists on your computer by comparing the hash of a deterministically generated ciphertext.

File_Locator = SHA1(Transformers2.avi.aes)

I think that when skey is a shared secret across all users sharing files in a 'cluster'... Aka In the latter example above, skey just acts as a 'tweak' to prevent the "confirmation that you or I have a copy of Transformers2.avi type" attacks, unless the attacker knows skey.

The idea is basically that you can request a file from the server (or peer) by asking for it's file locator (the hash of the mostly-deterministically encrypted ciphertext), and see whether the file is already stored or not. This principle can apply to fixed or variable length 'chunks' as well, supposedly.

The two examples are above functionally identical , just the latter is more secure, and probably more appropriate for this discussion where we are not sharing 'with the world' on bittorrent, but rather on small private p2p networks.

The main point of convergent encryption is that it allows 'de-duplication' -- meaning that if we've already stored one copy of a file, then the server is smart enough not to store a second copy of the file -- but rather it stores just a reference to it in some sort of metadata / mapping table.

I know this particular scheme (convergent encryption) is utilized in Tahoe LAFS and Bitcasa... It may be utilized in BitSync as well, though on that I'm unclear.

I'm still trying to understand it fully , so my apologies if my explanation is not very good.

Check out the pdf linked below at the end -- it actually discusses convergent encryption in the context of fixed and variable length 'chunks' of a plaintext file.

Here are two other good links:
https://tahoe-lafs.org/hacktahoelafs/drew_perttula.html

http://crypto.stackexchange.com/questions/729/is-convergent-encryption-really-secure

and this...

"Both models of our secure deduplication strategy rely on a num-
ber of basic security techniques. First, we utilize convergent en-
cryption [10] to enable encryption while still allowing deduplica-
tion on common chunks. Convergent encryption uses a function
of the hash of the plaintext of a chunk as the encryption key: any
client encrypting a given chunk will use the same key to do so, so
identical plaintext values will encrypt to identical ciphertext values,
regardless of who encrypts them. While this technique does leak
knowledge that a particular ciphertext, and thus plaintext, already
exists, an adversary with no knowledge of the plaintext cannot de-
duce the key from the encrypted chunk. "

Here are two resources that helped me so far:

(1) Secure Data Deduplication
http://www.ssrc.ucsc.edu/Papers/storer-storagess08.pdf

(2) Source code for Convergent Encryption in Python (includes both example 1 and example 2)
https://github.com/HITGmbH/py-convergent-encryption

@AudriusButkevicius
The Syncthing Project member

So you can't verify that someone has something, unless you have skey, because you need that to generate to locator to verify if someone has a given file.

If you managed to get a locator from someone for the given file you want to verify against, the fact that the content was encrypted using a hash over the plaintext is meaningless, as you got the locator already.

If you have skey, all bets are off, and there is no encryption left at that point.|

The only sensible reason I can come up with, is given you have the plaintext and the ciphertext of some other file, it's not possible to recover skey, because the ciphertext is encrypted with HMAC(skey, plaintext) rather than skey, so the only thing you can recover is HMAC(skey, plaintext) which is not good enough to decrypt a different file.

@AliceBlitterCopper

If in need to "sell" this feature to non tech users I say:
"Now you can setup a family-cloud or friends-cloud in which some of your private data can be stored redundantly: perfect for the backup of your photos f.e. which you don't want to give away but still need a backup for. Store it at your friends place."
I really would like to use it this way: instead of storing physical hd at several locations f.e.

@bademux

@Zillode let`s call it "mirroring"
for me it is all about mirror some data on different devices without setuping encfs\luks.

@AliceBlitterCopper

@Zillode @bademux Yes, that's a very good point. Let's not suggest too high SLA ;).

"... allows you to setup an 'attic' you can share with your family and friends, where your boxes are sealed and can only be opened by you if needed. It's not a replacement for a data-bunker or data-safe to store your most precious data. Sometimes an attic is what you need, though"

@eyeson

Going to throw my vote in here for this, having recently moved over to ST from Bittorrent Sync due to their whole faff, this is a feature I am missing

@benguild

@eyeson I think a lot of us in that same boat. We'd like to switch BTS to ST and can't because it's missing this feature.

@djtm

I'm thinking right now it would be cool to have something generic for this task. Sort of a reverse ecryptfs. Instead of an encrypted version on disk and a virtual unencrypted folder like ecryptfs you'd have the files unencrypted on disk and there is a folder which shows them only in encrypted form. Then you could use that folder and sync the encrypted version to other nodes with syncthing. I guess it should not be too hard to implement based on ecryptfs, either, as all the pieces are already there, they just need to be plugged in a different order...

@djtm

And someone already had the idea: https://bugs.launchpad.net/ecryptfs/+bug/376580
Oh and look, it's already implemented in encfs as encfs --reverse: http://superuser.com/questions/378249/is-there-a-way-to-encrypt-a-mounted-file-system-for-off-site-backup/379234#379234

from the encfs man page:

--reverse
Normally EncFS provides a plaintext view of data on demand. Normally it stores enciphered data and displays plaintext data. With --reverse it takes as source plaintext data and produces enciphered data on-demand. This can be useful for creating remote encrypted backups, where you do not wish to keep the local files unencrypted.
For example, the following would create an encrypted view in /tmp/crypt-view.
encfs --reverse /home/me /tmp/crypt-view
...

Of course you could also simply encrypt everything with ecryptfs and then sync the encrypted version, which is probably even safer. But while encfs is not cloud safe then encfs is arguably a lot safer than btsync.

@bademux

@djtm ecryptfs is the piece of sht, slow and buggy piece of sht. 2 or 3 years ago I try to do the same trick with encfs --reverse for dropbox-like services and just lost my time - the resulting "frankenstein" was too unstable and slow.
Wild (and incompatible with other world) guess:
maybe it is worth a try to use brand new ext4 encryption support in linux 4.1?

@djtm

@bademux I'm using ecryptfs shipping with Ubuntu for my home directoy. I never had any issues of any kind. The only thing is that it's a bit difficult to mount an ecryptfs directory from a bootable linux distribution. I'm not worried as much about speed as about security. The whole point of encryption is that it's reliable. encfs currently allows various attacks which are especially problematic within the cloud. However I believe it might be the better option to fix encfs, which had undergone security reviews by cryptography experts than to invent a new encryption here which will most likely end up being either a ton of work or mostly snake oil. As much as I'd love to see this feature implemented...

ext4 encryption will certainly not help, as the encrypted files will be invisible to syncthing.

@MyPod

As per the previous comments on this bug/enhancement request, @djtm, ecryptfs/encfs isn't good enough for something that is sent passing trough wires you don't own and can't control, as the changes within the structure of your database/filesystem can reveal informations. There is also to keep in mind that the solution has to be OS-independent and, in particular, work with windows (OSX could be easier to work with from what I know and understand), as well as being something (relatively) simple and easy to do that doesn't go beyond the scope of Syncthing.

@calmh
The Syncthing Project member

The requirement is clear; further discussion about the possible workarounds and this limitations fit better in the forum.

@calmh calmh locked and limited conversation to collaborators May 19, 2015
@calmh calmh removed this from the v1.0-maybe milestone Nov 17, 2015
@calmh calmh modified the milestone: Unplanned Jan 1, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.