-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrated encryption #7
Comments
I'd also include in requirements that it should not trust any single template (including VMs based on a specific template) with cleartext. While template compromise is unlikely and fatal already (for data of VMs based on it), spreading impact to all the VMs is even worse. This means encryption should be done either in dom0, or some entity (unikernel-like?) used only for backups encryption and nothing else. Or use different VMs depending on what data is encrypted (probably too complex). |
Thanks, Marek. That is what I was alluding to in references to isolation potential and low interactivity (as well as in the Readme where it states that untrusted guest volumes are handled safely), but its best to make it explicit. I had some discussion with the author of CryFS about this issue, backing up from an isolated admin VM, but s/he didn't seem to appreciate why anyone would isolate encryption functions in a disconnected admin environment. |
Changing the milestone to v0.4 as that will be the version that gets experimental encryption support. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Discussion should move back to QubesOS/qubes-issues#1293 |
This comment has been minimized.
This comment has been minimized.
Locking for now due to extraneous noise. |
AES mode selection –Some interesting AES encryption modes (subject to change): Non-authenticating
Authenticating
OCB is said to be much faster than other authenticating modes. IMHO, its uncertain whether an authenticating mode is necessary here for a couple of reasons: 1) Wyng already has a basis (hash manifests) for validating chunks of data. 2) The "hash last, then validate hash first" advocates appear to be basing their argument on attacks that mainly work on network data streams. If that is the case, then there is less to be concerned about in selecting between these modes. It is also worth assessing the risk of decrypting an (initially) unvalidated ciphertext. My understanding is that a symmetric cipher like AES and popular hashing algorithms are closely related and fall under the class of finite state machines. Therefore, if the very next operation after decrypt() is always either hash + compare with manifest hash or discard, then I think this is safe and secure. The largest issue in selecting a mode is probably in the degree of uniqueness required for the IV/nonce, which will be something to consider going forward. IANAC – This is all open to debate so convince me otherwise. :) |
Encryption work has started in branch wip04, at a POC stage, and as yet cannot be used securely (!) as it leaves the key exposed.... Use only test volumes with this. The metadata will be saved under a separate 'wyng.backup040' dir instead of usual, so no need to change meta dir manually for tests. Encryption is enabled by default and uses AES-256-CBC mode cipher. Currently it encrypts+decrypts data only (not metadata). To this extent "it works" for After testing various cipher modes, I settled on CBC. It is quite secure ("catastrophic" failure of confidentiality is rare/limited) and since the Python crypto libraries don't appear to be parallelized, its one of the better performers too. SIV mode had less than half the throughput of CBC in my tests. I also want to note that I took a fairly harmless liberty using encrypt() in the hope that collision resistance would be improved: A 128bit random "bolster" is added to the beginning of each plaintext chunk just before encrypting. With a chaining or cascading mode like CBC, I expect this should protect the actual data better than just the IV alone. Of course, informed comments on any of this are very welcome. |
New branch 'wip04b' created with the crypto library switched from pycryptodome to cryptography (python3-cryptography package). The reason is that cryptography is benchmarking around 35% faster for the same AES-256-CBC cipher, meaning that pycryptodome is exacting a 50% performance penalty. This was too large to ignore so I decided to switch now before the code got too dependent on the slower library. Another plus is that cryptography has been available in OS repositories longer and more consistently. Another small change is that encryption happens only just prior to the data buffer being sent, instead of being encrypted and possibly not being sent bc deduplication. Also, MAC tests for receive/verify/diff are now done with the secrets library. Some changes that are needed next:
Once these are implemented, we should have a reasonably secure encryption scheme for Wyng archives. |
AES-256-CBC is a poor choice. Not only is it very slow when encrypting (far more common than decrypting), it does not provide authentication, which leaves it vulnerable to chosen-ciphertext attacks. An AEAD cipher such as AES-256-GCM or ChaCha20-Poly1305 is a far better choice. AES is only a reasonable option on platforms where it is hardware accelerated; if portability to other platforms matters, use ChaCha20-Poly1305. AEAD ciphers usually have short nonces, which must never repeat for a given key. These nonces are too short to be safely chosen at random. Since persistently storing a nonce is a recipe for disaster (consider Finally, AEAD APIs should require the entire buffer to be passed in one operation. This makes them unsuitable for encrypting large individual messages. Instead, a streaming API should be used. Manually implementing such an API is error-prone. Is https://download.libsodium.org/doc/secret-key_cryptography/secretstream an option? That provides a high-level API for encrypting a sequence of messages, which is what a backup system needs. I strongly recommend just using libsodium for this, rather than trying to implement something similar by hand. |
Wyng is already awash in data hashes which puts it in the position of using CBC mode to its advantage. GCM is based on CTR mode, which has the potential for catastrophic confidentiality failures not present in CBC. Also see my comments above where the python library CBC was found to be about as fast as GCM. If you think Wyng's data verification is an issue, that should be addressed separately as it exists already and independently from encryption––and that will not change without compelling arguments. OTOH, that is not to say an AEAD mode won't be added. I think SIV (rather slow) or GCM-SIV (faster, but currently unavailable from OS repository) would have acceptable confidentiality safeguards. But GCM confidentiality appears too weak on its own (and hence why GCM-SIV was developed). XChaCha20-Poly1305 (X- with beefed-up IV space) also looks interesting, although I'd prefer to see some discussion about non-stream applications and also a formalization of the ChaCha20/XChaCha20 protocol first. Note these recent developments (GCM-SIV and XChaCha20) indicate confidentiality weakness of prior modes. The current emphasis on AEADs is controversial. Probably, if Wyng were a network protocol and not an at-rest storage format, I would agree AEADs are compelling. But that is not the case here.
My reading of current practice is that (besides nonce storage being accepted and usually mandatory) new keys are generated when nonce/IV space is exhausted. A particular mode may also have a requirement that an IV is unpredictable. So I think its more likely Wyng could use a nonce/IV that combines a unique counter with a random portion. A 64bit counter would accommodate (w smallest chunk size) 2^64 * 65536 (yottabytes) of backed-up disk space.
It depends. An incremental backup session records a series of data chunks to the archive. If the whole series must be considered a single stream, then Wyng loses both pruning capability and deduplication. So each chunk would need to be its own stream.... and I think we're back to the issues we face with the AES modes. That's why I stated early-on that the threat model looks more like the one for whole-disk encryption, which has its own trade-offs. If we break from that threat model we're probably looking at limiting the storage model to something that is not random-access. Metadata is a bit different story. Because of the way Wyng processes it (funneling digest lists through merge-sort), other encryption modes can be used. |
One needs to store the nonce with the data, but one must never use a nonce that was read from disk or the network. Instead, one should generate a fresh key using a KDF whenever the process starts, and store the KDF inputs (except the secret seed) along with the data. Alternatively, one can use XChaCha20-Poly1305, which has a large enough nonce that it can be just generated at random at each startup.
The whole-disk encryption threat model is intended for protection against loss of physical media, where chosen-ciphertext attacks are very difficult. That is not the case here. |
Re -use... correct?
This is going a little far. FDE is deployed on network storage systems, and in office environments where repeated physical access (without losing media) is a part of the threat model. |
I'm also curious why a chosen-ciphertext attack against data is an issue here. Wyng always works from the assumption that its metadata (digest list) is secure before any data is verified. Hence the 3rd item on the above checklist. |
And I'll grant the threat model is not exactly like FDE, where some implementations will re-use IVs. That's why I proposed using a unique counter in the IV. |
Note: I wrote this comment after reading up to #7 (comment). So this text doesn't take into consideration what has been posted after that. I second @DemiMarie's opinion that this should use authenticated encryption and if possible some existing more high-level API (I need to lookup some details on libsodium before I will comment if I think it's a good choice here (probably yes)). You are right that this is probably harder to attack in practice than some network protocol but given how fast modern AEADs are there is no reason to build something fragile in a new thing.
I don't understand this argument. What issue do you have if you make each chunk a separate stream? That being said if your chunks are small enough you can use the simpler AEAD interface instead of some streaming API.
I think for a backup software you need to support a bit more than FDE. In particular non local storage needs stronger authentication requirements than local FDE (as it's currently in use). Qubes built-in backup also supports strong authentication. FDE is also slowly moving into authenticating things. For example for system software (no encryption, only authentication) there's dm-verity that provides strong authentication. dm-crypt+dm-intergrity can provide only sector level authentication, but still better than plain dm-crypt. In general FDE makes a lot of compromises due to it's requirements. For a backup software you are in a much better position so you can support better crypto.
I would recommend against building in some separate crypto algorithm agility scheme for this use case. Choose a good algo+parameters. And if at some point it really turns out that there is a need to change it, that should be handled by a global format version change like other big changes.
Those developments are mainly to support randomly generated IVs (and IIRC the SIV variants also target to have some nonce reuse resistance), I don't think "confidentiality weakness" is a good way to say that they should not be used with random IVs. If this relevant to your usage depends a lot on how you plan to use it (see below).
Key derivation and integration into hierarchical metadata is probably the much more tricky task (because here "just use a existing robust higher-level API" is probably not possible to the extend it's for the encrypt a chunk part). So I would suggest to first draft the plan for this and discuss that. Then you can take another look at the "how to encrypt a chunk" part (for example if you derive a new key for each chunk anyway IV re-use is a not an issue). |
Such attacks are not addressed by common FDE solutions like dm-crypt (AFAIK MS's BitLocker is very similar but I'm not familiar with it's details). |
So you already verify the hash of the encrypted data before decryption? Then you have a custom AEAD scheme, not just CBC. I read your previous comments as you don't do this and only verify after decryption. |
No, I meant “use”. Otherwise one is vulnerable to a replay attack. The only time this is okay is if one has a hardware-enforced monotonic counter, but that is ~never the case in this context. |
I think you are talking past each other. You use the stored nonce to decrypt the data that was encrypted using it. What @DemiMarie means is that you should not use stored data to derive another nonce from it. So you should not do something like: |
You're confusing the role of metadata and data here. The lion's share of metadata is digest lists. There are too many assumptions being made here by people who have been disinterested in this project until this point.
No, Wyng is not a network protocol and if you paid attention you'd realize the data chunks are not validated like a network protocol. Its all-or-nothing. There is no "re-play" from error-correction schemes. The digest test already uses a time-invariant function. Anything else an attacker is likely to do is DoS. I'm fine with DoS.
WHY would I read and (yes) re-use any data like that and apply it in such a fashion??? If I store a counter, it can be in a protected header (metadata) such as archive.ini. And PLEASE don't repeat the error and say I can't trust the header either. It can be signed if that's really necessary, that is the whole point of issue 79. The current usage model assumes that the metadata is protected by the isolated admin environment; transitioning to encryption, that metadata will have to be verified before any data can be processed. And if you assume I'm going to use AES-CBC for metadata signing/verification, then /eyeroll. I should also point out the GCM problems aren't limited to nonce re-use. When the underlying CTR mode fails, it is (I repeat) catastrophic. Tons of data (or all) data gets exposed. With CBC under the same conditions, only the identical repeat messages tend to be exposed. Finally, key scheduling is better suited to network streams, but won't be out of the question going forward. I still have to assume that unique IVs will be sufficient, because that's what the API documentation and application guides say so the current tack is a reasonable starting point. Here's the deal. I do not want this issue flooded with piles of best-practice nostrums from every use case under the sun applied indiscriminately, as is the fashion. Going forward, you can comment if A) you're a cryptographer or B) you demonstrate you've reviewed the Wyng format and present ideas about encryption in "Wyng-ese". The ideas already in Wyng have to be respected or there will be little point in adding encryption to it. Please also understand, this is being developed by a single person (me) in my spare time. The encryption feature will be introduced as experimental and probably stay experimental for some time––as happened with deduplication––barring some considerable increase in participation. Other projects have a lot more manpower, and can still get by with delaying (say) correct verification of system updates for over a decade; in non-experimental releases at that. So, those are the terms and they are terrific. :-) |
I did look at the commit mentioning the ticket but given your comment it wasn't clear what is just done this way because it's some very early version of the feature and what is your mid to long term plan. Anyway: I did comment here since Marek asked me privately if I would have time to take a look. Unfortunately my comments had the opposite of the intended effect and you perceived my comments as some outsider to the project trying to force "best-practice nostrums" on you. Sorry about this, I definitely didn't want to annoy you in the issue tracker of your spare time project. So I will refrain from further comments for now. If you would like to discuss this or related topic in the future feel free to contact me. |
For perspective... qubes-core-admin/qubes/backup.py:
|
That's literally the only occurrence of this constant in the code, it is not used anywhere :) |
@marmarek What do you think about AES-SIV mode? |
Change data cipher default to xchacha20 Issue #7
Good Morning.... The basic encryption implementation has been completed! Usage notesUpon new archive creation with The rest is like using Wyng v0.3, although there are additional features slated for v0.4 that will cause further changes in its command syntax and format. Compatibility: Wyng 0.4 (wip) is being tested on Fedora 32 (Qubes 4.1), Debian 11 and Ubuntu 21.04. Qubes 4.0 does seem like a possibility if A) encryption is not used, or B) suitable encryption library versions are ported to Fedora 25. Technical notes
The counters for each key are updated in the remote metadata root archive.ini file as the volume data is being sent, alongside more frequently updated mirror of the counters tracked in the local .salt file. On startup, the two versions of each counter are compared, the larger is taken, and then the counter update step-1 is added as a precaution. The counter is always advanced before incorporating it in a new IV. The upper bound for each cipher's message counter plus other parameters for the IV:
|
Integrate an encryption layer that can also be used to verify metadata and data from the destination archive.
Looking for examples and discussion on applied cryptography techniques from best practices to implementations in various tools including qvm-backup, restic, Time Machine, etc.
Factors
Implementation checklist
Threat model
Wyng's threat model appears to be most similar to an encrypted database: A mass of data that is updated and curated periodically. Attackers gaining access to the entire volume ciphertext, possibly on successive occasions may be assumed.
Security issues
Encryption scheme should be robust and have low interactivity and complexity as well as high isolation potential.
Isolation would be in the form of a Qubes-like environment where the Admin VM (e.g. Domain 0) running the backup process is blocked from direct network access, and encryption/decryption is performed only there. Wyng should be able to encrypt effectively in such an isolated environment.
Compatibility with Admin isolation also extends to how any guest containers/VMs are handled: Encryption and integrity verification cannot rely on the guest environments or their OS templates.
Encryption strategies
LUKS or VeraCrypt on a loop device (which can be isolated) with backing in a remote/shared image file. For example: cryptsetup -> losetup -> sshfs. This solution is readily available but imposes a performance penalty of ~20% on a VM-isolated configuration. It also requires painstaking user setup in a Linux-specific environment; difficult to integrate; poor choice for remote/cloud.Encfs - A FUSE file-encrypting layer may improve performance over a setup based on a loop device. It may also be simpler to setup or even integrate. Advantage: automatic filename (but not size or sequence) obfuscation. Drawback: issue with hardlinks in some encryption modes.CryFS - Another FUSE layer with built-in support for network transports. Complete file metadata obfuscation. Claims superior resistance to attack. Unknowns: Hardlink support, transport isolation potential.Direct crypto library/AES utilization - Uses no external layers, but requires painstaking attention to detail and review by a cryptographer if possible. This option may be a natural choice, given the simplicity of the archive chunk format; any issues around the implementation security should have direct analogues to a wide field of other implementations and their use cases. See initial comments on AES modes.
Some encrypted backup tool that can accept a stream of named chunks with very low interactivity between the front end and back end (e.g. a 'push' model).
(After some deliberation and using Wyng with external encryption layers, this issue will be primarily concerned with an integrated solution similar to item 4.)
Types of data
Wyng keeps volume data and metadata as separate files, and the metadata validates the volume data.
See Issue #79 for specifics on metadata, which is expected to use separate encryption keys.
On commenting...
Following a core tenet of cryptography that the application must be understood thoroughly before making specific decisions, a substantial familiarity with Wyng is required to make sense of this issue (ye have been warned...).
Its suggested that making some incremental backups with Wyng and looking at the metadata under '/var/lib/wyng.backup' is a good starting point. In the source code, the classes under
ArchiveSet()
are instructive in addition tomerge_manifests()
and the places where its used.The text was updated successfully, but these errors were encountered: