Encryption #43

Open
tbaumann opened this Issue Aug 23, 2013 · 19 comments

Projects

None yet

5 participants

@tbaumann

I would love to dump some of our companies old data into glacier.
But I can not do that unencrypted.

This tool seems to solve a lot of my problems. But plaintext is a nogo for me. Any chance encryption could be added?

@vsespb
Owner
vsespb commented Aug 23, 2013

Hello. Most likely yes, encryption will be added. AFAIK lots of users want this feature. But I cannot promise that it will be done soon - it might look simple, but actually pretty complicated to make it work with all current and planned features.

For now you can encrypt it with 3rd party software (like pgp) and then use mtglacier for futher processing. Disadvantage is that you need additional disk space for it.

I will leave this ticket open and will post updates here. (I also had idea to post technical RFC before implementing this, to get some feedback and ideas from community)

p.s.
and thank you for supporting the project!

@dimbeni
dimbeni commented Sep 27, 2013

Hello,

first and foremost, thanks to you Victor for this very good job.

I would also like some client-side encryption, and would like to help out (assuming I'm able to).

I've been working on a small perl script (in the frame of a different project), where we directly used the power of gpg via system calls.

I realize this would be platform locked (I use Linux), but having the gpg magic working inside the hood of you Perl script would allow to have encryption without the additional disk space, via proper piping to a system call.

I'm thinking Linux, of course, but maybe the same approach could be extended to other OSs. What do you think of this approach? I assume you were more into using perl modules like Crypt:GPG?

I'm willing to contribute to this, but I would maybe need some few guidelines.

Thanks again

Davide

@vsespb
Owner
vsespb commented Sep 27, 2013

I'm thinking Linux, of course, but maybe the same approach could be extended to other OSs. What do you think of this approach? I assume you were more into using perl modules like Crypt:GPG?

Actually I was going to use GPG/PGP too. Like duplicity and s3cmd does.
I think it will work on any Unix/POSIX system (if user is able to install GPG on this system). Also possible approach is allow user to use any external encryption tool (like s3cmd does).

Problems with perl encryption modules, like Crypt-GPG:
a) I don't see good modules with small number of bugs, which are still supported and which have long history.
b) Raw encryption implementations are unusable (say, AES gives same cyphertext for same plaintext and same key, this is not secure without salt. And if you invent a salt, you invent new encryption protocol like GPG does)

Other problems with implementing encryption in mtglacier:

  1. sync and other commands logic heavy relies on fact that you can verify original file checksum to see if it's modified. For this I need to store checksum in a) Journal b) Amazon Glacier servers. And (a) can be restored from (b).
    Problem that storing unencrypted checksum of plaintext is considered as vulnerability. One can attack files with rainbow tables etc ( related issue with some prooflinks s3tools/s3cmd#12 (comment) ).

So I need to encrypt metadata too. Problem now if you retrieve inventory with 10_000 archives, yo'll have to decrypt 10_000 metadata records and thus call GPG command 10_000 times.

  1. File sizes. Internally I have file size. It's used here and there. Problem that now I need to distinct original file size and file size of encrypted/compressed archive in Amazon Glacier.

I'm willing to contribute to this, but I would maybe need some few guidelines.

Yes it's hard to contribute now. I have good testing tools for this, like glacier emulator (somewhere in gemu branch). I did not release documentation for it yet. I'll try to improve contributor guide & docs and testing tools before I start working with encryption.

Currently I am busy with refactoring of FSM engine to queue HTTP requests and writing another integration testing tool on top of gemu. This will make maintenance and QA of project much more easier.

I'll let you know in this ticket when I have better development docs and when I have specification (RFC !, to be discussed) on how encryption should work.

@vsespb
Owner
vsespb commented Sep 27, 2013

So I need to encrypt metadata too. Problem now if you retrieve inventory with 10_000 archives, yo'll have to decrypt 10_000 metadata records and thus call GPG command 10_000 times.

Actually, that's looks like show-stopper. I don't know good way to solve this. Seems there is no batch-mode in GPG.
Possible solutions:

a) use different encryption for metadata. (via CPAN modules)
b) don't store metadata for encrypted files on Amazon Glacier servers. Only in journal. Ask users to backup journal.

I don't like both (a) and (b).

@dimbeni
dimbeni commented Sep 27, 2013

Sorry if my reasoning is bugged, I don't know the code. If I understand correctly, all the metadata records are saved in the journal, in the form of one single file.

Isn't it possible to keep that file (the journal) as it is locally and only encrypt the file to be uploaded to Amazon?
When you need it locally, just read it as usual. When you need to retrieve the metadata from Amazon (which I would expect to happen seldom), the first step might be to retrieve the journal file and decrypt it (in one shot), and only then start to access the metadata inside it.

Now I don't know how much of the code should be rearranged for this to work, of course...

@vsespb
Owner
vsespb commented Sep 27, 2013

all the metadata records are saved in the journal, in the form of one single file

yes, correct.

Isn't it possible to keep that file (the journal) as it is locally and only encrypt the file to be uploaded to Amazon?

yes, that's possible, and does not look like security problem. Also user can manually encrypt and backup his journal by his own (backup to Amazon Glacier or Amazon S3 or other ways).

When you need it locally, just read it as usual. When you need to retrieve the metadata from Amazon (which I would expect to happen seldom), the first step might be to retrieve the journal file and decrypt it (in one shot), and only then start to access the metadata inside it.

yes, that would work. However, even we implement it, I don't believe it's clever idea to automate it (because uploading new copy of journal after each small change made to journal is not really effective),
so user will have to do all steps manually, which is inconvenient (or there should be separate command to automate it (less inconvenient)). Also, I think better place for Journal backup is Amazon S3, not Glacier (so we'll have to support S3 protocol + handle/explain in documentations possible problems with Amazon permissions)

I believe I suggested same above:
b) don't store metadata for encrypted files on Amazon Glacier servers. Only in journal. Ask users to backup journal.

Problems only that:

Metadata records stored in Journal file, but also in Amazon Glacier servers (there is special field for this)

Currently you can drop your journal, and restore it with retrieve-inventory download-inventory commands.

I think this feature is pretty usefull. Also it's natural to Amazon Glacier workflow. Pretty much all Amazon Glacier clients use it.

Also, usually different clients do not understand metadata format of other clients, but sometimes they do. And if they do, they understand
only metadata stored on Glacier servers (none of software tries to parse other software journals/other caches).
Thus this also useful if other software will try to read mtglacier files.

So I would like to preserve this feature, and store encrypted metadata on Amazon side somehow..

@vsespb
Owner
vsespb commented Sep 27, 2013

So I need to encrypt metadata too. Problem now if you retrieve inventory with 10_000 archives, yo'll have to decrypt 10_000 metadata records and thus call GPG command 10_000 times.

I've just thought of another approach.

We don't really need to encrypt SHA256/TreeHash sums. Because we don't need actual treehash of plaintext (probably!).
We only need to verify that TreeHash is correct.

So we can store HMAC(USER-SUPPLIED-PASSWORD, TREEHASH-OF-PLAINTEXT). And later just check that TreeHash matches.
HMAC will work fine in perl.

I am not really 100% sure yet if this will work. Also, this way we'll completelly loose original TreeHash checksums which will make harder for end user
to proof that everything is correct/debug other problems, etc.

@vsespb
Owner
vsespb commented Sep 27, 2013

So I need to encrypt metadata too. Problem now if you retrieve inventory with 10_000 archives, yo'll have to decrypt 10_000 metadata records and thus call GPG command 10_000 times.

Seems there is a way to decrypt several files at once (at least with symmetric encryption):

#!/bin/bash
PASS=MyPAssword
echo somedata1 > file1.txt
echo -n $PASS|gpg --yes --batch -a --passphrase-fd 0 --symmetric --cipher-algo AES256 file1.txt

echo somedata2 > file2.txt
echo -n $PASS|gpg --yes --batch -a --passphrase-fd 0 --symmetric --cipher-algo AES256 file2.txt

mv file1.txt file1.orig
mv file2.txt file2.orig

echo -n $PASS|gpg --yes --batch -a --passphrase-fd 0 --multifile -d file1.txt.asc file2.txt.asc

so number of files now limited by command line length. Interesting, that same does not work with encryption.

There is also a way to read filenames from STDIN:

#!/bin/bash
PASS=MyPassword
echo somedata1 > file1.txt
echo -n $PASS|gpg --yes --batch -a --passphrase-fd 0 --symmetric --cipher-algo AES256 file1.txt

echo somedata2 > file2.txt
echo -n $PASS|gpg --yes --batch -a --passphrase-fd 0 --symmetric --cipher-algo AES256 file2.txt

mv file1.txt file1.orig
mv file2.txt file2.orig


echo file1.txt.asc > list
echo file2.txt.asc >> list

gpg  -a  --multifile -d < list

but it asks for passphrase - i think this can be worked around in perl script by supplying passphrase-fd

@dimbeni
dimbeni commented Sep 27, 2013

What about gpgdir?

http://cipherdyne.org/gpgdir/

I remember I used it for encrypting several files in a directory...

@vsespb
Owner
vsespb commented Sep 27, 2013

Just in case, I want to clarify that those example of decrypting multiple files with PGP were about decrypting metadata (i.e. filenames and checksum information) received
from Amazon Glacier inventory.

I want to represent it with files (one file = one line in journal = one metadata entry) and decrypt with gpg --multifile.

Actual real user files encryption/decryption probably will be done using one file = one pgp command (and using pipes, no intermediate files).

What about gpgdir?

I checked the source, it's not small. It does not look that it decrypts files with --multifile option.
Do you think it will be faster to decrypt thousands (and perhaps millions) metadata entries with gpgdir than with --multifile calls?

@dimbeni
dimbeni commented Sep 27, 2013

Do you think it will be faster to decrypt thousands (and perhaps millions) metadata entries with gpgdir than with --multifile calls?

Probably not. Its main plus is that it allows for encrypting/decrypting multiple directories from command line, nothing that can't be done in a better way with perl tools (or gpg itself)

I was also thinking, what about temporarily go for

b) don't store metadata for encrypted files on Amazon Glacier servers. Only in journal. Ask users to backup journal.

at least in case the user chooses to encrypt?

I understand your thinking, and why it would be desirable to have the journal in Amazon as well.

But here's my reasoning: if you set up a remote backup process, a very big use case is disaster recovery (actually, I can't think of any other uses for Glacier).

So when you need the data, it's because your house or office was destroyed by fire or earthquake, and your entire IT infrastructure is likely gone together with the data. If you're worried of Hard Disk failures you shouldn't use Glacier, because of the long delay in retrieval.

This means that it is certainly of the uttermost importance to have the journal somewhere else than the local machine as well, but not only.

I'm currently backing up the entire "infrastructure" needed to recovery the data from Glacier (journal, of course, but also scripts, passwords, vaults names, etc) somewhere else (encrypted in dropbox, but that's just an example), and I think that anyone seriously planning for disaster is likely to do the same. So that the recovery can boot-strap itself anew by just remembering the encryption password of the stuff in dropbox, instead of passphrases, keys, secrets, vault names and all the related configs.

If the above makes sense, then it seems to me less unescapable that the encrypted journal should be in Glacier as well, at least in my use case...

@vsespb
Owner
vsespb commented Sep 28, 2013

actually, I can't think of any other uses for Glacier

secondary backup (because restore is too expensive for primary backup), archiving, log archiving enforced by law.

I'm currently backing up the entire "infrastructure" needed to recovery the data from Glacier (journal, of course, but also scripts, passwords, vaults names, etc)

good strategy. but some people might want to backup only scripts+password+names once (perhaps by just printing in on paper together with "backup restoration policy").

Even if they wish to backup Journal as well, backuping Journal after each Journal modification can be ineffective in some rare cases (for example when journal size is much higher than size of backup increment -
for example they have millions files and just change few very small files before each backup run).

If the above makes sense, then it seems to me less unescapable that the encrypted journal should be in Glacier as well

I agree that such implementation of encryption is higher priority (for end-users) than "proper" implementation with metadata encryption.

There are just few small disadvantages:

a) If we implement "encryption-without-metadata" and then, later, "encryption-with-metadata" features then would be few additional complexities for end users when they decide if it's safe to drop their journal or not.
I.e. there will be a note in documentation (which people have to read) which explains which version backups metadata during encryption and which not. And maybe commands/options
to check if your journal corresponds to backuped metadata or no.

b) There will be another branching in journal/metadata format (i.e. there will be two)

Next, I think solving that problem is not very big part of whole work. Maybe ~20%. So I am not sure if it worth to release intermediate version and introduce additional overhead.

Well, let's talk about way to actually contribute to this feature.
Unfortunatelly it's not first priority now.

There are two issues: #39, #40 related to file versioning. The thing when I implemented replace-modified option I had to write most part of file versioning in code (for
consistency in case of race conditions), so it's already almost implemented - and need to finish it.

There are also important issues #3, #37 (not sure yet if it's more important than encryption or no).

And, as I told I am busy now with rework of HTTP Queue engine.

My plan for encryption implementation:

a) I introduce some docs for developer
b) I document Glacier Emulator tool (it's probably a pain to test mtglacier against real Amazon servers)
c) I write RFC specification for encryption
d) After that I can split (some) encryption work to several tickets and give it to anyone who wish to contribute.
e) I am doing work unrelated to encryption (issues listed above)
f) I am busy with encryption, and we release it

I think (d) can happen pretty soon (maybe 0.5-1 month), but (f) won't happen soon (maybe ~6 months or bit more)

Also I want to mention here few things, missing in (development) docs:

What features should mtglacier have, how it should behave:

In the first place, mtglacier is Amazon Glacier client, but not a full-featured backup tool.
So priority is implementing natural Amazon Glacier protocol features (i.e. some useful operation, that is possible for set of files backuped to Amazon Glacier).
Examples:

Natural feature: Amazon glacier has Range-retrievals featue, so it's possible to hack some command to minimize retrieval cost with it.
for example something like --use-rangeretrievals --retrieve-rate=1Gb/h --storage-for-partial-files=/tmp

Not natural for glacier client: file deduplication, file encryption (both can be done with 3rd party tools), backup rotation (can be done with scripting)

(NOTE: sometimes this priority violated, for example I often think that encryption more important than range-retrieval )

Priority during development

  1. critical bug fixes.
  2. critical missing features fixes (i.e. if we implemented backup but forgot to implement restore - it's critical missing feature)
  3. command line and API, workflow and compatibility concerns should be sane.
  4. command line API backward compatibility
  5. reducing technical debt ( http://en.wikipedia.org/wiki/Technical_debt ) to actually save development time.
    for exampe:
    • I don't write and release code if I know it's bad and should be refactored, I refactor it before release then
    • I try not to release code without tests, if test should be implemented in this case
    • I try not to add additional code to places, which I know should be refactored-reworked soon.
    • Bug fixes more important than new features
    • I try avoid additional mess with metadata/journal format
  6. eliminating and avoiding each insignificant edge case (in API and docs)
    • Example of insignificant edge case, which is documented but makes docs harder to read:
      Config file name (specified with --config) can be in any encoding (it's used as is)
      Of course it will work only if your terminal encoding match your filesystem encoding or if your config file name
      consists of ASCII-7bit characters only.
    • Another example of case when workflow is more complex, but it prevents from troubles in edge cases: --check-max-file-size which is mandatory for --stdin upload
  7. ease of deploy
  8. documentation
  9. small bug fixes
  10. new features/enhancements
  11. other stuff

(NOTE: sometimes those priorities violated, but not much)

@vsespb
Owner
vsespb commented Oct 1, 2013

I made some benchmarks https://gist.github.com/vsespb/6776512 GPG able to decrypt ~ 1000 small files per second with --multifile (one Core i7 2600, single thread, CPU usage ~90% for the thread, I ran with 100_000 files). I think this is acceptable for metadata decryption (especially if make this multithreaded)

@vsespb
Owner
vsespb commented Oct 1, 2013

Another problem is how to prevent people from doing this:

MT_GPG_PASSWORD=MyPasword mtglacier sync --update-modified --encryption  ...

and then

MT_GPG_PASSWORD=DifferentPassword mtglacier sync --update-modified --encryption  ...

i.e. messing two different passwords for same journal/vault and perhaps same filenames (when we'll have versioning for files)

and, even worse, if metadata encrypted too, they will not be able to decrypt their matadata into journal, if they ever uploaded files with different passwords to same vault.

Currently similar problem presents too - people are allowed to use one journal to upload files to different vaults (this should not be allowed - more info https://github.com/vsespb/mt-aws-glacier#why-journal-does-not-contain-regionvault-information ).

@vsespb
Owner
vsespb commented Oct 2, 2013

Another problem with metadata encryption on Amazon Glacier servers, is that, if we allow encryption with asymmetric crypto, encrypted data can exceed (or eat much of) 1024 bytes allowed by amazon for metadata.

below sizes for:
base64(gpg(binary_treehash))
i.e. if we want to encrypt only SHA256 treehash of source file in metadata.

symmetric encryption: 102 bytes
asymetric 2048 bits: 495 bytes
asymetric 4096 bits: 843 bytes
asymetric two recipients (4096+2048 bits): 1208 bytes

@alexkrauss

Encryption is important, but it is orthogonal to what mt-aws-glacier does. Here is what I do right now:

  • Assume data is in data/
  • Use encfs --reverse to mount an encrypted version of that tree under data-encrypted/. This is simply a FUSE file system that encrypts the underlying file system on the fly.
  • Use mt-aws-glacier to store the encrypted tree normally.

Others have been doing this with other forms of cloud storage, e.g. http://shrp.me/docs/encrypted_offsite_backup.php .

So building encryption facilities into mt-aws-glacier might not actually be worth the effort, which seems to be considerable...

@vsespb
Owner
vsespb commented Oct 6, 2013

Encryption is important, but it is orthogonal to what mt-aws-glacier does.

I agree.

Use encfs --reverse to mount

I tried that, indeed that works great. Thanks you for information.

So building encryption facilities into mt-aws-glacier might not actually be worth the effort

Well, indeed, maybe. Need to think about it. One disadvantage that I see for now is that with encfs --reverse you cannot use file filtering https://github.com/vsespb/mt-aws-glacier#file-selection-options for operations (like restoring only certain files), this can be probably worked around (by dividing files into groups before backup and creating one encfs for each group, so later you can restore only one group, based on filepath again). Other problem can be portability (seems there are ports for *BSD and MacOSX, but not for Solaris), well, not a big deal too.

Anyway it's definitely worth to mention encfs --reverse in the documentation, will do that soon.

btw, another example of often requested feature which is orthogonal to mtglacier functionality is bandwidth throttling.
In theory OS should do that.There is example of setup https://github.com/uskudnik/amazon-glacier-cmd-interface#bandwidth-throttling however this does not seem to be 100% reliable as IP addresses might change, a reliable solution should be squid based probably. My thoughts are that bandwidth throttling is example of feature which, if implemented, will make life much easier for end users who want to limit bandwidth, cause setting up squid/tc and controlling it's configuration can be pretty hard (squid is a server, and you need to restart it to reconfigure).

@sergeyromanov

a) I don't see good modules with small number of bugs, which are still supported and which have long history.

I may suggest Crypt::OpenPGP. Just recently I've adopted this module and I'm planing to bring it up to speed in the near future

@vsespb
Owner
vsespb commented Jun 16, 2014

I may suggest Crypt::OpenPGP. Just recently I've adopted this module and I'm planing to bring it up to speed in the near future

That is interesting! Looks like pure perl thing (i.e. not calling external PGP command for each operation). That would be slow for encryption of data (I was going to call external GPG command for it), but that is OK, and will be even faster, for encrypting metadata (thouthands of small records) while keeping metadata format compatible with GPG.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment