Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Pickle encryption #13

Open
wants to merge 8 commits into
from

Conversation

Projects
None yet
4 participants

Encryption during sync modified to use an auxiliary file to store the MD5s needed to check changes between the files. There is now a pickled dictionary mapping the encrypted file's MD5 against the decrypted file's MD5 stored at ~/.s3metadata.

I have tested it with a put with --encrypt and then doing a sync with --encrypt and it will only move changed files. I have also tested get and sync back to local and decrypting works successfully in both cases. (You do need to specify --encrypt when doing the sync to local, so that s3cmd knows not to check the file size before syncing).

Let me know if you have any questions or suggestions on this. I am running my backup with it with over 15,000 files and it is a lot better than the HEAD version ;)

Contributor

mludvig commented Nov 16, 2011

Getting there, however a couple of comments still :)

  • can we have the metadata file in each directory instead of a central one? As I mentioned some people run s3cmd with huuuuge amount of files - parsing and updating the pickle file with millions of records may take a considerable time and memory. On the other hand they are unlikely to have millions of files in a single directory...
  • can the workflow hide the actual metadata storage details from the cmd_sync_local2remote() and cmd_sync_remote2local()? Something like:
    
    
which in turn would see if the original md5 and timestamp (or whatever else you need) is in "./.s3cmd.data". If not run HEAD request and store the result to "./.s3cmd.data" for the next time. Metadata class should encapsulate the details of handling and storing the actual metadata both locally and remotely.

What do you think?

The problem I see with the first point is, that's not really how my backups work. Since I'm running a sync on a live site I'm running, I actually create a copy of the site, sync that and then blow it away when I'm done. Any hidden files I'd store in the directories would be gone. I liked the home directory solution mainly because it's not in the backed up directory and it keeps it consolidated with my settings file. The fact that it's big isn't a deterrent to me at the moment (although I could see it grow into one and I'm still deciding the best way to clean it out). Being 1MB for 15,000 files doesn't scare me away. 1GB for 15,000,000 files I could see being a problem. But this works best for me as is. Is there a forum this can be discussed on? Maybe someone else could step up to handle this?

I'll look at the second one. Good point on that one. I can tell I pushed a little too much responsibility off onto the using classes.

Contributor

mludvig commented Nov 18, 2011

On 11/18/2011 12:11 PM, Joe Erickson wrote:

The problem I see with the first point is, that's not really how my backups work. Since I'm running a sync on a live site I'm running, I actually create a copy of the site, sync that and then blow it away when I'm done.

That's where the more generic logic hidden in the Metadata class would
come to help. We could have a config setting like "metadata_cache =
/some/path" where if "/some/path" == "." it would do per-directory
storage, otherwise store it in the given file.

The users of the class won't care - simple call metadata.get_blah() and
whether it reads from ./.s3cmd.data or $HOME/.s3cmd.data or performs a
HEAD request the sync_ocal2remote() caller won't need to know or care.

M.

Member

mdomsch commented Mar 9, 2013

Currently in the master branch is my HashCache code, where on a per-run basis you can specify the location of the cache file. It's not a per-directory cache though. As a size reference, against the Fedora primary architecture release tree with nearly 900k files, the pickle file is only 22MB. By placing it out of the tree being synced, the cache files need not be excluded on every sync invocation. It would be easy to store the md5 for the pre-encrypted file there too if desired. It currently stores the md5 of whatever is being uploaded.

Contributor

richo commented May 2, 2014

Is there a real need to use pickle instead of a serialization format that isn't turing complete?

Member

mdomsch commented May 3, 2014

@richo what format would you recommend? Pickle is trivial to use, and fine for the purpose we are putting it to right now, AFAIK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment