Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Pickle encryption #13

Open
wants to merge 8 commits into from

4 participants

@firstclown

Encryption during sync modified to use an auxiliary file to store the MD5s needed to check changes between the files. There is now a pickled dictionary mapping the encrypted file's MD5 against the decrypted file's MD5 stored at ~/.s3metadata.

I have tested it with a put with --encrypt and then doing a sync with --encrypt and it will only move changed files. I have also tested get and sync back to local and decrypting works successfully in both cases. (You do need to specify --encrypt when doing the sync to local, so that s3cmd knows not to check the file size before syncing).

Let me know if you have any questions or suggestions on this. I am running my backup with it with over 15,000 files and it is a lot better than the HEAD version ;)

@mludvig
Owner

Getting there, however a couple of comments still :)

  • can we have the metadata file in each directory instead of a central one? As I mentioned some people run s3cmd with huuuuge amount of files - parsing and updating the pickle file with millions of records may take a considerable time and memory. On the other hand they are unlikely to have millions of files in a single directory...

  • can the workflow hide the actual metadata storage details from the cmd_sync_local2remote() and cmd_sync_remote2local()? Something like:
    ```orig_md5, orig_timestamp = metadata.get_orig_attribs(full_path)

which in turn would see if the original md5 and timestamp (or whatever else you need) is in "./.s3cmd.data". If not run HEAD request and store the result to "./.s3cmd.data" for the next time. Metadata class should encapsulate the details of handling and storing the actual metadata both locally and remotely.

What do you think?
@firstclown

The problem I see with the first point is, that's not really how my backups work. Since I'm running a sync on a live site I'm running, I actually create a copy of the site, sync that and then blow it away when I'm done. Any hidden files I'd store in the directories would be gone. I liked the home directory solution mainly because it's not in the backed up directory and it keeps it consolidated with my settings file. The fact that it's big isn't a deterrent to me at the moment (although I could see it grow into one and I'm still deciding the best way to clean it out). Being 1MB for 15,000 files doesn't scare me away. 1GB for 15,000,000 files I could see being a problem. But this works best for me as is. Is there a forum this can be discussed on? Maybe someone else could step up to handle this?

I'll look at the second one. Good point on that one. I can tell I pushed a little too much responsibility off onto the using classes.

@mludvig
Owner
@mdomsch
Owner

Currently in the master branch is my HashCache code, where on a per-run basis you can specify the location of the cache file. It's not a per-directory cache though. As a size reference, against the Fedora primary architecture release tree with nearly 900k files, the pickle file is only 22MB. By placing it out of the tree being synced, the cache files need not be excluded on every sync invocation. It would be easy to store the md5 for the pre-encrypted file there too if desired. It currently stores the md5 of whatever is being uploaded.

@richo

Is there a real need to use pickle instead of a serialization format that isn't turing complete?

@mdomsch
Owner

@richo what format would you recommend? Pickle is trivial to use, and fine for the purpose we are putting it to right now, AFAIK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
Showing with 81 additions and 7 deletions.
  1. +1 −0  S3/Config.py
  2. +11 −0 S3/FileLists.py
  3. +47 −0 S3/MetaData.py
  4. +22 −7 s3cmd
View
1  S3/Config.py
@@ -39,6 +39,7 @@ class Config(object):
proxy_host = ""
proxy_port = 3128
encrypt = False
+ temp_location = "/tmp/tmpfile-"
dry_run = False
preserve_attrs = True
preserve_attrs_list = [
View
11 S3/FileLists.py
@@ -5,6 +5,7 @@
from S3 import S3
from Config import Config
+from MetaData import MetaData
from S3Uri import S3Uri
from SortedDict import SortedDict
from Utils import *
@@ -282,6 +283,7 @@ def __direction_str(is_remote):
debug("src_list.keys: %s" % src_list.keys())
debug("dst_list.keys: %s" % dst_list.keys())
+ metadata = MetaData()
for file in src_list.keys():
debug(u"CHECK: %s" % file)
if dst_list.has_key(file):
@@ -312,6 +314,10 @@ def __direction_str(is_remote):
elif src_remote == True and dst_remote == True:
src_md5 = src_list[file]['md5']
dst_md5 = dst_list[file]['md5']
+ if cfg.encrypt and src_md5 in metadata.metadata['md5_trans'] and src_remote == True:
+ src_md5 = metadata.metadata['md5_trans'][src_md5]
+ if cfg.encrypt and dst_md5 in metadata.metadata['md5_trans'] and dst_remote == True:
+ dst_md5 = metadata.metadata['md5_trans'][dst_md5]
except (IOError,OSError), e:
# MD5 sum verification failed - ignore that file altogether
debug(u"IGNR: %s (disappeared)" % (file))
@@ -323,6 +329,11 @@ def __direction_str(is_remote):
if src_md5 != dst_md5:
## Checksums are different.
attribs_match = False
+ ## If encrypt, remove all matching keys from the metadata file
+ if cfg.encrypt and src_md5 in metadata.metadata['md5_trans']:
+ del(metadata.metadata['md5_trans'][src_md5])
+ if cfg.encrypt and dst_md5 in metadata.metadata['md5_trans']:
+ del(metadata.metadata['md5_trans'][dst_md5])
debug(u"XFER: %s (md5 mismatch: src=%s dst=%s)" % (file, src_md5, dst_md5))
if attribs_match:
View
47 S3/MetaData.py
@@ -0,0 +1,47 @@
+## Amazon S3 manager - Exceptions library
+## Author: Michal Ludvig <michal@logix.cz>
+## http://www.logix.cz/michal
+## License: GPL Version 2
+
+import cPickle
+import os
+from logging import debug, info, warning, error
+
+class MetaData(object):
+ _instance = None
+ metadata = {}
+ metadata['md5_trans'] = {}
+
+ ## Creating a singleton
+ def __new__(self):
+ if self._instance is None:
+ self._instance = object.__new__(self)
+ return self._instance
+
+ def __init__(self):
+ metadata_file = ".s3metadata"
+ if os.getenv("HOME"):
+ metadata_file = os.path.join(os.getenv("HOME"), ".s3metadata")
+ elif os.name == "nt" and os.getenv("USERPROFILE"):
+ metadata_file = os.path.join(os.getenv("USERPROFILE").decode('mbcs'), "Application Data", "s3metadata.ini")
+
+ debug(u"Loading metadata from %s" % metadata_file)
+
+ if os.path.exists(metadata_file):
+ self.metadata = cPickle.load(open(metadata_file, 'rb'))
+
+
+ def save(self):
+ metadata_file = ".s3metadata"
+ if os.getenv("HOME"):
+ metadata_file = os.path.join(os.getenv("HOME"), ".s3metadata")
+ elif os.name == "nt" and os.getenv("USERPROFILE"):
+ metadata_file = os.path.join(os.getenv("USERPROFILE").decode('mbcs'), "Application Data", "s3metadata.ini")
+
+ debug(u"Saving metadata to %s" % metadata_file)
+ try:
+ cPickle.dump(self.metadata, open(metadata_file, 'wb'), -1)
+ except IOError, e:
+ error(u"Can't write out metadata file to %s: %s" % (metadata_file, e.strerror))
+
+# vim:et:ts=4:sts=4:ai
View
29 s3cmd
@@ -284,6 +284,7 @@ def cmd_object_put(args):
return
seq = 0
+ metadata = MetaData()
for key in local_list:
seq += 1
@@ -295,6 +296,7 @@ def cmd_object_put(args):
seq_label = "[%d of %d]" % (seq, local_count)
if Config().encrypt:
exitcode, full_name, extra_headers["x-amz-meta-s3tools-gpgenc"] = gpg_encrypt(full_name_orig)
+ metadata.metadata['md5_trans'][Utils.hash_file_md5(full_name)] = Utils.hash_file_md5(full_name_orig)
try:
response = s3.object_put(full_name, uri_final, extra_headers, extra_label = seq_label)
except S3UploadError, e:
@@ -314,6 +316,7 @@ def cmd_object_put(args):
if Config().encrypt and full_name != full_name_orig:
debug(u"Removing temporary encrypted file: %s" % unicodise(full_name))
os.remove(full_name)
+ metadata.save()
def cmd_object_get(args):
cfg = Config()
@@ -746,6 +749,9 @@ def cmd_sync_remote2local(args):
dst_stream = open(dst_file, "wb")
response = s3.object_get(uri, dst_stream, extra_label = seq_label)
dst_stream.close()
+ if response["headers"].has_key("x-amz-meta-s3tools-gpgenc"):
+ gpg_decrypt(dst_file, response["headers"]["x-amz-meta-s3tools-gpgenc"])
+ response["size"] = os.stat(dst_file)[6]
if response['headers'].has_key('x-amz-meta-s3cmd-attrs') and cfg.preserve_attrs:
attrs = _parse_attrs_header(response['headers']['x-amz-meta-s3cmd-attrs'])
if attrs.has_key('mode'):
@@ -837,12 +843,6 @@ def cmd_sync_local2remote(args):
s3 = S3(cfg)
- if cfg.encrypt:
- error(u"S3cmd 'sync' doesn't yet support GPG encryption, sorry.")
- error(u"Either use unconditional 's3cmd put --recursive'")
- error(u"or disable encryption with --no-encrypt parameter.")
- sys.exit(1)
-
## Normalize URI to convert s3://bkt to s3://bkt/ (trailing slash)
destination_base_uri = S3Uri(args[-1])
if destination_base_uri.type != 's3':
@@ -907,10 +907,12 @@ def cmd_sync_local2remote(args):
seq = 0
file_list = local_list.keys()
file_list.sort()
+ metadata = MetaData()
for file in file_list:
seq += 1
item = local_list[file]
src = item['full_name']
+ src_orig = src
uri = S3Uri(item['remote_uri'])
seq_label = "[%d of %d]" % (seq, local_count)
extra_headers = copy(cfg.extra_headers)
@@ -919,6 +921,9 @@ def cmd_sync_local2remote(args):
attr_header = _build_attr_header(src)
debug(u"attr_header: %s" % attr_header)
extra_headers.update(attr_header)
+ if cfg.encrypt:
+ exitcode, src, extra_headers["x-amz-meta-s3tools-gpgenc"] = gpg_encrypt(src_orig)
+ metadata.metadata['md5_trans'][Utils.hash_file_md5(src)] = Utils.hash_file_md5(src_orig)
response = s3.object_put(src, uri, extra_headers, extra_label = seq_label)
except InvalidFileError, e:
warning(u"File can not be uploaded: %s" % e)
@@ -933,7 +938,11 @@ def cmd_sync_local2remote(args):
speed_fmt[0], speed_fmt[1], seq_label))
total_size += response["size"]
uploaded_objects_list.append(uri.object())
+ if cfg.encrypt and src != src_orig:
+ debug(u"Removing temporary encrypted file: %s" % unicodise(src))
+ os.remove(src)
+ metadata.save()
total_elapsed = time.time() - timestamp_start
total_speed = total_elapsed and total_size/total_elapsed or 0.0
speed_fmt = formatSize(total_speed, human_readable = True, floating_point = True)
@@ -1162,7 +1171,7 @@ def gpg_command(command, passphrase = ""):
return p_exitcode
def gpg_encrypt(filename):
- tmp_filename = Utils.mktmpfile()
+ tmp_filename = Utils.mktmpfile( cfg.temp_location )
args = {
"gpg_command" : cfg.gpg_command,
"passphrase_fd" : "0",
@@ -1486,6 +1495,7 @@ def main():
optparser.add_option("-e", "--encrypt", dest="encrypt", action="store_true", help="Encrypt files before uploading to S3.")
optparser.add_option( "--no-encrypt", dest="encrypt", action="store_false", help="Don't encrypt files.")
+ optparser.add_option( "--temp-location", dest="temp_location", metavar="FOLDER", help="Location to store temporary files for encrypt. Add trailing / to signify directory and leave off to signify file prefix. Defaults to /tmp/tmpfile-")
optparser.add_option("-f", "--force", dest="force", action="store_true", help="Force overwrite and other dangerous operations.")
optparser.add_option( "--continue", dest="get_continue", action="store_true", help="Continue getting a partially downloaded file (only for [get] command).")
optparser.add_option( "--skip-existing", dest="skip_existing", action="store_true", help="Skip over files that exist at the destination (only for [get] and [sync] commands).")
@@ -1645,6 +1655,10 @@ def main():
## Some Config() options are not settable from command line
pass
+ ## if encrypt, can't really check size on sync
+ if cfg.encrypt:
+ cfg.sync_checks.remove("size")
+
## Special handling for tri-state options (True, False, None)
cfg.update_option("enable", options.enable)
cfg.update_option("acl_public", options.acl_public)
@@ -1786,6 +1800,7 @@ if __name__ == '__main__':
from S3.CloudFront import Cmd as CfCmd
from S3.CloudFront import CloudFront
from S3.FileLists import *
+ from S3.MetaData import MetaData
main()
sys.exit(0)
Something went wrong with that request. Please try again.