sync --preserve checks mtime #35

Closed
wants to merge 1 commit into from

1 participant

@mdomsch
s3tools member

This causes an extra HEAD request for each remote file, which greatly
slows down execution, and increases monetary cost $(0.01/10000
requests), but guarantees files whose mtime has changed will get
resync'd.

This is necessary for yum repositories, where repodata/* files may be
updated but not change size. It also correctly handles large files
whose md5 values as returned by S3 are incorrect having their content
(and thus mtime) changed, perhaps by RPM signing.

@mdomsch mdomsch sync --preserve checks mtime
This causes an extra HEAD request for each remote file, which greatly
slows down execution, and increases monetary cost $(0.01/10000
requests), but guarantees files whose mtime has changed will get
resync'd.

This is necessary for yum repositories, where repodata/* files may be
updated but not change size.  It also correctly handles large files
whose md5 values as returned by S3 are incorrect having their content
(and thus mtime) changed, perhaps by RPM signing.
282136a
@mdomsch
s3tools member

we're trading a ton of local disk I/O to calculate md5 on each file, for a HEAD call to S3 for each file
we do get the LastModified (uploaded) time from S3 w/o the HEAD call
I wonder if we can simply look at files with mtimes newer than LastModified...
and assume if file mtime is newer than LastModified, then it needs to be updated.

For regular occuring sync runs, I think that's valid...

@mdomsch
s3tools member

With this patch, syncing takes 10x longer. Probably the wrong approach then. Maybe LastModified as a proxy for mtime is good enough...

@mdomsch
s3tools member

Killing this pull request. What I've done elsewhere in my tree works better w/o the I/O penalty.

@mdomsch mdomsch closed this Jul 14, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment