Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MD5 cache file not updated after each calculated hash #641

Open
ramonsmits opened this issue Oct 2, 2015 · 5 comments
Open

MD5 cache file not updated after each calculated hash #641

ramonsmits opened this issue Oct 2, 2015 · 5 comments

Comments

@ramonsmits
Copy link

I'm currently running a s3cmd sync operation on my laptop from s3 to my nas. The operation is running for hours because the md5 hashes need to be generated. I added the --cache-file option so that the md5 hashes will be stored.

However, I just looked at the folder and I don't see that file. Does that mean that the cache file is only flushed to storage at the end of the sync operation?

Why isn't the cache file flushed after each calculated md5? I cannot stop the current operation as then all calculated md5 hashes are gone.

I'm running the operation with --verbose and after almost 11 hours it says 9000/40244 so that means it is now at 22%.

@mdomsch
Copy link
Contributor

mdomsch commented Oct 3, 2015

Yes, the cache is stored as a python pickle, at the end of the local
directory walk. Not at the end of the sync operation though. Depends
which direction you are syncing. If syncing local to remote, the local
file list is read first, the cache saved, then the remote file list is
read. If syncing remote to local, the remote list is read first, then the
local list is read and the cache saved. To re-write the pickle after each
file being read would be crazy. Now, it could be stored in a different
format, one more suitable to appending I suppose, but pickles were the easy
choice and have worked thus far.

On Fri, Oct 2, 2015 at 5:47 PM, Ramon Smits notifications@github.com
wrote:

I'm currently running a s3cmd sync operation on my laptop from s3 to my
nas. The operation is running for hours because the md5 hashes need to be
generated. I added the --cache-file option so that the md5 hashes will be
stored.

However, I just looked at the folder and I don't see that file. Does that
mean that the cache file is only flushed to storage at the end of the sync
operation?

Why isn't the cache file flushed after each calculated md5? I cannot stop
the current operation as then all calculated md5 hashes are gone.

I'm running the operation with --verbose and after almost 11 hours it
says 9000/40244 so that means it is now at 22%.


Reply to this email directly or view it on GitHub
#641.

@ramonsmits
Copy link
Author

Not familiar with python but is a pickle sort of a hash table or
dictionary? Yes that would be weird to save to disk after each file.

Is there a graceful abort possible that will still write the pickle to disk
in case you want to reboot?

A solution would be a log file that gets appended. At start you load the
pickle, then the log if it exists and update the pickle.

After the file scan, store the pickle, delete the log file.

Another option would be to store the pickle for a given interval so that if
the operation is aborted only the work of one interval is lost.

I prefer the first.

It explains why people mention that the md5 cache is not working. I'm now
syncing 40.000+ files and if I quit the terminal all md5 hash data is gone.

Let me dive into python, maybe I can contribute to s3cmd.
On Oct 3, 2015 6:24 AM, "Matt Domsch" notifications@github.com wrote:

Yes, the cache is stored as a python pickle, at the end of the local
directory walk. Not at the end of the sync operation though. Depends
which direction you are syncing. If syncing local to remote, the local
file list is read first, the cache saved, then the remote file list is
read. If syncing remote to local, the remote list is read first, then the
local list is read and the cache saved. To re-write the pickle after each
file being read would be crazy. Now, it could be stored in a different
format, one more suitable to appending I suppose, but pickles were the easy
choice and have worked thus far.

On Fri, Oct 2, 2015 at 5:47 PM, Ramon Smits notifications@github.com
wrote:

I'm currently running a s3cmd sync operation on my laptop from s3 to my
nas. The operation is running for hours because the md5 hashes need to be
generated. I added the --cache-file option so that the md5 hashes will be
stored.

However, I just looked at the folder and I don't see that file. Does that
mean that the cache file is only flushed to storage at the end of the
sync
operation?

Why isn't the cache file flushed after each calculated md5? I cannot stop
the current operation as then all calculated md5 hashes are gone.

I'm running the operation with --verbose and after almost 11 hours it
says 9000/40244 so that means it is now at 22%.


Reply to this email directly or view it on GitHub
#641.


Reply to this email directly or view it on GitHub
#641 (comment).

@ramonsmits
Copy link
Author

Today my wifi connection failed and the sync quit. No md5 hashes were flushed to disk at all.

Also, having a binary file that is written to at the end of the batch makes it impossible for multiple invocations to share the same cache file.

An alternative is to create a .md5 file for each file and

  • Store it in the same folder
  • Store it in a separate folder

Or use a file per folder and maybe even use the same file format as md5sum as text.

Or use a file per tree

@d4v3y0rk
Copy link

wow this does not seem to have gotten any love in a long time. was there ever a resolution? I am currently facing the same issue. every time I run the sync command it has to calculate 40k md5 hashes...

@rchavez-neu
Copy link

Same issue here. When syncing 8000 files s3cmd seems to generate an md5 every time and doesn't save the MD5 results to cache on local disk (s3cmd version 2.3.0). It makes for really long sync times.

Does anyone have any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants