MD5 cache file not updated after each calculated hash #641

ramonsmits · 2015-10-02T22:47:46Z

I'm currently running a s3cmd sync operation on my laptop from s3 to my nas. The operation is running for hours because the md5 hashes need to be generated. I added the --cache-file option so that the md5 hashes will be stored.

However, I just looked at the folder and I don't see that file. Does that mean that the cache file is only flushed to storage at the end of the sync operation?

Why isn't the cache file flushed after each calculated md5? I cannot stop the current operation as then all calculated md5 hashes are gone.

I'm running the operation with --verbose and after almost 11 hours it says 9000/40244 so that means it is now at 22%.

The text was updated successfully, but these errors were encountered:

mdomsch · 2015-10-03T04:24:14Z

Yes, the cache is stored as a python pickle, at the end of the local
directory walk. Not at the end of the sync operation though. Depends
which direction you are syncing. If syncing local to remote, the local
file list is read first, the cache saved, then the remote file list is
read. If syncing remote to local, the remote list is read first, then the
local list is read and the cache saved. To re-write the pickle after each
file being read would be crazy. Now, it could be stored in a different
format, one more suitable to appending I suppose, but pickles were the easy
choice and have worked thus far.

On Fri, Oct 2, 2015 at 5:47 PM, Ramon Smits notifications@github.com
wrote:

I'm currently running a s3cmd sync operation on my laptop from s3 to my
nas. The operation is running for hours because the md5 hashes need to be
generated. I added the --cache-file option so that the md5 hashes will be
stored.

However, I just looked at the folder and I don't see that file. Does that
mean that the cache file is only flushed to storage at the end of the sync
operation?

Why isn't the cache file flushed after each calculated md5? I cannot stop
the current operation as then all calculated md5 hashes are gone.

I'm running the operation with --verbose and after almost 11 hours it
says 9000/40244 so that means it is now at 22%.

—
Reply to this email directly or view it on GitHub
#641.

ramonsmits · 2015-10-03T06:44:49Z

Not familiar with python but is a pickle sort of a hash table or
dictionary? Yes that would be weird to save to disk after each file.

Is there a graceful abort possible that will still write the pickle to disk
in case you want to reboot?

A solution would be a log file that gets appended. At start you load the
pickle, then the log if it exists and update the pickle.

After the file scan, store the pickle, delete the log file.

Another option would be to store the pickle for a given interval so that if
the operation is aborted only the work of one interval is lost.

I prefer the first.

It explains why people mention that the md5 cache is not working. I'm now
syncing 40.000+ files and if I quit the terminal all md5 hash data is gone.

Let me dive into python, maybe I can contribute to s3cmd.
On Oct 3, 2015 6:24 AM, "Matt Domsch" notifications@github.com wrote:

Yes, the cache is stored as a python pickle, at the end of the local
directory walk. Not at the end of the sync operation though. Depends
which direction you are syncing. If syncing local to remote, the local
file list is read first, the cache saved, then the remote file list is
read. If syncing remote to local, the remote list is read first, then the
local list is read and the cache saved. To re-write the pickle after each
file being read would be crazy. Now, it could be stored in a different
format, one more suitable to appending I suppose, but pickles were the easy
choice and have worked thus far.

On Fri, Oct 2, 2015 at 5:47 PM, Ramon Smits notifications@github.com
wrote:

I'm currently running a s3cmd sync operation on my laptop from s3 to my
nas. The operation is running for hours because the md5 hashes need to be
generated. I added the --cache-file option so that the md5 hashes will be
stored.

However, I just looked at the folder and I don't see that file. Does that
mean that the cache file is only flushed to storage at the end of the
sync
operation?

Why isn't the cache file flushed after each calculated md5? I cannot stop
the current operation as then all calculated md5 hashes are gone.

I'm running the operation with --verbose and after almost 11 hours it
says 9000/40244 so that means it is now at 22%.

—
Reply to this email directly or view it on GitHub
#641.

—
Reply to this email directly or view it on GitHub
#641 (comment).

ramonsmits · 2015-10-04T20:24:07Z

Today my wifi connection failed and the sync quit. No md5 hashes were flushed to disk at all.

Also, having a binary file that is written to at the end of the batch makes it impossible for multiple invocations to share the same cache file.

An alternative is to create a .md5 file for each file and

Store it in the same folder
Store it in a separate folder

Or use a file per folder and maybe even use the same file format as md5sum as text.

Or use a file per tree

d4v3y0rk · 2021-05-19T17:24:19Z

wow this does not seem to have gotten any love in a long time. was there ever a resolution? I am currently facing the same issue. every time I run the sync command it has to calculate 40k md5 hashes...

rchavez-neu · 2023-05-24T13:51:46Z

Same issue here. When syncing 8000 files s3cmd seems to generate an md5 every time and doesn't save the MD5 results to cache on local disk (s3cmd version 2.3.0). It makes for really long sync times.

Does anyone have any ideas?

fviard added the feature-request label Nov 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MD5 cache file not updated after each calculated hash #641

MD5 cache file not updated after each calculated hash #641

ramonsmits commented Oct 2, 2015

mdomsch commented Oct 3, 2015

ramonsmits commented Oct 3, 2015

ramonsmits commented Oct 4, 2015

d4v3y0rk commented May 19, 2021

rchavez-neu commented May 24, 2023

MD5 cache file not updated after each calculated hash #641

MD5 cache file not updated after each calculated hash #641

Comments

ramonsmits commented Oct 2, 2015

mdomsch commented Oct 3, 2015

ramonsmits commented Oct 3, 2015

ramonsmits commented Oct 4, 2015

d4v3y0rk commented May 19, 2021

rchavez-neu commented May 24, 2023