Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce time stamp precision, take two #21

Closed
yemartin opened this issue Jan 3, 2022 · 14 comments
Closed

Reduce time stamp precision, take two #21

yemartin opened this issue Jan 3, 2022 · 14 comments

Comments

@yemartin
Copy link

yemartin commented Jan 3, 2022

This is a follow-up to #12 that was closed with the introduction of <timechange>. But unfortunately, <timechange> does not solve the original problem:

cshatag still cannot detect bit corruption that happens during move or copy operations between two filesystems with different timestamp precisions.

With the same example as in #12, with one added command to simulate corruption during transfer, and with:

  • /tmp on my root filesystem (APFS)
  • /Volumes/Organizer from my NAS, mounted through SMB (SMB_3.1.1)
$ rm /Volumes/Organizer/test.bin \
; touch /tmp/test.bin \
&& cshatag -qq /tmp/test.bin \
&& mv /tmp/test.bin /Volumes/Organizer/ \
&& echo 'CORRUPTION' >> /Volumes/Organizer/test.bin \
&& cshatag /Volumes/Organizer/test.bin

Current behavior

<outdated> /Volumes/Organizer/test.bin
 stored: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 1641197558.029917810
 actual: 4ef8ee0f9aaecb1597f22dfd7667af4a9b537e11e3aba08729647a882f9aff6e 1641197558.000000000

Expected behavior

<corrupt> /Volumes/Organizer/test.bin
 stored: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 1641197558.029917810
 actual: 4ef8ee0f9aaecb1597f22dfd7667af4a9b537e11e3aba08729647a882f9aff6e 1641197558.000000000

<timechange> was a nice introduction for when the data has not changed. But when the data did change, we still need to ignore small time differences below a certain threshold, to differentiate between a legitimate <outdated>, and a <corrupt> file.

Suggested implementation

As per the discussion in #12, I suggest:

  • ignoring time differences of 2 seconds or less (*1)
  • for this to be the default behavior (*2)
  • and if necessary, to add a command line option to use the original exact timestamp comparison, or even specify a custom threshold.

*1: FAT has a 2 seconds precision on last modified time
*2: With this new behavior as default, users may get a harmless false positive, but the file content is still good. If the behavior is opt-in, users would get false negatives, meaning corruption would go undetected.

Note: to get the false positive, the user would need to make a legitimate edit within 2 seconds of running cshatag against a given file, quite unlikely. And it if does happen, the file content is good anyway, so no harm done.

New status or not?

To keep things simple, I suggest we just do, when data has changed:

  • if time_delta <= threshold: corrupt
  • else (i.e. time_delta > threshold): outdated

but we can also consider introducing a new status, something like:

  • if time_delta == 0: corrupt
  • else if time_delta <= threshold: suspicious
  • else (i.e. time_delta > threshold): outdated

What do you think?

@rfjakob
Copy link
Owner

rfjakob commented Jan 3, 2022

Hi, thanks for the summary. I agree that this is an unsolved problem. Here are my comments:

  • ignoring time differences of 2 seconds or less (*1)

I think FAT does not matter as it does not support extended attributes. We can go for 100ns (NTFS).

  • for this to be the default behavior (*2)

OK

  • and if necessary, to add a command line option to use the original exact timestamp comparison, or even specify a custom threshold.

I think 100ns is precise enough for everything. No need for extra options.

  • New status or not?

"corrupt" is ok with 100ns. I don't see this happening by accident.

@yemartin
Copy link
Author

yemartin commented Jan 4, 2022

Thank you for the feedback @rfjakob

I think FAT does not matter as it does not support extended attributes.

You are totally right, we can ignore FAT! However, SMB does matter: it is a very common file sharing protocol, and does support extended attributes.

The output I originally posted showed only 1 second because 58.029917810 rounded down to 58, but here is a test run that shows a greater than 1 second difference. Here 17.846418404 was rounded down to 16:

<outdated> /Volumes/Organizer/test.bin
 stored: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 1641335917.846418404
 actual: 4ef8ee0f9aaecb1597f22dfd7667af4a9b537e11e3aba08729647a882f9aff6e 1641335916.000000000

It is not always rounding down to closets even integer. Sometimes it rounds down to the closest odd one. Here 75.475935745 was rounded down to 73:

<outdated> /Volumes/Organizer/test.bin
 stored: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 1641336175.475935745
 actual: 4ef8ee0f9aaecb1597f22dfd7667af4a9b537e11e3aba08729647a882f9aff6e 1641336173.000000000

Oh darn... In this last one, the time difference is even greater than 2 seconds!

Let me do some extra testing and report back here.

@yemartin
Copy link
Author

yemartin commented Jan 5, 2022

Thanks again for your feedback @rfjakob, it pushed me to get hard data and realize that we may actually need to ignore time differences of up to 3 seconds, maybe even more.

I ran 1000 tests, calculating some aggregates using:

$ for i in $(seq 1000); do \
rm -f /Volumes/Organizer/test.bin \
; touch /tmp/test.bin \
&& cshatag -qq /tmp/test.bin \
&& mv /tmp/test.bin /Volumes/Organizer/ \
&& echo 'CORRUPTION' >> /Volumes/Organizer/test.bin \
&& cshatag /Volumes/Organizer/test.bin \
| awk '{if ($3 > 0) print $3}' | paste -sd- - | bc \
; done \
| datamash min 1 max 1 median 1 mean 1

The results were (about 1 hour ago):

Aggregate Value
min 0.852231178
max 2.763420108
median 2.251456769
mean 2.240907976831

Running the same again now, I get quite different results:

Aggregate Value
min 1.916112147
max 2.965790765
median 2.447857228
mean 2.449437720647

The min is now barely below 2 seconds, and the max dangerously close to 3. I wonder it it would ever go above 3...

The time of the day may matter. I'll put this script in a cron job, run it throughout the day and see what that gives us. I'll report back here when done.

@yemartin
Copy link
Author

yemartin commented Jan 5, 2022

The data did not make sense, so I looked closer I found the reason: the echo 'CORRUPTION' >> /Volumes/Organizer/test.bin modifies the timestamp! The fluctuations were just due to my machine load. Doh...

With the fixed script, I get much more sensible data. With 1000 runs:

$ for i in $(seq 1000); do \
rm -f /Volumes/Organizer/test.bin \
; touch /tmp/test.bin \
&& cshatag -qq /tmp/test.bin \
&& mv /tmp/test.bin /Volumes/Organizer/ \
&& cshatag /Volumes/Organizer/test.bin \
| awk '{if ($3 > 0) print $3}' | paste -sd- - | bc \
; done \
| datamash min 1 max 1 median 1 mean 1

We get this, which is quite conclusive:

Aggregate Value
min 0.001664921
max 0.998636384
median 0.4940002175
mean 0.497302447519

Updating the suggestion accordingly:

Suggested implementation

I suggest:

  • ignoring time differences of 1 second or less (*1)
  • for this to be the default behavior
  • and if necessary, to add a command line option to use the original exact timestamp comparison, or even specify a custom threshold.

*1: SMB, a commonly used file sharing protocol that supports extended attributes, has a 1 second precision

What do you think @rfjakob ?

@rfjakob
Copy link
Owner

rfjakob commented Jan 5, 2022

Hmm, I don't know what's going on on /Volumes/Organizer, but on my Synology SMB mount, there's more that 1 second resolution:

$ touch foo ; stat foo | grep Modify
Modify: 2022-01-05 08:47:17.043150900 +0100

@yemartin
Copy link
Author

yemartin commented Jan 5, 2022

OK, so it is not just SMB... It could be a limitation of the MacOS SMB client, or differences on the SMB server (my NAS is a QNAP). I'll do some more experimenting and will report back here.

@yemartin
Copy link
Author

yemartin commented Jan 5, 2022

Here are my findings:

  • I could not reproduce the issue when mounting the SMB share in a Linux VM client. So the server is fine. The problem lies with the MacOS SMB client.
  • The issue seems to be that the MacOS SMB client can write timestamps with 100ns resolution, but reads them back with only 1 second resolution.

Here is the demonstration. Note: all the touch commands below are run on MacOS, to test the behavior of the MacOS SMB client. But the stat commands are run in the Linux VM, to show the correct timestamps.

1) The MacOS SMB client can write timestamps with 100ns resolution:

mac> touch foo
linux> $ stat foo | grep Modify
Modify: 2022-01-06 00:24:25.522884500 +0900

=> The file was written with a 100ns timestamp resolution

2) But it will read them back with 1s resolution:

To show this, I used touch -r [file], which reads the timestamps from the given file instead of using the current time:

mac> touch -r foo bar
linux> stat bar | grep Modify
Modify: 2022-01-06 00:24:25.000000000 +0900

=> The fractional part was lost, which shows that touch -r [file] read back the timestamps with only 1 second resolution. Or at least, that what it looks like to me.

So anyway, to support SMB on MacOS, we would need to ignore time differences of 1 second or less. What do you think?

@rfjakob
Copy link
Owner

rfjakob commented Jan 5, 2022

Does it always round down to the next second? If it does, we can do

  • If mtime_stored and mtime_actual are completely identical (including nanoseconds) => consider equal
  • If mtime_stored, rounded down to 100ns, is equal to mtime_actual => consider equal
  • If mtime_stored, rounded down to 1s, is equal to mtime_actual => consider equal
  • else consider different

@rfjakob
Copy link
Owner

rfjakob commented Jan 5, 2022

PS: The difference to just checking if the mtimes are less than 100ns (or 1s) apart is that we only consider something equal when it has zeros in the end, which is very unlikely to occour by chance.

@yemartin
Copy link
Author

yemartin commented Jan 5, 2022

Ah, I see your point. It sounds like a good idea! One potential problem is: I am still not sure where the rounding happens exactly. I just found that it is not as simple as "writes with 100ns but reads with 1s". The rounding may happen in multiple places too.

In this example, two "writes" (one for create, and one for update) happen with different resolutions! I just touch the same file twice, the first time to create it, the second time to update its timestamps:

mac> rm -f foo; touch foo
linux> stat foo | grep Modify
Modify: 2022-01-06 07:50:00.414947300 +0900 # <-- create with 100ns resolution

mac> touch foo
linux>stat foo | grep Modify
Modify: 2022-01-06 07:51:30.000000000 +0900 # <-- update with 1s resolution

Anyway, to answer your question, it does seem to be always rounding down. Even 28.91 was rounded down to 28.00:

mac> rm -f foo; touch foo; touch -r foo bar
linux> stat foo bar | grep Modify
Modify: 2022-01-06 07:40:28.917949500 +0900
Modify: 2022-01-06 07:40:28.000000000 +0900

@yemartin
Copy link
Author

yemartin commented Jan 5, 2022

Oh, this keeps getting weirder: it is NOT as simple as "creates with 100ns, updates with 1s" either:

mac> rm -f foo; touch foo
linux> stat foo | grep Modify
Modify: 2022-01-06 07:58:18.485945400 +0900 # <-- create with 100ns resolution

mac> echo 'FOO' >> foo
linux> stat foo | grep Modify
Modify: 2022-01-06 07:59:58.326903800 +0900 # <-- update also with 100ns resolution

One potential explanation, simple (so, Occam's razor, more likely to be correct) and that is consistent with all previously seen behaviors:

  • When doing a touch foo on creation, or echo 'FOO' >> foo, the client does not explicitly set timestamps. The timestamps are therefore set on the server side, with a 100ns resolution.
  • But when cshatag reads timestamps, or when doing a touch to update an existing file (write timestamps), then the client explicitly reads/sets timestamps, and this happens with 1s resolution.

So the explanation for all this may simply be: the macOS SMB client handles timestamps (read or write) with a 1s resolution, loosing the fractional seconds (hence rounding down).

@rfjakob
Copy link
Owner

rfjakob commented Oct 21, 2022

I think I should add --modify-window like rsync, quoting the rsync man page:

       --modify-window=NUM, -@
              When comparing two timestamps, rsync treats the timestamps as
              being equal if they differ by no more than the modify-window
              value.  The default is 0, which matches just integer seconds.  If
              you specify a negative value (and the receiver is at least version
              3.1.3) then nanoseconds will also be taken into account.
              Specifying 1 is useful for copies to/from MS Windows FAT
              filesystems, because FAT represents times with a 2-second
              resolution (allowing times to differ from the original by up to 1
              second).

@Ken0sis
Copy link

Ken0sis commented Apr 11, 2023

I second (support) the proposed solution to this problem. I'm on the same boat as yemartin. My files are on SMB network, and accessed via MacOS SMB client. I'm facing the same issue that the cshatag timestamps calculated on drives mounted on SMB network have lower resolution than timestamp calculated on local drive (even if both drives are the same format). I'm only seeing differences that 1s or less, though. However, this time difference makes it impossible to detect if data is corrupt when I copy files from a local drive to an SMB mounted drive.

I would like to be able use cshatag to check file integrity contents on network drives.

rfjakob added a commit that referenced this issue Apr 16, 2023
SMB only supports 100ns, so we may have missed corruption in
the following case:

1) Files are stored on ext4, cshatag runs and tags all files
2) Files are moved to SMB
3) Some file content gets corrupted during the move
4) cshatag considers these as "outdated" instead of "corrupt"

100ns should be good enough for everything, and avoids this
problem.

#21
@rfjakob
Copy link
Owner

rfjakob commented Apr 16, 2023

I went for the simpler route in the end. Resolution is now 1s on MacOS and 100ns on Linux.

Please test if that works as intended!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants