Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Siegfried seems to skip certain files without error or warning #115

Closed
MSavels opened this issue May 15, 2018 · 10 comments
Closed

Siegfried seems to skip certain files without error or warning #115

MSavels opened this issue May 15, 2018 · 10 comments
Assignees
Labels

Comments

@MSavels
Copy link

MSavels commented May 15, 2018

Hi,

I'm currently comparing the results from DROID and Siegfried (through Brunnhilde). In a dataset containing 216420 files, there are only 2537 discrepancies between the two (roughly 1%), which imho is not bad. However, in my test at least 50% of these discrepancies are due to Siegfried apparently skipping a file. A comparison of the outputs by roy yields "missing" from the siegfried CSV (confirmed by manually checking the Siegfried CSV: they aren't there, so no mistake by roy). I redid the Brunnhilde analysis several times and each time the same files were skipped. I analysed a few of these files (TIFF's in this case) with other programs (JHOVE, DPF Manager) and there seemed to be nothing wrong with them. I also checked whether it might be due to long paths/filenames, non-standard characters in the filename, too many files in a directory or extremely large files, but none of these things seemed a problem. This was confirmed by an individual analysis of each file with Siegfried: the files were correctly analysed. But when I tried to analyse the directory directly with Siegfried, the same files were skipped again. I have no idea why, but I can provide you with the files and the different analyses if you need them.

Kind regards,

Maarten

@richardlehane
Copy link
Owner

Hi Maarten
thanks for reporting this - it is a strange issue and to confess I'm a bit stumped!

Could you advise what OS you're on and what version of siegfried (sf -version)?
Do the files have any access restrictions different to other files in the directory (I'd still expect an error but possibly worth checking)?

Getting the files from you likely won't help if they can be identified individually, the problem seems more to do with their place in the file system... but if you could narrow down the issue and provide a zipped minimal directory with selected files that triggers the issue that would be a great help. Happy for you to send things to richard@itforarchivists.com

cheers
Richard

@richardlehane richardlehane self-assigned this May 16, 2018
@MSavels
Copy link
Author

MSavels commented May 16, 2018

Hi Richard,

The OS is CentOS Linux 7.4.1708
Siegfried version is 1.7.8 with signature V93 and containers sig 20171130
The files themselves are on a different server, mounted in the CentOS-server.

I checked the rights too, no anomalies there: all files have the same permissions, regardless of whether they were skipped or analysed.

I'll shortly be sending you a package containing 236 files. 4 of them were consistently skipped during additional tests. The other ones are all the files in one directory that was skipped entirely.

However -the plot thickens- I redid the same tests on a back-up I have of these files (the files are totally identical, they have the same sha256-hashvalue). Here the previously skipped files were analysed as normal, but different files were skipped. So I doubt it has anything to do with the files themselves, more with the way a list of them is built.

Kind regards,

Maarten

@richardlehane
Copy link
Owner

Thanks Maarten, I'm downloading the files now.

If you're scanning files over a network connection, it might be worth trying the -throttle flag to see if it assists. E.g. sf -throttle 50ms DIR. This may help narrow the issue down.

@MSavels
Copy link
Author

MSavels commented May 16, 2018

Tried it both with -throttle 50ms and 100ms. The same files were skipped.

@richardlehane
Copy link
Owner

The files all scanned correctly on my Windows laptop (i.e. 236 files in the zip, and 236 files in the results file). This does seem to be related to the way sf is walking your file system, rather than relating to the file contents.

@richardlehane
Copy link
Owner

richardlehane commented May 16, 2018

OK this golang bug seems like a possible cause: golang/go#24015

Unfortunately if this is the bug then it may be necessary to wait for a RedHat update to fix this. In later versions of the linux kernel (> 3.10) this problem seems to have been fixed

@richardlehane
Copy link
Owner

richardlehane commented May 16, 2018

If this is a kernel bug, a workaround pending a fix may be to use another tool like ls or find to manage the directory walk and pipe the list of files to sf for scanning.

Like:

find DIR -type f | sf -f -

@MSavels
Copy link
Author

MSavels commented Jun 1, 2018

The golang bug-workaround (enforcing CIFS version 1.0 on mount) didn't work. The same files were skipped. Piping the list in from find, however, did work. No files were skipped then. So for me, that solved it. Thanks for the help.
Kind regards,
Maarten

@richardlehane
Copy link
Owner

the recent golang 1.11 release has introduced a fix for this issue. I'm hopeful that a siegfried binary built with 1.11 will resolve this.

Unfortunately v1.7.9 binaries are still built with 1.10 as that is the current release supported by travis/appveyor. So will leave it open until the release binaries are built with 1.11

@richardlehane
Copy link
Owner

Hi Maarten
I released v1.7.10 with new binaries built with golang 1.11. This should, I believe, finally fix this issue. Will close now but please reopen if you can still reproduce
cheers
Richard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants