Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filenames containing ? give warning : 'extension mismatch' #129

Closed
workflowsguy opened this issue Jun 24, 2019 · 2 comments

Comments

@workflowsguy
Copy link

commented Jun 24, 2019

When files are processed with sf, those that contain a question mark at the end of the filename will be identified with the correct type, but a "extension mismatch" warning will still be output, viz.

sf "/Volumes/Public/bearbeiten/Dateien/ermitteln Dateityp/Salzburger Nachtstudio.2019-06-19 - Kulturkampf im Klassenzimmer?.mp3"
---
siegfried   : 1.7.12
scandate    : 2019-06-24T16:27:08+02:00
signature   : default.sig
created     : 2019-06-15T12:22:38+02:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V95.xml; container-signature-20180917.xml'
---
filename : '/Volumes/Public/bearbeiten/Dateien/ermitteln Dateityp/Salzburger Nachtstudio.2019-06-19 - Kulturkampf im Klassenzimmer?.mp3'
filesize : 74564436
modified : 2019-06-21T17:03:54+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/134'
    format  : 'MPEG 1/2 Audio Layer 3'
    version : 
    mime    : 'audio/mpeg'
    basis   : 'byte match at [[0 3] [74560365 1151] [74562035 1151] [74563705 3]] (signature 1/8)'
    warning : 'extension mismatch'

I am running on macOS, where ? is an allowed character for filenames.

Thanks!

@richardlehane richardlehane self-assigned this Jun 25, 2019

@richardlehane richardlehane added the bug label Jun 25, 2019

@richardlehane

This comment has been minimized.

Copy link
Owner

commented Jun 25, 2019

thanks for this report workflowsguy, an interesting bug! I'll look into it

@richardlehane

This comment has been minimized.

Copy link
Owner

commented Jun 25, 2019

I've found the offending code: https://github.com/richardlehane/siegfried/blob/master/internal/namematcher/namematcher.go#L149

The issue is that some filenames are within URLs (because of WARC scanning) and where sf thinks the name is a URL it strips characters following a "?" because in a URL that's the query string. E.g. it is trying to get the name within a string like "http://www.mysite.com/file.pdf?user=richard"

But in your case where the ? is legitimately part of a regular file name, this is breaking extension matching.

I'll have a think about how to re-jig this bit of the code to fix

@richardlehane richardlehane added this to the 1.7.13 milestone Jul 2, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.