Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZIP archive misidentified as video/x-ms-wmv #77

Open
mdavidn opened this issue Sep 14, 2022 · 2 comments
Open

ZIP archive misidentified as video/x-ms-wmv #77

mdavidn opened this issue Sep 14, 2022 · 2 comments

Comments

@mdavidn
Copy link

mdavidn commented Sep 14, 2022

I have a valid ZIP archive that happens to include the bytes wmv2 in the first four kilobytes. Active Storage misidentifies the file as Windows Media Video. When scanning over such a broad range of bytes, WMV magic needs a lower priority than other matches.

Marcel::MimeType.for Pathname.new('A-453.zip'), name: 'A-453.zip', declared_type: 'application/zip'
# => "video/x-ms-wmv"

File.read('A-453.zip')[0...4]
# => "PK\u0003\u0004"

File.read('A-453.zip').index('wmv2')
# => 585

`unzip -t A-453.zip`.chomp.split("\n").last
# => "No errors detected in compressed data of A-453.zip."
@mdavidn
Copy link
Author

mdavidn commented Sep 14, 2022

Here's my workaround for now, added to an initializer.

if Marcel::MimeType.for("PK\03\04wmv2") == 'video/x-ms-wmv'
  Marcel::Magic.remove('video/x-ms-wmv')
end

@pixeltrix
Copy link
Contributor

pixeltrix commented Sep 29, 2022

Just been bitten by this for a PDF as well - looking at the definition here it seems like that any instance of the string wmv2 in the first 8KB will trigger this match:

marcel/data/tika.xml

Lines 7701 to 7715 in 8e28563

<mime-type type="video/x-ms-wmv">
<sub-class-of type="video/x-ms-asf" />
<glob pattern="*.wmv"/>
<magic priority="60">
<match value="Windows Media Video" type="unicodeLE" offset="0:8192" />
<match value="VC-1 Advanced Profile" type="unicodeLE" offset="0:8192" />
<match value="wmv2" type="unicodeLE" offset="0:8192" />
</magic>
</mime-type>
<mime-type type="video/x-ms-wmx">
<glob pattern="*.wmx"/>
</mime-type>
<mime-type type="video/x-ms-wvx">
<glob pattern="*.wvx"/>
</mime-type>

Seems wildly broad as a magic string but I think the issue is the Tika rule is designed to match a codec type so would only apply in the context of a file ending in .wmv whereas Marcel is applying it as a general magic string. There could be other examples of mismatches like this in the Tika source file 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants