Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Docs/Sheets/PPT detection is being limited to 64kb (offset 30:65536) #59

Open
nvh0412 opened this issue Jun 20, 2021 · 3 comments
Assignees

Comments

@nvh0412
Copy link

nvh0412 commented Jun 20, 2021

Hi team,

Thanks for migrating this gem to use Tika and replaced mimemagic gem, we're using the latest gem version on production and so far so good, great work, thank you for your hard work!

We just figured out that some certain xlsx and docx files which are uploaded from our users are being miss-detected as application/zip, same as this issue #35

But it only happen with some files that have a size larger than 64kb

Summary:

There were 3 xlsx files:

  1. test.xlsx => 5kb => mimetype: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  2. test2.xlsx => 30kb => mimetype: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  3. test3.xlsx => 368kb => mimetype: application/zip

The root cause of 3rd case is it's failed when executing a matching comparison for [Content_Types].xml with offset is 30:65536 while Google Docs/sheets have the fingerprint items at the end of the file.

Can we implement a negative offset to read from the end of the file for these cases?

@gmcgibbon gmcgibbon self-assigned this Sep 29, 2021
@gmcgibbon
Copy link
Member

I looked into the negative offset approach but I'm not seeing consistent patterns of the placement of [Content_Types].xml in hex dumped files... It seems it can appear at the beginning or end of the file (see the two files in the repo). We may have to scan the entire file for this pattern, but I don't see another example of that in the DB.

@dandynaufaldi
Copy link

dandynaufaldi commented May 19, 2023

Hiello @nvh0412 @gmcgibbon, do you finally find solution/workaround for this?

I also got issue in detecting docx file, mine is 1.5 MB, always detected as application/zip. The file is exported from google docs

while if I tried to check for file with smaller size, it's working just fine, able to get application/vnd.openxmlformats-officedocument.wordprocessingml.document

*update

If I change how I call Marcel by supplying the name argument as well from

Marcel::MimeType.for(docx)

into

Marcel::MimeType.for(docx, name: docx_path)

I'm able to get application/vnd.openxmlformats-officedocument.wordprocessingml.document instead of application/zip

with above changes does it mean I can still safely detect if a file is an actual docx? as I'm using Marcel to reject file with non-whitelisted mime types

Thank you

@jeremy
Copy link
Member

jeremy commented Mar 1, 2024

This is pretty unfortunate is very common. Thankfully there's a decent fallback via the file extension.

Any alternative approach to sniffing these files where we needn't load the whole thing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants