Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word files (.docx) identified as application/zip #339

Closed
npalmius opened this issue Mar 17, 2020 · 11 comments
Closed

Word files (.docx) identified as application/zip #339

npalmius opened this issue Mar 17, 2020 · 11 comments
Assignees
Labels

Comments

@npalmius
Copy link

I know that similar issues (#312; #103; #66; #54) have been raised (and closed) before, but we've found some Word files (.docx) that aren't identified correctly using the latest version of file-type.

The test files are the three .docx files from https://file-examples.com/index.php/sample-documents-download/sample-doc-download/.

Note that version 12.4.2 works fine and detects the files as application/vnd.openxmlformats-officedocument.wordprocessingml.document, but versions 13 and 14 (including version 14.1.4) incorrectly identify them as application/zip.

I appreciate that these are probably slightly special/unusual files, and I don't know how they were produced, but Word does open them correctly so I would say that they are valid Word documents.

For now we'll downgrade to version 12.4.2, but it would be great if this could be corrected in the current versions.

@Borewit Borewit self-assigned this Mar 17, 2020
@Borewit
Copy link
Collaborator

Borewit commented Mar 17, 2020

The algorithm to unzip docx has been changed after version 12.4.2. Instead of searching the buffer the file-header-signature, the offset to the next header is calculated.

This has several advantages over the 12.4.2 algorithm:

  • Able decode beyond 4k boundary
  • More efficient, (compressed) file data is ignored

The docx fixtures you provided are encoded with an unknown file length, which is currently not supported by the algorithm. That is the reason, the current file detection algorithm fails.

@Borewit Borewit added the bug label Mar 17, 2020
@npalmius
Copy link
Author

Thank you for looking in to this issue and for the description of the cause. It sounds like these files were produced using some obscure software (or maybe even manually). All I can really say is that Word opens them, so I think they are valid Word documents, but at the same time I appreciate that it's probably not a priority to support handling of a very unusual version of the file format. Of course, if it could be fixed then that would still be great.

As a side note, when investigating this issue on our side, the first thing we did was to use the file-type-cli tool to test the files, which correctly identified them because it still uses version 12.4.2 of file-type. This threw us off course a bit so it would be useful if the cli tool used the latest version of the library. On the other hand, then we wouldn't have realised that it worked correctly in 12.4.2!

@Borewit
Copy link
Collaborator

Borewit commented Mar 18, 2020

It sounds like these files were produced using some obscure software (or maybe even manually).

So far, I have no reason to assume the provided samples are obscure.

I appreciate you reported it, and provided examples. That is good start. I am busy at the moment, I will investigate later what it takes to support the unknown-file-size flag.

@npalmius
Copy link
Author

That's a fair comment. Highly appreciate your time and effort on thie package!

@tsmx
Copy link

tsmx commented Apr 7, 2020

From my perspective, this bug can be reproduced if the docx is created/saved using Libre Office. At least I encountered it when scanning such files with version 14.x, which of course are not corrupted or obscure. With version 12.x the file type of exactly this files was determined correctly. Maybe that helps. Kind regards.

@TheColorRed
Copy link

According to wikipedia, docx, pptx, xlsx, etc. is a zip file format:
https://en.wikipedia.org/wiki/List_of_file_signatures

@TheColorRed
Copy link

TheColorRed commented Apr 28, 2020

It looks like there is a package that has file-type as a dependency, which parses .ppt, .doc, .xls.

https://www.npmjs.com/package/file-type-ext

@deftomat
Copy link

Still an issue. Any idea how to fix this?

@CSoellinger-IDS
Copy link

CSoellinger-IDS commented May 19, 2020

Same problem here with xlsx (think it could be the same problem like for docx). interesting is that the cli package returns "xlsx, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" for the same file where the node/browser version returns "zip, application/zip".

CLI uses version 12
At browser is used 14.4.0.

File to test:
MS-Excel_2007-2013_XML.xlsx

@Borewit
Copy link
Collaborator

Borewit commented Jun 7, 2020

Fixed by #369.

@Borewit Borewit closed this as completed Jun 7, 2020
@jessrosenfield
Copy link

Hmm I see that this issue was closed but I am getting this error here: https://www.gsaadvantage.gov/ref_text/GS21F010DA/0VCZQF.3R3CP6_GS-21F-010DA_FSS600.DOCX using `"file-type": "^14.6.2", (the latest release). Has this fix not been released yet or are there still docx files that will show up as zips with the latest fix??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants