-
-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word files (.docx) identified as application/zip #339
Comments
The algorithm to unzip docx has been changed after version 12.4.2. Instead of searching the buffer the file-header-signature, the offset to the next header is calculated. This has several advantages over the 12.4.2 algorithm:
The docx fixtures you provided are encoded with an unknown file length, which is currently not supported by the algorithm. That is the reason, the current file detection algorithm fails. |
Thank you for looking in to this issue and for the description of the cause. It sounds like these files were produced using some obscure software (or maybe even manually). All I can really say is that Word opens them, so I think they are valid Word documents, but at the same time I appreciate that it's probably not a priority to support handling of a very unusual version of the file format. Of course, if it could be fixed then that would still be great. As a side note, when investigating this issue on our side, the first thing we did was to use the file-type-cli tool to test the files, which correctly identified them because it still uses version 12.4.2 of file-type. This threw us off course a bit so it would be useful if the cli tool used the latest version of the library. On the other hand, then we wouldn't have realised that it worked correctly in 12.4.2! |
So far, I have no reason to assume the provided samples are obscure. I appreciate you reported it, and provided examples. That is good start. I am busy at the moment, I will investigate later what it takes to support the unknown-file-size flag. |
That's a fair comment. Highly appreciate your time and effort on thie package! |
From my perspective, this bug can be reproduced if the docx is created/saved using Libre Office. At least I encountered it when scanning such files with version 14.x, which of course are not corrupted or obscure. With version 12.x the file type of exactly this files was determined correctly. Maybe that helps. Kind regards. |
According to wikipedia, |
It looks like there is a package that has |
Still an issue. Any idea how to fix this? |
Same problem here with xlsx (think it could be the same problem like for docx). interesting is that the cli package returns "xlsx, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" for the same file where the node/browser version returns "zip, application/zip". CLI uses version 12 File to test: |
Fixed by #369. |
Hmm I see that this issue was closed but I am getting this error here: https://www.gsaadvantage.gov/ref_text/GS21F010DA/0VCZQF.3R3CP6_GS-21F-010DA_FSS600.DOCX using `"file-type": "^14.6.2", (the latest release). Has this fix not been released yet or are there still docx files that will show up as zips with the latest fix?? |
I know that similar issues (#312; #103; #66; #54) have been raised (and closed) before, but we've found some Word files (.docx) that aren't identified correctly using the latest version of file-type.
The test files are the three .docx files from https://file-examples.com/index.php/sample-documents-download/sample-doc-download/.
Note that version 12.4.2 works fine and detects the files as
application/vnd.openxmlformats-officedocument.wordprocessingml.document
, but versions 13 and 14 (including version 14.1.4) incorrectly identify them asapplication/zip
.I appreciate that these are probably slightly special/unusual files, and I don't know how they were produced, but Word does open them correctly so I would say that they are valid Word documents.
For now we'll downgrade to version 12.4.2, but it would be great if this could be corrected in the current versions.
The text was updated successfully, but these errors were encountered: