Word files (.docx) identified as application/zip #339

npalmius · 2020-03-17T18:02:14Z

I know that similar issues (#312; #103; #66; #54) have been raised (and closed) before, but we've found some Word files (.docx) that aren't identified correctly using the latest version of file-type.

The test files are the three .docx files from https://file-examples.com/index.php/sample-documents-download/sample-doc-download/.

Note that version 12.4.2 works fine and detects the files as application/vnd.openxmlformats-officedocument.wordprocessingml.document, but versions 13 and 14 (including version 14.1.4) incorrectly identify them as application/zip.

I appreciate that these are probably slightly special/unusual files, and I don't know how they were produced, but Word does open them correctly so I would say that they are valid Word documents.

For now we'll downgrade to version 12.4.2, but it would be great if this could be corrected in the current versions.

The text was updated successfully, but these errors were encountered:

Borewit · 2020-03-17T20:04:07Z

The algorithm to unzip docx has been changed after version 12.4.2. Instead of searching the buffer the file-header-signature, the offset to the next header is calculated.

This has several advantages over the 12.4.2 algorithm:

Able decode beyond 4k boundary
More efficient, (compressed) file data is ignored

The docx fixtures you provided are encoded with an unknown file length, which is currently not supported by the algorithm. That is the reason, the current file detection algorithm fails.

npalmius · 2020-03-18T12:20:27Z

Thank you for looking in to this issue and for the description of the cause. It sounds like these files were produced using some obscure software (or maybe even manually). All I can really say is that Word opens them, so I think they are valid Word documents, but at the same time I appreciate that it's probably not a priority to support handling of a very unusual version of the file format. Of course, if it could be fixed then that would still be great.

As a side note, when investigating this issue on our side, the first thing we did was to use the file-type-cli tool to test the files, which correctly identified them because it still uses version 12.4.2 of file-type. This threw us off course a bit so it would be useful if the cli tool used the latest version of the library. On the other hand, then we wouldn't have realised that it worked correctly in 12.4.2!

Borewit · 2020-03-18T12:26:50Z

It sounds like these files were produced using some obscure software (or maybe even manually).

So far, I have no reason to assume the provided samples are obscure.

I appreciate you reported it, and provided examples. That is good start. I am busy at the moment, I will investigate later what it takes to support the unknown-file-size flag.

npalmius · 2020-03-18T13:19:16Z

That's a fair comment. Highly appreciate your time and effort on thie package!

tsmx · 2020-04-07T19:21:32Z

From my perspective, this bug can be reproduced if the docx is created/saved using Libre Office. At least I encountered it when scanning such files with version 14.x, which of course are not corrupted or obscure. With version 12.x the file type of exactly this files was determined correctly. Maybe that helps. Kind regards.

TheColorRed · 2020-04-28T14:09:56Z

According to wikipedia, docx, pptx, xlsx, etc. is a zip file format:
https://en.wikipedia.org/wiki/List_of_file_signatures

TheColorRed · 2020-04-28T15:23:38Z

It looks like there is a package that has file-type as a dependency, which parses .ppt, .doc, .xls.

https://www.npmjs.com/package/file-type-ext

deftomat · 2020-05-18T17:26:23Z

Still an issue. Any idea how to fix this?

CSoellinger-IDS · 2020-05-19T12:55:52Z

Same problem here with xlsx (think it could be the same problem like for docx). interesting is that the cli package returns "xlsx, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" for the same file where the node/browser version returns "zip, application/zip".

CLI uses version 12
At browser is used 14.4.0.

File to test:
MS-Excel_2007-2013_XML.xlsx

Borewit · 2020-06-07T12:56:52Z

Fixed by #369.

jessrosenfield · 2020-07-16T18:03:09Z

Hmm I see that this issue was closed but I am getting this error here: https://www.gsaadvantage.gov/ref_text/GS21F010DA/0VCZQF.3R3CP6_GS-21F-010DA_FSS600.DOCX using `"file-type": "^14.6.2", (the latest release). Has this fix not been released yet or are there still docx files that will show up as zips with the latest fix??

Borewit self-assigned this Mar 17, 2020

Borewit added the bug label Mar 17, 2020

hermandavid mentioned this issue Jun 4, 2020

Fix ZIP header detection for MS Office files #369

Merged

Borewit closed this as completed Jun 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word files (.docx) identified as application/zip #339

Word files (.docx) identified as application/zip #339

npalmius commented Mar 17, 2020

Borewit commented Mar 17, 2020

npalmius commented Mar 18, 2020

Borewit commented Mar 18, 2020 •

edited

npalmius commented Mar 18, 2020

tsmx commented Apr 7, 2020

TheColorRed commented Apr 28, 2020

TheColorRed commented Apr 28, 2020 •

edited

deftomat commented May 18, 2020

CSoellinger-IDS commented May 19, 2020 •

edited

Borewit commented Jun 7, 2020

jessrosenfield commented Jul 16, 2020

Word files (.docx) identified as application/zip #339

Word files (.docx) identified as application/zip #339

Comments

npalmius commented Mar 17, 2020

Borewit commented Mar 17, 2020

npalmius commented Mar 18, 2020

Borewit commented Mar 18, 2020 • edited

npalmius commented Mar 18, 2020

tsmx commented Apr 7, 2020

TheColorRed commented Apr 28, 2020

TheColorRed commented Apr 28, 2020 • edited

deftomat commented May 18, 2020

CSoellinger-IDS commented May 19, 2020 • edited

Borewit commented Jun 7, 2020

jessrosenfield commented Jul 16, 2020

Borewit commented Mar 18, 2020 •

edited

TheColorRed commented Apr 28, 2020 •

edited

CSoellinger-IDS commented May 19, 2020 •

edited