Not recognising `application/pdf` file types #228

andrea-del-popolo · 2019-07-10T08:40:07Z

Hello,

I have some PDF files where file-type fails to recognise the type. I cannot share these files publicly as they contain private information. However these files come from different sources and can be viewed correctly in any pdf viewer that I tried, so they do not appear to be malformed or corrupted in any way.

This is a buffer from one of these files:

Content-time: 1562573880\\nContent-hash: 58f39cf5700921a2dd522755443c7ed7\\nContent-type: application/pdf\\nContent-size: 90200\\n%PDF-1.4\n%����\n%PDF-1.4\n%����\n%PDF-1.4\n%����\n3 0 obj\n<</Type /Page\n/Parent 1 0 R\n/MediaBox [0 0 595.280 841.890]\n/TrimBox [0.000

And this is how I am using file-type:

const fileType = require('file-type');
const readChunk = require('read-chunk');
const buffer = readChunk.sync(`${tmpDir}/${inputFileName}`, 0, fileType.minimumBytes);
const detectedFileType = fileType(buffer) || {ext: '?', mime: '?'};

Note: same behaviour on versions 10.9.0 and 12.0.1 (latest)

How can I help without sharing the entire file? 🤷‍♂

The text was updated successfully, but these errors were encountered:

jacor84 · 2019-07-10T14:24:12Z

First, you could specify the source of such files. What program on which platform (operating system) or what kind of website generated them?
Then, if you could share the first 20 bytes as hex:

console.log(buffer.toString('hex', 0, 20));

The magic bytes should be %PDF, so maybe before them there is some kind of BOM?

andrea-del-popolo · 2019-07-11T06:34:32Z

Hi @jacor84 thanks for the help. As of the source of files I cannot really tell what generated them because I don't know. This issue happens with files coming from different customers and I cannot understand whether there is a pattern, the contents of the files are totally different and nothing indicates that they may have been generated in the same way.

The result of console.log(buffer.toString('hex', 0, 20)); with 2 different pdf files affected by the issue are:

436f6e74656e742d74696d653a20313536323537
485454502f312e3020323030204f4b0d0a436163

This instead is the result on a file that is not affected by the issue:

255044462d312e340a25e2e3cfd30a332030206f

With BOM you mean BOM (file format)? 🤔

jacor84 · 2019-07-11T09:02:55Z

@andrea-del-popolo Sorry, but the hex content you pasted is not file content, but rather raw response with headers. Maybe that's where your problem originates. :-)

> console.log(Buffer.from('485454502f312e3020323030204f4b0d0a436163', 'hex').toString());
HTTP/1.0 200 OK
Cac

> console.log(Buffer.from('436f6e74656e742d74696d653a20313536323537', 'hex').toString());
Content-time: 156257

The last one is valid, you can see that it starts with magic bytes 25504446, which is %PDF.

By BOM I meant Byte Order Mark.

jacor84 · 2019-07-11T09:15:08Z

Now I see what could have gone wrong here. Your buffer should contain only file data, but it also contains HTTP headers. The reason is, there should be two consecutive new line markers (`\n') separating response headers from body, making one empty line. In your response, there is only one.

(...) Content-size: 90200\\n%PDF-1.4 (...)

You should investigate how this happened and fix it there, or find some workaround to get rid of those headers. Then this library will recognize them.

jacor84 · 2019-07-25T09:30:01Z

@andrea-del-popolo I think this case should be closed.

sindresorhus closed this as completed Sep 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not recognising `application/pdf` file types #228

Not recognising `application/pdf` file types #228

andrea-del-popolo commented Jul 10, 2019

jacor84 commented Jul 10, 2019

andrea-del-popolo commented Jul 11, 2019 •

edited

jacor84 commented Jul 11, 2019

jacor84 commented Jul 11, 2019

jacor84 commented Jul 25, 2019

Not recognising application/pdf file types #228

Not recognising application/pdf file types #228

Comments

andrea-del-popolo commented Jul 10, 2019

jacor84 commented Jul 10, 2019

andrea-del-popolo commented Jul 11, 2019 • edited

jacor84 commented Jul 11, 2019

jacor84 commented Jul 11, 2019

jacor84 commented Jul 25, 2019

Not recognising `application/pdf` file types #228

Not recognising `application/pdf` file types #228

andrea-del-popolo commented Jul 11, 2019 •

edited