Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not recognising application/pdf file types #228

Closed
andrea-del-popolo opened this issue Jul 10, 2019 · 5 comments
Closed

Not recognising application/pdf file types #228

andrea-del-popolo opened this issue Jul 10, 2019 · 5 comments

Comments

@andrea-del-popolo
Copy link

Hello,

I have some PDF files where file-type fails to recognise the type. I cannot share these files publicly as they contain private information. However these files come from different sources and can be viewed correctly in any pdf viewer that I tried, so they do not appear to be malformed or corrupted in any way.

This is a buffer from one of these files:

Content-time: 1562573880\\nContent-hash: 58f39cf5700921a2dd522755443c7ed7\\nContent-type: application/pdf\\nContent-size: 90200\\n%PDF-1.4\n%����\n%PDF-1.4\n%����\n%PDF-1.4\n%����\n3 0 obj\n<</Type /Page\n/Parent 1 0 R\n/MediaBox [0 0 595.280 841.890]\n/TrimBox [0.000

And this is how I am using file-type:

const fileType = require('file-type');
const readChunk = require('read-chunk');
const buffer = readChunk.sync(`${tmpDir}/${inputFileName}`, 0, fileType.minimumBytes);
const detectedFileType = fileType(buffer) || {ext: '?', mime: '?'};

Note: same behaviour on versions 10.9.0 and 12.0.1 (latest)

How can I help without sharing the entire file? 🤷‍♂

@jacor84
Copy link
Contributor

jacor84 commented Jul 10, 2019

First, you could specify the source of such files. What program on which platform (operating system) or what kind of website generated them?
Then, if you could share the first 20 bytes as hex:

console.log(buffer.toString('hex', 0, 20));

The magic bytes should be %PDF, so maybe before them there is some kind of BOM?

@andrea-del-popolo
Copy link
Author

andrea-del-popolo commented Jul 11, 2019

Hi @jacor84 thanks for the help. As of the source of files I cannot really tell what generated them because I don't know. This issue happens with files coming from different customers and I cannot understand whether there is a pattern, the contents of the files are totally different and nothing indicates that they may have been generated in the same way.

The result of console.log(buffer.toString('hex', 0, 20)); with 2 different pdf files affected by the issue are:

  • 436f6e74656e742d74696d653a20313536323537
  • 485454502f312e3020323030204f4b0d0a436163

This instead is the result on a file that is not affected by the issue:

  • 255044462d312e340a25e2e3cfd30a332030206f

With BOM you mean BOM (file format)? 🤔

@jacor84
Copy link
Contributor

jacor84 commented Jul 11, 2019

@andrea-del-popolo Sorry, but the hex content you pasted is not file content, but rather raw response with headers. Maybe that's where your problem originates. :-)

> console.log(Buffer.from('485454502f312e3020323030204f4b0d0a436163', 'hex').toString());
HTTP/1.0 200 OK
Cac

> console.log(Buffer.from('436f6e74656e742d74696d653a20313536323537', 'hex').toString());
Content-time: 156257

The last one is valid, you can see that it starts with magic bytes 25504446, which is %PDF.

By BOM I meant Byte Order Mark.

@jacor84
Copy link
Contributor

jacor84 commented Jul 11, 2019

Now I see what could have gone wrong here. Your buffer should contain only file data, but it also contains HTTP headers. The reason is, there should be two consecutive new line markers (`\n') separating response headers from body, making one empty line. In your response, there is only one.

(...) Content-size: 90200\\n%PDF-1.4 (...)

You should investigate how this happened and fix it there, or find some workaround to get rid of those headers. Then this library will recognize them.

@jacor84
Copy link
Contributor

jacor84 commented Jul 25, 2019

@andrea-del-popolo I think this case should be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants