Improve PDF / AI (Adobe Illustrator) recognition #396

Borewit · 2020-09-15T19:35:38Z

In line with what @vladfrangu suggested, it searched for AIPrivateData to detect .ai (Adobe Illustrator) format.

Not a perfect solution:

It is likely to fail on large PDF's, as it reads 10 MB of data at most to search in.
Maybe triggered by AIPrivateData appearing in the content.

I removed the fixture.ai because I suspect it is truncated. I don't own Adobe Illustrator myself, so I could not test it with that.

But it is probably does the job for most cases.

Fix #360

vladfrangu · 2020-09-15T19:43:43Z

What I was going to do (but didn't get to due to school and time constraints..) was peek around.. 5mb of data at a time, check if the sequence for AIPrivateData is found, and if so, return. Of course this'd mean that we could peek over the entire file... So, also not an optimal thing, however thanks to Adobe being adobe.. we gotta do it..

Borewit · 2020-09-15T19:52:22Z

What I was going to do (but didn't get to due to school and time constraints..) was peek around.. 5mb of data at a time, check if the sequence for AIPrivateData is found, and if so, return. Of course this'd mean that we could peek over the entire file... So, also not an optimal thing

Sounds good @vladfrangu.
Peeking in a loop with tokenizer will result in an infinite loop as it will not increment the stream position with the data read. But if you know it is PDF, it perfectly fine to 'consume' all the data you need with read instead of peek.

, however thanks to Adobe being adobe.. we gotta do it..
Sounds good @vladfrangu.

Yeah the embedded metadata looks amazingly bad. Why bothering to put that in if it is always the same?

vladfrangu · 2020-09-15T19:55:44Z

Peeking in a loop with tokenizer will result in an infinite loop as it will not increment the stream position with the data read. But if you know it is PDF, it perfectly fine to 'consume' all the data you need with read instead of peek.

I'm aware peek doesn't advance the position (hence it's name), so you'd skip as many bytes as you peek if you check in a loop, although you can probably just.. read it and skip the peek-ignore step

Also, pro tip: you can skip the entire first metadata part if you parse the header (it includes a length of bytes you can skip to get past the whole xml in pdf ordeal)

Borewit · 2020-09-15T20:03:44Z

I'm aware peek doesn't advance the position (hence it's name), so you'd skip as many bytes as you peek if you check in a loop, although you can probably just.. read it and skip the peek-ignore step

Exactly.

Also, pro tip: you can skip the entire first metadata part if you parse the header (it includes a length of bytes you can skip to get past the whole xml in pdf ordeal)

Bingo, en then use tokenizer.ignore(nrOfbytesToSkip), which will prevent reading the data of the underlying media supports it.

Borewit · 2020-09-18T18:30:06Z

Processing the "PDF blocks" is easier said then done. The COS ("Carousel" Object Structure) format which PDF is based on, requires semi text line oriented processing. Which is complex and tends to cross the binary format scope boundaries.

sindresorhus · 2020-09-28T17:26:35Z

Can you fix the merge conflict?

Ugzuzg · 2021-01-12T21:06:10Z

Is there any progress on this? Any help needed?

Borewit · 2021-01-12T21:23:43Z

Can you fix the merge conflict?

Done.

Borewit added the bug label Sep 15, 2020

Borewit self-assigned this Sep 15, 2020

Borewit requested a review from sindresorhus September 15, 2020 19:37

Fix Adobe Illustrator recognition.

ea97c62

Borewit force-pushed the issue-360-pdf-ai branch from 8020a8d to ea97c62 Compare January 12, 2021 21:23

Update test.js

0ff0ecb

sindresorhus merged commit 9736aa3 into master Jan 13, 2021

sindresorhus deleted the issue-360-pdf-ai branch January 13, 2021 10:11

hdavidzhu mentioned this pull request Jan 25, 2021

fromBuffer and stream methods give different results for same ai file #426

Closed

vladfrangu mentioned this pull request Apr 14, 2022

PDF created with Adobe Illustrator are wrongly detected as .ai files #360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PDF / AI (Adobe Illustrator) recognition #396

Improve PDF / AI (Adobe Illustrator) recognition #396

Borewit commented Sep 15, 2020

vladfrangu commented Sep 15, 2020

Borewit commented Sep 15, 2020

vladfrangu commented Sep 15, 2020 •

edited

Borewit commented Sep 15, 2020

Borewit commented Sep 18, 2020

sindresorhus commented Sep 28, 2020

Ugzuzg commented Jan 12, 2021

Borewit commented Jan 12, 2021

Improve PDF / AI (Adobe Illustrator) recognition #396

Improve PDF / AI (Adobe Illustrator) recognition #396

Conversation

Borewit commented Sep 15, 2020

vladfrangu commented Sep 15, 2020

Borewit commented Sep 15, 2020

vladfrangu commented Sep 15, 2020 • edited

Borewit commented Sep 15, 2020

Borewit commented Sep 18, 2020

sindresorhus commented Sep 28, 2020

Ugzuzg commented Jan 12, 2021

Borewit commented Jan 12, 2021

vladfrangu commented Sep 15, 2020 •

edited