Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PDF / AI (Adobe Illustrator) recognition #396

Merged
merged 2 commits into from
Jan 13, 2021

Conversation

Borewit
Copy link
Collaborator

@Borewit Borewit commented Sep 15, 2020

In line with what @vladfrangu suggested, it searched for AIPrivateData to detect .ai (Adobe Illustrator) format.

Not a perfect solution:

  • It is likely to fail on large PDF's, as it reads 10 MB of data at most to search in.
  • Maybe triggered by AIPrivateData appearing in the content.

I removed the fixture.ai because I suspect it is truncated. I don't own Adobe Illustrator myself, so I could not test it with that.

But it is probably does the job for most cases.

Fix #360

@Borewit Borewit added the bug label Sep 15, 2020
@Borewit Borewit self-assigned this Sep 15, 2020
@vladfrangu
Copy link
Contributor

What I was going to do (but didn't get to due to school and time constraints..) was peek around.. 5mb of data at a time, check if the sequence for AIPrivateData is found, and if so, return. Of course this'd mean that we could peek over the entire file... So, also not an optimal thing, however thanks to Adobe being adobe.. we gotta do it..

@Borewit
Copy link
Collaborator Author

Borewit commented Sep 15, 2020

What I was going to do (but didn't get to due to school and time constraints..) was peek around.. 5mb of data at a time, check if the sequence for AIPrivateData is found, and if so, return. Of course this'd mean that we could peek over the entire file... So, also not an optimal thing

Sounds good @vladfrangu.
Peeking in a loop with tokenizer will result in an infinite loop as it will not increment the stream position with the data read. But if you know it is PDF, it perfectly fine to 'consume' all the data you need with read instead of peek.

, however thanks to Adobe being adobe.. we gotta do it..
Sounds good @vladfrangu.

Yeah the embedded metadata looks amazingly bad. Why bothering to put that in if it is always the same?

@vladfrangu
Copy link
Contributor

vladfrangu commented Sep 15, 2020

Peeking in a loop with tokenizer will result in an infinite loop as it will not increment the stream position with the data read. But if you know it is PDF, it perfectly fine to 'consume' all the data you need with read instead of peek.

I'm aware peek doesn't advance the position (hence it's name), so you'd skip as many bytes as you peek if you check in a loop, although you can probably just.. read it and skip the peek-ignore step

Also, pro tip: you can skip the entire first metadata part if you parse the header (it includes a length of bytes you can skip to get past the whole xml in pdf ordeal)

@Borewit
Copy link
Collaborator Author

Borewit commented Sep 15, 2020

I'm aware peek doesn't advance the position (hence it's name), so you'd skip as many bytes as you peek if you check in a loop, although you can probably just.. read it and skip the peek-ignore step

Exactly.

Also, pro tip: you can skip the entire first metadata part if you parse the header (it includes a length of bytes you can skip to get past the whole xml in pdf ordeal)

Bingo, en then use tokenizer.ignore(nrOfbytesToSkip), which will prevent reading the data of the underlying media supports it.

@Borewit
Copy link
Collaborator Author

Borewit commented Sep 18, 2020

Processing the "PDF blocks" is easier said then done. The COS ("Carousel" Object Structure) format which PDF is based on, requires semi text line oriented processing. Which is complex and tends to cross the binary format scope boundaries.

@sindresorhus
Copy link
Owner

Can you fix the merge conflict?

@Ugzuzg
Copy link
Contributor

Ugzuzg commented Jan 12, 2021

Is there any progress on this? Any help needed?

@Borewit
Copy link
Collaborator Author

Borewit commented Jan 12, 2021

Can you fix the merge conflict?

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDF created with Adobe Illustrator are wrongly detected as .ai files
4 participants