-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Failures when extracting text from pdf #554
Comments
Welcome! Thanks for posting your first issue. The way things work here is that while customer issues are prioritized, other issues go into our backlog where they are assessed and fitted into the roadmap when suitable. If you need to get this done, consider buying a license which also enables you to use it in your commercial products. More information can be found on https://unidoc.io/ |
Hi @ryankilroy , thank you for reporiting this issue. We were able to reproduce it using the sample code and sample file you provided and we are currently investigating the cause of it. We will write an update as soon as we identify the source of the issue and the fixes. |
Hi @ryankilroy, after some investigation, we found out that the issue is in the Regarding your second issue, i.e., font extraction, the reason for the font extraction failure is that there is no font in pages 3 and beyond (because the pages are scanned). But the error message is not informative enough to convey this. We will update this one too. |
Hi @ryankilroy , This issue is fixed in the new release ( |
Description
When I attempt to extract the text from a pdf with certain embedded fonts, it returns some missing rune characters. The fonts don't seem to throw errors on the first page (which still has missing runes), but when I attempt to extract the fonts from the later pages in the pdf, I get some
Can't convert font object, invalid type
errors.Expected Behavior
I expect to be able to extract usable text from the pdf
Actual Behavior
Extracting text from the pdf results in missing runes
Steps to reproduce the behavior:
If you instead run
pdftotext <file.pdf> -
against it, the text is fully readableAttachments
Sample PDF.pdf
Examples
There are more missing runes in areas of the actual pdf, but I couldn't replicate them with the anonymized data. Here are some of the examples
The text was updated successfully, but these errors were encountered: