Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Artifact characters extracted in some PDFs #1241

Closed
tristancatteeuw opened this issue Sep 1, 2021 Discussed in #1240 · 4 comments
Closed

Artifact characters extracted in some PDFs #1241

tristancatteeuw opened this issue Sep 1, 2021 Discussed in #1240 · 4 comments
Labels

Comments

@tristancatteeuw
Copy link

Discussed in #1240

Originally posted by tristan19954 September 1, 2021
Hello,
I have an issue when extracting text from a pdf file. I have done this on hundred of documents with goo results, but this particular pdf has an unexpected behavior.
Here is a look at the pdf :
image

and now the text extracted :
image

As you can see, a bunch of "i" and "t" characters appeared on every line of text. I thought it might be an issue with the pdf, but when trying an online pdf converter, I didn't get those characters.

Any idea on what the issue might be?

@JorjMcKie
Copy link
Collaborator

As I wrote in #1240: I do need a reproducer file and a code snippet for how you extracted the text.

@tristancatteeuw
Copy link
Author

As I wrote in #1240: I do need a reproducer file and a code snippet for how you extracted the text.

I just provided a code snippet in the discussion. Maybe that might help

@JorjMcKie JorjMcKie added the bug label Sep 4, 2021
@JorjMcKie JorjMcKie changed the title Unwanted letters in specific pdf file Artifact characters extracted in some PDFs Sep 4, 2021
@JorjMcKie
Copy link
Collaborator

Caused by characters with empty bboxes. Fixed by checking this at various places.

@JorjMcKie
Copy link
Collaborator

new version 1.18.18 now on PyPI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants