Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text in generated PDF in wrong order #170

Closed
stweil opened this issue Dec 11, 2015 · 17 comments
Closed

Text in generated PDF in wrong order #170

stweil opened this issue Dec 11, 2015 · 17 comments
Labels

Comments

@stweil
Copy link
Contributor

stweil commented Dec 11, 2015

The characters in the attached image are recognized by Tesseract, but placed in a wrong order when generating a PDF file (which prevents searching in the PDF):

$ tesseract pdf-test.jpg pdf-test pdf
$ pdftotext pdf-test.pdf -
ra
me
ka
ot
ar
fr
In
en
ch
is
om
on
tr
as
r
ne
ei
g
un
ob
pr
Bau und Er

The original text scan is not exactly horizontal (-0.5°). If this is fixed before doing OCR, Tesseract creates a PDF with the correct text.

pdf-test

@jbreiden
Copy link
Contributor

I suspect the problem is in pdftotext. To find out, open the PDF in another viewer such as Chrome or Adobe Reader and see what you get during search and copy-paste.

@jbreiden
Copy link
Contributor

Confirmed. Poppler (used by pdftotext) has a very hard time with titled text. If you want Tesseract to be more aggressive at producing flat text lines for tilted images, modify ClipBaseline(). But a better approach is to to use a different text extractor, for example one based on PDFium.

https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L275

@jbreiden
Copy link
Contributor

And finally, if you want to post-process Tesseract produced PDF to flatten out tilted textlines, that's also fairly easy. Just ask if you need details.

@stweil
Copy link
Contributor Author

stweil commented Dec 12, 2015

evince (my standard PDF viewer on Linux) has the same problem and is not able to search such PDFs. The PDF viewer of iceweasel / Firefox gets the words right, so it is possible to search 'Infrarotkamera', but not the lines, so searching for word combinations does not work.

I don't want to post-process a PDF produced by Tesseract: the TXT file is perfect, so no need for me to extract text from the PDF.

I want to produce a PDF which I can give to others and which they can use with their normal PDF viewer.

Why is the text in the PDF file split into 2 character sequences? In the HOCR output, words remain words.

@amitdo
Copy link
Collaborator

amitdo commented Dec 12, 2015

@stweil
Copy link
Contributor Author

stweil commented Dec 12, 2015

Maybe Firefox also uses parts of the Poppler code. So there are a number of PDF viewers which are less robust in getting correct text lines from a Tesseract PDF. Wikipedia shows that the list of these viewers is quit impressive. This increases my wish to understand and fix what goes wrong in Tesseract's PDF generation process.

@stweil
Copy link
Contributor Author

stweil commented Dec 12, 2015

I just tested with Chromium (Debian's variant of Chrome). It gets the text better, but not good: "astronomischen Infrarotkamera" is not found because it is split in separate lines.

@amitdo
Copy link
Collaborator

amitdo commented Dec 12, 2015

Maybe Firefox also uses parts of the Poppler code.

No.
https://github.com/mozilla/pdf.js

@jbreiden
Copy link
Contributor

tilt

There's nothing invalid with the PDF files produced; tilted symbolic text lines are legitimate in PDF. They are the accurate representation of OCR results. The words are inside there just fine, along with some geometry describing the angle of the baseline. That said, this does blow the mind of some PDF text extractors who assume that it is impossible for a line of text to have any deviation of y-position between characters. (As you might imagine, this gets particularly fun with vertical Japanese text.)

You have exactly three choices if you want to better compatibility these viewers. One is to deskew the image before calling OCR. Second is modifying the section of code I pointed to earlier in Tesseract, to eliminate tilt in the symbolic text lines. Third is to remove the tilt in the invisible symbolic text lines, after the PDF is produced. All three of these approaches are relatively easy for a programmer and I'm happy to provide guidance.

There are major problems of forcing heavily titled text lines to be flat, a big one being
word highlightling ends up in the wrong place visually. So I would be reluctant to do that by default.

@stweil
Copy link
Contributor Author

stweil commented Dec 13, 2015

There must be some other aspect of the original scan, because Tesseract generates a PDF which works perfectly with all viewers and with pdftotext from this generated image with the same tilted text:
generated-image

@amitdo
Copy link
Collaborator

amitdo commented Dec 13, 2015

Poppler homepage
http://poppler.freedesktop.org/

Maybe someone wants to fill a bug report?

Use bugzilla to report bugs or suggest enhancements. The component is poppler.

From the TODO:

  • Investigate better (that is, normal) text selection.

@amitdo
Copy link
Collaborator

amitdo commented Dec 13, 2015

@stweil, what @jbreiden is saying is that some pdf viewers have bugs that make them present skewed text lines incorrectly.

@stweil
Copy link
Contributor Author

stweil commented Dec 13, 2015

Ok. Let me explain what I have understood. Obviously it is not possible to describe a rotated text line in PDF, so you have either to pretend that the line is horizontal (that's what ABBYY does with the original image and Tesseract does with the generated image), or you have to approximate the rotation by splitting the text line into smaller parts with different y values (which is causing trouble with most or even all free PDF viewers, at least those that are based on Poppler, Mozilla or PDFium).

I noticed a 2nd difference between ABBYY and Tesseract PDF: with Evince, selected text from the Tesseract PDF is shown as a simple blue rectangle, while in the ABBYY PDF it also shows the characters with some built-in font. But that's a different story.

@jbreiden
Copy link
Contributor

@stweil: You are misunderstanding. It is possible to describe a rotated text line in PDF. Tesseract does exactly this for any text line whose tilt is large enough. The tilted screenshot in Hebrew above is Tesseract output, displayed in Adobe Reader. You can even see the tilted text lines being highlighted for copy-paste in all their tilted glory. I point everyone again at the critical piece of code that explains the situation. It effectively says, 'Does the vertical displacement between the beginning and end of the baseline exceed 2 pts? (2 pts = 1 / 36 inch) If so, pretend the line is horizontal. Otherwise, faithfully record it as tilted.

https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L275

P.S. While there is a prayer that Poppler will evolve, my best guess is it will instead be replaced with PDFium some time over the next 10 years. PDFium is by no means perfect, but it is much stronger than Poppler.

@jbreiden
Copy link
Contributor

jbreiden commented Feb 2, 2016

I am going to close this issue soon, as working as intended. If people think the
threshold should be increased, that's a reasonable thing to discuss. Right now if
a text line has 2pts or less vertical displacement, we draw it totally flat. Maybe
that number should be 3pts.

@jbreiden jbreiden closed this as completed Feb 3, 2016
@amitdo amitdo added the PDF label May 30, 2016
@ravidocs
Copy link

I have similar problem while extracting data from pdf, it splits into one or two characters per line. I tried using TESS4J which gets the text much better. So trying to find if there is any options in tesseract(java) i could use.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants