Text in generated PDF in wrong order #170

stweil · 2015-12-11T06:21:05Z

The characters in the attached image are recognized by Tesseract, but placed in a wrong order when generating a PDF file (which prevents searching in the PDF):

$ tesseract pdf-test.jpg pdf-test pdf
$ pdftotext pdf-test.pdf -
ra
me
ka
ot
ar
fr
In
en
ch
is
om
on
tr
as
r
ne
ei
g
un
ob
pr
Bau und Er

The original text scan is not exactly horizontal (-0.5°). If this is fixed before doing OCR, Tesseract creates a PDF with the correct text.

The text was updated successfully, but these errors were encountered:

jbreiden · 2015-12-12T00:36:02Z

I suspect the problem is in pdftotext. To find out, open the PDF in another viewer such as Chrome or Adobe Reader and see what you get during search and copy-paste.

jbreiden · 2015-12-12T00:47:30Z

Confirmed. Poppler (used by pdftotext) has a very hard time with titled text. If you want Tesseract to be more aggressive at producing flat text lines for tilted images, modify ClipBaseline(). But a better approach is to to use a different text extractor, for example one based on PDFium.

https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L275

jbreiden · 2015-12-12T01:29:15Z

And finally, if you want to post-process Tesseract produced PDF to flatten out tilted textlines, that's also fairly easy. Just ask if you need details.

stweil · 2015-12-12T07:32:13Z

evince (my standard PDF viewer on Linux) has the same problem and is not able to search such PDFs. The PDF viewer of iceweasel / Firefox gets the words right, so it is possible to search 'Infrarotkamera', but not the lines, so searching for word combinations does not work.

I don't want to post-process a PDF produced by Tesseract: the TXT file is perfect, so no need for me to extract text from the PDF.

I want to produce a PDF which I can give to others and which they can use with their normal PDF viewer.

Why is the text in the PDF file split into 2 character sequences? In the HOCR output, words remain words.

amitdo · 2015-12-12T14:23:38Z

Evince's pdf support is based on Poppler.

https://en.wikipedia.org/wiki/Evince#Supported_document_formats
https://en.wikipedia.org/wiki/Poppler_%28software%29

stweil · 2015-12-12T14:36:51Z

Maybe Firefox also uses parts of the Poppler code. So there are a number of PDF viewers which are less robust in getting correct text lines from a Tesseract PDF. Wikipedia shows that the list of these viewers is quit impressive. This increases my wish to understand and fix what goes wrong in Tesseract's PDF generation process.

stweil · 2015-12-12T14:42:29Z

I just tested with Chromium (Debian's variant of Chrome). It gets the text better, but not good: "astronomischen Infrarotkamera" is not found because it is split in separate lines.

amitdo · 2015-12-12T14:51:07Z

Maybe Firefox also uses parts of the Poppler code.

No.
https://github.com/mozilla/pdf.js

amitdo · 2015-12-12T15:49:50Z

Chromium uses PDFium:
https://pdfium.googlesource.com/pdfium/
https://news.ycombinator.com/item?id=7781878
http://blog.foxitsoftware.com/foxit-pdf-technology-chosen-for-google-open-source/
https://groups.google.com/forum/#!forum/pdfium

jbreiden · 2015-12-13T05:34:54Z

There's nothing invalid with the PDF files produced; tilted symbolic text lines are legitimate in PDF. They are the accurate representation of OCR results. The words are inside there just fine, along with some geometry describing the angle of the baseline. That said, this does blow the mind of some PDF text extractors who assume that it is impossible for a line of text to have any deviation of y-position between characters. (As you might imagine, this gets particularly fun with vertical Japanese text.)

You have exactly three choices if you want to better compatibility these viewers. One is to deskew the image before calling OCR. Second is modifying the section of code I pointed to earlier in Tesseract, to eliminate tilt in the symbolic text lines. Third is to remove the tilt in the invisible symbolic text lines, after the PDF is produced. All three of these approaches are relatively easy for a programmer and I'm happy to provide guidance.

There are major problems of forcing heavily titled text lines to be flat, a big one being
word highlightling ends up in the wrong place visually. So I would be reluctant to do that by default.

stweil · 2015-12-13T09:07:46Z

There must be some other aspect of the original scan, because Tesseract generates a PDF which works perfectly with all viewers and with pdftotext from this generated image with the same tilted text:

amitdo · 2015-12-13T09:17:21Z

Poppler homepage
http://poppler.freedesktop.org/

Maybe someone wants to fill a bug report?

Use bugzilla to report bugs or suggest enhancements. The component is poppler.

From the TODO:

Investigate better (that is, normal) text selection.

amitdo · 2015-12-13T09:33:52Z

@stweil, what @jbreiden is saying is that some pdf viewers have bugs that make them present skewed text lines incorrectly.

stweil · 2015-12-13T09:44:08Z

Ok. Let me explain what I have understood. Obviously it is not possible to describe a rotated text line in PDF, so you have either to pretend that the line is horizontal (that's what ABBYY does with the original image and Tesseract does with the generated image), or you have to approximate the rotation by splitting the text line into smaller parts with different y values (which is causing trouble with most or even all free PDF viewers, at least those that are based on Poppler, Mozilla or PDFium).

I noticed a 2nd difference between ABBYY and Tesseract PDF: with Evince, selected text from the Tesseract PDF is shown as a simple blue rectangle, while in the ABBYY PDF it also shows the characters with some built-in font. But that's a different story.

jbreiden · 2015-12-13T19:28:01Z

@stweil: You are misunderstanding. It is possible to describe a rotated text line in PDF. Tesseract does exactly this for any text line whose tilt is large enough. The tilted screenshot in Hebrew above is Tesseract output, displayed in Adobe Reader. You can even see the tilted text lines being highlighted for copy-paste in all their tilted glory. I point everyone again at the critical piece of code that explains the situation. It effectively says, 'Does the vertical displacement between the beginning and end of the baseline exceed 2 pts? (2 pts = 1 / 36 inch) If so, pretend the line is horizontal. Otherwise, faithfully record it as tilted.

https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L275

P.S. While there is a prayer that Poppler will evolve, my best guess is it will instead be replaced with PDFium some time over the next 10 years. PDFium is by no means perfect, but it is much stronger than Poppler.

jbreiden · 2016-02-02T18:48:37Z

I am going to close this issue soon, as working as intended. If people think the
threshold should be increased, that's a reasonable thing to discuss. Right now if
a text line has 2pts or less vertical displacement, we draw it totally flat. Maybe
that number should be 3pts.

ravidocs · 2017-06-15T21:30:35Z

I have similar problem while extracting data from pdf, it splits into one or two characters per line. I tried using TESS4J which gets the text much better. So trying to find if there is any options in tesseract(java) i could use.

Thanks

amitdo mentioned this issue Jan 4, 2016

Some programs can't find OCR text in Tesseract's PDFs (3.04) #182

Closed

jbarlow83 mentioned this issue Jan 4, 2016

OCRmyPDF fails to detect text on pages created by Tesseract 3.04 ocrmypdf/OCRmyPDF#26

Closed

jbreiden closed this as completed Feb 3, 2016

amitdo added the PDF label May 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text in generated PDF in wrong order #170

Text in generated PDF in wrong order #170

stweil commented Dec 11, 2015

jbreiden commented Dec 12, 2015

jbreiden commented Dec 12, 2015

jbreiden commented Dec 12, 2015

stweil commented Dec 12, 2015

amitdo commented Dec 12, 2015

stweil commented Dec 12, 2015

stweil commented Dec 12, 2015

amitdo commented Dec 12, 2015

amitdo commented Dec 12, 2015

jbreiden commented Dec 13, 2015

stweil commented Dec 13, 2015

amitdo commented Dec 13, 2015

amitdo commented Dec 13, 2015

stweil commented Dec 13, 2015

jbreiden commented Dec 13, 2015

jbreiden commented Feb 2, 2016

ravidocs commented Jun 15, 2017

Text in generated PDF in wrong order #170

Text in generated PDF in wrong order #170

Comments

stweil commented Dec 11, 2015

jbreiden commented Dec 12, 2015

jbreiden commented Dec 12, 2015

jbreiden commented Dec 12, 2015

stweil commented Dec 12, 2015

amitdo commented Dec 12, 2015

stweil commented Dec 12, 2015

stweil commented Dec 12, 2015

amitdo commented Dec 12, 2015

amitdo commented Dec 12, 2015

jbreiden commented Dec 13, 2015

stweil commented Dec 13, 2015

amitdo commented Dec 13, 2015

amitdo commented Dec 13, 2015

stweil commented Dec 13, 2015

jbreiden commented Dec 13, 2015

jbreiden commented Feb 2, 2016

ravidocs commented Jun 15, 2017