-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text in generated PDF in wrong order #170
Comments
I suspect the problem is in pdftotext. To find out, open the PDF in another viewer such as Chrome or Adobe Reader and see what you get during search and copy-paste. |
Confirmed. Poppler (used by pdftotext) has a very hard time with titled text. If you want Tesseract to be more aggressive at producing flat text lines for tilted images, modify ClipBaseline(). But a better approach is to to use a different text extractor, for example one based on PDFium. https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L275 |
And finally, if you want to post-process Tesseract produced PDF to flatten out tilted textlines, that's also fairly easy. Just ask if you need details. |
I don't want to post-process a PDF produced by Tesseract: the TXT file is perfect, so no need for me to extract text from the PDF. I want to produce a PDF which I can give to others and which they can use with their normal PDF viewer. Why is the text in the PDF file split into 2 character sequences? In the HOCR output, words remain words. |
Evince's pdf support is based on Poppler. https://en.wikipedia.org/wiki/Evince#Supported_document_formats |
Maybe Firefox also uses parts of the Poppler code. So there are a number of PDF viewers which are less robust in getting correct text lines from a Tesseract PDF. Wikipedia shows that the list of these viewers is quit impressive. This increases my wish to understand and fix what goes wrong in Tesseract's PDF generation process. |
I just tested with Chromium (Debian's variant of Chrome). It gets the text better, but not good: "astronomischen Infrarotkamera" is not found because it is split in separate lines. |
|
There's nothing invalid with the PDF files produced; tilted symbolic text lines are legitimate in PDF. They are the accurate representation of OCR results. The words are inside there just fine, along with some geometry describing the angle of the baseline. That said, this does blow the mind of some PDF text extractors who assume that it is impossible for a line of text to have any deviation of y-position between characters. (As you might imagine, this gets particularly fun with vertical Japanese text.) You have exactly three choices if you want to better compatibility these viewers. One is to deskew the image before calling OCR. Second is modifying the section of code I pointed to earlier in Tesseract, to eliminate tilt in the symbolic text lines. Third is to remove the tilt in the invisible symbolic text lines, after the PDF is produced. All three of these approaches are relatively easy for a programmer and I'm happy to provide guidance. There are major problems of forcing heavily titled text lines to be flat, a big one being |
Poppler homepage Maybe someone wants to fill a bug report?
From the TODO:
|
Ok. Let me explain what I have understood. Obviously it is not possible to describe a rotated text line in PDF, so you have either to pretend that the line is horizontal (that's what ABBYY does with the original image and Tesseract does with the generated image), or you have to approximate the rotation by splitting the text line into smaller parts with different y values (which is causing trouble with most or even all free PDF viewers, at least those that are based on Poppler, Mozilla or PDFium). I noticed a 2nd difference between ABBYY and Tesseract PDF: with Evince, selected text from the Tesseract PDF is shown as a simple blue rectangle, while in the ABBYY PDF it also shows the characters with some built-in font. But that's a different story. |
@stweil: You are misunderstanding. It is possible to describe a rotated text line in PDF. Tesseract does exactly this for any text line whose tilt is large enough. The tilted screenshot in Hebrew above is Tesseract output, displayed in Adobe Reader. You can even see the tilted text lines being highlighted for copy-paste in all their tilted glory. I point everyone again at the critical piece of code that explains the situation. It effectively says, 'Does the vertical displacement between the beginning and end of the baseline exceed 2 pts? (2 pts = 1 / 36 inch) If so, pretend the line is horizontal. Otherwise, faithfully record it as tilted. https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L275 P.S. While there is a prayer that Poppler will evolve, my best guess is it will instead be replaced with PDFium some time over the next 10 years. PDFium is by no means perfect, but it is much stronger than Poppler. |
I am going to close this issue soon, as working as intended. If people think the |
I have similar problem while extracting data from pdf, it splits into one or two characters per line. I tried using TESS4J which gets the text much better. So trying to find if there is any options in tesseract(java) i could use. Thanks |
The characters in the attached image are recognized by Tesseract, but placed in a wrong order when generating a PDF file (which prevents searching in the PDF):
The original text scan is not exactly horizontal (-0.5°). If this is fixed before doing OCR, Tesseract creates a PDF with the correct text.
The text was updated successfully, but these errors were encountered: