OCR text mangled after passing through Ghostscript (3.05.00dev) #357

jbarlow83 · 2016-06-25T00:37:41Z

The attached input file in1.pdf was generated by Tesseract 3.05.00dev. I modified it to extract only the first page using pdftk to reduce the size; results are unaffected by this.

in1.pdf
out1.pdf

(The user who forwarded the file to me confirmed that it can be released publicly.)

The OCR text of in1.pdf begins as follows – no problems here:

JP. Morgan (Suisse) SA

Account n“ 7973101
Geneva, 3rd June 2016

The problem manifests after passing the file through Ghostscript 9.18 to refry the PDF, with no other changes...

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -o out1.pdf in1.pdf

The OCR text is mangled by the insertion of spaces after each recognized letter, and line breaks after certain words (from pdftotext), and loss of spaces so that the word boundaries are gone "3rdJune2016". Normally one would pass some other parameters to Ghostscript such as PDF/A conversion, but regardless of parameters the OCR text is mangled.

J P .

M o r g a n

A c c o u n t n

( S u i s s e ) S A

7 9 7 3 1 0 1

G e n e v a , 3 r d J u n e 2 0 1 6

When viewed in Acrobat XI, every other letter is highlighted. This text is unusable for searching.

I'd be happy to bring up with the Ghostscript people, but I have feeling that there's something unusual about how Tesseract generates OCR text in PDFs that causes Ghostscript to mishandle them, rather than the converse.

I tried discarding the OCR information, rasterize the PDF as an image, and then running OCR on the image again using tesseract 3.04.01 and the updated pdf.ttf ("sharp2.ttf"). Specifically I used

ocrmypdf -f --pdf-renderer tesseract in1.pdf out1_3.04.01.pdf

Pinging @jbreiden since you worked with me on the pdf.ttf issues...

The text was updated successfully, but these errors were encountered:

jbreiden · 2016-06-25T17:48:41Z

I worked extensively with Ken Sharp on ghostscript compatibility and thought we were in pretty good shape. Not sure what the story is here.

jbreiden · 2016-06-27T21:15:02Z

Well, I've reproduced your problem with a different PDF and the tilt doesn't seem to be a factor. The text is definitely getting represented differently after a pass through ghostscript. I think you should contact Ken and see what he thinks. There's nothing obviously wrong to my eye about the ghostscript respresentation, but text extraction in PDF is often more of an art than a science. (As always, I think the root problem is the PDF specification itself.)

Before:

BT
3 Tr 1 0 0 1 82.2 512.8 Tm /f-0-0 10 Tf 140.64 Tz [ <0045><0058><0050><0045><0052><0049><0045><004E><0043><0045> ] TJ 77.16 0 Td 156.802 Tz [ <0041><004E><0044> ] TJ 30.6 0 Td 129.334 Tz [ <0052><0045><004C><0041><0054><0049><0056><0049><0054> ] TJ 58.8 0 Td 141.6 Tz [ <0059> ] TJ 30.36 0 Td 94.8 Tz [ <0035><0039> ] TJ 
ET

After

BT
/R10 10 Tf
1.4064 0 0 1 82.2 512.8 Tm
3 Tr
[(�E)-500(�X)-500(�P)-500(�E)-500(�R)-500(�I)-500(�E)-500(�N)-500(�C)-500(�E)-500]TJ
1.56802 0 0 1 159.36 512.8 Tm
[(�A)-500(�N)-500(�D)-500]TJ
1.29334 0 0 1 189.96 512.8 Tm
[(�R)-500(�E)-500(�L)-500(�A)-500(�T)-500(�I)-500(�V)-500(�I)-500(�T)-500]TJ
1.416 0 0 1 248.76 512.8 Tm
[(�Y)-500]TJ
0.948 0 0 1 279.12 512.8 Tm
[(�5)-500(�9)-500]TJ

jbarlow83 · 2016-06-28T19:30:47Z

Reported to Ghostscript.
http://bugs.ghostscript.com/show_bug.cgi?id=696874

I tried deskewing the PDF (ocrmypdf --deskew) and that fixed the problem. You were able to replicate on an unskewed PDF? That's interesting....

jbreiden · 2016-06-30T18:40:58Z

@jbarlow83 Can you please do a compatibility check on 2.pdf as described in this thread?
I want to get as many reports as possible before making a change.

https://groups.google.com/forum/#!topic/tesseract-dev/2EmMMoR3QGs

jbarlow83 · 2016-07-01T20:59:42Z

I found that Acrobat can work with Tesseract-produced PDFs without introducing issues in the OCR text, so it looks like the problem is definitely with Ghostscript/pdfwrite. (I tried using Acrobat for both convert to PDF/A and optimize.)

jbarlow83 mentioned this issue Jun 30, 2016

2 columns only sometimes recognized ocrmypdf/OCRmyPDF#77

Closed

jbarlow83 closed this as completed Jul 1, 2016

jbarlow83 mentioned this issue Feb 7, 2017

[Clarification request] Can OCRmyPDF be modified to also create a plain text output file ? ocrmypdf/OCRmyPDF#126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR text mangled after passing through Ghostscript (3.05.00dev) #357

OCR text mangled after passing through Ghostscript (3.05.00dev) #357

jbarlow83 commented Jun 25, 2016 •

edited

jbreiden commented Jun 25, 2016

jbreiden commented Jun 27, 2016 •

edited

jbarlow83 commented Jun 28, 2016

jbreiden commented Jun 30, 2016

jbarlow83 commented Jul 1, 2016

OCR text mangled after passing through Ghostscript (3.05.00dev) #357

OCR text mangled after passing through Ghostscript (3.05.00dev) #357

Comments

jbarlow83 commented Jun 25, 2016 • edited

jbreiden commented Jun 25, 2016

jbreiden commented Jun 27, 2016 • edited

jbarlow83 commented Jun 28, 2016

jbreiden commented Jun 30, 2016

jbarlow83 commented Jul 1, 2016

jbarlow83 commented Jun 25, 2016 •

edited

jbreiden commented Jun 27, 2016 •

edited