Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR text mangled after passing through Ghostscript (3.05.00dev) #357

Closed
jbarlow83 opened this issue Jun 25, 2016 · 5 comments
Closed

OCR text mangled after passing through Ghostscript (3.05.00dev) #357

jbarlow83 opened this issue Jun 25, 2016 · 5 comments

Comments

@jbarlow83
Copy link

jbarlow83 commented Jun 25, 2016

The attached input file in1.pdf was generated by Tesseract 3.05.00dev. I modified it to extract only the first page using pdftk to reduce the size; results are unaffected by this.

in1.pdf
out1.pdf

(The user who forwarded the file to me confirmed that it can be released publicly.)

The OCR text of in1.pdf begins as follows – no problems here:

JP. Morgan (Suisse) SA

Account n“ 7973101
Geneva, 3rd June 2016

The problem manifests after passing the file through Ghostscript 9.18 to refry the PDF, with no other changes...

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -o out1.pdf in1.pdf

The OCR text is mangled by the insertion of spaces after each recognized letter, and line breaks after certain words (from pdftotext), and loss of spaces so that the word boundaries are gone "3rdJune2016". Normally one would pass some other parameters to Ghostscript such as PDF/A conversion, but regardless of parameters the OCR text is mangled.

J P .

M o r g a n

A c c o u n t n

( S u i s s e ) S A

7 9 7 3 1 0 1

G e n e v a , 3 r d J u n e 2 0 1 6

When viewed in Acrobat XI, every other letter is highlighted. This text is unusable for searching.
image

I'd be happy to bring up with the Ghostscript people, but I have feeling that there's something unusual about how Tesseract generates OCR text in PDFs that causes Ghostscript to mishandle them, rather than the converse.

I tried discarding the OCR information, rasterize the PDF as an image, and then running OCR on the image again using tesseract 3.04.01 and the updated pdf.ttf ("sharp2.ttf"). Specifically I used

ocrmypdf -f --pdf-renderer tesseract in1.pdf out1_3.04.01.pdf

Pinging @jbreiden since you worked with me on the pdf.ttf issues...

@jbreiden
Copy link
Contributor

I worked extensively with Ken Sharp on ghostscript compatibility and thought we were in pretty good shape. Not sure what the story is here.

@jbreiden
Copy link
Contributor

jbreiden commented Jun 27, 2016

Well, I've reproduced your problem with a different PDF and the tilt doesn't seem to be a factor. The text is definitely getting represented differently after a pass through ghostscript. I think you should contact Ken and see what he thinks. There's nothing obviously wrong to my eye about the ghostscript respresentation, but text extraction in PDF is often more of an art than a science. (As always, I think the root problem is the PDF specification itself.)

Before:

BT
3 Tr 1 0 0 1 82.2 512.8 Tm /f-0-0 10 Tf 140.64 Tz [ <0045><0058><0050><0045><0052><0049><0045><004E><0043><0045> ] TJ 77.16 0 Td 156.802 Tz [ <0041><004E><0044> ] TJ 30.6 0 Td 129.334 Tz [ <0052><0045><004C><0041><0054><0049><0056><0049><0054> ] TJ 58.8 0 Td 141.6 Tz [ <0059> ] TJ 30.36 0 Td 94.8 Tz [ <0035><0039> ] TJ 
ET

After

BT
/R10 10 Tf
1.4064 0 0 1 82.2 512.8 Tm
3 Tr
[(�E)-500(�X)-500(�P)-500(�E)-500(�R)-500(�I)-500(�E)-500(�N)-500(�C)-500(�E)-500]TJ
1.56802 0 0 1 159.36 512.8 Tm
[(�A)-500(�N)-500(�D)-500]TJ
1.29334 0 0 1 189.96 512.8 Tm
[(�R)-500(�E)-500(�L)-500(�A)-500(�T)-500(�I)-500(�V)-500(�I)-500(�T)-500]TJ
1.416 0 0 1 248.76 512.8 Tm
[(�Y)-500]TJ
0.948 0 0 1 279.12 512.8 Tm
[(�5)-500(�9)-500]TJ

@jbarlow83
Copy link
Author

Reported to Ghostscript.
http://bugs.ghostscript.com/show_bug.cgi?id=696874

I tried deskewing the PDF (ocrmypdf --deskew) and that fixed the problem. You were able to replicate on an unskewed PDF? That's interesting....

@jbreiden
Copy link
Contributor

@jbarlow83 Can you please do a compatibility check on 2.pdf as described in this thread?
I want to get as many reports as possible before making a change.

https://groups.google.com/forum/#!topic/tesseract-dev/2EmMMoR3QGs

@jbarlow83
Copy link
Author

I found that Acrobat can work with Tesseract-produced PDFs without introducing issues in the OCR text, so it looks like the problem is definitely with Ghostscript/pdfwrite. (I tried using Acrobat for both convert to PDF/A and optimize.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants