New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCR text mangled after passing through Ghostscript (3.05.00dev) #357
Comments
I worked extensively with Ken Sharp on ghostscript compatibility and thought we were in pretty good shape. Not sure what the story is here. |
Well, I've reproduced your problem with a different PDF and the tilt doesn't seem to be a factor. The text is definitely getting represented differently after a pass through ghostscript. I think you should contact Ken and see what he thinks. There's nothing obviously wrong to my eye about the ghostscript respresentation, but text extraction in PDF is often more of an art than a science. (As always, I think the root problem is the PDF specification itself.) Before:
After
|
Reported to Ghostscript. I tried deskewing the PDF ( |
@jbarlow83 Can you please do a compatibility check on 2.pdf as described in this thread? https://groups.google.com/forum/#!topic/tesseract-dev/2EmMMoR3QGs |
I found that Acrobat can work with Tesseract-produced PDFs without introducing issues in the OCR text, so it looks like the problem is definitely with Ghostscript/pdfwrite. (I tried using Acrobat for both convert to PDF/A and optimize.) |
The attached input file in1.pdf was generated by Tesseract 3.05.00dev. I modified it to extract only the first page using pdftk to reduce the size; results are unaffected by this.
in1.pdf
out1.pdf
(The user who forwarded the file to me confirmed that it can be released publicly.)
The OCR text of in1.pdf begins as follows – no problems here:
The problem manifests after passing the file through Ghostscript 9.18 to refry the PDF, with no other changes...
The OCR text is mangled by the insertion of spaces after each recognized letter, and line breaks after certain words (from
pdftotext
), and loss of spaces so that the word boundaries are gone "3rdJune2016". Normally one would pass some other parameters to Ghostscript such as PDF/A conversion, but regardless of parameters the OCR text is mangled.When viewed in Acrobat XI, every other letter is highlighted. This text is unusable for searching.
I'd be happy to bring up with the Ghostscript people, but I have feeling that there's something unusual about how Tesseract generates OCR text in PDFs that causes Ghostscript to mishandle them, rather than the converse.
I tried discarding the OCR information, rasterize the PDF as an image, and then running OCR on the image again using tesseract 3.04.01 and the updated pdf.ttf ("sharp2.ttf"). Specifically I used
Pinging @jbreiden since you worked with me on the pdf.ttf issues...
The text was updated successfully, but these errors were encountered: