Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract creates PDF with no spaces for Arabic #2446

Closed
yregaieg opened this issue May 20, 2019 · 10 comments
Closed

Tesseract creates PDF with no spaces for Arabic #2446

yregaieg opened this issue May 20, 2019 · 10 comments
Labels

Comments

@yregaieg
Copy link

yregaieg commented May 20, 2019

I am trying Tesseract with arabic document, and I noticed that text recognition works extremely well (I am actually quite surprised by the accuracy of it).
However, when I try to generate a PDF with a text overlay on top of the image using : tesseract -l ara test-ocr.jpg result pdf the document generated doesn't contain any spaces in it.
Here is a snippet of the extracted TXT using tesseract -l ara test-ocr.jpg result :

الموضوع : حول تنفيذ مشروع انتزاع للمصلحة العامة

Here is the same snippet, but copied from the PDF generated with this command tesseract -l ara test-ocr.jpg result pdf :

الموضوع:حولتنفيذمشروعانتزاعللمصلحةالعامة

Sample image to test this out : https://imgur.com/a/TsfudZ6


Environment

  • Tesseract Version:
09:50 $ tesseract --version
tesseract 4.0.0
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.2 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found SSE
  • Platform:
Darwin XXXX 18.5.0 Darwin Kernel Version 18.5.0: Mon Mar 11 20:40:32 PDT 2019; root:xnu-4903.251.3~3/RELEASE_X86_64 x86_64
@zdenop
Copy link
Contributor

zdenop commented May 20, 2019

Please use the latest code when reporting issue.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented May 20, 2019 via email

@amitdo
Copy link
Collaborator

amitdo commented May 20, 2019

Also test with other pdf viewers. The default pdf viewer in macOS is not so good at displaying Tesseract's output. The built-in pdf viewer in Google's Chrome browser is recommended.

@yregaieg
Copy link
Author

yregaieg commented May 20, 2019 via email

@yregaieg
Copy link
Author

Just an update : Following the recommendation from @zdenop I have built tesseract from sources (5.0 alpha) and retried. And I can confirm that I still get the same results as with 4.0.0.
I have also tried out all possible --psm options (read somewhere that it made a difference with Japanese) but it doesn't seem to help in my case !

@jbreiden
Copy link
Contributor

jbreiden commented May 20, 2019 via email

@yregaieg
Copy link
Author

That would be highly appreciated. I have been working for this pro-bono project for a couple of months, and it would be a shame to abandon at this stage due to this bug.
@jbreiden Also if you manage to locate the culprit but you don't have enough time to fine-tune/experiment I would be more than happy to help :)

@jbreiden
Copy link
Contributor

jbreiden commented May 23, 2019 via email

@jbreiden
Copy link
Contributor

jbreiden commented May 23, 2019 via email

@yregaieg
Copy link
Author

Getting back to this issue, Indeed I do see space now, at least on Adobe Acrobat Reader DC, not sure what did I miss when I first raised this issue, in Alfresco and PDFJS though I was searching for a word and I assumed I had the same issue as in the MacOS preview app, but it turns out I was stumbling on #238 which is even more annoying !
Thanks guys and sorry for the wrong Alarm

@amitdo amitdo added the PDF label Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants