Tesseract creates PDF with no spaces for Arabic #2446

yregaieg · 2019-05-20T07:55:00Z

I am trying Tesseract with arabic document, and I noticed that text recognition works extremely well (I am actually quite surprised by the accuracy of it).
However, when I try to generate a PDF with a text overlay on top of the image using : tesseract -l ara test-ocr.jpg result pdf the document generated doesn't contain any spaces in it.
Here is a snippet of the extracted TXT using tesseract -l ara test-ocr.jpg result :

الموضوع : حول تنفيذ مشروع انتزاع للمصلحة العامة

Here is the same snippet, but copied from the PDF generated with this command tesseract -l ara test-ocr.jpg result pdf :

الموضوع:حولتنفيذمشروعانتزاعللمصلحةالعامة

Sample image to test this out : https://imgur.com/a/TsfudZ6

Environment

Tesseract Version:

09:50 $ tesseract --version
tesseract 4.0.0
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.2 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found SSE

Platform:

Darwin XXXX 18.5.0 Darwin Kernel Version 18.5.0: Mon Mar 11 20:40:32 PDT 2019; root:xnu-4903.251.3~3/RELEASE_X86_64 x86_64

The text was updated successfully, but these errors were encountered:

zdenop · 2019-05-20T08:55:01Z

Please use the latest code when reporting issue.

Shreeshrii · 2019-05-20T10:27:48Z

@zdenop Please consider releasing 4.1 RC2.

…

On Mon, 20 May 2019, 14:25 zdenop, ***@***.***> wrote: Please use the latest code when reporting issue. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2446?email_source=notifications&email_token=ABG37I3UD5HXLILEJONV4Q3PWJRQDA5CNFSM4HN7LX72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVYEO7I#issuecomment-493897597>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG37IYX6NBEMF3IVZIPNJTPWJRQDANCNFSM4HN7LX7Q> .

amitdo · 2019-05-20T12:04:34Z

Also test with other pdf viewers. The default pdf viewer in macOS is not so good at displaying Tesseract's output. The built-in pdf viewer in Google's Chrome browser is recommended.

yregaieg · 2019-05-20T12:08:21Z

Tried both actually, along with pdfJS and tika parsers inside Alfresco Content Services.

…

On 20 May 2019, at 14:05, Amit D. ***@***.***> wrote: Also test with other pdf viewers. The default pdf viewer in macOS is not so good at displaying Tesseract's output. The built-in pdf viewer in Google's Chrome browser is recommended. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2446?email_source=notifications&email_token=ACYW6TNDCFSSSHSVLGFCNDDPWKHZ7A5CNFSM4HN7LX72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVYTIZA#issuecomment-493958244>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACYW6TKWIFROY2LRI74ER43PWKHZ7ANCNFSM4HN7LX7Q>.

yregaieg · 2019-05-20T22:16:57Z

Just an update : Following the recommendation from @zdenop I have built tesseract from sources (5.0 alpha) and retried. And I can confirm that I still get the same results as with 4.0.0.
I have also tried out all possible --psm options (read somewhere that it made a difference with Japanese) but it doesn't seem to help in my case !

jbreiden · 2019-05-20T22:59:26Z

I'll take a look but no promises.

…

yregaieg · 2019-05-22T05:59:16Z

That would be highly appreciated. I have been working for this pro-bono project for a couple of months, and it would be a shame to abandon at this stage due to this bug.
@jbreiden Also if you manage to locate the culprit but you don't have enough time to fine-tune/experiment I would be more than happy to help :)

jbreiden · 2019-05-23T21:12:30Z

Source image is great, but please also attach a problematic PDF to bug.

…

jbreiden · 2019-05-23T21:27:11Z

Tried with stock Tesseract shipping with Debian, which is a pretty recent build. Did not reproduce the problem. Seeing all sorts of spaces in Chrome's PDF viewer during highlight and copy-paste operations. http://metadata.ftp-master.debian.org/changelogs/main/t/tesseract/unstable_changelog # tesseract -v tesseract 4.0.0 leptonica-1.76.0 libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX Found SSE There was a really recent change to spacing in PDF about 25 days ago which conceivably could mess things up. If older Tesseract works and newer Tesseract fails, that's the most obvious possible culprit. #1900

yregaieg · 2019-05-30T12:23:07Z

Getting back to this issue, Indeed I do see space now, at least on Adobe Acrobat Reader DC, not sure what did I miss when I first raised this issue, in Alfresco and PDFJS though I was searching for a word and I assumed I had the same issue as in the MacOS preview app, but it turns out I was stumbling on #238 which is even more annoying !
Thanks guys and sorry for the wrong Alarm

yregaieg closed this as completed May 30, 2019

amitdo added the PDF label Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract creates PDF with no spaces for Arabic #2446

Tesseract creates PDF with no spaces for Arabic #2446

yregaieg commented May 20, 2019 •

edited

zdenop commented May 20, 2019

Shreeshrii commented May 20, 2019 via email

amitdo commented May 20, 2019

yregaieg commented May 20, 2019 via email

yregaieg commented May 20, 2019

jbreiden commented May 20, 2019 via email

yregaieg commented May 22, 2019

jbreiden commented May 23, 2019 via email

jbreiden commented May 23, 2019 via email

yregaieg commented May 30, 2019

Tesseract creates PDF with no spaces for Arabic #2446

Tesseract creates PDF with no spaces for Arabic #2446

Comments

yregaieg commented May 20, 2019 • edited

Environment

zdenop commented May 20, 2019

Shreeshrii commented May 20, 2019 via email

amitdo commented May 20, 2019

yregaieg commented May 20, 2019 via email

yregaieg commented May 20, 2019

jbreiden commented May 20, 2019 via email

yregaieg commented May 22, 2019

jbreiden commented May 23, 2019 via email

jbreiden commented May 23, 2019 via email

yregaieg commented May 30, 2019

yregaieg commented May 20, 2019 •

edited