-
Notifications
You must be signed in to change notification settings - Fork 9.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract creates PDF with no spaces for Arabic #2446
Comments
Please use the latest code when reporting issue. |
@zdenop Please consider releasing 4.1 RC2.
…On Mon, 20 May 2019, 14:25 zdenop, ***@***.***> wrote:
Please use the latest code when reporting issue.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2446?email_source=notifications&email_token=ABG37I3UD5HXLILEJONV4Q3PWJRQDA5CNFSM4HN7LX72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVYEO7I#issuecomment-493897597>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG37IYX6NBEMF3IVZIPNJTPWJRQDANCNFSM4HN7LX7Q>
.
|
Also test with other pdf viewers. The default pdf viewer in macOS is not so good at displaying Tesseract's output. The built-in pdf viewer in Google's Chrome browser is recommended. |
Tried both actually, along with pdfJS and tika parsers inside Alfresco Content Services.
… On 20 May 2019, at 14:05, Amit D. ***@***.***> wrote:
Also test with other pdf viewers. The default pdf viewer in macOS is not so good at displaying Tesseract's output. The built-in pdf viewer in Google's Chrome browser is recommended.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#2446?email_source=notifications&email_token=ACYW6TNDCFSSSHSVLGFCNDDPWKHZ7A5CNFSM4HN7LX72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVYTIZA#issuecomment-493958244>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACYW6TKWIFROY2LRI74ER43PWKHZ7ANCNFSM4HN7LX7Q>.
|
Just an update : Following the recommendation from @zdenop I have built tesseract from sources (5.0 alpha) and retried. And I can confirm that I still get the same results as with 4.0.0. |
I'll take a look but no promises.
… |
That would be highly appreciated. I have been working for this pro-bono project for a couple of months, and it would be a shame to abandon at this stage due to this bug. |
Source image is great, but please also attach a problematic PDF to bug.
… |
Tried with stock Tesseract shipping with Debian, which is a pretty recent
build.
Did not reproduce the problem. Seeing all sorts of spaces in Chrome's PDF
viewer during highlight and copy-paste operations.
http://metadata.ftp-master.debian.org/changelogs/main/t/tesseract/unstable_changelog
# tesseract -v
tesseract 4.0.0
leptonica-1.76.0
libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff
4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX
Found SSE
There was a really recent change to spacing in PDF about 25 days ago which
conceivably could
mess things up. If older Tesseract works and newer Tesseract fails, that's
the most obvious possible
culprit.
#1900
|
Getting back to this issue, Indeed I do see space now, at least on Adobe Acrobat Reader DC, not sure what did I miss when I first raised this issue, in Alfresco and PDFJS though I was searching for a word and I assumed I had the same issue as in the MacOS preview app, but it turns out I was stumbling on #238 which is even more annoying ! |
I am trying Tesseract with arabic document, and I noticed that text recognition works extremely well (I am actually quite surprised by the accuracy of it).
However, when I try to generate a PDF with a text overlay on top of the image using :
tesseract -l ara test-ocr.jpg result pdf
the document generated doesn't contain any spaces in it.Here is a snippet of the extracted TXT using
tesseract -l ara test-ocr.jpg result
:Here is the same snippet, but copied from the PDF generated with this command
tesseract -l ara test-ocr.jpg result pdf
:Sample image to test this out : https://imgur.com/a/TsfudZ6
Environment
The text was updated successfully, but these errors were encountered: