Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic language (right to left in writing) stored (left to right) after create PDF Searchable #238

Open
tbadran opened this issue Feb 25, 2016 · 102 comments

Comments

@tbadran
Copy link

tbadran commented Feb 25, 2016

I have tested latest release 3.05 on windows platform to OCR Arabic document to PDF (searchable) and when choose text from output PDF file it seems stored in opposite (left to right) and letters should be stored from (Right to left)!!!

i.e. original text In Arabic is
مرحبا
Stored in PDF as text as
ابحرم

@tbadran tbadran changed the title Arabic language (right to left in writing) stored (left to write) after create PDF Searchable Arabic language (right to left in writing) stored (left to right) after create PDF Searchable Feb 25, 2016
@roozgar
Copy link

roozgar commented Feb 25, 2016

​please put your sample file and the command you used for ocr job​

@tbadran
Copy link
Author

tbadran commented Feb 25, 2016

This is the command:

tesseract c:\temp\test_ara.jpg -l ara -psm 3 c:\temp\test_ara pdf

Files are attached (source JPG and output PDF)

test_ara
test_ara.pdf

please check original word
أنحاء
output inside PDF is
ءاحنا

@tbadran
Copy link
Author

tbadran commented Feb 25, 2016

Command and Samples are attached now in the previous comment

@amitdo
Copy link
Collaborator

amitdo commented Feb 26, 2016

Which program are you using to view the PDF?

@amitdo
Copy link
Collaborator

amitdo commented Feb 26, 2016

It does not look reversed wtth Chrome PDF viewer, just not very accurate...

@roozgar
Copy link

roozgar commented Feb 26, 2016

@amitdo
is there any way to reach a better accuracy in Arabic language until to change to new engine?
now with tesseract i get about 100% accuracy in English but for Arabic result is about 30-40%
but for example i checked google drive ocr for Arabic and i see it have 100 results for same image..

can we work on language data for a better results?

@tbadran
Copy link
Author

tbadran commented Feb 26, 2016

I am using Adobe Reader.
But please note that words are not reversed while viewing the PDF because it contains the original image with text layer.
I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!

@tbadran
Copy link
Author

tbadran commented Feb 26, 2016

This is a serious issue with the PDF output feature using Arabic Language and similar languages that be written from right to left

@amitdo
Copy link
Collaborator

amitdo commented Feb 26, 2016

@roozgar

It seems that Ray is planning to release soon a new version of Tesseract, that will include a new OCR engine based on LSTM.

With LSTM, OCR for printed Arabic (not real handwrite) can reach 95% character accuracy.

"Offline Printed Urdu Nastaleeq Script Recognition
with Bidirectional LSTM Networks"
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.447.4577&rep=rep1&type=pdf

@amitdo
Copy link
Collaborator

amitdo commented Feb 26, 2016

I checked google drive ocr for Arabic and i see it have 100 results for same image..

Neither you or I know what programs they are using to do OCR there...

@amitdo
Copy link
Collaborator

amitdo commented Feb 26, 2016

@tbadran

But please note that words are not reversed while viewing the PDF because it contains the original image with text layer.
I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!

Yes, I know...

Here is a copy of the invisible text layer (copied & pasted):

مداها ينم همهما
اللغة العريية
لغة جهد مه
مسنره هي انحاء العالم

Using Chromium (Google browser) PDF viewer under Linux.

Your original jpg image:
test_ara

@jbreiden
Copy link
Contributor

I try hard to make sure Arabic and other right-to-left languages work correctly in Tesseract PDF. As the problem is isolated further I'm happy to look, but I'm not aware of any reason things would have broken.

@jbreiden
Copy link
Contributor

A quick check shows Chrome gives good results (as per amitdo) and Acroread gives bad results (as per tbadran). This is surprising, I thought we were good with Acroread. I wonder if this is a regression and if so when it occurred.

@jbreiden
Copy link
Contributor

Regarding recognition accuracy, that's a better topic for the forum. But in short: Don't compare against Google Drive. Don't expect major accuracy improvements unless/until Ray is successful with his ideas. And most importantly, don't trust any predictions about 'soon'. That last one is true for all software everywhere.

@amitdo
Copy link
Collaborator

amitdo commented Feb 27, 2016

@roozgar

You can try training Tesseract using the regular engine. Use the the wiki and see #169. I really don't know how good the result will be for Arabic.

Like jbreiden said, the timeline could change...

@tbadran
Copy link
Author

tbadran commented Feb 27, 2016

Please note my testing using the binaries for Windows downloaded from:
http://domasofan.spdns.eu/tesseract/
and I am Using Windows 10 with Acrobat Pro 11 to view output PDF file

@tbadran
Copy link
Author

tbadran commented Feb 27, 2016

I have tested multiple different sample files not only sample uploaded above and every time getting same issue in output PDF on windows 10 + Acrobat Pro 11

@tfmorris
Copy link
Contributor

On OS X, I'm seeing the opposite of earlier reports:

  • Acrobat Reader DC 15.10.20056.167417 appears correct when cutting & pasting
  • Google Chrome Version 48.0.2564.116 (64-bit) appears backwards

@tfmorris
Copy link
Contributor

Adobe Acrobat:

امهمه مني اهادم
ةييرعلا ةغللا
. هم دهج ةغل
ملاعلا ءاحنا يه هرنسم

Google Chrome

مداها ينم همهما
اللغة العريية
لغة جهد مه
مسنره هي انحاء العالم

@amitdo
Copy link
Collaborator

amitdo commented Feb 29, 2016

Tom,

Look at the original jpg.
Lines 2 and 4 in Google Chrome look quite similar to lines 2 and 3 in the original jpg. First word in line 3 in the original jpg became first word in line 3 in Google Chrome.
Clearly, that's the 'good' output...

@amitdo
Copy link
Collaborator

amitdo commented Feb 29, 2016

Again, in Google Chromium.
If I mark the first two lines in the PDF + first word in line 3,
copy the (invisible) text, paste it to a text file,
mark the second to last word in line 3 in the PDF,
copy the (invisible) text, paste it to the text file, I get:

مداها ينم همهما
اللغة العريية
لغة مسنره هي انحاء العالم

@jbreiden
Copy link
Contributor

jbreiden commented Mar 1, 2016

I find it a little easier to test with Hebrew because the letters do not connect. Tesseract version 3.03 behaves the same, so this is not a regression. Will need to think about this, because it is not obvious what exactly is going wrong. Lots of PDF files do a crazy 'write it backwards' strategy but that should not be required. Tesseract writes in reading order.

@jbreiden
Copy link
Contributor

jbreiden commented Mar 9, 2016

There are two things I can think of doing. One is to give up and write Arabic
backwards (which I really hate!). The other is to put an entry in the PDF
metadata, Catalog/ViewerPreferences/Direction. Will continue thinking about
this, slowly.

@amitdo
Copy link
Collaborator

amitdo commented Mar 9, 2016

@jbreiden
I didn't understand you. In one comment you talk about Hebrew and in another one you only referring Arabic. Does Hebrew displayed correctly with Adobe Reader?

@amitdo
Copy link
Collaborator

amitdo commented Mar 9, 2016

Please make sure that any change you do is not causing any regression with Chrome PDF viewer and OS X Preview. Thanks for your work!

@jbreiden
Copy link
Contributor

jbreiden commented Mar 9, 2016

@amitdo Hebrew has the exact same problem as Arabic.

@amitdo
Copy link
Collaborator

amitdo commented Mar 10, 2016

Maybe explicitly using unicode bidi control characters can help ?

@jbreiden
Copy link
Contributor

That's another possibility, thanks for the suggestion.

@jbreiden
Copy link
Contributor

jbreiden commented Sep 17, 2018 via email

@stweil
Copy link
Contributor

stweil commented Sep 17, 2018

This sounds as if there will not be a fix in the near future. So we should not require that this bug must be fixed for 4.0.0.

@MalekBadi
Copy link

Is this issue resolved ??

@yregaieg
Copy link

@amitdo
Using the latest ABBYY FineReader 14 to create a searchable pdf:

* Both Chrome and Adobe Acrobat Reader can select/copy/paste correctly.

Conclusion:
It seems that Tesseract needs tweaking to solve this problem.

Original Image.zip
Tesseract.pdf
Abby Finereader.pdf

@jbreiden I have experimented with the files he attached, and I came to notice something that does actually make sense : It seems both files have different setup for text orientation when I start selecting from mid-sentence and drag over a few lines :
Tesseract :
image
ABBY :
image

@yregaieg
Copy link

yregaieg commented May 30, 2019

Not sure if this is of any value : https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
Page 598 (606 / 756) has a table describing writing mode :
image
And it seems to be possible to specify direction of writing as RTL using this parameter on a structure element and all child elements. @jbreiden can you have a look at it ?

Section 14.8.2.3.3 could be related to this issue as well. (I see that you had a look at this section 3 years ago in here #238 (comment) so it is probably not really what is causing our issue)

@ReactNativeFan
Copy link

ReactNativeFan commented Jun 9, 2020

I would like to inform that the problem still persists in Tesseract 4.1.1
The Tesseract recognizes and displays Arabic text correctly. However, when export results as PDF/A, the stored text in PDF/A are reversed.

Yes, if you open the PDF in Acrobat, it will give you reversed words, and will work fine for Google Chrome PDF reader. However, when i extracted the stored text in PDF/A using pdfToText, the words are reversed too, which means the text was stored in the wrong order.

See the following example for more details:

ar

Here is the PDF/A generated by Tesseract
Recognized_PDFA_By_Tesseract.pdf

To summarize:

True Text
مرحبا بكم جميعا
اللغة العربية

Tesseract Text 100% correct
مرحبا بكم جميعا
اللغة العربية

Tesseract PDF/A Text
اعيمج مكب ابحرم
ةيبرعلا ةغللا

As you see in the Tesseract PDF/A text, every word is reversed although the .hOCR file is correct.

hOCR

Actually, the words are not reversed (you still can read every letter) but the "entire line is mirrored". Usually, we face this problem when rendering Arabic text in HTML by setting "text-align:right"

I think, the problem here is that the x-coord of each RTL letter is rendered by measuring x from left rather than right i.e., (x,y) should be (W-x, y) where W is the page width.

@Mennaruuk
Copy link

This issue also persists in Tesseract 5 alpha. Another issue is when I double-click to select then copy a word, my computer does copy correctly the whole word. Visibly, however, the blue selection box isn't going over the entire word. You can observe that in the screenshot below. Both of these issues occur in Adobe Reader DC and Okular.
image

@diyajunaid
Copy link

Please advise if this issue is resolved in any latest version of tesseract?

@saleha-DS
Copy link

I'm using Tess4j 4.5.1 and having the same issue. When I process image and create pdf, open it in Acrobat reader, it displays 100% correct. However, search doesn't work unless I reverse the letters. When I copy the text and paste it in MS Word, it display reversed.

@amitdo
Copy link
Collaborator

amitdo commented Feb 21, 2022

Unfortunately, this long standing issue was not solved.

Copy&paste or search of words in documents written in RTL scripts only works with Google Chrome's PDF reader., Even with this viewer there might be some issues.

@wollmers
Copy link

Unfortunately, this long standing issue was not solved.

Copy&paste or search of words in documents written in RTL scripts only works with Google Chrome's PDF reader., Even with this viewer there might be some issues.

RTL scripts are stored left to right (storage order) according to standards, but only rendered and displayed RTL.

In the above example it's stored the right way, but the github editor can not handle it correctly. But I can use a screenshot of my command line:

Bildschirmfoto 2022-02-21 um 23 06 16

Also on my MacOS the program Preview displays the PDF correctly and it is searchable with copy & paste into the search field:

Bildschirmfoto 2022-02-21 um 23 00 02

Thus it's a problem of PDF viewers, not of Tesseract.

@amitdo
Copy link
Collaborator

amitdo commented Feb 23, 2022

RTL scripts are stored left to right (storage order) according to standards, but only rendered and displayed RTL.

Which standards are you referring here?

Here is how it works in html:
https://www.w3.org/International/questions/qa-visual-vs-logical.

Regarding the PDF standard:

https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Tesseract does not do what is described in:
14.8.2.3.3 Reverse-Order Show Strings

@wollmers
Copy link

RTL scripts are stored left to right (storage order) according to standards, but only rendered and displayed RTL.

Which standards are you referring here?

Unicode
https://unicode.org/reports/tr9/#Introduction

Here is how it works in html: https://www.w3.org/International/questions/qa-visual-vs-logical.

HTML just implements the Unicode standard. And all editors, command line clients I know use logical storage order.

Regarding the PDF standard:

https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Tesseract does not do what is described in: 14.8.2.3.3 Reverse-Order Show Strings

That's an obscurity of the PDF specification. Yes, in the above PDF sample I didn't find the string /ReversedChars. Does this mean, Tesseract should implement this storage method?

At least Acrobat Reader should do it right for the interfaces to the "normal" world like search field. Other PDF viewers do it.

@amitdo
Copy link
Collaborator

amitdo commented Feb 23, 2022

It's not just Adobe, Firefox and Evince also have the same issue.

Years ago, there was an attempt to implement the strings reversal as described in the spec, to make Adobe and other viewers happy, but it didn't work so well.

@amitdo
Copy link
Collaborator

amitdo commented Feb 23, 2022

There is a regression with Google's Chrome. It completely fails to render the Arabic pdf above. The same failure occurs with another pdf with Hebrew.

$ chromium --version
Chromium 98.0.4758.102 snap

@florisre
Copy link

As of now (using tesseract 5.3.1), the issue still exists. Brave (Chromium-based) handles my Arabic-script PDF correctly, Firefox and Okular do not. Haven't tested anything else.

@IrtazaIjaz

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests