New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract segmentation fault when using Arabic and English #1275

Open
Freedomafia opened this Issue Jan 15, 2018 · 18 comments

Comments

Projects
None yet
5 participants
@Freedomafia

Freedomafia commented Jan 15, 2018

When I tried the Arabic only and English only text copying it worked. However when I tried to use them both simultaneously on the picture of the scanned page I got a 'segmentation fault'. I have attached a link to the image of a scanned page of the Arabic-English dictionary : https://imgur.com/a/K8bqz.

My bashscript was:

tesseract Arabic_to_English.png -l eng+ara output

However the terminal returned the message that there was a 'segmentation fault'. Full error message:

Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Detected 24 diacritics
no best words!!
no best words!!
no best words!!
no best words!!
Segmentation fault (core dumped)

I wanted to ask whether tesseract is able to work with English and Arabic simultaneously.


Environment

  • Tesseract Version:
    tesseract 3.04.01
    leptonica-1.73
    libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

Current Behavior:

Expected Behavior:

Suggested Fix:

@Shreeshrii

This comment has been minimized.

Show comment
Hide comment
@Shreeshrii

Shreeshrii Jan 15, 2018

Contributor

Please try

tesseract Arabic_to_English.png -l Arabic output

and report the result.

Contributor

Shreeshrii commented Jan 15, 2018

Please try

tesseract Arabic_to_English.png -l Arabic output

and report the result.

@Shreeshrii

This comment has been minimized.

Show comment
Hide comment
@Shreeshrii

Shreeshrii Jan 15, 2018

Contributor

I have attached an image of a scanned page of the Arabic-English dictionary.

Image is not attached.

Contributor

Shreeshrii commented Jan 15, 2018

I have attached an image of a scanned page of the Arabic-English dictionary.

Image is not attached.

@amitdo

This comment has been minimized.

Show comment
Hide comment
@amitdo

amitdo Jan 15, 2018

Contributor

Tesseract Version: latest

From the master branch?

From which repo did you download the traineddata?

Contributor

amitdo commented Jan 15, 2018

Tesseract Version: latest

From the master branch?

From which repo did you download the traineddata?

@Freedomafia

This comment has been minimized.

Show comment
Hide comment
@Freedomafia

Freedomafia Jan 15, 2018

@Shreeshrii
(1) I have edited my Original Post with a link to the image
(2) I tried the command you suggested : tesseract Arabic_to_English.png -l Arabic output. However it returned the error message:

Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Error opening data file /usr/share/tesseract-ocr/tessdata/Arabic.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'Arabic'
Tesseract couldn't load any languages!

Freedomafia commented Jan 15, 2018

@Shreeshrii
(1) I have edited my Original Post with a link to the image
(2) I tried the command you suggested : tesseract Arabic_to_English.png -l Arabic output. However it returned the error message:

Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Error opening data file /usr/share/tesseract-ocr/tessdata/Arabic.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'Arabic'
Tesseract couldn't load any languages!
@Freedomafia

This comment has been minimized.

Show comment
Hide comment
@Freedomafia

Freedomafia Jan 15, 2018

@amitdo I have updated my OP. I did not download any test data

Freedomafia commented Jan 15, 2018

@amitdo I have updated my OP. I did not download any test data

@amitdo

This comment has been minimized.

Show comment
Hide comment
@amitdo

amitdo Jan 15, 2018

Contributor

Duplicates #235

Contributor

amitdo commented Jan 15, 2018

Duplicates #235

@Freedomafia

This comment has been minimized.

Show comment
Hide comment
@Freedomafia

Freedomafia Jan 15, 2018

@amitdo Thank you for the link. Is there any way I can get an alert if and when tesseract will work with ara+eng in the future?

Freedomafia commented Jan 15, 2018

@amitdo Thank you for the link. Is there any way I can get an alert if and when tesseract will work with ara+eng in the future?

@stweil

This comment has been minimized.

Show comment
Hide comment
@stweil

stweil Jan 15, 2018

Contributor

That combination should already work with the latest experimental Tesseract 4 (which also supports an alternative combination Arabic+Latin), but I have no personal experience with ara.traineddata or Arabic.traineddata.

Contributor

stweil commented Jan 15, 2018

That combination should already work with the latest experimental Tesseract 4 (which also supports an alternative combination Arabic+Latin), but I have no personal experience with ara.traineddata or Arabic.traineddata.

@amitdo

This comment has been minimized.

Show comment
Hide comment
@amitdo

amitdo Jan 15, 2018

Contributor

I have updated my OP. I did not download any test data

An interesting discussion:
https://english.stackexchange.com/questions/424366/does-op-mean-original-poster-or-original-post

Contributor

amitdo commented Jan 15, 2018

I have updated my OP. I did not download any test data

An interesting discussion:
https://english.stackexchange.com/questions/424366/does-op-mean-original-poster-or-original-post

@amitdo

This comment has been minimized.

Show comment
Hide comment
@amitdo

amitdo Jan 15, 2018

Contributor

Repeating myself, just to try this GitHub feature:
githubteacher/github-for-developers-sept-2015#705 (comment)

Duplicate of #235

Contributor

amitdo commented Jan 15, 2018

Repeating myself, just to try this GitHub feature:
githubteacher/github-for-developers-sept-2015#705 (comment)

Duplicate of #235

@Shreeshrii

This comment has been minimized.

Show comment
Hide comment
@Shreeshrii

Shreeshrii Jan 19, 2018

Contributor

@Freedomafia If you use tesseract4 (LSTM engine) with Arabic (which has both ara and eng already), it works fine. See attached.

Arabic_English-png-Arabic-tessdata_best3.txt

Contributor

Shreeshrii commented Jan 19, 2018

@Freedomafia If you use tesseract4 (LSTM engine) with Arabic (which has both ara and eng already), it works fine. See attached.

Arabic_English-png-Arabic-tessdata_best3.txt

@Freedomafia

This comment has been minimized.

Show comment
Hide comment
@Freedomafia

Freedomafia Jan 20, 2018

Hi @Shreeshii I did manage to use it with Tesseract v4 two days ago but I encountered some issues which I will post as a comment on this thread later today. Many thanks.

Freedomafia commented Jan 20, 2018

Hi @Shreeshii I did manage to use it with Tesseract v4 two days ago but I encountered some issues which I will post as a comment on this thread later today. Many thanks.

@Aspiringarabist

This comment has been minimized.

Show comment
Hide comment
@Aspiringarabist

Aspiringarabist Jan 20, 2018

Hi @Shreeshrii @amitdo @stweil . Thank you for your help thus far. I have two questions:

(1) The text file produced from the arabic and arabic+english are not great. However when I use english only it gets all the english words correctly however it guesses the english characters when it meets the arabic letters. I am thinking of returning the arabic letters with empty spaces by utilising what I believe to be the fact that the english only LSTM will produce a low confidence score when it comes across arabic letters. Is there anywhere to extract confidence scores per letter/character?

(2) I could not run all the tesseract4 features.

I ran the following lines:

tesseract --oem 3 Arabic_to_English.png -l ara+eng outputs/outputoem3_AE
tesseract --oem 2 Arabic_to_English.png -l ara+eng outputs/outputoem2_AE
tesseract --oem 1 Arabic_to_English.png -l ara+eng outputs/outputoem1_AE
tesseract --oem 0 Arabic_to_English.png -l ara+eng outputs/outputoem0_AE
tesseract --oem 3 Arabic_to_English.png -l eng+ara outputs/outputoem3_EA
tesseract --oem 2 Arabic_to_English.png -l eng+ara outputs/outputoem2_EA
tesseract --oem 1 Arabic_to_English.png -l eng+ara outputs/outputoem1_EA

oem 1 and 3 produced results (see attached). However all the --oem 0 and 2 failed to produce an OCR text and they returned the error message (for both):

utputs/outputoem0_AE
mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file adaptmatch.cpp, line 537
Segmentation fault (core dumped)

Many thanks :)

outputoem1_AE.txt
outputoem1_EA.txt
outputoem3_AE.txt
outputoem3_EA.txt

Aspiringarabist commented Jan 20, 2018

Hi @Shreeshrii @amitdo @stweil . Thank you for your help thus far. I have two questions:

(1) The text file produced from the arabic and arabic+english are not great. However when I use english only it gets all the english words correctly however it guesses the english characters when it meets the arabic letters. I am thinking of returning the arabic letters with empty spaces by utilising what I believe to be the fact that the english only LSTM will produce a low confidence score when it comes across arabic letters. Is there anywhere to extract confidence scores per letter/character?

(2) I could not run all the tesseract4 features.

I ran the following lines:

tesseract --oem 3 Arabic_to_English.png -l ara+eng outputs/outputoem3_AE
tesseract --oem 2 Arabic_to_English.png -l ara+eng outputs/outputoem2_AE
tesseract --oem 1 Arabic_to_English.png -l ara+eng outputs/outputoem1_AE
tesseract --oem 0 Arabic_to_English.png -l ara+eng outputs/outputoem0_AE
tesseract --oem 3 Arabic_to_English.png -l eng+ara outputs/outputoem3_EA
tesseract --oem 2 Arabic_to_English.png -l eng+ara outputs/outputoem2_EA
tesseract --oem 1 Arabic_to_English.png -l eng+ara outputs/outputoem1_EA

oem 1 and 3 produced results (see attached). However all the --oem 0 and 2 failed to produce an OCR text and they returned the error message (for both):

utputs/outputoem0_AE
mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file adaptmatch.cpp, line 537
Segmentation fault (core dumped)

Many thanks :)

outputoem1_AE.txt
outputoem1_EA.txt
outputoem3_AE.txt
outputoem3_EA.txt

@Shreeshrii

This comment has been minimized.

Show comment
Hide comment
@Shreeshrii

Shreeshrii Jan 20, 2018

Contributor
  1. tessdata_best and tessdata_fast do NOT have the legacy tesseract model in it, hence --oem 0 (tesseract) and --oem 2 (tesseract+LSTM) won't work.

--oem 1 is LSTM, and --oem 3 is default - which should fallback to --oem 1. So the results should be the same.

  1. Please also test with Arabic traineddata, which is different from ara. It has both Arabic and English.
Contributor

Shreeshrii commented Jan 20, 2018

  1. tessdata_best and tessdata_fast do NOT have the legacy tesseract model in it, hence --oem 0 (tesseract) and --oem 2 (tesseract+LSTM) won't work.

--oem 1 is LSTM, and --oem 3 is default - which should fallback to --oem 1. So the results should be the same.

  1. Please also test with Arabic traineddata, which is different from ara. It has both Arabic and English.
@Shreeshrii

This comment has been minimized.

Show comment
Hide comment
@Shreeshrii

Shreeshrii Jan 21, 2018

Contributor

Is there anywhere to extract confidence scores per letter/character?

see
https://github.com/tesseract-ocr/tesseract/wiki/APIExample#result-iterator-example

#681

There is a debug type of config variable you can set to see details such as
#681 (comment)

Contributor

Shreeshrii commented Jan 21, 2018

Is there anywhere to extract confidence scores per letter/character?

see
https://github.com/tesseract-ocr/tesseract/wiki/APIExample#result-iterator-example

#681

There is a debug type of config variable you can set to see details such as
#681 (comment)

@Freedomafia

This comment has been minimized.

Show comment
Hide comment
@Freedomafia

Freedomafia Jan 21, 2018

You’re a star @Shreeshrii . I will test this out and report back here.

Many thanks.

Freedomafia commented Jan 21, 2018

You’re a star @Shreeshrii . I will test this out and report back here.

Many thanks.

@Shreeshrii

This comment has been minimized.

Show comment
Hide comment
@Shreeshrii

Shreeshrii Jan 25, 2018

Contributor

For enabling the debug info related to this,

update the config called logfile to the following and then use 'logfile' as the last variable in your command.

logfile config

debug_file tesseract.log
multilang_debug_level 3
stopper_debug_level 3

command

time tesseract --tessdata-dir /tesseract_ocr/tessdata_fast/   "${img_file}" "${img_file%.*}-Arabic-tessdata_fast-debug"  --oem 1  -l Arabic+ara --psm 6 logfile

The tesseract.log generated by above will be on the following lines.

Processing word with lang Arabic at:Bounding box=(93,1820)->(126,1851)
Trying word using lang Arabic, oem 1
Best choice: accepted=0, adaptable=0, done=1 : Lang result : ینہ : R=2.97306, C=-8.12302, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM
str	ی	ن	ہ
state:	1 	1 	1 
C	-0.195	-0.324	-1.160
1 new words better than 0 old words: r: 2.97306 v 0 c: -8.12302 v 0 valid dict: 1 v 0
Trying word using lang ara, oem 1
Best choice: accepted=1, adaptable=0, done=1 : Lang result : ىنب : R=3.02201, C=-2.14964, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM
str	ى	ن	ب
state:	1 	1 	1 
C	-0.208	-0.307	-0.297
1 new words worse than 1 old words: r: 3.02201 v 2.97306 c: -2.14964 v -8.12302 valid dict: 0 v 1
Contributor

Shreeshrii commented Jan 25, 2018

For enabling the debug info related to this,

update the config called logfile to the following and then use 'logfile' as the last variable in your command.

logfile config

debug_file tesseract.log
multilang_debug_level 3
stopper_debug_level 3

command

time tesseract --tessdata-dir /tesseract_ocr/tessdata_fast/   "${img_file}" "${img_file%.*}-Arabic-tessdata_fast-debug"  --oem 1  -l Arabic+ara --psm 6 logfile

The tesseract.log generated by above will be on the following lines.

Processing word with lang Arabic at:Bounding box=(93,1820)->(126,1851)
Trying word using lang Arabic, oem 1
Best choice: accepted=0, adaptable=0, done=1 : Lang result : ینہ : R=2.97306, C=-8.12302, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM
str	ی	ن	ہ
state:	1 	1 	1 
C	-0.195	-0.324	-1.160
1 new words better than 0 old words: r: 2.97306 v 0 c: -8.12302 v 0 valid dict: 1 v 0
Trying word using lang ara, oem 1
Best choice: accepted=1, adaptable=0, done=1 : Lang result : ىنب : R=3.02201, C=-2.14964, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM
str	ى	ن	ب
state:	1 	1 	1 
C	-0.208	-0.307	-0.297
1 new words worse than 1 old words: r: 3.02201 v 2.97306 c: -2.14964 v -8.12302 valid dict: 0 v 1
@Freedomafia

This comment has been minimized.

Show comment
Hide comment
@Freedomafia

Freedomafia Jan 25, 2018

Thank you very much @Shreeshrii . This is very detailed and beneficial

Freedomafia commented Jan 25, 2018

Thank you very much @Shreeshrii . This is very detailed and beneficial

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment