Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract creating tiff image get core dumped and segmentation fault. #3909

Open
Aliw7979 opened this issue Aug 24, 2022 · 12 comments
Open

Tesseract creating tiff image get core dumped and segmentation fault. #3909

Aliw7979 opened this issue Aug 24, 2022 · 12 comments

Comments

@Aliw7979
Copy link

I want to create tiff images from 55 fonts using this code:

rm -rf train/*
tesstrain.sh --fonts_dir font \
	--lang fas \
	--noextract_font_properties  --linedata_only \
	--langdata_dir langdata_lstm\
	--tessdata_dir tesseract/tessdata \
	--save_box_tiff \
	--maxpages 500 \
	--fontlist \
	 "IRAban, Regular"          \
	 "IRHoma, Regular"        	\
	 "IRNarges, Regular"      	\
	 "IRTerafik, Bold"   \
	 "IRAmir, Regular"        	\
	 "IRJadid, Regular"        	\
	 "IRNaskh, Regular"			\
	 "IRTerafik, Italic" \
	 "IRArshia, Regular"      	\
	 "IRKamran, Regular"       	\
	 "IRNazanin, Bold" 	\
	 "IRTerafik, Regular" 	  	\
	 "IRBadr, Bold"    	\
	 "IRKhorasan, Regular"     	\
	 "IRNazanin, Italic"	\
	 "IRTitr, Regular"       	\
	  "IRBadr, Italic"   \
	  "IRKoodak, Regular"       \
	  "IRNazanin, Regular"     	\
	 "IRYakout, Bold" 	\
	  "IRBadr, Regular"        	\
	  "IRLotus, Bold"    \
	  "IRNazli, Bold"   	\
	 "IRYakout, Italic"  \
	  "IRCompset, Bold" 	\
	  "IRLotus, Italic"  \
	  "IRNazli, Regular"       	\
	 "IRYakout, Regular"        \
	  "IRCompset, Italic"  	\
	  "IRLotus, Regular"        \
	  "IRPooya, Regular"       	\
	 "IRYekan, Bold"     \
	  "IRCompset, Regular"     	\
	  "IRMaryam, Regular"       \
	  "IRRoya, Bold"    	\
	 "IRYekan, Regular"         \
	  "IRDast Nevis, Regular"   	\
	  "IRMashhad, Regular"      \
	 "IRRoya, Italic"  	\
	 "IRZar, Bold"       \
	  "IRDavat, Regular"       	\
	  "IRMehr, Regular"         \
	  "IRRoya, Regular"        	\
	 "IRZar, Italic"     \
	  "IRElham, Regular"       	\
	  "IRMitra, Bold"    \
	  "IRShiraz, Regular"      	\
	 "IRZar, Regular"           \
	  "IREntezar, Regular"     	\
	  "IRMitra, Italic" \
	  "IRSina, Regular"        	\
	 "IRZeytoon, Regular"       \
	  "IRFarnaz, Regular"      	\
	  "IRMitra, Regular"        \
	  "IRTabassom, Regular"    	\
	--output_dir trainforB
	

but every time I get segmentation fault and (core dumped) during the process with random error message like double linked list fault or double free or corruption(!prev) for example :

IMG_20220823_235202_669

With Intel Xeon X5670 @ 2.93GHz 24 core and around 50 GB RAM
I checked CPU and RAM usage everything is OK but some CPU cores are at 100% and then this happen.
I have to mention that when I use google Colab everything is OK but take too much time because of limited resource.


Environment

  • Tesseract Version: = 4.1.1
  • Platform: Ubuntu 20.04.4 LTS

Current Behavior:

Get core dumped on creating tiff images for new fonts

Expected Behavior:

Complete creating tiff image for new fonts

Suggested Fix: None

@zdenop
Copy link
Contributor

zdenop commented Aug 24, 2022

Please use recent version of tesseract (5.2) and recent version of training . Old version(including "tesstrain.sh") is not supported due to lack of resources.

@zdenop zdenop closed this as completed Aug 24, 2022
@Aliw7979
Copy link
Author

I installed latest version and latest version of training with tesstrain.py but got same error
Screenshot from 2022-08-24 17-02-20

tessseract --version
tesseract 5.2.0-13-g74e22 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found SSE4.1 Found OpenMP 201511 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4 Found libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
and
text2image --version
Using CAIRO_FONT_TYPE_FT. Pango version: 1.44.7 5.2.0-13-g74e22

@zdenop
Copy link
Contributor

zdenop commented Aug 24, 2022

Please provide all necessary files for reproducing problem + whole log (not screenshot) of training process.

@zdenop zdenop reopened this Aug 24, 2022
@Aliw7979
Copy link
Author

Aliw7979 commented Aug 25, 2022

Since I don't have permission to share data to public I made a private repository and shared it with you including files that is required to reproduce problem.
(Mention that I used combine_tessdata -e fas.traineddata fas.lstm before start generating)

@Aliw7979
Copy link
Author

Could you reproduce the problem?

@zdenop
Copy link
Contributor

zdenop commented Aug 30, 2022

No. Simple I do not have time.
Anyway tesseract is opensource, so you should be able to replicate problem with opensource/free to use data, because more testers can join to find the problem.

@Aliw7979
Copy link
Author

Aliw7979 commented Aug 30, 2022

I found that when I want to generate training files for one font it wont give me any error but with two or more fonts, it gives me core dumped error.
log of error:
tesstrain.log
fonts:
fonts.zip

@Aliw7979
Copy link
Author

I found the solution :D
My training_text was too big, I reduce it to around 200KB.
But my question is : Does this affect the accuracy?

@zdenop
Copy link
Contributor

zdenop commented Aug 31, 2022

I am not sure if this is a real problem. I see your training_text size is 12M. Example training text has 26M...
So maybe there is a lack of resources. It would be great if you can narrow your problem e.g try to check line length or find the block of text that case crash (make training with one font only to seed up process)

@Aliw7979
Copy link
Author

Aliw7979 commented Sep 4, 2022

I'm working on it.
As you can see my sentences in file are too long, sometimes we have new line after 30 words in a sentence.
I'm searching for a script code to insert newline after 5 or 6 words, I know I can use python code to do that but with for loop it is not practical with 1 or 2 GB text file so I'm searching for a good solution.

@amitdo
Copy link
Collaborator

amitdo commented Sep 26, 2022

I think this issue is duplicate of #2860.

--maxpages 500

Try to change it to much smaller value.

As you can see my sentences in file are too long, sometimes we have new line after 30 words in a sentence.

Yes, the text line should not be too long. Try to limit it to 10-12 words.

@zdenop
Copy link
Contributor

zdenop commented Sep 26, 2022

1 or 2 GB text

???
Any rationale for using such a big file if google guidance for training is 26Mb text?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants