Tesseract creating tiff image get core dumped and segmentation fault. #3909

Aliw7979 · 2022-08-24T06:55:40Z

I want to create tiff images from 55 fonts using this code:

rm -rf train/*
tesstrain.sh --fonts_dir font \
	--lang fas \
	--noextract_font_properties  --linedata_only \
	--langdata_dir langdata_lstm\
	--tessdata_dir tesseract/tessdata \
	--save_box_tiff \
	--maxpages 500 \
	--fontlist \
	 "IRAban, Regular"          \
	 "IRHoma, Regular"        	\
	 "IRNarges, Regular"      	\
	 "IRTerafik, Bold"   \
	 "IRAmir, Regular"        	\
	 "IRJadid, Regular"        	\
	 "IRNaskh, Regular"			\
	 "IRTerafik, Italic" \
	 "IRArshia, Regular"      	\
	 "IRKamran, Regular"       	\
	 "IRNazanin, Bold" 	\
	 "IRTerafik, Regular" 	  	\
	 "IRBadr, Bold"    	\
	 "IRKhorasan, Regular"     	\
	 "IRNazanin, Italic"	\
	 "IRTitr, Regular"       	\
	  "IRBadr, Italic"   \
	  "IRKoodak, Regular"       \
	  "IRNazanin, Regular"     	\
	 "IRYakout, Bold" 	\
	  "IRBadr, Regular"        	\
	  "IRLotus, Bold"    \
	  "IRNazli, Bold"   	\
	 "IRYakout, Italic"  \
	  "IRCompset, Bold" 	\
	  "IRLotus, Italic"  \
	  "IRNazli, Regular"       	\
	 "IRYakout, Regular"        \
	  "IRCompset, Italic"  	\
	  "IRLotus, Regular"        \
	  "IRPooya, Regular"       	\
	 "IRYekan, Bold"     \
	  "IRCompset, Regular"     	\
	  "IRMaryam, Regular"       \
	  "IRRoya, Bold"    	\
	 "IRYekan, Regular"         \
	  "IRDast Nevis, Regular"   	\
	  "IRMashhad, Regular"      \
	 "IRRoya, Italic"  	\
	 "IRZar, Bold"       \
	  "IRDavat, Regular"       	\
	  "IRMehr, Regular"         \
	  "IRRoya, Regular"        	\
	 "IRZar, Italic"     \
	  "IRElham, Regular"       	\
	  "IRMitra, Bold"    \
	  "IRShiraz, Regular"      	\
	 "IRZar, Regular"           \
	  "IREntezar, Regular"     	\
	  "IRMitra, Italic" \
	  "IRSina, Regular"        	\
	 "IRZeytoon, Regular"       \
	  "IRFarnaz, Regular"      	\
	  "IRMitra, Regular"        \
	  "IRTabassom, Regular"    	\
	--output_dir trainforB

but every time I get segmentation fault and (core dumped) during the process with random error message like double linked list fault or double free or corruption(!prev) for example :

With Intel Xeon X5670 @ 2.93GHz 24 core and around 50 GB RAM
I checked CPU and RAM usage everything is OK but some CPU cores are at 100% and then this happen.
I have to mention that when I use google Colab everything is OK but take too much time because of limited resource.

Environment

Tesseract Version: = 4.1.1
Platform: Ubuntu 20.04.4 LTS

Current Behavior:

Get core dumped on creating tiff images for new fonts

Expected Behavior:

Complete creating tiff image for new fonts

Suggested Fix: None

The text was updated successfully, but these errors were encountered:

zdenop · 2022-08-24T07:04:49Z

Please use recent version of tesseract (5.2) and recent version of training . Old version(including "tesstrain.sh") is not supported due to lack of resources.

Aliw7979 · 2022-08-24T12:39:28Z

I installed latest version and latest version of training with tesstrain.py but got same error

tessseract --version
tesseract 5.2.0-13-g74e22 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found SSE4.1 Found OpenMP 201511 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4 Found libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
and
text2image --version
Using CAIRO_FONT_TYPE_FT. Pango version: 1.44.7 5.2.0-13-g74e22

zdenop · 2022-08-24T14:18:33Z

Please provide all necessary files for reproducing problem + whole log (not screenshot) of training process.

Aliw7979 · 2022-08-25T10:44:46Z

Since I don't have permission to share data to public I made a private repository and shared it with you including files that is required to reproduce problem.
(Mention that I used combine_tessdata -e fas.traineddata fas.lstm before start generating)

Aliw7979 · 2022-08-29T06:01:46Z

Could you reproduce the problem?

zdenop · 2022-08-30T08:02:17Z

No. Simple I do not have time.
Anyway tesseract is opensource, so you should be able to replicate problem with opensource/free to use data, because more testers can join to find the problem.

Aliw7979 · 2022-08-30T15:11:29Z

I found that when I want to generate training files for one font it wont give me any error but with two or more fonts, it gives me core dumped error.
log of error:
tesstrain.log
fonts:
fonts.zip

Aliw7979 · 2022-08-30T20:30:29Z

I found the solution :D
My training_text was too big, I reduce it to around 200KB.
But my question is : Does this affect the accuracy?

zdenop · 2022-08-31T08:36:22Z

I am not sure if this is a real problem. I see your training_text size is 12M. Example training text has 26M...
So maybe there is a lack of resources. It would be great if you can narrow your problem e.g try to check line length or find the block of text that case crash (make training with one font only to seed up process)

Aliw7979 · 2022-09-04T07:15:50Z

I'm working on it.
As you can see my sentences in file are too long, sometimes we have new line after 30 words in a sentence.
I'm searching for a script code to insert newline after 5 or 6 words, I know I can use python code to do that but with for loop it is not practical with 1 or 2 GB text file so I'm searching for a good solution.

amitdo · 2022-09-26T15:23:18Z

I think this issue is duplicate of #2860.

--maxpages 500

Try to change it to much smaller value.

As you can see my sentences in file are too long, sometimes we have new line after 30 words in a sentence.

Yes, the text line should not be too long. Try to limit it to 10-12 words.

zdenop · 2022-09-26T16:17:23Z

1 or 2 GB text

???
Any rationale for using such a big file if google guidance for training is 26Mb text?

zdenop closed this as completed Aug 24, 2022

zdenop reopened this Aug 24, 2022

amitdo added training text2image labels Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract creating tiff image get core dumped and segmentation fault. #3909

Tesseract creating tiff image get core dumped and segmentation fault. #3909

Aliw7979 commented Aug 24, 2022

zdenop commented Aug 24, 2022

Aliw7979 commented Aug 24, 2022

zdenop commented Aug 24, 2022

Aliw7979 commented Aug 25, 2022 •

edited

Aliw7979 commented Aug 29, 2022

zdenop commented Aug 30, 2022

Aliw7979 commented Aug 30, 2022 •

edited

Aliw7979 commented Aug 30, 2022

zdenop commented Aug 31, 2022

Aliw7979 commented Sep 4, 2022

amitdo commented Sep 26, 2022 •

edited

zdenop commented Sep 26, 2022

Tesseract creating tiff image get core dumped and segmentation fault. #3909

Tesseract creating tiff image get core dumped and segmentation fault. #3909

Comments

Aliw7979 commented Aug 24, 2022

Environment

Current Behavior:

Expected Behavior:

Suggested Fix: None

zdenop commented Aug 24, 2022

Aliw7979 commented Aug 24, 2022

zdenop commented Aug 24, 2022

Aliw7979 commented Aug 25, 2022 • edited

Aliw7979 commented Aug 29, 2022

zdenop commented Aug 30, 2022

Aliw7979 commented Aug 30, 2022 • edited

Aliw7979 commented Aug 30, 2022

zdenop commented Aug 31, 2022

Aliw7979 commented Sep 4, 2022

amitdo commented Sep 26, 2022 • edited

zdenop commented Sep 26, 2022

Aliw7979 commented Aug 25, 2022 •

edited

Aliw7979 commented Aug 30, 2022 •

edited

amitdo commented Sep 26, 2022 •

edited