Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak when building training files #1999

Closed
H-Bluhm opened this issue Oct 18, 2018 · 12 comments
Closed

Possible memory leak when building training files #1999

H-Bluhm opened this issue Oct 18, 2018 · 12 comments

Comments

@H-Bluhm
Copy link

H-Bluhm commented Oct 18, 2018

Environment

  • Tesseract Version: 4.0-rc3, 4.0-rc2, 4.0-rc1
  • Platform: Linux 9000119697 4.15.0-36-generic Update for Github & fix spelling #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Cut training text into smaller chunks (~20000 lines each, ~1.3mb) to build training files on smaller computers in parallel.
Tried building frk training files using:

src/training/tesstrain.sh \
--lang frk \
--linedata_only \
--noextract_font_properties \
--fonts_dir ~/frk_fonts/ \
--langdata_dir ~/langdata_lstm/ \
--tessdata_dir ~/tessdata/ \
--output_dir ~/tessOutput/

Current Behavior:

Memory usage increases rapidly (total >15gb when process was killed by kernel)

Expected Behavior:

With same training text until 4.0-beta.4 memory usage increased at a much lower rate (max usage <4gb)

@eighttails
Copy link
Contributor

I'm also just about to report the same issue and have funished bisecting.
Maybe related to this commit.
345e5ee

@eighttails
Copy link
Contributor

eighttails commented Oct 18, 2018

Well, what I mentioned above is text2image issue.

Now following command consumes more than 1GB of memory.
text2image.exe --fonts_dir /c/Windows/Fonts --font Meiryo --text langdata/jpn/jpn.training_text --max_pages 0 --outputbase test
It used to consume about 100MB before.
tesstrain.sh runs 8 tasks simultanously so if each task consumes 1GB RAM it crashes computer with 8GB RAM.

If you say lstmtraining leaks memory, it is a different problem.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Oct 18, 2018

Sorry about mixing up issues.

OP's issue with tesstrain.sh is related to text2image since lstmtraining is run separately after that. I am deleting my earlier comment since it is a different issue.

@eighttails
Copy link
Contributor

tesstrain.sh continues even if text2image crashed.
After that lstmtraining generates error message because text2image died without generating box files.

@stweil stweil added the bug label Oct 18, 2018
@stweil
Copy link
Contributor

stweil commented Oct 18, 2018

This looks like a bug and a regression if beta.4 still was fine. So we have to decide whether fixing it is required for 4.0.0.

@amitdo
Copy link
Collaborator

amitdo commented Oct 18, 2018

It depends on how quickly you can find the faulty commit...

@Shreeshrii
Copy link
Collaborator

A number of people use training, specially for finetuning of non-Latin languages which have been trained at Google with fewer fonts.

While there are many issues with training that are on hold right now, I think this regression should be fixed before 4.0.0.

@Shreeshrii
Copy link
Collaborator

@eighttails has identified a commit in #1999 (comment)

@stweil stweil added this to the 4.0.0 milestone Oct 18, 2018
@amitdo
Copy link
Collaborator

amitdo commented Oct 18, 2018

Another training issue is #1052.

Also see #1700 (comment)

@zdenop
Copy link
Contributor

zdenop commented Oct 18, 2018

@eighttails : you are right. Problem is that if PangoFontMap is created with pango_cairo_font_map_get_default() it must not be freed , but if it is created with pango_cairo_font_map_new_for_font_type, it should be freed...

@zdenop zdenop closed this as completed in d1d73b9 Oct 18, 2018
@zdenop
Copy link
Contributor

zdenop commented Oct 18, 2018

Please check

@mgeerdsen
Copy link
Contributor

Memory resumption is looking good again (text2image 4.0.0-rc3-41-g0a42c0).

However I do not get any text rendered into the images (just blank pages), whereas this worked in the same setup with 4.0.0-beta.4. I will look into this and open up a separate issue for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants