Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image too small to scale!! (2x48 vs min width of 3) - vertical lstm training #3001

Closed
davidb1 opened this issue Jun 2, 2020 · 12 comments · Fixed by #3223
Closed

Image too small to scale!! (2x48 vs min width of 3) - vertical lstm training #3001

davidb1 opened this issue Jun 2, 2020 · 12 comments · Fixed by #3223

Comments

@davidb1
Copy link

davidb1 commented Jun 2, 2020

Environment

Tesseract Version: 4.00
Commit Number:
Platform: Ububtu 18.04

Current Behavior:

I am generating vertical lstm training files using tesstrain.sh but when I try to train on them I get (on all the training data):

Image too small to scale!! (2x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Compute CTC targets failed!

I couldn't find much on the problem except #590 and I couldn't find a solution there.

Where should I be looking in order to fix this?

Expected Behavior:

Suggested Fix:

@stweil
Copy link
Contributor

stweil commented Jun 2, 2020

Usually you can simply ignore that message. It is typically caused by very narrow glyphs (like i, I, l or |) in your training images, or there are bad training images with a small width. Could you please check that? Tesseract expects that a character image should be at least 3 pixels wide. Maybe we should reduce that limit to 2 pixels if such images are reasonable.

@amitdo amitdo added the training label Jun 2, 2020
@amitdo
Copy link
Collaborator

amitdo commented Jun 2, 2020

Duplicate of #2890.

@davidb1
Copy link
Author

davidb1 commented Jun 2, 2020

@stweil thanks, I do have a lot of small glyphs but I don't see any other output other than these errors. Is the setting controlling the minimum glyph size in the network specification?
Is the whole image ignored if one glyph is too small (since the output is "Image not trainable")?
@Shreeshrii @amitdo Any answer would be much appreciated, I can't find any info about this
image

@YuvalFeldman
Copy link

I'm having the same issue as @davidb1

@tesseract-ocr tesseract-ocr deleted a comment from niki13010 Jun 3, 2020
@filippesic
Copy link

I have the same error but when I'm actually trying to do OCR. Tesseract v4.1.1-rc2-25-g9707 and error is 2x36 vs min width of 3 Line cannot be recognized!!, without the CTC error. There are a lot of errors, not just a few. Images that I'm trying to do OCR on are upscaled from 231 native DPI to 800DPI. I'm gonna try 600DPI too and maybe lower, but I had the most success with 800, even tho it may be too large, the text in the images is very small, it has a lot of separate columns and it's a two page scan.

@Shreeshrii Shreeshrii changed the title Image too small to scale!! (2x48 vs min width of 3) Image too small to scale!! (2x48 vs min width of 3) - vertical lstm training Oct 26, 2020
@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 11, 2021

I investigated this a bit today, since there were reports of successful training of jpn_vert in the forum in 2018. So it seemed to me that maybe something was broken after that.

I downloaded the traineddata files from the project - https://github.com/zodiac3539/jpn_vert
Unpacking the files I noted the version to be tesseract-4.0.0-beta3.

So, I compared training for same set of inputs, for tesseract-4.0.0-beta3 and tesseract-4.1.0-rc1.

There errors are not there for 4.0.0-beta3 but are there in tesseract-4.1.0-rc1, so were introduced in between these.

4.0.0-beta3 log

Version string:4.00.00alpha:jpn_vert:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=2586, offset=192
17:lstm:size=12936715, offset=2778
18:lstm-punc-dawg:size=2602, offset=12939493
19:lstm-word-dawg:size=1168650, offset=12942095
20:lstm-number-dawg:size=50, offset=14110745
21:lstm-unicharset:size=173324, offset=14110795
22:lstm-recoder:size=46601, offset=14284119
23:version:size=85, offset=14330720
Extracting tessdata components from /home/ubuntu/tessdata_best/jpn_vert.traineddata
Wrote train-tesseract-4.0/jpn_vert.config
Wrote train-tesseract-4.0/jpn_vert.lstm
Wrote train-tesseract-4.0/jpn_vert.lstm-punc-dawg
Wrote train-tesseract-4.0/jpn_vert.lstm-word-dawg
Wrote train-tesseract-4.0/jpn_vert.lstm-number-dawg
Wrote train-tesseract-4.0/jpn_vert.lstm-unicharset
Wrote train-tesseract-4.0/jpn_vert.lstm-recoder
Wrote train-tesseract-4.0/jpn_vert.version
Loaded file output/jpn_vert_new_checkpoint, unpacking...
Code range changed from 375 to 373!
Must supply the old traineddata for code conversion!
Loaded file train-tesseract-4.0/jpn_vert.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 415 to 373!
Num (Extended) outputs,weights in Series:
  1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys64:64, 20736
  Lfx96:96, 61824
  Lrx96:96, 74112
  Lfx512:512, 1247232
  Fc373:373, 191349
Total weights = 1595413
Previous null char=414 mapped to 372
Continuing from train-tesseract-4.0/jpn_vert.lstm
Loaded 8674/8674 lines (1-8674) of document train-tesseract-4.0/jpn.07YasashisaAntique.exp0.lstmf
Iteration 0: GROUND  TRUTH : ようせい ときめき 没後 籠原 章子 終日 クイール 一雄 レビン ヘン ゴヤール 五霞
Iteration 0: ALIGNED TRUTH : ようせい ときめき 没後 籠原 章子 終日 クイール 一雄 レビン ヘン ゴヤール 五起
Iteration 0: BEST OCR TEXT : ようせい ときめき 没後 籠原 章子 終日 クイール 一雄 レビン ヘン ゴヤール 五起
File /tmp/tmp.MxtdQNVyAV/jpn/jpn.07YasashisaAntique.exp0.lstmf line 3743 :
Mean rms=0.342%, delta=0.202%, train=1.37%(8.333%), skip ratio=0%
Iteration 1: GROUND  TRUTH : もの 甘糟^朗読一緒3禿該冶金 絞り込みはをバイトプレミアムの 遙
Iteration 1: BEST OCR TEXT : もの 上甘糟^朗読一緒3禾訪冶金 絞り込みはをバイトプレミアムの。頂
File /tmp/tmp.MxtdQNVyAV/jpn/jpn.07YasashisaAntique.exp0.lstmf line 6946 :
Mean rms=0.77%, delta=3.517%, train=11.002%(41.667%), skip ratio=0%
Iteration 2: GROUND  TRUTH : 功利 記念祭 築上 想う あっきー 寂し 寒く 来る ブレット コミュニケーション 弾幕
Iteration 2: ALIGNED TRUTH : 功利 記念祭 築上 想う あっきー 寂し 寒く 来る ブレット コミュニケーション 弾莫
Iteration 2: BEST OCR TEXT : 切利 記念祭 築上 想う あっき1 寂し 寒く 来る ブレット コミュニケーション 弾幕
File /tmp/tmp.MxtdQNVyAV/jpn/jpn.07YasashisaAntique.exp0.lstmf line 722 :
Mean rms=0.69%, delta=2.866%, train=9.587%(33.838%), skip ratio=0%
Iteration 3: GROUND  TRUTH : に対する紋章、から劇団昴孟チャート(蟹工船陣貫通存在探すは蘆名
Iteration 3: BEST OCR TEXT : に対する絞章、から和劇団尺生チャート(制工船陣綱通存在探すは董和
File /tmp/tmp.MxtdQNVyAV/jpn/jpn.07YasashisaAntique.exp0.lstmf line 5206 :
Mean rms=0.92%, delta=5.213%, train=18.619%(50.379%), skip ratio=0%
Iteration 4: GROUND  TRUTH : 映え おっちゃん 集まり 失語 ムジュラ 采配 並べて 倍額 スクラム 囲む ウィリアム
Iteration 4: BEST OCR TEXT : 映え おっちゃん 和集まり 失語 ムジュラ 来配 並べて 倍額 スクラム 囲む ウィリアム
File /tmp/tmp.MxtdQNVyAV/jpn/jpn.07YasashisaAntique.exp0.lstmf line 333 :
Mean rms=0.866%, delta=4.633%, train=17.319%(43.939%), skip ratio=0%
Iteration 5: GROUND  TRUTH : 水前寺 使え 田崎 リュート スレイヴ 付い 親孝行 サブプライ フィル リオナ さっさと
Iteration 5: BEST OCR TEXT : 水前寺 使え 田崎 リュート スレイヴ 付い 親孝行 サブツプライ フィル リオナ さっさざと
File /tmp/tmp.MxtdQNVyAV/jpn/jpn.07YasashisaAntique.exp0.lstmf line 912 :
Mean rms=0.799%, delta=4.037%, train=14.938%(39.646%), skip ratio=0%
Iteration 6: GROUND  TRUTH : 太い エロゲー アンゴラ 紺屋 神無 大多喜 複利 長寧 シジュウカラ 細長い 詫間
Iteration 6: BEST OCR TEXT : 太い エロゲー アンゴラ 紺屋 神無 大多喜 複利 長密 シジュウカラ 細長い 辿間
File /tmp/tmp.MxtdQNVyAV/jpn/jpn.07YasashisaAntique.exp0.lstmf line 243 :
Mean rms=0.784%, delta=3.725%, train=13.962%(36.58%), skip ratio=0%
Iteration 7: GROUND  TRUTH : かゆい 昌俊 初詣 し 博客 やっと 大関 ひっかかっ ニューサー 枯木 ピーズ 除却
Iteration 7: BEST OCR TEXT : かゆい 晶俊 初庶 し 靖客 やっと 大関 ひっかかっ ニューサー 枯木 ピーズ 除和
File /tmp/tmp.MxtdQNVyAV/jpn/jpn.07YasashisaAntique.exp0.lstmf line 1465 :
Mean rms=0.786%, delta=3.556%, train=14.055%(36.174%), skip ratio=0%
Iteration 8: GROUND  TRUTH : 海外麒麟 髑 毒蛾お技祷返信に貴方し、 個数 ) 琢磨暇単元3空港礫で
Iteration 8: BEST OCR TEXT : 海外庶設 隅 毒鼎お技持返信に貴方し、 個数 ) 諒磨陣単元3空港槍で
File /tmp/tmp.MxtdQNVyAV/jpn/jpn.07YasashisaAntique.exp0.lstmf line 1263 :
Mean rms=0.882%, delta=4.489%, train=17.377%(39.562%), skip ratio=0%
Iteration 9: GROUND  TRUTH : フォーム明確日帰り中料理叩かボタン新株高速雷 データベース金(通り
Iteration 9: BEST OCR TEXT : フォーム明確日帰り中料理釘かボタン新株高速才 データベース金(通り
File /tmp/tmp.MxtdQNVyAV/jpn/jpn.07YasashisaAntique.exp0.lstmf line 4140 :
Mean rms=0.864%, delta=4.283%, train=16.816%(40.606%), skip ratio=0%
Iteration 10: GROUND  TRUTH : 怨霊の2東芝 元棗瓜) パソコン 温泉 ホールディングス
Iteration 10: BEST OCR TEXT : 急霊の2東芝。 元楽払) パソコン 温泉 ホールディングス

4.1.0-rc1 log

Version string:4.00.00alpha:jpn_vert:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=2586, offset=192
17:lstm:size=12936715, offset=2778
18:lstm-punc-dawg:size=2602, offset=12939493
19:lstm-word-dawg:size=1168650, offset=12942095
20:lstm-number-dawg:size=50, offset=14110745
21:lstm-unicharset:size=173324, offset=14110795
22:lstm-recoder:size=46601, offset=14284119
23:version:size=85, offset=14330720
Extracting tessdata components from /home/ubuntu/tessdata_best/jpn_vert.traineddata
Wrote train-tesseract-4.1/jpn_vert.config
Wrote train-tesseract-4.1/jpn_vert.lstm
Wrote train-tesseract-4.1/jpn_vert.lstm-punc-dawg
Wrote train-tesseract-4.1/jpn_vert.lstm-word-dawg
Wrote train-tesseract-4.1/jpn_vert.lstm-number-dawg
Wrote train-tesseract-4.1/jpn_vert.lstm-unicharset
Wrote train-tesseract-4.1/jpn_vert.lstm-recoder
Wrote train-tesseract-4.1/jpn_vert.version
+ lstmtraining --continue_from train-tesseract-4.1/jpn_vert.lstm --old_traineddata /home/ubuntu/tessdata_best/jpn_vert.traineddata --traineddata train-tesseract-4.1/jpn/jpn.traineddata --train_listfile train-tesseract-4.1/jpn.training_files.txt --model_output output-tesseract-4.1/jpn_vert_new --debug_interval -1 --max_iterations 1000000
Loaded file train-tesseract-4.1/jpn_vert.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 415 to 373!
Num (Extended) outputs,weights in Series:
  1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys64:64, 20736
  Lfx96:96, 61824
  Lrx96:96, 74112
  Lfx512:512, 1247232
  Fc373:373, 191349
Total weights = 1595413
Previous null char=414 mapped to 372
Continuing from train-tesseract-4.1/jpn_vert.lstm
Loaded 8674/8674 lines (1-8674) of document train-tesseract-4.1/jpn.07YasashisaAntique.exp0.lstmf
Image too small to scale!! (2x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Image too small to scale!! (1x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Image too small to scale!! (2x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Image too small to scale!! (2x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Image too small to scale!! (2x48 vs min width of 3)

To those who want to train vertical fonts, please try with tesseract-4.0.0-beta3.

It will be great if someone can investigate this further to figure out when the bug got in.

@stweil
Copy link
Contributor

stweil commented Jan 11, 2021

The original report was for 4.0.0. So can we reduce the commit range which introduced the changed behaviour to somewhere between 4.0.0-beta3 and 4.0.0? Which data and commands did you use to get the log files shown above?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 11, 2021

4.0.0-beta.4 is ok (Jul 30 2018)

Continuing from train-tesseract-4.0.0-beta.4/jpn_vert.lstm
Loaded 101/101 lines (1-101) of document train-tesseract-4.0.0-beta.4/jpn.07YasashisaAntique.exp0.lstmf
Iteration 0: GROUND  TRUTH : ダンディー 晃弘 ホロスコープ 婚姻 たち吉 生花 タコ焼き 陸前 食い倒れ 漢代
Iteration 0: BEST OCR TEXT : ダンディー 晃弘 ホロスコープ 婚如 たち吉 生花 タコ焼き 陸前 食い倒れ 漢代
File /tmp/tmp.HY0X9TtDNX/jpn/jpn.07YasashisaAntique.exp0.lstmf line 14 :
Mean rms=0.751%, delta=1.919%, train=4.478%(10%), skip ratio=0%
Iteration 1: GROUND  TRUTH : 暴力団 羽立 ピグメント ローレンス 荷主 組立 いぬ 映し出さ アトピタ 永世 モクモク
Iteration 1: BEST OCR TEXT : 暴力団 羽立 ピビグメント ローレンス 主 組立 いぬ 映し出さ アトピタ 永世 モクモク

4.0.0-rc1 fails (Oct 2, 2018)

Continuing from train-tesseract-4.0.0-rc1/jpn_vert.lstm
Loaded 101/101 lines (1-101) of document train-tesseract-4.0.0-rc1/jpn.07YasashisaAntique.exp0.lstmf
Image too small to scale!! (2x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Image too small to scale!! (2x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Image too small to scale!! (2x48 vs min width of 3)
Line cannot be recognized!!

2018-09 - commit # 554450c fails (Sep 3, 2018)

Continuing from train-tesseract-2018-09/jpn_vert.lstm
Loaded 101/101 lines (1-101) of document train-tesseract-2018-09/jpn.07YasashisaAntique.exp0.lstmf
Image too small to scale!! (1x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Image too small to scale!! (2x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable

So would be some commit from Aug 2018.

@Shreeshrii
Copy link
Collaborator

I am using 07YasashisaAntique font provided by a user with tesstrain.sh rendering text in vertical mode. Font needs to be added to vertical fonts in language_specific.sh.


FONTS_DIR=fonts
TESSDATA_DIR=/home/ubuntu/tessdata_best
LANG_DATA_DIR=langdata_lstm
START_MODEL=jpn_vert
LANG=jpn
OUTPUT_NAME=jpn_vert_new
TESS=tesseract-$1
TRAINDIR=train-$TESS
OUTPUTDIR=output-$TESS

# generates training data
$TESS/src/training/tesstrain.sh \
        --fonts_dir ${FONTS_DIR} \
        --tessdata_dir  ${TESSDATA_DIR} \
        --langdata_dir ${LANG_DATA_DIR} \
        --linedata_only \
        --lang ${LANG} \
        --noextract_font_properties \
        --output_dir $TRAINDIR \
        --exposures "0" \
		--fontlist "07YasashisaAntique" \
        --training_text ${LANG_DATA_DIR}/${LANG}/${LANG}.test.training_text 

mkdir -p $OUTPUTDIR

# extracts lstm from old traineddata
combine_tessdata -u ${TESSDATA_DIR}/${START_MODEL}.traineddata $TRAINDIR/${START_MODEL}.

# finetunes model
lstmtraining \
    --continue_from $TRAINDIR/${START_MODEL}.lstm \
    --old_traineddata ${TESSDATA_DIR}/${START_MODEL}.traineddata \
    --traineddata $TRAINDIR/${LANG}/${LANG}.traineddata \
    --train_listfile $TRAINDIR/${LANG}.training_files.txt \
    --model_output $OUTPUTDIR/${OUTPUT_NAME} \
    --debug_interval -1 \
    --max_iterations 1000

# combines into a .traineddata
lstmtraining --stop_training \
    --continue_from $OUTPUTDIR/${OUTPUT_NAME}_checkpoint \
    --traineddata $TRAINDIR/${LANG}/${LANG}.traineddata \
    --model_output $OUTPUTDIR/${OUTPUT_NAME}.traineddata

@Shreeshrii
Copy link
Collaborator

Please see https://groups.google.com/g/tesseract-ocr/c/QVTAfLGKiNI/m/QfQxv924CgAJ for the report in forum along with a zipped file of training fonts etc.

@Shreeshrii
Copy link
Collaborator

@stweil This bug seems to have been introduced by my commits related to jav_java. I will revert the change, test and report back later today.

@Shreeshrii
Copy link
Collaborator

The problem was caused by addition of --psm 6 to
phase_E_extract_features " lstm.train " 8 "lstmf"
in tesstrain.sh.

This led to a change in size of the lstmf file (and what was in it).
Removing --psm 6 from the script fixes the issue.
I will make a PR for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants