Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating point exception on certain images when parameter tosp_old_to_method is set (some languages like Hindi set it by default) #3483

Closed
Daimona opened this issue Jul 1, 2021 · 10 comments

Comments

@Daimona
Copy link

Daimona commented Jul 1, 2021

Environment

  • Tesseract Version:
tesseract 5.0.0-alpha-20210401-139-g38f0f
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
  • Commit Number: Built from source using git master as of today, so it should be 38f0fdc
  • Platform: Debian Buster. I'm hesitant to share the exact kernel version. I can ask if that'd be OK and share it later if it turns out to be important.

Current Behavior:

  • Use this image for testing
  • Run tesseract with -l hin (I'm using the tessdata_fast version of the language files)

Tesseract will die due to a floating point exception (division by zero?)

$ curl -s "https://upload.wikimedia.org/wikipedia/commons/thumb/9/9e/%E0%A4%A6%E0%A5%87%E0%A4%B5%E0%A4%95%E0%A5%80%E0%A4%A8%E0%A4%82%E0%A4%A6%E0%A4%A8_%E0%A4%B8%E0%A4%AE%E0%A4%97%E0%A5%8D%E0%A4%B0.pdf/page26-1024px-%E0%A4%A6%E0%A5%87%E0%A4%B5%E0%A4%95%E0%A5%80%E0%A4%A8%E0%A4%82%E0%A4%A6%E0%A4%A8_%E0%A4%B8%E0%A4%AE%E0%A4%97%E0%A5%8D%E0%A4%B0.pdf.jpg" | tesseract stdin stdout -l hin

Estimating resolution as 240
Floating point exception

Expected Behavior:

I'd expect tesseract to transcribe the image without errors, instead of dying with a SIGFPE.

Suggested Fix:

Not really a suggested fix, but gdb shows where the error is happening:

$ gdb tesseract
 [...]
(gdb) run testimg.jpg stdout -l hin
Starting program: /usr/local/bin/tesseract testimg.jpg stdout -l hin
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Estimating resolution as 240

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7e98196 in tesseract::Textord::make_a_word_break (this=this@entry=0x7ffff69c4c70, row=row@entry=0x555555b81000, blob_box=..., blob_box@entry=..., prev_gap=prev_gap@entry=32767, prev_blob_box=prev_blob_box@entry=...,
    real_current_gap=real_current_gap@entry=294, within_xht_current_gap=294, next_blob_box=..., next_gap=-60, blanks=@0x7fffffffcd11: 0 '\000', fuzzy_sp=@0x7fffffffcd0f: false, fuzzy_non=@0x7fffffffcd10: false,
    prev_gap_was_a_space=@0x7fffffffcd12: false, break_at_next_gap=@0x7fffffffcd13: false) at src/textord/tospace.cpp:1235
1235            blanks = static_cast<uint8_t>(current_gap / row->space_size);
@stweil
Copy link
Contributor

stweil commented Jul 1, 2021

Maybe row->space_size is 0.0.

Update 2021-07-02: It is 0.0.

@stweil
Copy link
Contributor

stweil commented Jul 2, 2021

The problem only occurs when the parameter tosp_old_to_method is set to true. These scripts and languages do this by default: ara and ben (both only tessdata_best), hin, mar, san and Devanagari (all three model variants). So any of those languages also fails with a crash.

It is possible to add -c tosp_old_to_method=0 when running Tesseract. That fixes the crash because the relevant code is no longer executed. I am not an expert for any of those languages, so I ask you (and @Shreeshrii):

Is that parameter needed at all, or could it always be forced to 0?
Why are there inconsistent settings of the parameter for ara and ben? If those languages work fine with tessdata and tessdata_fast I see no reason why tessdata_best should require a different setting.

@stweil stweil changed the title Floating point exception on certain images when using Hindi as language Floating point exception on certain images when parameter tosp_old_to_method is set (some languages like Hindi set it by default) Jul 2, 2021
@stweil stweil added this to the 5.0.0 milestone Jul 2, 2021
@stweil stweil added the bug label Jul 2, 2021
@stweil
Copy link
Contributor

stweil commented Jul 2, 2021

@MerlijnWajer, as you process a huge amount of image (as far as I know also with Indic languages), I'd expect that this issue will also affect you as soon as you use 5.0.0-alpha-20210401 (I enabled FP exceptions in commit 422452b).

Older code does not use FP exceptions, but silently shows undefined behaviour.

stweil added a commit to stweil/tesseract that referenced this issue Jul 2, 2021
Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor

stweil commented Jul 2, 2021

I now fixed the crash in Git master, but would like to keep this issue open until it is clear how we proceed with tosp_old_to_method.

@Daimona
Copy link
Author

Daimona commented Jul 2, 2021

Thanks for the fix, I confirm it's working now for my example image. However, I've found other images that, when transcribed with -l hin, make tesseract crash with a FPE. One example is this one. The gdb output is the same for all of them:

(gdb) run test2.jpg stdout -l hin
Starting program: /usr/local/bin/tesseract test2.jpg stdout -l hin
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Estimating resolution as 284
Detected 209 diacritics

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7e9d5b6 in tesseract::Textord::to_spacing (this=this@entry=0x7ffff69c4c70, page_tr=..., page_tr@entry=..., blocks=blocks@entry=0x7fffffffd0c8) at src/textord/tospace.cpp:84
84              row_spacing_stats(row, gapmap.get(), block_index, row_index, block_space_gap_width,

I've reported it here in case it's related to the original report, but I can open a different issue if it's not.

@stweil
Copy link
Contributor

stweil commented Jul 3, 2021

In that new case, the int16_t value block_non_space_gap_width is 0 and used for a division. That code also depends on tosp_old_to_method.

stweil added a commit that referenced this issue Jul 3, 2021
Rewriting the code avoids FP operations (so makes it potentially faster)
and fixes the division by 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor

stweil commented Jul 3, 2021

I fixed that new division by 0 in commit 158c845.

@MerlijnWajer
Copy link
Contributor

@MerlijnWajer, as you process a huge amount of image (as far as I know also with Indic languages), I'd expect that this issue will also affect you as soon as you use 5.0.0-alpha-20210401 (I enabled FP exceptions in commit 422452b).

Older code does not use FP exceptions, but silently shows undefined behaviour.

Thanks for the heads up. I'll pull the latest code, and if I run into problems like these I'll be sure to file an issue.

@amitdo
Copy link
Collaborator

amitdo commented Oct 29, 2021

The two floating point exceptions were fixed by @stweil.

@amitdo amitdo closed this as completed Oct 29, 2021
@amitdo
Copy link
Collaborator

amitdo commented Oct 29, 2021

I opened a new related issue: #3614

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants