[C-API] TessBaseAPIRecognize() segfaults when no language data files is available #779

jflesch · 2017-03-22T10:51:47Z

Hello,

When working on openpaperwork/pyocr#51 , someone reported to me the following issue:

When no language data file is available at all (not even English), TessBaseAPIRecognize() segfaults.
Also happens when TESSDATA_PREFIX is set to an invalid directory.

Tested with libtesseract 3.04.01 (Debian Sid).

Example code:

#include <stdio.h>
#include <stdlib.h>

#include <leptonica/allheaders.h>

#include <tesseract/capi.h>

/*
 * gcc -Wall -Werror -g -ggdb -ltesseract -llept test.c -o test.out
 * ./test.out ~/git/pyocr/tests/input/real/basic_doc.jpg
 */

int main(int argc, char **argv)
{
	TessBaseAPI *handle;
	PIX *image;

	setenv("TESSDATA_PREFIX", "/opt/tulipe", 1 /* overwrite */);

	handle = TessBaseAPICreate();
	TessBaseAPIInit3(handle, NULL, "fra");
	TessBaseAPISetVariable(handle, "tessedit_zero_rejection", "F");
	image = pixRead(argv[1]);
	TessBaseAPISetImage2(handle, image);

	TessBaseAPIRecognize(handle, NULL);

	TessBaseAPIDelete(handle);

	return 0;
}

Stacktrace:

(gdb) bt
#0  0x00007ffff76f2597 in tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int) () from /usr/lib/x86_64-linux-gnu/libtesseract.so.3
#1  0x00007ffff76dcb7f in tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) ()
   from /usr/lib/x86_64-linux-gnu/libtesseract.so.3
#2  0x0000555555554aa2 in main (argc=2, argv=0x7fffffffe3d8) at test.c:27

PS: This is a very low-priority issue for me. I can work around it really easily on Python side.

The text was updated successfully, but these errors were encountered:

zdenop · 2018-09-28T21:56:59Z

@jflesch : Problem is that you are not using c-api as designed ;-) : you forget to check return value. So instead of

TessBaseAPIInit3(handle, NULL, "fra");`

you use something like this:

if(TessBaseAPIInit3(handle, NULL, "eng") != 0)
		die("Error initializing tesseract\n");

jflesch · 2018-09-29T08:53:23Z

Oops :-/
Thanks for the reply, and sorry for wasting your time. I'll make sure I didn't miss any other return values in Pyocr.

zdenop · 2018-09-29T09:57:54Z

no problem. With testing you issue thanks for test case) I found and fix another problem ;-)

BTW: maybe it would be good if you join your effort with @jbarlow83 : he is wrapping leptonica in python in his project OCRmyPDF. leptonica provides several great function for ocr preprocesing like descew and dewarp removing background)

jflesch · 2018-09-30T19:56:28Z

Interesting. I did write my own image manipulation library. It contains reimplementations of unpaper's algorithms. The idea at the time was mainly to reduce dependency on stuff like OpenCV as much as possible because they made installing my project a lot more complicated (it was before Flatpak). Unpaper's unskewing is the only algo I haven't re-implemented yet. I'll have a look later at what leptonica and OCRmyPDF provide exactly.

zdenop · 2018-10-02T13:10:24Z

I spoke with some developer (not python) and he was quite surprised why tesseract users are using/looking for other libraries for OCR image preprocessing if all really needed is in leptonica ;-)
Maybe easy to use python wrapper could help to promote it.
@jbarlow83 make a good job in this area, but to have leptonica as separate python module could help gain more attention...

jflesch · 2018-10-02T14:35:14Z

Actually, to be honest, in Paperwork, I don't really worry much about accuracy. OCR is used for indexing documents only. Fuzzy-searching takes care of non-entirely-accurate OCR results. So at the moment I don't even bother pre-processing images at all before passing them to Tesseract.
The image manipulations I do are to fix user perception of the images or to try to simplify their content before exporting to make the output files smaller (Unpaper's algorithms / SWT).

But I'm wondering: Is there an official recommendation for set of pre-processing algorithms to apply before passing an image to Tesseract ? Are there some algorithms that would make images match more closely data that was used for training Tesseract ? (grayscale ? pure b&w ? unskewing ? ...)

amitdo · 2018-10-02T14:41:38Z

It depends on which ocr engine in 4.0.0 you want to use (lstm/legacy).

zdenop · 2018-10-02T15:03:35Z

Some recommendations is on ImproveQuality - not sure how many of them are must for 4.0.

Deskew, dewarp, noise removing and crop should also applicable general also for your project.
Passing only text region (or graphics removing including lines), maybe trying other binarization method (tesseract use Otsu) could improve result too.

jflesch · 2018-10-02T15:10:28Z

Thanks :)

jbarlow83 · 2018-10-02T21:25:05Z

@jflesch There's some overlap in what we're doing but it's not duplication either. ocrmypdf focuses on doing PDF to OCR PDF. It necessarily includes a tesseract executable wrapper similar to pyocr, with some differences in approach (not that there is much going here here). I've never done much with libtesseract.

Leptonica is a large library and as far as I know the kind of document analysis and cleanup filters are a fair bit more sophisticated than unpaper has. I can't say I've studied it in great detail, but unpaper strikes me as generally following a homebrewed approach - it relies on a lot of assumptions ("there are either one or two columns of text") and basic methods like counting the number of black/white pixels in an area, and thresholds. Leptonica has features to do things like generate masks that cover all of the text and image regions on a page, which is done with morphology. Its author, Dan Bloomberg, has an academic background in the topic. (ocrmypdf uses unpaper too, but mainly for historical reasons - the original author added it and I maintained it as, but I've never really given it a hard look.)

My wrapper doesn't cover all of Leptonica, but it's fairly easy to pull in more functions. I'd certainly consider spinning it off to a new project.

(There is also a "pyleptonica" wrapper that is derived from parsing the leptonica source, but it has not been updated for several years and is stuck in Python 2. In a way this is the right thing to do to wrap such a large library. However, pyleptonica uses its own custom C parser (!), and outputs a massive 2.6 MB Python script for its wrappers. I decided this was madness, especially the huge pure Python script. Anyway, it's there, and forking it and taming it might be an approach to consider.)

jflesch changed the title ~~[C-API] TessBaseAPIRecognize() segfault when no language data files is available~~ [C-API] TessBaseAPIRecognize() segfaults when no language data files is available Mar 22, 2017

jflesch mentioned this issue Mar 22, 2017

Libtesseract: need stress-testing openpaperwork/pyocr#51

Open

zdenop closed this as completed Sep 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C-API] TessBaseAPIRecognize() segfaults when no language data files is available #779

[C-API] TessBaseAPIRecognize() segfaults when no language data files is available #779

jflesch commented Mar 22, 2017

zdenop commented Sep 28, 2018

jflesch commented Sep 29, 2018 •

edited

Loading

zdenop commented Sep 29, 2018

jflesch commented Sep 30, 2018 •

edited

Loading

zdenop commented Oct 2, 2018

jflesch commented Oct 2, 2018 •

edited

Loading

amitdo commented Oct 2, 2018

zdenop commented Oct 2, 2018

jflesch commented Oct 2, 2018

jbarlow83 commented Oct 2, 2018

[C-API] TessBaseAPIRecognize() segfaults when no language data files is available #779

[C-API] TessBaseAPIRecognize() segfaults when no language data files is available #779

Comments

jflesch commented Mar 22, 2017

zdenop commented Sep 28, 2018

jflesch commented Sep 29, 2018 • edited Loading

zdenop commented Sep 29, 2018

jflesch commented Sep 30, 2018 • edited Loading

zdenop commented Oct 2, 2018

jflesch commented Oct 2, 2018 • edited Loading

amitdo commented Oct 2, 2018

zdenop commented Oct 2, 2018

jflesch commented Oct 2, 2018

jbarlow83 commented Oct 2, 2018

jflesch commented Sep 29, 2018 •

edited

Loading

jflesch commented Sep 30, 2018 •

edited

Loading

jflesch commented Oct 2, 2018 •

edited

Loading