Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C-API] TessBaseAPIRecognize() segfaults when no language data files is available #779

Closed
jflesch opened this issue Mar 22, 2017 · 10 comments

Comments

@jflesch
Copy link

jflesch commented Mar 22, 2017

Hello,

When working on openpaperwork/pyocr#51 , someone reported to me the following issue:

When no language data file is available at all (not even English), TessBaseAPIRecognize() segfaults.
Also happens when TESSDATA_PREFIX is set to an invalid directory.

Tested with libtesseract 3.04.01 (Debian Sid).

Example code:

#include <stdio.h>
#include <stdlib.h>

#include <leptonica/allheaders.h>

#include <tesseract/capi.h>

/*
 * gcc -Wall -Werror -g -ggdb -ltesseract -llept test.c -o test.out
 * ./test.out ~/git/pyocr/tests/input/real/basic_doc.jpg
 */

int main(int argc, char **argv)
{
	TessBaseAPI *handle;
	PIX *image;

	setenv("TESSDATA_PREFIX", "/opt/tulipe", 1 /* overwrite */);

	handle = TessBaseAPICreate();
	TessBaseAPIInit3(handle, NULL, "fra");
	TessBaseAPISetVariable(handle, "tessedit_zero_rejection", "F");
	image = pixRead(argv[1]);
	TessBaseAPISetImage2(handle, image);

	TessBaseAPIRecognize(handle, NULL);

	TessBaseAPIDelete(handle);

	return 0;
}

Stacktrace:

(gdb) bt
#0  0x00007ffff76f2597 in tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int) () from /usr/lib/x86_64-linux-gnu/libtesseract.so.3
#1  0x00007ffff76dcb7f in tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) ()
   from /usr/lib/x86_64-linux-gnu/libtesseract.so.3
#2  0x0000555555554aa2 in main (argc=2, argv=0x7fffffffe3d8) at test.c:27

PS: This is a very low-priority issue for me. I can work around it really easily on Python side.

@jflesch jflesch changed the title [C-API] TessBaseAPIRecognize() segfault when no language data files is available [C-API] TessBaseAPIRecognize() segfaults when no language data files is available Mar 22, 2017
@zdenop
Copy link
Contributor

zdenop commented Sep 28, 2018

@jflesch : Problem is that you are not using c-api as designed ;-) : you forget to check return value. So instead of

TessBaseAPIInit3(handle, NULL, "fra");`

you use something like this:

if(TessBaseAPIInit3(handle, NULL, "eng") != 0)
		die("Error initializing tesseract\n");

@zdenop zdenop closed this as completed Sep 28, 2018
@jflesch
Copy link
Author

jflesch commented Sep 29, 2018

Oops :-/
Thanks for the reply, and sorry for wasting your time. I'll make sure I didn't miss any other return values in Pyocr.

@zdenop
Copy link
Contributor

zdenop commented Sep 29, 2018

no problem. With testing you issue thanks for test case) I found and fix another problem ;-)

BTW: maybe it would be good if you join your effort with @jbarlow83 : he is wrapping leptonica in python in his project OCRmyPDF. leptonica provides several great function for ocr preprocesing like descew and dewarp removing background)

@jflesch
Copy link
Author

jflesch commented Sep 30, 2018

Interesting. I did write my own image manipulation library. It contains reimplementations of unpaper's algorithms. The idea at the time was mainly to reduce dependency on stuff like OpenCV as much as possible because they made installing my project a lot more complicated (it was before Flatpak). Unpaper's unskewing is the only algo I haven't re-implemented yet. I'll have a look later at what leptonica and OCRmyPDF provide exactly.

@zdenop
Copy link
Contributor

zdenop commented Oct 2, 2018

I spoke with some developer (not python) and he was quite surprised why tesseract users are using/looking for other libraries for OCR image preprocessing if all really needed is in leptonica ;-)
Maybe easy to use python wrapper could help to promote it.
@jbarlow83 make a good job in this area, but to have leptonica as separate python module could help gain more attention...

@jflesch
Copy link
Author

jflesch commented Oct 2, 2018

Actually, to be honest, in Paperwork, I don't really worry much about accuracy. OCR is used for indexing documents only. Fuzzy-searching takes care of non-entirely-accurate OCR results. So at the moment I don't even bother pre-processing images at all before passing them to Tesseract.
The image manipulations I do are to fix user perception of the images or to try to simplify their content before exporting to make the output files smaller (Unpaper's algorithms / SWT).

But I'm wondering: Is there an official recommendation for set of pre-processing algorithms to apply before passing an image to Tesseract ? Are there some algorithms that would make images match more closely data that was used for training Tesseract ? (grayscale ? pure b&w ? unskewing ? ...)

@amitdo
Copy link
Collaborator

amitdo commented Oct 2, 2018

It depends on which ocr engine in 4.0.0 you want to use (lstm/legacy).

@zdenop
Copy link
Contributor

zdenop commented Oct 2, 2018

Some recommendations is on ImproveQuality - not sure how many of them are must for 4.0.

Deskew, dewarp, noise removing and crop should also applicable general also for your project.
Passing only text region (or graphics removing including lines), maybe trying other binarization method (tesseract use Otsu) could improve result too.

@jflesch
Copy link
Author

jflesch commented Oct 2, 2018

Thanks :)

@jbarlow83
Copy link

@jflesch There's some overlap in what we're doing but it's not duplication either. ocrmypdf focuses on doing PDF to OCR PDF. It necessarily includes a tesseract executable wrapper similar to pyocr, with some differences in approach (not that there is much going here here). I've never done much with libtesseract.

Leptonica is a large library and as far as I know the kind of document analysis and cleanup filters are a fair bit more sophisticated than unpaper has. I can't say I've studied it in great detail, but unpaper strikes me as generally following a homebrewed approach - it relies on a lot of assumptions ("there are either one or two columns of text") and basic methods like counting the number of black/white pixels in an area, and thresholds. Leptonica has features to do things like generate masks that cover all of the text and image regions on a page, which is done with morphology. Its author, Dan Bloomberg, has an academic background in the topic. (ocrmypdf uses unpaper too, but mainly for historical reasons - the original author added it and I maintained it as, but I've never really given it a hard look.)

My wrapper doesn't cover all of Leptonica, but it's fairly easy to pull in more functions. I'd certainly consider spinning it off to a new project.

(There is also a "pyleptonica" wrapper that is derived from parsing the leptonica source, but it has not been updated for several years and is stuck in Python 2. In a way this is the right thing to do to wrap such a large library. However, pyleptonica uses its own custom C parser (!), and outputs a massive 2.6 MB Python script for its wrappers. I decided this was madness, especially the huge pure Python script. Anyway, it's there, and forking it and taming it might be an approach to consider.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants