Avoid tesseract writing Pix out/reading them back. #2965

robinwatts · 2020-05-04T23:24:54Z

By default, when we ImageData::SetPix, we write the data out as a
PNG, just to read it back in to get a compressed buffer of data.
We then use this to generate a new Pix.

In builds of Tesseract on systems where we don't have temp files,
writing files out is problematic.

Not only that, but compressing/uncompressing is slow, and on minimal
builds of leptonica, where we've disabled the format writers to reduce
memory footprint, we get no compression anyway.

In such cases, it'd be far nicer just to keep the original Pix as
the internal data.

Also, when recovering the pixmap from the ImageData, if we know we're
only going to read from the data, we can avoid duplicating it and
just use the original. This is exactly the case when GRAPHICS_DISABLED
is set.

So, introduce a TESSERACT_IMAGEDATA_AS_PIX predefine that we can use
to cause the internal data to be a Pix rather than a compressed
buffer.

Given we don't do compression, and they were writing to memory,
this was all just more effort than we needed.

Also, if we're using GRAPHICS_DISABLED, we might as well just
pixCopy rather than pixClone as only the scaler uses this.

By default, when we ImageData::SetPix, we write the data out as a PNG, just to read it back in to get a compressed buffer of data. We then use this to generate a new Pix. In builds of Tesseract on systems where we don't have temp files, writing files out is problematic. Not only that, but compressing/uncompressing is slow, and on minimal builds of leptonica, where we've disabled the format writers to reduce memory footprint, we get no compression anyway. In such cases, it'd be far nicer just to keep the original Pix as the internal data. Also, when recovering the pixmap from the ImageData, if we know we're only going to read from the data, we can avoid duplicating it and just use the original. This is exactly the case when GRAPHICS_DISABLED is set. So, introduce a TESSERACT_IMAGEDATA_AS_PIX predefine that we can use to cause the internal data to be a Pix rather than a compressed buffer. Given we don't do compression, and they were writing to memory, this was all just more effort than we needed. Also, if we're using GRAPHICS_DISABLED, we might as well just pixCopy rather than pixClone as only the scaler uses this.

robinwatts · 2020-05-04T23:28:37Z

I am looking into using Tesseract with Ghostscript. These are some changes that I made locally to smooth operation. For systems without fmemopen (such as windows), tesseract drives leptonica to save images out, just so it can then read the compressed version back in again.

This commit enables a build define that avoids that.

zdenop · 2020-05-16T18:35:10Z

I like the concept to build minimal tesseract (I tried to do it some time ago AFAIK only zlib was needed).

zdenop merged commit b5d639d into tesseract-ocr:master May 16, 2020

amitdo referenced this pull request Mar 31, 2021

Remove unused ifdef.

4fa05b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid tesseract writing Pix out/reading them back. #2965

Avoid tesseract writing Pix out/reading them back. #2965

robinwatts commented May 4, 2020

robinwatts commented May 4, 2020

zdenop commented May 16, 2020

Avoid tesseract writing Pix out/reading them back. #2965

Avoid tesseract writing Pix out/reading them back. #2965

Conversation

robinwatts commented May 4, 2020

robinwatts commented May 4, 2020

zdenop commented May 16, 2020