corrupt pdf output on cygwin #63

Shreeshrii · 2015-07-24T10:31:52Z

Using windows binaries compiled by Simon on cygwin from http://domasofan.spdns.eu/tesseract/

$ tesseract testing\eurotext.tif testing\eurotext -l eng+deu pdf

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Error in fopenWriteStream: stream not opened
Error in pixWrite: stream not opened
Error in fopenReadStream: file not found
Error in extractG4DataFromFile: stream not opened to file
Error in l_generateG4Data: datacomp not extracted
Error in pixGenerateCIData: g4 data not made
Error in l_generateCIDataForPdf: file testing\eurotext.tif format is 4; unreadable
Error during processing.

the pdf comes out but you can't open it.
adobe reader shows an error that it is corrupted.

(Forum thread - https://groups.google.com/forum/#!msg/tesseract-ocr/ToWcnyHqF4c/FHWGlQhd6poJ )

Shreeshrii · 2015-07-24T10:33:12Z

Version information -
tesseract 3.05.00dev
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.3

zdenop · 2015-07-24T10:50:28Z

Did you read https://groups.google.com/d/msg/tesseract-ocr/ToWcnyHqF4c/P7HDEKsR1cEJ ?

matzeri · 2015-07-26T19:41:10Z

on cygwin x86_64 and same on x86:
$ tesseract --version
tesseract 3.04.00
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.3

$ tesseract eurotext.tif eurotext -l eng+deu pdf
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

$ ls -lrt eurotext.pdf
-rw-r--r-- 1 marco Administrators 13K Jul 26 21:36 eurotext.pdf

Shreeshrii · 2015-07-27T03:01:07Z

Marco, the version I tested was 'v3.05.00dev' based on the master branch from git (built by Simon).

Could it be that one of the newer commits has caused this issue?

matzeri · 2015-07-27T10:44:18Z

I doubt, more likely you are missing some additional library/program or
a missing configuration.

On Mon, Jul 27, 2015 at 5:01 AM, Shreeshrii notifications@github.com
wrote:

Marco, the version I tested was 'v3.05.00dev' based on the master branch
from git (built by Simon).

Could it be that one of the newer commits has caused this issue?

—
Reply to this email directly or view it on GitHub
#63 (comment)
.

jbreiden · 2015-07-27T23:52:35Z

Tesseract knows that PDF creation failed and returns an error code. So at least this is not silent data corruption. I'd like to know if the problem is present for PNG input or if it is restricted to TIFF.

Shreeshrii · 2015-07-28T02:53:58Z

Jeff, it worked for png and jpg for pdf output. This is using the versions compiled by Simon.

C:\Users\User\Downloads\TESS>tesseract -v
tesseract 3.05.00dev
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.3

C:\Users\User\Downloads\TESS>tesseract testing/phototest.gif testing/phototest.gif -l eng pdf
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in fopenWriteStream: stream not opened
Error in l_binaryWrite: stream not opened
Error in fopenReadStream: file not found
Error in pixRead: image file not found: /tmp/leptonica/847980_4108_mem.gif
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.

C:\Users\User\Downloads\TESS>tesseract testing/phototest.tif testing/phototest.tif -l eng pdf
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Error in fopenWriteStream: stream not opened
Error in pixWrite: stream not opened
Error in fopenReadStream: file not found
Error in extractG4DataFromFile: stream not opened to file
Error in l_generateG4Data: datacomp not extracted
Error in pixGenerateCIData: g4 data not made
Error in l_generateCIDataForPdf: file testing/phototest.tif format is 4; unreadable
Error during processing.

C:\Users\User\Downloads\TESS>tesseract testing/phototest.tif testing/phototest.tif -l eng
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

C:\Users\User\Downloads\TESS>tesseract testing/phototest.png testing/phototest.png -l eng pdf
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica

C:\Users\User\Downloads\TESS>tesseract testing/phototest.jpg testing/phototest.jpg -l eng pdf
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica

Shreeshrii · 2015-07-28T03:00:20Z

Directory of C:\Users\User\Downloads\TESS\testing

07/28/15 08:10 55,504 phototest.gif
07/28/15 08:19 0 phototest.gif.pdf
08/28/14 20:38 57,772 phototest.jpg
07/28/15 08:20 61,460 phototest.jpg.pdf
08/28/14 20:38 5,265 phototest.png
07/28/15 08:20 8,890 phototest.png.pdf
07/24/15 12:15 38,668 phototest.tif
07/28/15 08:20 2,910 phototest.tif.pdf
07/28/15 08:20 287 phototest.tif.txt

jbreiden · 2015-07-28T03:10:34Z

Hmmm.... interesting. I suspect this is related to that classic Windows problem
where you can't pass file pointers between different DLLs, especially if they use
different runtimes. If so, we may be in trouble.

jbreiden · 2015-07-28T03:35:10Z

Or... do we still have some ifdefs in the code to do Windows streaming I/O a little differently? I vaguely remember writing some back in the day. Maybe they are misbehaving under Cygwin? Can't seem to find them at the moment.

Shreeshrii · 2015-07-28T04:51:30Z

Marco is able to get the pdf output from the 3.04.00 version he packaged
for cygwin.

I was testing based on the (3.05.dev version) files that were built by Simon. I do not have
cygwin installed but will try downloading the files from the mirrors Marco
suggested and see what happens.

FYI, I downloaded the MSYS2 tesseract-ocr package for 3.04.00 (packaged by
Alex at
https://github.com/Alexpux/MINGW-packages/tree/master/mingw-w64-tesseract-ocr)
and am able to get the pdf output from it.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jul 28, 2015 at 9:05 AM, jbreiden notifications@github.com wrote:

Or... do we still have some ifdefs in the code to do Windows streaming I/O
a little differently? I vaguely remember writing some back in the day.
Maybe they are misbehaving under Cygwin? Can't seem to find them at the
moment.

—
Reply to this email directly or view it on GitHub
#63 (comment)
.

Shreeshrii · 2015-07-28T04:53:26Z

Just to clarify, I am referring the pdf output from tif input in the above post.

Shreeshrii · 2015-07-29T06:24:06Z

Working with 3.04.00 packaged by Marco for cygwin

ra@Shree ~/tesseract-ocr
$ tesseract testing/phototest.tif phototest.tif
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

ra@Shree ~/tesseract-ocr
$ tesseract testing/phototest.tif testing/phototest.tif pdf
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

ra@Shree ~/tesseract-ocr
$ tesseract testing/phototest.tif testing/phototest.tif hocr
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

ra@Shree ~/tesseract-ocr
$ tesseract --list-langs
List of available languages (2):
eng
osd

ra@Shree ~/tesseract-ocr
$ tesseract -v
tesseract 3.04.00
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.3

Shreeshrii · 2015-07-29T06:25:05Z

ra@Shree ~/tesseract-ocr/testing
$ ls -lrt
total 165
-rwx---r-x 1 ra ra 38668 Jul 29 11:45 phototest.tif
-rwx---r-x 1 ra ra 102598 Jul 29 11:45 eurotext.tif
-rw----r-- 1 ra ra 7712 Jul 29 11:47 phototest.tif.pdf
-rw----r-- 1 ra ra 287 Jul 29 11:48 phototest.tif.txt
-rw----r-- 1 ra ra 8394 Jul 29 11:48 phototest.tif.hocr

LeeBear35 · 2016-08-30T14:10:02Z

I went into pbrush and created a Hello World image and saved it as bmp, gif, jpg, png, and tif. When I process those files using tesseract.exe imagefile textfile -l eng, all the files process correctly except the GIF file. I included the GIF and the output below:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in fopenWriteStream: stream not opened
Error in l_binaryWrite: stream not opened
Error in fopenReadStream: file not found
Error in pixRead: image file not found: /tmp/199506_720_mem.gif
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.

Also here is the version dump:

tesseract 3.05.00dev
leptonica-1.73
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 : libtiff 4
.0.6 : zlib 1.2.8 : libwebp 0.4.3

If I was a guessing man I would say maybe it is in the temporary file name /tmp/199506_720_mem.gif likely not conforming to MS windows.

A little more information, looking at the pixReadMemGif routine it makes a call to get a temporary file, in doing so that routine tries to ensure that the tmp directory exists, when I created a tmp directory at the root of the drive where I am running tesseract, the GIF file correctly extracted after creating that directory. That is in the Leptonica utils.c file in the genTempFilename routine.

Shreeshrii · 2016-08-30T14:40:43Z

Maybe leptonic is not built with gif library

sent from my phone. excuse the brevity.

On 30-Aug-2016 7:40 PM, "LeeBear35" notifications@github.com wrote:

I went into pbrush and created a Hello World image and saved it as bmp,
gif, jpg, png, and tif. When I process those files using tesseract.exe
imagefile textfile -l eng, all the files process correctly except the GIF
file. I included the GIF and the output below:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in fopenWriteStream: stream not opened
Error in l_binaryWrite: stream not opened
Error in fopenReadStream: file not found
Error in pixRead: image file not found: /tmp/199506_720_mem.gif
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.
[image: helloworld]
https://cloud.githubusercontent.com/assets/11964590/18092293/6a92bfd8-6e91-11e6-8c27-2e66a0da3114.gif

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#63 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7pPyAlMmoDPBQ3BkxMC24_LqyNXks5qlDnGgaJpZM4FfEhW
.

LeeBear35 · 2016-08-30T14:45:30Z

After further research the issue is with the Leptonica utils.c genTempFilename method, it attempts to ensure that the tmp directory exists on the drive where the program is executing, but fails to create the directory so the resulting temp file returned cannot not be created or used. If the tmp directory is created then the GIF file is processed and extracted correctly.

I updated my post when I discovered this short coming.

Leland Carpenter ♦ Sr. Software Engineer ♦ PRGX USA, Inc.
4904 Hickory Way ♦ Johnsburg, IL 60051-8967
O: 815.307.7634 ♦ Lee.Carpenter@prgx.commailto:Lee.Carpenter@prgx.com
[cid:image001.jpg@01D202A3.3B9A04C0]

From: Shreeshrii [mailto:notifications@github.com]
Sent: Tuesday, August 30, 2016 09:41 AM
To: tesseract-ocr/tesseract tesseract@noreply.github.com
Cc: Carpenter, Lee Lee.Carpenter@prgx.com; Comment comment@noreply.github.com
Subject: Re: [tesseract-ocr/tesseract] corrupt pdf output on cygwin (#63)

Maybe leptonic is not built with gif library

sent from my phone. excuse the brevity.

On 30-Aug-2016 7:40 PM, "LeeBear35" <notifications@github.com mailto:notifications@github.com> wrote:

I went into pbrush and created a Hello World image and saved it as bmp,
gif, jpg, png, and tif. When I process those files using tesseract.exe
imagefile textfile -l eng, all the files process correctly except the GIF
file. I included the GIF and the output below:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in fopenWriteStream: stream not opened
Error in l_binaryWrite: stream not opened
Error in fopenReadStream: file not found
Error in pixRead: image file not found: /tmp/199506_720_mem.gif
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.
[image: helloworld]
https://cloud.githubusercontent.com/assets/11964590/18092293/6a92bfd8-6e91-11e6-8c27-2e66a0da3114.gif

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#63 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7pPyAlMmoDPBQ3BkxMC24_LqyNXks5qlDnGgaJpZM4FfEhW
.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com//issues/63#issuecomment-243462302, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALaQrt3vwLXo6DiUEMrtKldXWIn3hi2qks5qlEEEgaJpZM4FfEhW.

matzeri · 2016-08-30T15:07:05Z

/tmp/199506_720_mem.gif is fine for cygwin. Are you using a cygwin build without a proper directory structure ?

jbreiden · 2016-08-31T01:51:05Z

I have a number of tempfile patches already written for Leptonica to make these calls more
secure and less brittle, and there is ongoing work on this topic. I actually don't know if
cygwin is using the Unix or Windows code path for temporary files, but just want to
mention that there is activity. Don't know why you are getting bad results compared to
other cygwin users.

https://sources.debian.net/src/leptonlib/1.73-5/debian/patches/

matzeri · 2016-08-31T04:57:57Z

@jbreiden Starting from 1.73 is following the Unix tmp path.

LeeBear35 · 2016-08-31T13:10:05Z

Might be that I am running on the e: drive instead of the c: drive and that there was no e:\tmp, it was just a matter of the routine not swapping out the /tmp for the windows temporary directory.

Leland Carpenter ♦ Sr. Software Engineer ♦ PRGX USA, Inc.
4904 Hickory Way ♦ Johnsburg, IL 60051-8967
O: 815.307.7634 ♦ Lee.Carpenter@prgx.commailto:Lee.Carpenter@prgx.com
[cid:image001.jpg@01D2035F.0F526440]

From: jbreiden [mailto:notifications@github.com]
Sent: Tuesday, August 30, 2016 08:51 PM
To: tesseract-ocr/tesseract tesseract@noreply.github.com
Cc: Carpenter, Lee Lee.Carpenter@prgx.com; Comment comment@noreply.github.com
Subject: Re: [tesseract-ocr/tesseract] corrupt pdf output on cygwin (#63)

I have a number of tempfile patches already written for Leptonica to these calls more
secure and less brittle, and there is ongoing work on this topic. I actually don't know if
cygwin is using the Unix or Windows code path for temporary files, but just want to
mention that there is activity. Don't know why you are getting bad results compared to
other cygwin users.

https://sources.debian.net/src/leptonlib/1.73-5/debian/patches/

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com//issues/63#issuecomment-243635600, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALaQrkGGqG6w13z9K5OGD9_kiB2gU7J2ks5qlN4ggaJpZM4FfEhW.

zdenop closed this as completed Jul 29, 2015

amitdo added the PDF label Jun 3, 2016

Sharcoux mentioned this issue Jan 10, 2023

The font size returned by the recognize api is often incorrect #3988

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corrupt pdf output on cygwin #63

corrupt pdf output on cygwin #63

Shreeshrii commented Jul 24, 2015

Shreeshrii commented Jul 24, 2015

zdenop commented Jul 24, 2015

matzeri commented Jul 26, 2015

Shreeshrii commented Jul 27, 2015

matzeri commented Jul 27, 2015

jbreiden commented Jul 27, 2015

Shreeshrii commented Jul 28, 2015

Shreeshrii commented Jul 28, 2015

jbreiden commented Jul 28, 2015

jbreiden commented Jul 28, 2015

Shreeshrii commented Jul 28, 2015

Shreeshrii commented Jul 28, 2015

Shreeshrii commented Jul 29, 2015

Shreeshrii commented Jul 29, 2015

LeeBear35 commented Aug 30, 2016 •

edited

Shreeshrii commented Aug 30, 2016

LeeBear35 commented Aug 30, 2016

matzeri commented Aug 30, 2016

jbreiden commented Aug 31, 2016 •

edited

matzeri commented Aug 31, 2016

LeeBear35 commented Aug 31, 2016

corrupt pdf output on cygwin #63

corrupt pdf output on cygwin #63

Comments

Shreeshrii commented Jul 24, 2015

Shreeshrii commented Jul 24, 2015

zdenop commented Jul 24, 2015

matzeri commented Jul 26, 2015

Shreeshrii commented Jul 27, 2015

matzeri commented Jul 27, 2015

jbreiden commented Jul 27, 2015

Shreeshrii commented Jul 28, 2015

Shreeshrii commented Jul 28, 2015

jbreiden commented Jul 28, 2015

jbreiden commented Jul 28, 2015

Shreeshrii commented Jul 28, 2015

Shreeshrii commented Jul 28, 2015

Shreeshrii commented Jul 29, 2015

Shreeshrii commented Jul 29, 2015

LeeBear35 commented Aug 30, 2016 • edited

Shreeshrii commented Aug 30, 2016

LeeBear35 commented Aug 30, 2016

matzeri commented Aug 30, 2016

jbreiden commented Aug 31, 2016 • edited

matzeri commented Aug 31, 2016

LeeBear35 commented Aug 31, 2016

LeeBear35 commented Aug 30, 2016 •

edited

jbreiden commented Aug 31, 2016 •

edited