Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corrupt pdf output on cygwin #63

Closed
Shreeshrii opened this issue Jul 24, 2015 · 21 comments
Closed

corrupt pdf output on cygwin #63

Shreeshrii opened this issue Jul 24, 2015 · 21 comments
Labels

Comments

@Shreeshrii
Copy link
Collaborator

Using windows binaries compiled by Simon on cygwin from http://domasofan.spdns.eu/tesseract/

$ tesseract testing\eurotext.tif testing\eurotext -l eng+deu pdf

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Error in fopenWriteStream: stream not opened
Error in pixWrite: stream not opened
Error in fopenReadStream: file not found
Error in extractG4DataFromFile: stream not opened to file
Error in l_generateG4Data: datacomp not extracted
Error in pixGenerateCIData: g4 data not made
Error in l_generateCIDataForPdf: file testing\eurotext.tif format is 4; unreadable
Error during processing.

the pdf comes out but you can't open it.
adobe reader shows an error that it is corrupted.

(Forum thread - https://groups.google.com/forum/#!msg/tesseract-ocr/ToWcnyHqF4c/FHWGlQhd6poJ )

@Shreeshrii
Copy link
Collaborator Author

Version information -
tesseract 3.05.00dev
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.3

@zdenop
Copy link
Contributor

zdenop commented Jul 24, 2015

@matzeri
Copy link
Contributor

matzeri commented Jul 26, 2015

on cygwin x86_64 and same on x86:
$ tesseract --version
tesseract 3.04.00
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.3

$ tesseract eurotext.tif eurotext -l eng+deu pdf
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

$ ls -lrt eurotext.pdf
-rw-r--r-- 1 marco Administrators 13K Jul 26 21:36 eurotext.pdf

@Shreeshrii
Copy link
Collaborator Author

Marco, the version I tested was 'v3.05.00dev' based on the master branch from git (built by Simon).

Could it be that one of the newer commits has caused this issue?

@matzeri
Copy link
Contributor

matzeri commented Jul 27, 2015

I doubt, more likely you are missing some additional library/program or
a missing configuration.

On Mon, Jul 27, 2015 at 5:01 AM, Shreeshrii notifications@github.com
wrote:

Marco, the version I tested was 'v3.05.00dev' based on the master branch
from git (built by Simon).

Could it be that one of the newer commits has caused this issue?


Reply to this email directly or view it on GitHub
#63 (comment)
.

@jbreiden
Copy link
Contributor

Tesseract knows that PDF creation failed and returns an error code. So at least this is not silent data corruption. I'd like to know if the problem is present for PNG input or if it is restricted to TIFF.

@Shreeshrii
Copy link
Collaborator Author

Jeff, it worked for png and jpg for pdf output. This is using the versions compiled by Simon.

C:\Users\User\Downloads\TESS>tesseract -v
tesseract 3.05.00dev
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.3

C:\Users\User\Downloads\TESS>tesseract testing/phototest.gif testing/phototest.gif -l eng pdf
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in fopenWriteStream: stream not opened
Error in l_binaryWrite: stream not opened
Error in fopenReadStream: file not found
Error in pixRead: image file not found: /tmp/leptonica/847980_4108_mem.gif
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.

C:\Users\User\Downloads\TESS>tesseract testing/phototest.tif testing/phototest.tif -l eng pdf
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Error in fopenWriteStream: stream not opened
Error in pixWrite: stream not opened
Error in fopenReadStream: file not found
Error in extractG4DataFromFile: stream not opened to file
Error in l_generateG4Data: datacomp not extracted
Error in pixGenerateCIData: g4 data not made
Error in l_generateCIDataForPdf: file testing/phototest.tif format is 4; unreadable
Error during processing.

C:\Users\User\Downloads\TESS>tesseract testing/phototest.tif testing/phototest.tif -l eng
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

C:\Users\User\Downloads\TESS>tesseract testing/phototest.png testing/phototest.png -l eng pdf
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica

C:\Users\User\Downloads\TESS>tesseract testing/phototest.jpg testing/phototest.jpg -l eng pdf
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica

@Shreeshrii
Copy link
Collaborator Author

Directory of C:\Users\User\Downloads\TESS\testing

07/28/15 08:10 55,504 phototest.gif
07/28/15 08:19 0 phototest.gif.pdf
08/28/14 20:38 57,772 phototest.jpg
07/28/15 08:20 61,460 phototest.jpg.pdf
08/28/14 20:38 5,265 phototest.png
07/28/15 08:20 8,890 phototest.png.pdf
07/24/15 12:15 38,668 phototest.tif
07/28/15 08:20 2,910 phototest.tif.pdf
07/28/15 08:20 287 phototest.tif.txt

@jbreiden
Copy link
Contributor

Hmmm.... interesting. I suspect this is related to that classic Windows problem
where you can't pass file pointers between different DLLs, especially if they use
different runtimes. If so, we may be in trouble.

@jbreiden
Copy link
Contributor

Or... do we still have some ifdefs in the code to do Windows streaming I/O a little differently? I vaguely remember writing some back in the day. Maybe they are misbehaving under Cygwin? Can't seem to find them at the moment.

@Shreeshrii
Copy link
Collaborator Author

Marco is able to get the pdf output from the 3.04.00 version he packaged
for cygwin.

I was testing based on the (3.05.dev version) files that were built by Simon. I do not have
cygwin installed but will try downloading the files from the mirrors Marco
suggested and see what happens.

FYI, I downloaded the MSYS2 tesseract-ocr package for 3.04.00 (packaged by
Alex at
https://github.com/Alexpux/MINGW-packages/tree/master/mingw-w64-tesseract-ocr)
and am able to get the pdf output from it.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jul 28, 2015 at 9:05 AM, jbreiden notifications@github.com wrote:

Or... do we still have some ifdefs in the code to do Windows streaming I/O
a little differently? I vaguely remember writing some back in the day.
Maybe they are misbehaving under Cygwin? Can't seem to find them at the
moment.


Reply to this email directly or view it on GitHub
#63 (comment)
.

@Shreeshrii
Copy link
Collaborator Author

Just to clarify, I am referring the pdf output from tif input in the above post.

@Shreeshrii
Copy link
Collaborator Author

Working with 3.04.00 packaged by Marco for cygwin

ra@Shree ~/tesseract-ocr
$ tesseract testing/phototest.tif phototest.tif
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

ra@Shree ~/tesseract-ocr
$ tesseract testing/phototest.tif testing/phototest.tif pdf
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

ra@Shree ~/tesseract-ocr
$ tesseract testing/phototest.tif testing/phototest.tif hocr
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

ra@Shree ~/tesseract-ocr
$ tesseract --list-langs
List of available languages (2):
eng
osd

ra@Shree ~/tesseract-ocr
$ tesseract -v
tesseract 3.04.00
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.3

@Shreeshrii
Copy link
Collaborator Author

ra@Shree ~/tesseract-ocr/testing
$ ls -lrt
total 165
-rwx---r-x 1 ra ra 38668 Jul 29 11:45 phototest.tif
-rwx---r-x 1 ra ra 102598 Jul 29 11:45 eurotext.tif
-rw----r-- 1 ra ra 7712 Jul 29 11:47 phototest.tif.pdf
-rw----r-- 1 ra ra 287 Jul 29 11:48 phototest.tif.txt
-rw----r-- 1 ra ra 8394 Jul 29 11:48 phototest.tif.hocr

@zdenop zdenop closed this as completed Jul 29, 2015
@amitdo amitdo added the PDF label Jun 3, 2016
@LeeBear35
Copy link

LeeBear35 commented Aug 30, 2016

I went into pbrush and created a Hello World image and saved it as bmp, gif, jpg, png, and tif. When I process those files using tesseract.exe imagefile textfile -l eng, all the files process correctly except the GIF file. I included the GIF and the output below:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in fopenWriteStream: stream not opened
Error in l_binaryWrite: stream not opened
Error in fopenReadStream: file not found
Error in pixRead: image file not found: /tmp/199506_720_mem.gif
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.
helloworld

Also here is the version dump:

tesseract 3.05.00dev
leptonica-1.73
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 : libtiff 4
.0.6 : zlib 1.2.8 : libwebp 0.4.3

If I was a guessing man I would say maybe it is in the temporary file name /tmp/199506_720_mem.gif likely not conforming to MS windows.

A little more information, looking at the pixReadMemGif routine it makes a call to get a temporary file, in doing so that routine tries to ensure that the tmp directory exists, when I created a tmp directory at the root of the drive where I am running tesseract, the GIF file correctly extracted after creating that directory. That is in the Leptonica utils.c file in the genTempFilename routine.

@Shreeshrii
Copy link
Collaborator Author

Maybe leptonic is not built with gif library

  • sent from my phone. excuse the brevity.

On 30-Aug-2016 7:40 PM, "LeeBear35" notifications@github.com wrote:

I went into pbrush and created a Hello World image and saved it as bmp,
gif, jpg, png, and tif. When I process those files using tesseract.exe
imagefile textfile -l eng, all the files process correctly except the GIF
file. I included the GIF and the output below:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in fopenWriteStream: stream not opened
Error in l_binaryWrite: stream not opened
Error in fopenReadStream: file not found
Error in pixRead: image file not found: /tmp/199506_720_mem.gif
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.
[image: helloworld]
https://cloud.githubusercontent.com/assets/11964590/18092293/6a92bfd8-6e91-11e6-8c27-2e66a0da3114.gif


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#63 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7pPyAlMmoDPBQ3BkxMC24_LqyNXks5qlDnGgaJpZM4FfEhW
.

@LeeBear35
Copy link

After further research the issue is with the Leptonica utils.c genTempFilename method, it attempts to ensure that the tmp directory exists on the drive where the program is executing, but fails to create the directory so the resulting temp file returned cannot not be created or used. If the tmp directory is created then the GIF file is processed and extracted correctly.

I updated my post when I discovered this short coming.

Leland Carpenter ♦ Sr. Software Engineer ♦ PRGX USA, Inc.
4904 Hickory Way ♦ Johnsburg, IL 60051-8967
O: 815.307.7634 ♦ Lee.Carpenter@prgx.commailto:Lee.Carpenter@prgx.com
[cid:image001.jpg@01D202A3.3B9A04C0]

From: Shreeshrii [mailto:notifications@github.com]
Sent: Tuesday, August 30, 2016 09:41 AM
To: tesseract-ocr/tesseract tesseract@noreply.github.com
Cc: Carpenter, Lee Lee.Carpenter@prgx.com; Comment comment@noreply.github.com
Subject: Re: [tesseract-ocr/tesseract] corrupt pdf output on cygwin (#63)

Maybe leptonic is not built with gif library

  • sent from my phone. excuse the brevity.

On 30-Aug-2016 7:40 PM, "LeeBear35" <notifications@github.commailto:notifications@github.com> wrote:

I went into pbrush and created a Hello World image and saved it as bmp,
gif, jpg, png, and tif. When I process those files using tesseract.exe
imagefile textfile -l eng, all the files process correctly except the GIF
file. I included the GIF and the output below:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in fopenWriteStream: stream not opened
Error in l_binaryWrite: stream not opened
Error in fopenReadStream: file not found
Error in pixRead: image file not found: /tmp/199506_720_mem.gif
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.
[image: helloworld]
https://cloud.githubusercontent.com/assets/11964590/18092293/6a92bfd8-6e91-11e6-8c27-2e66a0da3114.gif


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#63 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7pPyAlMmoDPBQ3BkxMC24_LqyNXks5qlDnGgaJpZM4FfEhW
.


You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com//issues/63#issuecomment-243462302, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALaQrt3vwLXo6DiUEMrtKldXWIn3hi2qks5qlEEEgaJpZM4FfEhW.

@matzeri
Copy link
Contributor

matzeri commented Aug 30, 2016

/tmp/199506_720_mem.gif is fine for cygwin. Are you using a cygwin build without a proper directory structure ?

@jbreiden
Copy link
Contributor

jbreiden commented Aug 31, 2016

I have a number of tempfile patches already written for Leptonica to make these calls more
secure and less brittle, and there is ongoing work on this topic. I actually don't know if
cygwin is using the Unix or Windows code path for temporary files, but just want to
mention that there is activity. Don't know why you are getting bad results compared to
other cygwin users.

https://sources.debian.net/src/leptonlib/1.73-5/debian/patches/

@matzeri
Copy link
Contributor

matzeri commented Aug 31, 2016

@jbreiden Starting from 1.73 is following the Unix tmp path.

@LeeBear35
Copy link

Might be that I am running on the e: drive instead of the c: drive and that there was no e:\tmp, it was just a matter of the routine not swapping out the /tmp for the windows temporary directory.

Leland Carpenter ♦ Sr. Software Engineer ♦ PRGX USA, Inc.
4904 Hickory Way ♦ Johnsburg, IL 60051-8967
O: 815.307.7634 ♦ Lee.Carpenter@prgx.commailto:Lee.Carpenter@prgx.com
[cid:image001.jpg@01D2035F.0F526440]

From: jbreiden [mailto:notifications@github.com]
Sent: Tuesday, August 30, 2016 08:51 PM
To: tesseract-ocr/tesseract tesseract@noreply.github.com
Cc: Carpenter, Lee Lee.Carpenter@prgx.com; Comment comment@noreply.github.com
Subject: Re: [tesseract-ocr/tesseract] corrupt pdf output on cygwin (#63)

I have a number of tempfile patches already written for Leptonica to these calls more
secure and less brittle, and there is ongoing work on this topic. I actually don't know if
cygwin is using the Unix or Windows code path for temporary files, but just want to
mention that there is activity. Don't know why you are getting bad results compared to
other cygwin users.

https://sources.debian.net/src/leptonlib/1.73-5/debian/patches/


You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com//issues/63#issuecomment-243635600, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALaQrkGGqG6w13z9K5OGD9_kiB2gU7J2ks5qlN4ggaJpZM4FfEhW.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants