New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
corrupt pdf output on cygwin #63
Comments
Version information - |
on cygwin x86_64 and same on x86: $ tesseract eurotext.tif eurotext -l eng+deu pdf $ ls -lrt eurotext.pdf |
Marco, the version I tested was 'v3.05.00dev' based on the master branch from git (built by Simon). Could it be that one of the newer commits has caused this issue? |
I doubt, more likely you are missing some additional library/program or On Mon, Jul 27, 2015 at 5:01 AM, Shreeshrii notifications@github.com
|
Tesseract knows that PDF creation failed and returns an error code. So at least this is not silent data corruption. I'd like to know if the problem is present for PNG input or if it is restricted to TIFF. |
Jeff, it worked for png and jpg for pdf output. This is using the versions compiled by Simon. C:\Users\User\Downloads\TESS>tesseract -v C:\Users\User\Downloads\TESS>tesseract testing/phototest.gif testing/phototest.gif -l eng pdf C:\Users\User\Downloads\TESS>tesseract testing/phototest.tif testing/phototest.tif -l eng pdf C:\Users\User\Downloads\TESS>tesseract testing/phototest.tif testing/phototest.tif -l eng C:\Users\User\Downloads\TESS>tesseract testing/phototest.png testing/phototest.png -l eng pdf C:\Users\User\Downloads\TESS>tesseract testing/phototest.jpg testing/phototest.jpg -l eng pdf |
Directory of C:\Users\User\Downloads\TESS\testing 07/28/15 08:10 55,504 phototest.gif |
Hmmm.... interesting. I suspect this is related to that classic Windows problem |
Or... do we still have some ifdefs in the code to do Windows streaming I/O a little differently? I vaguely remember writing some back in the day. Maybe they are misbehaving under Cygwin? Can't seem to find them at the moment. |
Marco is able to get the pdf output from the 3.04.00 version he packaged I was testing based on the (3.05.dev version) files that were built by Simon. I do not have FYI, I downloaded the MSYS2 tesseract-ocr package for 3.04.00 (packaged by ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Jul 28, 2015 at 9:05 AM, jbreiden notifications@github.com wrote:
|
Just to clarify, I am referring the pdf output from tif input in the above post. |
Working with 3.04.00 packaged by Marco for cygwin ra@Shree ~/tesseract-ocr ra@Shree ~/tesseract-ocr ra@Shree ~/tesseract-ocr ra@Shree ~/tesseract-ocr ra@Shree ~/tesseract-ocr |
ra@Shree ~/tesseract-ocr/testing |
Maybe leptonic is not built with gif library
On 30-Aug-2016 7:40 PM, "LeeBear35" notifications@github.com wrote:
|
After further research the issue is with the Leptonica utils.c genTempFilename method, it attempts to ensure that the tmp directory exists on the drive where the program is executing, but fails to create the directory so the resulting temp file returned cannot not be created or used. If the tmp directory is created then the GIF file is processed and extracted correctly. I updated my post when I discovered this short coming. Leland Carpenter ♦ Sr. Software Engineer ♦ PRGX USA, Inc. From: Shreeshrii [mailto:notifications@github.com] Maybe leptonic is not built with gif library
On 30-Aug-2016 7:40 PM, "LeeBear35" <notifications@github.commailto:notifications@github.com> wrote:
— |
/tmp/199506_720_mem.gif is fine for cygwin. Are you using a cygwin build without a proper directory structure ? |
I have a number of tempfile patches already written for Leptonica to make these calls more https://sources.debian.net/src/leptonlib/1.73-5/debian/patches/ |
@jbreiden Starting from 1.73 is following the Unix tmp path. |
Might be that I am running on the e: drive instead of the c: drive and that there was no e:\tmp, it was just a matter of the routine not swapping out the /tmp for the windows temporary directory. Leland Carpenter ♦ Sr. Software Engineer ♦ PRGX USA, Inc. From: jbreiden [mailto:notifications@github.com] I have a number of tempfile patches already written for Leptonica to these calls more https://sources.debian.net/src/leptonlib/1.73-5/debian/patches/ — |
Using windows binaries compiled by Simon on cygwin from http://domasofan.spdns.eu/tesseract/
$ tesseract testing\eurotext.tif testing\eurotext -l eng+deu pdf
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Error in fopenWriteStream: stream not opened
Error in pixWrite: stream not opened
Error in fopenReadStream: file not found
Error in extractG4DataFromFile: stream not opened to file
Error in l_generateG4Data: datacomp not extracted
Error in pixGenerateCIData: g4 data not made
Error in l_generateCIDataForPdf: file testing\eurotext.tif format is 4; unreadable
Error during processing.
the pdf comes out but you can't open it.
adobe reader shows an error that it is corrupted.
(Forum thread - https://groups.google.com/forum/#!msg/tesseract-ocr/ToWcnyHqF4c/FHWGlQhd6poJ )
The text was updated successfully, but these errors were encountered: