Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF output: odd spaces on OSX preview #699

Closed
RNCTX opened this issue Feb 4, 2017 · 5 comments
Closed

PDF output: odd spaces on OSX preview #699

RNCTX opened this issue Feb 4, 2017 · 5 comments
Labels

Comments

@RNCTX
Copy link
Contributor

RNCTX commented Feb 4, 2017

I found a mention of this in another post from a prior version. I managed to produce the below with -l eng and --oem 1.

#337

tesseract -v
tesseract 4.00.00alpha
leptonica-1.74.1
libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8

OpenCL info:
Found 1 platform(s).
Platform 1 name: Apple.
Version: OpenCL 1.2 (Jan 4 2017 22:35:59).
Found 2 device(s).
Device 1 name: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz.
Device 2 name: Intel(R) Iris(TM) Graphics 6100.
Found AVX
Found SSE

Gilbert. Certainly. Anybody can write a three-volumed novel.* It merely requires a complete ignorance of both life and literature. The difficulty that I should fancy the reviewer feels is the difficulty of sustaining any standard. Where there is no style a standard must be impossible. The poor reviewers are apparently reduced to be the reporters of the police court of literature, the chroniclers of the doings of the habitual criminals of art. It is sometimes said of them that they do not read all through the works they are called upon to criticise. They do not. Or at least they should not. If they did so, they would become confirmed misanthropes; or, ifI may borrow a phrase from one of the pretty N e w n h a m graduates, confirmed womanthropes' for the rest of their lives. Nor is it necessary. To know the vintage and quality of a wine one need not drink the whole cask. It must be perfectly easy in half an h o u r t o s a y w h e t h e r a b o o k is w o r t h a n y t h i n g o r w o r t h n o t h i n g . T e n m i n u t e s are really sufficient, if one has the instinct for form. W h o wants to wade t h r o u g h a d u l l v o l u m e ? O n e t a s t e s it, a n d t h a t is q u i t e e n o u g h - m o r e t h a n enough, Ishould imagine. Iam aware that there are many honest workers in painting as well as in literature w h o object to criticism entirely. T h e y are quite right. Their work stands in no intellectual relation to their age. Itbrings u s n o n e w e l e m e n t o f p l e a s u r e . It s u g g e s t s n o f r e s h d e p a r t u r e o f t h o u g h t , o r passion, or beauty. It should not be spoken of. It should be left to the oblivion that it deserves.

screen shot 2017-02-04 at 1 39 41 am

However, the same text copied/pasted from the same file opened in Adobe Acrobat Pro 10 is (almost) flawless...

Gilbert. Certainly. Anybody can write a three-volumed novel.* It merely
requires a complete ignorance of both life and literature. The difficulty that
I should fancy the reviewer feels is the difficulty of sustaining any standard.
Where there is no style a standard must be impossible. The poor reviewers
are apparently reduced to be the reporters of the police court of literature,
the chroniclers of the doings of the habitual criminals of art. It is sometimes
said of them that they do not read all through the works they are called upon
to criticise. They do not. Or at least they should not. If they did so, they
would become confirmed misanthropes; or, if I may borrow a phrase from
one of the pretty Newnham graduates, confirmed womanthropes' for the
rest of their lives. Nor is it necessary. To know the vintage and quality of a
wine one need not drink the whole cask. It must be perfectly easy in half an
hour to say whether a book is worth anything or worth nothing. Ten minutes
are really sufficient, if one has the instinct for form. Who wants to wade
through a dull volume? One tastes it, and that is quite enough-more than
enough, I should imagine. I am aware that there are many honest workers
in painting as well as in literature who object to criticism entirely. They are
quite right. Their work stands in no intellectual relation to their age. It brings
us no new element of pleasure. It suggests no fresh departure of thought, or
passion, or beauty. It should not be spoken of. It should be left to the oblivion
that it deserves.

@Shreeshrii
Copy link
Collaborator

Seems to be working fine in my case

 tesseract -v
tesseract 4.00.00alpha
 leptonica-1.74.1
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

 Found AVX
 Found SSE

 tesseract testeng.png testeng --oem 1 -l eng

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

Here is the output

Gilbert. Certainly. Anybody can write a three-volumed novel." It merely
requires a complete ignorance of both life and literature. The difficulty that
I should fancy the reviewer feels is the difficulty of sustaining any standard.
Where there is no style a standard must be impossible. The poor reviewers
are apparently reduced to be the reporters of the police court of literature,
the chroniclers of the doings of the habitual criminals of art. It is sometimes
said of them that they do not read all through the works they are called upon
to criticise. They do not. Or at least they should not. If they did so, they
would become confirmed misanthropes; or, if I may borrow a phrase from
one of the pretty Newnham graduates, confirmed womanthropes' for the
rest of their lives. Nor is it necessary. To know the vintage and quality of a
wine one need not drink the whole cask. It must be perfectly easy in half an
hour to say whether a book is worth anything or worth nothing. Ten minutes
are really sufficient, if one has the instinct for form. Who wants to wade
through a dull volume? One tastes it, and that is quite enough-more than
enough, I should imagine. I am aware that there are many honest workers
in painting as well as in literature who object to criticism entirely. They are
quite right. Their work stands in no intellectual relation to their age. It brings
us no new element of pleasure. It suggests no fresh departure of thought, or
passion, or beauty. It should not be spoken of. It should be left to the oblivion
that it deserves.

@Shreeshrii
Copy link
Collaborator

could be related to OpenCL

@RNCTX
Copy link
Contributor Author

RNCTX commented Feb 4, 2017

I rebuilt HEAD without OpenCL and got the same result. I would suppose this has to be something screwy with OSX preview, since it doesn't appear to happen in Adobe Acrobat (or Chrome, which I also opened it in just now).

Here's the complete page...

The946.png.pdf

@jbreiden
Copy link
Contributor

jbreiden commented Feb 4, 2017

Known problem. Root cause is PDF spec which forces heuristics into text extraction, and Preview is well known to have some of the wonkiest heuristics.

@jbreiden
Copy link
Contributor

jbreiden commented Feb 4, 2017

Definitely not related to opencl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants