Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page size with the new pdf option #150

Closed
lanthaler opened this issue Nov 15, 2015 · 13 comments
Closed

Page size with the new pdf option #150

lanthaler opened this issue Nov 15, 2015 · 13 comments
Labels

Comments

@lanthaler
Copy link

The new pdf option to directly create a PDF with embedded text is awesome. Unfortunately, I haven't been able to figure out yet how to specify the page size (e.g. A4, letter, ...). Is that possible?

@jbreiden
Copy link
Contributor

Sorry, no. If the input image is A4 then the output PDF is A4. The design goal of Tesseract's PDF module is to not change anything about the image. If you want to modify page size, either change the input image or post process the output PDF.

@Wikinaut
Copy link
Contributor

Here's a tip how to rescale all pages to e.g. DIN A4:
https://github.com/Wikinaut/utils/wiki#scale_all_pages_in_PDF_to_A4

@jbreiden
Copy link
Contributor

jbreiden commented Feb 1, 2016

This issue should be closed. (Working as intended)

@zdenop zdenop closed this as completed Feb 1, 2016
@amitdo amitdo added the PDF label May 30, 2016
@bocekm
Copy link

bocekm commented Aug 2, 2016

@jbreiden, how tesseract determines page size of the input image? The page size depends on DPI which tesseract has no information about. For example A5 (5.83 x 8.27 inch) with 300 dpi has resolution 1748 x 2480 pixels.
The problem is that when the input image to tesseract has 1748 x 2480 pixels, it outputs pdf file with page size 24.97 × 35.43 inch, not 5.83 x 8.27 inch.
Can you please reopen this issue or should I create a new issue?

@jbreiden
Copy link
Contributor

jbreiden commented Aug 2, 2016

Resolution (DPI) is extracted from the header of the input image. If missing, then Tesseract has no choice but to make something up. Don't do that! Many tools can be used to inspect and adjust DPI for an input image file. If you want to use ImageMagick, the commands are "identify -verbose" to inspect and "mogrify -density 300x300 -units PixelsPerInch" to set.

@brlin-tw
Copy link
Contributor

brlin-tw commented Jan 6, 2018

Is there anyway to directly specify the image's DPI to Tesseract?

@jbreiden
Copy link
Contributor

jbreiden commented Jan 6, 2018

No, there is not. And I am reluctant to add this capability.

@monuminu
Copy link

There should an option to specify the size of the output PDF . The size of PDF page is becoming very large . If anyone have done anything inorder to avoid the same please let me know .

@Wikinaut
Copy link
Contributor

Many times I made related proposals, at least to achieve the goal to mix the original image and the OCRed text afterwards, all were dismissed. I fully support your proposal!

@jbreiden
Copy link
Contributor

jbreiden commented Jul 27, 2019 via email

@nettoyoussef
Copy link

nettoyoussef commented Aug 19, 2020

I have a file which has as its original format A4.
When I convert it to images and then perform ocr using tesseract, the page size changes, as described here.

I tried setting the dpi using magick convert as below. However, the files kept the same size as the original ones.
What am I doing wrong?

$ identify my_file.png  
my_file.png PNG 3040x4560 3040x4560+0+0 8-bit Gray 2c 29989B 0.000u 0:00.000

$ convert my_file.png -page a4 my_file-1.png

$ identify my_file-1.png                        
my_file-1.png PNG 3040x4560 595x842+0+0 8-bit Gray 2c 35878B 0.000u 0:00.000

$ tesseract -l eng my_file-1.png my_file-1 pdf

$ pdfinfo my_file-1.pdf
Title:          
Producer:       Tesseract 3.04.01
CreationDate:   Wed Aug 19 11:26:53 2020 -03
...
Pages:          1
Page size:      3040 x 4560 pts
Page rot:       0
File size:      11336 bytes
Optimized:      no
PDF version:    1.5

However, if I skip tesseract, and export directly to pdf, everything is fine:

$convert my_file.png -page A4 my_file-1.pdf && pdfinfo my_file-1.pdf
...
Producer:       https://imagemagick.org
CreationDate:   Wed Aug 19 16:43:17 2020 -03
ModDate:        Wed Aug 19 16:43:17 2020 -03
...
Pages:          1
Page size:      595.165 x 842.234 pts (A4)
Page rot:       0
File size:      45306 bytes
Optimized:      no
PDF version:    1.3

If I use mogrify, I can alter the resolution:

mogrify -density 360x360 -units PixelsPerInch my_file.png

And that indeed comes close to A4, but I have to calculate the resolution for each image accordingly.

If I don't know the original dpi which was used, how can I automatically set the image size using mogrify or tesseract (i.e., without having to calculate it manually for each image separately)?

My images are extracted directly from a pdf which I desire to include an ocr layer.

@brlin-tw
Copy link
Contributor

FYI, I've implemented a utility to fix the DPI of an image if you know its actual dimension: Install Image Density Fixer for Linux using the Snap Store | Snapcraft.

@Kunzol
Copy link

Kunzol commented Nov 21, 2020

My experience showed, that the easiest way to get the pdf to any size is using pdfjam. It keeps the text overlay in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants