-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Page size with the new pdf option #150
Comments
Sorry, no. If the input image is A4 then the output PDF is A4. The design goal of Tesseract's PDF module is to not change anything about the image. If you want to modify page size, either change the input image or post process the output PDF. |
Here's a tip how to rescale all pages to e.g. DIN A4: |
This issue should be closed. (Working as intended) |
@jbreiden, how tesseract determines page size of the input image? The page size depends on DPI which tesseract has no information about. For example A5 (5.83 x 8.27 inch) with 300 dpi has resolution 1748 x 2480 pixels. |
Resolution (DPI) is extracted from the header of the input image. If missing, then Tesseract has no choice but to make something up. Don't do that! Many tools can be used to inspect and adjust DPI for an input image file. If you want to use ImageMagick, the commands are "identify -verbose" to inspect and "mogrify -density 300x300 -units PixelsPerInch" to set. |
Is there anyway to directly specify the image's DPI to Tesseract? |
No, there is not. And I am reluctant to add this capability. |
There should an option to specify the size of the output PDF . The size of PDF page is becoming very large . If anyone have done anything inorder to avoid the same please let me know . |
Many times I made related proposals, at least to achieve the goal to mix the original image and the OCRed text afterwards, all were dismissed. I fully support your proposal! |
Set the dpi of the input images. Use mogrify from ImageMagick or similar.
|
I have a file which has as its original format A4. I tried setting the dpi using magick convert as below. However, the files kept the same size as the original ones.
However, if I skip tesseract, and export directly to pdf, everything is fine:
If I use mogrify, I can alter the resolution:
And that indeed comes close to A4, but I have to calculate the resolution for each image accordingly. If I don't know the original dpi which was used, how can I automatically set the image size using mogrify or tesseract (i.e., without having to calculate it manually for each image separately)? My images are extracted directly from a pdf which I desire to include an ocr layer. |
FYI, I've implemented a utility to fix the DPI of an image if you know its actual dimension: Install Image Density Fixer for Linux using the Snap Store | Snapcraft. |
My experience showed, that the easiest way to get the pdf to any size is using |
The new pdf option to directly create a PDF with embedded text is awesome. Unfortunately, I haven't been able to figure out yet how to specify the page size (e.g. A4, letter, ...). Is that possible?
The text was updated successfully, but these errors were encountered: