Searchable-Image-PDF-Creat-O-Mat

This batch script creates a searchable PDF out of a PDF with one or more scanned pages by using the image handling software ImageMagick and the OCR software Tesseract. It is possible to drag and drop one or multiple PDF files onto this batch file to start the process. But it is also possible to use the command line too.

Prerequisites:

ImageMagick (7.0.8-27 and newer) https://imagemagick.org/
Ghostscript (9.xx) https://www.ghostscript.com/
Tesseract (4.0 and newer) https://github.com/tesseract-ocr/tesseract/wiki
Operating System: Microsoft Windows 7 (with PowerShell); 8; 8.1; 10

How to use it

Installation

Install ImageMagick
Install Ghostscript
Install Tesseract
Put the file into the folder where you or your scanner stores the scanned PDF files
Open the batch file with a text editor e.g. Notepad or Scite and
- fill in the correct (absolute) folder of the ImageMagick, Ghostscript and Tesseract executable files at the beginning of the file.
- edit the source language if necessary. You can set one or multiple languages Tesseract should look for in the scanned documents. (BTW: there are different types of training data files for Tesseract. These seem to be a good choice: https://github.com/tesseract-ocr/tessdata_best )
- Save the changes

Usage

Drag and drop one or multiple PDF files onto this batch file to start the process

or
Use the commandline window to start the script <script filename> [pdf filename #1] [pdf filename #2] ... [pdf filename #n]

The Process

The script uses Imagemagick and Ghostscript to extract the scanned pages from the PDF file and store them tempararily in a subfolder of the current batch file location.
Imagemagick will then be used to deskew the image files in order to get better OCR results (there is an option the prevent that).
The temporary image files will then be processed by Tesseract which creates a new PDF file with a searchable text layer.
Afterwards Ghostscript will be used to repack the PDF file in order to get smaller file (there is an option the prevent that).
The batch file will create also a further subfolder (\searchable_PDF) to store the searchable PDF files there.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
LICENSE		LICENSE
README.md		README.md
Searchable_Image_PDF_Creat-O-Mat.bat		Searchable_Image_PDF_Creat-O-Mat.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Searchable-Image-PDF-Creat-O-Mat

Prerequisites:

How to use it

Installation

Usage

The Process

About

Releases 6

Packages

Languages

License

timberger/Searchable-Image-PDF-Creat-O-Mat

Folders and files

Latest commit

History

Repository files navigation

Searchable-Image-PDF-Creat-O-Mat

Prerequisites:

How to use it

Installation

Usage

The Process

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages