Skip to content
This batch script creates a searchable PDF out of a PDF with one or more scanned pages.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE
README.md
Searchable_Image_PDF_Creat-O-mat.bat

README.md

Searchable-Image-PDF-Creat-O-Mat

This batch script creates a searchable PDF out of a PDF with one or more scanned pages by using the image handling software ImageMagick and the OCR software Tesseract. It is possible to drag and drop one or multiple PDF files onto this batch file to start the process. But it is also possible to use the command line too.

Prerequisites:

How to use it

Installation

  • Install ImageMagick
  • Install Ghostscript
  • Install Tesseract
  • Put the file into the folder where you or your scanner stores the scanned PDF files
  • Open the batch file with a text editor e.g. Notepad or Scite and
    • fill in the correct (absolute) folder of the ImageMagick, Ghostscript and Tesseract executable files at the beginning of the file.
    • edit the source language if necessary. You can set one or multiple languages Tesseract should look for in the scanned documents. (BTW: there are different types of training data files for Tesseract. These seem to be a good choice: https://github.com/tesseract-ocr/tessdata_best )
    • Save the changes

Usage

  • Drag and drop one or multiple PDF files onto this batch file to start the process

    or

  • Use the commandline window to start the script <script filename> [pdf filename #1] [pdf filename #2] ... [pdf filename #n]

The Process

  • The script uses Imagemagick and Ghostscript to extract the sacanned pages from the PDF file and store them tempararily in a subfolder of the current batch file location.
  • Imagemagick will then be used to deskew the image files in order to get better OCR results (there is an option the prevent that).
  • The temporary image files will then be processed by Tesseract which creates a new PDF file with a searchable text layer.
  • Afterwards Ghostscript will be used to repack the PDF file in order to get smaller file (there is an option the prevent that).
  • The batch file will create also a further subfolder (\searchable_PDF) to store the searchable PDF files there.
You can’t perform that action at this time.