Skip to content

Latest commit

 

History

History
33 lines (26 loc) · 1.4 KB

README.md

File metadata and controls

33 lines (26 loc) · 1.4 KB

Arabic PDF OCR - Searchable PDF

Perform Optical Character Recognition (OCR) on a scanned PDF file containing Arabic text. I use Tesseract OCR to extract text from each page, generate a searchable PDF, and save the OCR text as a separate text file. Can aid in digitizing Arabic text from PDFs and creating searchable documents.

Requirements

Input / Output

  • Input : filePath variable points to your input PDF file.
  • Output : A new PDF file with searchable text generated from the OCR results and a text file containing the extracted Arabic text for each page.

Usage

  1. Install the required libraries from requirements.txt.
  2. Modify the filePath variable to point to your input PDF file.
  3. Set the path to the Tesseract OCR command in the script if needed by modifying the line - pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
  4. Run the script, and the combined PDF and translated text will be saved in the same directory.

Example:

# Set the path to the input PDF file
filePath = '/path/to/your/input.pdf'

# Set the path to the Tesseract OCR command
pytesseract.pytesseract.tesseract_cmd = '/path/to/your/tesseract'

# Run the script
python script.py