### Different PDF parsers

- `PyMuPDF` is a Python wrapper for the MuPDF library, which is a lightweight PDF and XPS viewer and parser. It can be used to extract text, images, and other data from PDF files, as well as to manipulate PDF files programmatically. It provides a comprehensive set of tools for working with PDF files, including merging and splitting PDFs, adding annotations and bookmarks, and converting PDFs to other formats.

- `DeepDocDetection` is a Python library for document analysis and OCR (Optical Character Recognition). It provides tools for detecting text, images, and tables in PDF files, as well as for performing OCR on scanned documents. It uses deep learning models to achieve high accuracy in document analysis and OCR.

#### For this example we will use `PyMuPDF`

The code performs the following steps to convert the PDF file into OCR-like images and extract the text:

1. Import the necessary libraries: `os`, `pytesseract`, `Image` from `PIL`, and `convert_from_path` from `pdf2image`.
2. Set the Tesseract command path to the correct location on your system.
3. Define the path to the PDF file you want to process.
4. Use the `convert_from_path` function from the `pdf2image` library to convert the PDF into a list of images, one for each page.
5. Create a directory named `ocr_images` in the same location as the PDF file to store the OCR-like JPEG images.
6. Iterate through the list of images, and for each image:
   a. Save the image as a JPEG file in the `ocr_images` directory with a name like `ocr_image_0.jpeg`, `ocr_image_1.jpeg`, etc.
   b. Add the file path of the saved JPEG image to a list `ocr_image_paths`.
7. Initialize an empty string `full_text` to store the extracted text from the images.
8. Iterate through the list of OCR-like JPEG images, and for each image:
   a. Open the image using the `Image.open()` function from the `PIL` library.
   b. Extract the text from the image using the `pytesseract.image_to_string()` function.
   c. Append the extracted text to the `full_text` string, separated by a newline character.
9. Save the extracted text (the content of the `full_text` string) to a file named `ocr_text.txt` in the same location as the PDF file.

This code reads the text content from the PDF file by converting it into OCR-like images, then extracts the text using Tesseract OCR. The final extracted text is saved as a plain text file.

In [None]:
%pip install PyMuPDF
%pip install pdf2image pytesseract pillow PyMuPDF
%brew install tesseract
%which tesseract
%brew install poppler

In [None]:
import os
import pytesseract
from PIL import Image
from pdf2image import convert_from_path

# Set Tesseract path
pytesseract.pytesseract.tesseract_cmd = r'/somepath/tesseract'

# PDF file path
pdf_file_path = '/somepath/test.pdf'

# Convert PDF to images
images = convert_from_path(pdf_file_path)

# Save images as OCR-like JPEGs
ocr_image_dir = os.path.join(os.path.dirname(pdf_file_path), 'ocr_images')
if not os.path.exists(ocr_image_dir):
    os.makedirs(ocr_image_dir)

ocr_image_paths = []
for i, img in enumerate(images):
    ocr_image_path = os.path.join(ocr_image_dir, f'ocr_image_{i}.jpeg')
    img.save(ocr_image_path, 'JPEG')
    ocr_image_paths.append(ocr_image_path)

# Perform OCR using Tesseract
full_text = ''
for ocr_image_path in ocr_image_paths:
    img = Image.open(ocr_image_path)
    text = pytesseract.image_to_string(img)
    full_text += text + '\n'

# Save the OCR text to a file
ocr_text_path = os.path.join(os.path.dirname(pdf_file_path), 'ocr_text.txt')
with open(ocr_text_path, 'w') as f:
    f.write(full_text)
