# Raw RAG 03: OCR, Much Clearer

In today's business landscape, many enterprises still rely heavily on paper documents. These documents, often scanned and stored as images, contain crucial information that needs to be extracted and digitized. While Optical Character Recognition (OCR) is the traditional solution for this task, even advanced cloud-based OCR services often fall short when dealing with complex, real-world documents.

## The Limitations of Existing OCR Solutions

Despite technological advancements, OCR remains challenging:

- Traditional OCR struggles with handwritten text and non-standard formats
- Even cloud OCR services from tech giants like AWS and Google Cloud have limitations:
  - Constraints on document formats
  - Difficulty with unconventional layouts
  - Inconsistent performance across different document types
- Many solutions miss the mark when it comes to real-world, diverse business documents

These shortcomings highlight the need for more robust, flexible approaches to document preprocessing.

## Innovative Approaches

This notebook introduces three progressive techniques to significantly improve OCR results:

1. **Local OCR Implementation**: We'll start with a basic, locally-run OCR system. This gives us full control over the initial text extraction process.

2. **LLM-Based Correction and Enhancement**: We'll use Large Language Models (LLMs) to correct and refine the output from our local OCR. This combination has yielded surprisingly good results, especially for challenging documents.

3. **Multi-Modal LLM for Direct OCR**: Leveraging the latest advancements in AI, we'll explore using multi-modal LLMs to perform OCR directly on document images. This cutting-edge approach can potentially overcome many limitations of traditional OCR methods.

## What We'll Cover

In this notebook, we'll explore:

1. Setting up and running a local OCR system
2. Implementing LLM-based correction of OCR output
3. Utilizing multi-modal LLMs for direct image-to-text conversion
4. Analyzing and comparing the results of each approach
5. Discussing the pros and cons of each method for different document types

By exploring these three approaches - from traditional local OCR to advanced AI-driven techniques - we aim to provide a comprehensive overview of modern document preprocessing methods. This exploration will equip you with the knowledge to choose the best approach for your specific document understanding needs.

Let's dive in and discover how we can push the boundaries of OCR technology, addressing the limitations of both traditional and cloud-based solutions while exploring the exciting possibilities of multi-modal AI!

In [None]:
# Install local ocr dependencies

%pip install PyPDF2 pytesseract pillow python-dotenv

In [None]:
# Load the environment variables from the .env file

from dotenv import load_dotenv
import os

dotenv_path = ".env"
load_dotenv(dotenv_path=dotenv_path)

In [None]:
import PyPDF2
import pytesseract
from PIL import Image
import io


def extract_text_from_pdf(pdf_path: str) -> str:
  """
  Extracts text from a PDF file.

  Args:
    pdf_path (str): The path to the PDF file.

  Returns:
    str: The extracted text from the PDF file.
  """
  # Open the PDF file
  with open(pdf_path, "rb") as file:
    reader = PyPDF2.PdfReader(file)

    # Initialize an empty string to store all the text
    full_text = ""

    # Iterate through each page
    for page in reader.pages:
      # Try to extract text directly first
      text = page.extract_text()

      # If no text was extracted, it might be an image-based PDF
      if not text:
        # Extract images from the page
        for image in page.images:
          # Open the image using PIL
          img = Image.open(io.BytesIO(image.data))

          # Use Tesseract to do OCR on the image
          text = pytesseract.image_to_string(img)

          # Add the extracted text to our full text
          full_text += text + "\n"
      else:
        # If text was extracted directly, add it to our full text
        full_text += text + "\n"

  return full_text


In [None]:
# Test the function with a US Supreme Court Opinion PDF
pdf_path = "docs/service-ll-usrep-usrep001-usrep001410a-usrep001410a.pdf"

In [None]:
# Perform local OCR text extraction

extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text)

In [None]:
# Display the PDF file

from IPython.display import IFrame

IFrame(pdf_path, width=600, height=300)

Upon comparing the original PDF with the extracted text, it becomes evident that our OCR output, while functional, is far from perfect. The results are riddled with various issues:

- Numerous spelling mistakes
- Misinterpreted characters
- Formatting inconsistencies
- Potential loss of critical information

These imperfections underscore the limitations of traditional OCR techniques, especially when dealing with complex or low-quality documents. However, this is where our innovative approach comes into play.

In the next exciting step of our process, we'll harness the power of Large Language Models (LLMs) to address these shortcomings. By leveraging LLMs, we aim to:

1. Correct spelling and grammatical errors
2. Reconstruct misinterpreted words and phrases
3. Infer missing or unclear information based on context
4. Enhance the overall quality and readability of the extracted text

This LLM-driven correction phase represents a significant leap forward in OCR technology, potentially transforming imperfect extractions into highly accurate and usable text. Let's dive in and explore how LLMs can elevate our OCR results to new heights of accuracy and reliability.

In [None]:
# Enhance OCR output using LLM-based text refinement
#
# This step leverages a Large Language Model to:
#   1. Correct OCR-induced errors (e.g., misspellings, misinterpretations)
#   2. Improve text coherence and readability
#   3. Reconstruct ambiguous or partially extracted content
#
# The LLM acts as an intelligent post-processing layer, significantly
# elevating the quality of our extracted text beyond standard OCR capabilities.

from openai import OpenAI

client = OpenAI()

full_query = f"""Below is a scanned PDF document, please attempt to clean up the text and return it formatted. Don't add any text pre or post the extracted text.

Original text:
\"\"\"
{extracted_text}
\"\"\"

"""

response = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You help user with their request.",
        },
        {"role": "user", "content": full_query},
    ],
    model="gpt-4-turbo",
    temperature=0,
)

corrected_text = response.choices[0].message.content

print(corrected_text)

The LLM-enhanced result demonstrates a marked improvement over the original OCR output. We observe:

- Significant reduction in spelling and grammatical errors
- Enhanced coherence and readability of the extracted text
- Better preservation of the document's original meaning and context

This substantial enhancement underscores the power of combining traditional OCR with LLM-based refinement. However, our exploration doesn't stop here.

Next, we'll push the boundaries further by leveraging a state-of-the-art vision model to extract text directly from the PDF. This cutting-edge approach bypasses traditional OCR entirely, potentially offering:

1. Improved handling of complex layouts and non-standard fonts
2. Better interpretation of visual context and document structure
3. Possibly higher accuracy in challenging scenarios (e.g., handwritten text, low-quality scans)

By comparing the results of this vision model-based extraction with our LLM-enhanced OCR output, we'll gain valuable insights into the strengths and limitations of each approach. This comparison will help us determine the most effective method for various document types and quality levels.

Let's proceed with this exciting next step and see how the vision model performs in our document extraction task!

In [None]:
# Install additional dependencies

%pip install PyMuPDF

In [None]:
# Convert PDF first page to base64 string

import fitz  # PyMuPDF
from PIL import Image
import io
import base64

def pdf_first_page_to_base64(pdf_path: str) -> str:
  """
  Converts the first page of a PDF file to a base64 encoded string.

  Args:
    pdf_path (str): The path to the PDF file.

  Returns:
    str: The base64 encoded string representation of the first page of the PDF.
  """
  
  # Open the PDF file
  pdf_document = fitz.open(pdf_path)
  
  # Get the first page
  first_page = pdf_document[0]
  
  # Convert the page to an image
  pix = first_page.get_pixmap()
  
  # Convert pixmap to PIL Image
  img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
  
  # Create a byte stream
  img_byte_arr = io.BytesIO()
  
  # Save the image as PNG to the byte stream
  img.save(img_byte_arr, format='PNG')
  
  # Get the byte string
  img_byte_arr = img_byte_arr.getvalue()
  
  # Encode to base64
  base64_encoded = base64.b64encode(img_byte_arr).decode('utf-8')
  
  # Close the PDF document
  pdf_document.close()
  
  return base64_encoded


In [None]:
# Convert the first page of the PDF to a base64 string

base64_string = pdf_first_page_to_base64(pdf_path)
print(base64_string[:100])

In [None]:
# Extract text from the image using OpenAI mutli-modal model (GPT-4o)

img_response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "extract text from the image, return it formatted. Do not add any additional text pre or post.",
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_string}"},
                },
            ],
        }
    ],
)

base64_resp = img_response.choices[0].message.content

print(base64_resp)

In [None]:
def create_markdown_table(col1, col2, col3, text1, text2, text3):
    table = f"""
| {col1} | {col2} | {col3} |
|----------|----------|----------|
| {text1} | {text2} | {text3} |
"""

    return table.strip()

In [None]:
compare_text = create_markdown_table(
    "Local OCR Extraction",
    "GPT4 Formatted (Base on Local OCR)",
    "GPT4o Vision Extraction",
    extracted_text,
    corrected_text, base64_resp
)

In [None]:
# Display the comparison table

print(compare_text)

**Conclusion:**
OCR is not enough for non-standardized text. It can be beneficial to use LLM to correct the OCR output. But due to the volume of the data, it may not be fesible to use LLM for all the documents. Local models can be a substitute for cloud services.