# OCR and Data Extraction from Provided PDFs

## Objective
This notebook will extract claims and benefits data from the provided PDFs, "OCR Template Test" and "OCR Template Test Scanned". The extracted data will be structured into two pandas DataFrames: one for **claims data** and one for **benefits data**.

## Steps
1. Perform OCR to extract text from PDFs.
2. Parse the text to extract claims and benefits data.
3. Structure the parsed data into two pandas DataFrames.

---


In [21]:
!pip install tensorflow



Collecting tensorflow
  Downloading tensorflow-2.17.0-cp312-cp312-win_amd64.whl.metadata (3.2 kB)
Collecting tensorflow-intel==2.17.0 (from tensorflow)
  Downloading tensorflow_intel-2.17.0-cp312-cp312-win_amd64.whl.metadata (5.0 kB)
Collecting absl-py>=1.0.0 (from tensorflow-intel==2.17.0->tensorflow)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow-intel==2.17.0->tensorflow)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=24.3.25 (from tensorflow-intel==2.17.0->tensorflow)
  Downloading flatbuffers-24.3.25-py2.py3-none-any.whl.metadata (850 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow-intel==2.17.0->tensorflow)
  Downloading gast-0.6.0-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow-intel==2.17.0->tensorflow)
  Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting libclang>=13.0.0 (from tensorf

In [24]:
import fitz  # PyMuPDF for PDF handling
import pytesseract  # For OCR
from PIL import Image
import io

# Ensure that Tesseract is installed and pytesseract knows its path (if on Windows)
# Uncomment and specify the correct path for Tesseract if necessary:
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def extract_text_from_regular_pdf(pdf_path):
    """
    Extract text from a regular PDF using PyMuPDF without converting it to images.
    """
    text = ""
    # Open the PDF using PyMuPDF
    pdf_document = fitz.open(pdf_path)
    
    for page_number in range(len(pdf_document)):
        page = pdf_document.load_page(page_number)
        
        # Extract text directly from the PDF (for regular PDFs)
        extracted_text = page.get_text("text")
        if extracted_text.strip():
            text += f"Text from Page {page_number + 1}:\n" + extracted_text + "\n"
    
    pdf_document.close()
    return text

def extract_text_from_scanned_pdf(pdf_path):
    """
    Extract text from a scanned PDF by converting each page to an image and using Tesseract OCR.
    """
    text = ""
    # Open the PDF using PyMuPDF
    pdf_document = fitz.open(pdf_path)
    
    for page_number in range(len(pdf_document)):
        page = pdf_document.load_page(page_number)
        
        # Check if the page contains any text
        extracted_text = page.get_text("text")
        if not extracted_text.strip():
            # If no text is found, extract the images from the page and use OCR
            image_list = page.get_images(full=True)
            if image_list:
                for img_index, img in enumerate(image_list):
                    xref = img[0]
                    base_image = pdf_document.extract_image(xref)
                    image_bytes = base_image["image"]
                    pil_image = Image.open(io.BytesIO(image_bytes))
                    
                    # Use Tesseract to extract text from the image
                    ocr_text = pytesseract.image_to_string(pil_image)
                    text += f"OCR from Page {page_number + 1}:\n" + ocr_text + "\n"
        else:
            text += f"Text from Page {page_number + 1} (scanned but contains text):\n" + extracted_text + "\n"
    
    pdf_document.close()
    return text

# Paths to the regular and scanned PDFs
pdf_path_regular = 'OCR_template_test.pdf'
pdf_path_scanned = 'OCR_template_test_scanned.pdf'

# Extract text from the regular PDF using direct text extraction
regular_pdf_text = extract_text_from_regular_pdf(pdf_path_regular)
print("Extracted Text from Regular PDF:\n", regular_pdf_text)

# Extract text from the scanned PDF using Tesseract OCR
scanned_pdf_text = extract_text_from_scanned_pdf(pdf_path_scanned)
print("Extracted Text from Scanned PDF:\n", scanned_pdf_text)


Looking for C:\Users\owner\.keras-ocr\craft_mlt_25k.h5


ValueError: Unrecognized keyword arguments passed to Dense: {'weights': [array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]], dtype=float32), array([1., 0., 0., 0., 1., 0.], dtype=float32)]}

---
## Step 2: Parsing the Extracted Text into Claims and Benefits DataFrames

We will now parse the text obtained from OCR into two structured DataFrames: one for **claims** data and one for **benefits** data. These DataFrames will contain the relevant columns as specified in the task.

---


In [None]:
# Step 2: Parsing Extracted Text for Claims and Benefits

# Define patterns or rules for extracting the claims and benefits data from the text
# This will depend on the structure of your specific dataset and text format

# Sample dictionaries to store claims and benefits data
claims_data = {
    'Monthly claims': [],
    'Number of insured lives': [],
    'Number of claims': [],
    'Amount of paid claims': [],
    'Amount of paid claims (with VAT)': [],
    'Policy Year': [],
    'End date': [],
    'Class': [],
    'Overall Limit': []
}

benefits_data = {
    'Benefit_Sama': [],
    'Number of Claims': [],
    'Amount of Claims': [],
    'Amount of Claims with VAT': [],
    'Notes': [],
    'Policy Year': [],
    'End date': [],
    'Class': [],
    'Overall Limit': []
}

# Example parsing logic for claims data (to be adjusted based on text structure)
def parse_claims_data(text):
    # Use regular expressions or string processing to extract relevant data
    pattern = re.compile(r'pattern_to_extract_claims_data')
    for match in re.findall(pattern, text):
        # Add extracted data to the claims_data dictionary
        claims_data['Monthly claims'].append(match[0])
        # Continue parsing other fields
        # For simplicity, adding dummy values here
        claims_data['Number of insured lives'].append("dummy_value")
        claims_data['Number of claims'].append("dummy_value")
        claims_data['Amount of paid claims'].append("dummy_value")
        claims_data['Amount of paid claims (with VAT)'].append("dummy_value")
        claims_data['Policy Year'].append("dummy_value")
        claims_data['End date'].append("2024-12-31")  # Example of formatting end date
        claims_data['Class'].append("dummy_class")
        claims_data['Overall Limit'].append("dummy_limit")

# Example parsing logic for benefits data (to be adjusted based on text structure)
def parse_benefits_data(text):
    pattern = re.compile(r'pattern_to_extract_benefits_data')
    for match in re.findall(pattern, text):
        # Add extracted data to the benefits_data dictionary
        benefits_data['Benefit_Sama'].append(match[0])
        # Continue parsing other fields
        benefits_data['Number of Claims'].append("dummy_value")
        benefits_data['Amount of Claims'].append("dummy_value")
        benefits_data['Amount of Claims with VAT'].append("dummy_value")
        benefits_data['Notes'].append(parse_notes(match[1]))  # Example note processing
        benefits_data['Policy Year'].append("dummy_value")
        benefits_data['End date'].append("2024-12-31")  # Example of formatting end date
        benefits_data['Class'].append("dummy_class")
        benefits_data['Overall Limit'].append("dummy_limit")

# Example function to parse the Notes column in the benefits data
def parse_notes(note):
    if '%' in note:
        return re.findall(r'\d+%', note)[0]  # Return percentage
    elif 'cesarean' in note.lower():
        return 'Yes'
    else:
        return 'No info'  # If empty or no specific info

# Example usage: Parse the extracted text
parse_claims_data(pdf_text)
parse_benefits_data(pdf_text)

# Convert dictionaries to DataFrames
claims_df = pd.DataFrame(claims_data)
benefits_df = pd.DataFrame(benefits_data)

# Display the DataFrames
print("Claims DataFrame:")
print(claims_df)

print("Benefits DataFrame:")
print(benefits_df)


---
## Step 3: Finalizing the DataFrames

Ensure that:
- The **end date** is formatted as `yyyy-mm-dd`.
- The **Notes** column in the benefits DataFrame is parsed to return only the percentage, cesarean coverage, or "No info" based on its content.

We will now structure the data into two pandas DataFrames: one for **claims** and one for **benefits**.

---


In [None]:
# Step 3: Final DataFrames - Displaying the structured claims and benefits data

# Output the final claims DataFrame
print("Final Claims DataFrame:")
display(claims_df)

# Output the final benefits DataFrame
print("Final Benefits DataFrame:")
display(benefits_df)

# Save the DataFrames to CSV or Excel if needed
# claims_df.to_csv('claims_data.csv', index=False)
# benefits_df.to_csv('benefits_data.csv', index=False)
