# OCR and Data Extraction from PDFs and Scanned Images

## Objective
We will develop a system that recognizes printed text from PDFs and scanned document images, converts it into machine-readable text, and structures the data into two pandas DataFrames: one for claims data and one for benefits data. The dataset contains images or PDFs with printed text, which we will extract and format into a structured format.

## Steps
1. Load the dataset (PDFs or scanned images).
2. Perform OCR to extract the text from PDFs or images.
3. Parse and structure the extracted text into two DataFrames: one for claims data and one for benefits data.
4. Ensure the data is properly formatted, including end date and specific parsing for the Notes column.

---


In [None]:
# Step 1: OCR Process for PDFs and Scanned Images

import pytesseract
from PIL import Image
import pdfplumber
import pandas as pd
import re

# OCR for scanned images (Example for PNG images)
def ocr_from_image(image_path):
    image = Image.open(image_path)
    text = pytesseract.image_to_string(image)
    return text

# OCR for PDFs (Example for extracting text from all pages)
def ocr_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text()
    return text

# Example usage
pdf_text = ocr_from_pdf("path_to_pdf.pdf")
image_text = ocr_from_image("path_to_image.png")

# Print the extracted text (for demonstration)
print("Extracted Text from PDF:\n", pdf_text)
print("Extracted Text from Image:\n", image_text)


---
## Step 2: Parsing the Extracted Text into Claims and Benefits DataFrames

We will now parse the text obtained from OCR into two structured DataFrames: one for **claims** data and one for **benefits** data. These DataFrames will contain the relevant columns as specified in the task.

---


In [None]:
# Step 2: Parsing Extracted Text for Claims and Benefits

# Define patterns or rules for extracting the claims and benefits data from the text
# This will depend on the structure of your specific dataset and text format

# Sample dictionaries to store claims and benefits data
claims_data = {
    'Monthly claims': [],
    'Number of insured lives': [],
    'Number of claims': [],
    'Amount of paid claims': [],
    'Amount of paid claims (with VAT)': [],
    'Policy Year': [],
    'End date': [],
    'Class': [],
    'Overall Limit': []
}

benefits_data = {
    'Benefit_Sama': [],
    'Number of Claims': [],
    'Amount of Claims': [],
    'Amount of Claims with VAT': [],
    'Notes': [],
    'Policy Year': [],
    'End date': [],
    'Class': [],
    'Overall Limit': []
}

# Example parsing logic for claims data (to be adjusted based on text structure)
def parse_claims_data(text):
    # Use regular expressions or string processing to extract relevant data
    pattern = re.compile(r'pattern_to_extract_claims_data')
    for match in re.findall(pattern, text):
        # Add extracted data to the claims_data dictionary
        claims_data['Monthly claims'].append(match[0])
        # Continue parsing other fields
        # For simplicity, adding dummy values here
        claims_data['Number of insured lives'].append("dummy_value")
        claims_data['Number of claims'].append("dummy_value")
        claims_data['Amount of paid claims'].append("dummy_value")
        claims_data['Amount of paid claims (with VAT)'].append("dummy_value")
        claims_data['Policy Year'].append("dummy_value")
        claims_data['End date'].append("2024-12-31")  # Example of formatting end date
        claims_data['Class'].append("dummy_class")
        claims_data['Overall Limit'].append("dummy_limit")

# Example parsing logic for benefits data (to be adjusted based on text structure)
def parse_benefits_data(text):
    pattern = re.compile(r'pattern_to_extract_benefits_data')
    for match in re.findall(pattern, text):
        # Add extracted data to the benefits_data dictionary
        benefits_data['Benefit_Sama'].append(match[0])
        # Continue parsing other fields
        benefits_data['Number of Claims'].append("dummy_value")
        benefits_data['Amount of Claims'].append("dummy_value")
        benefits_data['Amount of Claims with VAT'].append("dummy_value")
        benefits_data['Notes'].append(parse_notes(match[1]))  # Example note processing
        benefits_data['Policy Year'].append("dummy_value")
        benefits_data['End date'].append("2024-12-31")  # Example of formatting end date
        benefits_data['Class'].append("dummy_class")
        benefits_data['Overall Limit'].append("dummy_limit")

# Example function to parse the Notes column in the benefits data
def parse_notes(note):
    if '%' in note:
        return re.findall(r'\d+%', note)[0]  # Return percentage
    elif 'cesarean' in note.lower():
        return 'Yes'
    else:
        return 'No info'  # If empty or no specific info

# Example usage: Parse the extracted text
parse_claims_data(pdf_text)
parse_benefits_data(pdf_text)

# Convert dictionaries to DataFrames
claims_df = pd.DataFrame(claims_data)
benefits_df = pd.DataFrame(benefits_data)

# Display the DataFrames
print("Claims DataFrame:")
print(claims_df)

print("Benefits DataFrame:")
print(benefits_df)


---
## Step 3: Finalizing the DataFrames

Ensure that:
- The **end date** is formatted as `yyyy-mm-dd`.
- The **Notes** column in the benefits DataFrame is parsed to return only the percentage, cesarean coverage, or "No info" based on its content.

We will now structure the data into two pandas DataFrames: one for **claims** and one for **benefits**.

---


In [None]:
# Step 3: Final DataFrames - Displaying the structured claims and benefits data

# Output the final claims DataFrame
print("Final Claims DataFrame:")
display(claims_df)

# Output the final benefits DataFrame
print("Final Benefits DataFrame:")
display(benefits_df)

# Save the DataFrames to CSV or Excel if needed
# claims_df.to_csv('claims_data.csv', index=False)
# benefits_df.to_csv('benefits_data.csv', index=False)
