<a href="https://colab.research.google.com/github/sprouse9/OCR/blob/main/InsuranceApplicationOCR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [11]:
!apt-get install -y tesseract-ocr
!pip install pytesseract Pillow pdf2image

from PIL import Image
import pytesseract
import re
import requests
from io import BytesIO


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [12]:
import os

folder_path = "/content/"  # Replace with the path to your folder

# check if we have documents present in /content
has_files = any(os.path.isfile(os.path.join(folder_path, item)) for item in os.listdir(folder_path))


if has_files:
    print(f"The folder '{folder_path}' contains documents (files).")
else:
    print(f"The folder '{folder_path}' does not contain any documents (files).")



The folder '/content/' contains documents (files).


In [13]:
# Step 1: Get image files in folder
image_extensions = (".png", ".jpg", ".jpeg", ".tiff", ".bmp")
image_files = [f for f in os.listdir(folder_path) if f.lower().endswith(image_extensions)]

# Step 2: Check if any images exist
if image_files:
    print(f"Found {len(image_files)} image file(s): {image_files}")

    # Step 3: Load first image
    image_path = os.path.join(folder_path, image_files[0])
    image = Image.open(image_path)

    # Step 4: Perform OCR
    text = pytesseract.image_to_string(image)

    # Step 5: Check for date and signature
    date_pattern = r'\b(?:\d{1,2}[/-]\d{1,2}[/-]\d{2,4}|(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},\s+\d{4})\b'
    signature_keywords = ['signature', 'signed by']

    date_found = re.search(date_pattern, text, re.IGNORECASE)
    signature_found = any(keyword in text.lower() for keyword in signature_keywords)

    print("\nExtracted Text:\n", text)
    print("\n✅ Date Found:" if date_found else "\n❌ Date Missing:", date_found.group(0) if date_found else "")
    print("✅ Signature Mentioned" if signature_found else "❌ Signature Missing")
else:
    print(f"No image files found in '{folder_path}'.")





Found 1 image file(s): ['LIfeApplication20230405-5593.png']

Extracted Text:
 LIFE INSURANCE
APPLICATION

INSURED’S INFORMATION

Full Name: John Doe
Address: _!234 Elm St.

 

 

Springfield , IL 62701
Date of Birth: June 8, 1983

BENEFICIARY DESIGNATION

Primary Beneficiary: Jane Doe

 

 

Signature: Gabe Dee
Date: 06/08/2023

 


✅ Date Found: June 8, 1983
✅ Signature Mentioned


As we can see the address field is not correct:

**Address: _!234 Elm St.**

Lets attempt to improve OCR accuracy by enhancing contrast and removing noise


In [14]:
import cv2
import numpy as np

# Load image via OpenCV
img_cv = cv2.imread(image_path)

# Convert to grayscale
gray = cv2.cvtColor(img_cv, cv2.COLOR_BGR2GRAY)

# Apply thresholding to clean up lines and underlines
_, thresh = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY_INV)

# Invert back (for Tesseract compatibility)
preprocessed = 255 - thresh

# Save and re-OCR
cv2.imwrite("/content/cleaned.png", preprocessed)
image_cleaned = Image.open("/content/cleaned.png")
text = pytesseract.image_to_string(image_cleaned)
print(text)


LIFE INSURANCE
APPLICATION

INSURED’S INFORMATION

FullName: John Doe
Address: 1234 Elm St.

 

Springfield , IL 62701
Date of Birth: June 8, 1983

BENEFICIARY DESIGNATION

Primary Beneficiary: Jane Doe

 

Signature: Gabe Dee
Date: 06/08/2023

 



As we can see the address field is properly scanned, however the signature name was not correctly translated to "John Doe".  
This hits a real-world IDP pain point:

even if a signature is present, OCR often misreads handwritten names like "Gabe Dee" or "John Smith" due to:

* Inconsistent handwriting

* Stylized script fonts

* Noise, blur, or compression artifacts

