## Code to select emplacement of elements to get from a scanned copy

As we are screening scanned documents from candidates, it is necessary to convert the pdf into image. Once it is done, it is super useful to indicate our algorithm where he should find the information in order to convert the image to text. If we don't do it, it might add additional work when post-processing the text to capture the required elements for our application.

Therefore, we call a widget from matplotlib that indicates the coordinates of the picture where we select our elements. We do this on many documents to ensure the coordinates might takes into account the noise and variability in the way candidates scan their documents.

In [87]:
from pdf2image import convert_from_path
import matplotlib.pyplot as plt
from matplotlib.widgets import RectangleSelector
from PIL import Image

# Convert the first page of a PDF to an image
pdf_path = "/Users/gaetanrieben/Desktop/Automation_Work/RH_Lausanne/bulletin-1ere.pdf_24979517.pdf"  # Replace with your PDF path
pages = convert_from_path(pdf_path, dpi=300)
first_page = pages[0]

# To store rectangle coordinates
zones = []

# Callback function to capture rectangle coordinates
def on_select(eclick, erelease):
    """
    Callback function that gets the coordinates of the rectangle.
    :param eclick: Mouse click event (x1, y1)
    :param erelease: Mouse release event (x2, y2)
    """
    global zones
    x1, y1 = eclick.xdata, eclick.ydata
    x2, y2 = erelease.xdata, erelease.ydata
    zones.append((min(x1, x2), min(y1, y2), abs(x2 - x1), abs(y2 - y1)))  # (x, y, width, height)
    print(f"Zone: x={min(x1, x2):.0f}, y={min(y1, y2):.0f}, width={abs(x2 - x1):.0f}, height={abs(y2 - y1):.0f}")

# Function to display the image and enable rectangle drawing
def define_zones(image):
    """
    Allows the user to draw rectangles interactively on an image.
    :param image: PIL.Image object
    """
    fig, ax = plt.subplots(figsize=(10, 10))
    ax.imshow(image)
    ax.set_title("Drag to select zones. Close the window when done.")

    # Initialize RectangleSelector
    toggle_selector = RectangleSelector(
        ax, on_select,
        interactive=True  # Enables interactive rectangle drawing
    )

    plt.show()

# Call the interactive function
define_zones(first_page)

# Display final zones
print("Final zones:", zones)


Final zones: []


## Converting the image into text and post-processing text

In this step, we are using Tesseract library that does a great job in converting images into text. We could use other library, such as Google Cloud Vision API. I chose this library as it is free.

We first define a function that allows us to extract text from images, indicating the image and the region we pre-defined. We will be able to extract text from exact emplacement. 

Once done, we have to post-process the text we extracted.For instance, the scanned document includes a text in the following format : " First name and Last Name - Class Number". We have to remove the elements we are not interest in, and keep everything that is on the left of the "-". Then, the last name is the last element, while the first name is element before that last element (In french, we have names such Pierre-David. It's important to keep the whole first name).

We have to find for the class level, which is either VP, VG or Gymnase and is indicated in the text written by the school.

Lastly, we have a list of classes, and a list of grades. It is highly important to match the right class with the right grade. Otherwise, we might match the wrong grade to a specific class and it will affect the ATS score of that candidate.This is why the subject is in an array, and the grade is in another array, so that we can map it into a dictionary.

Those elements are then ready to be transferred into the ATS application.

In [85]:
from pdf2image import convert_from_path
import pytesseract
from PIL import Image
import re


# Set the path to the Tesseract executable
pytesseract.pytesseract.tesseract_cmd = '/opt/homebrew/bin/tesseract'


def extract_text_from_region(image, region):
    """
    Extracts text from a specific region of an image using Tesseract OCR.
    :param image: PIL.Image object
    :param region: Tuple (x, y, width, height) for the ROI
    :return: Extracted text
    """
    x, y, width, height = region
    left = x
    upper = y
    right = x + width
    lower = y + height
    cropped_image = image.crop((left, upper, right, lower))
    text = pytesseract.image_to_string(cropped_image)
    return text

def zonal_ocr_from_pdf(pdf_path, regions):
    """
    Perform zonal OCR on a PDF.
    :param pdf_path: Path to the input PDF
    :param regions: Dictionary of regions with keys as labels and values as ROI tuples
    :return: Extracted data dictionary
    """
    # Convert PDF pages to images
    pages = convert_from_path(pdf_path, dpi=300)
    extracted_data = {}

    for page_number, page_image in enumerate(pages, start=1):
        page_data = {}
        for label, region in regions.items():
            text = extract_text_from_region(page_image, region)
            page_data[label] = text.strip()
        extracted_data[f'Page_{page_number}'] = page_data

    return extracted_data

def process_grades_and_classes(results):
    """
    Processes the Grades and Classes_Name into separate arrays, removing empty entries.
    :param results: Dictionary containing OCR results.
    :return: Tuple of arrays (classes_array, grades_array).
    """
    classes_array = []
    grades_array = []

    for page, data in results.items():
        if "Classes_Name" in data:
            classes_array = [cls for cls in data["Classes_Name"].split('\n') if cls.strip()]
        if "Grades" in data:
            grades_array = [grade for grade in data["Grades"].split('\n') if grade.strip()]

    return classes_array, grades_array

def process_name(results):
    """
    Processes the Name field in the results to split into first name and last name(s).
    Updates the results dictionary directly.
    :param results: Dictionary containing OCR results.
    """
    for page, data in results.items():
        if "Name" in data:
            name = data["Name"]
            # Use regex to remove everything after "-" followed by a number
            name = re.split(r'\s*-\s*\d', name)[0].strip()
            name_parts = name.split()
            first_name = name_parts[0]
            last_name = " ".join(name_parts[1:]) if len(name_parts) > 1 else ""
    return first_name, last_name

def process_class_level(results):
    """
    Checks for specific keywords in the Class_Level field and extracts the class level.
    Updates the results dictionary directly.
    :param results: Dictionary containing OCR results.
    """
    keywords = ["gymnase", "VP", "VG"]
    for page, data in results.items():
        if "Class_Level" in data:
            class_level_text = data["Class_Level"].lower()
            for keyword in keywords:
                if keyword in class_level_text:
                    class_level = keyword.upper()
                    break

    return class_level

if __name__ == "__main__":
    # Define the input PDF
    pdf_path = "/Users/gaetanrieben/Desktop/Automation_Work/RH_Lausanne/bulletin-1ere.pdf_24979517.pdf"

    # Define the regions for Zonal OCR (adjust these coordinates based on your document layout)
    # Format: {'label': (x, y, width, height)}
    regions = {
        "Classes_Name": (259, 1188, 1366, 611),  # x, y, width, height based on latest screenshot
        "Name": (259, 928, 1874, 78),  # x, y, width, height based on latest screenshot
        "Class_Level": (228, 1006, 2159, 182),  # x, y, width, height based on latest screenshot
        "Grades": (1643, 1182, 230, 623)  # x, y, width, height based on latest screenshot
    }

    # Perform Zonal OCR
    results = zonal_ocr_from_pdf(pdf_path, regions)

        # Process Grades and Classes_Name into arrays
    classes_array, grades_array = process_grades_and_classes(results)

    first_name, last_name = process_name(results)
    class_level = process_class_level(results)