---

# Bounding Box Merger Project

## Objective

The goal of this project is to address a unique challenge involving **Optical Character Recognition (OCR)** and **data science algorithms** to work with images of an Excel sheet. Each image contains two columns of bounding box coordinates with each row depicting coordinates that are closely related and could potentially be merged into a single bounding box. Specifically, the project aims to:

1. Develop an **OCR-based pipeline** to extract data from images and save this data into a CSV file.
2. Create a **data science model** to automatically identify and merge these related bounding boxes.

### Background

The dataset consists of a ZIP folder named `screening_data_csv.zip`, which contains **51 images**. Each image is a part of an Excel sheet showcasing two columns: **List A** and **List B**. These columns contain bounding box coordinates in the format `[x1, y1, x2, y2]`. The coordinates in each row of List A correspond to matched bounding box coordinates in the same row of List B, indicating pairs that are closely related and can be merged.

### Expected Output

The project's final output should be a list of coordinates from Lists A and B, highlighting which bounding boxes can be combined into a single bounding box. This involves:

- Accurately identifying closely related bounding boxes.
- Merging these bounding boxes into a single entity.

### Deliverables

1. **OCR Pipeline Code**: To convert image data into CSV format.
2. **Data Science Model Code**: For merging related bounding boxes.
3. **Evaluation Metrics**: To assess the model's accuracy and efficiency in merging bounding boxes.
4. **Documentation**: Detailing the approach, algorithms used, and instructions for system execution on new datasets.

This project not only tests our ability to apply OCR technology but also challenges us to leverage data science algorithms for spatial analysis and decision-making. Let's embark on this exciting journey to develop a comprehensive solution for bounding box analysis.

---


# Data Extraction 

### 1. Unzip the zip file and save it in the data/processed directory

In [78]:
# import zipfile
# import os

# # Path to your ZIP file
# zip_file_path = r"D:\bounding_box_merger\data\raw\screening_data_csv.zip"

# # Destination folder where you want to place the unzipped files
# destination_path = r"D:\bounding_box_merger\data\processed\images"

# # Create the destination folder if it doesn't exist
# if not os.path.exists(destination_path):
#     os.makedirs(destination_path)

# # Unzip the file
# with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
#     zip_ref.extractall(destination_path)

# print(f"Files extracted to {destination_path}")


Files extracted to D:\bounding_box_merger\data\processed\images


# Reading and Preprocessing the images

In [79]:
# # Import all the Libraries and Modules.
# import cv2
# import pytesseract
# import os
# import numpy as np
# import pandas as pd
# from pathlib import Path
# import re
# from pathlib import Path

# pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"


Split all the images where 30% of the image width goes to a 'List A' folder and the remaining 70% to a 'List B' folder

In [80]:
# # Define the source directory where all images are stored
# source_dir = Path(r"D:\bounding_box_merger\data\processed\images\screening_data_csv_1")

# # Define the target directories for List A and List B images
# list_a_dir = Path(r"D:\bounding_box_merger\data\processed\images\List_A")
# list_b_dir = Path(r"D:\bounding_box_merger\data\processed\images\List_B")

# # Create the directories if they don't exist
# list_a_dir.mkdir(parents=True, exist_ok=True)
# list_b_dir.mkdir(parents=True, exist_ok=True)

# # Iterate over each image in the source directory
# for image_path in source_dir.glob('*.png'):
#     # Load the image
#     image = cv2.imread(str(image_path))

#     # Check if the image was loaded successfully
#     if image is None:
#         print(f"Error: Unable to load image at {image_path}")
#     else:
#         # Calculate the split column index based on 30% for List A and 70% for List B
#         split_col_index = int(image.shape[1] * 0.3)  # 30% of the width

#         # Split the image
#         list_a_image = image[:, :split_col_index]
#         list_b_image = image[:, split_col_index:]

#         # Define the paths to save split images
#         list_a_image_path = list_a_dir / f'{image_path.stem}_list_a.png'
#         list_b_image_path = list_b_dir / f'{image_path.stem}_list_b.png'
        
#         # Save the split images
#         cv2.imwrite(str(list_a_image_path), list_a_image)
#         cv2.imwrite(str(list_b_image_path), list_b_image)


* Convert the Individual Images to text using pytesseract
* Then, the raw text we got from image, clean it using Regular Expression and get the numbers in the for of [x1, y1, x2, y2] for List A and List B
* Then convert the numbers to a dataframe.

In [81]:
# # Function to perform OCR and extract bounding boxes to a DataFrame
# def ocr_to_dataframe(image_path, pattern):
#     # Load the image from disk
#     image = cv2.imread(image_path)

#     # Convert the image to grayscale to improve OCR accuracy
#     gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

#     # Perform OCR using PyTesseract
#     text = pytesseract.image_to_string(gray_image)

#     # Use regular expressions to find all bounding boxes in the text
#     matches = re.findall(pattern, text)

#     # Convert matches to DataFrame
#     return pd.DataFrame(matches, columns=['x1', 'y1', 'x2', 'y2']).astype(int)

# # Define the folders for List A and List B images
# list_a_folder = Path(r"D:\bounding_box_merger\data\processed\images\List_A")
# list_b_folder = Path(r"D:\bounding_box_merger\data\processed\images\List_B")

# # Pattern to extract bounding boxes
# pattern = r'\[?\(?(\d+), (\d+), (\d+), (\d+)\)?\]?'

# # Initialize empty DataFrames for List A and List B
# df_list_a = pd.DataFrame(columns=['x1', 'y1', 'x2', 'y2'])
# df_list_b = pd.DataFrame(columns=['x1', 'y1', 'x2', 'y2'])

# # Process List A images
# for image_path in list_a_folder.glob('*.png'):
#     df_temp = ocr_to_dataframe(str(image_path), pattern)
#     df_list_a = pd.concat([df_list_a, df_temp], ignore_index=True)

# # Process List B images
# for image_path in list_b_folder.glob('*.png'):
#     df_temp = ocr_to_dataframe(str(image_path), pattern)
#     df_list_b = pd.concat([df_list_b, df_temp], ignore_index=True)

# # Show the combined DataFrame shapes
# print(f"List A DataFrame shape: {df_list_a.shape}")
# # print(f"List B DataFrame shape: {df_list_b.shape}")


List A DataFrame shape: (2503, 4)
List B DataFrame shape: (2503, 4)


Then Convert the data frame to csv format for both List A and List B

In [82]:
# # Define file paths for the CSV files
# list_a_csv_path = r"D:\bounding_box_merger\data\processed\csv\list_a.csv"
# list_b_csv_path = r"D:\bounding_box_merger\data\processed\csv\list_b.csv"

# # Convert df_list_a to CSV
# df_list_a.to_csv(list_a_csv_path, index=False)

# # Convert df_list_b to CSV
# df_list_b.to_csv(list_b_csv_path, index=False)

# print(f"List A DataFrame saved to {list_a_csv_path}")
# print(f"List B DataFrame saved to {list_b_csv_path}")


List A DataFrame saved to D:\bounding_box_merger\data\processed\csv\list_a.csv
List B DataFrame saved to D:\bounding_box_merger\data\processed\csv\list_b.csv


# Data Pipeline

In [85]:
import os
import cv2
import zipfile
import pandas as pd
import pytesseract
import numpy as np
from pathlib import Path
import re

def process_images(zip_file_path, destination_path, tesseract_path):
    # Set tesseract path
    pytesseract.pytesseract.tesseract_cmd = tesseract_path

    # Extract zip file
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(destination_path)
    print(f"Files extracted to {destination_path}")

    # Define directories
    source_dir = Path(destination_path, "screening_data_csv_1")
    list_a_dir = Path(destination_path, "List_A")
    list_b_dir = Path(destination_path, "List_B")

    # Create directories if they don't exist
    list_a_dir.mkdir(parents=True, exist_ok=True)
    list_b_dir.mkdir(parents=True, exist_ok=True)

    # Split images
    for image_path in source_dir.glob('*.png'):
        image = cv2.imread(str(image_path))
        if image is None:
            print(f"Error: Unable to load image at {image_path}")
        else:
            split_col_index = int(image.shape[1] * 0.3)  # 30% of the width
            list_a_image = image[:, :split_col_index]
            list_b_image = image[:, split_col_index:]
            list_a_image_path = list_a_dir / f'{image_path.stem}_list_a.png'
            list_b_image_path = list_b_dir / f'{image_path.stem}_list_b.png'
            cv2.imwrite(str(list_a_image_path), list_a_image)
            cv2.imwrite(str(list_b_image_path), list_b_image)

    # OCR to dataframe
    def ocr_to_dataframe(image_path, pattern):
        image = cv2.imread(image_path)
        gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        text = pytesseract.image_to_string(gray_image)
        matches = re.findall(pattern, text)
        return pd.DataFrame(matches, columns=['x1', 'y1', 'x2', 'y2']).astype(int)

    # Process List A and List B images
    pattern = r'\[?\(?(\d+), (\d+), (\d+), (\d+)\)?\]?'
    df_list_a = pd.DataFrame(columns=['x1', 'y1', 'x2', 'y2'])
    df_list_b = pd.DataFrame(columns=['x1', 'y1', 'x2', 'y2'])

    for image_path in list_a_dir.glob('*.png'):
        df_temp = ocr_to_dataframe(str(image_path), pattern)
        df_list_a = pd.concat([df_list_a, df_temp], ignore_index=True)

    for image_path in list_b_dir.glob('*.png'):
        df_temp = ocr_to_dataframe(str(image_path), pattern)
        df_list_b = pd.concat([df_list_b, df_temp], ignore_index=True)

    print(f"List A DataFrame shape: {df_list_a.shape}")
    print(f"List B DataFrame shape: {df_list_b.shape}")

    # Delete processed images
    for image_path in list_a_dir.glob('*.png'):
        os.remove(image_path)

    for image_path in list_b_dir.glob('*.png'):
        os.remove(image_path)
    
    list_a_csv_path = r"D:\bounding_box_merger\data\processed\csv\list_a.csv"
    list_b_csv_path = r"D:\bounding_box_merger\data\processed\csv\list_b.csv"

    # Convert df_list_a to CSV
    df_list_a.to_csv(list_a_csv_path, index=False)

    # Convert df_list_b to CSV
    df_list_b.to_csv(list_b_csv_path, index=False)

    print(f"List A DataFrame saved to {list_a_csv_path}")
    print(f"List B DataFrame saved to {list_b_csv_path}")

    return df_list_a, df_list_b


In [86]:
process_images(zip_file_path = r"D:\bounding_box_merger\data\raw\screening_data_csv.zip", destination_path = r"D:\bounding_box_merger\data\processed\images", tesseract_path = r"C:\Program Files\Tesseract-OCR\tesseract.exe" )

Files extracted to D:\bounding_box_merger\data\processed\images
List A DataFrame shape: (2503, 4)
List B DataFrame shape: (2503, 4)
List A DataFrame saved to D:\bounding_box_merger\data\processed\csv\list_a.csv
List B DataFrame saved to D:\bounding_box_merger\data\processed\csv\list_b.csv


(        x1    y1    x2    y2
 0     2929  1727  3056  1801
 1      714  3826   784  4033
 2     1970  2461  2028  2654
 3     5690  2156  5801  2247
 4     4026  3674  4138  3781
 ...    ...   ...   ...   ...
 2498  1930  1728  2049  1798
 2499  4028  3468  4110  3601
 2500  5714  2484  5794  2668
 2501  4277  1554  4357  1665
 2502  3922  1724  4064  1804
 
 [2503 rows x 4 columns],
         x1    y1    x2    y2
 0     2962  1442  3038  1693
 1      800  3858  1032  3987
 2     2042  2490  2284  2628
 3     5800  2111  5973  2267
 4     4136  3655  4310  3802
 ...    ...   ...   ...   ...
 2498  1967  1443  2036  1697
 2499  4139  3498  4343  3576
 2500  5802  2513  6034  2642
 2501  4382  1575  4608  1640
 2502  3963  1444  4037  1698
 
 [2503 rows x 4 columns])

* Extract Images: Unzips and extracts images from a provided ZIP file.
* Split Images: Splits each image into two parts, assuming the data for List A and List B are on different sections of the image.
* OCR Processing: Applies OCR to each part to extract bounding box coordinates and saves the results into two pandas DataFrames.
* Save to CSV: Writes the DataFrames to CSV files for List A and List B, making the data easier to access and use for further processing.