# Motivation
I want to try my hand at training a CNN for letter recognition sometime soon. In order to do that, I'm going to need to collect some training data. 

This notebook will parse through all of the labeled images I have, extract the processed tile images, and then save them as separate images. 

# Setup
The cells below will set up the rest of this notebook. 

I'll start by configuring the kernel: 

In [1]:
# Change directories to the root of the project
%cd ..

# Enable autoreload of modules
%load_ext autoreload
%autoreload 2

d:\data\programming\boggle-vision


Next, I'm going to import some relevant libraries:

In [2]:
# Import statements
import cv2
import pandas as pd
from pathlib import Path
import math
from matplotlib import pyplot as plt
import numpy as np
from statistics import mode
import utils
import cv2
import pytesseract
from PIL import Image
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
import shutil

# Importing custom modules
import utils.board_detection as board_detect

Finally, I'm going to create a folder in the `data/` folder - this is where I'll store all of the data. 

In [3]:
# Declare the path to the training-data folder
training_data_path = Path("data/training-data")

# If it already exists, delete it (even if it's not empty)
if training_data_path.exists():
    shutil.rmtree(training_data_path)

# Create the training-data folder
training_data_path.mkdir(parents=True, exist_ok=True)

# Loading Data
Here, I'm going to load in all of the pictures, as well as some information about each of them. 

In [4]:
# Open the .csv file containing the labeled boards
board_data_df = pd.read_csv("data/labeled-boards.csv")

# Add a column which is the parsed letter sequence
board_data_df["parsed_letter_sequence"] = board_data_df["letter_sequence"].apply(
    lambda letter_list: letter_list.split(";")
)

# Load all of the images using cv2
file_path_to_image = {}
for row in board_data_df.itertuples():
    file_path_to_image[row.file_path] = cv2.imread(row.file_path)

# Parsing Boards
Below, I'm going to run each of the boards through a "parsing" method. This will extract the "processed" letter images for each of the different tiles.

In [5]:
# We'll collect some results about the board data here
all_parsed_boards_df_records = []

# Iterate through all of the rows in the board data
for row in tqdm(list(board_data_df.query("difficulty == 'easy'").itertuples())):
    # Try and parse the board
    error_msg = None
    try:
        letter_img_sequence = board_detect.parse_boggle_board(
            file_path_to_image[row.file_path],
            max_image_height=1200,
            return_parsed_img_sequence=True
        )

    except Exception as e:
        continue

    # Add some information to the all_parsed_boards_df_records
    all_parsed_boards_df_records.append(
        {
            "file_path": row.file_path,
            "letter_img_sequence": letter_img_sequence,
            "error_msg": error_msg,
        }
    )

# Parse the results into a dataframe
all_parsed_boards_df = pd.DataFrame(all_parsed_boards_df_records)

100%|██████████| 8/8 [00:03<00:00,  2.20it/s]


# Associating Images with Letters
Finally, I'm going to create a final DataFrame, where each row represents a single tile from a test image. 

In [6]:
# We're going to store each row of the eventual DataFrame in this list 
parsed_letter_img_df_records = []

# Iterate through each row in the parsed boards DataFrame
for row in tqdm(list(all_parsed_boards_df.merge(board_data_df, on="file_path").itertuples())):
    
    # Extract some information about the current board
    cur_board_file_path = Path(row.file_path)
    cur_board_letter_sequence = row.parsed_letter_sequence
    cur_board_letter_img_sequence = row.letter_img_sequence
    
    # Iterate through each of the letters in the parsed letter sequence
    for tile_idx, letter in enumerate(cur_board_letter_sequence):
        
        # Determine the image associated with the current tile_idx
        cur_tile_img = cur_board_letter_img_sequence.get(tile_idx, None)
        if cur_tile_img is None:
            continue
        
        # Save the current tile image to the training-data directory
        cur_tile_img_file_path = training_data_path / f"{letter}_{cur_board_file_path.stem}_{tile_idx}.png"
        cv2.imwrite(str(cur_tile_img_file_path), cur_tile_img)
        
        # Store the information about the current tile in the parsed_letter_img_df_records
        parsed_letter_img_df_records.append(
            {
                "board_img_file_path": cur_board_file_path,
                "tile_idx": tile_idx,
                "letter": letter,
                "tile_img": cur_tile_img,
                "tile_img_file_path": cur_tile_img_file_path,
            }
        )

# Now, make a DataFrame from the parsed_letter_img_df_records
parsed_letter_img_df = pd.DataFrame(parsed_letter_img_df_records)

100%|██████████| 8/8 [00:00<00:00, 66.30it/s]


Now, with this DataFrame created, we're going to drop the images from it and then save it. 

In [7]:
# Save the index (i.e., a list of all of the tile images) to an Excel file
parsed_letter_img_df.drop(columns=["tile_img"]).to_excel(
    "data/training-data/index.xlsx", index=False
)

We're also going to save some stats; this will be a count of how many of each tile we have. 

In [9]:
# Determine the count of each tile in the dataset
original_tile_count_df = (
    parsed_letter_img_df.drop(columns=["tile_img"])
    .groupby("letter")
    .agg(count=("tile_idx", "count"))
    .reset_index()
)

# Ensure that all of the tile types are represented
from utils.settings import allowed_boggle_tiles

# Create a DataFrame to store the tile counts
tile_count_df_records = []
for tile in allowed_boggle_tiles:
    if tile not in original_tile_count_df["letter"].values:
        tile_count_df_records.append(
            {
                "letter": tile,
                "count": 0,
            }
        )
    else:
        tile_count_df_records.append(
            {
                "letter": tile,
                "count": original_tile_count_df.query(f"letter == '{tile}'")[
                    "count"
                ].values[0],
            }
        )
tile_count_df = pd.DataFrame(tile_count_df_records)

# Store the tile counts in an Excel file
tile_count_df.to_excel("data/training-data/stats.xlsx", index=False)