# Motivation
I want to try my hand at training a CNN for letter recognition sometime soon. In order to do that, I'm going to need to collect some training data. 

This notebook will parse through all of the labeled images I have, extract the processed tile images, and then save them as separate images. 

# Setup
The cells below will set up the rest of this notebook. 

I'll start by configuring the kernel: 

In [1]:
# Change directories to the root of the project
%cd ..

# Enable autoreload of modules
%load_ext autoreload
%autoreload 2

d:\data\programming\boggle-vision


Next, I'm going to import some relevant libraries:

In [2]:
# Import statements
import cv2
import pandas as pd
from pathlib import Path
import math
from matplotlib import pyplot as plt
import numpy as np
from statistics import mode
import utils
import cv2
import pytesseract
from PIL import Image
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
import shutil

# Importing custom modules
import utils.board_detection as board_detect

Finally, I'm going to create a folder in the `data/` folder - this is where I'll store all of the data. 

In [3]:
# Declare the path to the training-data folder
training_data_path = Path("data/training-data")
original_data_path = Path("data/original-training-data")

# If it already exists, delete it (even if it's not empty)
if training_data_path.exists():
    shutil.rmtree(training_data_path)
    
if original_data_path.exists():
    shutil.rmtree(original_data_path)

# Create the training-data folder
training_data_path.mkdir(parents=True, exist_ok=True)
original_data_path.mkdir(parents=True, exist_ok=True)

# Loading Data
Here, I'm going to load in all of the pictures, as well as some information about each of them. 

In [4]:
# Open the .csv file containing the labeled boards
board_data_df = pd.read_csv("data/labeled-boards.csv")

# Add a column which is the parsed letter sequence
board_data_df["parsed_letter_sequence"] = board_data_df["letter_sequence"].apply(
    lambda letter_list: letter_list.split(";")
)

# Load all of the images using cv2
file_path_to_image = {}
for row in board_data_df.itertuples():
    file_path_to_image[row.file_path] = cv2.imread(row.file_path)

# Parsing Boards
Below, I'm going to run each of the boards through a "parsing" method. This will extract the "processed" letter images for each of the different tiles.

In [5]:
# We'll collect some results about the board data here
all_parsed_boards_df_records = []

# Iterate through all of the rows in the board data
for row in tqdm(list(board_data_df.query("difficulty == 'easy'").itertuples())):
    # Try and parse the board
    error_msg = None
    try:
        letter_img_sequence = board_detect.parse_boggle_board(
            file_path_to_image[row.file_path],
            max_image_height=1200,
            return_parsed_img_sequence=True
        )

    except Exception as e:
        continue

    # Add some information to the all_parsed_boards_df_records
    all_parsed_boards_df_records.append(
        {
            "file_path": row.file_path,
            "letter_img_sequence": letter_img_sequence,
            "error_msg": error_msg,
        }
    )

# Parse the results into a dataframe
all_parsed_boards_df = pd.DataFrame(all_parsed_boards_df_records)

100%|██████████| 30/30 [00:11<00:00,  2.72it/s]


# Associating Images with Letters
Finally, I'm going to create a final DataFrame, where each row represents a single tile from a test image. 

In [6]:
# We're going to store each row of the eventual DataFrame in this list 
parsed_letter_img_df_records = []

# Iterate through each row in the parsed boards DataFrame
for row in tqdm(list(all_parsed_boards_df.merge(board_data_df, on="file_path").itertuples())):
    
    # Extract some information about the current board
    cur_board_file_path = Path(row.file_path)
    cur_board_letter_sequence = row.parsed_letter_sequence
    cur_board_letter_img_sequence = row.letter_img_sequence
    
    # Iterate through each of the letters in the parsed letter sequence
    for tile_idx, letter in enumerate(cur_board_letter_sequence):
        
        # Determine the image associated with the current tile_idx
        cur_tile_img = cur_board_letter_img_sequence.get(tile_idx, None)
        if cur_tile_img is None:
            continue
        
        # Save the current tile image to the training-data directory
        cur_tile_img_file_path = training_data_path / f"{letter}_{cur_board_file_path.stem}_{tile_idx}.png"
        
        # Save cur_tile_img to cur_tile_img_file_path. Do not use cv2.imwrite - it introduces compression artifacts
        Image.fromarray(cur_tile_img).save(cur_tile_img_file_path)
        
        # Store the information about the current tile in the parsed_letter_img_df_records
        parsed_letter_img_df_records.append(
            {
                "board_img_file_path": cur_board_file_path,
                "tile_idx": tile_idx,
                "letter": letter,
                "tile_img": cur_tile_img,
                "tile_img_file_path": cur_tile_img_file_path,
            }
        )

# Now, make a DataFrame from the parsed_letter_img_df_records
parsed_letter_img_df = pd.DataFrame(parsed_letter_img_df_records)

100%|██████████| 30/30 [00:00<00:00, 64.10it/s]


Now, with this DataFrame created, we're going to drop the images from it and then save it. 

In [7]:
# Save the index (i.e., a list of all of the tile images) to an Excel file
parsed_letter_img_df.drop(columns=["tile_img"]).to_excel(
    "data/training-data-index.xlsx", index=False
)

We're also going to save some stats; this will be a count of how many of each tile we have. 

In [8]:
# Determine the count of each tile in the dataset
original_tile_count_df = (
    parsed_letter_img_df.drop(columns=["tile_img"])
    .groupby("letter")
    .agg(count=("tile_idx", "count"))
    .reset_index()
)

# Ensure that all of the tile types are represented
from utils.settings import allowed_boggle_tiles

# Create a DataFrame to store the tile counts
tile_count_df_records = []
for tile in allowed_boggle_tiles:
    if tile not in original_tile_count_df["letter"].values:
        tile_count_df_records.append(
            {
                "letter": tile,
                "count": 0,
            }
        )
    else:
        tile_count_df_records.append(
            {
                "letter": tile,
                "count": original_tile_count_df.query(f"letter == '{tile}'")[
                    "count"
                ].values[0],
            }
        )
tile_count_df = pd.DataFrame(tile_count_df_records)

# Store the tile counts in an Excel file
tile_count_df.sort_values("count", ascending=False).to_excel(
    "data/training-data-stats.xlsx", index=False
)

Finally, we're going to copy this folder. 

In [9]:
# Copy all of the images from the data/training-data folder to the data/original-training-data folder
for file_path in tqdm(list(training_data_path.glob("*.png"))):
    shutil.copy(str(file_path), str(original_data_path / file_path.name))

100%|██████████| 1080/1080 [00:02<00:00, 417.11it/s]


# Augmenting Training Data
In order to bolster the training data that I've got, I'm going to do a little duplication and rotation. 

First, I'll set up this process and parameterize it a little bit. 

In [10]:
# Parameterize the data augmentation process
max_class_count_multiplier = 2
n_per_rotation = 1

# Create a DataFrame detailing the tile counts
tile_img_df = parsed_letter_img_df[["tile_img_file_path", "tile_idx", "letter", "tile_img"]].copy()

Now, to start: I'm going to determine how many of each letter I need to create.

In [11]:
# Create a DataFrame mapping the letter to the tile count
letter_to_tile_ct_df = (
    tile_img_df.groupby("letter")
    .agg(count=("tile_idx", "count"))
    .reset_index()
    .sort_values("count", ascending=False)
)

# Determine the number of tiles in the largest class
max_class_count = letter_to_tile_ct_df["count"].max()

# Determine how many samples each class ought to have 
samples_per_class = math.ceil(max_class_count * max_class_count_multiplier)

Next, I'm going to iterate through each of the existing letter images, and determine how many times I need to save them. 

In [12]:
# As we duplicate images, we're going to keep track of them in this list
duplicated_letter_img_df_records = []

# We're going to keep track of how many times we've duplicated each image
image_duplicate_ct_dict = {}

# Iterate through each of the letters and determine how many times they need to be duplicated
for row in letter_to_tile_ct_df.itertuples():
    
    # Determine the letter and the number of times it needs to be duplicated
    cur_letter = row.letter
    cur_letter_count = row.count
    n_times_to_duplicate = samples_per_class - cur_letter_count
    
    # Subset the tile_img_df to only include the current letter
    cur_letter_tile_img_df = tile_img_df.query("letter == @cur_letter").copy()
    
    # For each of the times to duplicate, duplicate the letter
    for duplicate_idx in range(n_times_to_duplicate):
        
        # Determine which letter we're going to copy 
        letter_to_copy_idx = duplicate_idx % cur_letter_count
        df_row_to_copy = cur_letter_tile_img_df.iloc[letter_to_copy_idx]
        
        # Now, we're going to create a new filepath for the duplicated image 
        cur_letter_file_path = Path(df_row_to_copy.tile_img_file_path)
        cur_letter_file_stem = cur_letter_file_path.stem
        duplicate_ct_for_cur_img = image_duplicate_ct_dict.get(cur_letter_file_stem, 0)
        duplicate_file_path = f"{cur_letter_file_stem}_copy-{(duplicate_ct_for_cur_img+1):04}.png"
        duplicate_file_path = cur_letter_file_path.parent / duplicate_file_path
        
        # Make sure to store the fact that we've duplicated this image
        image_duplicate_ct_dict[cur_letter_file_stem] = duplicate_ct_for_cur_img + 1
        
        # Now, we'll add a record to the duplicated_letter_img_df_records
        duplicated_letter_img_df_records.append({
            "tile_img_file_path": duplicate_file_path,
            "tile_idx": df_row_to_copy.tile_idx,
            "letter": df_row_to_copy.letter,
            "tile_img": df_row_to_copy.tile_img,
        })

# Now: we're going to create a Dataframe out of the duplicated_letter_img_df_records
duplicated_letter_img_df = pd.DataFrame.from_records(duplicated_letter_img_df_records)

# Create a DataFrame containing all of the letter images
all_letter_img_df = pd.concat([
    tile_img_df,
    duplicated_letter_img_df,
])

Now that we've created this DataFrame, we'll need to save all of the files. 

In [13]:
# Iterate through each of the rows in the duplicated_letter_img_df and save the images
for row in tqdm(list(duplicated_letter_img_df.itertuples())):
    Image.fromarray(row.tile_img).save(str(row.tile_img_file_path))

100%|██████████| 7304/7304 [00:03<00:00, 2283.75it/s]


Now that this is done: we need to rotate all of the pictures. 

In [14]:
# We're going to keep track of the images to save in a DataFrame
rotated_tile_img_df_records = []

# Iterate through each of the rotations that we want to do 
for cur_rotation_idx in range(n_per_rotation):
    
    # Iterate through each of the tiles in the dataset
    for row in all_letter_img_df.itertuples():
        
        # Iterate through each of the rotation angles
        for cur_rotation_angle in [0, 90, 180, 270]:
            
            # Determine the new file path for the rotated image
            cur_file_path = Path(row.tile_img_file_path)
            cur_file_stem = cur_file_path.stem
            rotated_file_path = cur_file_path.parent / f"{cur_file_stem}_rotate_{cur_rotation_idx:02}-{cur_rotation_angle:03}.png"
            
            # Rotate the image 
            img = row.tile_img
            image_pil = Image.fromarray(np.uint8(img))
            rotated_img = image_pil.rotate(cur_rotation_angle)
            rotated_img = np.array(rotated_img)
            
            
            # Add a record to the rotated_tile_img_df_records
            rotated_tile_img_df_records.append({
                "tile_img_file_path": rotated_file_path,
                "tile_idx": row.tile_idx,
                "letter": row.letter,
                "tile_img": rotated_img,
            })

# Finally, we're going to create a DataFrame out of the rotated_tile_img_df_records
rotated_tile_img_df = pd.DataFrame.from_records(rotated_tile_img_df_records)

Now, we're going to save all of the rotated tiles. 

In [15]:
# Iterate through each of the rows in the rotated_tile_img_df and save the images
for row in tqdm(list(rotated_tile_img_df.itertuples())):
    Image.fromarray(row.tile_img).save(str(row.tile_img_file_path))

100%|██████████| 33536/33536 [00:14<00:00, 2255.60it/s]
