# Training our Classification Model

This notebook outlines how we utilized both printed and handwritten characters to train our classification model and goes through the different classes of models that were used.

In [1]:
import math
import numpy
import os
import PIL.Image
import random
import scipy.ndimage
import tensorflow

## Data Sources

We used two data sources to train our classification model, handwritten letters from the [EMNIST dataset](https://www.nist.gov/itl/iad/image-group/emnist-dataset) and printed letters from the [Chars74K dataset](http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/). Though we will ultimately be classifying printed characters from title pages, we've decided to train our model using both printed _and_ handwritten characters in the hopes of making our model more robust. Both data sources consist of 62 classes of characters (uppercase and lowercase for all letters in the English alphabet and all 10 digits), but we will, initially, only be using the letter characters to train our model since title pages consist predominantly of letter characters. We will also be treating uppercase and lowercase versions of the same letter as the same class since our classifications don't need to be case-sensitive and merging these classes improves accuracy.

### EMNIST dataset (handwritten)

Character images are 28x28 pixels with white text on a black background. The distribution of images per class is unbalanced and is as follows:

|Letter|Uppercase|Lowercase|Total|
|------|---------|---------|-----|
|A|7469|11677|19146|
|B|4526|6012|10538|
|C|11833|3286|15119|
|D|5341|11860|17201|
|E|5785|28723|34508|
|F|10622|2961|13583|
|G|2964|4276|7240|
|H|3673|10217|13890|
|I|13994|3152|17146|
|J|4388|2213|6601|
|K|2850|2957|5807|
|L|5886|17853|23739|
|M|10487|3109|13596|
|N|9588|13316|22904|
|O|29139|3215|32354|
|P|9744|2816|12560|
|Q|3018|3499|6517|
|R|5882|16425|22307|
|S|24272|3136|27408|
|T|11396|21227|32623|
|U|14604|3312|17916|
|V|5433|3378|8811|
|W|5501|3164|8665|
|X|3203|3292|6495|
|Y|5541|2746|8287|
|Z|3165|3176|6341|

### Chars74K dataset (printed)

Character images are 128x128 pixels with black text on a white background. The distribution of images per class is balanced with 1,016 images for each class. The 1,016 images for each class consist of 254 different fonts where each font uses one of 4 styles of emphasis (normal, _italics_, **bold**, or ***bold & italics***). The grayscale of these printed character images is reversed before the data is zipped up for use in this notebook. All other transformations necessary to make the printed character dataset consistent with the handwritten character dataset are contained within this notebook.

The set of data that is available to use in this notebook is 293 MB in size and is available to download [here](https://drive.google.com/file/d/1Lz9bK85vTPnjS_D7WErI_zXIssIVdgro/view?usp=sharing). The data consists of 26 numbered subdirectories (one for each letter with the number representing the index of the letter in the English alphabet) where the file names encode whether the character is printed or handwritten, the case of the letter (printed or handwritten), and for printed characters, an identifier for the font.

## Selecting Images to Train/Test the Model

Below is a function that selects which of the available images to use as training data and test data. The function is helpful with testing different handwritten/printed splits of the training data to see which split produces the "best" model. Since the available data consists of nearly 8 times more handwritten characters than printed characters, depending on the specified split between handwritten/printed characters in the training data, we aren't always going to use all of the available handwritten data to train our model. At the moment, these behaviors are hard coded into this process:

*  training/test data will use all of the available printed characters
*  training/test data will be balanced by letter/case combination, but it might be a better idea to distribute the frequency of letters in the training/test data to more closely match the natural frequency of letters occurring in the Italian language
*  fonts of printed characters that appear in the training data will never appear in the test data as well
*  test data will be 100% printed characters (since we will ultimately be classifying printed characters from title pages), but if test accuracy is used as one of the main metrics to compare models, it might not be a bad idea to use some handwritten characters in the test set as well

In [2]:
# the number of emphases used for each font/letter/case combination in the printed dataset
# (normal, italics, bold, or bold & italics)
CHARS_PER_FONT = 4

# the minimum number of images for a letter/case combination in the handwritten dataset
# (lowercase J)
MIN_HANDWRITTEN_CLASS_SIZE = 2213

# the number of fonts used in the printed dataset
NUM_UNIQUE_FONTS = 254

# the number of images for each letter in the printed dataset (combining the totals of 
# lowercase and uppercase letters)
PRINTED_CHARS_PER_CLASS = 2032

In [3]:
"""
Helper function that uses random components to determine which of the available character
images to use as training data and test data.

Args:
    train_ratio (float): the percentage of the combined training/test data that is used in the
        training set
    printed_train_ratio (float): the percentage of the training data that are printed
        characters (as opposed to handwritten characters)
    data_dir (string): path to where the data directory is located

Returns:
    A 2-tuple of the full paths of images to use in the training set and the full paths of
    images to use in the testing set.
"""

def get_data(train_ratio, printed_train_ratio, data_dir):

    if printed_train_ratio == 1:
        # the parameters specify that only printed characters should be used in the training
        # set
        handwritten_train = 0
        printed_train = PRINTED_CHARS_PER_CLASS * train_ratio
    else:
        # otherwise, estimate the number of handwritten and printed characters to use as 
        # training data from each class from a system of equations using the specified
        # parameters
        common_term = train_ratio * (1 - printed_train_ratio)
        handwritten_train = (common_term * PRINTED_CHARS_PER_CLASS) / (1 - common_term)
        printed_train = (printed_train_ratio * handwritten_train) / (1 - printed_train_ratio)

    # round the estimated number of handwritten characters to use as training data up to the
    # nearest even number so we are using the same number of uppercase letters as lowercase
    # letters
    handwritten_train = math.ceil(handwritten_train / 2) * 2
    # round the estimated number of printed characters to use as training data down to the
    # point where we are using all of the characters of a certain font
    printed_train = math.floor(printed_train / (CHARS_PER_FONT * 2)) * (CHARS_PER_FONT * 2)
    # after rounding both the number of handwritten and printed characters to use as training
    # data, increase the number of handwritten characters to use as training data until we've
    # reached the specified ratio of printed characters to handwritten characters in the
    # training set
    while printed_train / (handwritten_train + printed_train) > printed_train_ratio:
        handwritten_train += 2

    # the number of handwritten characters to use for each letter/case combination
    handwritten_train_per_class = handwritten_train // 2
    # the number of printed fonts to use for each letter/case combination
    printed_train_fonts_per_class = printed_train // (CHARS_PER_FONT * 2)

    # raise an exception if the specified percentage of printed characters in the training set
    # is small enough where we're not guaranteed to have a balanced number of handwritten
    # characters in the training set by letter/case combination
    assert handwritten_train_per_class <= MIN_HANDWRITTEN_CLASS_SIZE, (
        "Not enough available handwritten characters to use a constant number of handwritten "
        "characters in the training set for each letter/case combination. Please increase "
        "the percentage of printed characters to use in the training set."
    )

    # will store the full paths of all images to use in the training set and test set,
    # respectively; these lists will then be returned by the function
    train_images, test_images = [], []

    for class_dir in os.listdir(data_dir):

        # for each class, will store the full paths of all handwritten uppercase and lowercase
        # images belonging to that class, respectively
        handwritten_uppercase, handwritten_lowercase = [], []

        # for each class, will store the full paths of all printed images that belong to that
        # class, partitioned by font
        fonts = list(range(NUM_UNIQUE_FONTS))
        printed_by_font = {font: [] for font in fonts}

        for char_image in os.listdir(os.path.join(data_dir, class_dir)):

            # the image file name encodes handwritten/printed, case of letter and identifies
            # the font for printed images
            filename_parts = char_image.split("_")
            char_image_full_path = os.path.join(data_dir, class_dir, char_image)

            if filename_parts[1] == "handwritten":
                if filename_parts[-1][:-4] == "uppercase":
                    handwritten_uppercase.append(char_image_full_path)
                else:
                    assert filename_parts[-1][:-4] == "lowercase", (
                        "Unexpected format of image file name: {}".format(char_image)
                    )
                    handwritten_lowercase.append(char_image_full_path)
            else:
                assert filename_parts[1] == "printed", (
                    "Unexpected format of image file name: {}".format(char_image)
                )
                printed_by_font[int(filename_parts[2][-3:])].append(char_image_full_path)

        # random component used to make the selection of which images to use and in which data
        # set (training or test) random
        random.shuffle(handwritten_uppercase)
        random.shuffle(handwritten_lowercase)
        random.shuffle(fonts)

        # select the calculated number of handwritten characters to be used in the training
        # data per class/case combination for both the sets of uppercase and lowercase
        # handwritten characters from this class
        handwritten_train_images = handwritten_uppercase[:handwritten_train_per_class]
        handwritten_train_images.extend(handwritten_lowercase[:handwritten_train_per_class])

        # select the calculated number of printed character fonts to be used in the training
        # data; all other printed character fonts will be used in the test data
        printed_train_images, printed_test_images = [], []
        for index, font_num in enumerate(fonts):
            if index < printed_train_fonts_per_class:
                # all characters using this font will be used in the training data
                printed_train_images.extend(printed_by_font[font_num])
            else:
                # all characters using this font will be used in the test data
                printed_test_images.extend(printed_by_font[font_num])

        train_images.extend(handwritten_train_images + printed_train_images)
        test_images.extend(printed_test_images)

    return train_images, test_images

The training/test split has been set constant at 85/15 while training our model, but it is left as a variable parameter. The specified data directory can also be changed, if need be. After trying different multiples of 5% ranging from 50%-100% for the percentage of printed characters to use in the training set, we found that a printed/handwritten split in the training data of 75/25 received the "best" results from looking at test accuracy of the model and accuracy on a couple of the generated title pages.

In [4]:
train_image_files, test_image_files = get_data(
    train_ratio=0.85,
    printed_train_ratio=0.75,
    data_dir="data",
)

## Normalizing Handwritten/Printed Data Sources into One Consistent Dataset

Since the original formats of the EMNIST (handwritten) data source and the Chars74K (printed) data source are different, we need to convert both data sources into one consistent format. The first difference (handwritten characters being white text on a black background and the printed characters being the reverse of that) was handled outside of this notebook and was applied before zipping up all our data. The next difference (handwritten characters are 28x28 pixels in size whereas printed characters are 128x128 pixels in size) will be solved by following the same process used the produce the EMNIST, handwritten characters. The EMNIST characters originated as 128x128 pixel images in the NIST Special Database 19 database. The process used to convert these 128x128 pixel images to the 28x28 pixel images we see in our dataset (meant to match the format of the original MNIST dataset) is outlined in the [EMNIST paper](https://arxiv.org/pdf/1702.05373v2.pdf) and is shown below.

![EMNIST Conversion Process](./Data/emnist_conversion_process.png)

We will roughly follow this same process to convert the 128x128 pixel printed images in our dataset to the same format as the 28x28 pixel handwritten images in our dataset. This process can be viewed as one of the layers in our model (the first layer).

In [5]:
# border padding used when the region of interest is first placed into a square image; caused
# a bit of confusion since this is how it is described in the EMNIST paper, but my
# interpretation of this description isn't displayed in the image from the EMNIST paper shown
# above
BORDER_PADDING = 2

# maximum pixel value used in the grayscale images; 0.0=black, 255.0=white
MAX_PIXEL_VALUE = 255.0

# dimensions of the EMNIST data and the target dimensions for the printed data so both data
# sources are in the same format
TARGET_IMAGE_DIMS = (28, 28)

In [6]:
"""
Helper function that transforms images into a target size, following a certain process along
the way.

Args:
    image_files (list): list of full paths of images to transform

Returns:
    A numpy.ndarray of the transformed images.
"""
def transform_data(image_files):

    # will store numpy.ndarrays containing the transformed version of each image represented
    # by a full path in the input list; will be converted to a numpy.array itself and returned
    transformed_data = []

    for full_image_path in image_files:

        # the transformation process utilizes different libraries including PIL.Image,
        # scipy.ndimage, and numpy to be as fast as possible

        # load the image using PIL.Image
        pixel_array = PIL.Image.open(full_image_path)

        # convert to a numpy.ndarray for the next step in the transformation process
        cols, rows = pixel_array.size
        pixel_matrix = numpy.array(pixel_array, dtype=numpy.float64).reshape((rows, cols))

        # if the image is already of the correct size (is handwritten), no need to go through
        # the bulk of the transformation process; otherwise (is printed) it does need to go
        # through the bulk of the transformation process
        if pixel_matrix.shape != TARGET_IMAGE_DIMS:

            # apply a Gaussian filter to the image with sigma=1 to soften the edges, using
            # scipy.ndimage
            pixel_matrix = scipy.ndimage.gaussian_filter(pixel_matrix, sigma=1)

            # convert back to a PIL.Image object for the next step in the transformation
            # process
            pixel_array = PIL.Image.fromarray(pixel_matrix)

            # extract the region of interest (remove whitespace surrounding the image)
            pixel_array = pixel_array.crop(box=pixel_array.getbbox())

            # convert back to a numpy.ndarray for the next step in the transformation process
            cols, rows = pixel_array.size
            pixel_matrix = numpy.array(pixel_array, dtype=numpy.float64).reshape((rows, cols))

            # place and center region of interest into a square image, while preserving aspect
            # ratio (add equal amount of whitespace to both sides of the shorter dimension of
            # the image until it has square dimensions)
            if cols > rows:
                if (cols - rows) % 2 == 0:
                    row_padding_1 = numpy.zeros(((cols - rows) // 2, cols))
                else:
                    # if an odd number of rows need to be added to the numpy.ndarray to make
                    # it square, add the extra row to the top of the image
                    row_padding_1 = numpy.zeros((((cols - rows) // 2) + 1, cols))
                row_padding_2 = numpy.zeros(((cols - rows) // 2, cols))
                pixel_matrix = numpy.concatenate(
                    (row_padding_1, pixel_matrix, row_padding_2),
                    axis=0,
                )
            elif rows > cols:
                if (rows - cols) % 2 == 0:
                    col_padding_2 = numpy.zeros((rows, (rows - cols) // 2))
                else:
                    # if an odd number of columns need to be added to the numpy.ndarray to
                    # make it square, add the extra column to the right of the image
                    col_padding_2 = numpy.zeros((rows, ((rows - cols) // 2) + 1))
                col_padding_1 = numpy.zeros((rows, (rows - cols) // 2))
                pixel_matrix = numpy.concatenate(
                    (col_padding_1, pixel_matrix, col_padding_2),
                    axis=1,
                )

            # add a whitespace border to the square image (keeping it square)
            rows, cols = pixel_matrix.shape
            row_padding = numpy.zeros((BORDER_PADDING, cols))
            pixel_matrix = numpy.concatenate(
                (row_padding, pixel_matrix, row_padding),
                axis=0,
            )
            col_padding = numpy.zeros((rows + (2 * BORDER_PADDING), BORDER_PADDING))
            pixel_matrix = numpy.concatenate(
                (col_padding, pixel_matrix, col_padding),
                axis=1,
            )

            # convert back to a PIL.Image object for the next step in the transformation
            # process
            pixel_array = PIL.Image.fromarray(pixel_matrix)

            # downsample the image to the target dimensions using bi-cubic interpolation
            pixel_array = pixel_array.resize(
                TARGET_IMAGE_DIMS,
                resample=PIL.Image.BICUBIC,
            )

            # convert back to a numpy.ndarray for the next step in the transformation process
            cols, rows = pixel_array.size
            pixel_matrix = numpy.array(pixel_array, dtype=numpy.float64).reshape((rows, cols))

        # the optional downsampling step in the transformation process can potentially
        # increase pixel values over the maximum pixel value, this step reduces these pixel
        # values back down to the maximum pixel value; this step also scales the pixel values
        # down to a range of [0, 1] for ease of use in the model
        _reduce_pixel_values_over_max_and_scale = numpy.vectorize(
            lambda pixel_value: min(pixel_value / MAX_PIXEL_VALUE, 1.0)
        )
        pixel_matrix = _reduce_pixel_values_over_max_and_scale(pixel_matrix)

        transformed_data.append(pixel_matrix)

    transformed_data = numpy.array(transformed_data)

    return transformed_data

In [7]:
"""
Helper function that extracts the labeled classes from the full paths of images.

Args:
    image_files (list): list of full paths of images to extract class labels from

Returns:
    A numpy.array of the class labels.
"""
def get_labels(image_files):

    # will store the class labels encoded in all the full paths of images provided as input
    labels = []

    for image_file in image_files:
        # the class label is the numbered subdirectory (first directory up from the image
        # file) in the image's full path
        label = int(image_file.split(os.sep)[-2])
        labels.append(label)

    labels = numpy.array(labels)

    return labels

In [8]:
train_images = transform_data(train_image_files)
test_images = transform_data(test_image_files)

In [9]:
train_labels = get_labels(train_image_files)
test_labels = get_labels(test_image_files)

## Training the Model

The final model we came up with a 2-layer CNN (64-32) with a 4x4 kernel size and a cross-entropy loss function, trained for 2 epochs. This model was deemed "best" from looking a the test accuracy of the model and how the model performed at classifying characters extracted from our generated title pages.

We had started with a basic neural network and kept on improving from there. The first big improvement (not directly related with tuning the model) was when we changed our target format from black text on a white background to white text on a black background (easier for the model to deal with since there is more non-text space in the images and the color black is equivalent to a pixel value of 0). The next big improvement was when we shifted from using a basic neural network to a convolutional neural network. A CNN seemed like the better choice for our image-structured data. From there, it's been slight tuning to many different parameters, mainly focusing on the split of printed/handwritten characters in the training set, kernel size, and number of epochs to train for, but there are many parameters in the model that have remain fixed and have yet to be explored.

In [10]:
model = tensorflow.keras.Sequential([
    # adds a dimension for the CNN (this step wasn't necessary to train a basic neural
    # network)
    tensorflow.keras.layers.Reshape(TARGET_IMAGE_DIMS + (1,), input_shape=TARGET_IMAGE_DIMS),
    tensorflow.keras.layers.Conv2D(
        64,
        kernel_size=4,
        activation=tensorflow.nn.relu,
    ),
    tensorflow.keras.layers.Conv2D(
        32,
        kernel_size=4,
        activation=tensorflow.nn.relu,
    ),
    tensorflow.keras.layers.Flatten(),
    tensorflow.keras.layers.Dense(26, activation=tensorflow.nn.softmax)
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=2)
test_loss, test_acc = model.evaluate(test_images, test_labels)
print('Test accuracy:', test_acc)

Epoch 1/2
Epoch 2/2
Test accuracy: 0.9267072213500785


In [None]:
# uncomment out the line below if you would like to re-train the model or train a new model
# to use in our full pipeline

#tensorflow.keras.models.save_model(model, "./Data/model.hdf5")