<a href="https://colab.research.google.com/github/clemsage/NeuralDocumentClassification/blob/master/skeleton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setting up the computing environment


## Install and import PyTorch

Select "GPU" in the Accelerator drop-down on Notebook Settings through the Edit menu.

In [24]:
# %pip install torch torchvision numpy matplotlib Pillow datasets
import torch

print(torch.__version__)

2.4.1+cu121


## Confirm PyTorch can see the GPU

In [2]:
print(torch.cuda.is_available())

True


## Additional information about hardware

For CPU information and RAM, run:

In [3]:
!cat /proc/cpuinfo
!cat /proc/meminfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 154
model name	: 12th Gen Intel(R) Core(TM) i7-12800H
stepping	: 3
microcode	: 0xffffffff
cpu MHz		: 2803.212
cache size	: 24576 KB
physical id	: 0
siblings	: 20
core id		: 0
cpu cores	: 10
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 21
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves umip gfni vaes vpclmulqdq rdpid fsrm md_clear flush_l1d arch_capabilities
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs retbleed eibrs_pbrsb
bogomips	: 5606.42
clflush size	

MemTotal:       32708976 kB
MemFree:        16791772 kB
MemAvailable:   28810364 kB
Buffers:          218656 kB
Cached:         11807216 kB
SwapCached:            0 kB
Active:           907692 kB
Inactive:       14107624 kB
Active(anon):       2352 kB
Inactive(anon):  2991704 kB
Active(file):     905340 kB
Inactive(file): 11115920 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       8388608 kB
SwapFree:        8388608 kB
Dirty:              8424 kB
Writeback:             0 kB
AnonPages:       2949352 kB
Mapped:           774036 kB
Shmem:              4644 kB
KReclaimable:     457268 kB
Slab:             533240 kB
SReclaimable:     457268 kB
SUnreclaim:        75972 kB
KernelStack:       11408 kB
PageTables:        37524 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    24743096 kB
Committed_AS:    5517696 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       33812 kB
VmallocChunk:          0 kB
Percpu:          

## Other useful package imports

In [22]:
import importlib
import os
import pickle
import sys
from collections import Counter
from dataclasses import dataclass
from os import path

import matplotlib.pyplot as plt
import numpy as np


# Working on the dataset

The dataset is a subset of the [RVL-CDIP dataset](https://www.cs.cmu.edu/~aharley/rvl-cdip/). See [Harley et al.](http://scs.ryerson.ca/~aharley/icdar15/harley_convnet_icdar15.pdf) and [Asim et al.](https://www.dfki.de/fileadmin/user_upload/import/10637_Asim_Document_Image_Classification.pdf) papers for recent works on this dataset.

## Information about the dataset

This project only considers the following 5 classes among the 16 classes of the original dataset:

In [29]:
class_names = ["email", "form", "handwritten", "invoice", "advertisement"]
NUM_CLASSES = len(class_names)

## Import the dataset

First, clone or pull the GitHub repository of the project:

In [None]:
if not os.path.exists("NeuralDocumentClassification"):
    !git clone https://github.com/thibaultdouzon/NeuralDocumentClassification.git
else:
    !git -C NeuralDocumentClassification pull
sys.path.append("NeuralDocumentClassification")

Download the train, test and validation dataset assignments from this [Google Drive](https://drive.google.com/drive/folders/1Pkd6sUkDGBUymWKK93abZx1MQiWmzFgP):

In [3]:
from src import download_dataset

importlib.reload(download_dataset)
dataset_path = "dataset"

download_dataset.download_and_extract("all", dataset_path)

Each dataset file is a binary dump that can be loaded with the [Pickle](https://docs.python.org/3.11/library/pickle.html) module.

In [26]:
with open(path.join(dataset_path, "train.pkl"), "rb") as f:
    train_dataset = pickle.load(f)

with open(path.join(dataset_path, "test.pkl"), "rb") as f:
    test_dataset = pickle.load(f)

with open(path.join(dataset_path, "validation.pkl"), "rb") as f:
    validation_dataset = pickle.load(f)


for split_name, split_dataset in zip(
    ["train", "test", "validation"], [train_dataset, test_dataset, validation_dataset]
):
    print(f"{split_name}_dataset contains {len(split_dataset)} documents")
train_dataset[0].keys()


train_dataset contains 5000 documents
test_dataset contains 1000 documents
validation_dataset contains 500 documents


dict_keys(['id', 'image', 'label', 'words', 'boxes'])

Each `dataset` object is a `list` containing multiple document information. A document is a `dict` with the following structure:

```json
{
    "id": "Unique document identifier",
    "image": "A PIL.Image object containing the document's image",
    "label": "A number between in [0 .. 4] representing the class of the document",
    "words": "A list of words extracted from the image with an OCR",
    "boxes": "A list of tuples of numbers providing the position of each word in the document"
}
```

List the image files of the training and test dataset:

Print 5 image from the training dataset using [matplotlib](https://matplotlib.org/stable/tutorials/images.html) `plt` module:

In [None]:
### Insert your code here ###
# See the expected solution by clicking on the cell below

In [None]:
# @title
for document in train_dataset[:5]:
    print(class_names[document["label"]])
    plt.imshow(document["image"].convert("RGB"))
    plt.show()

# Creating Pytorch datasets and dataloaders for Computer Vision task

The first goal of this section is to create `torch.utils.data.Dataset` for the classification task using only the image of the document.

We will define a class inheriting [`torch.utils.data.Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) called `DocumentImageDataset`.

It should be able to create an instance of `DocumentImageDataset` using our previously loaded datasets.
For simplification, all images should be formatted to black and white (remove the color channel), and resized to a fixed (512, 512) size. Use [`torchvision.transforms.v2.functional`](https://pytorch.org/vision/main/transforms.html#v2-api-reference-recommended) module to convert a `PIL.Image` to a `torch.Tensor` and perform the simplifications.

Upon iteration, it should return an `ImageSample` object defined as follows:



In [48]:
import torch
import torch.utils.data as data
import torchvision.transforms.v2.functional as F


@dataclass
class ImageSample:
    image: torch.Tensor  # shape: (H, W)
    label: int  # 0 ≤ label < NUM_CLASSES

    def __post_init__(self):
        "Some assertions to check the validity of the data"
        assert self.image.shape == (
            512,
            512,
        ), f"Expected shape (512, 512), got {self.image.shape}"
        assert torch.all(self.image <= 1.0) and torch.all(
            self.image >= 0.0
        ), "Expected each pixel of image in range [0.0, 1.0]"
        assert self.label in range(
            NUM_CLASSES
        ), f"Expected label in range [0 .. {NUM_CLASSES-1}], got {self.label}"

In [None]:
## Fill the methods of the class DocumentImageDataset


class DocumentImageDataset(data.Dataset):
    def __init__(self, dataset: list[dict]):
        self.dataset = dataset
        raise NotImplementedError

    def __len__(self) -> int:
        """This method returns the length of the dataset"""
        raise NotImplementedError

    def __getitem__(self, idx: int) -> ImageSample:
        """This method returns the idx-th sample of the dataset
        If idx is out of bounds, it should raise an IndexError"""

        raise NotImplementedError

In [49]:
# @title


class DocumentImageDataset(data.Dataset):
    def __init__(self, dataset: list[dict]):
        self.dataset = dataset

    def __len__(self) -> int:
        """This method returns the length of the dataset"""
        return len(self.dataset)

    def __getitem__(self, idx: int) -> ImageSample:
        """This method returns the idx-th sample of the dataset
        If idx is out of bounds, it should raise an IndexError"""

        return ImageSample(
            image=F.to_dtype(
                F.to_image(F.resize(self.dataset[idx]["image"], size=[512, 512])),
                dtype=torch.float32,
                scale=True,
            )[0],
            label=self.dataset[idx]["label"],
        )

If your implementation is correct, you should be able to create an instance of `DocumentImageDataset` and get its 0th element without error

In [50]:
image_dataset = DocumentImageDataset(validation_dataset)
image_dataset[0]  # no error here

ImageSample(image=tensor([[1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        ...,
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.]]), label=3)

The final goal of this section is to implement a [`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) that wraps the `DocumentImageDataset` and handles useful tasks like shuffling and batching.

No need to create a new class, we simply need to implement the `collate_fn` that takes a list of `ImageSample` and should return an `ImageBatch`.


hint: Use `torch.tensor` and `torch.stack` to respectively convert a python list to a `torch.Tensor` and stack multiple tensors together into a new one along a new dimension.

In [51]:
@dataclass
class ImageBatch:
    images: torch.Tensor
    labels: torch.Tensor

    def __post_init__(self):
        assert self.images.shape[0] == self.labels.shape[0]
        assert self.images.shape[1:] == (512, 512)
        assert len(self.images.shape) == 3
        assert len(self.labels.shape) == 1

In [None]:
def collate_fn(batch: list[ImageSample]) -> ImageBatch:
    """This function should return a batch of samples as an ImageBatch object"""
    raise NotImplementedError

In [53]:
# @title


def collate_fn(batch: list[ImageSample]) -> ImageBatch:
    """This function should return a batch of samples as an ImageBatch object"""
    return ImageBatch(
        images=torch.stack(
            [sample.image for sample in batch], dim=0
        ),  # shape: (B, H, W)
        labels=torch.tensor([sample.label for sample in batch]),  # shape: (B,)
    )

If your implementation is correct, you should be able to create a dataloader with a batch size and retrieve the first batch.

In [55]:
dataloader = data.DataLoader(
    image_dataset, batch_size=5, collate_fn=collate_fn, shuffle=True
)
next(iter(dataloader))  # no error here

ImageBatch(images=tensor([[[1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         ...,
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000]],

        [[1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         ...,
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000]],

        [[0.3922, 0.9098, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [0.2549, 0.8902, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [0.2549, 0.8902, 1.0000,  ..., 

## Explore the data

Get image shape and label for one element of the training dataset:

In [None]:
for image, label in labeled_train_ds.take(1):
    print("Image shape (height, width, depth):", image.numpy().shape)
    print("Label:", class_names[label.numpy()])

Plot 3 random training images of each class:

In [None]:
plt.figure(figsize=(30, 60))
n_images_per_class = 3

for image, label in labeled_train_ds:
    break  # Sample images and labels for each class

for class_idx, class_name in enumerate(class_names):
    for i in range(n_images_per_class):
        plt.subplot(
            NUM_CLASSES, n_images_per_class, class_idx * n_images_per_class + i + 1
        )
        plt.xticks([])
        plt.yticks([])
        plt.grid(False)
        # plt.imshow(...)
        # plt.xlabel(...)

# plt.show()

In [None]:
# @title
plt.figure(figsize=(30, 60))
n_images_per_class = 3
images = {class_name: [] for class_name in class_names}

for image, label in labeled_train_ds:
    image = image.numpy()
    label = label.numpy()

    if len(images[class_names[label]]) < n_images_per_class:
        images[class_names[label]].append(image)

    if all(
        [len(images[class_name]) == n_images_per_class for class_name in class_names]
    ):
        break

for class_idx, class_name in enumerate(class_names):
    for i in range(n_images_per_class):
        plt.subplot(
            NUM_CLASSES, n_images_per_class, class_idx * n_images_per_class + i + 1
        )
        plt.xticks([])
        plt.yticks([])
        plt.grid(False)
        plt.imshow(np.squeeze(images[class_name][i]), cmap="gray")
        plt.xlabel(class_name)

plt.show()

Print the class distribution in the training set:

In [None]:
cnt_class = Counter()
for file_path in dataset["training"]:
    label = labels_idx.lookup(tf.constant(file_path))
    # Update the counter with the label value

# Print the class counter

In [None]:
# @title
cnt_class = Counter()
for file_path in dataset["training"]:
    label = labels_idx.lookup(tf.constant(file_path))
    cnt_class.update([class_names[label.numpy()]])

for key, val in cnt_class.most_common():
    print("%s: %d" % (key, val))

## Prepare training

Use a temporary folder for caching elements of the dataset in order to speed up training and testing:

In [None]:
temp_folder = "/tmp/%dx%dx1" % (IMG_HEIGHT, IMG_WIDTH)
labeled_train_ds = labeled_train_ds.cache(temp_folder)
labeled_test_ds = labeled_test_ds.cache(temp_folder)

Shuffle the documents within each subset:

In [None]:
labeled_train_ds = labeled_train_ds.shuffle(2048)
labeled_test_ds = labeled_test_ds.shuffle(2048)

Batch documents within each subset:


In [None]:
batch_size = 128
labeled_train_ds = labeled_train_ds.batch(batch_size)
labeled_test_ds = labeled_test_ds.batch(batch_size)

Prefetch the subsets in the background while the model is computing:

In [None]:
labeled_train_ds = labeled_train_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
labeled_test_ds = labeled_test_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

# Visual classifiers

## Fully connected neural network

### Set up the layers

Build a neural network composed of one fully connected (aka dense) hidden layer with 128 [ReLu](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) units and one output softmax layer.

Each image must be reshaped to a 1 dimensional vector before being fed to the hidden layer.

In [None]:
model = tf.keras.Sequential(
    [
        # Insert your layers here, see the following documentation:
        # https://www.tensorflow.org/tutorials/quickstart/beginner
    ]
)

In [None]:
# @title
model = tf.keras.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=(IMG_HEIGHT, IMG_WIDTH, 1)),
        tf.keras.layers.Dense(128, activation="relu"),
        tf.keras.layers.Dense(NUM_CLASSES, activation="softmax"),
    ]
)

### Compile the model

Compile the model by providing the optimizer, the loss function you want to minimize and the metrics to monitor during training:

In [None]:
model.compile(
    optimizer="adam",  # https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam
    loss="sparse_categorical_crossentropy",  # Loss used for multi-class classification with integer labels
    # https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy
    metrics=[
        "sparse_categorical_accuracy"
    ],  # https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Accuracy
)

Print a summary of the model:

In [None]:
print(model.summary())

### Train the model

Fit the model on the training set for 20 epochs:

In [None]:
EPOCHS = 20
# model.fit(...)  # https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit

In [None]:
# @title
EPOCHS = 10
model.fit(labeled_train_ds, epochs=EPOCHS)

### Evaluation on the test set

Get the values of the loss and accuracy: 

In [None]:
# model.evaluate(..., verbose=2) # https://www.tensorflow.org/api_docs/python/tf/keras/Model#evaluate

In [None]:
# @title
model.evaluate(labeled_test_ds, verbose=2)

Are these values different from their training counterparts ?

### Prediction on the test set

Implement a function that gathers the model predictions and the ground truth labels for a random batch of a given dataset:

In [None]:
def predict_random_batch(model, dataset):
    """
    Sample a random batch of the dataset and return the images of this batch
    as well as its labels and the predicted classes of a given model

    Parameters
    ----------
    model : tf.keras.Model
    dataset: tf.data.Dataset

    Returns
    -------
    images: np.ndarray or EagerTensor
    labels: list of str
    predicted_classes: list of str
    """
    images, labels = next(iter(dataset))

    # get label names for the sampled batch

    # make predictions
    predicted_classes = None

    return images, labels, predicted_classes

In [None]:
# @title
def predict_random_batch(model, dataset):
    """
    Sample a random batch of the dataset and return the images of this batch
    as well as its labels and the predicted classes of a given model

    Parameters
    ----------
    model : tf.keras.Model
    dataset: tf.data.Dataset

    Returns
    -------
    images: np.ndarray or EagerTensor
    labels: list of str
    predicted_classes: list of str
    """
    images, labels = next(iter(dataset))

    # get label names for the sampled batch
    labels = [class_names[i] for i in labels]

    # make predictions
    predictions = model.predict(images)
    predicted_classes_idx = np.argmax(predictions, axis=1)
    predicted_classes = [class_names[i] for i in predicted_classes_idx]

    return images, labels, predicted_classes

Plot the first 9 images of this batch, give their labels and predicted classes in the legend:

In [None]:
def plot_images_predictions_and_labels(images, labels, predicted_classes):
    plt.figure(figsize=(30, 40))

    for im_idx in range(9):
        plt.subplot(3, 3, im_idx + 1)
        plt.xticks([])
        plt.yticks([])
        plt.grid(False)
        plt.imshow(np.squeeze(images[im_idx]), cmap="gray")
        plt.xlabel("label: %s\npred: %s" % (labels[im_idx], predicted_classes[im_idx]))

    plt.show()


result = predict_random_batch(model, labeled_test_ds)
plot_images_predictions_and_labels(*result)

### Under the Hood

Implement an ReLu dense layer by creating its weights and biases and giving the transformation from inputs to outputs:

In [None]:
# https://www.tensorflow.org/guide/keras/custom_layers_and_models#the_layer_class
class MyDenseLayer(tf.keras.layers.Layer):
    def __init__(self, units, input_dim):
        super(MyDenseLayer, self).__init__()
        self.w = self.add_weight(
            shape=None,  ## Insert the shape of the weight matrix here
            initializer="glorot_uniform",  # Default initializer for weights of a tf.keras.layers.Dense layer
            trainable=True,
        )

        self.b = self.add_weight(
            shape=None,  ## Insert the shape of the bias vector here
            initializer="zeros",  # Default initializer for biases of a tf.keras.layers.Dense layer
            trainable=True,
        )

    def call(self, inputs):
        outputs = None
        return outputs

In [None]:
# @title
# https://www.tensorflow.org/guide/keras/custom_layers_and_models#the_layer_class
class MyDenseLayer(tf.keras.layers.Layer):
    def __init__(self, units, input_dim):
        super(MyDenseLayer, self).__init__()
        self.w = self.add_weight(
            shape=(input_dim, units),
            initializer="glorot_uniform",  # Default initializer for weights of a tf.keras.layers.Dense layer
            trainable=True,
        )

        self.b = self.add_weight(
            shape=(units,),
            initializer="zeros",  # Default initializer for biases of a tf.keras.layers.Dense layer
            trainable=True,
        )

    def call(self, inputs):
        return tf.keras.activations.relu(tf.matmul(inputs, self.w) + self.b)

Using your custom hidden layer, set up again the layers of the model defined previously:

In [None]:
model = tf.keras.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=(IMG_HEIGHT, IMG_WIDTH, 1)),
        ## Insert your custom hidden layer here
        tf.keras.layers.Dense(NUM_CLASSES, activation="softmax"),
    ]
)

In [None]:
# @title
model = tf.keras.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=(IMG_HEIGHT, IMG_WIDTH, 1)),
        MyDenseLayer(128, IMG_HEIGHT * IMG_WIDTH),
        tf.keras.layers.Dense(NUM_CLASSES, activation="softmax"),
    ]
)

Lower-level implementation of the model compile step:

In [None]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam()
train_loss = tf.keras.metrics.Mean(name="train_loss")
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")


@tf.function
def train_step(images, labels):
    with tf.GradientTape() as tape:
        predictions = model(images)
        loss = loss_object(labels, predictions)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss(loss)
    train_accuracy(labels, predictions)

Lower-level implementation of the model fit step:

In [None]:
for epoch in range(EPOCHS):
    for images, labels in labeled_train_ds:
        train_step(images, labels)
        template = "Epoch {}, Loss: {}, Accuracy: {}"
        print(
            template.format(
                epoch + 1, train_loss.result(), train_accuracy.result() * 100
            )
        )

## Convolutional Neural Networks (CNN)

### Training from scratch

Create and compile a model alterning convolution and max pooling layers. You can add some fully connected layers between the last locally connected layer and the output layer. Start with a shallow network (4 or 5 convolution layers) and progressively move to deeper architectures: 

In [None]:
from tensorflow.keras.layers import Conv2D, Dense, Flatten, MaxPooling2D

model = tf.keras.Sequential(
    [
        # Alterning Conv2D and MaxPooling2D layers
        # Some dense hidden layer(s)
        Dense(NUM_CLASSES, activation="softmax")
    ]
)

print(model.summary())

model.compile(
    optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)

In [None]:
# @title
from tensorflow.keras.layers import Dense

shallow_model = tf.keras.Sequential(
    [
        Conv2D(
            16,
            3,
            padding="same",
            activation="relu",
            input_shape=(IMG_HEIGHT, IMG_WIDTH, 1),
        ),
        MaxPooling2D(4),
        Conv2D(32, 3, padding="same", activation="relu"),
        MaxPooling2D(4),
        Conv2D(64, 3, padding="same", activation="relu"),
        MaxPooling2D(4),
        Conv2D(128, 3, padding="same", activation="relu"),
        MaxPooling2D(4),
        Flatten(),
        Dense(128, activation="relu"),
        Dense(NUM_CLASSES, activation="softmax"),
    ]
)

deep_model = tf.keras.Sequential(
    [
        Conv2D(
            16,
            3,
            padding="same",
            activation="relu",
            input_shape=(IMG_HEIGHT, IMG_WIDTH, 1),
        ),
        MaxPooling2D(),
        Conv2D(32, 3, padding="same", activation="relu"),
        MaxPooling2D(),
        Conv2D(64, 3, padding="same", activation="relu"),
        MaxPooling2D(),
        Conv2D(128, 3, padding="same", activation="relu"),
        MaxPooling2D(),
        Conv2D(256, 3, padding="same", activation="relu"),
        MaxPooling2D(),
        Conv2D(256, 3, padding="same", activation="relu"),
        MaxPooling2D(),
        Conv2D(256, 3, padding="same", activation="relu"),
        MaxPooling2D(),
        Conv2D(256, 3, padding="same", activation="relu"),
        MaxPooling2D(),
        Flatten(),
        Dense(256, activation="relu"),
        Dense(NUM_CLASSES, activation="softmax"),
    ]
)

model_with_strides = tf.keras.Sequential(
    [
        Conv2D(
            16,
            3,
            padding="same",
            activation="relu",
            input_shape=(IMG_HEIGHT, IMG_WIDTH, 1),
        ),
        MaxPooling2D(),
        Conv2D(32, 3, padding="same", activation="relu"),
        MaxPooling2D(),
        Conv2D(64, 3, padding="same", activation="relu", strides=2),
        MaxPooling2D(),
        Conv2D(128, 3, padding="same", activation="relu", strides=2),
        MaxPooling2D(),
        Conv2D(128, 3, padding="same", activation="relu", strides=2),
        MaxPooling2D(),
        Flatten(),
        Dense(128, activation="relu"),
        Dense(NUM_CLASSES, activation="softmax"),
    ]
)
model = deep_model
print(model.summary())

model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["sparse_categorical_accuracy"],
)

Fit the CNN on the training data:

In [None]:
EPOCHS = 10
model.fit(labeled_train_ds, epochs=EPOCHS)

Evaluate the trained model on the test set:

In [None]:
model.evaluate(labeled_test_ds, verbose=2)

You should reach test accuracy greater than 0.99 !




Plot images, predictions and labels for some test documents:

In [None]:
plot_images_predictions_and_labels(*predict_random_batch(model, labeled_test_ds))

### Transfer Learning with pre-trained models 

The objective is to leverage the knowledge learnt by a pre-trained image classifier. See [TensorFlow Hub](https://tfhub.dev/) to browse available state-of-the art models such as [Inception V3](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf) or [MobileNet V2](https://arxiv.org/pdf/1801.04381.pdf).

Choose a pre-trained model for extracting high level feature vectors of document images:

In [None]:
extractor_model = "inception_v3"
if extractor_model == "inception_v3":
    feature_extraction_url = (
        "https://tfhub.dev/google/tf2-preview/inception_v3/feature_vector/4"
    )
    IMG_HEIGHT, IMG_WIDTH = None, None  ## Insert expected input image shape here
elif extractor_model == "mobilenet_v2":
    feature_extraction_url = (
        "https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/feature_vector/4"
    )
    IMG_HEIGHT, IMG_WIDTH = None, None  ## Insert expected input image shape here

In [None]:
# @title
extractor_model = "inception_v3"
if extractor_model == "inception_v3":
    feature_extraction_url = (
        "https://tfhub.dev/google/tf2-preview/inception_v3/feature_vector/4"
    )
    IMG_HEIGHT, IMG_WIDTH = 299, 299
elif extractor_model == "mobilenet_v2":
    feature_extraction_url = (
        "https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/feature_vector/4"
    )
    IMG_HEIGHT, IMG_WIDTH = 224, 224

Reshape images to the format expected by the chosen model, i.e. IMG_HEIGHT x IMG_WIDTH x 3 (RGB) and recreate training dataset:

In [None]:
def decode_img(img):
    # convert the compressed string to a uint8 tensor
    img = tf.io.decode_png(img, channels=1)
    # convert to floats in the [0,1] range
    img = tf.image.convert_image_dtype(img, tf.float32)
    # resize the image to the desired size
    img = tf.image.resize(img, [IMG_HEIGHT, IMG_WIDTH])
    # convert to RGB color scale
    img = tf.concat([img for _ in range(3)], axis=-1)  # R = G = B
    return img


def process_path(file_path):
    label = labels_idx.lookup(file_path)

    img = tf.io.read_file(file_path)
    img = decode_img(img)
    return img, label


labeled_train_ds = list_train_ds.map(
    process_path, num_parallel_calls=tf.data.experimental.AUTOTUNE
)
labeled_train_ds = labeled_train_ds.cache("/tmp/%dx%dx3" % (IMG_HEIGHT, IMG_WIDTH))
labeled_train_ds = labeled_train_ds.shuffle(2048).batch(batch_size)
labeled_train_ds = labeled_train_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

Construct our own image classifier by retrieving and freezing the hidden layers of the pre-trained model: 

In [None]:
import tensorflow_hub as hub


In [None]:
# @title


Train this new model:

In [None]:
EPOCHS = 10
model.fit(labeled_train_ds, epochs=EPOCHS)