<a href="https://colab.research.google.com/github/clemsage/NeuralDocumentClassification/blob/master/skeleton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Setting up the computing environment


## Install and import PyTorch


In [None]:
# Install all packages listed in pyproject.toml
%pip install click datasets gdown ipython jupyter matplotlib nltk numpy openai pillow polars pydantic requests ruff scikit-learn torch torchmetrics torchvision tqdm transformers types-requests types-tqdm

Select "GPU" in the Accelerator drop-down on Notebook Settings through the Edit menu.


In [None]:
# %pip install torch torchvision numpy matplotlib Pillow datasets
import torch

print(torch.__version__)

## Confirm PyTorch can see the GPU


In [None]:
print(torch.cuda.is_available())

## Additional information about hardware


For CPU information and RAM, run:


In [None]:
!cat /proc/cpuinfo
!cat /proc/meminfo

## Other useful package imports


In [11]:
import importlib
import operator
import os
import pickle
import sys
from dataclasses import dataclass
from functools import reduce
from os import path

import matplotlib.pyplot as plt
import tqdm


# Working on the dataset


The dataset is a subset of the [RVL-CDIP dataset](https://www.cs.cmu.edu/~aharley/rvl-cdip/). See [Harley et al.](http://scs.ryerson.ca/~aharley/icdar15/harley_convnet_icdar15.pdf) and [Asim et al.](https://www.dfki.de/fileadmin/user_upload/import/10637_Asim_Document_Image_Classification.pdf) papers for recent works on this dataset.


## Information about the dataset


This project only considers the following 5 classes among the 16 classes of the original dataset:


In [9]:
class_names = ["email", "form", "handwritten", "invoice", "advertisement"]
NUM_CLASSES = len(class_names)

## Import the dataset


If you are on Google Colab, first clone the repository.


In [None]:
if not os.path.exists("NeuralDocumentClassification"):
    !git clone https://github.com/thibaultdouzon/NeuralDocumentClassification.git
else:
    !git -C NeuralDocumentClassification pull
sys.path.append("NeuralDocumentClassification")

You now either have a "NeuralDocumentClassification" folder or are already inside it.
Download the train, test and validation dataset assignments from this [Google Drive](https://drive.google.com/drive/folders/1Pkd6sUkDGBUymWKK93abZx1MQiWmzFgP) using the provided code in `src.download_dataset`:


In [13]:
from src import download_dataset

dataset_path = "dataset"

download_dataset.download_and_extract("all", dataset_path)

Each dataset file is a binary dump that can be loaded with the [Pickle](https://docs.python.org/3.11/library/pickle.html) module.


In [None]:
with open(path.join(dataset_path, "train.pkl"), "rb") as f:
    train_dataset = pickle.load(f)

with open(path.join(dataset_path, "test.pkl"), "rb") as f:
    test_dataset = pickle.load(f)

with open(path.join(dataset_path, "validation.pkl"), "rb") as f:
    validation_dataset = pickle.load(f)


for split_name, split_dataset in zip(
    ["train", "test", "validation"], [train_dataset, test_dataset, validation_dataset]
):
    print(f"{split_name}_dataset contains {len(split_dataset)} documents")
train_dataset[0].keys()


Each `dataset` object is a `list` containing multiple document information. A document is a `dict` with the following structure:

```json
{
  "id": "Unique document identifier",
  "image": "A PIL.Image object containing the document's image",
  "label": "A number between in [0 .. 4] representing the class of the document",
  "words": "A list of words extracted from the image with an OCR",
  "boxes": "A list of tuples of numbers providing the position of each word in the document"
}
```


# Explore the data


Print 5 image from the training dataset using [matplotlib](https://matplotlib.org/stable/tutorials/images.html)'s `plt` module:


In [None]:
### Insert your code here ###
# See the expected solution by clicking on the cell below

In [None]:
# @title
for document in train_dataset[:5]:
    print(class_names[document["label"]])
    plt.imshow(document["image"].convert("RGB"))
    plt.show()

Try to answer the following questions:

What is the shape of the images?
How are the different classes distributed?
Using subplots, show an image of each class.


# Creating Pytorch datasets and dataloaders for Computer Vision task

The first goal of this section is to create `torch.utils.data.Dataset` for the classification task using only the image of the document.

We will define a class inheriting [`torch.utils.data.Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) called `DocumentImageDataset`.

It should be able to create an instance of `DocumentImageDataset` using our previously loaded datasets.
For simplification, all images should be resized to a fixed (512, 512) size. Use [`torchvision.transforms.v2.functional`](https://pytorch.org/vision/main/transforms.html#v2-api-reference-recommended) module to convert a `PIL.Image` to a `torch.Tensor` and perform the simplifications.

Upon iteration, it should return an `ImageSample` object defined as follows:


In [15]:
import torch
import torch.utils.data as data
import torchvision.transforms.v2.functional as F


@dataclass
class ImageSample:
    image: torch.Tensor  # shape: (C, H, W)
    label: int  # 0 ≤ label < NUM_CLASSES

    def __post_init__(self):
        "Some assertions to check the validity of the data"
        assert self.image.shape == (
            1,
            512,
            512,
        ), f"Expected shape (1, 512, 512), got {self.image.shape}"
        assert torch.all(self.image <= 1.0) and torch.all(
            self.image >= 0.0
        ), "Expected each pixel of image in range [0.0, 1.0]"
        assert self.label in range(
            NUM_CLASSES
        ), f"Expected label in range [0 .. {NUM_CLASSES-1}], got {self.label}"

In [16]:
# Fill the methods of the class DocumentImageDataset


class DocumentImageDataset(data.Dataset):
    def __init__(self, dataset: list[dict]):
        self.dataset = dataset
        raise NotImplementedError

    def __len__(self) -> int:
        """This method returns the length of the dataset"""
        raise NotImplementedError

    def __getitem__(self, idx: int) -> ImageSample:
        """This method returns the idx-th sample of the dataset
        If idx is out of bounds, it should raise an IndexError"""

        raise NotImplementedError

In [17]:
# @title


class DocumentImageDataset(data.Dataset):
    def __init__(self, dataset: list[dict]):
        self.dataset = dataset

    def __len__(self) -> int:
        """This method returns the length of the dataset"""
        return len(self.dataset)

    def __getitem__(self, idx: int) -> ImageSample:
        """This method returns the idx-th sample of the dataset
        If idx is out of bounds, it should raise an IndexError"""

        return ImageSample(
            # F.to_tensor is deprecated, use F.to_dtype(F.to_image(...), dtype=torch.float32, scale=True) instead
            image=F.to_dtype(
                F.to_image(F.resize(self.dataset[idx]["image"], size=[512, 512])),
                dtype=torch.float32,
                scale=True,
            ),
            label=self.dataset[idx]["label"],
        )

If your implementation is correct, you should be able to create an instance of `DocumentImageDataset` and get its 0th element without error


In [None]:
image_dataset = DocumentImageDataset(validation_dataset)
image_dataset[0]  # no error here

The final goal of this section is to implement a [`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) that wraps the `DocumentImageDataset` and handles useful tasks like shuffling and batching.

No need to create a new class, we simply need to implement the `collate_fn` that takes a list of `ImageSample` and should return an `ImageBatch`.

hint: Use `torch.tensor` and `torch.stack` to respectively convert a python list to a `torch.Tensor` and stack multiple tensors together into a new one along a new dimension.


In [19]:
@dataclass
class ImageBatch:
    images: torch.Tensor
    labels: torch.Tensor

    def __post_init__(self):
        assert self.images.shape[0] == self.labels.shape[0]
        assert self.images.shape[1:] == (1, 512, 512)
        assert len(self.images.shape) == 4
        assert len(self.labels.shape) == 1

In [20]:
def collate_fn(batch: list[ImageSample]) -> ImageBatch:
    """This function should return a batch of samples as an ImageBatch object"""
    raise NotImplementedError

In [21]:
# @title


def collate_fn(batch: list[ImageSample]) -> ImageBatch:
    """This function should return a batch of samples as an ImageBatch object"""
    return ImageBatch(
        images=torch.stack(
            [sample.image for sample in batch], dim=0
        ),  # shape: (B, C, H, W)
        labels=torch.tensor([sample.label for sample in batch]),  # shape: (B,)
    )

If your implementation is correct, you should be able to create a dataloader with a batch size and retrieve the first batch.


In [None]:
dataloader = data.DataLoader(
    image_dataset, batch_size=5, collate_fn=collate_fn, shuffle=True, drop_last=True
)
next(iter(dataloader))  # no error here

# Visual classifiers


In [23]:
from torch import nn

## Multi Layer Perceptron


### Set up the layers


Build a neural network composed of one fully connected (aka dense, or `Linear` in torch) hidden layer with 128 [ReLu](<https://en.wikipedia.org/wiki/Rectifier_(neural_networks)>) units.

Each image must be flattened to a single (512 × 512) dimension before being fed to the linear layer.

Use `torch.nn` (nn stands for Neural Network) module for all those operations.


In [None]:
mlp_model = nn.Sequential(
    # Fill the layers of the model
    # It should take an input of shape (B, 512, 512)
    # and output a tensor of shape (B, NUM_CLASSES)
)

In [15]:
# @title
mlp_model = nn.Sequential(
    nn.Flatten(start_dim=1),  # Do not flatten the batch dimension
    nn.Linear(512 * 512, 128),  # d_input = n_pixels in an image = 512 × 512
    nn.ReLU(),
    nn.Linear(128, NUM_CLASSES),  # d_output = NUM_CLASSES
)

Side question, how many trainable parameters does your model have ?

hint: use the `model.parameters()` method to iterate over all the model's parameters


In [None]:
def count_parameters(model, trainable=True):
    return sum(
        reduce(operator.mul(p.shape), 1)  # or p.numel()
        for p in model.parameters()
        if p.requires_grad == trainable
    )


print(f"Your model uses {count_parameters(mlp_model):_} trainable parameters")

### Train the model


Pytorch does not provide a ready to use training loop function like Tensorflow does.
We will implement it ourselves.

We must first implement the training over a full iteration over the dataloader.
It will take the model, the dataloader, a loss function, an optimizer and a device to run on.

hint: help yourselves with the torch [documentation](https://pytorch.org/tutorials/beginner/introyt/trainingyt.html#the-training-loop)


In [24]:
def train_one_epoch(
    model: nn.Module,
    dataloader: data.DataLoader,
    loss_fn: nn.Module,
    optimizer: torch.optim.Optimizer,  # type: ignore
    device: torch.device,
) -> float:
    """This function should train the model for one epoch and return the average loss"""

    raise NotImplementedError

In [25]:
# @title
def train_one_epoch(
    model: nn.Module,
    dataloader: data.DataLoader,
    loss_fn: nn.Module,
    optimizer: torch.optim.Optimizer,  # type: ignore
    device: torch.device,
) -> float:
    """This function should train the model for one epoch and return the average loss"""
    model.train()
    model.to(device)

    epoch_loss = 0.0
    with tqdm.tqdm(desc="Training", total=len(dataloader)) as pbar:
        for i, batch in enumerate(dataloader):
            images, labels = batch.images.to(device), batch.labels.to(device)

            optimizer.zero_grad()  # Reset gradients
            outputs = model(images)  # Compute model's predictions
            loss = loss_fn(outputs, labels)  # Compute the loss

            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()

            pbar.set_postfix(loss=epoch_loss / (i + 1))
            pbar.update(1)
    mean_loss = epoch_loss / len(dataloader)
    print(f"Training loss (↓): {mean_loss:.4f}")
    return mean_loss

We also need to implement an evaluation method that evaluates the model's performance on a test or validation set.

It might compute the average loss and performance metric that we will use to compare models.


In [26]:
def evaluate(
    model: nn.Module,
    dataloader: data.DataLoader,
    loss_fn: nn.Module,
    metric_fn: nn.Module,
    device: torch.device,
) -> tuple[float, float]:
    """This function should evaluate the model on the dataset and return the average loss and metric"""

    raise NotImplementedError

In [27]:
def evaluate(
    model: nn.Module,
    dataloader: data.DataLoader,
    loss_fn: nn.Module,
    metric_fn: nn.Module,
    device: torch.device,
    dataset_name: str = "validation",
) -> tuple[float, float]:
    """This function should evaluate the model on the dataset and return the average loss and metric"""
    model.eval()
    model.to(device)

    epoch_loss = 0.0
    epoch_metric = 0.0
    with torch.no_grad():
        for batch in tqdm.tqdm(dataloader, desc="Evaluation"):
            images, labels = batch.images.to(device), batch.labels.to(device)

            outputs = model(images)
            loss = loss_fn(outputs, labels)
            metric = metric_fn(outputs.argmax(dim=-1), labels)

            epoch_loss += loss.item()
            epoch_metric += metric.item()

        mean_loss = epoch_loss / len(dataloader)
        print(f"{dataset_name.capitalize()} loss (↓): {mean_loss:.4f}")
        mean_metric = epoch_metric / len(dataloader)
        print(f"{dataset_name.capitalize()} metric (↑): {mean_metric:.4f}")
        return mean_loss, mean_metric


Let's now implement the outer loop that trains the model over several epochs.

After each epoch, we want to control the model's performance on the validation set.

More confisticated training procedures might include model savings, modifying the learning rate or reporting to a dashboard.


In [28]:
def train(
    model: nn.Module,
    train_dataloader: data.DataLoader,
    validation_dataloader: data.DataLoader,
    loss_fn: nn.Module,
    metric_fn: nn.Module,
    optimizer: torch.optim.Optimizer,  # type: ignore
    device: torch.device,
    n_epochs: int = 10,
) -> tuple[list[float], list[float], list[float]]:
    """This function should train the model for 10 epochs and return the training and validation losses and metrics"""
    for epoch in range(n_epochs):
        print(f"Epoch {epoch + 1}/{n_epochs}")
        # Train the model here

        # Evaluate the model here
    raise NotImplementedError


In [29]:
# @title


def train(
    model: nn.Module,
    train_dataloader: data.DataLoader,
    validation_dataloader: data.DataLoader,
    loss_fn: nn.Module,
    metric_fn: nn.Module,
    optimizer: torch.optim.Optimizer,  # type: ignore
    device: torch.device,
    n_epochs: int = 10,
) -> tuple[list[float], list[float], list[float]]:
    """This function should train the model for some epochs and return the training and validation losses"""
    train_losses = []
    validation_losses = []
    validation_metrics = []

    for epoch in range(n_epochs):
        print(f"Epoch {epoch + 1}/{n_epochs}")
        train_loss = train_one_epoch(
            model, train_dataloader, loss_fn, optimizer, device
        )
        train_losses.append(train_loss)

        validation_loss, validation_metric = evaluate(
            model, validation_dataloader, loss_fn, metric_fn, device
        )
        validation_losses.append(validation_loss)
        validation_metrics.append(validation_metric)

    return train_losses, validation_losses, validation_metrics

In [None]:
import torchmetrics

train_loader = data.DataLoader(
    DocumentImageDataset(train_dataset),
    batch_size=16,
    collate_fn=collate_fn,
    shuffle=True,
)
validation_loader = data.DataLoader(
    DocumentImageDataset(validation_dataset),
    batch_size=16,
    collate_fn=collate_fn,
    shuffle=False,
)

device = torch.device(
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

selected_model = mlp_model

optimizer = torch.optim.Adam(selected_model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
metric_fn = torchmetrics.Accuracy(task="multiclass", num_classes=NUM_CLASSES).to(device)

n_epochs = 5

hist = train(
    selected_model,
    train_loader,
    validation_loader,
    loss_fn,
    metric_fn,
    optimizer,
    device,
    n_epochs=n_epochs,
)

Plot the losses and the accuracies on 2 different subplots to observe how the training went.


In [None]:
plt.figure(figsize=(12, 6))

# First subplot
plt.subplot(1, 2, 1)
# Subplot code here

# Second subplot
plt.subplot(1, 2, 2)
# Subplot code here

plt.show()

In [None]:
# @title

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)

plt.plot(hist[0], label="Training loss")
plt.plot(hist[1], label="Validation loss")
plt.legend()
plt.yscale("log")
plt.ylabel("Loss")
plt.xlabel("Epoch")

plt.subplot(1, 2, 2)
plt.plot(hist[2], label="Validation accuracy")
plt.legend()
plt.ylabel("Accuracy")
plt.ylim(0, 1)
plt.xlabel("Epoch")

# figure title
plt.suptitle("Training history")
plt.show()

### Evaluation on the test set

Now evaluate the model on the remaining test set and store its accuracy.


In [None]:
# @title
test_loader = data.DataLoader(
    DocumentImageDataset(test_dataset),
    batch_size=16,
    collate_fn=collate_fn,
    shuffle=False,
)

test_loss, test_metric = evaluate(
    mlp_model, test_loader, loss_fn, metric_fn, device, dataset_name="test"
)

Are these values different from their training counterparts ?


## Convolutional Neural Networks (CNN)


### Training from scratch


Create and compile a model alterning convolution and max pooling layers. You can add some fully connected layers between the last locally connected layer and the output layer. Start with a shallow network (4 or 5 convolution layers) and progressively move to deeper architectures:


In [None]:
# @title

conv_model = nn.Sequential(
    nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Dropout(0.3),
    nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Dropout(0.3),
    nn.Flatten(start_dim=1),
    nn.Linear(128 * 128 * 8, NUM_CLASSES),
)

print(f"Your model uses {count_parameters(mlp_model):_} trainable parameters")


Train the CNN model


In [None]:
# @title

train_loader = data.DataLoader(
    DocumentImageDataset(train_dataset),
    batch_size=16,
    collate_fn=collate_fn,
    shuffle=True,
)
validation_loader = data.DataLoader(
    DocumentImageDataset(validation_dataset),
    batch_size=16,
    collate_fn=collate_fn,
    shuffle=False,
)

device = torch.device(
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

selected_model = conv_model

optimizer = torch.optim.Adam(selected_model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
metric_fn = torchmetrics.Accuracy(task="multiclass", num_classes=NUM_CLASSES).to(device)

n_epochs = 5

hist = train(
    selected_model,
    train_loader,
    validation_loader,
    loss_fn,
    metric_fn,
    optimizer,
    device,
    n_epochs=n_epochs,
)

### Evaluation on the test set

How does it compare with the MLP model?
What is the best accuracy you can get?


In [None]:
# @title
test_loader = data.DataLoader(
    DocumentImageDataset(test_dataset),
    batch_size=16,
    collate_fn=collate_fn,
    shuffle=False,
)

test_loss, test_metric = evaluate(
    conv_model, test_loader, loss_fn, metric_fn, device, dataset_name="test"
)

### Using a pre-trained model

Download a pre-trained model from the [pytoch hub](https://pytorch.org/vision/stable/models.html#using-the-pre-trained-models) for vision model.

Eg. Resnet


In [None]:
# @title

from torchvision.models import list_models

print(list_models())

resnet = torch.hub.load("pytorch/vision", "efficientnet_b1", pretrained=True)
resnet

By default, the loaded does not make predictions for our problem.

We need to slightly modify its output to fit our requirements.

Models trained on ImageNet expect color images with 3 channels for color instead of 1.
We either need to modify th first convolution layer of the model to accomodate for that.
Or, another solution could be to repeat our input image 3 times along the channel dimension. That could be done in a new `collat_fn`.


In [None]:
resnet.features[0][0]

In [45]:
# @title


# For resnet18
# ## convert input to grayscale
# resnet.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
# resnet.conv1.weight = nn.Parameter(resnet.conv1.weight.sum(dim=1, keepdim=True) / 3)

# ## new classification head
# resnet.fc = nn.Linear(resnet.fc.in_features, NUM_CLASSES)


# For efficientnet_b1
## convert input to grayscale
old_weights = nn.Parameter(resnet.features[0][0].weight.sum(dim=1, keepdim=True) / 3)
resnet.features[0][0] = nn.Conv2d(1, 32, kernel_size=3, stride=2, padding=1, bias=False)
resnet.features[0][0].weight = old_weights

resnet.classifier[1] = nn.Linear(resnet.classifier[1].in_features, NUM_CLASSES)

We advice freezing all parameters except those from the last layer of convolution and the new classification head.

I will reduce the memory requirements to train the model and ensure the features the pre-trained model was trained to extract are not modified by the finetunig


In [46]:
for param in resnet.parameters():
    param.requires_grad = False

for param in resnet.layer4.parameters():
    param.requires_grad = True

for param in resnet.fc.parameters():
    param.requires_grad = True

In [None]:
import torchmetrics

train_loader = data.DataLoader(
    DocumentImageDataset(train_dataset),
    batch_size=16,
    collate_fn=collate_fn,
    shuffle=True,
)
validation_loader = data.DataLoader(
    DocumentImageDataset(validation_dataset),
    batch_size=16,
    collate_fn=collate_fn,
    shuffle=False,
)

device = torch.device(
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

selected_model = resnet

optimizer = torch.optim.Adam(selected_model.parameters(), lr=1e-4)
loss_fn = nn.CrossEntropyLoss()
metric_fn = torchmetrics.Accuracy(task="multiclass", num_classes=NUM_CLASSES).to(device)

n_epochs = 5

hist = train(
    selected_model,
    train_loader,
    validation_loader,
    loss_fn,
    metric_fn,
    optimizer,
    device,
    n_epochs=n_epochs,
)