<a href="https://colab.research.google.com/github/clemsage/NeuralDocumentClassification/blob/master/skeleton_ocr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Training a classifier on OCR text input


# Imports & Cloning repository


In [2]:
import os
import pickle
import sys
from dataclasses import dataclass
from os import path

import matplotlib.pyplot as plt
import tqdm


In [3]:
class_names = ["email", "form", "handwritten", "invoice", "advertisement"]
NUM_CLASSES = len(class_names)

In [None]:
if not os.path.exists("NeuralDocumentClassification"):
    !git clone https://github.com/thibaultdouzon/NeuralDocumentClassification.git
else:
    !git -C NeuralDocumentClassification pull
sys.path.append("NeuralDocumentClassification")

In [4]:
from src import download_dataset

dataset_path = "dataset"

download_dataset.download_and_extract("all", dataset_path)

In [None]:
with open(path.join(dataset_path, "train.pkl"), "rb") as f:
    train_dataset = pickle.load(f)

with open(path.join(dataset_path, "test.pkl"), "rb") as f:
    test_dataset = pickle.load(f)

with open(path.join(dataset_path, "validation.pkl"), "rb") as f:
    validation_dataset = pickle.load(f)


for split_name, split_dataset in zip(
    ["train", "test", "validation"], [train_dataset, test_dataset, validation_dataset]
):
    print(f"{split_name}_dataset contains {len(split_dataset)} documents")
train_dataset[0].keys()


Each `dataset` object is a `list` containing multiple document information. A document is a `dict` with the following structure:

```json
{
  "id": "Unique document identifier",
  "image": "A PIL.Image object containing the document's image",
  "label": "A number between in [0 .. 4] representing the class of the document",
  "words": "A list of strings (not words !) extracted from the image with an OCR",
  "boxes": "A list of tuples of numbers providing the position of each word in the document"
}
```


# Explore the data

Take the time to explore the textual data included in the dataset.


Ideas

- 10 most common words? (hint: Counter)
- Count number of unique words
- Distribution of words (cumulative occurences plot)


In [None]:
# @title

from collections import Counter

all_texts = [
    [word for sentence in doc["words"] for word in sentence.split()]
    for doc in validation_dataset + test_dataset + train_dataset
]

most_common_words = Counter([w for text in all_texts for w in text])
most_common_words.most_common(10)

In [None]:
# @title

n_unique_words = len({w for text in all_texts for w in text})
n_unique_words

In [None]:
# @title

# Zipf's law

plt.figure(figsize=(10, 5))
plt.plot(
    [c / sum(most_common_words.values()) for w, c in most_common_words.most_common(50)]
)

# put words on xlabel
plt.xticks(
    range(50),
    [w for w, c in most_common_words.most_common(50)],
    rotation=80,
    fontsize=9,
)
plt.ylabel("Word frequency")
plt.title("Word frequency in the dataset")
plt.show()

In [None]:
# @title
from itertools import accumulate

cum_word_occurences = list(
    accumulate([count for word, count in most_common_words.most_common(n_unique_words)])
)

plt.figure(figsize=(10, 5))
plt.plot(cum_word_occurences)

plt.xlabel("Rank of the word")
plt.ylabel("Number of occurences")
plt.title("Cumulative number of occurences of the most common words")
plt.show()

# Classification with Scikit Learn


In [6]:
import nltk
import sklearn


@dataclass
class TextSample:
    text: str
    label: int

    def __init__(self, document: dict):
        self.text = " ".join(
            [word for sentence in document["words"] for word in sentence.split()]
        )
        self.label = document["label"]


train_samples = [TextSample(doc) for doc in train_dataset]

test_samples = [TextSample(doc) for doc in test_dataset]

validation_samples = [TextSample(doc) for doc in validation_dataset]


## Tokenization and Vectorization

To train models at solving our problem, we need to convert texts into vectors that will represent our documents.
Take a look at Scikit Learn [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#) and [TFIDFVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).
First fit a vectorizer on the training set, then apply the vectorization transformation to each dataset split.

What are the shapes of the resulting vectors? What does each dimension mean?


In [None]:
# @title

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform([sample.text for sample in train_samples])
X_test = vectorizer.transform([sample.text for sample in test_samples])
X_validation = vectorizer.transform([sample.text for sample in validation_samples])

Y_train = [sample.label for sample in train_samples]
Y_test = [sample.label for sample in test_samples]
Y_validation = [sample.label for sample in validation_samples]


X_train.shape, X_test.shape, X_validation.shape
# Each vector's first dimension is the number of documents, the second dimension is the number of unique words in the dataset
# The value at (i, j) is the number of occurences of the j-th word in the i-th document

## Basic Model: Scikit-Learn Classification

Use any Scikit-Learn classification model to train a first text model.
Good first picks: [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) or [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)


In [None]:
# @title

from sklearn.svm import SVC

model = SVC(kernel="linear")
model.fit(X_train, Y_train)


## Evaluate the model

Use Scikit-Learn [metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) to evaluate your model


In [None]:
# @title

from sklearn.metrics import accuracy_score, confusion_matrix

print("Test")
Y_pred = model.predict(X_test)
accuracy = accuracy_score(Y_test, Y_pred)
print(f"Accuracy on the test set: {accuracy:.2f}")
print(confusion_matrix(Y_test, Y_pred))

print("Validation")
Y_pred = model.predict(X_validation)
accuracy = accuracy_score(Y_validation, Y_pred)
print(f"Accuracy on the validation set: {accuracy:.2f}")
print(confusion_matrix(Y_validation, Y_pred))


# Transformers

Done playing with kids toys.

All modern AI models use the [Transformer architecture](https://arxiv.org/pdf/1706.03762). The initial research paper is one of the most influencial of the last decade.


In [13]:
import transformers

## Tokenization

Transformers usually use subword tokenizer, ie. a word _can_ be tokenized into multiple tokens.


In [None]:
# Let's use LayoutLM tokenizer first

tokenizer = transformers.AutoTokenizer.from_pretrained(
    "microsoft/layoutlm-base-uncased"
)

In [None]:
encoding = tokenizer("Hello, world! I can tokenize any sentence.")

for token_id in encoding["input_ids"]:
    print(tokenizer.decode(token_id))

# Note how `tokenize` is encoded as `token ##ize`

## Dataset for LayoutLM

LayoutLM uses both textual and 2D positional information, here is a new data sample class to work with


In [23]:
@dataclass
class TextBoxSample:
    words: list[str]
    boxes: list[tuple[int, int, int, int]]  # (left, top, right, bottom)
    label: int

    def __init__(self, document: dict):
        self.words = []
        self.boxes = []

        # We need to split the words in the sentences and compute the bounding boxes for each word
        for sentence, sentence_box in zip(document["words"], document["boxes"]):
            words = sentence.split()
            self.words.extend(words)

            words_len = [len(word) for word in words]
            box_width = sentence_box[2] - sentence_box[0]

            word_left = sentence_box[0]
            for word_len in words_len:
                word_right = word_left + int(word_len * box_width / len(sentence))
                self.boxes.append(
                    (word_left, sentence_box[1], word_right, sentence_box[3])
                )
                word_left = word_right + int(1 * box_width / len(sentence))

        self.label = document["label"]


train_samples = [TextBoxSample(doc) for doc in train_dataset]

test_samples = [TextBoxSample(doc) for doc in test_dataset]

validation_samples = [TextBoxSample(doc) for doc in validation_dataset]

In [None]:
# LayoutLM tokenizer does not support bounding boxes, so we will use the LayoutLMv2 tokenizer instead
# Otherwise we would have to implement ourselves the mapping of bounding boxes to tokens
# This can be tricky because some words can be split into multiple tokens

tokenizer = transformers.AutoTokenizer.from_pretrained(
    "microsoft/layoutlmv2-base-uncased"
)

# Use it like this, it can support batched inputs
tokenizer(
    text=train_samples[0].words, boxes=train_samples[0].boxes, padding="max_length"
)

## Batching function

Like we did in th vision part, we need to implement a batching function that will batch together multiple inputs together and prepare them to be fed to the model


In [None]:
# @title


def collate_fn(
    samples: list[TextBoxSample],
    tokenizer: transformers.LayoutLMv2Tokenizer = tokenizer,
):
    encodings = tokenizer(
        text=[sample.words for sample in samples],
        boxes=[sample.boxes for sample in samples],
        padding="max_length",
        return_tensors="pt",  # return PyTorch tensors
    )
    encoding["labels"] = (
        torch.zeros_like(encoding["input_ids"]) - 100
    )  # -100 is the default ignore value for the loss function
    encoding["labels"][:, 0] = torch.tensor(
        [sample.label for sample in samples], dtype=torch.long
    )

    return encodings

## Model

The transformer library provides model's code and weights. We will use the weights of a fine-tuned model on RVL-CDIP from the hub
Let's first download its weigits and fix his mistakes so we can load the model weights.


In [None]:
!git lfs install
!git clone https://huggingface.co/gurvgupta/LayoutLM_rvl-cdip
!mv LayoutLM_rvl-cdip/LayoutLM_rvl-cdip_epoch_50.pt LayoutLM_rvl-cdip/pytorch_model.bin

In [None]:
from transformers.models.layoutlm import LayoutLMForSequenceClassification

model = LayoutLMForSequenceClassification.from_pretrained(
    "./LayoutLM_rvl-cdip", num_labels=NUM_CLASSES, ignore_mismatched_sizes=True
)