## TROCR Pytorch Finetuning with CUSTOM DATASET

This jupyter notebook was used for finetuning Microsoft/trocr-large-stage1 base model (I dont use the Handwritten finetuned version to avoid language mistakes)

I used the same dataset of the DETR project, but, insted of downloading in COCO JSON format, I downloaded in XML format and parse it with "xml_workbench.ipynb" lab

The difference with the DETR dataset, this dataset contains all the original labels (~790 labels).

The structure of the dataset is: "image_path" and "label" (in text)

The reason for making the train cycle "manually" was for problems with GPU memory (Out of memory), to solve it, I implemented the cycle from scratch based on the CausalLLM Finetuning (SFTTrainer)

Basically, I used the "Right Shift" technique.

The input for the Encoder are the pixel values, the input for the Decoder is the target text including BOS token and excluding the EOS token, with padding.

To ilustrate: 

* `This is the text`
* `<bos> This is the text`
* `This is the text <eos>`

Author: Rodrigo Alvarez

In [1]:
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
from clearml import Task
import os
from tqdm.auto import tqdm
import json
import torch



In [2]:
# This is in case to use ClearML (Local with docker) to Log the metrics
%env CLEARML_WEB_HOST=http://localhost:8080
%env CLEARML_API_HOST=http://localhost:8008
%env CLEARML_FILES_HOST=http://localhost:8081
%env CLEARML_API_ACCESS_KEY=AEBY191O3R1U4SGBDPLA
%env CLEARML_API_SECRET_KEY=OVvAzcKHtSfqP95jjMHgmgAvzDcSKIKRt5wv1hE1PerO5D3uiT
%env CLEARML_LOG_MODEL=False

env: CLEARML_WEB_HOST=http://localhost:8080
env: CLEARML_API_HOST=http://localhost:8008
env: CLEARML_FILES_HOST=http://localhost:8081
env: CLEARML_API_ACCESS_KEY=AEBY191O3R1U4SGBDPLA
env: CLEARML_API_SECRET_KEY=OVvAzcKHtSfqP95jjMHgmgAvzDcSKIKRt5wv1hE1PerO5D3uiT
env: CLEARML_LOG_MODEL=False


In [3]:
HF_CACHE = "/home/ralvarez22/Documentos/llm_data/llm_cache"
TROCR_MODEL = "/home/ralvarez22/Documentos/llm_data/llm_cache/models--microsoft--trocr-large-stage1/snapshots/3c8ead8dfda428d914334169380bb546f770a300"

DATASET_PATH = "../hand-cursive-trocr"

METADATA_FILE = "train_metadata.json"

In [4]:
# Prepare the processor and the model
processor = TrOCRProcessor.from_pretrained(TROCR_MODEL, cache_dir=HF_CACHE, device_map="cuda")
model = VisionEncoderDecoderModel.from_pretrained(TROCR_MODEL, cache_dir=HF_CACHE, device_map="cuda")
# In every tutorial I found, they dont modify the config of the processor and model
# This configuration sets the special tokens for a valid Training and Inference use
# Please make sure to set the decoder_start_token_id to the tokenizer bos_token_id
# In some cases, the bos_token_id is the eos_token_id. This results in NO generation, because the end-of-sequence
model.generation_config.decoder_start_token_id = processor.tokenizer.bos_token_id
model.config.decoder.bos_token_id = processor.tokenizer.bos_token_id
model.config.decoder.decoder_start_token_id = processor.tokenizer.bos_token_id
model.config.decoder.eos_token_id = processor.tokenizer.eos_token_id
model.config.decoder.pad_token_id = processor.tokenizer.pad_token_id
model.config.encoder.bos_token_id = processor.tokenizer.bos_token_id
model.config.encoder.decoder_start_token_id = processor.tokenizer.bos_token_id
model.config.encoder.eos_token_id = processor.tokenizer.eos_token_id


  return self.fget.__get__(instance, owner)()
Config of the encoder: <class 'transformers.models.vit.modeling_vit.ViTModel'> is overwritten by shared encoder config: ViTConfig {
  "attention_probs_dropout_prob": 0.0,
  "encoder_stride": 16,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 1024,
  "image_size": 384,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "model_type": "vit",
  "num_attention_heads": 16,
  "num_channels": 3,
  "num_hidden_layers": 24,
  "patch_size": 16,
  "qkv_bias": false,
  "transformers_version": "4.46.3"
}

Config of the decoder: <class 'transformers.models.trocr.modeling_trocr.TrOCRForCausalLM'> is overwritten by shared decoder config: TrOCRConfig {
  "activation_dropout": 0.0,
  "activation_function": "relu",
  "add_cross_attention": true,
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
 

In [5]:
BATCH_SIZE = 5 # Modify in case of CUDA OUT OF MEMORY
MODEL_USED = "Trocr Large Stage 1" # Name of the model used, this for logs
CKP_PATH = "../finetuned/trocr"
FINAL_MODEL_PATH = "../finetuned/trocr"
MODEL_CODENAME = "Terminus" # Model Codename versioning
MODEL_VERSION = 1
SAVE_CKP_EVERY = 20
MAX_ITEMS = -1
EPOCHS = 5 # I use this value because it was only a Proof of concept test. With more Epochs, the accurancy (in theory) should be better
LR = 1e-5 # All the tutorials recommend 4e-5 or 5e-5, but, I couldn't get a good model, the model stopped learning at the epoch 20 or 25 and the Loss Graph begun to raise instead of go down

In [6]:
# A function to manually chunk the data
def divide_chunks(l, n):
    # looping till length l 
    for i in range(0, len(l), n):  
        yield l[i:i + n] 

In [7]:
# Load the metadata file
dataset_metadata = json.load(open(os.path.join(DATASET_PATH, METADATA_FILE), "r"))

In [8]:
if MAX_ITEMS > 0:
    dataset_metadata = dataset_metadata[:MAX_ITEMS]

In [9]:
# Create the chunks
chunked_dataset = list(divide_chunks(dataset_metadata, BATCH_SIZE))

In [10]:
log_info = {
    "type": "TROCR Cursive Handwritten",
    "codename": MODEL_CODENAME,
    "version": MODEL_VERSION,
    "epochs": EPOCHS,
    "batch_size": BATCH_SIZE,
    "learning_rate": LR,
    "dataset": "Handwritten App V1",
    "model": MODEL_USED
}

In [11]:
trocr_total_params = sum(p.numel() for p in model.parameters())
trocr_train_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total params: {}\nTrainable params: {} M".format(trocr_total_params / 1e6, trocr_train_params/ 1e6))
log_info["total_params"] = trocr_total_params
log_info["trainable_params"] = trocr_train_params

Total params: 609.169408
Trainable params: 609.169408 M


In [12]:
tsk_name = "{}_V{}".format(MODEL_CODENAME, str(MODEL_VERSION))
task = Task.init(task_name=tsk_name, project_name="HandCursive-I")
task.set_parameters(log_info)

ClearML Task: created new task id=18714187cb0b475282f708791d8c99a3
2025-01-17 17:49:18,733 - clearml.Task - INFO - Storing jupyter notebook directly as code
CLEARML-SERVER new package available: UPGRADE to v2.0.0 is recommended!
Release Notes:
### Breaking Changes

MongoDB major version was upgraded from v5.x to 6.x.
Please note that if your current ClearML Server version is smaller than v1.17 (where MongoDB v5.x was first used), you'll need to first upgrade to ClearML Server v1.17.
#### Upgrading to ClearML Server v1.17 from a previous version
- If using docker-compose,  use the following docker-compose files:
  * [docker-compose file](https://github.com/allegroai/clearml-server/blob/2976ce69cc91550a3614996e8a8d8cd799af2efd/upgrade/1_17_to_2_0/docker-compose.yml)
  * [docker-compose file foe Windows](https://github.com/allegroai/clearml-server/blob/2976ce69cc91550a3614996e8a8d8cd799af2efd/upgrade/1_17_to_2_0/docker-compose-win10.yml)

### New Features

- New look and feel: Full light/

In [13]:
# Prepare the Loss Function (CrossEntropy) and the Optimizer (AdamW)
# I set the ignore_index to the tokenizer pad token to avoid bad calculations
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=processor.tokenizer.pad_token_id)
optimizer = torch.optim.AdamW(model.parameters(), lr=LR)

In [14]:
model_name = "{}/V_{}".format(MODEL_CODENAME, MODEL_VERSION)
epochs_path = os.path.join(CKP_PATH, model_name)
print("Saving chekpoints to {}".format(epochs_path))
os.makedirs(epochs_path, exist_ok=True)

Saving chekpoints to ../finetuned/trocr/Terminus/V_1


In [15]:
# Auxiliar function to open the image and load the pixel values
def load_and_process_images(images_chunk, troc_proc):
    proc_chunk = []
    for x in images_chunk:
        proc_chunk.append(
            troc_proc(
                Image.open(os.path.join(DATASET_PATH, x)).convert("RGB"),
                return_tensors="pt",
            ).pixel_values.to("cuda")
        )
    # Use squeeze to eliminate the single array dimension of every item
    return torch.stack(proc_chunk, 0).squeeze()

In [16]:
# The "training step" function
def train_epoch(trocr_model: VisionEncoderDecoderModel, dataset, ls_fn, optim):
    losses = 0  # Accumulation of loss for every epoch
    for chunk in dataset:
        chunk_images = [x["image"] for x in chunk]
        labels = [x["label"] for x in chunk]
        chunk_images = load_and_process_images(
            chunk_images, processor
        )  # Process the batch images and get the batch pixels
        # Tokenize the labels
        labels = processor.tokenizer(
            labels, add_special_tokens=True, return_tensors="pt", padding=True
        )["input_ids"].to("cuda")
        # Clone the labels to avoid modifications in the original tensor
        input_labels = labels.clone()
        # Convert the EOS token to a padding token
        input_labels = torch.where(
            input_labels == processor.tokenizer.eos_token_id,
            processor.tokenizer.pad_token_id,
            input_labels,
        )
        # Because I shifted 1 item to the right, I need to add an additional token to preserve the dimensions
        to_concat = (
            torch.empty((1, input_labels.shape[0]), dtype=torch.long, device="cuda")
            .masked_fill(
                torch.ones(input_labels.shape[0], dtype=torch.bool, device="cuda"),
                processor.tokenizer.pad_token_id,
            )
            .transpose(1, 0)
        )
        # This are the shifted labels
        shifted_labels = torch.cat((labels[:, 1:], to_concat), dim=1)
        # Create the Attention Mask for the decoder
        # shifted_mask = torch.ones_like(shifted_labels, device="cuda")
        # The attention is: 0 for pad token (or tokens to ignore), 1 for the other values
        shifted_mask = torch.where(
            shifted_labels == processor.tokenizer.pad_token_id, 0, 1
        ).to("cuda")
        # Call the forward method to get the logits
        # print(chunk_images.shape, input_labels.shape, shifted_mask.shape)
        logits = trocr_model.forward(chunk_images, input_labels, shifted_mask).logits
        # print(logits)
        # Resize or rearrange the logits to match the VOCAB_SIZE dim (and embedding size of the model)
        loss = ls_fn(
            logits.contiguous().view(-1, trocr_model.config.decoder.vocab_size),
            shifted_labels.contiguous().view(-1),
        )
        # Get the loss item
        loss_item = loss.item()
        # Reset the grad
        optim.zero_grad()
        # Derivative to update the weights
        loss.backward()
        optim.step()
        losses += loss_item
    return losses / len(dataset)

In [None]:
logger = task.get_logger()
model.train()
for epoch in tqdm(range(EPOCHS)):
    train_loss = train_epoch(model, chunked_dataset, loss_fn, optimizer)
    if epoch > 0 and epoch % SAVE_CKP_EVERY == 0: # Save every N epochs, but not the 0 epoch
        ckp_path = os.path.join(CKP_PATH, MODEL_CODENAME, "V_{}".format(MODEL_VERSION), "Epoch_{}".format(epoch))
        model.save_pretrained(ckp_path, safe_serialization=True)
        processor.save_pretrained(ckp_path)
    #print(train_loss)
    logger.report_scalar(title='Train Loss', series='Loss', value=train_loss, iteration=epoch)

  0%|          | 0/5 [00:00<?, ?it/s]

CLEARML-SERVER new package available: UPGRADE to v2.0.0 is recommended!
Release Notes:
### Breaking Changes

MongoDB major version was upgraded from v5.x to 6.x.
Please note that if your current ClearML Server version is smaller than v1.17 (where MongoDB v5.x was first used), you'll need to first upgrade to ClearML Server v1.17.
#### Upgrading to ClearML Server v1.17 from a previous version
- If using docker-compose,  use the following docker-compose files:
  * [docker-compose file](https://github.com/allegroai/clearml-server/blob/2976ce69cc91550a3614996e8a8d8cd799af2efd/upgrade/1_17_to_2_0/docker-compose.yml)
  * [docker-compose file foe Windows](https://github.com/allegroai/clearml-server/blob/2976ce69cc91550a3614996e8a8d8cd799af2efd/upgrade/1_17_to_2_0/docker-compose-win10.yml)

### New Features

- New look and feel: Full light/dark themes ([clearml #1297](https://github.com/allegroai/clearml/issues/1297))
- New UI task creation options
  - Support bash as well as python scripts
  -

In [None]:
os.makedirs(FINAL_MODEL_PATH, exist_ok=True)
final_ckp_file = os.path.join(FINAL_MODEL_PATH, MODEL_CODENAME, "V_{}_final".format(MODEL_VERSION) )
model.save_pretrained(final_ckp_file, safe_serialization=True)
processor.save_pretrained(final_ckp_file)

In [19]:
task.flush()
task.mark_completed()
task.close()

At the end, the metrics were the following

<img src="./images/trocr_metrics.png" width="800">