# Donut Fine-tuning

El presente notebook es el proceso de fine-tuning para [DoNut-base](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://huggingface.co/naver-clova-ix/donut-base-finetuned-cord-v2&ved=2ahUKEwjMh4O54vGNAxVsIrkGHQznKskQFnoECBcQAQ&usg=AOvVaw1uKtlO2jgCL6oC_haM4FIB).

In [1]:
pip install datasets transformers

Note: you may need to restart the kernel to use updated packages.


In [3]:
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
#from transformers import MobileViTForImageClassification, MobileViTImageProcessor
from transformers import ViTForImageClassification, ViTImageProcessor, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import TrainingArguments, Trainer
import matplotlib.pyplot as plt
import numpy as np

2025-06-18 19:18:44.184242: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-18 19:18:44.193722: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750285124.203394   66240 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750285124.206382   66240 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1750285124.215655   66240 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [4]:
device =  'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
print(device) # Expected: ‘cuda’ if Linux else ‘mps’ if MacOS
device =  'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'

cuda


In [5]:
import transformers

print(torch.__version__)
print(transformers.__version__)

2.6.0+cu126
4.52.4


# Cargand variables de entorno

In [8]:
from dotenv import load_dotenv
import os

# Try different .env paths
env_path = os.path.join("..",".env")
# env path using os path join
    
print(f"🔍 Trying .env path: {env_path}")

# Load environment variables
found = load_dotenv(env_path)
print(f"✅ .env file {'found' if found else 'not found'}")

# Verify loaded variables
print("\n🔎 Environment variables:")
for var in ["PROCESSED_DATA_DIR", "RAW_DATA_DIR", "CHECKPOINT_DIR"]:
    value = os.getenv(var)
    print(f"{var}: {'✅' if value else '❌'} {value}")

img_dir_train = os.path.join(os.getenv("PROCESSED_DATA_DIR"),'train2014')
img_dir_val = os.path.join(os.getenv("PROCESSED_DATA_DIR"),'val2014')
img_dir_test =  os.path.join(os.getenv("PROCESSED_DATA_DIR"),'test2014')
ann_coco_text = os.path.join(os.getenv("RAW_DATA_DIR"),'cocotext.v2.zip')

print(ann_coco_text)

🔍 Trying .env path: ../.env
✅ .env file found

🔎 Environment variables:
PROCESSED_DATA_DIR: ✅ /home/juan/CEIA/vpc3-proyecto/vpc3_proyecto/data/processed
RAW_DATA_DIR: ✅ /home/juan/CEIA/CEIA-ViT/TrabajosPracticos/TP_Final/data/raw
CHECKPOINT_DIR: ✅ /home/juan/CEIA/vpc3-proyecto/vpc3_proyecto/models
/home/juan/CEIA/CEIA-ViT/TrabajosPracticos/TP_Final/data/raw/cocotext.v2.zip


In [9]:
from transformers import DonutProcessor, VisionEncoderDecoderModel

processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")

model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")

# 1. First fix the model configuration
model.config.decoder_start_token_id = processor.tokenizer.pad_token_id
model.config.pad_token_id = processor.tokenizer.pad_token_id

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [10]:

from vpc3_proyecto.model_training.dataset_donut_model import DonutTextDatasetFromCocoTextV2Raw

train_dataset = DonutTextDatasetFromCocoTextV2Raw(img_dir_train, ann_coco_text, processor=processor)
val_dataset = DonutTextDatasetFromCocoTextV2Raw(img_dir_val, ann_coco_text, processor=processor)
test_dataset = DonutTextDatasetFromCocoTextV2Raw(img_dir_test, ann_coco_text, processor=processor)


In [9]:
val_dataset.__len__()


4697

In [10]:
train_dataset.__len__()

16440

In [11]:
test_dataset.__len__()

2348

# Entrenamiento (fine tuning)

In [11]:
import os
os.environ["WANDB_DISABLED"] = "true" # no utilizamos weights and biases

In [12]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir=os.path.join(os.getenv("CHECKPOINT_DIR"),"donut-finetuned-coco2014"),
    per_device_train_batch_size=1,  # Minimum possible
    gradient_accumulation_steps=1,  # No accumulation
    fp16=True,                      # Mixed precision
    gradient_checkpointing=True,    # Memory optimization
    optim="adamw_8bit",             # 8-bit optimizer
    eval_strategy="no",
    per_device_eval_batch_size=1,
    save_strategy="epoch",             # Disable checkpoints
    logging_steps=50,
    learning_rate=1e-5,
    num_train_epochs=1              # Start with 1 epoch
)





Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [13]:
from vpc3_proyecto.model_evaluation.utils import compute_metrics
from functools import partial
from transformers import Seq2SeqTrainer 
compute_metrics_bound = partial(compute_metrics, donut_processor=processor)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics_bound,
)


In [14]:
torch.cuda.empty_cache() # limpiamos cache

trainer.train()
# trainer.train(resume_from_checkpoint=True)  # Automatically finds latest checkpoint

`use_cache=True` is incompatible with gradient checkpointing`. Setting `use_cache=False`...
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 7.63 GiB of which 3.00 MiB is free. Process 61934 has 3.56 GiB memory in use. Including non-PyTorch memory, this process has 4.02 GiB memory in use. Of the allocated memory 3.76 GiB is allocated by PyTorch, and 95.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [24]:

from vpc3_proyecto.model_evaluation.utils import get_last_checkpoint_folder

last_ckpt = get_last_checkpoint_folder(training_args.output_dir)

manually_saved_folder = os.path.join(last_ckpt, "manually-saved")

if last_ckpt:
    print(f"✅ Último checkpoint: {last_ckpt}")
    processor.save_pretrained(manually_saved_folder)
    trainer.save_model(manually_saved_folder)
    print("✅ Processor guardado en el directorio manual save dentro del checkpoint: "+manually_saved_folder)
else:
    print("❌ No se encontraron checkpoints")



✅ Último checkpoint: ./donut-finetuned-coco2014/checkpoint-15656
✅ Processor guardado en el directorio manual save dentro del checkpoint: ./donut-finetuned-coco2014/checkpoint-15656/manually-saved
