# BART Model

This notebook implements translation task using the BART (Bidirectional and Auto-Regressive Transformers). The implementation demonstrates how transfer learning can be applied to low-resource languages by fine-tuning a pre-trained language model on specialized domain data.

Here are some key procedures you'll find along the notebook:
- **Data Preparation:** load and preprocess Sumerian-English parallel texts from the SumTablets dataset
- **Model Configuration:** set up a BART model (either base or large version) with appropriate parameters for the translation task
- **Training Pipeline:** implement a complete training workflow using Hugging Face's Transformers library with:
    - Dynamic tokenization of source and target texts
    - Sequence-to-sequence training with teacher forcing
    - Early stopping to prevent overfitting
    - Learning rate scheduling and mixed precision training
    - Evaluation metrics tracking (BLEU, METEOR, ROUGE)
- **Translation Interface:** Provide functionality to translate new Sumerian texts using the fine-tuned model
- **Model Saving:** save the trained model and facilitates deployment by backing up to OneDrive

In [1]:
import os
import pandas as pd

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BartTokenizer, BartForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
from transformers import GenerationConfig, EarlyStoppingCallback

from datasets import load_dataset, Dataset as HFDataset
from load_dataset import preprocess_dataset
from compute_metrics import compute_metrics

2025-06-01 09:03:11.457240: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748768591.478114   10934 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748768591.484816   10934 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1748768591.501318   10934 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1748768591.501333   10934 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1748768591.501335   10934 computation_placer.cc:177] computation placer alr

In [2]:
large = False  # Set to True if using the large version of BART

MODEL_NAME = "facebook/bart-base" if not large else "facebook/bart-large"
# Directory to save the fine-tuned model
OUTPUT_DIR = "./bart_model" if not large else "./bart_large_model"
# Directory for TensorBoard logs
LOGGING_DIR = "./bart_logs" if not large else "./bart_large_logs"

# Some hyperparameters
MAX_INPUT_LENGTH = 512
MAX_TARGET_LENGTH = 512
BATCH_SIZE = 8
LEARNING_RATE = 1e-5
NUM_TRAIN_EPOCHS = 100

# check if dirs exist, if not create them
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(LOGGING_DIR, exist_ok=True)

# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BartForConditionalGeneration.from_pretrained(MODEL_NAME)
model.to(device)

tokenizer = BartTokenizer.from_pretrained(MODEL_NAME)

if not hasattr(model.config, "decoder_start_token_id") or model.config.decoder_start_token_id is None:
    model.config.decoder_start_token_id = tokenizer.bos_token_id if tokenizer.bos_token_id is not None else tokenizer.eos_token_id

In [None]:
preprocessed_train = preprocess_dataset('../datasets/SumTablets_English_train.csv')
preprocessed_val = preprocess_dataset('../datasets/SumTablets_English_validation.csv')
preprocessed_test = preprocess_dataset('../datasets/SumTablets_English_test.csv')

train_data = [{
    'source': row['sumerian'],
    'target': row['english']
} for _, row in preprocessed_train.iterrows()]

val_data = [{
    'source': row['sumerian'],
    'target': row['english']
} for _, row in preprocessed_val.iterrows()]

test_data = [{
    'source': row['sumerian'],
    'target': row['english']
} for _, row in preprocessed_test.iterrows()]

Loaded 1907 examples from ../datasets/SumTablets_English_train.csv
Preprocessed dataset contains 1905 examples
Loaded 107 examples from ../datasets/SumTablets_English_validation.csv
Preprocessed dataset contains 107 examples
Loaded 113 examples from ../datasets/SumTablets_English_test.csv
Preprocessed dataset contains 113 examples


In [None]:
def preprocess_function(examples):
    """
    Tokenizes the source (Sumerian) and target (English) texts.
    """
    inputs = examples['source']
    targets = examples['target']

    # DEBUG: Print flag if any of inputs or targets are none
    if any(x is None for x in inputs) or any(x is None for x in targets):
        print("Warning: Found None values in inputs or targets. This may affect training.")


    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True, padding="max_length")

    # Tokenize targets (English) using the newer approach
    labels = tokenizer(text_target=targets, max_length=MAX_TARGET_LENGTH, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Convert lists to Hugging Face Dataset objects
train_dataset = HFDataset.from_list(train_data)
val_dataset = HFDataset.from_list(val_data)

# Apply preprocessing to the datasets
print("Tokenizing datasets...")
tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_val_dataset = val_dataset.map(preprocess_function, batched=True)

Tokenizing datasets...


Map:   0%|          | 0/1905 [00:00<?, ? examples/s]

Map:   0%|          | 0/107 [00:00<?, ? examples/s]

In [6]:
print("Example of tokenized input:")
print(tokenized_train_dataset[0])

Example of tokenized input:
{'source': ' 1(u) la₂ 1(diš) udu u₄ 2(u) 8(diš)-kam ki ab-ba-sa₆-ga-ta na-lu₅ i₃-dab₅   iti <unk> bi₂-gu₇ mu en-unu₆-gal {d}inana unu{ki}ga ba-hun  1(u) la₂ 1(diš)', 'target': '9 rams, 28th day, from Abba-saga, Nalu accepted; month: “ubi-feast,” year: “Enunugal of Inanna of Uruk was installed;” (total:) 9 (rams).', 'input_ids': [0, 112, 1640, 257, 43, 897, 24987, 9264, 9264, 112, 1640, 7506, 4654, 43, 1717, 6588, 1717, 24987, 9264, 11936, 132, 1640, 257, 43, 290, 1640, 7506, 4654, 19281, 330, 424, 27651, 4091, 12, 3178, 12, 11146, 24987, 9264, 27819, 12, 2538, 12, 4349, 2750, 12, 6487, 24987, 9264, 5782, 939, 24987, 9264, 862, 12, 417, 873, 24987, 9264, 5782, 1437, 1437, 24, 118, 1437, 3, 4003, 24987, 9264, 9264, 12, 5521, 24987, 9264, 6382, 14701, 1177, 12, 879, 257, 24987, 9264, 27819, 12, 9487, 25522, 417, 24303, 179, 1113, 542, 257, 45152, 3144, 24303, 2538, 17279, 12, 18458, 1437, 112, 1640, 257, 43, 897, 24987, 9264, 9264, 112, 1640, 7506, 4654, 43, 2,

In [None]:
# Set the training arguments for the Seq2SeqTrainer
training_args = Seq2SeqTrainingArguments(
    output_dir=OUTPUT_DIR,                  # Directory to save the model checkpoints
    
    num_train_epochs=NUM_TRAIN_EPOCHS,      # Number of training epochs
    per_device_train_batch_size=BATCH_SIZE, # Batch size for training
    per_device_eval_batch_size=BATCH_SIZE,  # Batch size for evaluation
    
    learning_rate=LEARNING_RATE,            # Learning rate for the optimizer
    weight_decay=0.01,                      # Weight decay for regularization
    warmup_ratio=0.1,                       # Warmup ratio for learning rate scheduler
    gradient_accumulation_steps=1,          # Gradient accumulation steps to simulate larger batch sizes
    lr_scheduler_type="cosine",             # Use cosine learning rate scheduler
    label_smoothing_factor=0.1,             # Label smoothing factor for better generalization

    save_total_limit=1,                     # Only keep the last checkpoint
    predict_with_generate=True,             # Enable generation during evaluation
    report_to="tensorboard",                # Report metrics to TensorBoard
    logging_dir=LOGGING_DIR,                # Directory for TensorBoard logs
    logging_steps=50,                       # Log every 50 steps
    
    eval_strategy="epoch",                  # Evaluate at the end of each epoch
    save_strategy="epoch",                  # Save model at the end of each epoch
    load_best_model_at_end=True,            # Load the best model at the end of training
    metric_for_best_model="meteor",         # Metric to determine the best model
    fp16=torch.cuda.is_available(),         # Use mixed precision training if GPU is available
)

# Set up generation configuration for the model
generation_config = GenerationConfig(
    max_length=MAX_TARGET_LENGTH,           # Maximum length of the generated sequences
    early_stopping=True,                    # Stop generation when all beams reach the EOS token
    num_beams=4,                            # Number of beams for beam search
    no_repeat_ngram_size=3,                 # Prevent repetition of n-grams in the generated text
    forced_bos_token_id=0,                  # Force the beginning of the sequence to be the BOS token
    pad_token_id=tokenizer.pad_token_id,    # Padding token ID for the tokenizer
    eos_token_id=tokenizer.eos_token_id,    # End of sequence token ID for the tokenizer
    decoder_start_token_id=tokenizer.bos_token_id if tokenizer.bos_token_id is not None else tokenizer.eos_token_id   # Decoder start token ID for the model
)
model.generation_config = generation_config

# Set DataCllator for Seq2Seq tasks to handle padding and batching
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Initialize the Seq2SeqTrainer with the model, training arguments, datasets, tokenizer, data collator, and metrics computation
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=lambda p: compute_metrics(p, tokenizer),        # Function to compute metrics during evaluation
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]    # Early stopping callback to prevent overfitting
    )

# Start the training process
print("Starting model training...")
try:
    trainer.train()
    print("Training finished successfully!")

    # Save the final model and tokenizer
    trainer.save_model(f"{OUTPUT_DIR}/final_model")
    tokenizer.save_pretrained(f"{OUTPUT_DIR}/final_model_tokenizer")
    print(f"Final model saved to {OUTPUT_DIR}/final_model")

except Exception as e:
    print(f"An error occurred during training: {e}")

In [3]:
model = BartForConditionalGeneration.from_pretrained(f"{OUTPUT_DIR}/final_model")
tokenizer = BartTokenizer.from_pretrained(f"{OUTPUT_DIR}/final_model_tokenizer")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def translate_sumerian_to_english(text, trained_model, trained_tokenizer, device):
    """
    Translates a Sumerian text to English using the fine-tuned model.

    Args:
        text (str): The Sumerian text to translate.
        trained_model (BartForConditionalGeneration): The fine-tuned BART model.
        trained_tokenizer (BartTokenizer): The tokenizer used for the model.
        device (torch.device): The device to run the model on (CPU or GPU).
    Returns:
        str: The translated English text.
    """
    
    # Set model to evaluation mode
    trained_model.eval()
    trained_model.to(device)

    # Prepare the input text
    inputs = trained_tokenizer(text, return_tensors="pt", max_length=MAX_INPUT_LENGTH, truncation=True, padding=True)
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)

    # Generate translation
    with torch.no_grad():   # Disable gradient calculations for inference
        outputs = trained_model.generate(
            input_ids,
            attention_mask=attention_mask,      # Use attention mask to ignore padding tokens
            max_length=MAX_TARGET_LENGTH + 2,   # +2 for start/end tokens
            num_beams=5,                        # Beam search width
            early_stopping=True                 # Stop when all beams reach the end token
        )

    # Decode the generated ids to text
    translated_text = trained_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translated_text

testing_data = preprocess_dataset('../datasets/SumTablets_English_test.csv')

for index, row in testing_data.iterrows():
    sumerian_text = row['sumerian']
    english_translation = translate_sumerian_to_english(sumerian_text, model, tokenizer, device)
    true_english_translation = row['english']
    
    print(f"Sumerian: {sumerian_text}")
    print(f"Predicted English: {english_translation}")
    print(f"True English: {true_english_translation}")
    print("-" * 50)

Loaded 113 examples from ../datasets/SumTablets_English_test.csv
Preprocessed dataset contains 113 examples
Sumerian:  ...guruš engar dumu-ni ...ur-mes 1(u) 1(diš) guruš ugula ur-lugal 8(diš) guruš ugula ab-ba-sag₁₀ 6(diš) guruš ugula lugal-ku₃-zu 3(diš) guruš ugula šeš-kal-la 2(diš) guruš ugula lugal-iti-da 4(diš) guruš ugula lu₂-dingir-ra 7(diš) guruš ugula ur-am₃-ma 4(diš) guruš ugula ur-e₂-nun-na  1(geš₂) guruš ugula al-la-igi-še₃-du gurum₂ u₄ 2(diš)-kam ki-su₇ ka-ma-ri₂ gub-ba giri₃ i₃-kal-la iti še-kar-ra-gal₂-la mu {d}šu{d}suen lugal uri₅-ma{ki}...da za-ab-ša-li{ki} mu-hul
Predicted English: n male laborers, plowmen, son of Umes; 11 male laborers: foreman: Ur-lugal; 8 male laborer: Abba-saga; 6 male laborers (from) Lugal-kuzu; 3 male laborers stationed: Šeškalla; 2 male laborers for Lugalitida; 4 male laborers from (the account of) Lu-dingira; 7 male laborers of (the accounts of) Ur-amma; 4 workmen, foreman of Ur-Enunna; 90 male laborers foreman (of) Alla-igiše, the threshing fl

In [5]:
import sys  
sys.path.insert(1, '../utils')

from rclone import update_folder_on_onedrive

In [6]:
update_folder_on_onedrive("bart_model", "bart_model")

Updating 'bart_model' on OneDrive with 'bart_model'...
rclone command: rclone sync bart_model onedrive_bocconi:AI-project/bart_model -P
SUCCESS: Folder updated successfully.
Local folder 'bart_model' has been removed after successful update.


True