# Translation task (en-uk)

Welcome to the 'eng_uk_translation_opus_kde4.ipynb' notebook! 

Here we will fine-tune the pretrained model for translation task (from English to Ukrainian) via Trainer API to achieve accurate translations, tailored to a specific domain.

Mostly based on chapter from Hugging Face NLP Course: https://huggingface.co/learn/nlp-course/chapter7/4


In [None]:
!pip install sacrebleu # SacreBLEU, the most commonly used metric for benchmarking translation models (not used in most of other examples, so not present in requriements.txt)

In [1]:
# all imports
from datasets import load_dataset
import evaluate
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
from huggingface_hub import notebook_login

import numpy as np
import os

# dataset
raw_datasets = load_dataset("kde4", lang1="en", lang2="uk")

# baseline model checkpoint
model_checkpoint = "Helsinki-NLP/opus-mt-en-uk"

# baseline model
translator = pipeline("translation", model=model_checkpoint)

# tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt") 

# model to fine-tune
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

# data collator (to deal with the padding)
# we will use DataCollatorForSeq2Seq cause we need to pad not only inputs but also labels, and padding value for labels should be -100
# also DataCollatorForSeq2Seq will be responsible for generating "decoder_input_ids" - shifted versions of the labels for our model 
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model) # DataCollatorForSeq2Seq behaves differently depending on the model architecture so we need to pass model into it

# define metrics
metric = evaluate.load("sacrebleu") # SacreBLEU metric, score can go from 0 to 100, and higher is better

Found cached dataset kde4 (C:/Users/SUPERSOKOL/.cache/huggingface/datasets/kde4/en-uk-lang1=en,lang2=uk/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac)


  0%|          | 0/1 [00:00<?, ?it/s]



# Baseline test

In [4]:
# test baseline
original_sentence = raw_datasets["train"][6]['translation']['en']
baseline_translation = translator(original_sentence)[0]['translation_text']
true_translation = raw_datasets["train"][6]['translation']['uk']
print(f'original sentence in English: \n{original_sentence}')
print(f'\n\nbaseline translation: \n{baseline_translation}')
print(f'\n\ntrue translation: \n{true_translation}')


original sentence in English: 
Open a module by clicking its name; a list of submodules will appear. Then, click one of the submodule category names to edit its configuration in the right pane.


baseline translation: 
Відкрити модуль натисканням його назви; з' явиться список підмодулів. Після цього натисніть одну з назв підмодулів, щоб змінити налаштування підкатегорії на правій панелі.


true translation: 
Якщо ви відкриєте модуль наведенням на його позначку вказівника миші з наступним клацанням лівою кнопкою миші, з’ явиться список підмодулів. Після цього вам слід натиснути назву одного підмодулів категорії, щоб отримати доступ до відповідних налаштувань на панелі праворуч.


# Preparations

In [5]:
# Data preprocessing 
# split dataset
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
# rename 'test' to 'validation'
split_datasets["validation"] = split_datasets.pop("test")

max_length = 64    # set max length of sentence to 128 cause our sentences in dataset are pretty short

# data preprocessing funcyion
def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]       # english sentence
    targets = [ex["uk"] for ex in examples["translation"]]      # ukrainian swentence
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True     # tokenize, note that we need to pass ukrainian sentence into 'text_targets' arg 
    )
    return model_inputs

# apply preprocessing
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)

# Metrics
# metrics computation function for model outputs
def compute_metrics(eval_preds):
    preds, labels = eval_preds # get model prediction logits and true labels (tokenized)
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)         # decode model predictions

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)       # decode true labels

    # Some simple post-processing (remove leading, and trailing whitespaces)
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    # compute metric
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

Map:   0%|          | 0/210249 [00:00<?, ? examples/s]

Map:   0%|          | 0/23362 [00:00<?, ? examples/s]

# Log in to Hub

In [7]:
notebook_login() # to log in to Hugging Face, so you’re able to upload your results to the Model Hub if you want to

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Final preparation before fine-tuning

In [8]:
# to define our training args we will use Seq2SeqTrainingArguments, a subclass of TrainingArguments that contains a few more fields 
args = Seq2SeqTrainingArguments(
    f"marian-finetuned-kde4-en-to-uk",
    evaluation_strategy="no",       # we will just evaluate our model once before training and after cause evaluation takes a while
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16, # I use small batches cause my laptops' GPU is not very powerfull
    per_device_eval_batch_size=32,  #
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,     # for evaluation during training
    fp16=True,                      # speeds up training on modern GPUs, comment this line if you dont have GPU with CUDA
    push_to_hub=True,               # to upload the model to the Hub at the end of each epoch, comment this line if you are not logged in to Hugging Face 
)

# next we need to define trainer instance and pass everything to it
trainer = Seq2SeqTrainer(
    model,                                              # model to fine-tune
    args,                                               # training args we defined earlier
    train_dataset=tokenized_datasets["train"],          # train split of preprocessed dataset
    eval_dataset=tokenized_datasets["validation"],      # validation split of preprocessed dataset
    data_collator=data_collator,                        # here we pass our DataCollatorForSeq2Seq instance 
    tokenizer=tokenizer,                                # our tokenizer instance
    compute_metrics=compute_metrics,                    # our custom metrics function
)

q:\SANDBOX\huggingface_nlp_tutorial\marian-finetuned-kde4-en-to-uk is already a clone of https://huggingface.co/SUPERSOKOL/marian-finetuned-kde4-en-to-uk. Make sure you pull the latest changes with `repo.git_pull()`.


# Fine-tuning

In [9]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"     # split size set to 512mb to avoid CUDA "out of memory" error 

# take a look at score our model gets before fine-tuning (will take a bit of time)
start_scores = trainer.evaluate(max_length=max_length)
print(f'scores before fine-tuning: \n{start_scores}')

# train model (will take a bit of time)
trainer.train()

# after training we need to evaluate our model again (will take a bit of time)
final_scores = trainer.evaluate(max_length=max_length)
print(f'\n\nscores after fine-tuning: \n{final_scores}')

# push our model to Model Hub, comment this line if you are not logged in to Hugging Face 
trainer.push_to_hub(tags="translation", commit_message="Training complete")

  0%|          | 0/731 [00:00<?, ?it/s]

scores before fine-tuning: 
{'eval_loss': 1.1544969081878662, 'eval_bleu': 48.76479032184241, 'eval_runtime': 1204.7683, 'eval_samples_per_second': 19.391, 'eval_steps_per_second': 0.607}




  0%|          | 0/39423 [00:00<?, ?it/s]

{'loss': 1.0977, 'learning_rate': 1.9747355604596302e-05, 'epoch': 0.04}
{'loss': 1.0133, 'learning_rate': 1.9493696573066485e-05, 'epoch': 0.08}
{'loss': 0.9932, 'learning_rate': 1.9240037541536668e-05, 'epoch': 0.11}
{'loss': 1.0227, 'learning_rate': 1.898637851000685e-05, 'epoch': 0.15}
{'loss': 0.9757, 'learning_rate': 1.8732719478477033e-05, 'epoch': 0.19}
{'loss': 0.9917, 'learning_rate': 1.8479060446947215e-05, 'epoch': 0.23}
{'loss': 0.978, 'learning_rate': 1.8225401415417395e-05, 'epoch': 0.27}
{'loss': 1.0071, 'learning_rate': 1.797174238388758e-05, 'epoch': 0.3}
{'loss': 0.9657, 'learning_rate': 1.771859067042082e-05, 'epoch': 0.34}
{'loss': 0.9652, 'learning_rate': 1.7464931638891007e-05, 'epoch': 0.38}
{'loss': 0.9761, 'learning_rate': 1.7211272607361186e-05, 'epoch': 0.42}
{'loss': 0.9672, 'learning_rate': 1.695761357583137e-05, 'epoch': 0.46}
{'loss': 0.9703, 'learning_rate': 1.6704461862364612e-05, 'epoch': 0.49}
{'loss': 0.9877, 'learning_rate': 1.6450802830834795e-05,

Adding files tracked by Git LFS: ['source.spm', 'target.spm']. This may take a bit of time if the files are large.


{'loss': 0.8478, 'learning_rate': 1.3154757375136343e-05, 'epoch': 1.03}
{'loss': 0.831, 'learning_rate': 1.2901098343606527e-05, 'epoch': 1.07}
{'loss': 0.8111, 'learning_rate': 1.2647439312076708e-05, 'epoch': 1.1}
{'loss': 0.804, 'learning_rate': 1.239378028054689e-05, 'epoch': 1.14}
{'loss': 0.8177, 'learning_rate': 1.2140121249017071e-05, 'epoch': 1.18}
{'loss': 0.8121, 'learning_rate': 1.1886462217487256e-05, 'epoch': 1.22}
{'loss': 0.833, 'learning_rate': 1.1632803185957437e-05, 'epoch': 1.26}
{'loss': 0.8165, 'learning_rate': 1.1379651472490678e-05, 'epoch': 1.29}
{'loss': 0.8338, 'learning_rate': 1.1125992440960863e-05, 'epoch': 1.33}
{'loss': 0.8149, 'learning_rate': 1.0872333409431044e-05, 'epoch': 1.37}
{'loss': 0.8045, 'learning_rate': 1.0618674377901226e-05, 'epoch': 1.41}
{'loss': 0.8201, 'learning_rate': 1.0365522664434468e-05, 'epoch': 1.45}
{'loss': 0.8127, 'learning_rate': 1.011186363290465e-05, 'epoch': 1.48}
{'loss': 0.8191, 'learning_rate': 9.858204601374832e-06, 

  0%|          | 0/731 [00:00<?, ?it/s]



scores after fine-tuning: 
{'eval_loss': 0.7623581290245056, 'eval_bleu': 50.09005982889118, 'eval_runtime': 1322.1733, 'eval_samples_per_second': 17.669, 'eval_steps_per_second': 0.553, 'epoch': 3.0}


Upload file runs/Aug17_14-46-59_DESKTOP-SH88-S-K/events.out.tfevents.1692274030.DESKTOP-SH88-S-K.38808.0: 100%…

Upload file runs/Aug17_14-46-59_DESKTOP-SH88-S-K/events.out.tfevents.1692280738.DESKTOP-SH88-S-K.38808.1: 100%…

To https://huggingface.co/SUPERSOKOL/marian-finetuned-kde4-en-to-uk
   6c54a97..e38e2d0  main -> main

To https://huggingface.co/SUPERSOKOL/marian-finetuned-kde4-en-to-uk
   e38e2d0..c84c9e7  main -> main



'https://huggingface.co/SUPERSOKOL/marian-finetuned-kde4-en-to-uk/commit/e38e2d0d88dda090f8a918565ea7950c3d93f242'

# Final tests

In [18]:
# Fine-tuned model loading and usage (if you have pushed your model to Hub)
model_checkpoint = "SUPERSOKOL/marian-finetuned-kde4-en-to-uk"                      # Replace this with your own checkpoint 
translator_fine_tuned = pipeline("translation", model=model_checkpoint)             # Load our fine-tuned model from Hub
new_translation = translator_fine_tuned(original_sentence)[0]['translation_text']   # Feed sentence into pipeline     

# compare our model to baseline

print(f'original sentence in English: \n{original_sentence}')       # original sentence
print(f'\n\nbaseline translation: \n{baseline_translation}')        # baseline translation
print(f'\n\ntrue translation: \n{true_translation}')                # ground truth

print(f'\n\nfine-tuned model translation: \n{new_translation}')             # fine-tuned model translation

original sentence in English: 
Open a module by clicking its name; a list of submodules will appear. Then, click one of the submodule category names to edit its configuration in the right pane.


baseline translation: 
Відкрити модуль натисканням його назви; з' явиться список підмодулів. Після цього натисніть одну з назв підмодулів, щоб змінити налаштування підкатегорії на правій панелі.


true translation: 
Якщо ви відкриєте модуль наведенням на його позначку вказівника миші з наступним клацанням лівою кнопкою миші, з’ явиться список підмодулів. Після цього вам слід натиснути назву одного підмодулів категорії, щоб отримати доступ до відповідних налаштувань на панелі праворуч.


fine-tuned model translation: 
Відкрити модуль можна натисканням його назви. Буде відкрито список підмодулів. Після цього натисніть одну з підмодулів назв категорій, щоб змінити налаштування підкатегорії на правій панелі.
