The purpose of this project is to use transformers for English to Italian translation. The problem is a sequence to sequence task. The model will have an encoder-decoder structure, the encoder part will convert the input to a representation of it and the decoder will output a word by word translation using the representations from encoder.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate

!apt install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 18.3 MB/s 
[?25hCollecting evaluate
  Downloading evaluate-0.3.0-py3-none-any.whl (72 kB)
[K     |████████████████████████████████| 72 kB 680 kB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 54.1 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 68.9 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 63.0 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading mul

In [2]:
# Loading the dataset

from datasets import load_dataset

raw_datasets = load_dataset("kde4", lang1="en", lang2="it")

Downloading builder script:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/8.45k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.10k [00:00<?, ?B/s]



Downloading and preparing dataset kde4/en-it to /root/.cache/huggingface/datasets/kde4/en-it-lang1=en,lang2=it/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac...


Downloading data:   0%|          | 0.00/7.62M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/220566 [00:00<?, ? examples/s]

Dataset kde4 downloaded and prepared to /root/.cache/huggingface/datasets/kde4/en-it-lang1=en,lang2=it/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [3]:
# what the dataset looks like:

raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 220566
    })
})

The dataset has 220566 rows, we will split it into train and validation sets.

In [4]:
# splitting the dataset into train and validation using dataset split

split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 198509
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 22057
    })
})

In [5]:
# Renaming the test to validation

split_datasets["validation"] = split_datasets.pop("test")

In [6]:
# Let's have a look at some of our data

split_datasets['train'][1]['translation']

{'en': 'Slideshow runs in a loop',
 'it': 'Esegui la presentazione in ciclo continuo'}

In [7]:
# Let's check a few translations using transformers pipelines

from transformers import pipeline


In [8]:
model_checkpoint = "Helsinki-NLP/opus-mt-en-it"
translator = pipeline("translation", model=model_checkpoint)
translator("Please send me an email by the end of the day")

Downloading:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/343M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/789k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/814k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.35M [00:00<?, ?B/s]



[{'translation_text': "Vi prego di inviarmi un'email entro la fine della giornata"}]

In [9]:
translator('what checkpoint did you use for the translation task with transformers?')

[{'translation_text': "quale checkpoint hai usato per l'attività di traduzione con i trasformatori?"}]

Using the transformers' pipeline it is possible to obtain the translations with a very acceptable accuracy, but to gain more, we can train on our dataset and see the results.

In [10]:
# Now let's train our model on the dataset
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")

In [11]:
# The first step is to use the tokenizer to prepare the dataset for our model. 

max_length = 128


def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["it"] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )
    return model_inputs

In [12]:
# After defining the preprocessing function, we need to apply it on the train and validation sets:

tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)

  0%|          | 0/199 [00:00<?, ?ba/s]

  0%|          | 0/23 [00:00<?, ?ba/s]

In [13]:
tokenized_datasets['train'][0]

{'input_ids': [8004, 1419, 40, 0],
 'attention_mask': [1, 1, 1, 1],
 'labels': [4038, 1142, 0]}

The keys are: 
- input_ids: that are the sequence of numbers assigned to words of the dataset
- attention mask: helps the model to understand what words are important to focus on, for instance for the padding tokens, the attention mask will be zero and for others it will be 1. 
- labels: the sequence assigned to the labels

In [14]:
# Fine tuning the model using a seq2seq trainer

from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Datacollation: we need data collator for dynamic padding and it should be applied on both input and label sentences
So we will use a datacollatorforseq2seq that will receive the tokenizer and processes the input but also takes the model
Since it will prepare the decoder input IDs, that are shifted version of labels with a special token that depends on 
the architecture 

In [15]:


from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [16]:
# Let's see the result 

batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

We can see that the decoder input IDs are added to the keys of the dictionary.

In [17]:
batch['labels']

tensor([[   59,  9222,    15,  4777,     7,  5883,  8853,     0,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100],
        [  136,    15, 13829,   557,  1248,  9592, 19430,   334, 12960,    61,
           396,    32, 32943,     3,    15,  6166,    61, 12960,    16, 30594,
           134,    46,  3508,  8565,    26,  2873,    26,   396,     2,     0]])

It is observed that the labels are padded with -100.

In [18]:
# Let's take a look at the decoder input ids:
batch['decoder_input_ids']

tensor([[80034,    59,  9222,    15,  4777,     7,  5883,  8853,     0, 80034,
         80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034,
         80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034],
        [80034,   136,    15, 13829,   557,  1248,  9592, 19430,   334, 12960,
            61,   396,    32, 32943,     3,    15,  6166,    61, 12960,    16,
         30594,   134,    46,  3508,  8565,    26,  2873,    26,   396,     2]])

We can see the decoder input ids are the shifted version of the labels.

In [19]:
# The metric that is appropriate for translation task is "sacreblue" that compares the sequences to 
# decide how similar they are
!pip install sacrebleu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sacrebleu
  Downloading sacrebleu-2.3.1-py3-none-any.whl (118 kB)
[K     |████████████████████████████████| 118 kB 27.0 MB/s 
[?25hCollecting portalocker
  Downloading portalocker-2.6.0-py2.py3-none-any.whl (15 kB)
Collecting colorama
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-2.6.0 sacrebleu-2.3.1


In [20]:
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

We should know how the metric expects its inputs. It receives a list for the predictions and a list of list for translations, since it is common for some NLP tasks to have more than one label.
The score given by the metric is between 0 and 100, the more the better.


In [21]:
# Let's see a few examples:

predictions  = ['the bread is on the table']
labels = [['bread on table']]
metric.compute(predictions = predictions, references = labels)

{'score': 10.682175159905853,
 'counts': [3, 0, 0, 0],
 'totals': [6, 5, 4, 3],
 'precisions': [50.0, 10.0, 6.25, 4.166666666666667],
 'bp': 1.0,
 'sys_len': 6,
 'ref_len': 3}

In [22]:
predictions  = ['the bread is on the table']
labels = [['the bread is on the table']]
metric.compute(predictions = predictions, references = labels)

{'score': 100.00000000000004,
 'counts': [6, 5, 4, 3],
 'totals': [6, 5, 4, 3],
 'precisions': [100.0, 100.0, 100.0, 100.0],
 'bp': 1.0,
 'sys_len': 6,
 'ref_len': 6}

In [23]:
# Now we are ready to build our compute predictions model

import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

In [24]:
# we need to define our training arguments, and for this task it will be seq2seq training arguments

from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    f"marian-finetuned-kde4-en-to-it",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
)


In [25]:
# Now that we have our trainer arguments, we can put everything together in the trainer

from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Using cuda_amp half precision backend


In [26]:
trainer.train()

***** Running training *****
  Num examples = 198509
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 18612
  Number of trainable parameters = 85116416


Step,Training Loss
500,1.2064
1000,1.131
1500,1.103
2000,1.0607
2500,1.017
3000,1.0309
3500,0.9964
4000,1.0006
4500,1.0
5000,0.9839


Saving model checkpoint to marian-finetuned-kde4-en-to-it/checkpoint-6204
Configuration saved in marian-finetuned-kde4-en-to-it/checkpoint-6204/config.json
Model weights saved in marian-finetuned-kde4-en-to-it/checkpoint-6204/pytorch_model.bin
tokenizer config file saved in marian-finetuned-kde4-en-to-it/checkpoint-6204/tokenizer_config.json
Special tokens file saved in marian-finetuned-kde4-en-to-it/checkpoint-6204/special_tokens_map.json
Saving model checkpoint to marian-finetuned-kde4-en-to-it/checkpoint-12408
Configuration saved in marian-finetuned-kde4-en-to-it/checkpoint-12408/config.json
Model weights saved in marian-finetuned-kde4-en-to-it/checkpoint-12408/pytorch_model.bin
tokenizer config file saved in marian-finetuned-kde4-en-to-it/checkpoint-12408/tokenizer_config.json
Special tokens file saved in marian-finetuned-kde4-en-to-it/checkpoint-12408/special_tokens_map.json
Saving model checkpoint to marian-finetuned-kde4-en-to-it/checkpoint-18612
Configuration saved in marian-fi

TrainOutput(global_step=18612, training_loss=0.9004241206326742, metrics={'train_runtime': 4505.2313, 'train_samples_per_second': 132.186, 'train_steps_per_second': 4.131, 'total_flos': 1.2786014627168256e+16, 'train_loss': 0.9004241206326742, 'epoch': 3.0})

In [27]:
trainer.evaluate(max_length=max_length)

***** Running Evaluation *****
  Num examples = 22057
  Batch size = 64


{'eval_loss': 0.8286335468292236,
 'eval_bleu': 49.5133784080541,
 'eval_runtime': 1430.6242,
 'eval_samples_per_second': 15.418,
 'eval_steps_per_second': 0.241,
 'epoch': 3.0}

We see the score of 49.5 which is considered a good result.