<a href="https://colab.research.google.com/github/rokmr/Natural-Language-Processing/blob/main/translation/02_MT_Model_FineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 2: Fine-tuning Seq2Seq Pre-trained Models for Translation

Welcome to the second tutorial of our class on "Practical Machine Translation for Low Resource Languages". Today, we will be continuing our journey of creating practical translation systems by learning how to fine-tune a pre-trained model for the translation task. We will be considering the [mT5](https://arxiv.org/abs/2010.11934) model by Google and fine-tune it translate English sentences to Hindi. For fine-tuning, we will be using the [IIT Bombay English-Hindi Parallel Corpus](https://www.cfilt.iitb.ac.in/iitb_parallel/).





In [1]:
 !pip install -q -U torch transformers  datasets sacrebleu evaluate  sentencepiece accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.2/492.2 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.9/118.9 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m74.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
#Loading packages that we will be making use of in the tutorial
import io
import accelerate
import json
import argparse
import os
import random
import numpy as np
import datasets
from datasets import load_dataset, load_metric, load_from_disk
from dataclasses import dataclass, field
import transformers
from transformers import (
    HfArgumentParser,
    AutoConfig,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    set_seed
)

In [3]:
#Defining data paths please modify accordingly on your machine
DATA_DIR = "/content"

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
%%capture
!unzip /content/drive/Shareddrives/Colab/Datasets/iitb-en-hi.zip

In [6]:
import requests
import zipfile
from pathlib import Path
import os

data_dir = Path("data")

if data_dir.is_dir():
    print(f"{data_dir} directory exists.")
else:
    print(f"Did not find {data_dir} directory, creating one...")
    data_dir.mkdir(parents=True, exist_ok=True)

with open(data_dir / "flores200_dataset.zip", "wb") as f:
    request = requests.get("https://raw.githubusercontent.com/rokmr/Natural-Language-Processing/main/translation/dataset/flores200_dataset.zip")
    print("Downloading flores200_dataset data...")
    f.write(request.content)

with zipfile.ZipFile(data_dir / "flores200_dataset.zip", "r") as zip_ref:
    print("Unzipping flores200_dataset data...")
    zip_ref.extractall(data_dir)

# Remove zip file
os.remove(data_dir / "flores200_dataset.zip")

Did not find data directory, creating one...
Downloading flores200_dataset data...
Unzipping flores200_dataset data...


## Task 1: Pre-processing the Datasets and Building Dataloaders

In this part we will be first reading the parallel train and test corpora, pre-process it and then create dataloaders to efficiently iterate through the dataset during training and evaluation.

We will read the text files using the `load_dataset` from the [Datasets](https://huggingface.co/docs/datasets/index) library by 🤗. Datasets is a powerful library that makes handling and processing large-scale datasets very convenient. Below we load the English and Hindi text datasets

In [7]:
en_hi_dataset = load_dataset(
    "json",
    data_files = {
        "train": f"{DATA_DIR}/iitb-en-hi/train_sample_en_hi.json",
        "dev": f"{DATA_DIR}/iitb-en-hi/dev_en_hi.json",
        "test": f"{DATA_DIR}/iitb-en-hi/test_en_hi.json"
    },
)

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [8]:
en_hi_dataset

DatasetDict({
    train: Dataset({
        features: ['en', 'hi'],
        num_rows: 99999
    })
    dev: Dataset({
        features: ['en', 'hi'],
        num_rows: 520
    })
    test: Dataset({
        features: ['en', 'hi'],
        num_rows: 2507
    })
})

A short tutorial on Datasets Library: We can directly index the DatasetDict object created above to obtain a dataset split by it's name. For eg:

In [9]:
print(f"Train Dataset: {en_hi_dataset['train']}")
print(f"Dev Dataset: {en_hi_dataset['dev']}")
print(f"Test Dataset: {en_hi_dataset['test']}")

Train Dataset: Dataset({
    features: ['en', 'hi'],
    num_rows: 99999
})
Dev Dataset: Dataset({
    features: ['en', 'hi'],
    num_rows: 520
})
Test Dataset: Dataset({
    features: ['en', 'hi'],
    num_rows: 2507
})


`num_rows` show the number of examples present for a selected dataset split. A particular example at an index `i` can be accessed directly by indexing:

In [10]:
print(f"Train Dataset: {en_hi_dataset['train'][0]}")

Train Dataset: {'en': 'The wealth of the castle was stored in the form of bars of silver.', 'hi': 'किले की समृद्धि चांदी की ईंटों के रूप में भंडारित करके रखी गयी थी। '}


Now that our dataset is loaded, we will process the same to be usable by the model. As explained in the last tutorial, we will be using a tokenizer to convert both English and Hindi sentences to a sequence of token ids.

We start by loading the tokenizer for the model that we want to fine-tune i.e. `mt5-small`

In [11]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/mt5-small")

Downloading (…)okenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


Revisiting output of a tokenizer object

In [12]:
example_text = "Fine-tuning machine translation models is fun!"
tokenizer(example_text)

{'input_ids': [38820, 264, 110160, 10902, 53802, 33477, 339, 2925, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

* As discussed in the previous class the tokenizer returns a list of token ids represented here by `input_ids`. It also returns something called an `attention_mask` which we briefly alluded to in the previous tutorial.
* The purpose of an `attention_mask` is to specify what tokens to avoid attending to while computing the hidden representations. These tokens to be avoided are generally Padding tokens which are appended to the inputs to ensure that all sequences are of the same sequence length.
* In this particular example `attention_mask` is simply a vector of all 1's, simply meaning that there is no such padding token in the processed input. By providing a value to `max_length` while calling the tokenizer and setting padding = "max_length", will pad the input to the specified length.

In [13]:
tokenizer(example_text, padding = "max_length", max_length = 128)

{'input_ids': [38820, 264, 110160, 10902, 53802, 33477, 339, 2925, 309, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

As can be seen some the `input_ids` are now appended with token ids 0 which in the case of mt5 refer to the padding tokens. Similarly `attention_mask` has now a sequence of 0s to identify the padding tokens.

In [14]:
tokenizer(example_text, max_length = 48, padding = True, truncation = True)

{'input_ids': [38820, 264, 110160, 10902, 53802, 33477, 339, 2925, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

We will now implement a function that takes in a dataset example and tokenizes both the english and hindi text to obtain model inputs as well as expected output. Since, we are building a translation system from English -> Hindi, for the model inputs we will consider english text and hindi text for the labels.

In [15]:
def process_example(
    example,
    tokenizer,
    src_lang = "en",
    tgt_lang = "hi",
    max_length = 48,
    prefix = ""
  ):

  """
  Takes in a dataset `example` and tokenizes both the english and hindi texts to
  obtain model inputs as well as expected output.

  Inputs:
    - example (dict) : A dictionary with keys "en" for English text and "hi" for
    hindi text
    - tokenizer (PreTrainedTokenizer) : The tokenizer to use to process input
    and output texts
    - src_lang (str) : Language text to use for input
    - tgt_lang (str) : Language to translate to and hence use as output
    - prefix (str) : Prefix to prepend to the input text

  Returns:
    - (dict) : Dictionary containing `input_ids`, `attention_mask` and `labels`
  """

  # Step 1: Tokenize source langauge text to be used as inputs
  input_text = prefix + " " + example[src_lang]
  model_inputs = None
  ### BEGIN SOLUTION
  model_inputs = tokenizer(input_text, max_length = max_length,  padding = True, truncation = True)
  ### END SOLUTION

  input_ids = model_inputs["input_ids"]
  attention_mask = model_inputs["attention_mask"]

  # Step 2: Tokenize target language text
  output_text = example[tgt_lang]
  ### BEGIN SOLUTION
  model_outputs = tokenizer(output_text, max_length = max_length,  padding = True, truncation = True)
  ### END SOLUTION

  output_ids = model_outputs["input_ids"]

  return {
      "input_ids" : input_ids,
      "attention_mask" : attention_mask,
      "labels" : output_ids
  }

Lets test the function on an example

In [16]:
example = en_hi_dataset['train'][0]
process_example(example, tokenizer)

{'input_ids': [486,
  259,
  91250,
  304,
  287,
  259,
  108708,
  639,
  5317,
  285,
  281,
  287,
  2885,
  304,
  68751,
  304,
  33043,
  260,
  1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [1676,
  1048,
  975,
  3526,
  127776,
  3081,
  28523,
  113006,
  975,
  259,
  1665,
  25324,
  1396,
  641,
  259,
  9293,
  844,
  2573,
  83764,
  76735,
  1100,
  2916,
  259,
  17928,
  1304,
  1462,
  17149,
  259,
  8798,
  378,
  259,
  1]}

We can now apply this function to all examples in our datasets. The Datasets library provides `map` function that makes parallel processing of dataset examples extremely convenient.

In [17]:
train_dataset = en_hi_dataset["train"]
train_dataset = train_dataset.map(
    lambda example: process_example(example, tokenizer,
                                    prefix = "translate English to Hindi",
                                    max_length = 48),
    num_proc = 8, # For parallelization
    remove_columns = ["en", "hi"]
)


dev_dataset = en_hi_dataset["dev"]
dev_dataset = dev_dataset.map(
    lambda example: process_example(example, tokenizer,
                                    prefix = "translate English to Hindi",
                                    max_length = 48),
    num_proc = 8, # For parallelization
    remove_columns = ["en", "hi"]
)

test_dataset = en_hi_dataset["test"]
test_dataset = test_dataset.map(
    lambda example: process_example(example, tokenizer,
                                    prefix = "translate English to Hindi",
                                    max_length = 48),
    num_proc = 8, # For parallelization
    remove_columns = ["en", "hi"]
)

Map (num_proc=8):   0%|          | 0/99999 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/520 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/2507 [00:00<?, ? examples/s]

In [18]:
train_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 99999
})

In [19]:
# DataCollatorForSeq2Seq is similar to DataLoader
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer)

## T2: Fine-tuning mT5 using Trainer





Let's revisit the concept of working with models; We briefly discussed using models that are available in the marketplace (remember "MarianMT" models?) As you can imagine, these are models that already have a meaningful idea about the task they are being asked to (in this case translation). This is because they have been consciously trained for translation before. Such models, which have already been trained for a given task (or a collection of tasks) are called **Pre-Trained Models** .

Pre-trained models are very beneficial because they

(a) can be rapidly adapt given tasks on your data

(b) without incurring a significant time and computation cost.


What you'll see in the next steps is a powerful pretrained learner called **mt5 or the multilingual T5**. This model has seen data from 108 languages and being an encoder-decoder model, can develop very powerful contextual representations in these langauges, _even the unseen ones_.


In [20]:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [21]:
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-small").to(device)

Downloading pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Now that you have a powerful initial model and a good dataset - you need to have some way of evaluating it so that

(a) during training: you can enforce the model to modify itself the best possible manner for translation on your dataset and


(b) during evaluation: understand how it performs on unseen set of samples. For both of these purposes, we can use the evaluation process that we began this class with.

In [22]:
# First we load BLEU, our primary metric, from a suite of metrics called sacrebleu.

metric = load_metric("sacrebleu")

def postprocess_text(preds, labels):
        '''
        A helper function which strips trailing spaces at the end of our predictions and sequences.
        '''
        preds = [pred.strip() for pred in preds]
        labels = [[label.strip()] for label in labels]
        return preds, labels

def compute_metrics(eval_preds):
    '''
    A function that computes the BLEU, given the predictions of the model.
    BLEU's computation requires us to pass the gold labels and the model' predictions.
    '''
    # Step 1: We catch our model' predictions and use our precomputed, encoded labels to start the comparison process.
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Step 2: Then we decode them
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    # Step 3: Finally, we pass them to BLEU
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    output_prediction_file = os.path.join(training_args.output_dir, f'{result["score"]}_generated_predictions.txt')

    # Step 4: Here we write the prediction file so that we can observe the evolution of the model' generation.
    with open(output_prediction_file, "w", encoding="utf-8") as writer:
            writer.write("\n".join(decoded_preds))
            result = {"bleu": result["score"]}
    return result

  metric = load_metric("sacrebleu")


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

## Training the Model

Now that we have all the pillars (data, model and evaluation pipeline) - We can start training our model. As before, we will start with an abstracted pipeline that will train the model for us. Let's look at some of the parameters that we will use here.

In [23]:
from transformers import Seq2SeqTrainingArguments

In [24]:
training_args = Seq2SeqTrainingArguments('./')

In [25]:
training_args



### Learning Parameters
1. Learning Rate
2. Gradient Accumulation Steps
3. Batch Size
4. Evaluation Accumulation Steps
5. Number of Epochs
6. Save Total Limit

In [27]:
save_dir = "/content/drive/MyDrive/mt5_en_hi"
training_args = Seq2SeqTrainingArguments(
    output_dir = save_dir,
    learning_rate = 1e-3,
    gradient_accumulation_steps = 2,
    per_device_train_batch_size = 32,
    per_device_eval_batch_size=16,
    do_train = True,
    do_eval = True,
    do_predict = False,
    num_train_epochs = 10,
    fp16 = False,
    eval_accumulation_steps=2,
    log_on_each_node=False,
    eval_steps = 1000,
    predict_with_generate = True,
    evaluation_strategy = 'steps',
    save_total_limit=1,
    save_steps=1000,
    save_strategy="steps"
    )

In [28]:
# Fixing Seed for reproducibility - This can be considered as an anchor that makes sure that the model's modification are bounded against a range.
set_seed(42)

# Now we initialize the object that is responsible for handling our training
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [29]:
train_result = trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Bleu
1000,2.8708,2.520659,2.55615
2000,2.4475,2.331734,3.883074
3000,2.285,2.23438,4.892527
4000,2.1133,2.175542,5.531558


KeyboardInterrupt: ignored

Now that the model is trained, we can load it (not needed if the runtime wasn't disconnected) and qualitatively examine its outputs

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("/content/drive/MyDrive/mt5_en_hi")
tokenizer = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-small")

input_text = "learning is fun"
prompt = f"translate English to Hindi {input_text}"
tokenized_prompt = tokenizer(prompt, return_tensors="pt")
model_output = model.generate(input_ids = tokenized_prompt["input_ids"])
generation_text = tokenizer.batch_decode(model_output)
print(generation_text)

OSError: ignored

## T3: Diving Deep into The Trainer: Writing our own training loop

In [30]:
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW

In [31]:
class CustomSeq2SeqTrainer:

  def __init__(
      self,
      train_dataset,
      eval_dataset,
      model,
      tokenizer,
      data_collator,
      compute_metrics_fn,
      eval_metric="bleu",
      device="cuda",
      output_dir="checkpoints/",
      batch_size=8,
      learning_rate=1e-3,
      num_train_epochs=10
  ):

    self.train_dataset = train_dataset
    self.eval_dataset = eval_dataset
    self.model = model
    self.device = device
    self.tokenizer = tokenizer
    self.data_collator = data_collator
    self.compute_metrics_fn = compute_metrics_fn
    self.eval_metric = eval_metric

    # Put the model to device
    self.model.to(device)

    # Create Dataloaders
    self.train_loader = DataLoader(train_dataset,
                                   batch_size = batch_size,
                                   collate_fn=self.data_collator)
    self.eval_loader = DataLoader(eval_dataset,
                                  batch_size = batch_size,
                                  collate_fn=self.data_collator)

    # Initialize the Optimizer
    self.optimizer = AdamW(model.parameters(), lr=learning_rate)

    # Initialize train arguments
    self.output_dir = output_dir
    self.batch_size = batch_size
    self.learning_rate = learning_rate
    self.num_train_epochs = num_train_epochs

  def send_batch_to_device(self, batch):
    return {k : v.to(self.device) for k,v in batch.items()}

  def save_pretrained(self):
    self.model.save_pretrained(self.output_dir)

  def eval_step(self, batch):

    # Get model predictions
    with torch.no_grad():
      preds = self.model.generate(
          input_ids=batch["input_ids"],
          attention_mask=batch["attention_mask"]
      )
      preds = preds.detach().cpu().numpy()

    # Call compute metrics
    eval_results = self.compute_metrics_fn((preds, batch["labels"].cpu().numpy()))

    return eval_results[self.eval_metric]


  def eval(self):
    eval_score = 0
    for batch in tqdm(self.eval_loader):
      batch = self.send_batch_to_device(batch)
      eval_score += self.eval_step(batch)

    eval_score = eval_score / len(self.eval_loader)
    return eval_score


  def train_step(self, batch):

    # Zero out any existing computed gradients
    self.optimizer.zero_grad()

    # Perform forward pass through the model
    model_output = self.model(**batch)
    loss = model_output.loss # Since the batch in our case contains labels too, the forward pass will compute the loss as well

    # Compute gradients
    loss.backward()

    # Update the model parameters using optimizer
    self.optimizer.step()

    return loss.item()

  def train(self):

    for epoch in tqdm(range(self.num_train_epochs)):
      train_loss_epoch = 0
      for batch in tqdm(self.train_loader):
        # Copy the batch to the specified device (cuda or cpu)
        batch = self.send_batch_to_device(batch)

        # Do one train step
        train_loss_epoch += self.train_step(batch)

      train_loss_epoch /= len(self.train_loader)
      print("Evaluating...")
      eval_score = self.eval()

      print(f"Epoch {epoch+1} completed.")
      print(f"Train loss: {train_loss_epoch}")
      print(f"Eval {self.eval_metric}: {eval_score}")

      self.save_pretrained()

In [32]:
train_dataset_small = train_dataset.select(range(1000))
trainer = CustomSeq2SeqTrainer(
      train_dataset_small,
      dev_dataset,
      model,
      tokenizer,
      data_collator,
      compute_metrics_fn = compute_metrics,
      eval_metric="bleu",
      device="cuda",
      output_dir="checkpoints/",
      batch_size=8,
      learning_rate=1e-3,
      num_train_epochs=3
)

In [34]:
from tqdm.auto import tqdm

In [35]:
trainer.train()

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/125 [00:00<?, ?it/s]

Evaluating...


  0%|          | 0/65 [00:00<?, ?it/s]



Epoch 1 completed.
Train loss: 2.1587034816741943
Eval bleu: 3.1147747563787496


  0%|          | 0/125 [00:00<?, ?it/s]

Evaluating...


  0%|          | 0/65 [00:00<?, ?it/s]

Epoch 2 completed.
Train loss: 1.7426650009155273
Eval bleu: 2.782454599482715


  0%|          | 0/125 [00:00<?, ?it/s]

Evaluating...


  0%|          | 0/65 [00:00<?, ?it/s]

Epoch 3 completed.
Train loss: 1.4689962677955628
Eval bleu: 2.762368518533578
