# English to Dutch translation using the opus book dataset and t5 tranformer


### Loading the dataset


The dataset is downloaded directly from the hugging face library interface using the 'datasets' library. </br>
Once downloaded the dataset will be present in the cache memory of the notebook and can be accessed for future use.


In [2]:
#Downloading opus_books dataset 
from datasets import load_dataset, load_metric
raw_data = load_dataset("opus_books", "en-nl")
raw_data

Downloading and preparing dataset opus_books/en-nl (download: 3.56 MiB, generated: 9.80 MiB, post-processed: Unknown size, total: 13.36 MiB) to C:\Users\Sharanya Manohar\.cache\huggingface\datasets\opus_books\en-nl\1.0.0\e8f950a4f32dc39b7f9088908216cd2d7e21ac35f893d04d39eb594746af2daf...


Downloading data:   0%|          | 0.00/3.73M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/38652 [00:00<?, ? examples/s]

Dataset opus_books downloaded and prepared to C:\Users\Sharanya Manohar\.cache\huggingface\datasets\opus_books\en-nl\1.0.0\e8f950a4f32dc39b7f9088908216cd2d7e21ac35f893d04d39eb594746af2daf. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 38652
    })
})

The dataset is available in a single split and needs to be split to create a validation set.

In [3]:
split_data = raw_data["train"].train_test_split(train_size=0.9, seed=20)
split_data

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 34786
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 3866
    })
})

'test' key can be renamed as 'validation' for interpretability

In [4]:
split_data["validation"] = split_data.pop("test")

Let us look at one instance of the dataset.

In [5]:
split_data["train"][1]["translation"]

{'en': 'That was a good time."', 'nl': 'Dat was een goede tijd."'}

### Defining the tokenizer

Pre-defined tokenizer of the pre-trained "t5-small" model is used to tokenize the text in the dataset

In [6]:
model_t5 = "t5-small"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_t5,use_fast=False)

In [7]:
#t5 transformer models require a prefix indicating the action to be performed on the input provided.
prefix = "translate English to Dutch:" 

### Defining the pre-processing function

In [8]:
#Refer hugging face documentations for language codes
source_language = "en"
target_language = "nl"

max_input_length = 128
max_target_length = 128

def preprocess(instances):
   input = [prefix + i[source_language] for i in instances["translation"]]
   target = [i[target_language] for i in instances["translation"]]
   tokenized_inputs = tokenizer(input, max_length=max_input_length, truncation=True)
   # Setup the tokenizer for target
   with tokenizer.as_target_tokenizer():
       label = tokenizer(target, max_length=max_target_length, truncation=True)
   tokenized_inputs["labels"] = label["input_ids"]
   return tokenized_inputs

In [9]:
#Applying the pre processing on the entire dataset
tokenized_datasets = split_data.map(preprocess, batched=True)

  0%|          | 0/35 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

### Creating subsets of the dataset for faster training

In [10]:
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
eval_dataset = tokenized_datasets["validation"].shuffle(seed=42).select(range(1000))

### Using the 't5-small' pre-trained model

In [11]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(model_t5)

In [12]:
batch_size = 16

#defining training attributes
args = Seq2SeqTrainingArguments(
   output_dir="./results",
   evaluation_strategy = "epoch",
   learning_rate=2e-5,
   per_device_train_batch_size=batch_size,
   per_device_eval_batch_size=batch_size,
   weight_decay=0.01,
   save_total_limit=3,
   num_train_epochs=1,
   predict_with_generate=True   
)

In [13]:
#pad inputs and label them
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [14]:
import numpy as np
from datasets import load_metric
metric = load_metric("sacrebleu")
meteor = load_metric('meteor')

#customizing compute_metrics function to display bleu score, mean prediction length and meteor score
def compute_metrics(eval_preds):
   preds, labels = eval_preds
   if isinstance(preds, tuple):
       preds = preds[0]
    
   decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
   # Replacing -100 in the labels as they are not needed and cannot be decoded
   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
   
   decoded_preds = [pred.strip() for pred in decoded_preds]
   decoded_labels = [[label.strip()] for label in decoded_labels]
   
   result = metric.compute(predictions=decoded_preds, references=decoded_labels)
   meteor_result = meteor.compute(predictions=decoded_preds, references=decoded_labels)
   prediction_length = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
   
   result = {'bleu' : result['score']}
   result["gen_len"] = np.mean(prediction_length)
   result["meteor"] = meteor_result["meteor"]
   result = {x: round(y, 4) for x, y in result.items()}
   return result

[nltk_data] Downloading package wordnet to C:\Users\Sharanya
[nltk_data]     Manohar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Sharanya
[nltk_data]     Manohar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Sharanya
[nltk_data]     Manohar\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [15]:
#training object with customized parameters
trainer = Seq2SeqTrainer(
   model,
   args,
   train_dataset=train_dataset,
   eval_dataset=eval_dataset,
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)


In [17]:
#train model using train function
trainer.train()

The following columns in the training set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: id, translation. If id, translation are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 63


Epoch,Training Loss,Validation Loss,Bleu,Gen Len,Meteor
1,No log,4.019485,0.3646,17.109,0.0623


The following columns in the evaluation set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: id, translation. If id, translation are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=63, training_loss=4.33593992203001, metrics={'train_runtime': 1143.1448, 'train_samples_per_second': 0.875, 'train_steps_per_second': 0.055, 'total_flos': 27013377687552.0, 'train_loss': 4.33593992203001, 'epoch': 1.0})