# T5 Practice Implementation - Samyukt Sriram

Implementing T5 from the HuggingFace Transformers library for the task of machine translation. Following this guide: https://huggingface.co/docs/transformers/tasks/translation

Just a learning exercise to familiarize myself with the workflow of the transformers library.

In [1]:
# installing imports

!pip install datasets
!pip install transformers
!pip install sacrebleu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 5.2 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.6 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 56.4 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 75.1 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████

In [2]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [3]:
import numpy as np

from datasets import load_dataset, load_metric
from transformers import AutoTokenizer

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer



In [6]:
model_checkpoint = 't5-small'

#Change the below languages to get the language pair you want.
#Initially was lower in the Preprocessing function. Moved up here to adjust corpus as well.
source_lang = 'en'
target_lang = 'de'
prefix = 'translate English to German: '

In [7]:
#Loading in data into the datasets object

raw_datasets = load_dataset("wmt16", f"{target_lang}-{source_lang}", split = {'train':'train[:2%]', 'test':'test[:10%]'}) #using fraction of corpus as a trial
#train_ds, test_ds = load_dataset('wmt16','de-en', split=['train[:100]', 'test[:100]'])
print(raw_datasets)
#raw_datasets = raw_datasets['train'].train_test_split(test_size = 0.2)
#print(raw_datasets)
metric = load_metric('sacrebleu')

Downloading and preparing dataset wmt16/de-en (download: 1.57 GiB, generated: 1.28 GiB, post-processed: Unknown size, total: 2.85 GiB) to /root/.cache/huggingface/datasets/wmt16/de-en/1.0.0/9e0038fe4cc117bd474d2774032cc133e355146ed0a47021b2040ca9db4645c0...


Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/658M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/919M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/38.7M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/5 [00:00<?, ?it/s]

Extracting data files: 0it [00:00, ?it/s]

Generating train split:   0%|          | 0/4548885 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2169 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2999 [00:00<?, ? examples/s]

Dataset wmt16 downloaded and prepared to /root/.cache/huggingface/datasets/wmt16/de-en/1.0.0/9e0038fe4cc117bd474d2774032cc133e355146ed0a47021b2040ca9db4645c0. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 90978
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 300
    })
})


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

In [8]:
raw_datasets["train"][0]
#raw_datasets["test"][0]

{'translation': {'de': 'Wiederaufnahme der Sitzungsperiode',
  'en': 'Resumption of the session'}}

In [9]:
# Tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [10]:
#Preprocessing Function

#source_lang = 'en'
#target_lang = 'de'
#prefix = 'translate English to German: ' #Some models that are capable of multiple tasks need this prefix in the inputs to specifcy what the task is.

def preprocess_function(examples):

  inputs = [prefix + example[source_lang] for example in examples['translation']]
  targets = [example[target_lang] for example in examples['translation']]
  model_inputs = tokenizer(targets, max_length = 128, truncation = True)

  #Not sure why this has to be coded this way. EXPERIMENT
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(targets, max_length = 128, truncation = True)
  
  model_inputs['labels'] = labels['input_ids']
  return model_inputs

In [11]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched = True)

tokenized_datasets

  0%|          | 0/91 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 90978
    })
    test: Dataset({
        features: ['translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 300
    })
})

In [12]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(device)

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

In [13]:
#TRAINING ARGS

batch_size = 16
model_name = model_checkpoint.split('/')[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-{source_lang}-to-{target_lang}",
    evaluation_strategy = 'epoch',
    learning_rate = 2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    weight_decay = 0.01,
    save_total_limit = 3,
    num_train_epochs = 1,
    predict_with_generate = True,
    fp16 = True #Can only be run with GPU
)

In [14]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model= model)

In [15]:
#These functions help generate predicitons, and compute metrics

def postprocess_text(preds, labels):
  #print(f'preds unprocessed: {preds} \n labels unprocessed: {labels}')
  preds = [pred.strip() for pred in preds]
  labels = [label.strip() for label in labels]
  return preds, labels
  
def compute_metrics(eval_preds):

  preds, labels = eval_preds
  if isinstance(preds, tuple):
    preds = preds[0]
  decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens = True)

  #Replacing -100 in the labels, we can't decode them. (my guess is these are unknown words?)
  labels = np.where(labels !=-100, labels, tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens = True)

  #Applying the postprocessing function from above

  decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

  print(f'predictions: {decoded_preds} \n labels: {decoded_labels}')

  #Computing metric
  result = metric.compute(predictions = [decoded_preds], references = [decoded_labels])
  result = {'bleu': result['score']}
  prediction_lens = [np.count_nonzero(pred!= tokenizer.pad_token_id) for pred in preds]
  result['gen_len'] = np.mean(prediction_lens)
  result = {k: round(v,4) for k,v in result.items()}

  return result

In [16]:
#initializing trainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset = tokenized_datasets['train'],
    eval_dataset = tokenized_datasets['test'],
    data_collator = data_collator,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

Using amp half precision backend


In [17]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: translation. If translation are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 90978
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 5687


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.0063,0.002503,50.0307,17.7967


Saving model checkpoint to t5-small-finetuned-en-to-de/checkpoint-500
Configuration saved in t5-small-finetuned-en-to-de/checkpoint-500/config.json
Model weights saved in t5-small-finetuned-en-to-de/checkpoint-500/pytorch_model.bin
tokenizer config file saved in t5-small-finetuned-en-to-de/checkpoint-500/tokenizer_config.json
Special tokens file saved in t5-small-finetuned-en-to-de/checkpoint-500/special_tokens_map.json
Saving model checkpoint to t5-small-finetuned-en-to-de/checkpoint-1000
Configuration saved in t5-small-finetuned-en-to-de/checkpoint-1000/config.json
Model weights saved in t5-small-finetuned-en-to-de/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in t5-small-finetuned-en-to-de/checkpoint-1000/tokenizer_config.json
Special tokens file saved in t5-small-finetuned-en-to-de/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to t5-small-finetuned-en-to-de/checkpoint-1500
Configuration saved in t5-small-finetuned-en-to-de/checkpoint-1500/config.js

predictions: ['Obama empfängt Netanyahu', 'Das Verhältnis zwischen Obama und Netanyahu ist nicht gerade freundschaftlich.', 'Die beiden wollten über die Umsetzung der internationalen Vereinbarung sowie über Teherans destabil', 'Bei der Begegnung soll es aber auch um den Konflikt mit den Paläst', 'Das Verhältnis zwischen Obama und Netanyahu ist seit Jahren gespannt.', 'Washington kritisiert den andauernden Siedlungsbau Israels und wirf', 'Durch den von Obama beworbenen Deal um das iranische Atomprogramm', 'Im März hatte Netanyahu auf Einladung der Republikaner vor dem US-Kongress eine', 'Die Rede war mit Obama nicht abgesprochen, ein Treffen hatte dieser mit Hinweis auf die', 'In einem Notruf gesteht Professor, seine Freundin erschossen zu haben', 'In einem Notruf erzählte Professor Shannon Lamb mit einer etwas zittrigen Stimme der Polizei', 'Lamb war es wichtig zu betonen, dass sein "süßer Hund"', 'Innerhalb des Hauses fanden die Beamten die Leiche von Amy Prentis', 'Es gab keinen Hinw



Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=5687, training_loss=0.014935218598231796, metrics={'train_runtime': 736.0392, 'train_samples_per_second': 123.605, 'train_steps_per_second': 7.726, 'total_flos': 2334439361937408.0, 'train_loss': 0.014935218598231796, 'epoch': 1.0})

In [19]:
#For a standard task, you can use this

while False:
  from transformers import pipeline

  model = model.to(device)

  translator = pipeline(f'translation_{source_lang}_to_{target_lang}', model = model, tokenizer = tokenizer)
  #print(f'Translating {source_lang} to {target_lang}')

  sentence_to_translate = 'I really want to eat a big breakfast, take a long nap, and fall asleep on my balcony in the warm summer shade'

  print(translator(sentence_to_translate))
  break

In [20]:
input_sentence = 'There was once a big man who ate a massice donut'

input_ids = tokenizer.encode(input_sentence, return_tensors = 'pt').to(device) #dont forget that input tensors and model have to be on the same GPU

model = model.to(device)
output = model.generate(input_ids = input_ids,
                        )
print(tokenizer.decode(output[0], skip_special_tokens = True))

Es gab einmal einen großen Mann, der ate a massice donut.
