# T5 Implementation for Generating Match Commentary - Samyukt Sriram

Implementing T5 from the HuggingFace Transformers library for the task of generating cricket commentary. Code structure based on this guide: https://huggingface.co/docs/transformers/tasks/translation

Idea for using Machine Translation for generating long commentaries from short inputs from this paper: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15723716.pdf

Data used: https://www.kaggle.com/datasets/saivamshi/ipl-2019-commentary-data

Wanted to implement the above idea on data sourced from cricket. This cricket commentary database is a lot more uniform in its language compared to the paper's, so better results might be possible.

This is still a work in progress, something is going wrong at some point in this pipeline. Output generated is identical to the input, and playing with decoder output generation parameters doesn't fix it.

Possibly because T5 is trained for specific kinds of translation tasks, and the data being used is not enough to change that behaviour?

In [None]:
import os
import pandas as pd


#ipl_path = '../input/ipl-2019-commentary-data/ipl2019_final.csv'
ipl_path = '/content/ipl2019_final.csv'
df = pd.read_csv(ipl_path)
df = df[['Short_comm', 'Commentary']]


df

In [2]:
# installing imports

!pip install datasets
!pip install transformers
!pip install sacrebleu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 15.6 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.6 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 65.2 MB/s 
Collecting dill<0.3.5
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 7.4 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 75.9 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37

In [3]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [4]:
import numpy as np

from datasets import load_dataset, load_metric, Dataset
from transformers import AutoTokenizer

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer



In [5]:
model_checkpoint = 't5-base'

#Change the below languages to get the language pair you want.
#Initially was lower in the Preprocessing function. Moved up here to adjust corpus as well.
source_lang = 'en'
target_lang = 'de'
prefix = 'Translate English to English: '

In [6]:
#Loading in data into the datasets object

raw_datasets = Dataset.from_pandas(df) #using fraction of corpus as a trial
#train_ds, test_ds = load_dataset('wmt16','de-en', split=['train[:100]', 'test[:100]'])
print(raw_datasets)
raw_datasets = raw_datasets.train_test_split(test_size = 0.1)
print(raw_datasets)
metric = load_metric('sacrebleu')

Dataset({
    features: ['Short_comm', 'Commentary'],
    num_rows: 13521
})
DatasetDict({
    train: Dataset({
        features: ['Short_comm', 'Commentary'],
        num_rows: 12168
    })
    test: Dataset({
        features: ['Short_comm', 'Commentary'],
        num_rows: 1353
    })
})


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

In [7]:
raw_datasets["train"][0]
#raw_datasets["test"][0]

{'Commentary': 'short ball down leg side, looks for the hook, mistimed towards fine leg',
 'Short_comm': "['Stokes to Shubman Gill, 1 run']"}

In [8]:
# Tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [9]:
#Preprocessing Function

def preprocess_function(examples):

  inputs = [prefix + example[0] for example in examples['Short_comm']]
  targets = [example for example in examples['Commentary']]
  model_inputs = tokenizer(targets, max_length = 256, truncation = True)

  #Not sure why this has to be coded this way. EXPERIMENT
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(targets, max_length = 256, truncation = True)
  
  model_inputs['labels'] = labels['input_ids']
  return model_inputs

In [10]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched = True)

tokenized_datasets

  0%|          | 0/13 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['Short_comm', 'Commentary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 12168
    })
    test: Dataset({
        features: ['Short_comm', 'Commentary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1353
    })
})

In [11]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(device)

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

In [12]:
#TRAINING ARGS

batch_size = 16
model_name = model_checkpoint.split('/')[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-for-cricket",
    evaluation_strategy = 'epoch',
    learning_rate = 2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    weight_decay = 0.01,
    save_total_limit = 3,
    num_train_epochs = 4,
    predict_with_generate = True,
    fp16 = True, #Can only be run with GPU
)

In [13]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model= model)

In [14]:
#These functions help generate predicitons, and compute metrics

def postprocess_text(preds, labels):
  #print(f'preds unprocessed: {preds} \n labels unprocessed: {labels}')
  preds = [pred.strip() for pred in preds]
  labels = [label.strip() for label in labels]
  return preds, labels
  
def compute_metrics(eval_preds):

  preds, labels = eval_preds
  if isinstance(preds, tuple):
    preds = preds[0]
  decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens = True)

  #Replacing -100 in the labels, we can't decode them. (my guess is these are unknown words?)
  labels = np.where(labels !=-100, labels, tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens = True)

  #Applying the postprocessing function from above

  decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

  print(f'predictions: {decoded_preds} \n labels: {decoded_labels}')

  #Computing metric
  result = metric.compute(predictions = [decoded_preds], references = [decoded_labels])
  result = {'bleu': result['score']}
  prediction_lens = [np.count_nonzero(pred!= tokenizer.pad_token_id) for pred in preds]
  result['gen_len'] = np.mean(prediction_lens)
  result = {k: round(v,4) for k,v in result.items()}

  return result

In [16]:
print(tokenizer)
vocab = 32128

0


In [15]:
#initializing trainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset = tokenized_datasets['train'],
    eval_dataset = tokenized_datasets['test'],
    data_collator = data_collator,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

Using amp half precision backend


In [16]:
#os.environ["WANDB_DISABLED"] = "true" #For some reason this needs to be disabled, bc i don't have a wandb key.

In [17]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: Commentary, Short_comm. If Commentary, Short_comm are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 12168
  Num Epochs = 4
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3044


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.106,0.000351,27.4808,17.864
2,0.0022,0.000194,27.4702,17.864
3,0.0018,0.00011,27.4702,17.864
4,0.0013,8.2e-05,27.4702,17.864


Saving model checkpoint to t5-base-finetuned-for-cricket/checkpoint-500
Configuration saved in t5-base-finetuned-for-cricket/checkpoint-500/config.json
Model weights saved in t5-base-finetuned-for-cricket/checkpoint-500/pytorch_model.bin
tokenizer config file saved in t5-base-finetuned-for-cricket/checkpoint-500/tokenizer_config.json
Special tokens file saved in t5-base-finetuned-for-cricket/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: Commentary, Short_comm. If Commentary, Short_comm are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1353
  Batch size = 16


predictions: ['nicely placed by Stoinis, into the acres of space on the leg side with midwick', 'a low full toss outside off this time, Warner this time moves across and flick', 'fuller around off, comes forward and pushes it towards covers', 'flat, full, angling down leg, clipped down to deep backward square leg', 'knocked away through square leg for one', 'slower ball but short and wide, Bairstow was a little early into the', 'another googly, and de Kock bends into a reverse sweep -', '135kph, short ball, hurries Dhoni, he', 'neatly tucked away. Full on middle stump, gets low with the front foot across', 'yorker, tailing in, de Villiers squeezes it out to short fine-leg', 'length outside off, has a tentative poke at it from the crease, the hint of', 'length ball on middle, but sliding down the leg side. Dhawan misses the', 'full on middle, du Plessis makes a bit of room and goes through the line', 'waits for the ball to come to him from outside off and steers it late to third', 'floa

Saving model checkpoint to t5-base-finetuned-for-cricket/checkpoint-1000
Configuration saved in t5-base-finetuned-for-cricket/checkpoint-1000/config.json
Model weights saved in t5-base-finetuned-for-cricket/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in t5-base-finetuned-for-cricket/checkpoint-1000/tokenizer_config.json
Special tokens file saved in t5-base-finetuned-for-cricket/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to t5-base-finetuned-for-cricket/checkpoint-1500
Configuration saved in t5-base-finetuned-for-cricket/checkpoint-1500/config.json
Model weights saved in t5-base-finetuned-for-cricket/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in t5-base-finetuned-for-cricket/checkpoint-1500/tokenizer_config.json
Special tokens file saved in t5-base-finetuned-for-cricket/checkpoint-1500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` an

predictions: ['nicely placed by Stoinis, into the acres of space on the leg side with midwick', 'a low full toss outside off this time, Warner this time moves across and flick', 'fuller around off, comes forward and pushes it towards covers', 'flat, full, angling down leg, clipped down to deep backward square leg', 'knocked away through square leg for one', 'slower ball but short and wide, Bairstow was a little early into the', 'another googly, and de Kock bends into a reverse sweep -', '135kph, short ball, hurries Dhoni, he', 'neatly tucked away. Full on middle stump, gets low with the front foot across', 'yorker, tailing in, de Villiers squeezes it out to short fine-leg', 'length outside off, has a tentative poke at it from the crease, the hint of', 'length ball on middle, but sliding down the leg side. Dhawan misses the', 'full on middle, du Plessis makes a bit of room and goes through the line', 'waits for the ball to come to him from outside off and steers it late to third', 'floa

Saving model checkpoint to t5-base-finetuned-for-cricket/checkpoint-2000
Configuration saved in t5-base-finetuned-for-cricket/checkpoint-2000/config.json
Model weights saved in t5-base-finetuned-for-cricket/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in t5-base-finetuned-for-cricket/checkpoint-2000/tokenizer_config.json
Special tokens file saved in t5-base-finetuned-for-cricket/checkpoint-2000/special_tokens_map.json
Deleting older checkpoint [t5-base-finetuned-for-cricket/checkpoint-500] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: Commentary, Short_comm. If Commentary, Short_comm are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1353
  Batch size = 16


predictions: ['nicely placed by Stoinis, into the acres of space on the leg side with midwick', 'a low full toss outside off this time, Warner this time moves across and flick', 'fuller around off, comes forward and pushes it towards covers', 'flat, full, angling down leg, clipped down to deep backward square leg', 'knocked away through square leg for one', 'slower ball but short and wide, Bairstow was a little early into the', 'another googly, and de Kock bends into a reverse sweep -', '135kph, short ball, hurries Dhoni, he', 'neatly tucked away. Full on middle stump, gets low with the front foot across', 'yorker, tailing in, de Villiers squeezes it out to short fine-leg', 'length outside off, has a tentative poke at it from the crease, the hint of', 'length ball on middle, but sliding down the leg side. Dhawan misses the', 'full on middle, du Plessis makes a bit of room and goes through the line', 'waits for the ball to come to him from outside off and steers it late to third', 'floa

Saving model checkpoint to t5-base-finetuned-for-cricket/checkpoint-2500
Configuration saved in t5-base-finetuned-for-cricket/checkpoint-2500/config.json
Model weights saved in t5-base-finetuned-for-cricket/checkpoint-2500/pytorch_model.bin
tokenizer config file saved in t5-base-finetuned-for-cricket/checkpoint-2500/tokenizer_config.json
Special tokens file saved in t5-base-finetuned-for-cricket/checkpoint-2500/special_tokens_map.json
Deleting older checkpoint [t5-base-finetuned-for-cricket/checkpoint-1000] due to args.save_total_limit
Saving model checkpoint to t5-base-finetuned-for-cricket/checkpoint-3000
Configuration saved in t5-base-finetuned-for-cricket/checkpoint-3000/config.json
Model weights saved in t5-base-finetuned-for-cricket/checkpoint-3000/pytorch_model.bin
tokenizer config file saved in t5-base-finetuned-for-cricket/checkpoint-3000/tokenizer_config.json
Special tokens file saved in t5-base-finetuned-for-cricket/checkpoint-3000/special_tokens_map.json
Deleting older chec

predictions: ['nicely placed by Stoinis, into the acres of space on the leg side with midwick', 'a low full toss outside off this time, Warner this time moves across and flick', 'fuller around off, comes forward and pushes it towards covers', 'flat, full, angling down leg, clipped down to deep backward square leg', 'knocked away through square leg for one', 'slower ball but short and wide, Bairstow was a little early into the', 'another googly, and de Kock bends into a reverse sweep -', '135kph, short ball, hurries Dhoni, he', 'neatly tucked away. Full on middle stump, gets low with the front foot across', 'yorker, tailing in, de Villiers squeezes it out to short fine-leg', 'length outside off, has a tentative poke at it from the crease, the hint of', 'length ball on middle, but sliding down the leg side. Dhawan misses the', 'full on middle, du Plessis makes a bit of room and goes through the line', 'waits for the ball to come to him from outside off and steers it late to third', 'floa



Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=3044, training_loss=0.01913762918909652, metrics={'train_runtime': 1157.3905, 'train_samples_per_second': 42.053, 'train_steps_per_second': 2.63, 'total_flos': 4388673882562560.0, 'train_loss': 0.01913762918909652, 'epoch': 4.0})

In [19]:
input_sentence = 'Translate: Malinga to Thakur'

input_ids = tokenizer.encode(input_sentence, return_tensors = 'pt').to(device) #dont forget that input tensors and model have to be on the same GPU

model = model.to(device)
output = model.generate(input_ids = input_ids,
                        min_length = 100,
                        max_length = 200,
                        do_sample = True,
                        num_return_sequences = 1,
                        temperature = 1,
                        repetition_penalty = 2.0
                        )
print(tokenizer.decode(output[0], skip_special_tokens = True))

Translate: Malinga to Thakur. Translat-bois: German to Thaikur [Diffuser] by Brahm Akin tashita (*) 2/3/14 "From Malian to Thakur." Translation and translation: Malingpto Thakur Swedish to Thakur Thesaurus Greek with Turkish subtitles, translated in C Täu tayk or Arabic into Thakur French for translating: Malingá to Thakur
