# T5 Implementation for Generating Match Commentary - Samyukt Sriram

Implementing T5 from the HuggingFace Transformers library for the task of generating cricket commentary. Code structure based on this guide: https://huggingface.co/docs/transformers/tasks/translation

Idea for using Machine Translation for generating long commentaries from short inputs from this paper: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15723716.pdf

Data used: https://www.kaggle.com/datasets/saivamshi/ipl-2019-commentary-data.
Add the ipl2019_final.csv file to the content directory in colab.

Wanted to implement the above idea on data sourced from cricket. This cricket commentary database is a lot more uniform in its language compared to the paper's, so better results might be possible.

This is still a work in progress, something is going wrong at some point in this pipeline. Output generated is identical to the input, and playing with decoder output generation parameters doesn't fix it.

Possibly because T5 is trained for specific kinds of translation tasks, and the data being used is not enough to change that behaviour?

In [2]:
import os
import pandas as pd


#ipl_path = '../input/ipl-2019-commentary-data/ipl2019_final.csv'
ipl_path = '/content/ipl2019_final.csv'
df = pd.read_csv(ipl_path)
df = df[['Short_comm', 'Commentary']]


df

Unnamed: 0,Short_comm,Commentary
0,"['Malinga to Thakur, OUT']","Mumbai Indians win IPL 2019! Slower ball, yor..."
1,"['Malinga to Thakur, 2 runs']","full toss on leg, and swiped away to deep back..."
2,"['Malinga to Watson, 1 run, OUT']","yorker, just outside off, and slapped to deep ..."
3,"['Malinga to Watson, 2 runs']","yorker, just outside leg, and hammered to wide..."
4,"['Malinga to Jadeja, 1 run']","low full toss on middle, and drilled back to M..."
...,...,...
13516,"['Chahar to Patel, FOUR runs']",misfield by by mid-off too full and Parthiv c...
13517,"['Chahar to Patel, no run']","fuller and straighter, Parthiv wants to drive,..."
13518,"['Chahar to Patel, no run']","good bowling, keeps it wide of off stump but w..."
13519,"['Chahar to Patel, no run']","back of a length ball, given some width outsid..."


In [3]:
# installing imports

!pip install datasets
!pip install transformers
!pip install sacrebleu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 14.2 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 62.6 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 67.6 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.0 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |███████████████████████

In [4]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [5]:
import numpy as np

from datasets import load_dataset, load_metric, Dataset
from transformers import AutoTokenizer

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer



In [6]:
model_checkpoint = 't5-base'

#Change the below languages to get the language pair you want.
#Initially was lower in the Preprocessing function. Moved up here to adjust corpus as well.
source_lang = 'en'
target_lang = 'de'
prefix = 'Translate English to English: '

In [7]:
#Loading in data into the datasets object

raw_datasets = Dataset.from_pandas(df) #using fraction of corpus as a trial
#train_ds, test_ds = load_dataset('wmt16','de-en', split=['train[:100]', 'test[:100]'])
print(raw_datasets)
raw_datasets = raw_datasets.train_test_split(test_size = 0.1)
print(raw_datasets)
metric = load_metric('sacrebleu')

Dataset({
    features: ['Short_comm', 'Commentary'],
    num_rows: 13521
})
DatasetDict({
    train: Dataset({
        features: ['Short_comm', 'Commentary'],
        num_rows: 12168
    })
    test: Dataset({
        features: ['Short_comm', 'Commentary'],
        num_rows: 1353
    })
})


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

In [8]:
raw_datasets["train"][0]
#raw_datasets["test"][0]

{'Commentary': 'back of a length, wide outside off, and steered to short third man',
 'Short_comm': "['Unadkat to Dhoni, 1 run']"}

In [9]:
# Tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [10]:
#Preprocessing Function

def preprocess_function(examples):

  inputs = [prefix + example[0] for example in examples['Short_comm']]
  targets = [example for example in examples['Commentary']]
  model_inputs = tokenizer(targets, max_length = 256, truncation = True)

  #Not sure why this has to be coded this way. EXPERIMENT
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(targets, max_length = 256, truncation = True)
  
  model_inputs['labels'] = labels['input_ids']
  return model_inputs

In [11]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched = True)

tokenized_datasets



  0%|          | 0/13 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['Short_comm', 'Commentary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 12168
    })
    test: Dataset({
        features: ['Short_comm', 'Commentary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1353
    })
})

In [12]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(device)

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

In [13]:
#TRAINING ARGS

batch_size = 16
model_name = model_checkpoint.split('/')[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-for-cricket",
    evaluation_strategy = 'epoch',
    learning_rate = 2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    weight_decay = 0.01,
    save_total_limit = 3,
    num_train_epochs = 4,
    predict_with_generate = True,
    fp16 = True, #Can only be run with GPU
)

In [14]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model= model)

In [15]:
#These functions help generate predicitons, and compute metrics

def postprocess_text(preds, labels):
  #print(f'preds unprocessed: {preds} \n labels unprocessed: {labels}')
  preds = [pred.strip() for pred in preds]
  labels = [label.strip() for label in labels]
  return preds, labels
  
def compute_metrics(eval_preds):

  preds, labels = eval_preds
  if isinstance(preds, tuple):
    preds = preds[0]
  decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens = True)

  #Replacing -100 in the labels, we can't decode them. (my guess is these are unknown words?)
  labels = np.where(labels !=-100, labels, tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens = True)

  #Applying the postprocessing function from above

  decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

  #Computing metric
  result = metric.compute(predictions = [decoded_preds], references = [decoded_labels])
  result = {'bleu': result['score']}
  prediction_lens = [np.count_nonzero(pred!= tokenizer.pad_token_id) for pred in preds]
  result['gen_len'] = np.mean(prediction_lens)
  result = {k: round(v,4) for k,v in result.items()}

  return result

In [16]:
#initializing trainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset = tokenized_datasets['train'],
    eval_dataset = tokenized_datasets['test'],
    data_collator = data_collator,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

Using cuda_amp half precision backend


In [17]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: Short_comm, Commentary. If Short_comm, Commentary are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 12168
  Num Epochs = 4
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3044


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.1111,0.000474,27.7469,17.8234
2,0.0023,9.7e-05,27.7469,17.8234
3,0.0019,6.9e-05,27.7469,17.8234
4,0.0012,5.6e-05,27.7469,17.8234


Saving model checkpoint to t5-base-finetuned-for-cricket/checkpoint-500
Configuration saved in t5-base-finetuned-for-cricket/checkpoint-500/config.json
Model weights saved in t5-base-finetuned-for-cricket/checkpoint-500/pytorch_model.bin
tokenizer config file saved in t5-base-finetuned-for-cricket/checkpoint-500/tokenizer_config.json
Special tokens file saved in t5-base-finetuned-for-cricket/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: Short_comm, Commentary. If Short_comm, Commentary are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1353
  Batch size = 16
Saving model checkpoint to t5-base-finetuned-for-cricket/checkpoint-1000
Configuration saved in t5-base-finetuned-for-cricket/checkpoint-1000/config.json
Model weights saved in t5-base-finetuned-for-cricket/

TrainOutput(global_step=3044, training_loss=0.020033849033670264, metrics={'train_runtime': 1183.1796, 'train_samples_per_second': 41.137, 'train_steps_per_second': 2.573, 'total_flos': 4356494264033280.0, 'train_loss': 0.020033849033670264, 'epoch': 4.0})

In [20]:
input_sentence = 'Translate: Malinga to Thakur'

input_ids = tokenizer.encode(input_sentence, return_tensors = 'pt').to(device) #dont forget that input tensors and model have to be on the same GPU

model = model.to(device)
output = model.generate(input_ids = input_ids,
                        max_length = 200,
                        do_sample = True,
                        num_return_sequences = 1,
                        temperature = 1,
                        )
print(tokenizer.decode(output[0], skip_special_tokens = True))

Translate: Malinga to Thakur
