# Fine Tuning T5 Models

## One Time Setup

### Install Dependencies

In [1]:
!pip install transformers
!pip install sentencepiece
!pip install git+https://github.com/google-research/bleurt.git
!pip install setuptools accelerate nvidia-ml-py3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m87.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m107.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.3 tokenizers-0.13.3 transformers-4.27.4
Looking in indexes: https://pypi.org/simple, htt

In [2]:
import torch
print(f'torch.__version__: {torch.__version__}')
!nvcc --version
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

torch.__version__: 2.0.0+cu118
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0


In [3]:
# # !git clone https://github.com/NVIDIA/apex
# !pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex
# import torch
# print(f'torch.__version__: {torch.__version__}')
# torch.randn(1, 1, 32000).to(device='cuda:0')

### Connect to Google Drive
We will be loading data from google drive and also save trained models to google drive. So lets mount google drive.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Imports and Constants

In [4]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
import transformers
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from pynvml import *
import os,sys,humanize,psutil
import gc
from torch.utils.data.dataloader import DataLoader

def print_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print("CPU RAM Used: " + humanize.naturalsize( psutil.virtual_memory().used))
    print("CPU RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available))

    print(f"GPU memory occupied: {info.used//1024**2} MB.")
    print('Using device:', device)
    print()
    if device.type == 'cuda':
        print(torch.cuda.get_device_name(0))
        print('Memory Usage:')
        print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
        print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_utilization()

print_utilization()
!nvidia-smi

CPU RAM Used: 1.9 GB
CPU RAM Free: 86.9 GB
GPU memory occupied: 449 MB.
Using device: cuda

NVIDIA A100-SXM4-40GB
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB
Thu Apr  6 04:46:07 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    43W / 400W |      3MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
    

In [30]:
# HYPER_PARAMS = [{'model':'t5-base', 'dataset': 's1', 'max_len':65, 'epochs'=3, 'training_samples'=100000, 'val_samples'=1000, 'batch_size'=16},
#                 {'model':'t5-base', 'dataset': 's2', 'max_len':110, 'epochs'=3, 'training_samples'=100000, 'val_samples'=1000, 'batch_size'=16},
#                 {'model':'t5-base', 'dataset': 's3', 'max_len':150, 'epochs'=3, 'training_samples'=100000, 'val_samples'=1000, 'batch_size'=16}]

DATA_NAME = "s2"
# T5_MODEL_NAME = "t5-small"
# T5_MODEL_NAME = "t5-base"
# T5_MODEL_NAME = "t5-large" - colab instances do not have enough memory for T5 large.
T5_MODEL_NAME = 'google/t5-v1_1-small'
# T5_MODEL_NAME = 'google/t5-v1_1-base'

MAIN_DATA_FILE = f'drive/MyDrive/MIDS/w266/project/datasci-w266-2023-spring-team-story-bot/posptproc_corpus_spacy_{DATA_NAME}.csv'
TRAIN_DATA_FILE = f'posptproc_corpus_spacy_{DATA_NAME}_train.csv'
VAL_DATA_FILE = f'posptproc_corpus_spacy_{DATA_NAME}_val.csv'
SEED = 42
CHECKPOINTS_TO_SAVE = 1

NUM_TRAIN_SAMPLES = 100000
# NUM_TRAIN_SAMPLES = 25000
# NUM_VAL_SAMPLES = 45000
NUM_VAL_SAMPLES = 1000
# MAX_LOAD_AT_ONCE = 10000
SRC_MAX_LENGTH=512
TARGET_MAX_LENGTH=128

# MODEL_CKPT_FOLDER = 'drive/MyDrive/MIDS/w266/project/checkpoints/'
# MODEL_CKPT_FILE = MODEL_CKPT_FOLDER + f'{T5_MODEL_NAME}-finetuned-02'
TUNED_T5_SAVED = f'drive/MyDrive/MIDS/w266/project/saved_models/final/{T5_MODEL_NAME}-data{DATA_NAME}-finetuned'
PROMPT = 'Generate next line: '
BATCH_SIZE = 16

### Split Data File

In [31]:
def split_datafile(main_file, train_file, val_file):
  data_df = pd.read_csv(main_file)
  data_df = data_df.astype({'variable':'string', 'label':'string'})
  
  x_train, x_val, y_train, y_val = train_test_split(data_df['variable'], data_df['label'], train_size=0.7, random_state=SEED)
  x_train = [PROMPT + x for x in x_train]
  x_val = [PROMPT + x for x in x_val]
  xy_train = {'variable': x_train, 'label': y_train}
  xy_val = {'variable': x_val, 'label': y_val}

  df_train = pd.DataFrame(xy_train)
  df_val = pd.DataFrame(xy_val)
  df_train.to_csv(train_file, index=False)
  df_val.to_csv(val_file, index=False)
  print(f'Split {data_df.shape[0]} entires to {df_train.shape[0]} and {df_val.shape[0]}')
  return x_train[:NUM_TRAIN_SAMPLES], x_val[:NUM_VAL_SAMPLES], y_train[:NUM_TRAIN_SAMPLES], y_val[:NUM_VAL_SAMPLES]

x_train, x_val, y_train, y_val = split_datafile(MAIN_DATA_FILE, TRAIN_DATA_FILE, VAL_DATA_FILE)

Split 205705 entires to 143993 and 61712


In [32]:
print(len(x_train), len(x_val), len(y_train), len(y_val))

100000 1000 100000 1000


In [33]:
t5_tokenizer = T5Tokenizer.from_pretrained(T5_MODEL_NAME)
t5_model = T5ForConditionalGeneration.from_pretrained(T5_MODEL_NAME).cuda()
print_utilization()

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json: 0.00B [00:00, ?B/s]

Downloading (…)okenizer_config.json: 0.00B [00:00, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/537 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/308M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

CPU RAM Used: 7.5 GB
CPU RAM Free: 81.3 GB
GPU memory occupied: 6709 MB.
Using device: cuda

NVIDIA A100-SXM4-40GB
Memory Usage:
Allocated: 3.7 GB
Cached:    4.5 GB


In [34]:
def tokenize(tokenizer, data, max_length, to_cuda=False):
  tokenized = tokenizer(
    list(data),
    max_length=max_length,
    padding='max_length',
    truncation=True,
    return_attention_mask=True,
    return_tensors='pt')
  if to_cuda:
    tokenized['input_ids'] = tokenized['input_ids'].cuda(non_blocking=True)
    tokenized['attention_mask'] = tokenized['attention_mask'].cuda(non_blocking=True)
  return tokenized
  
x_train_tokenized = tokenize(t5_tokenizer, x_train, SRC_MAX_LENGTH)
y_train_tokenized = tokenize(t5_tokenizer, y_train, TARGET_MAX_LENGTH)
x_val_tokenized = tokenize(t5_tokenizer, x_val, SRC_MAX_LENGTH)
y_val_tokenized = tokenize(t5_tokenizer, y_val, TARGET_MAX_LENGTH)


In [35]:
# Create torch dataset
class ForT5Dataset(torch.utils.data.Dataset):
    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets
    
    def __len__(self):
        return len(self.targets["input_ids"])
    
    def __getitem__(self, index):
        input_ids = self.inputs["input_ids"][index].squeeze()
        target_ids = self.targets["input_ids"][index].squeeze()
        attention_mask = self.inputs['attention_mask'][index].squeeze()
        return {'input_ids': input_ids,
                'attention_mask': attention_mask,
                'labels': target_ids}

training_set = ForT5Dataset(x_train_tokenized, y_train_tokenized)
validation_set = ForT5Dataset(x_val_tokenized, y_val_tokenized)

In [37]:
print(len(training_set), len(validation_set))

100000 1000


In [25]:
# def print_n(it, n=5):
#   for i in range(n):
#     print(f'{i+1}: {next(it)}')

# print_n(train_data_iterator(), n=1)
# print_n(val_data_iterator(), n=1)


## Train Model

In [39]:
%%time
args = Seq2SeqTrainingArguments(
    output_dir='checkpoints',
    evaluation_strategy='epoch',
    save_strategy='epoch',
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    load_best_model_at_end=True,
    save_total_limit=CHECKPOINTS_TO_SAVE,
    # learning_rate=3e-4,
    # optim='adamw_torch',
    learning_rate=1e-3,
    adafactor=True,
    # gradient_accumulation_steps=4,
    # fp16=True,
    bf16=True,
    tf32=True
)

# Define the trainer, passing in the model, training args, and data generators

trainer = Seq2SeqTrainer(
    t5_model,
    args,
    train_dataset=training_set,
    eval_dataset=validation_set
)

result = trainer.train()
print_summary(result)



Epoch,Training Loss,Validation Loss
1,0.7612,0.709733
2,0.7274,0.688406
3,0.7005,0.679588


Time: 3179.66
Samples/second: 94.35
CPU RAM Used: 8.2 GB
CPU RAM Free: 80.5 GB
GPU memory occupied: 15245 MB.
Using device: cuda

NVIDIA A100-SXM4-40GB
Memory Usage:
Allocated: 0.9 GB
Cached:    12.9 GB
CPU times: user 53min 5s, sys: 12.3 s, total: 53min 17s
Wall time: 52min 59s


In [40]:
trainer.save_model(TUNED_T5_SAVED)

In [41]:
# Post training cleanup
trainer = None
t5_model = None
with torch.no_grad():
    torch.cuda.empty_cache()
gc.collect()
!nvidia-smi -caa
print_utilization()

Cleared Accounted PIDs for GPU 00000000:00:04.0.
All done.
CPU RAM Used: 8.2 GB
CPU RAM Free: 80.5 GB
GPU memory occupied: 4529 MB.
Using device: cuda

NVIDIA A100-SXM4-40GB
Memory Usage:
Allocated: 0.9 GB
Cached:    2.4 GB


## Inference

In [42]:
# # Final test list for model trained against s2 dataset.
# FINAL_TEST_LIST = ['Princess Leia lay upon her bed all the night. She could not sleep at all.',
#                    'He stopped himself for a minute and thought if it was the right thing to do. It did seem like a good thing to do.',
#                    'There once lived king named Rama. He was very wise and just.',
#                    'Once upon a time, an old owl lived in the forest. He was very wise.']

# Final test list for model trained against s1 dataset.
FINAL_TEST_LIST = ['Princess Leia lay upon her bed all the night.',
                   'He stopped himself for a minute and thought if it was the right thing to do.',
                   'There once lived king named Rama.',
                   'Once upon a time, an old owl lived in the forest.']

def evaluate(model, tokenizer, lines, prompt):
  transformers.logging.set_verbosity_error()
  for test_input_text in lines:
      test_inputs = tokenizer([prompt + test_input_text], return_tensors='pt')
      test_output_ids = model.generate(
          test_inputs['input_ids'].cuda(),
          num_beams=5,
          no_repeat_ngram_size=3,
          num_return_sequences=5,
          max_new_tokens=100,
          do_sample=True,
          top_k=0)
      print(f'Input: {test_input_text}')
      decoded = [tokenizer.decode(out_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) for out_ids in test_output_ids]
      print(f'Output: {decoded}')

In [43]:
## Untrained T5 model
# evaluate(T5ForConditionalGeneration.from_pretrained("t5-large").cuda(), t5_tokenizer, FINAL_TEST_LIST, "Continue the next sentence of the story: ")

In [44]:
## Fine tuned T5 model
evaluate(T5ForConditionalGeneration.from_pretrained(TUNED_T5_SAVED).cuda(), t5_tokenizer, FINAL_TEST_LIST, PROMPT)


Input: Princess Leia lay upon her bed all the night.
Output: ['"It\'s a pity," she said.', 'She sat on the bed, and she slept for a long time.', 'She sat down on the bed, and said, “Alas!', 'She was a beautiful princess, and she was as beautiful as the sun, and the sun was shining brightly on the sky.', 'She sat on the bed.']
Input: He stopped himself for a minute and thought if it was the right thing to do.
Output: ['He had a frightened face, and he had sat down on the floor and listened to the sound of a thunderbolt.', '“It’s the right thing,” he said.', '“Oh, I’m going to give you a piece of money,” he said.', '"It\'s the best thing you can do," he said.', 'He did not know what he was thinking about.']
Input: There once lived king named Rama.
Output: ['He was a king, and he had a great deal of money in his pocket.', "He was the king's son, and he was a king.", 'He was a king, and he had a son.', 'He was a rich man, and he was rich.', 'He was a king, and he lived in a cave in the for

In [None]:
TUNED_T5_SAVED

'drive/MyDrive/MIDS/w266/project/saved_models/t5-base-datas1-finetuned'

In [None]:
evaluate(T5ForConditionalGeneration.from_pretrained(TUNED_T5_SAVED).cuda(), t5_tokenizer, FINAL_TEST_LIST, PROMPT)


Input: Princess Leia lay upon her bed all the night. She could not sleep at all.
Output: ['She slept in a slumber, but she did not know how to get out of bed.', 'She was very ill, and she could not sleep for a long time.', 'She was so tired that she could not sleep at all.', 'She could not sleep at all.', 'She could not sleep at all.']
Input: He stopped himself for a minute and thought if it was the right thing to do. It did seem like a good thing to do.
Output: ['He went out to eat a little, and then he went to bed, and he sat down with a cup of tea, and said, “It’s a good thing to do.”', 'It was a good thing to do.', 'It was a good thing to do.', 'It was a good thing to do.', 'He thought it was the right thing to do.']
Input: There once lived king named Rama. He was very wise and just.
Output: ['Rama was a king of India, and he had a great wealth of wealth.', 'Rama was a very good king, and he was very good to his people.', 'Rama was a good king, and he had a great wealth of wealth.'