# Fine-tune T5-small on x-sum

## Libraries and environment preparation

In [1]:
#Install essential packages
%%capture
! pip install datasets transformers rouge-score nltk wandb

In [2]:
#install Git-LFS
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 37 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 2s (910 kB/s)
Selecting previously unselected package git-lfs.
(Reading database ... 155229 files and directories currently installed.)
Preparing to unpack .../git-lfs_2.3.4-1_amd64.deb ...
Unpacking git-lfs (2.3.4-1) ...
Setting up git-lfs (2.3.4-1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


In [3]:
#Colab Environment Check for GPU and RAM
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

#GPU check
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Your runtime has 27.3 gigabytes of available RAM

Tue Jan 25 20:01:43 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-------------------------------------

Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [17]:
# Make sure your version of Transformers is at least 4.11.0 
# to run the following code correctly:
import transformers
import datasets
print(transformers.__version__)

4.15.0


In [5]:
# Import Wandb 
import os
import wandb
API_KEY = '39991c538626bee25c64d4f8a4c3403dd635537c'
os.environ["WANDB_API_KEY"] = API_KEY

## Loading the dataset and process

In [6]:
from datasets import load_dataset
raw_datasets = load_dataset("xsum")

Downloading:   0%|          | 0.00/2.05k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/954 [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset xsum/default (download: 245.38 MiB, generated: 507.60 MiB, post-processed: Unknown size, total: 752.98 MiB) to /root/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934...


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset xsum downloaded and prepared to /root/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

In [8]:
model_checkpoint = "t5-small"
from transformers import T5TokenizerFast
tokenizer = T5TokenizerFast.from_pretrained("t5-small")

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

In [9]:
# If you are using one of the five T5 checkpoints we have to prefix 
# the inputs with "summarize:" (t5 is a multi-task model).

if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "

For Xsum, the input tokens are about 1500 and the length of the summaries are about 160. Here we truncate to 1024 and 128

In [10]:
# tokenlize inputs into map

max_input_length = 512
max_target_length = 64

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [11]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

  0%|          | 0/205 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

In [12]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 11334
    })
})

## Fine-tuning the model

In [44]:
# Import Huggingface Automodel class from model checkpoint and print details

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
Model config T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
 

In [45]:
# data collator: pad the inputs and labels during each batch to save space
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [46]:
# keep track with wandb
wandb.init(project="T5-small")

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
eval/gen_len,▁
eval/loss,▁
eval/rouge1,▁
eval/rouge2,▁
eval/rougeL,▁
eval/rougeLsum,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁█

0,1
eval/gen_len,18.8429
eval/loss,2.42466
eval/rouge1,29.312
eval/rouge2,8.4332
eval/rougeL,23.2114
eval/rougeLsum,23.2169
eval/runtime,319.3021
eval/samples_per_second,35.49
eval/steps_per_second,2.22
train/epoch,0.1


Define `Seq2SeqTrainer` to compute the metrics from the predictions, and also do a bit of pre-processing to decode the predictions into texts:

In [18]:
# Define compute_metrics
import nltk
import numpy as np
nltk.download('punkt')

metric = datasets.load_metric("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Downloading:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [47]:
# Define traing args, batch size and epoch
# batch size max 8 for input length 1024 on Colab Pro

batch_size = 16
epochs = 1
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-xsum",
    load_best_model_at_end="eval_loss",
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_steps=1000,  # set to 1000 for full training
    save_steps=1250,  # set to 500 for full training
    eval_steps=1250,  # set to 8000 for full training
    save_total_limit=3,
    num_train_epochs=epochs,
    predict_with_generate=True,
    fp16=True,
    report_to="wandb",
)

PyTorch: setting up devices


In [48]:
# Pass into the trainer

train_dataset=tokenized_datasets["train"]
eval_dataset=tokenized_datasets["validation"]

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Using amp half precision backend


We can now finetune our model by just calling the `train` method:

In [49]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: id, document, summary.
***** Running training *****
  Num examples = 204045
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 12753
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1250,2.9154,2.567737,27.1573,6.9159,21.2421,21.2466,18.8317
2500,2.8007,2.520872,27.9247,7.4419,21.9669,21.9659,18.8217
3750,2.7672,2.49243,28.3801,7.7393,22.3389,22.3441,18.7944
5000,2.7283,2.475576,28.6439,7.8905,22.5284,22.5279,18.8419
6250,2.7082,2.458713,28.7996,7.987,22.6692,22.6716,18.8124
7500,2.694,2.449851,28.9603,8.1435,22.848,22.8445,18.8064
8750,2.6906,2.441512,29.1038,8.2011,22.931,22.9365,18.8113
10000,2.6765,2.435417,29.1162,8.2422,22.9891,22.9925,18.8087
11250,2.6634,2.432583,29.1001,8.2451,23.002,23.0016,18.818
12500,2.671,2.430608,29.1736,8.2798,23.0388,23.0324,18.8128


The following columns in the evaluation set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: id, document, summary.
***** Running Evaluation *****
  Num examples = 11332
  Batch size = 16
Saving model checkpoint to t5-small-finetuned-xsum/checkpoint-1250
Configuration saved in t5-small-finetuned-xsum/checkpoint-1250/config.json
Model weights saved in t5-small-finetuned-xsum/checkpoint-1250/pytorch_model.bin
tokenizer config file saved in t5-small-finetuned-xsum/checkpoint-1250/tokenizer_config.json
Special tokens file saved in t5-small-finetuned-xsum/checkpoint-1250/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: id, document, summary.
***** Running Evaluation *****
  Num examples = 11332
  Batch size = 16
Saving model checkpoint to t5-small-finetuned-xsum/checkpoint-2500
Configuration saved in t5-small-finetuned-xsum

TrainOutput(global_step=12753, training_loss=2.724424991683942, metrics={'train_runtime': 7885.3621, 'train_samples_per_second': 25.876, 'train_steps_per_second': 1.617, 'total_flos': 2.761556411547648e+16, 'train_loss': 2.724424991683942, 'epoch': 1.0})

In [50]:
wandb.finish()

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
eval/gen_len,▆▅▁█▄▃▃▃▄▄
eval/loss,█▆▄▃▂▂▂▁▁▁
eval/rouge1,▁▄▅▆▇▇████
eval/rouge2,▁▄▅▆▆▇████
eval/rougeL,▁▄▅▆▇▇████
eval/rougeLsum,▁▄▅▆▇▇████
eval/runtime,▁▄▇▂█▇▅▄▄▇
eval/samples_per_second,█▅▂▇▁▂▄▅▅▂
eval/steps_per_second,█▅▃▇▁▂▄▅▅▂
train/epoch,▁▁▂▂▂▃▃▃▃▄▄▅▅▅▆▆▆▆▇▇███

0,1
eval/gen_len,18.8128
eval/loss,2.43061
eval/rouge1,29.1736
eval/rouge2,8.2798
eval/rougeL,23.0388
eval/rougeLsum,23.0324
eval/runtime,322.2967
eval/samples_per_second,35.16
eval/steps_per_second,2.2
train/epoch,1.0


In [51]:
!ls t5-small-finetuned-xsum/

checkpoint-10000  checkpoint-11250  checkpoint-12500


In [62]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [59]:
!zip -r /content/t5-small-finetuned-xsum.zip /content/t5-small-finetuned-xsum/checkpoint-12500/

  adding: content/t5-small-finetuned-xsum/checkpoint-12500/ (stored 0%)
  adding: content/t5-small-finetuned-xsum/checkpoint-12500/training_args.bin (deflated 48%)
  adding: content/t5-small-finetuned-xsum/checkpoint-12500/scheduler.pt (deflated 49%)
  adding: content/t5-small-finetuned-xsum/checkpoint-12500/scaler.pt (deflated 55%)
  adding: content/t5-small-finetuned-xsum/checkpoint-12500/special_tokens_map.json (deflated 83%)
  adding: content/t5-small-finetuned-xsum/checkpoint-12500/optimizer.pt (deflated 7%)
  adding: content/t5-small-finetuned-xsum/checkpoint-12500/rng_state.pth (deflated 27%)
  adding: content/t5-small-finetuned-xsum/checkpoint-12500/pytorch_model.bin (deflated 8%)
  adding: content/t5-small-finetuned-xsum/checkpoint-12500/config.json (deflated 62%)
  adding: content/t5-small-finetuned-xsum/checkpoint-12500/trainer_state.json (deflated 78%)
  adding: content/t5-small-finetuned-xsum/checkpoint-12500/tokenizer.json (deflated 59%)
  adding: content/t5-small-finetun

In [63]:
!cp t5-small-finetuned-xsum.zip '/content/drive/My Drive/weights/'

## Trying with a smaller dataset

In [28]:
# Init new logging params
wandb.init(project="T5-small")

In [29]:
# Select to get smaller dataset
small_train = raw_datasets['train'].select(list(range(0, 10000)))
small_val = raw_datasets['validation'].select(list(range(0, 1000)))
small_train

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 10000
})

In [30]:
tokenized_train = small_train.map(preprocess_function, batched=True)
tokenized_val = small_val.map(preprocess_function, batched=True)
tokenized_train

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['document', 'summary', 'id', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 10000
})

In [31]:
# Import a new T5-small
model_small = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
Model config T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
 

In [32]:
# data collator: pad the inputs and labels during each batch to save space
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_small)

In [33]:
# Define traing args, batch size and epoch
# batch size max 16 on Colab Pro

batch_size = 16
epochs = 20
model_name = model_checkpoint.split("/")[-1]
args_small = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-xsum-small",
    load_best_model_at_end="eval_loss",
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_steps=1000,  # set to 1000 for full training
    save_steps=1250,  # set to 500 for full training
    eval_steps=1250,  # set to 8000 for full training
    save_total_limit=3,
    num_train_epochs=epochs,
    predict_with_generate=True,
    fp16=True,
    report_to="wandb",
)

PyTorch: setting up devices


In [34]:
# Pass into the trainer

train_dataset=tokenized_train
eval_dataset=tokenized_val

trainer_small = Seq2SeqTrainer(
    model_small,
    args_small,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Using amp half precision backend


In [35]:
trainer_small.train()

The following columns in the training set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: id, document, summary.
***** Running training *****
  Num examples = 10000
  Num Epochs = 20
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 12500
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1250,2.9136,2.567745,27.3293,7.2675,21.628,21.6164,18.788
2500,2.7553,2.521096,28.1019,7.7336,22.2597,22.2611,18.791
3750,2.6799,2.503271,28.1919,7.8234,22.3281,22.3369,18.794
5000,2.5815,2.492357,28.558,7.9851,22.3408,22.3591,18.819
6250,2.5425,2.485655,28.8282,8.0046,22.7569,22.7682,18.829
7500,2.5176,2.480971,29.1581,8.1411,22.989,23.0065,18.851
8750,2.4913,2.480038,29.4905,8.4728,23.3022,23.3217,18.854
10000,2.459,2.478985,29.2759,8.2754,23.2017,23.2311,18.827
11250,2.443,2.479385,29.4248,8.5089,23.3547,23.3805,18.834
12500,2.4426,2.4789,29.3509,8.3785,23.2531,23.2799,18.842


The following columns in the evaluation set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: id, document, summary.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16
Saving model checkpoint to t5-small-finetuned-xsum-small/checkpoint-1250
Configuration saved in t5-small-finetuned-xsum-small/checkpoint-1250/config.json
Model weights saved in t5-small-finetuned-xsum-small/checkpoint-1250/pytorch_model.bin
tokenizer config file saved in t5-small-finetuned-xsum-small/checkpoint-1250/tokenizer_config.json
Special tokens file saved in t5-small-finetuned-xsum-small/checkpoint-1250/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: id, document, summary.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16
Saving model checkpoint to t5-small-finetuned-xsum-small/checkpoint-2500
Configuratio

TrainOutput(global_step=12500, training_loss=2.57158525390625, metrics={'train_runtime': 4861.3738, 'train_samples_per_second': 41.141, 'train_steps_per_second': 2.571, 'total_flos': 2.7067328313163776e+16, 'train_loss': 2.57158525390625, 'epoch': 20.0})

In [36]:
wandb.finish()

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
eval/gen_len,▁▁▂▄▅██▅▆▇
eval/loss,█▄▃▂▂▁▁▁▁▁
eval/rouge1,▁▄▄▅▆▇█▇██
eval/rouge2,▁▄▄▅▅▆█▇█▇
eval/rougeL,▁▄▄▄▆▇█▇██
eval/rougeLsum,▁▄▄▄▆▇█▇██
eval/runtime,▃▂▃▁█▄▄▂▂▃
eval/samples_per_second,▆▇▆█▁▅▅▇▇▆
eval/steps_per_second,▆▇▇█▁▅▅▇█▆
train/epoch,▁▁▂▂▂▃▃▃▃▄▄▅▅▅▆▆▆▆▇▇███

0,1
eval/gen_len,18.842
eval/loss,2.4789
eval/rouge1,29.3509
eval/rouge2,8.3785
eval/rougeL,23.2531
eval/rougeLsum,23.2799
eval/runtime,28.4256
eval/samples_per_second,35.18
eval/steps_per_second,2.216
train/epoch,20.0


In [64]:
!zip -r /content/t5-small-finetuned-xsum-small.zip /content/t5-small-finetuned-xsum-small/checkpoint-12500/

  adding: content/t5-small-finetuned-xsum-small/checkpoint-12500/ (stored 0%)
  adding: content/t5-small-finetuned-xsum-small/checkpoint-12500/training_args.bin (deflated 49%)
  adding: content/t5-small-finetuned-xsum-small/checkpoint-12500/scheduler.pt (deflated 49%)
  adding: content/t5-small-finetuned-xsum-small/checkpoint-12500/scaler.pt (deflated 55%)
  adding: content/t5-small-finetuned-xsum-small/checkpoint-12500/special_tokens_map.json (deflated 83%)
  adding: content/t5-small-finetuned-xsum-small/checkpoint-12500/optimizer.pt (deflated 7%)
  adding: content/t5-small-finetuned-xsum-small/checkpoint-12500/rng_state.pth (deflated 27%)
  adding: content/t5-small-finetuned-xsum-small/checkpoint-12500/pytorch_model.bin (deflated 8%)
  adding: content/t5-small-finetuned-xsum-small/checkpoint-12500/config.json (deflated 62%)
  adding: content/t5-small-finetuned-xsum-small/checkpoint-12500/trainer_state.json (deflated 79%)
  adding: content/t5-small-finetuned-xsum-small/checkpoint-1250

In [65]:
!cp t5-small-finetuned-xsum-small.zip '/content/drive/My Drive/weights/'

## Results of T5 small batch

In [66]:
from transformers import T5ForConditionalGeneration

In [71]:
num_start = 20
num_select = 10

In [72]:
small_test = raw_datasets['test'].select(list(range(num_start, num_start+num_select)))
small_test

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 10
})

In [68]:
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token # to avoid an error

sentences = [prefix + sentence for sentence in small_test['document']] # use different length sentences to test batching
inputs = tokenizer([prefix + sentence for sentence in sentences], max_length=max_input_length, return_tensors="pt", padding=True)

  "`max_length` is ignored when `padding`=`True` and there is no truncation strategy. "


In [69]:
output_sequences = model.generate(
    input_ids=inputs['input_ids'].cuda(),
    attention_mask=inputs['attention_mask'].cuda(),
    do_sample=False, # disable sampling to test if batching affects output
)
prediction = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)

In [70]:
output_sequences_small = model_small.generate(
    input_ids=inputs['input_ids'].cuda(),
    attention_mask=inputs['attention_mask'].cuda(),
    do_sample=False, # disable sampling to test if batching affects output
)
prediction_small = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)

In [73]:
for i in range(num_select):
    print("Original Text: %s" % small_test[i]['document'])
    print("\nActual Summary: %s" % small_test[i]['summary'])
    print("\nBatch Predicted: %s" % prediction[i])
    print("\nSmall_Set Summary: %s" % prediction_small[i])
    print("=====================================================================\n")

Original Text: Pakistan's telecoms regulator said the ban was no longer necessary because Google, which owns YouTube, had now launched a Pakistan-specific version.
YouTube has denied claims that the authorities can filter content.
Many young Pakistanis have welcomed the lifting of the ban but some activists want details of the deal with Google.
They say there should be greater transparency of the terms agreed between Google and the government.
A Pakistan Telecommunication Authority (PTA) official confirmed to the BBC that all internet service providers had been directed to open access to YouTube.
The Pakistan Telecommunication Company Ltd posted on its Facebook page on Monday: "Welcome Back YouTube".
Pakistan's ministry of information technology said: "Google has provided an online web process through which requests for blocking access of offending material can be made by the PTA to Google directly.
"Google/YouTube will accordingly restrict access to the said offending material for use