## Section 2 of LLM Project - Summarisation after preprocessing earlier

**In the previous notebook** (llm_project_1_preprocess) we took a [Hugging Face Dataset](https://huggingface.co/datasets/alexfabbri/multi_news) with a large number of long-form news articles, along with a human-generated summary of the news article.
- EDA of the range of character length, removed outliers
- The training, testing and validation data was preprocessed with removing punctuation, lower case, tokenized
- Saved as csvs to preserve the steps taken

**This notebook** will explore LLMs to summarize the original Multi-News documents, and use ROUGE Score to compare the similarity to the human generated summaries in the training set. This is a sequence-to-sequence comparison.

**Start with the Google T5 model.**

As a possible phase 2 activity I'd like to see the difference in performance by feeding the tokenized text that I preprocessed myself in the previous notebook.

I was running into major RAM / Memory issues with the free-to-use Google Colab and experimented with reducing the dataset size - the original set linked above had 45,000 records in training. I will point out any 'trims' to dataset size below.

Begin with Imports

In [21]:
!pip install transformers datasets evaluate rouge_score



In [None]:
from datasets import load_dataset

multi_news = load_dataset("multi_news")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

multi_news.py:   0%|          | 0.00/3.83k [00:00<?, ?B/s]

The repository for multi_news contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/multi_news.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


train.src.cleaned:   0%|          | 0.00/548M [00:00<?, ?B/s]

train.tgt:   0%|          | 0.00/58.8M [00:00<?, ?B/s]

val.src.cleaned:   0%|          | 0.00/66.9M [00:00<?, ?B/s]

val.tgt:   0%|          | 0.00/7.30M [00:00<?, ?B/s]

test.src.cleaned:   0%|          | 0.00/69.0M [00:00<?, ?B/s]

test.tgt:   0%|          | 0.00/7.31M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/44972 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5622 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5622 [00:00<?, ? examples/s]

In [None]:
multi_news

DatasetDict({
    train: Dataset({
        features: ['document', 'summary'],
        num_rows: 44972
    })
    validation: Dataset({
        features: ['document', 'summary'],
        num_rows: 5622
    })
    test: Dataset({
        features: ['document', 'summary'],
        num_rows: 5622
    })
})

Same Dataset as before - this is an HF dataset and previously we converted to a Pandas Dataframe and manipulated it to preprocess and tokenize.

We'll use the AutoTokenizer in transformers this time to train the model.

In [22]:
# log in to upload model

from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## RAM / Memory Issues

#### The multi-news dataset is too big (45k records in training) and causing too many issues running in Google Colab.

#### I am going to use train_test_split on the three sections to retain only **25% of the original dataset for each**

In [None]:
from datasets import DatasetDict

# reduce train set using train_test_split, keeping only the train section
train_reduced = multi_news['train'].train_test_split(test_size=0.75, seed=42)['train']

# reduce test set using train_test_split, keeping only the train section
test_reduced = multi_news['test'].train_test_split(test_size=0.75, seed=42)['train']

# reduce validation set in the same way
val_reduced = multi_news['validation'].train_test_split(test_size=0.75, seed=42)['train']

# create a new DatasetDict with the reduced sets - news_small
news_small = DatasetDict({
    'train': train_reduced,
    'test': test_reduced,
    'validation': val_reduced
})

In [None]:
news_small

DatasetDict({
    train: Dataset({
        features: ['document', 'summary'],
        num_rows: 11243
    })
    test: Dataset({
        features: ['document', 'summary'],
        num_rows: 1405
    })
    validation: Dataset({
        features: ['document', 'summary'],
        num_rows: 1405
    })
})

In [None]:
news_small['train'][0]

{'document': 'New Zealand is wild. \n \n Just ask this Kiwi kayaker, Kyle Mulinder, who happened to be in “the wrong place at the wrong time” to get slapped directly in the face by an octopus wielded by a seal. \n \n Yes, you read that right: a seal had an octopus in its mouth in the midst of a battle royale of marine creatures. The seal emerged from the water flailing around with the octopus, which ended up hitting the unsuspecting kayaker in the face while he was out enjoying the great outdoors. \n \n “We were just sitting out in the middle of the ocean and then this huge male seal appeared with an octopus and he was thrashing him about for ages,” Mulinder told Yahoo News about the freak encounter with nature. Then came the big moment. “I was like ‘mate, what just happened?’” As it turns out, Mulinder was out kayaking with a team of content creators for GoPro testing camera equipment — so they were able to capture all the footage of the incident on video. \n \n Mulinder appears to be

Above: Confirm same structure, 'document' is the feature and 'summary' is the label or target variable.

## Preprocessing with T5 Tokenizer

We'll load the T5 AutoTokenizer and preprocessing from here is to prefix the input so the instructions to the LLM are clear (NLP tasks often require prompting) that it's a summarisation task.

In [None]:
from transformers import AutoTokenizer

# load the tokenizer
checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [None]:
prefix = "Summarize: "

In [None]:
def preprocess_function(examples):

    # define input with prefix
    inputs = [prefix + doc for doc in examples["document"]]

    model_inputs = tokenizer(
        inputs,
        max_length=1024,
        truncation=True
        ) # start with token input limit 1024, truncate larger instances, default settings

    labels = tokenizer(
        text_target=examples["summary"],
        max_length=128,
        truncation=True
        ) # default settings, text_target is the argument for tokenizing labels

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

Hugging Face `map` method can apply the preprocessing function to the entire dataset

In [None]:
# Apply preprocessing and remove the raw text columns
tokenized_news_small = news_small.map(preprocess_function,
                                      batched=True,
                                      remove_columns=["document", "summary"])

Map:   0%|          | 0/11243 [00:00<?, ? examples/s]

Map:   0%|          | 0/1405 [00:00<?, ? examples/s]

Map:   0%|          | 0/1405 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

Re above:

- the model needs to deal with sentences of different token length and this is called dynamic padding where individual sentences will just be made up with padding to the upper limit. We will add parameters later that tell the trainer to ignore the padding characters

Batch of examples with DataCollatorForSeq2Seq - it's easier to process sentences in parallel if they're all the same length.
So we will 'dynamically pad' sentence by sentence rather than pad the whole dataset to the maximum length. This helps with processing.

Will use Pytorch here not Tensorflow

## Evaluation

Using ROUGE

In [None]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

this function passes the predictions and labels to compute to calculate the ROUGE metric (use later after training)

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id) # labels replacement, so padding tokens are properly ignored when evaluating the Rouge score
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) # will remove special tokens like pad

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Training

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Further steps needed:
- Define the training hyperparameters in Seq2SeqTrainingArguments,
  - only required one is output_dir saves the model to HF
  - Reduce batch size to 4 (16 was default) to avoid memory issues, as well as this being a reduced size dataset already

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="small-multi-news-model",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4, # reduced from 16 to 4 to avoid memory issues
    per_device_eval_batch_size=4, # reduced from 16 to 4 to avoid memory issues
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True, #change to bf16=True for XPU
    push_to_hub=True, # this model will automatically be on HF hub when done
    remove_unused_columns=False
)

- Pass training arguments to Seq2SeqTrainer with the model, dataset, tokenizer collator and function above


In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_news_small["train"], # used tokenized dataset not original news_small
    eval_dataset=tokenized_news_small["test"],  # used tokenized dataset not original news_small
    tokenizer=tokenizer,
    data_collator=data_collator, # data collator is handling padding dynamically (see above)
    compute_metrics=compute_metrics,
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [None]:
print(tokenized_news_small['train'][0])
# document and summary gone now, only input_ids, attention mask and labels remains! Good

{'input_ids': [12198, 1635, 1737, 10, 368, 5725, 19, 3645, 5, 1142, 987, 48, 31092, 16697, 49, 6, 19454, 10094, 77, 588, 6, 113, 2817, 12, 36, 16, 105, 532, 1786, 286, 44, 8, 1786, 97, 153, 12, 129, 3, 7, 8478, 3138, 1461, 16, 8, 522, 57, 46, 3, 32, 75, 2916, 302, 587, 40, 221, 26, 57, 3, 9, 7042, 5, 2163, 6, 25, 608, 24, 269, 10, 3, 9, 7042, 141, 46, 3, 32, 75, 2916, 302, 16, 165, 4247, 16, 8, 3, 12342, 13, 3, 9, 3392, 11268, 15, 13, 8769, 14231, 5, 37, 7042, 13999, 45, 8, 387, 5731, 173, 53, 300, 28, 8, 3, 32, 75, 2916, 302, 6, 84, 3492, 95, 10849, 8, 1149, 76, 5628, 53, 16697, 49, 16, 8, 522, 298, 3, 88, 47, 91, 5889, 8, 248, 10962, 5, 105, 1326, 130, 131, 3823, 91, 16, 8, 2214, 13, 8, 5431, 11, 258, 48, 1450, 5069, 7042, 4283, 28, 46, 3, 32, 75, 2916, 302, 11, 3, 88, 47, 3, 189, 12380, 53, 376, 81, 21, 3, 2568, 642, 10094, 77, 588, 1219, 15670, 3529, 81, 8, 19866, 6326, 28, 1405, 5, 37, 29, 764, 8, 600, 798, 5, 105, 196, 47, 114, 458, 5058, 6, 125, 131, 2817, 58, 22, 153, 282, 34, 

- train() fine tunes the model

In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,2.995,2.73036,0.1468,0.0464,0.1121,0.1121,19.0
2,2.929,2.698031,0.148,0.0475,0.1132,0.1132,19.0
3,2.9059,2.682364,0.1498,0.0483,0.1142,0.1142,19.0
4,2.9034,2.678337,0.1502,0.0493,0.1146,0.1146,19.0




TrainOutput(global_step=11244, training_loss=2.956871952801911, metrics={'train_runtime': 1576.5135, 'train_samples_per_second': 28.526, 'train_steps_per_second': 7.132, 'total_flos': 1.2169978140033024e+16, 'train_loss': 2.956871952801911, 'epoch': 4.0})

In [23]:
# Reload the model when the runtime fnished.
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("small-multi-news-model") # saved on Hugging Face
tokenizer = AutoTokenizer.from_pretrained("small-multi-news-model")

## Interpreting the results table above:

- **Epoch:** 4 - LLM sees the training data 4 distinct times to adjust weights/inputs

- **Training loss:** error on the training dataset. Shows decrease over the four epochs so the model is better predicting over time.

- **Validation loss:** Performance on the validation dataset not seen in training. This also decreases modestly over time and these values are not too different from the Training Loss. It's a good sign that the model isn't overfitting.

- **ROUGE scores:** Higher values indicate better performance and the scores are incrementally getting better over time, so the model is improving its summarisation ability. Values of 0.15 suggest there is some overlap with the reference summaries but still probably missing key details or producing more informative sentences.
  - **ROUGE-1 and ROUGE-2 are usually low in summarisation tasks if the dataset is full of long documents or the summary styles vary among the training.** I noticed in my EDA in the previous notebook that some of the news reports are written as online articles (very matter of fact) and others appear to be news anchor-led reports from television shows (full of addresses to the viewer/reader) and the range of document lengths were vast. There were some extremely long documents and some brief news summary reports.
  - Definitions of the scores below:
    - Rouge1 is the overlap of individual words (a.k.a. unigrams)
    - Rouge2 is the overlap of consecutive words (bigrams)
    - Rougel (Rouge-L) is the longest common subsequence between predicted and the human/label summary. Rougelsum (Rouge-Lsum) is specifically for summarisations.

- **GEN LEN** is generation length which was 19 tokens. This is very very short, perhaps almost a sentence in length. I expect the ROUGE scores could be improved by simply prompting a longer summary, even by a small amount.
  - **See below**: The model-generated summaries are far too short compared to the references which are around 315 average token length. This matches up with the average **character length** of around 1200 which we found in some earlier EDA.




In [26]:
import numpy as np

ref_lengths = [len(tokenizer.encode(summary)) for summary in news_small['train']['summary']]
print("Average training reference summary length:", np.mean(ref_lengths))


Average training reference summary length: 315.212843547096


## Key insights:

- The model is learning from each new epoch, both training and validation loss gradually decreasing and ROUGE is slightly improving over time.
- Does not seem to be overfitting, due to the results above
- ROUGE Scores are quite low and the big gap in summary length probably contributes significantly to this.

In [None]:
# push to HF
trainer.push_to_hub()

events.out.tfevents.1729727459.847cd201361a.821.0:   0%|          | 0.00/13.0k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/tjjdoherty/small-multi-news-model/commit/1cb22f93afea5a1dd01fad2edcd8daf2fbffc876', commit_message='End of training', commit_description='', oid='1cb22f93afea5a1dd01fad2edcd8daf2fbffc876', pr_url=None, pr_revision=None, pr_num=None)