## **Author**: *Sena Nur Bilgin*
## **Specialization**:  *DSA*
## **Subject**: *AL Modelling T5 (Full-Fine Tuning with BBC & Dialog Summary)*

### Full-Fine Tuning Example Notebook for T5.1 and T5.2 Models:  

This notebook demonstrates the process of full-fine tuning for T5.1 and T5.2 models. Dialog Summary Fine-tuning (Second Process) was applied using the same notebook. The key distinction here is the inclusion of an already tuned model and its corresponding tokenizer to facilitate seamless continuation from previous tuning stages.

### Zeroshot Learning Inferences in Inference Notebook:

Refer to the Inference Notebook for detailed explanations and examples of zero-shot inferences. This section explores how T5 models, specifically T5.1 and T5.2, perform zero-shot tasks, providing insights into their capabilities without explicit training on specific tasks or datasets.

### Dependencies & Importing Libraries:

In [None]:
!pip install -U transformers
!pip install -U datasets
!pip install tensorboard
!pip install sentencepiece
!pip install accelerate
!pip install evaluate
!pip install rouge_score
!pip install accelerate -U
!pip install transformers[torch]

In [None]:
import os
import pprint

import numpy as np
import pandas as pd
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, TrainingArguments, Trainer, GenerationConfig
from datasets import load_dataset
import evaluate

pp = pprint.PrettyPrinter()

input_directory = '/kaggle/input'
for dirname, _, filenames in os.walk(input_directory):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Constants:

In [None]:
MODEL = 't5-base'
BATCH_SIZE = 4
NUM_PROCS = 4
EPOCHS = 10
OUT_DIR = 'results_t5base'
MAX_LENGTH = 512 #

### Model Upload:

In [None]:
tokenizer = T5Tokenizer.from_pretrained(MODEL)
model = T5ForConditionalGeneration.from_pretrained(MODEL)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Extractive Text Summarization Dataset : BBC

This dataset contains 417 political news articles from BBC, spanning from 2004 to 2005. The dataset is organized into two main folders: ` Articles` and `Summaries`.

### Folder Structure

- `Articles/`: Contains the original news articles.
- `Summaries/`: Contains summaries for each news article.


### Extractive Text Summarization Dataset : DialogSum
The DialogSum dataset is a large-scale dialogue summarization dataset comprising 13,460 dialogues, with an additional 1,000 dialogues reserved for testing purposes. The dataset is split into training, testing, and validation sets.


### Folder Structure

- `dialogue/`: Textual representation of the dialogue.
- `summary/`: Human-written summary of the dialogue, aimed at capturing the essential information.
- `topic/`: Human-written topic or one-liner that encapsulates the main theme or subject of the dialogue.



In [None]:
dataset = load_dataset('gopalkalpande/bbc-news-summary', split='train')
full_dataset = dataset.train_test_split(test_size=0.2, shuffle=True)
dataset_train = full_dataset['train']
dataset_valid = full_dataset['test']

print(dataset_train)
print(dataset_valid)

Dataset({
    features: ['File_path', 'Articles', 'Summaries'],
    num_rows: 1779
})
Dataset({
    features: ['File_path', 'Articles', 'Summaries'],
    num_rows: 445
})


In [2]:
dataset_dialog = load_dataset("knkarthick/dialogsum") 
dataset_train_dialog = dataset_dialog['train']
dataset_valid_dialog = dataset_dialog['test']

print(dataset_train_dialog)
print(dataset_valid_dialog)

Dataset({
    features: ['id', 'dialogue', 'summary', 'topic'],
    num_rows: 12460
})
Dataset({
    features: ['id', 'dialogue', 'summary', 'topic'],
    num_rows: 1500
})


### Preprocessing & Tokenization Function:

In [None]:
def preprocess_function(examples):
    """
    Preprocesses the dataset examples to convert text data into model inputs and targets.
    """
    inputs = [f"summarize: {article}" for article in examples['Articles']]
    model_inputs = tokenizer(
        inputs,
        max_length=MAX_LENGTH,
        truncation=True,
        padding='max_length'
    )

    targets = [summary for summary in examples['Summaries']]
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=MAX_LENGTH,
            truncation=True,
            padding='max_length'
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train = dataset_train.map(
    preprocess_function,
    batched=True,
    num_proc=NUM_PROCS
)
tokenized_valid = dataset_valid.map(
    preprocess_function,
    batched=True,
    num_proc=NUM_PROCS
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/1779 [00:00<?, ? examples/s]



Map (num_proc=4):   0%|          | 0/445 [00:00<?, ? examples/s]



### GPU usage:

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

### Checking number of trainable parameters:

In [None]:
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")

222,903,552 total parameters.
222,903,552 training parameters.


### Computing Metrics:

In [None]:
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    """
    Computes evaluation metrics for summarization using ROUGE.
    """
    predictions, labels = eval_pred.predictions[0], eval_pred.label_ids

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True,
        rouge_types=[
            'rouge1',
            'rouge2',
            'rougeL'
        ]
    )

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}


In [None]:
def preprocess_logits_for_metrics(logits, labels):
    """
    Original Trainer may have a memory leak.
    This is a workaround to avoid storing too many tensors that are not needed.
    """
    pred_ids = torch.argmax(logits[0], dim=-1)
    return pred_ids, labels

### Training Parameters for Full-Fine Tuning:

In [None]:
training_args = TrainingArguments(
    output_dir=OUT_DIR,
    num_train_epochs=5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir=OUT_DIR,
    logging_steps=5,
    evaluation_strategy='steps',
    eval_steps=200,
    save_strategy='epoch',
    save_total_limit=2,
    report_to='tensorboard',
    learning_rate=0.0001,
    dataloader_num_workers=4
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    compute_metrics=compute_metrics
)

history = trainer.train()
trainer.save_model(OUT_DIR)
tokenizer.save_pretrained(OUT_DIR)

  self.pid = os.fork()


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Gen Len
200,0.3829,0.451746,0.8918,0.8135,0.8696,241.7438
400,0.1828,0.397108,0.899,0.8239,0.8785,242.4989
600,0.3997,0.377828,0.9024,0.8286,0.8824,242.5011
800,0.376,0.371118,0.9035,0.831,0.884,242.5011
1000,0.2635,0.364494,0.9053,0.833,0.8857,242.5011
1200,0.1908,0.359677,0.9064,0.8352,0.8872,242.5011
1400,0.2131,0.356932,0.9072,0.8366,0.888,242.5011
1600,0.4016,0.354457,0.9082,0.8383,0.889,242.5011
1800,0.266,0.353737,0.9081,0.8382,0.8891,242.5011
2000,0.3448,0.352719,0.9083,0.8388,0.8894,242.5011


  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()


AttributeError: 'TrainingArguments' object has no attribute 'save_pretrained'

### Model Recall & Inference with BBC (Same) Dataset:

In [None]:
MODEL_PATH = "/content/drive/MyDrive/results_t5base/checkpoint-2225"  


model_finetuned = T5ForConditionalGeneration.from_pretrained(MODEL_PATH)
tokenizer_finetuned = T5Tokenizer.from_pretrained("/content/drive/MyDrive/results_t5base")
tokenizer_original = T5Tokenizer.from_pretrained('t5-base')
model_original= T5ForConditionalGeneration.from_pretrained('t5-base')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special 

In [None]:
articles = dataset[10:20]['Articles']
human_baseline_summaries = dataset[10:20]['Summaries']

original_model_summaries = []
full_tuned_model_summaries = []

for idx, dialogue in enumerate(articles):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = model_original.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=100,do_sample=True, num_beams=1))
    original_model_text_output = tokenizer_original.decode(original_model_outputs[0], skip_special_tokens=True)

    full_tuned_model_outputs = model_finetuned.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=100,do_sample=True, num_beams=1))
    full_tuned_model_text_output = tokenizer_finetuned.decode(full_tuned_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    full_tuned_model_summaries.append(full_tuned_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, full_tuned_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'full_tuned_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,peft_model_summaries
0,Engineering firm Balfour Beatty and five railw...,Balfour Beatty and five railway managers face ...,Engineering firm Balfour Beatty and five railw...
1,But Mr Howard said he rejected the idea that t...,"conservative leader: ""i don't want anybody to ...","He said he found BNP's policies ""abhorrent"" bu..."
2,As his party set out detailed asylum reform pl...,a new ruling party to put quotas into effect a...,Lib Dem chairman Matthew Taylor said there nee...
3,It was claimed he had been embarrassed by the ...,former cabinet minister luke douglas returns t...,And he again heaped praise on Mr Brown saying ...
4,"""Nearly two thirds of people with TB are born ...",migrant health checks for citizens coming unde...,Lib Dem leader Michael Howard said the checks ...
5,Mr Kennedy criticised Mr Brown for failing to ...,chancellor unveils a one-off £200 council tax ...,"""Mr Brown argues for a softer approach to tax ..."
6,Ministers have insisted they are committed to ...,ministers insist they are committed to free ca...,Ms Sturgeon said that while she had no reason ...
7,The UK has welcomed the decision by India and ...,the'spirit of cooperation' between the two cou...,Mr Straw said he hoped the agreement would mak...
8,A deal bringing Turkey a step closer to EU mem...,"talks between turkey, ankara are set to start ...",The deal to open formal talks with Ankara came...
9,Lord Goldsmith said the answer represented his...,attorney general's advice did not seem to cont...,Former foreign secretary Robin Cook said Lord ...


In [None]:
df['full_tuned_model_summaries'][9]

'Former foreign secretary Robin Cook said Lord Goldsmith\'s admission that his parliamentary answer was not a summary of his legal opinion suggested Parliament may have been misled."As I have always made clear, I set out in the answer my own genuinely held, independent view that military action was lawful under the existing (UN) Security Council resolutions," he said.Lord Goldsmith said the answer represented his "genuinely held independent view" the war was legal.Cl'

In [None]:
df['human_baseline_summaries'][9]

'Lord Goldsmith said the answer represented his "genuinely held independent view" the war was legal.Former foreign secretary Robin Cook said Lord Goldsmith\'s admission that his parliamentary answer was not a summary of his legal opinion suggested Parliament may have been misled."The attorney general may never have presented his answer as a summary, but others certainly did," he said.Last week Lord Goldsmith said in a statement: "I was fully involved throughout the drafting process and personally finalised, and of course approved, the answer."In the House of Lords, the attorney general faced a call by former Tory lord chancellor Lord Mackay to now publish the "full text" of the advice - the suggestion was rejected.Tony Blair has dismissed questions about the attorney general\'s advice, and said his Parliamentary statement had been a "fair summary" of his opinion."The answer did not purport to be a summary of my confidential legal advice to government.""As I have always made clear, I se

In [None]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=full_tuned_model_summaries,
    references=human_baseline_summaries[0:len(full_tuned_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.22296541032849315, 'rouge2': 0.07660635569370747, 'rougeL': 0.1366560860367862, 'rougeLsum': 0.13726203830336203}
PEFT MODEL:
{'rouge1': 0.43872278341426113, 'rouge2': 0.3325722423458035, 'rougeL': 0.3353004249423641, 'rougeLsum': 0.33623703559489393}
