## **Author**: *Sena Nur Bilgin*
## **Specialization**:  *DSA*
## **Subject**: *AL Modelling Google-T5 (Full-Fine Tuning & PEFT with BBC)*

### Full-Fine Tuning Example Notebook Google T5 Flan:  
This notebook demonstrates the process of Full-Fine tuning & PEFT for Google T5 model.

### Zeroshot Learning Inferences in Inference Notebook:
Refer to the Inference Notebook for detailed explanations and examples of zero-shot inferences.

### Dependencies & Importing Libraries:

In [None]:
!pip install -U transformers
!pip install -U datasets
!pip install tensorboard
!pip install sentencepiece
!pip install accelerate
!pip install evaluate
!pip install rouge_score
!pip install accelerate -U
!pip install transformers[torch]
!pip install peft

In [3]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
import pprint
import torch
import evaluate
import numpy as np
from datasets import load_dataset
import time

pp = pprint.PrettyPrinter()


### Constants:

In [None]:
MODEL = 'google/flan-t5-base'
BATCH_SIZE = 4
NUM_PROCS = 4
EPOCHS = 5
OUT_DIR = './results_t5base_google'
MAX_LENGTH = 512 
PEFT_MODEL_PATH="./peft-bbc-summary-checkpoint-local"


### Model Upload:

*In this study, we employed state-of-the-art sequence-to-sequence models for natural language processing tasks. Specifically, we utilized the Transformers library to load and configure our models. The initialization process involved loading a pretrained sequence-to-sequence model and its tokenizer using AutoModelForSeq2SeqLM.from_pretrained() and AutoTokenizer.from_pretrained() functions, respectively. Throughout our experimentation, we ensured compatibility with our hardware capabilities by setting torch_dtype=torch.bfloat16 for both the model and tokenizer. This choice optimizes memory usage during training and inference This setup allowed us to effectively deploy the models for tasks such as text generation and summarization, demonstrating their robust performance across various datasets and evaluation metrics.*

In [None]:

model = AutoModelForSeq2SeqLM.from_pretrained(MODEL, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(MODEL, torch_dtype=torch.bfloat16)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Extractive Text Summarization Dataset : BBC

This dataset contains 417 political news articles from BBC, spanning from 2004 to 2005. The dataset is organized into two main folders: ` Articles` and `Summaries`.

### Folder Structure

- `Articles/`: Contains the original news articles.
- `Summaries/`: Contains summaries for each news article.

In [None]:
dataset = load_dataset('gopalkalpande/bbc-news-summary', split='train')
full_dataset = dataset.train_test_split(test_size=0.2, shuffle=True)
dataset_train = full_dataset['train']
dataset_valid = full_dataset['test']

print(dataset_train)
print(dataset_valid)

Dataset({
    features: ['File_path', 'Articles', 'Summaries'],
    num_rows: 1779
})
Dataset({
    features: ['File_path', 'Articles', 'Summaries'],
    num_rows: 445
})


### Preprocessing & Tokenization Function:

*The code defines a preprocess_function designed for sequence-to-sequence model training, specifically tailored for text summarization tasks. It prepares input sequences by prefixing each article with "summarize: " and tokenizes them using a specified tokenizer (tokenizer). Similarly, it processes target summaries and assigns them as labels to the model inputs. The function ensures uniform sequence lengths using padding and truncation techniques to facilitate efficient batch processing for both training and validation datasets.*

In [None]:
def preprocess_function(examples):
    """
    Preprocesses examples for sequence-to-sequence model training.
    """
    inputs = [f"summarize: {article}" for article in examples['Articles']]
    model_inputs = tokenizer(
        inputs,
        max_length=MAX_LENGTH,
        truncation=True,
        padding='max_length'
    )

    targets = [summary for summary in examples['Summaries']]
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=MAX_LENGTH,
            truncation=True,
            padding='max_length'
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train = dataset_train.map(
    preprocess_function,
    batched=True,
    num_proc=NUM_PROCS
)

tokenized_valid = dataset_valid.map(
    preprocess_function,
    batched=True,
    num_proc=NUM_PROCS
)


  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/1779 [00:00<?, ? examples/s]

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/445 [00:00<?, ? examples/s]



### GPU usage:

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

### Checking number of trainable parameters:

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


### Computing Metrics:

*The compute_metrics function evaluates text summarization performance using the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) algorithm. It processes predicted and actual token IDs to decode them into human-readable summaries. ROUGE scores (rouge1, rouge2, rougeL) are computed by comparing these decoded summaries, optionally using stemming for linguistic normalization. Additionally, the function calculates the average length of generated summaries based on non-padding tokens in predictions. The computed metrics are returned as a dictionary, rounded to four decimal places, providing insights into the quality and length of generated summaries.*

In [None]:
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    """
    Computes evaluation metrics for summarization using ROUGE.
    """
    predictions, labels = eval_pred.predictions[0], eval_pred.label_ids

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True,
        rouge_types=[
            'rouge1',
            'rouge2',
            'rougeL'
        ]
    )

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}


In [None]:
def preprocess_logits_for_metrics(logits, labels):
    """
    Original Trainer may have a memory leak.
    This is a workaround to avoid storing too many tensors that are not needed.
    """
    pred_ids = torch.argmax(logits[0], dim=-1)
    return pred_ids, labels

### Training Parameters for Full-Fine Tuning:

In [None]:
training_args = TrainingArguments(
    output_dir=OUT_DIR,
    num_train_epochs=5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir=OUT_DIR,
    logging_steps=5,
    evaluation_strategy='steps',
    eval_steps=200,
    save_strategy='epoch',
    save_total_limit=2,
    report_to='tensorboard',
    learning_rate=0.0001,
    dataloader_num_workers=4
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    compute_metrics=compute_metrics
)

history = trainer.train()
trainer.save_model(OUT_DIR)
tokenizer.save_pretrained(OUT_DIR)

  self.pid = os.fork()


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Gen Len
200,16.7375,15.023314,0.8776,0.7809,0.8528,512.0
400,1.8703,1.356794,0.8967,0.8196,0.8757,234.5483
600,0.757,0.385215,0.8995,0.8261,0.8801,233.6427
800,0.6473,0.332949,0.9035,0.8326,0.8856,233.5708
1000,0.2427,0.32286,0.9054,0.8348,0.8879,233.5663
1200,0.3185,0.320745,0.9061,0.8357,0.8885,233.5663
1400,0.3533,0.319323,0.9068,0.8365,0.8891,233.5663
1600,0.4133,0.318254,0.9069,0.8369,0.8894,233.5663
1800,0.2848,0.317875,0.9071,0.8371,0.8897,233.5663
2000,0.4379,0.31777,0.907,0.8369,0.8894,233.5663


  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()


TypeError: Object of type dtype is not JSON serializable

## Introducing PEFT: Tuning Google Flan with Basic & Complex Lora Settings

*Here, we've conducted modeling for both Basic and Complex PEFT. However, only the training results for Complex PEFT are displayed. We forgot to include evaluation arguments during training. Due to the complexity involved in recomputation, we've opted to maintain the original tuning steps. For model performance details, please refer to the Inference Notebook.*


### Details:

*In configuring both Basic and Complex PEFT models, we focused on tuning key hyperparameters to optimize performance for sequence-to-sequence language modeling tasks. For the Basic PEFT model, we conducted tuning primarily on two hyperparameters: r and lora_alpha, setting them to 32. These parameters are responsible for the model's capacity and learning dynamics. There are crucial for balancing complexity and performance. In contrast, the Complex PEFT model underwent more extensive tuning, with r and lora_alpha adjusted to 64. This higher configuration aimed to enhance the model's ability to handle more intricate patterns and details in data. We assumed that those settings might improve its overall effectiveness in capturing complex linguistic structures. Both configurations targeted modules "q" and "v", employed a dropout rate of 0.05 for regularization, applied no bias (bias="none"), and were tailored for sequence-to-sequence language modeling (task_type=TaskType.SEQ_2_SEQ_LM).*



## Basic Settings:

In [None]:

lora_config = LoraConfig(
    r=32, # The hyperparameter we have tuned for Basic PEFT model.
    lora_alpha=32, # The hyperparameter we have tuned for Basic PEFT model.
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)


## Complex Settings:

In [None]:
lora_config = LoraConfig(
    r=64, # The hyperparameter we have tuned for Complex PEFT model.
    lora_alpha=64, # The hyperparameter we have tuned for Complex PEFT model.
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)


### Training Parameters for PEFT Tuning - Complex:

In [None]:
peft_model = get_peft_model(model,
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

peft_training_args = TrainingArguments(
    output_dir=PEFT_MODEL_PATH,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=1e-3, 
    num_train_epochs=1,
    logging_steps=1,
    max_steps=300,
    dataloader_num_workers=4
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_train["train"],
)

peft_trainer.train()
peft_trainer.model.save_pretrained(PEFT_MODEL_PATH)
tokenizer.save_pretrained(PEFT_MODEL_PATH)

max_steps is given, it will override any value given in num_train_epochs


trainable model parameters: 7077888
all model parameters: 254655744
percentage of trainable model parameters: 2.78%


Step,Training Loss
1,49.25
2,44.0
3,39.0
4,32.75
5,28.625
6,25.375
7,22.875
8,18.5
9,14.875
10,12.0




('./33enhanced-peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './33enhanced-peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './33enhanced-peft-dialogue-summary-checkpoint-local/spiece.model',
 './33enhanced-peft-dialogue-summary-checkpoint-local/added_tokens.json',
 './33enhanced-peft-dialogue-summary-checkpoint-local/tokenizer.json')