#### Sunday, May 12, 2024

Running '4. Calculating the Rouge scores after fine-tuning'

You cannot run this notebook in one pass because you need to restart the kernel before running '# 3 Fine-tuning with QLORA **vetgedrukte tekst**' and then before running '4. Calculating the Rouge scores after fine-tuning'

#### Saturday, May 11, 2024

mamba activate ftllm

[Google’s Gemma vs Microsoft’s Phi-2 vs Mistral on Summarisation](https://pub.towardsai.net/googles-gemma-vs-microsoft-s-phi-2-vs-mistral-on-summarisation-6877bc7b1a69)

https://colab.research.google.com/drive/11_UrXd7PMB1NAV51JEJ5R3Y__oLoZRnW?usp=sharing

In [1]:
# Make sure we always use this folder for all things huggingface!
import os

os.environ["HF_HOME"] = "/home/rob/Data2/huggingface"

In [2]:
# only target the 4090 ...
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [3]:
# And again, we are getting this error! ...
# Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. 
# Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use `accelerate launch` which will do this automatically.

os.environ["NCCL_P2P_DISABLE"]="1"
os.environ["NCCL_IB_DISABLE"]="1"


# Code explanation for Causal models

Welcome at this Colab Code sharing notebook. This is part of the Medium publication: "[Google's Gemma vs Microsoft's Phi-2 vs Mistral on Summarisation](https://medium.com/@Farhang87/googles-gemma-vs-microsoft-s-phi-2-vs-mistral-on-summarisation-6877bc7b1a69)". Read the full article for further guidance.

Let's start by installing the libraries.

In [None]:
# Install necessary libraries with specific versions to ensure compatibility
# !pip install torch==2.1.2 tensorboard rouge_score
# !pip install --upgrade datasets==2.16.1 accelerate==0.26.1 evaluate==0.4.1 bitsandbytes==0.42.0
# !pip install --upgrade git+https://github.com/huggingface/trl@a3c5b7178ac4f65569975efadc97db2f3749c65e
# !pip install --upgrade git+https://github.com/huggingface/peft@4a1559582281fc3c9283892caea8ccef1d6f5a4f
# !pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers
# !pip install ninja packaging
# !MAX_JOBS=4 pip install flash-attn --no-build-isolation

# 1. Logging into Huggingface and loading the SamSum dataset

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [4]:
from datasets import load_dataset

# Load the SamSum dataset for training, validation, and testing
dataset = load_dataset("samsum")
train_dataset, validation_dataset, test_dataset = dataset['train'], dataset['validation'], dataset['test']

From here-on, you can either go further with 2. Baseline Rouge Evaluation, 3. Finetuning, or 4. Post-finetuning Rouge evaluation.

# 2. Baseline Rouge evaluation

We'll start by loading the model, if possible in full precision, and the tokenizer.

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TextGenerationPipeline
import torch

# Replace with your actual BioMedLM model checkpoint
model_id = "google/gemma-2b"

Hmm first time I tried to download this, I got the error message ...

    Cannot access gated repo for url https://huggingface.co/google/gemma-2b/resolve/main/config.json.
    Access to model google/gemma-2b is restricted. You must be authenticated to access it.

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True, padding_side='right', trust_remote_code=True)



In [7]:
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32, trust_remote_code=True).to("cuda")
# 70m 54.6s

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
!nvidia-smi

Sun May 12 08:07:33 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce GTX 1050        Off | 00000000:01:00.0  On |                  N/A |
| 38%   58C    P0              N/A /  70W |    481MiB /  2048MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:02:00.0 Off |  

In [9]:
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

For sanity checks, I always run this code block to see how the model outputs some sample dialogue rows from the Test-dataset, before commencing the Rouge evaluation.

In [10]:
from torch.cuda.amp import autocast
import random

def generate_summary(dialogue):
    # Adjusting the prompt to QA format
    prompt = f"Instruct: Please summarize the following dialogue in less than 70 words:\n\n{dialogue}\nOutput:"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512, padding=True).to("cuda")

    with autocast():
        outputs = model.generate(**inputs, max_new_tokens=50, num_return_sequences=1)

    # Decode the generated text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extracting the summary part from the generated text
    summary_start = generated_text.find("Output:")
    if summary_start != -1:
        summary = generated_text[summary_start + len("Output:"):]
    else:
        summary = generated_text
    return summary.strip()

In [11]:
# Test the summarization on random samples
random_samples = random.sample(list(test_dataset), 3)

In [12]:
for sample in random_samples:
    dialogue = sample["dialogue"]
    true_summary = sample["summary"]

    generated_summary = generate_summary(dialogue)

    print(f"Dialogue: {dialogue}\nTrue Summary: {true_summary}\nGenerated Summary: {generated_summary}\n")


Dialogue: Ralph: Have you prepared a speech for Ulrich's wedding?
Sergio: Yes, it took me a long time
Ralph: What are you going to mention?
Sergio: I'll mostly just talk about how he's been a great friend over the years.
Ralph: Yeah, he is a great guy. He deserves this.
Sergio: I'm a bit nervous about it though--giving a speech.
Ralph: You'll be fine. He'll know you put a lot of thought into it.
True Summary: Sergio needed a long time to prepare a speech for Ulrich's wedding. He's going to talk about their long-lasting friendship and is nervous about giving a speech. Ralph is sure it will be fine. 
Generated Summary: 

Dialogue: Tricia: The cake is still not ready.
Zandra: Which cake?
Tricia: For your daughter’s birthday, Tam ;)
Zandra: Oh, of course, there are so many of them, I don’t even know what’s going on.
Tricia: Sure thing, you need a hand ;]
Zandra: Thank you so much, what would I do without you…
Zandra: But what about the cake, the party is tomorrow!
Tricia: You finally reali

Then we define the generate_summaries function which will be used during the Rouge calculations.

In [13]:
def generate_summaries(dialogues):
    generated_summaries = []
    for dialogue in dialogues:
        # Adjusted prompt format for Phi-2
        prompt = f"Instruct: Please summarize the following dialogue in less than 70 words:\n\n{dialogue}\nOutput:"
        inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512).to("cuda")

        with autocast():
            outputs = model.generate(**inputs, max_new_tokens=50, pad_token_id=tokenizer.eos_token_id, num_return_sequences=1)

        # Decode and clean up the generated text
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Extracting the summary part from the generated text
        summary_start = generated_text.find("Output:")
        if summary_start != -1:
            summary = generated_text[summary_start + len("Output:"):]
        else:
            summary = generated_text
        generated_summaries.append(summary.strip())

    return generated_summaries


In [14]:
import evaluate
from tqdm.auto import tqdm

# Initialize the ROUGE metric
rouge = evaluate.load("rouge")

In [15]:
rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': [], 'rougeLsum': []}
batch_size = 8  # Adjust based on your GPU's capabilities

In [16]:
for i in tqdm(range(0, len(test_dataset), batch_size), desc="Processing"):
    batch_indices = list(range(i, min(i + batch_size, len(test_dataset))))
    batch_dataset = test_dataset.select(batch_indices)
    batch_dialogues = [example['dialogue'] for example in batch_dataset]
    true_summaries = [example['summary'] for example in batch_dataset]

    generated_summaries = generate_summaries(batch_dialogues)
    scores = rouge.compute(predictions=generated_summaries, references=true_summaries)

    for key in scores.keys():
        # Directly append the score as a percentage without trying to access non-existing dictionary keys
        rouge_scores[key].append(scores[key] * 100)


# 1m 6.7s

Processing:   0%|          | 0/103 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

In [17]:
# Calculate average ROUGE scores
average_scores = {key: sum(values) / len(values) for key, values in rouge_scores.items()}
print("Average Baseline ROUGE Scores:", average_scores)

Average Baseline ROUGE Scores: {'rouge1': 0.8506188783237173, 'rouge2': 0.24196585582431088, 'rougeL': 0.6846194132470514, 'rougeLsum': 0.7498105442299694}


# 3 Fine-tuning with QLORA **vetgedrukte tekst**

Before starting the Fine-tuning process, it helps to get as much GPU memory as possible. I suggest to Restart this Session, so the GPU get's flushed. Do run the Step 1 (where we get samsum AND the 4090 code AND the HF_HOME code AND the 4090 fix code!), before commencing further.

In [5]:
# Loading the model
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
from peft import prepare_model_for_kbit_training

In [6]:
model_id = "google/gemma-2b"

# Configure model for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [7]:
# Load the model with the specified quantization configuration
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    use_cache=False,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2" #only available on A100 GPU
)

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
# Prepare the model for k-bit training and load tokenizer
model = prepare_model_for_kbit_training(model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token  # Ensure padding token is correctly set
tokenizer.padding_side = "right"  # Set padding side to right for consistency

Same thing as earlier, do run these tests before commencing the training. To make sure the model is loaded correctly. As you can see, I changed the prompt here, to make sure it aligns to the training format.

In [9]:
from torch.cuda.amp import autocast
import random

def generate_summary(dialogue):
    # Adjusting the prompt to QA format
    prompt = f"""<s>###Instruction:
              You are a helpful, respectful and honest assistant. \
              Your task is to summarize the following dialogue. \
              Your answer should be based on the provided dialogue only.\n ### Dialogue:
              {dialogue}\n Summary:"""
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512, padding=True).to("cuda")

    with autocast():
        outputs = model.generate(**inputs, max_new_tokens=50, num_return_sequences=1)

    # Decode the generated text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extracting the summary part from the generated text
    summary_start = generated_text.find("Summary:")
    if summary_start != -1:
        summary = generated_text[summary_start + len("Summary:"):]
    else:
        summary = generated_text
    return summary.strip()

In [10]:
# Test the summarization on random samples
random_samples = random.sample(list(test_dataset), 3)

for sample in random_samples:
    dialogue = sample["dialogue"]
    true_summary = sample["summary"]

    generated_summary = generate_summary(dialogue)

    print(f"Dialogue: {dialogue}\nTrue Summary: {true_summary}\nGenerated Summary: {generated_summary}\n")

# 5.2s


The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.float16.


Dialogue: Marta: <file_gif>
Marta: Sorry girls, I clicked something by accident :D
Agnieszka: No problem :p
Weronika: Hahaha
Agnieszka: Good thing you didn't send something from your gallery ;)
True Summary: Marta sent a file accidentally,
Generated Summary: Marta: Sorry girls, I clicked something by accident :D
 Agnieszka: No problem :p
 Weronika: Hahaha
 Agnieszka: Good thing you didn't send something from your gallery ;)
</s>

Dialogue: Jeremih: hey, tell your sis to text back
Hansel: haha, thats your issues bro, dont drag me into it
Jeremih: she's mad at me
Hansel: for what
Jeremih: i dont even know😔
Hansel:😢😂
Jeremih: youre laughing
Hansel: haha, ill tell her but next time i wont interfere
Jeremih: Okay bro, thanks
True Summary: Hansel will tell his sis to text Jeremih back.
Generated Summary: Hansel: haha, thats your issues bro, dont drag me into it
 Jeremih: she's mad at me
 Hansel: for what
 Jeremih: i dont even know
 Hansel:😢😂
 Jer

Dialogue: Sophia: I'm sorry
Mason: It's fine

In [11]:
# Prompt formatter
def prompt_formatter(sample):
    return f"""<s>### Instruction:
    You are a helpful, respectful and honest assistant. \
    Your task is to summarize the following dialogue in a concise way. \
    Your answer should be based on the provided dialogue only.
    ### Dialogue:
    {sample['dialogue']}
    ### Summary:
    {sample['summary']} </s>"""
    n = 0
    print(prompt_formatter(train_dataset[n]))

Before setting the training variables for PEFT, have look at the Linear layers that should be defined as the target_modules by running:

In [12]:
print(model)

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaFlashAttention2(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
  

In [13]:
from peft import LoraConfig, get_peft_model

# the QLoRA paper recommends LoRA dropout = 0.05 for small models (less than 13B)

peft_config = LoraConfig(
    target_modules=[
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
    "gate_proj",
    "up_proj",
    "down_proj",
    "lm_head",
    ],
    lora_alpha=16,
    lora_dropout=0.05,
    r=8,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_config)

In [14]:
from transformers import TrainingArguments
from trl import SFTTrainer

# set up the trainer
args = TrainingArguments(
    output_dir="gemma2b-samsum",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    logging_steps=4,
    save_strategy="epoch",
    learning_rate=2e-4,
    optim="paged_adamw_32bit",
    bf16=True, # make sure this works with your GPU, otherwise set to False and choose fp16 = True
    fp16=False,
    tf32=True, # make sure this works with your GPU, otherwise set to False and choose fp16 = True
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    disable_tqdm=False,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    peft_config=peft_config,
    max_seq_length=1024,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=prompt_formatter,
    args=args,
)

Generating train split: 0 examples [00:00, ? examples/s]

In [15]:
trainer.train()

# 15m 46.9s

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mrobkayinto[0m. Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 0/418 [00:00<?, ?it/s]

The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


{'loss': 2.5701, 'grad_norm': 3.5357959270477295, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 2.3167, 'grad_norm': 0.5911887288093567, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 2.2311, 'grad_norm': 4.342338562011719, 'learning_rate': 0.0002, 'epoch': 0.03}
{'loss': 2.0312, 'grad_norm': 1.7371952533721924, 'learning_rate': 0.0002, 'epoch': 0.04}
{'loss': 1.9146, 'grad_norm': 2.1151106357574463, 'learning_rate': 0.0002, 'epoch': 0.05}
{'loss': 1.8679, 'grad_norm': 2.0839924812316895, 'learning_rate': 0.0002, 'epoch': 0.06}
{'loss': 1.8357, 'grad_norm': 0.4784912168979645, 'learning_rate': 0.0002, 'epoch': 0.07}
{'loss': 1.713, 'grad_norm': 0.4830818176269531, 'learning_rate': 0.0002, 'epoch': 0.08}
{'loss': 1.8088, 'grad_norm': 0.4389052987098694, 'learning_rate': 0.0002, 'epoch': 0.09}
{'loss': 1.7498, 'grad_norm': 0.48688098788261414, 'learning_rate': 0.0002, 'epoch': 0.1}
{'loss': 1.6936, 'grad_norm': 1.191637635231018, 'learning_rate': 0.0002, 'epoch': 0.11}
{'loss': 1.68



{'train_runtime': 951.9125, 'train_samples_per_second': 3.511, 'train_steps_per_second': 0.439, 'train_loss': 1.7100665626343357, 'epoch': 1.0}


TrainOutput(global_step=418, training_loss=1.7100665626343357, metrics={'train_runtime': 951.9125, 'train_samples_per_second': 3.511, 'train_steps_per_second': 0.439, 'total_flos': 4.093825814573875e+16, 'train_loss': 1.7100665626343357, 'epoch': 1.0})

After fine-tuning, share the PEFT adapter on your Huggingface account, to re-use for the next steps.

In [16]:
# Save our tokenizer and create model card
tokenizer.save_pretrained("gemma2b-samsum_4bitqlora")
trainer.create_model_card()
# save model
trainer.save_model()
# Push the results to the hub ... um, nope!
# trainer.push_to_hub()



# 4. Calculating the Rouge scores after fine-tuning

After the Fine-tuning process, it helps to get as much GPU memory as possible. I suggest to Restart this Session, so the GPU get's flushed. Do run  Step 1, before commencing further.

Let's load the PEFT adapter, model and tokenizer.

In [5]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

In [6]:
# config = PeftConfig.from_pretrained("Farhang87/gemma2b-samsum") #use your own Huggingface link
config = PeftConfig.from_pretrained("gemma2b-samsum") #use your own Huggingface link

In [7]:
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", trust_remote_code=True, device_map="auto")

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
# model = PeftModel.from_pretrained(model, "Farhang87/gemma2b-samsum") #use your own Huggingface link
model = PeftModel.from_pretrained(model, "gemma2b-samsum") #use your own Huggingface link

In [9]:
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Like earlier, do some sanity checks again, making sure the PEFT adapter has been loaded correctly.

In [10]:
from torch.cuda.amp import autocast
import random

def generate_summary(dialogue):
    # Adjusting the prompt to QA format
    prompt = f"""<s>###Instruction:
              You are a helpful, respectful and honest assistant. \
              Your task is to summarize the following dialogue. \
              Your answer should be based on the provided dialogue only.\n ### Dialogue:
              {dialogue}\n Summary:"""
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512, padding=True).to("cuda")

    with autocast():
        outputs = model.generate(**inputs, max_new_tokens=50, num_return_sequences=1)

    # Decode the generated text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extracting the summary part from the generated text
    summary_start = generated_text.find("Summary:")
    if summary_start != -1:
        summary = generated_text[summary_start + len("Summary:"):]
    else:
        summary = generated_text
    return summary.strip()

In [11]:
# Test the summarization on random samples
random_samples = random.sample(list(test_dataset), 3)

for sample in random_samples:
    dialogue = sample["dialogue"]
    true_summary = sample["summary"]

    generated_summary = generate_summary(dialogue)

    print(f"Dialogue: {dialogue}\nTrue Summary: {true_summary}\nGenerated Summary: {generated_summary}\n")


Dialogue: Fiona: I just can’t stand it
Wanda: What again
Fiona: When I’m in one room with him… I just go crazy
Wanda: Conrad?
Fiona: Yesss, he’s absolutely lovely!!
Wanda: IT IS YOUR STUDENT
Fiona: So what? I mean sure, I know it’s… inappropriate xd but still, he’s only 5 years younger than me
Wanda: It’s so fucked up, I knew you are crazy before but it’s too much
Fiona: I knoooow when I come back home after the class I can’t do anything for like an hour or two, I just listen to the music
Wanda: You’re literally in love with him
Fiona: I mean I don’t expect anything, we’re from different worlds but… Yes, I just want to be around him. All the time xd
Wanda: So you need to do something about it
Fiona: Are you crazy, I can’t!!
Wanda: Why not
Fiona: What would my boss say if she knew
Wanda: Will she know?
Fiona: How can I know what will he do, he can tell his mother as well
Wanda: Just go for it!!
True Summary: Fiona fell in love with his student, Conrad.
Generated Summary: Fiona is in lov

In [12]:
from torch.cuda.amp import autocast

def generate_summaries(dialogues):
    generated_summaries = []
    for dialogue in dialogues:
        # Use the same prompt format as in generate_summary
        prompt = f"""<s>###Instruction:
              You are a helpful, respectful and honest assistant. \
              Your task is to summarize the following dialogue. \
              Your answer should be based on the provided dialogue only.\n ### Dialogue:
              {dialogue}\n Summary:"""
        inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512).to("cuda")

        with autocast():
            outputs = model.generate(**inputs, max_new_tokens=70, pad_token_id=tokenizer.eos_token_id, num_return_sequences=1)

        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        # Extracting the summary part from the generated text following the "Summary:" marker
        summary_start = generated_text.find("Summary:")
        summary = generated_text[summary_start + len("Summary:"):] if summary_start != -1 else generated_text
        generated_summaries.append(summary.strip())

    return generated_summaries

In [13]:
import evaluate
from datasets import load_dataset
from tqdm.auto import tqdm

# Initialize the ROUGE metric
rouge = evaluate.load("rouge")

rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': [], 'rougeLsum': []}
batch_size = 8  # Adjust based on your GPU's capabilities

In [14]:
for i in tqdm(range(0, len(test_dataset), batch_size), desc="Processing"):
    batch_indices = list(range(i, min(i + batch_size, len(test_dataset))))
    batch_dataset = test_dataset.select(batch_indices)
    batch_dialogues = [example['dialogue'] for example in batch_dataset]
    true_summaries = [example['summary'] for example in batch_dataset]

    generated_summaries = generate_summaries(batch_dialogues)
    scores = rouge.compute(predictions=generated_summaries, references=true_summaries)

    for key in scores.keys():
        # Directly append the score as a percentage without trying to access non-existing dictionary keys
        rouge_scores[key].append(scores[key] * 100)

Processing:   0%|          | 0/103 [00:00<?, ?it/s]

In [15]:
# Calculate average ROUGE scores
average_scores = {key: sum(values) / len(values) for key, values in rouge_scores.items()}
print("Average ROUGE Scores after Fine-tuning:", average_scores)


Average ROUGE Scores after Fine-tuning: {'rouge1': 38.333791665390336, 'rouge2': 17.805572671744667, 'rougeL': 31.955494312574686, 'rougeLsum': 32.03829314198248}
