# <font color = 'pickle'>**Summarization Dataset using GEMMA** </font>

**Objective:**

In this notebook, we aim to use GEMMA models with QLORA for classification problems. **We will now use Casual Languagge Model - Basically we will do instruction tuning.**


**Plan**

1. Set Environment
2. Load Dataset
3. Accessing and Manipulating Splits
4. Load Pre-trained Tokenizer
5. Create Prompts
6. Model Training
  1. Download pre-trained model <br>  
  3. PEFT Setup
  4. Training Arguments <br>
  5. Instantiate Trainer <br>
  6. Setup WandB <br>
  7. Training
  8. Validation
7. Perfromance on Test Set
8. Model Inference

### <font color='pickle'>**1. Set Environment**</font>


In [1]:
import sys

# Check if running in Google Colab
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    # Mount Google Drive with force_remount to avoid mounting issues
    drive.mount('/content/drive', force_remount=True)

    # Set the base path to a Google Drive folder
    base_path = '/content/drive/MyDrive/Colab Notebooks'

    # Install necessary packages
    print("Installing required libraries...")
    !pip install transformers evaluate wandb datasets accelerate peft bitsandbytes trl -U -qq
    !pip install iterative-stratification -qq

    print("All packages installed successfully.")

# No need to force restart the runtime as this approach aligns with Colab's auto-resolved environment



Mounted at /content/drive
Installing required libraries...
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m121.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.9/310.9 kB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m17.9 MB/s[0m

In [2]:
# standard data science librraies for data handling and v isualization
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import re
import gc
import time

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import multilabel_confusion_matrix, precision_score, recall_score, f1_score
import joblib

import torch
import torch.nn as nn
import ast


# New libraries introduced in this notebook
import evaluate
from datasets import load_dataset, DatasetDict, Dataset, ClassLabel
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM
from transformers import (
    TrainingArguments,
    Trainer,
    set_seed,
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    AutoConfig,
    pipeline,
    BitsAndBytesConfig,
)
from peft import (
    TaskType,
    LoraConfig,
    prepare_model_for_kbit_training,
    get_peft_model,
)

import wandb
from google.colab import userdata
from huggingface_hub import login

In [3]:
set_seed(42)

In [4]:
wandb_api_key = userdata.get('WANDB_API_KEY')
hf_token = userdata.get('HF_TOKEN')


In [5]:
if hf_token:
    # Log in to Hugging Face
    login(token=hf_token)
    print("Successfully logged in to Hugging Face!")
else:
    print("Hugging Face token not found in notebook secrets.")

Successfully logged in to Hugging Face!


In [6]:
if wandb_api_key:
  wandb.login(key=wandb_api_key)
  print("Successfully logged in to WANDB!")
else:
    print("WANDB key not found in notebook secrets.")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Successfully logged in to WANDB!


In [7]:
base_folder = Path(base_path)

model_folder = base_folder/'models/gemma_qlora_lmh_summarization'

model_folder.mkdir(exist_ok=True, parents = True)

### <font color='pickle'>**2. Load Dataset**</font>


In [8]:
from datasets import load_dataset

ds = load_dataset("harpreetmann/samsum_max_500")

README.md:   0%|          | 0.00/697 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/5.10M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/644k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/653k [00:00<?, ?B/s]

discarded-00000-of-00001.parquet:   0%|          | 0.00/349k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12908 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1614 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1613 [00:00<?, ? examples/s]

Generating discarded split:   0%|          | 0/234 [00:00<?, ? examples/s]

In [9]:
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'input', 'output'],
        num_rows: 12908
    })
    test: Dataset({
        features: ['id', 'input', 'output'],
        num_rows: 1614
    })
    validation: Dataset({
        features: ['id', 'input', 'output'],
        num_rows: 1613
    })
    discarded: Dataset({
        features: ['id', 'input', 'output'],
        num_rows: 234
    })
})

In [10]:
ds['train'][0]

{'id': '13818877',
 'input': "Samantha: Where is the entrance to the building? Are you inside already?\r\nManuel: ohm you have to go from the back, sorry I didn't tell you\r\nCory: don't worry, everybody has a problem with it\r\nSamantha: it's super weird",
 'output': "Samantha can't find the entrance. Manuel helps her."}

In [11]:
# Calculate the average length of the summarization in 'output'
average_length = sum(len(summary.split()) for summary in ds['train']['output']) / len(ds['train']['output'])

print(f"Average length of summaries: {average_length:.2f} words")

Average length of summaries: 19.99 words


In [12]:
def free_gpu_memory():
    """
    Frees up GPU memory after CUDA out-of-memory error in Colab.

    This function performs the following steps:
    1. Deletes all PyTorch objects to clear references.
    2. Calls garbage collection to remove unreferenced objects from memory.
    3. Uses torch.cuda.empty_cache() to release cached GPU memory.
    4. Waits for a moment to ensure memory is fully released.
    """
    try:
        # Delete all torch tensors to free up memory
        for obj in list(locals().values()):
            if torch.is_tensor(obj):
                del obj

        # Collect garbage to release any remaining unused memory
        gc.collect()

        # Empty the CUDA cache to release GPU memory
        torch.cuda.empty_cache()

        # Adding a small delay to allow memory to be fully released
        time.sleep(2)

        print("GPU memory has been freed.")
    except Exception as e:
        print(f"Error while freeing GPU memory: {e}")

In [13]:
free_gpu_memory()

GPU memory has been freed.


In [14]:
checkpoint = "google/gemma-2-2b-it"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

In [15]:
tokenizer.eos_token

'<eos>'

In [16]:
tokenizer.pad_token

'<pad>'

In [17]:
def format_chat(example):
    instruction = f"Summarize the the TEXT, limit your response to around 20 words.\n\nTEXT: {example['input']}"
    messages = [
        {"role": "user", "content": instruction},
        {"role": "assistant", "content": f"{example['output']}"}
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

In [18]:
ds = ds.map(format_chat)

Map:   0%|          | 0/12908 [00:00<?, ? examples/s]

Map:   0%|          | 0/1614 [00:00<?, ? examples/s]

Map:   0%|          | 0/1613 [00:00<?, ? examples/s]

Map:   0%|          | 0/234 [00:00<?, ? examples/s]

In [19]:
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'input', 'output', 'text'],
        num_rows: 12908
    })
    test: Dataset({
        features: ['id', 'input', 'output', 'text'],
        num_rows: 1614
    })
    validation: Dataset({
        features: ['id', 'input', 'output', 'text'],
        num_rows: 1613
    })
    discarded: Dataset({
        features: ['id', 'input', 'output', 'text'],
        num_rows: 234
    })
})

In [20]:
ds['train'][0]

{'id': '13818877',
 'input': "Samantha: Where is the entrance to the building? Are you inside already?\r\nManuel: ohm you have to go from the back, sorry I didn't tell you\r\nCory: don't worry, everybody has a problem with it\r\nSamantha: it's super weird",
 'output': "Samantha can't find the entrance. Manuel helps her.",
 'text': "<bos><start_of_turn>user\nSummarize the the TEXT, limit your response to around 20 words.\n\nTEXT: Samantha: Where is the entrance to the building? Are you inside already?\r\nManuel: ohm you have to go from the back, sorry I didn't tell you\r\nCory: don't worry, everybody has a problem with it\r\nSamantha: it's super weird<end_of_turn>\n<start_of_turn>model\nSamantha can't find the entrance. Manuel helps her.<end_of_turn>\n"}

In [21]:
def check_length(example):
  encoding = tokenizer.encode(example['input'])
  return len(encoding)<=1000
train_filtered = ds['train'].filter(check_length)
valid_filtered = ds['validation'].filter(check_length)

print(f"Train: {len(ds['train'])} -> {len(train_filtered)}")
print(f"Valid: {len(ds['validation'])} -> {len(valid_filtered)}")


Filter:   0%|          | 0/12908 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1613 [00:00<?, ? examples/s]

Train: 12908 -> 12908
Valid: 1613 -> 1613


In [22]:
def get_appropriate_dtype():
    if torch.cuda.is_available() and torch.cuda.get_device_capability(0) >= (8, 0):
        return torch.bfloat16
    return torch.float16

In [23]:
torch_data_type = get_appropriate_dtype()
torch_data_type

torch.bfloat16

In [24]:
bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_use_double_quant=True,
  bnb_4bit_compute_dtype=torch_data_type,
  bnb_4bit_quant_storage=torch_data_type,
)

In [25]:
model = AutoModelForCausalLM.from_pretrained(checkpoint,
                                             quantization_config=bnb_config,
                                             torch_dtype=torch_data_type,
                                             trust_remote_code=True,)

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

In [26]:
model

Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): Linear4bit(in_features=2304, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2304, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (up_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (down_proj): Linear4bit(in_features=9216, out_features=2304, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (pre_feedforward_layernorm)

In [27]:
def find_linear_layers(model):
    """
    Extracts the unique names of Linear layers from a model.

    Args:
    model (nn.Module): The model from which to extract Linear layer names.

    Returns:
    list: A list of unique names of Linear layers.
    """
    # Convert the model's modules to string
    model_modules = str(model.modules)
    # Pattern to extract names of Linear layers
    pattern = r'\((\w+)\): Linear'
    # Find all occurrences of the pattern
    linear_layer_names = re.findall(pattern, model_modules)
    print(linear_layer_names)
    # Get unique names using a set, then convert back to list
    target_modules = list(set(linear_layer_names))
    return target_modules

In [28]:
find_linear_layers(model)

['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj', 'lm_head']


['gate_proj',
 'v_proj',
 'k_proj',
 'q_proj',
 'up_proj',
 'o_proj',
 'down_proj',
 'lm_head']

In [29]:
TaskType.CAUSAL_LM

<TaskType.CAUSAL_LM: 'CAUSAL_LM'>

In [30]:
peft_config = LoraConfig(
    task_type = TaskType.CAUSAL_LM,
    r = 128,
    lora_alpha = 256,
    lora_dropout = 0.01,
    target_modules = ['v_proj',  'q_proj',  'up_proj', 'o_proj', 'down_proj', 'gate_proj','k_proj', 'lm_head']
)

lora_model = get_peft_model(model, peft_config)

lora_model.print_trainable_parameters()

trainable params: 199,196,672 || all params: 2,813,538,560 || trainable%: 7.0799




In [31]:
# Define the directory where model checkpoints will be saved
model_folder = base_folder/'models/gemma_qlora_lmh_inst'

# Create the directory if it doesn't exist
model_folder.mkdir(exist_ok=True, parents=True)
run_name= 'stack_exp_lmh_gemma_inst'

use_fp16 = torch_data_type == torch.float16
use_bf16 = torch_data_type == torch.bfloat16

# Configure training parameters
training_args = SFTConfig(
    seed = 42,
    # Training-specific configurations
    num_train_epochs=2,  # Total number of training epochs
    # Number of samples per training batch for each device
    per_device_train_batch_size=4,
    # Number of samples per evaluation batch for each device
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant":False},
    # torch_empty_cache_steps=5,
    weight_decay=0.0,  # Apply L2 regularization to prevent overfitting
    learning_rate=1e-5,  # Step size for the optimizer during training
    optim='adamw_torch',  # Optimizer,

    # Checkpoint saving and model evaluation settings
    output_dir=str(model_folder),  # Directory to save model checkpoints
    eval_strategy='steps',  # Evaluate model at specified step intervals
    eval_steps=20,  # Perform evaluation every 10 training steps
    save_strategy="steps",  # Save model checkpoint at specified step intervals
    save_steps=20,  # Save a model checkpoint every 10 training steps
    load_best_model_at_end=True,  # Reload the best model at the end of training
    save_total_limit=2,  # Retain only the best and the most recent model checkpoints
    # Use 'accuracy' as the metric to determine the best model
    metric_for_best_model="eval_loss",
    greater_is_better=False,  # A model is 'better' if its accuracy is higher


    # Experiment logging configurations (commented out in this example)
    logging_strategy='steps',
    logging_steps=20,
    report_to='wandb',  # Log metrics and results to Weights & Biases platform
    run_name= run_name,  # Experiment name for Weights & Biases

    # Precision settings determined based on GPU capability
    fp16=use_fp16 ,  # Set True if torch_data_type is torch.float16
    bf16=use_bf16,  # Set True if torch_data_type is torch.bfloat16
    tf32=False,  # Disable tf32 unless you want to use Ampere specific optimization
)


In [32]:
if training_args.gradient_checkpointing:
  model.config.use_cache = False # disabling cache for compatibility

model.config

Gemma2Config {
  "_attn_implementation_autoset": true,
  "_name_or_path": "google/gemma-2-2b-it",
  "architectures": [
    "Gemma2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "attn_logit_softcapping": 50.0,
  "bos_token_id": 2,
  "cache_implementation": "hybrid",
  "eos_token_id": [
    1,
    107
  ],
  "final_logit_softcapping": 30.0,
  "head_dim": 256,
  "hidden_act": "gelu_pytorch_tanh",
  "hidden_activation": "gelu_pytorch_tanh",
  "hidden_size": 2304,
  "initializer_range": 0.02,
  "intermediate_size": 9216,
  "max_position_embeddings": 8192,
  "model_type": "gemma2",
  "num_attention_heads": 8,
  "num_hidden_layers": 26,
  "num_key_value_heads": 4,
  "pad_token_id": 0,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_storage": "bfloat16",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": true,
    "llm_int8_enable_fp32_cpu_offload": false,

In [37]:
data_collator_comp = DataCollatorForCompletionOnlyLM(tokenizer = tokenizer, mlm = False,
                                                     response_template = "<start_of_turn>model\n") # pulled from tokenizer could be different for different models

trainer = SFTTrainer(
    model = model,
    args = training_args,
    data_collator = data_collator_comp,
    peft_config = peft_config,
    train_dataset = train_filtered,
    eval_dataset = valid_filtered,
    dataset_text_field = 'text',
    packing = False
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/12908 [00:00<?, ? examples/s]

Map:   0%|          | 0/1613 [00:00<?, ? examples/s]



In [38]:
%env WANDB_PROJECT = summarization_instruction_gemma

env: WANDB_PROJECT=summarization_instruction_gemma


In [39]:
try:
    # Your code that may cause a CUDA out-of-memory error
    # Example: trainer.train() or other GPU intensive operations
    # lora_model.config.use_cache = False
    trainer.train()
except RuntimeError as e:
    if 'CUDA out of memory' in str(e):
        print("CUDA out of memory error detected. Freeing GPU memory.")
        free_gpu_memory()
        # Optionally, you can retry the operation here after freeing up memory
        # Example retry:
        # trainer.train()
    else:
        raise e

Step,Training Loss,Validation Loss
20,1.38,1.19494
40,1.1783,1.158712
60,1.1402,1.147249




KeyboardInterrupt: 

# <font color = 'orange'>**Model Training Stopped because of time**

May need to change the clean function and see if we need to clean at all or can we take everything after the response template????



We might not even need the clean function at all, the extracr_content_fixed function may be good enough, we may not need to do any further cleaning or summarization.

We cannot do greedy decoding we have to do beam search.

We need to do model.generate, pass the argument and then generate.