# Overview
- This notebook fine-tunes the open-source models released by Google (gemma). The notebook can finetune both the 2b and 7b models.
- Finetuned on an A10G GPU
- Lora is used for finetuning.
- Dataset is Financial Phrase Bank from Kaggle.
- Inspired from (https://www.kaggle.com/code/lucamassaron/fine-tune-gemma-7b-it-for-sentiment-analysis)


In [1]:
from watermark import watermark
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # environment variable which tells pytorch to use the first GPU
os.environ["TOKENIZERS_PARALLELISM"] = "false" # Whether to parallelize tokenization process.

In [2]:
# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [3]:
# IMPORTS
import numpy as np
import pandas as pd
import os
from tqdm import tqdm

import torch
import torch.nn as nn

import transformers
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig, 
                          TrainingArguments, 
                          pipeline, 
                          logging)

from datasets import Dataset
from peft import LoraConfig, PeftConfig
import bitsandbytes as bnb
from trl import SFTTrainer

from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from sklearn.model_selection import train_test_split

In [4]:
%load_ext watermark
%watermark -p torch,transformers,datasets,accelerate,trl,peft

torch       : 2.2.1
transformers: 4.38.1
datasets    : 2.17.1
accelerate  : 0.27.2
trl         : 0.7.11
peft        : 0.8.2



In [5]:
# Local file name.
filename = "all-data.csv"

# Read the file to pandas
df = pd.read_csv(filename, 
                 names=["sentiment", "text"],
                 encoding="utf-8", encoding_errors="replace")

# Create train, and test
X_train = list()
X_test = list()

# sample from each sentiment randomly and add 300 each to train and test.
for sentiment in ["positive", "neutral", "negative"]:
    train, test  = train_test_split(df[df.sentiment==sentiment], 
                                    train_size=300,
                                    test_size=300, 
                                    random_state=42)
    X_train.append(train)
    X_test.append(test)

X_train = pd.concat(X_train).sample(frac=1, random_state=10)
X_test = pd.concat(X_test)

# The ones that are not used will be eval set (randomly sample 50 from each sentiment)
eval_idx = [idx for idx in df.index if idx not in list(train.index) + list(test.index)]
X_eval = df[df.index.isin(eval_idx)]
X_eval = (X_eval
          .groupby('sentiment', group_keys=False)
          .apply(lambda x: x.sample(n=50, random_state=10, replace=True)))
X_train = X_train.reset_index(drop=True)


# Generate prompt for fine-tuning
def generate_prompt(data_point):
    return f"""
            Analyze the sentiment of the news headline enclosed in square brackets, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "positive" or "neutral" or "negative"

            [{data_point["text"]}] = {data_point["sentiment"]}
            """.strip()

# Generate prompt for test
def generate_test_prompt(data_point):
    return f"""
            Analyze the sentiment of the news headline enclosed in square brackets, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "positive" or "neutral" or "negative"

            [{data_point["text"]}] = 

            """.strip()

# Prompt from gemma model card
# def generate_prompt(data_point):
#     return f"""<start_of_turn>user
#             Analyze the sentiment of the news headline enclosed in square brackets, 
#             determine if it is positive, neutral, or negative, and return the answer as 
#             the corresponding sentiment label "positive" or "neutral" or "negative"
            
#             [{data_point["text"]}]<end_of_turn>
#             <start_of_turn>model
#             {data_point["sentiment"]}<end_of_turn>""".strip()

# def generate_test_prompt(data_point):
#     return f"""<start_of_turn>user
#             Analyze the sentiment of the news headline enclosed in square brackets, 
#             determine if it is positive, neutral, or negative, and return the answer as 
#             the corresponding sentiment label "positive" or "neutral" or "negative"
            
#             [{data_point["text"]}]<end_of_turn>
#             <start_of_turn>model
#             """.strip()

# Conver train and eval to Datasets
X_train = pd.DataFrame(X_train.apply(generate_prompt, axis=1), 
                       columns=["text"])
X_eval = pd.DataFrame(X_eval.apply(generate_prompt, axis=1), 
                      columns=["text"])

y_true = X_test.sentiment
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)

In [6]:
df.head(2)

Unnamed: 0,sentiment,text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...


## Function to evaluate the performs of LLMs

In [7]:
def evaluate(y_true, y_pred):
    """
    Maps the labels to positive(2) , negative(0) or neutral(1).
    Calculate the overall accuracy.
    Calculate the overall accuracy of individual labels.
    Also generate classification report and confusion matrix.
    """
    labels = ['positive', 'neutral', 'negative']
    mapping = {'positive': 2, 'neutral': 1, 'none':1, 'negative': 0}
    
    def map_func(x):
        return mapping.get(x, 1)
    
    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

# Load the model and tokensizer

In [8]:
# model_name = "/kaggle/input/gemma/transformers/7b-it/1"
model_name = "google/gemma-7b-it"

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # load in 4bit
    bnb_4bit_use_double_quant=False, # load in double quant
    bnb_4bit_quant_type="nf4", # nf4 type of quantization
    bnb_4bit_compute_dtype=compute_dtype, # use float16 for computation.
)

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config, 
)

model.config.use_cache = False
model.config.pretraining_tp = 1

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/888 [00:00<?, ?B/s]

## Make predictions using the LLM

In [9]:
def predict(X_test, model, tokenizer):
    """
    Generate predictions. Input prompt is first converted to tokens,
    The tokens are sent to the LLM to generate an output.
    The output is then mapped to positive, negative, neutral
    """
    y_pred = []
    for i in tqdm(range(len(X_test))):
        prompt = X_test.iloc[i]["text"]
        input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
        outputs = model.generate(**input_ids, max_new_tokens=1, temperature=0.0)
        result = tokenizer.decode(outputs[0])
        answer = result.split("=")[-1].lower()
        if "positive" in answer:
            y_pred.append("positive")
        elif "negative" in answer:
            y_pred.append("negative")
        elif "neutral" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    return y_pred

In [10]:
y_pred = predict(X_test, model, tokenizer)
evaluate(y_true, y_pred)

100%|██████████████████████████████████████████████████████████████████████████████████████████| 900/900 [01:33<00:00,  9.63it/s]

Accuracy: 0.631
Accuracy for label 0: 0.803
Accuracy for label 1: 0.193
Accuracy for label 2: 0.897

Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.80      0.86       300
           1       0.46      0.19      0.27       300
           2       0.52      0.90      0.66       300

    accuracy                           0.63       900
   macro avg       0.64      0.63      0.60       900
weighted avg       0.64      0.63      0.60       900


Confusion Matrix:
[[241  40  19]
 [ 13  58 229]
 [  4  27 269]]





### The accuracy across all the sentiments is 0.631. The accuracy is the best for label 2 and worst for the label 1.
- This is very different from the gemma_2b_it which had good accuracy on label 0 and worst on label 2

# Finetuning the model
- Using the SFTTrainer.
- We will use the PEFT method, and set the parameters for LORA.
- Set the training parameters such as logging, number of epoch, batch_size, learning rate, etc.

In [11]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear",
)

training_arguments = TrainingArguments(
    output_dir="logs",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="tensorboard",
    do_eval=False,
    evaluation_strategy="no",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
    max_seq_length=1024,
)

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Detected kernel version 4.14.326, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [12]:
# Train model 
trainer.train()

# Save trained model
# trainer.model.save_pretrained("trained-model_it")

Step,Training Loss
25,5.0038
50,1.0692
75,0.9686
100,0.9876
125,0.851
150,0.7029
175,0.6463
200,0.6534
225,0.638
250,0.4414


TrainOutput(global_step=336, training_loss=0.9918200472990671, metrics={'train_runtime': 769.96, 'train_samples_per_second': 3.507, 'train_steps_per_second': 0.436, 'total_flos': 1.124157764164608e+16, 'train_loss': 0.9918200472990671, 'epoch': 2.99})

In [13]:
%load_ext tensorboard
%tensorboard --logdir logs/runs --bind_all

In [15]:
from tensorboard import notebook
print(notebook.list()) # View open TensorBoard instances
# !kill 14175

Known TensorBoard instances:
  - port 6006: logdir logs/runs (started 0:00:26 ago; pid 14175)
None


# Evaluate the model after training.

In [16]:
y_pred = predict(X_test, model, tokenizer)
evaluate(y_true, y_pred)


100%|██████████████████████████████████████████████████████████████████████████████████████████| 900/900 [02:10<00:00,  6.89it/s]

Accuracy: 0.886
Accuracy for label 0: 0.977
Accuracy for label 1: 0.877
Accuracy for label 2: 0.803

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.98      0.97       300
           1       0.81      0.88      0.84       300
           2       0.90      0.80      0.85       300

    accuracy                           0.89       900
   macro avg       0.89      0.89      0.89       900
weighted avg       0.89      0.89      0.89       900


Confusion Matrix:
[[293   4   3]
 [ 12 263  25]
 [  2  57 241]]





### After training the performance improves significantly
- Overall accuracy is 0.886, and a little better than gemma-2b-it
- And the model performs well on all the labels.