# About this notebook
The great [original notebook](https://www.kaggle.com/code/emiz6413/training-gemma-2-9b-4-bit-qlora-fine-tuning/notebook) doesn't work now because of OSS version confliction.  
To solve this, the OSS versions following were fixed to run correctly.

- transformers : 4.42.3
- bitsandbytes : 0.43.1
- accelerate : 0.32.1
- peft : 0.11.1

---

## What this notebook is
This notebook demonstrates how I trained Gemma-2 9b to obtain LB: 0.941. The inference code can be found [here](https://www.kaggle.com/code/emiz6413/inference-gemma-2-9b-4-bit-qlora).
I used 4-bit quantized [Gemma 2 9b Instruct](https://huggingface.co/unsloth/gemma-2-9b-it-bnb-4bit) uploaded by unsloth team as a base-model and added LoRA adapters and trained for 1 epoch.

## Result

I used `id % 5 == 0` as an evaluation set and used all the rest for training.

| subset | log loss |
| - | - |
| eval | 0.9371|
| LB | 0.941 |

## What is QLoRA fine-tuning?

In the conventional fine-tuning, weight ($\mathbf{W}$) is updated as follows:

$$
\mathbf{W} \leftarrow \mathbf{W} - \eta \frac{{\partial L}}{{\partial \mathbf{W}}} = \mathbf{W} + \Delta \mathbf{W}
$$

where $L$ is a loss at this step and $\eta$ is a learning rate.

[LoRA](https://arxiv.org/abs/2106.09685) tries to approximate the $\Delta \mathbf{W} \in \mathbb{R}^{\text{d} \times \text{k}}$ by factorizing $\Delta \mathbf{W}$ into two (much) smaller matrices, $\mathbf{B} \in \mathbb{R}^{\text{d} \times \text{r}}$ and $\mathbf{A} \in \mathbb{R}^{\text{r} \times \text{k}}$ with $r \ll \text{min}(\text{d}, \text{k})$.

$$
\Delta \mathbf{W}_{s} \approx \mathbf{B} \mathbf{A}
$$

<img src="https://storage.googleapis.com/pii_data_detection/lora_diagram.png">

During training, only $\mathbf{A}$ and $\mathbf{B}$ are updated while freezing the original weights, meaning that only a fraction (e.g. <1%) of the original weights need to be updated during training. This way, we can reduce the GPU memory usage significantly during training while achieving equivalent performance to the usual (full) fine-tuning.

[QLoRA](https://arxiv.org/abs/2305.14314) pushes the efficiency further by quantizing LLM. For example, a 8B parameter model alone would take up 32GB of VRAM in 32-bit, whereas quantized 8-bit/4-bit 8B model only need 8GB/4GB respectively. 
Note that QLoRA only quantize LLM's weights in low precision (e.g. 8-bit) while the computation of forward/backward are done in higher precision (e.g. 16-bit) and LoRA adapter's weights are also kept in higher precision.

1 epoch using A6000 took ~15h in 4-bit while 8-bit took ~24h and the difference in log loss was not significant.

## Note
It takes prohivitively long time to run full training on kaggle kernel. I recommend to use external compute resource to run the full training.
This notebook uses only 100 samples for demo purpose, but everything else is same as my setup.

In [None]:
# gemma-2 is available from transformers>=4.42.3
!pip install transformers==4.42.3
!pip install bitsandbytes==0.43.1
!pip install accelerate==0.32.1
!pip install peft==0.11.1

In [None]:

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Define the model name
model_name = 'unsloth/gemma-2-2b-it-bnb-4bit'

# Download the model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Save them locally
model.save_pretrained('./gemma_2b_model')
tokenizer.save_pretrained('./gemma_2b_model')


In [None]:
import shutil

# Zip the directory containing model and tokenizer
shutil.make_archive('/kaggle/working/gemma_2b_model', 'zip', '/kaggle/input/gemma_2b_model')


In [None]:
import os
import copy
from dataclasses import dataclass

import numpy as np
import torch
from datasets import Dataset
from transformers import (
    BitsAndBytesConfig,
    Gemma2ForSequenceClassification,
    GemmaTokenizerFast,
    Gemma2Config,
    PreTrainedTokenizerBase, 
    EvalPrediction,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from sklearn.metrics import log_loss, accuracy_score

### Configurations

In [None]:
@dataclass
class Config:
    output_dir: str = "output"
    checkpoint: str = "unsloth/gemma-2-9b-it-bnb-4bit"  # 4-bit quantized gemma-2-9b-instruct
    max_length: int = 1024
    n_splits: int = 5
    fold_idx: int = 0
    optim_type: str = "adamw_8bit"
    per_device_train_batch_size: int = 2
    gradient_accumulation_steps: int = 2  # global batch size is 8 
    per_device_eval_batch_size: int = 8
    n_epochs: int = 1
    freeze_layers: int = 16  # there're 42 layers in total, we don't add adapters to the first 16 layers
    lr: float = 2e-4
    warmup_steps: int = 20
    lora_r: int = 16
    lora_alpha: float = lora_r * 2
    lora_dropout: float = 0.05
    lora_bias: str = "none"
    
config = Config()

#### Training Arguments

In [None]:
training_args = TrainingArguments(
    output_dir="output",
    overwrite_output_dir=True,
    report_to="none",
    num_train_epochs=config.n_epochs,
    per_device_train_batch_size=config.per_device_train_batch_size,
    gradient_accumulation_steps=config.gradient_accumulation_steps,
    per_device_eval_batch_size=config.per_device_eval_batch_size,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="steps",
    save_steps=200,
    optim=config.optim_type,
    fp16=True,
    learning_rate=config.lr,
    warmup_steps=config.warmup_steps,
)

In [None]:
training_args = TrainingArguments(
    output_dir="output",
    overwrite_output_dir=True,
    report_to="none",
    num_train_epochs=config.n_epochs,
    per_device_train_batch_size=config.per_device_train_batch_size,
    gradient_accumulation_steps=config.gradient_accumulation_steps,
    per_device_eval_batch_size=config.per_device_eval_batch_size,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="steps",
    save_steps=200,
    optim=config.optim_type,
    fp16=True,
    learning_rate=config.lr,
    warmup_steps=config.warmup_steps,
    ddp_find_unused_parameters=False,
    gradient_checkpointing=True,
    logging_dir="./logs",
)


#### LoRA config

In [None]:
lora_config = LoraConfig(
    r=config.lora_r,
    lora_alpha=config.lora_alpha,
    # only target self-attention
    target_modules=["q_proj", "k_proj", "v_proj"],
    layers_to_transform=[i for i in range(42) if i >= config.freeze_layers],
    lora_dropout=config.lora_dropout,
    bias=config.lora_bias,
    task_type=TaskType.SEQ_CLS,
)

### Instantiate the tokenizer & model

In [None]:
tokenizer = GemmaTokenizerFast.from_pretrained(config.checkpoint)
tokenizer.add_eos_token = True  # We'll add <eos> at the end
tokenizer.padding_side = "right"

In [None]:
model = Gemma2ForSequenceClassification.from_pretrained(
    config.checkpoint,
    num_labels=2,
    torch_dtype=torch.float16,
    device_map="auto",
)
model.config.use_cache = False
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model

In [None]:
model.print_trainable_parameters()

### Instantiate the dataset

In [None]:
import pandas as pd

# Load the parquet file
df = pd.read_parquet('/kaggle/input/wsdm-cup-multilingual-chatbot-arena/train.parquet')

# Save the dataframe as a CSV file with escape characters
df.to_csv('/kaggle/working/train.csv', index=False, escapechar='\\')


In [None]:
ds = Dataset.from_csv("/kaggle/working/train.csv")
print(len(ds))
ds = ds.select(torch.arange(200))  # We only use the first 100 data for demo purpose

In [None]:
class CustomTokenizer:
    def __init__(
        self, 
        tokenizer: PreTrainedTokenizerBase, 
        max_length: int
    ) -> None:
        self.tokenizer = tokenizer
        self.max_length = max_length
        
    def __call__(self, batch: dict) -> dict:
        prompt = ["<prompt>: " + self.process_text(t) for t in batch["prompt"]]
        response_a = ["\n\n<response_a>: " + self.process_text(t) for t in batch["response_a"]]
        response_b = ["\n\n<response_b>: " + self.process_text(t) for t in batch["response_b"]]
        texts = [p + r_a + r_b for p, r_a, r_b in zip(prompt, response_a, response_b)]
        tokenized = self.tokenizer(texts, max_length=self.max_length, truncation=True)
        labels=[]
        for a_win, b_win in zip(batch["winner_model_a"], batch["winner_model_b"]):
            if a_win:
                label = 0
            elif b_win:
                label = 1
            else:
                label = 2
            labels.append(label)
        return {**tokenized, "labels": labels}
        
    @staticmethod
    def process_text(text: str) -> str:
        return text.replace("null", "").strip()
    # def process_text(text: str) -> str:
    #     return " ".join(eval(text, {"null": ""}))

 

In [None]:
class CustomTokenizer:
    def __init__(self, tokenizer: PreTrainedTokenizerBase, max_length: int) -> None:
        self.tokenizer = tokenizer
        self.max_length = max_length
        
    def __call__(self, batch: dict) -> dict:
        # Ensure that the keys exist in the batch before processing
        prompt = ["<prompt>: " + self.process_text(t) for t in batch.get("prompt", [])]
        response_a = ["\n\n<response_a>: " + self.process_text(t) for t in batch.get("response_a", [])]
        response_b = ["\n\n<response_b>: " + self.process_text(t) for t in batch.get("response_b", [])]
        
        # Concatenate all parts into one text field for tokenization
        texts = [p + r_a + r_b for p, r_a, r_b in zip(prompt, response_a, response_b)]
        
        # Tokenize the texts
        tokenized = self.tokenizer(texts, max_length=self.max_length, truncation=True, padding=True)
        
        # Handle the winner labels (mapping winner from 'model_a' to 0, 'model_b' to 1)
        labels = []
        winners = batch.get("winner", [])
        
        for winner in winners:
            if winner == 'model_a':
                label = 0
            elif winner == 'model_b':
                label = 1
            # If the winner is neither 'model_a' nor 'model_b', you could choose to skip or handle the error here
            else:
                continue  # Or use `label = None` if you want to handle such cases separately
                
            labels.append(label)
        
        # Return tokenized output with labels
        return {**tokenized, "labels": labels}

    @staticmethod
    def process_text(text: str) -> str:
        return text.replace("null", "").strip()


In [None]:
print(ds[0])  # Check a single example


In [None]:
encode = CustomTokenizer(tokenizer, max_length=config.max_length)
ds = ds.map(encode, batched=True)

### Compute metrics

We'll compute the log-loss used in LB and accuracy as a auxiliary metric.

In [None]:
def compute_metrics(eval_preds: EvalPrediction) -> dict:
    preds = eval_preds.predictions
    labels = eval_preds.label_ids
    probs = torch.from_numpy(preds).float().softmax(-1).numpy()
    loss = log_loss(y_true=labels, y_pred=probs)
    acc = accuracy_score(y_true=labels, y_pred=preds.argmax(-1))
    return {"acc": acc, "log_loss": loss}

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import EvalPrediction

def compute_metrics(eval_pred: EvalPrediction) -> dict:
    # Extract predictions and labels from the EvalPrediction object
    logits, labels = eval_pred.predictions, eval_pred.label_ids

    # Convert logits to predicted labels (assuming binary classification with logits)
    pred_labels = logits.argmax(axis=-1)  # For multi-class, use argmax along the correct axis

    # Calculate accuracy and other metrics
    accuracy = accuracy_score(labels, pred_labels)

    # Return the metrics as a dictionary
    return {
        "accuracy": accuracy,
    }


### Split

Here, train and eval is splitted according to their `id % 5`

In [None]:
folds = [
    (
        [i for i in range(len(ds)) if i % config.n_splits != fold_idx],
        [i for i in range(len(ds)) if i % config.n_splits == fold_idx]
    ) 
    for fold_idx in range(config.n_splits)
]

In [None]:
train_idx, eval_idx = folds[config.fold_idx]

trainer = Trainer(
    args=training_args, 
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds.select(train_idx),
    eval_dataset=ds.select(eval_idx),
    compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)
trainer.train()