# Fine-Tuning of the Gemma-2b Model for Data Science Question Generation

**Abstract:** This project details the fine-tuning of the Gemma-2b language model. The primary objective is to enhance the model's capability to generate relevant and insightful questions within the domain of data science and machine learning.

**Model Selection:** The Gemma-2b model was selected for this task due to its efficient architecture and powerful performance. Its relatively small size makes it a suitable candidate for fine-tuning on a custom dataset, offering a balance between computational cost and expected performance for this specific application.

---

## 1. Environment Setup

### 1.1. Dependency Installation

The following libraries are required for the experiment and are installed in this section:

* `bitsandbytes`: For model quantization, which reduces the model's memory footprint and computational demand.
* `trl`: The Transformer Reinforcement Learning library, utilized for its `SFTTrainer` to facilitate supervised fine-tuning.
* `tensorboard`: To monitor and visualize the training process and model performance metrics.
* `jupyter_tensorboard`: To integrate Tensorboard with the Jupyter environment.

In [None]:
%pip install bitsandbytes --upgrade --no-cache-dir
%pip install trl
%pip install tensorboard
%pip install jupyter_tensorboard

### 1.2. Hugging Face Authentication

Authentication with Hugging Face is a prerequisite for accessing the Gemma-2b model, as it is hosted in a private repository. This step involves using a generated access token to log in. For detailed instructions on token generation, please refer to the [Hugging Face documentation](https://huggingface.co/docs/hub/en/security-tokens).

In [None]:
from huggingface_hub import login

hf_token = "hf_token"
login(token=hf_token)

## 2. Model Configuration and Loading

### 2.1. Quantization Configuration

To optimize the model for training, we employ 4-bit quantization using the `bitsandbytes` library. The specific configurations are as follows:

* **`load_in_4bit=True`**: This enables the loading of the model with 4-bit precision.
* **`bnb_4bit_use_double_quant=True`**: Double quantization is applied for enhanced memory efficiency.
* **`bnb_4bit_quant_type="nf4"`**: The "Normal Float 4" (nf4) quantization type is used, which is a 4-bit data type optimized for normally distributed weights.
* **`bnb_4bit_compute_dtype=torch.bfloat16`**: The computation is performed using the `bfloat16` data type for a balance of precision and performance.

In [8]:
import torch
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_compute_dtype = torch.bfloat16
)

### 2.2. Model and Tokenizer Loading

The pre-trained Gemma-2b model and its corresponding tokenizer are loaded from the Hugging Face model hub. The previously defined quantization configuration is applied during this process.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b-it",
    quantization_config=bnb_config,
    device_map={"":0}
)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it", add_eos_token=True)

input_text = "Ask a question about Overfitting/Underfitting."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))

## 3. Data Preparation

### 3.1. Dataset Loading and Preprocessing

The dataset for fine-tuning is loaded from a JSON file. A preprocessing step is performed to reformat the data into a `prompt` and `completion` structure. This standardized format is required by the `SFTTrainer`.

In [None]:
import json
import pandas as pd

dataset = 'assets/dataset/dataset_5k.json'

with open(dataset, 'r', encoding='utf-8') as f:
    data = json.load(f)

converted_data = []

for item in data:
    converted_item = {
v
    }
    converted_data.append(converted_item)

print("Example:")
print(json.dumps(converted_data[0], indent=2, ensure_ascii=False))

output_file = '/content/dataset_2k_prompt_completion.json'

with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(converted_data, f, indent=2, ensure_ascii=False)

main_df = pd.read_json('/content/dataset_2k_prompt_completion.json')

Example:
{
  "prompt": "Act as a machine learning interviewer and formulate a question on Overfitting/Underfitting.",
  "completion": "What is the role of complexity in model performance?"
}


### 3.2. Dataset Splitting

The preprocessed dataset is split into three subsets:

* **Training Set:** Used to train the model.
* **Validation Set:** Used to evaluate the model's performance during training and to tune hyperparameters.
* **Test Set:** Reserved for the final evaluation of the model's performance after training is complete.

In [11]:
from sklearn.model_selection import train_test_split

train_data, temp_data = train_test_split(main_df, test_size=0.2, random_state=42)

eval_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

## 4. Model Training

### 4.1. Preparing the Model for K-bit Training

The model is prepared for k-bit training using the `peft` (Parameter-Efficient Fine-Tuning) library. This step enables gradient checkpointing to reduce memory usage during the training process.

In [None]:
from peft import PeftModel, prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

### 4.2. Identifying Target Layers for LoRA

For the LoRA (Low-Rank Adaptation) fine-tuning technique, we need to identify the specific layers of the model that will be adapted. This typically involves targeting the linear (fully connected) layers of the transformer architecture. The following code identifies all 4-bit linear modules in the model that will be targeted by LoRA.

In [13]:
import bitsandbytes as bnb

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
        if 'lm_head' in lora_module_names:
            lora_module_names.remove('lm_head')

    return list(lora_module_names)

modules = find_all_linear_names(model)
print(modules)

['down_proj', 'gate_proj', 'o_proj', 'v_proj', 'k_proj', 'up_proj', 'q_proj']


### 4.3. LoRA Configuration

The LoRA configuration is defined with the following parameters:

* **`r` (Rank):** 64
* **`lora_alpha` (Alpha):** 32
* **`target_modules`:** The list of linear layers identified in the previous step.
* **`lora_dropout`:** 0.05
* **`bias`:** "none"
* **`task_type`:** "CAUSAL_LM"

This configuration is then applied to the model using `get_peft_model` from the `peft` library.

In [14]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=64,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

### 4.4. Verifying Trainable Parameters

After applying LoRA, we can verify the number of trainable parameters.

In [15]:
trainable, total = model.get_nb_trainable_parameters()
print(f'Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%')

Trainable: 78446592 | total: 2584619008 | Percentage: 3.0351%


### 4.5. Training Arguments and Trainer Initialization

The training arguments are defined using the `TrainingArguments` class from the `transformers` library. These arguments specify various training parameters such as batch size, learning rate, and logging strategy. The `SFTTrainer` is then initialized with the model, datasets, LoRA configuration, and training arguments.

In [None]:
from trl import SFTTrainer
import transformers
from transformers import TrainingArguments
from datasets import Dataset

train_dataset = Dataset.from_pandas(train_data)
eval_dataset = Dataset.from_pandas(temp_data)

print(train_data.columns)

print('prompt' in train_data.columns)

training_arguments = TrainingArguments(
    output_dir="gemma-2b-FT_results",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    optim="paged_adamw_8bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    save_strategy="epoch",
    eval_strategy="steps",
    eval_steps=10,
    num_train_epochs=1,
    max_steps=250,
    fp16=True,
    logging_dir='./logs',
    logging_steps=10,
    report_to="tensorboard"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=lora_config,
    args=training_arguments
)

### 4.6. Model Training

The training process is initiated by calling the `train()` method on the `trainer` object. The training progress, including training and validation loss, is logged to Tensorboard.

- You can view training logs in colab with tensorboard as follows:

    - %load_ext tensorboard

    - %tensorboard --logdir ./logs


In [None]:
model.config.use_cache = False
trainer.train()

In [19]:
new_model = "Gemma-2b_interview-FT"

## 5. Model Saving and Merging

### 5.1. Saving the Fine-Tuned Model

The fine-tuned model adapters are saved to the specified directory.

In [20]:
trainer.model.save_pretrained(new_model)

In [21]:
import gc
gc.collect()

1517

In [22]:
torch.cuda.empty_cache()
gc.collect()

0

### 5.2. Merging the Model with Base Model

The fine-tuned LoRA adapters are merged with the base Gemma-2b model to create a standalone, fine-tuned model. The merged model is then saved to a new directory.

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b-it",
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)

merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

merged_model.save_pretrained("merged_model", safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.padding_side= "right"

## 6. Model Evaluation

A simple test is conducted to evaluate the performance of the fine-tuned model on a sample prompt. This provides a qualitative assessment of the model's ability to generate relevant questions.

In [66]:
def get_completion(query:str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """
  <start_of_turn>
   user
   {query}
   <end_of_turn>\n
   <start_of_turn>model
  """

  prompt = prompt_template.format(query=query)

  encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

  model_inputs = encodeds.to(device)

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

  return(decoded)

In [67]:
result = get_completion(query="Ask a question about Overfitting/Underfitting.", model=merged_model, tokenizer=tokenizer)
print(result)



  
   user
   Ask a question about Overfitting/Underfitting.
   

   model
  What causes overfitting and how can it be prevented?
