# Introduction
## Gemma LoRA Fine-tuning for Beginners with Hugging Face

In this notebook, we'll learn the very basics of using the Gemma model, incorporating the powerful tools from Hugging Face. It's focused on the simplest content without any complex processing. This practical exercise is about training a Large Language Model (LLM) to generate Python Q&A using the Gemma model with the support of Hugging Face libraries.

### Table of Contents:
 
1. What is Gemma?<br>
2. Package Installation and Importing<br>
3. Data Loading <br>
4. Data Preprocessing for Training<br>
5. Loading the Gemma Model<br>
7. Q & A Results Before Finetuning<br>
7. Applying Gemma LoRA<br>
8. Training Gemma<br>
9. Q & A Results After Finetuning<br>
10. Conclusion<br>

### Dataset Used
- [Dataset_Python_Question_Answer](https://www.kaggle.com/datasets/chinmayadatt/dataset-python-question-answer) : This dataset is about Python programming. Question and answers are generated using Gemma. There are more than four hundred questions and their corresponding answers about Python programming.

---

# 1.What is Gemma?

Gemma is a powerful machine learning model designed for a wide range of tasks. This section will introduce the basics of Gemma, its use cases, and why it's beneficial for your projects.

### Summary

- Gemma models are built from the ground up to be lightweight and state-of-the-art. They are text-to-text, decoder-only large language models, available primarily in English.
- They come with open weights, offering both pre-trained and instruction-tuned variants to suit a wide array of text generation tasks.
- Ideal for applications such as question answering, summarization, and reasoning, Gemma models can be deployed on relatively modest hardware, including laptops and desktops, or within your own cloud infrastructure.

### Description

- **Lightweight and Open**: Gemma models are designed to be both powerful and accessible, embodying Google's commitment to democratizing state-of-the-art AI technology.
- **Versatile Applications**: Whether it's generating answers to questions, summarizing documents, or facilitating complex reasoning tasks, Gemma models are equipped to handle a diverse set of challenges.
- **Democratizing AI**: By making Gemma models lightweight and open, Google ensures that cutting-edge AI technology is no longer confined to those with access to extensive computational resources.

### Inputs and Outputs

- **Input**: Gemma models take in text strings, which can range from questions and prompts to longer documents that require summarization.
- **Output**: In response, they generate text in English, offering answers, summaries, or other forms of text-based output, tailored to the input provided.


# 2. Package Installation and Importing

Before we start, it's essential to install all necessary packages, including Gemma itself. This part will cover the installation process step by step.

In [28]:
# Install specific versions of PEFT, evaluate, transformers, accelerate, and bitsandbytes packages quietly without showing output.
!pip install -q -U peft evaluate transformers==4.38.0 accelerate==0.27.2 bitsandbytes==0.42.0 peft==0.8.2

# Upgrade and quietly install the latest versions of the trl and datasets packages.
!pip install -U -q trl==0.7.11 datasets==2.17.1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Package Description

#### python basic module
- `os`: Provides ways to interact with the operating system and its environment variables.
- `torch`: PyTorch library for deep learning applications.
- `numpy`: Essential library for linear algebra and mathematical operations.
- `pandas`: Powerful data processing tool, ideal for handling CSV files and other forms of structured data.

#### transformers module
- `AutoTokenizer`: Used to automatically load a pre-trained tokenizer.
- `AutoModelForCausalLM`: Used to automatically load pre-trained models for causal language modeling.
- `BitsAndBytesConfig`: Configuration class for setting up the Bits and Bytes tokenizer.
- `AutoConfig`: Used to automatically load the model's configuration.
- `TrainingArguments`: Defines arguments for training setup.

#### datasets module
- `Dataset`: A class for handling datasets.

#### peft module
- `LoraConfig`: A configuration class for configuring the Lora model.
- `PeftModel`: A class that defines the PEFT model.
- `prepare_model_for_kbit_training`: A function that prepares a model for k-bit training.
- `get_peft_model`: Function to get the PEFT model.

#### trl module
- `SFTTrainer`: Trainer class for SFT (Supervised Fine-Tuning) training.

#### IPython.display module
- `Markdown`: Used to output text in Markdown format.
- `display`: Used to display objects in Jupyter notebooks.

In [29]:
import os
import torch

import numpy as np
import pandas as pd

from transformers import (AutoTokenizer, 
                          AutoModelForCausalLM, 
                          BitsAndBytesConfig, 
                          AutoConfig,
                          TrainingArguments)

from datasets import Dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
from IPython.display import Markdown, display

In [30]:
# Disable CA bundle check. Useful in certain environments where you may encounter SSL errors.
os.environ['CURL_CA_BUNDLE'] = ''

# Set the order of devices as seen by CUDA to PCI bus ID order. This is to ensure consistency in device selection.
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

# Check if CUDA is available, and if so, specify which GPU(s) to be made visible to the process.
if torch.cuda.is_available():
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Only set this if CUDA is available.
    print("CUDA is available")
else:
    print("CUDA is not available")

CUDA is available


A tool for tracking and visualizing Machine Learning experiments. Wandb helps you easily manage metrics, hyperparameters, experiment code, and model artifacts during model training.<br>
<a href="https://github.com/wandb/wandb">wandb github</a>

In [31]:
# Wandb for experiment tracking
import wandb

# Initialize Weights & Biases (wandb) for experiment tracking.
# If a wandb account exists, it can typically be used by specifying project and entity.
# However, for this example, we're disabling wandb to ignore it by setting mode to "disabled".
wandb.init(mode="disabled")



# 3. Data Loading

Loading your data is the first step in the machine learning pipeline. This section will guide you through loading your dataset into the Jupyter notebook environment.

### To download a dataset, follow these simple steps:
1. Look for the "Input" option located below the "Notebook" section in the right-side menu.
2. Click on the "+ Add Input" button.
3. In the search bar that appears, type "dataset-python-question-answer".
4. Find the dataset in the search results and click the "+" button to add it to your notebook. This action will automatically download the dataset for you.

In [32]:
# The necessary packages `os` and `pandas` are required for this section of the code. 
# However, they have already been imported in the "2. Package Installation" section, so their import statements are omitted here to avoid redundancy.

# Define the filename of the target dataset.
# Natural Language to Python Code

target_filename = 'Dataset_Python_Question_Answer.csv'

# Initialize a variable to hold the full path to the target CSV file.
csv_file_path = None

# Walk through the directory structure starting from '/kaggle/input'.
for dirname, _, filenames in os.walk('/kaggle/input'):
    # Check if the target filename is present in the current directory's list of files.
    if target_filename in filenames:
        # Construct the full path to the target file and update the csv_file_path variable.
        csv_file_path = os.path.join(dirname, target_filename)
        break  # Exit the loop after finding the target file.
        
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Check if the specific CSV file's path has been found.
if csv_file_path:
    print(csv_file_path)
else:
    # Print an error message if the specific CSV file was not found.
    # Also, suggest checking the 'Input' menu to ensure the file has been properly added.
    print(f"The specified file '{target_filename}' was not found. Please ensure the file has been correctly added to the 'Input' menu on the right.")

/kaggle/input/dataset-python-question-answer/Dataset_Python_Question_Answer.csv


In [33]:
# Load the dataset from the identified CSV file.
# csv_file_path = "/kaggle/input/dataset-python-question-answer/Dataset_Python_Question_Answer.csv"
original_data = pd.read_csv(csv_file_path)

# Print the shape of the dataset to understand its dimensions (number of rows and columns).
print('original_data shape:',original_data.shape)

original_data shape: (419, 2)


In [34]:
# Display a random sample of 2 rows from the original_data to get a quick overview of the data.
original_data.sample(2)

Unnamed: 0,Question,Answer
203,What is recursion with a default case in Pyth...,"[""Sure. Here's an example of recursion with a ..."
278,What is the difference between a dictionary a...,"[""Sure. Here's the difference between a dictio..."


# 4. Data Preprocessing for Training

Before initiating the training process with Google's Gemma, a pivotal step involves the preparation of our dataset. The core of this stage is to align our dataset with the specifications required by Gemma, ensuring optimal compatibility and efficiency in training. The process commences with the strategic manipulation of our dataset, specifically focusing on the 'Question' and 'Answer' columns. These columns are instrumental as we meticulously combine them to form comprehensive training examples, thereby facilitating a seamless training experience.

A critical aspect to acknowledge during data preprocessing is the management of data length. Given that the Gemma model operates as a Large Language Model (LLM), it's imperative to assess the length of our training data. Training with excessively lengthy data could impose substantial demands on GPU resources, potentially hindering the efficiency of the process. To circumvent this challenge and optimize resource utilization, we advocate for the exclusion of unduly long data from the training set. This strategic decision not only preserves GPU resources but also ensures a more streamlined and effective training workflow.

In [35]:
question_column = "Question"
answer_column = "Answer"

# Calculate the length of each 'Question' and 'Answer' combined and add it as a new column
original_data['text_length'] = original_data[question_column].str.len() + original_data[answer_column].str.len()

# Calculate the average length of 'Answer' in the filtered dataset
average_length = int(original_data['text_length'].mean())

# Find the shortest and longest lengths of 'Answer' in the filtered dataset
shortest_length = int(original_data['text_length'].min())
longest_length = int(original_data['text_length'].max())

# Print the statistics
print("Average length of 'Question and Answer' in original dataset:", average_length)
print("Shortest length of 'Question and Answer' in original dataset:", shortest_length)
print("Longest length of 'Question and Answer' in original dataset:", longest_length)

Average length of 'Question and Answer' in original dataset: 1708
Shortest length of 'Question and Answer' in original dataset: 139
Longest length of 'Question and Answer' in original dataset: 3511


In [36]:
# Calculate the median length of 'text_length' to set a threshold for filtering
median_text_length_threshold = int(original_data['text_length'].quantile(0.5))

# Retain only rows where 'text_length' is less than or equal to the median text length
filtered_data = original_data[original_data['text_length'] <= median_text_length_threshold]

# Output the number of entries before and after filtering to assess the impact
print("Number of entries before filtering:", len(original_data))
print("Number of entries after filtering:", len(filtered_data))

print("---"*10)

# Determine the maximum 'text_length' in the filtered dataset
max_text_length_in_filtered_data = int(filtered_data['text_length'].max())

# Compare the maximum 'text_length' before and after filtering
print(f"Maximum text length before filtering: {longest_length}\nMaximum text length after filtering: {max_text_length_in_filtered_data}")


Number of entries before filtering: 419
Number of entries after filtering: 210
------------------------------
Maximum text length before filtering: 3511
Maximum text length after filtering: 1754


In [37]:
# Display a random sample of 2 rows from the filtered_data to get a quick overview of the data.
filtered_data.sample(2)

Unnamed: 0,Question,Answer,text_length
233,Define a function that takes a list of string...,"['```python', 'def remove_first_letter(strings...",1520
256,Define a function that takes a string and two...,"['```python', 'def get_first_letter(text, num1...",537


Furthermore, it's **essential** to highlight the integration with the Hugging Face's transformers library, a pivotal component in our data preprocessing journey. This integration necessitates the conversion of our dataset into a specific format, namely `from datasets import Dataset`. This adjustment is crucial as it aligns with the library's requirements, enabling us to leverage its full potential in facilitating the training of the Gemma model. By adhering to this format, we ensure a harmonious and efficient interaction with the transformers library, further enhancing the overall training process.
<a href="https://huggingface.co/docs/transformers/index">Transformers documentation</a>


In [38]:
import random

# Convert dataset to Dataset object
dataset = Dataset.from_pandas(filtered_data)

# Print the entire dataset
print("<Data structure>")
print(dataset)

# Generate a random index based on the dataset length
random_index = random.randint(0, len(dataset) - 1)

# Print a random sample of the dataset
print("\n\n<Random sample dataset>")
print("\n- Question:", dataset[random_index][question_column])
print("\n- Answer:", dataset[random_index][answer_column])

<Data structure>
Dataset({
    features: ['Question', 'Answer', 'text_length', '__index_level_0__'],
    num_rows: 210
})


<Random sample dataset>

- Question:  What is the difference between a tuple and a list in Python?

- Answer: ["Sure, here's the difference between a tuple and a list in Python.", '**Tuples** are unordered collections of elements that are defined at the time of creation. They are created using parentheses, and the elements are separated by commas, with each element on a separate line. Tuples are immutable, meaning their contents cannot be changed after they are created.', '**Lists** are ordered collections of elements that are defined at the time of creation. They are created using square brackets, and the elements are separated by commas, with each element on a separate line. Lists are mutable, meaning their contents can be changed after they are created.', "Here's an example to illustrate the difference between tuples and lists:", '```python', 'tuple1 = (1, 2, 3

# 5. Loading the Gemma Model

Here, we'll cover how to load the Gemma model so it's ready for finetuning. This includes where to download the model from and how to load it into your notebook.

### Adding the Gemma Model
1. Still in the "Input" section of the right-side menu in your Kaggle notebook, click on the "+ Add Input" button again.
2. Below the search bar that appears, click on the "Models" option.
3. In the search bar, type "Gemma" to find the model.
4. From the filtered results, select the Gemma model by clicking on the "+" button next to it. Make sure to choose the correct version by noting the framework as "Transformers", the variation as "2b-it", and the version as "v3".
5. After selecting the correct Gemma model, click on "Add Model" at the bottom.
6. The Gemma model, specifically "Gemma.v3", should now be listed under the "Models" subsection of the "Input" section in the right-side menu of your notebook, indicating successful addition.

In [39]:
# Check if gemma/transformers/2b-it/3 exists.
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/data-assistants-with-gemma/submission_categories.txt
/kaggle/input/data-assistants-with-gemma/submission_instructions.txt
/kaggle/input/dataset-python-question-answer/Dataset_Python_Question_Answer.csv
/kaggle/input/gemma/transformers/2b-it/3/model.safetensors.index.json
/kaggle/input/gemma/transformers/2b-it/3/gemma-2b-it.gguf
/kaggle/input/gemma/transformers/2b-it/3/config.json
/kaggle/input/gemma/transformers/2b-it/3/model-00001-of-00002.safetensors
/kaggle/input/gemma/transformers/2b-it/3/model-00002-of-00002.safetensors
/kaggle/input/gemma/transformers/2b-it/3/tokenizer.json
/kaggle/input/gemma/transformers/2b-it/3/tokenizer_config.json
/kaggle/input/gemma/transformers/2b-it/3/special_tokens_map.json
/kaggle/input/gemma/transformers/2b-it/3/.gitattributes
/kaggle/input/gemma/transformers/2b-it/3/tokenizer.model
/kaggle/input/gemma/transformers/2b-it/3/generation_config.json


### BitsAndBytesConfig Overview

`BitsAndBytesConfig` is a configuration class provided by the `transformers` library, which is designed for controlling the behavior of model quantization and optimization during both the training and inference phases of model deployment. Quantization is a technique used to reduce the memory footprint and computational requirements of deep learning models by representing model weights and activations in lower-precision data types, such as 8-bit integers (`int8`) or even 4-bit representations.

#### Benefits of Quantization

The primary benefits of quantization include:

- **Reduced Memory Usage**: Lower-precision representations require less memory, enabling the deployment of larger models on devices with limited memory capacity.
- **Increased Inference Speed**: Operations with lower-precision data types can be executed faster, thus speeding up the inference time.
- **Energy Efficiency**: Reduced computational requirements translate to lower energy consumption, which is crucial for mobile and embedded devices.

#### `BitsAndBytesConfig` Parameters

In the context of the `transformers` library, `BitsAndBytesConfig` allows users to configure the quantization behavior specifically for using the `bitsandbytes` backend. Below is an example configuration along with comments explaining each parameter:


In [40]:
# Checking for the available device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Available devices print
print("device:",device)

# Defining the path to the pre-trained model
model_path = "/kaggle/input/gemma/transformers/2b-it/3"

# Loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Defining BitsAndBytesConfig
bnbConfig = BitsAndBytesConfig(
    load_in_4bit=True, # Enable loading of the model in 4-bit quantized format.
    bnb_4bit_quant_type="nf4", # Specify the quantization type. "nf4" refers to a specific 4-bit quantization scheme.
    bnb_4bit_compute_dtype=torch.bfloat16, # Define the data type for computations. bfloat16 offers a good balance between precision and speed.
)

# Loading the model for causal language modeling
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             device_map="auto",
                                             quantization_config=bnbConfig
                                            )

# Move the model to the specified computing device (CPU or GPU).
# model = model.to(device)

device: cuda


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [41]:
# Print a summary of the model to understand its architecture and the number of parameters.
model

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
     

### Setting generating Text with the Gemma Model

This code provides a simple function to generate text using the Gemma model. The Gemma model, a variant of large language models, excels in generating human-like text based on a given prompt. This function utilizes both a model and tokenizer from the Gemma architecture, formatting the output in a specific template for clarity and consistency.

In [42]:
# Define a template for formatting instructions and responses.
# This template will be used to format the text data in a LLM structure.
template = "Instruction:\n{instruction}\n\nResponse:\n{response}"

In [43]:
def generate_response(model, tokenizer, prompt, device, max_new_tokens=128):
    """
    This function generates a response to a given prompt using a specified model and tokenizer.

    Parameters:
    - model (PreTrainedModel): The machine learning model pre-trained for text generation.
    - tokenizer (PreTrainedTokenizer): A tokenizer for converting text into a format the model understands.
    - prompt (str): The initial text prompt to generate a response for.
    - device (torch.device): The computing device (CPU or GPU) the model should use for calculations.
    - max_new_tokens (int, optional): The maximum number of new tokens to generate. Defaults to 128.

    Returns:
    - str: The text generated in response to the prompt.
    """
    # Convert the prompt into a format the model can understand using the tokenizer.
    # The result is also moved to the specified computing device.
    inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to(device)

    # Generate a response based on the tokenized prompt.
    outputs = model.generate(**inputs, num_return_sequences=1, max_new_tokens=max_new_tokens)

    # Convert the generated tokens back into readable text.
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract and return the response text. Here, it assumes the response is formatted as "Response: [generated text]".
    response_text = text.split("Response:")[1]
    
    return response_text

# 6. Q & A Results Before Finetuning

Before we start the finetuning process, let's see how the Gemma model performs out of the box on our dataset. This section will show you how to run a simple question-answering test.

In [44]:
question = "What is the difference between an expression and an operator?"

prompt = template.format(
    instruction=question,
    response="",
)

response_text = generate_response(model, tokenizer, prompt, device, 128)

Markdown(response_text)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



Sure. Here's the difference between an expression and an operator:

**Expression:**

* An expression is a combination of numbers, variables, operators, and literals that is evaluated to a single value.
* It is a mathematical or logical statement that represents a numerical value or a truth value.
* An expression can be a single value, a complex expression, or a compound expression.

**Operator:**

* An operator is a symbol or character that is used to combine or modify the values of other operands.
* It is not evaluated to a specific value and is not a numerical expression.
* Operators are used

# 7. Applying Gemma LoRA

In this Session, we'll be applying the LoRA (**Low-Rank Adaptation**) technique to the **Gemma model**, a method designed to make fine-tuning large models like Gemma both **fast and efficient**. LoRA, a part of **PEFT** (**Parameter Efficient Fine-Tuning**), focuses on updating specific parts of a pre-trained model by only training a select few dense layers. This drastically cuts down on the computational demands and GPU memory needs, all without adding any extra time to the inference process. Here's what makes LoRA so powerful for our purposes:

<center><img src="https://cdn-lfs.huggingface.co/datasets/huggingface/documentation-images/4313422c5f2755897fb8ddfc5b99251358f679647ec0f2d120a3f1ff060defe7?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27lora_diagram.png%3B+filename%3D%22lora_diagram.png%22%3B&response-content-type=image%2Fpng&Expires=1713275384&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMzI3NTM4NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9kYXRhc2V0cy9odWdnaW5nZmFjZS9kb2N1bWVudGF0aW9uLWltYWdlcy80MzEzNDIyYzVmMjc1NTg5N2ZiOGRkZmM1Yjk5MjUxMzU4ZjY3OTY0N2VjMGYyZDEyMGEzZjFmZjA2MGRlZmU3P3Jlc3BvbnNlLWNvbnRlbnQtZGlzcG9zaXRpb249KiZyZXNwb25zZS1jb250ZW50LXR5cGU9KiJ9XX0_&Signature=NAlgCQRn6ktvkOq8WpJkP7DyBvC3ta3Z5gGREWKvLDGQLYpypCszzucGL7nFdzirC4Py9CkgAgkAwbtGAkBU0JvbDVqxIAK9SzpX34xyFmoERdHqH2sQUh17cZ42f60MU9E%7E209I%7Ec6HgUNponN8lhoQzn0jEKYvkzsVsVUPu4OuYONDx4C1tywJIDovcKZCqEQY7f9-OjEKjLPr-CkNymcE%7Eprd83SMPThprA3HVl4gmMbCslQgUM8mM5imHcFxozdbzgD1Mb0U%7El7THXSeBWXdpGdZIBjbJSwJBEEMBtlVbbKtncPTrZWUjrrq03EJJSB7Cc8IA%7EgtJ3cbUerDGw__&Key-Pair-Id=KVTP0A1DKRTAX" width="500"><br/>
Paper: <a href="https://arxiv.org/abs/2106.09685">LoRA: Low-Rank Adaptation of Large Language Models</a></center>

- **Dramatically reduces the number of parameters** needed, by up to **10,000 times**.
- **Cuts down GPU memory usage** by **three times**.
- **Maintains quick inference times** with **no additional latency**.

The essence of PEFT, and by extension LoRA, is to enhance a model's performance using minimal resources, focusing on fine-tuning a handful of parameters for specific tasks. This technique is particularly advantageous as it:
  
- Optimizes rank decomposition matrices, maintaining the original model weights while adding optimized low-rank weights **A** and **B**.
- Allows for up to **threefold reductions** in both time and computational costs.
- Enables easy swapping of the LoRA module (weights **A** and **B**) according to the task at hand, lowering storage requirements and avoiding any increase in inference time.

When applied specifically to **Transformer architectures**, targeting **attention weights** and keeping MLP modules static, LoRA significantly enhances the model's efficiency. For instance, in GPT-3 175B models, it:
  
- **Reduces VRAM usage** from **1.2TB to 350GB**.
- **Lowers checkpoint size** from **350GB to 35MB**.
- **Boosts training speed** by approximately **25%**.

By integrating LoRA into Gemma, we aim to streamline the model's fine-tuning process in this Session, making it quicker and more resource-efficient, without compromising on performance.

In [45]:
# LoRA configuration: Sets up the parameters for Low-Rank Adaptation, which is a method for efficient fine-tuning of transformers.
lora_config = LoraConfig(
    r = 8,  # Rank of the adaptation matrices. A lower rank means fewer parameters to train.
    target_modules = ["q_proj", "o_proj", "k_proj", "v_proj",
                      "gate_proj", "up_proj", "down_proj"],  # Transformer modules to apply LoRA.
    task_type = "CAUSAL_LM",  # The type of task, here it is causal language modeling.
)

# 8. Training Gemma

Now that everything is set up, it's time to finetune the Gemma model on your data. This section will guide you through the training process, including setting up your training loop and selecting the right hyperparameters.

In [46]:
def formatting_func(example):
    """
    Formats a given example (a dictionary containing question and answer) using the predefined template.
    
    Parameters:
    - example (dict): A dictionary with keys corresponding to the columns of the dataset, such as 'question' and 'answer'.
    
    Returns:
    - list: A list containing a single formatted string that combines the instruction and the response.
    """
    # Add the phrase to verify training success and format the text using the template and the specific example's instruction and response.
    line = template.format(instruction=example[question_column], response=example[answer_column])
    return [line]


In [47]:
# Setup for the trainer object that will handle fine-tuning of the model.
trainer = SFTTrainer(
    model=model,  # The pre-trained model to fine-tune.
    train_dataset=dataset,  # The dataset used for training.
    max_seq_length=512,  # The maximum sequence length for the model inputs.
    args=TrainingArguments(  # Arguments for training setup.
        per_device_train_batch_size=1,  # Batch size per device (e.g., GPU).
        gradient_accumulation_steps=4,  # Number of steps to accumulate gradients before updating model weights.
        warmup_steps=5,  # Number of steps to gradually increase the learning rate at the beginning of training.
        max_steps=30,  # Total number of training steps to perform.
        learning_rate=2e-4,  # Learning rate for the optimizer.
        fp16=False,  # Whether to use 16-bit floating point precision for training. False means 32-bit is used.
        logging_steps=1,  # How often to log training information.
        output_dir="outputs",  # Directory where training outputs will be saved.
        optim="paged_adamw_8bit"  # The optimizer to use, with 8-bit precision for efficiency.
    ),
    peft_config=lora_config,  # The LoRA configuration for efficient fine-tuning.
    formatting_func=formatting_func,  # The function to format the dataset examples.
)


Map:   0%|          | 0/210 [00:00<?, ? examples/s]



In [48]:
# train the model to the processed data.
trainer.train()

Step,Training Loss
1,0.3112
2,0.3112
3,0.2976
4,0.2712
5,0.2365
6,0.2011
7,0.1683
8,0.1426
9,0.1204
10,0.1015


TrainOutput(global_step=30, training_loss=0.08974544471129775, metrics={'train_runtime': 41.8676, 'train_samples_per_second': 2.866, 'train_steps_per_second': 0.717, 'total_flos': 183554172518400.0, 'train_loss': 0.08974544471129775, 'epoch': 30.0})

# 9. Q&A Results After Finetuning

After training, let's see how much our Gemma model has improved. We'll rerun the question-answering test and compare the results to the pre-finetuning performance.

In [49]:
question = "What is the difference between an expression and an operator?"

prompt = template.format(
    instruction=question,
    response="",
)

response_text = generate_response(trainer.model, tokenizer, prompt, device, 128)

Markdown(response_text)


Sure. Here's the difference between an expression and an operator:

**Expression:**

* An expression is a combination of one or more operands (numbers, variables, functions, etc.) connected by operators.
* It tells the computer what to do or calculate.
* An expression can be a single value or a complex expression.

**Operator:**

* An operator is a symbol used to combine or manipulate operands.
* It tells the computer how to perform the operation.
* An operator can be a binary operator (e.g., +, -, *, /) or a unary operator (e.g.,

**Although** the performance of the Gemma model was already outstanding, it might appear that there is not a significant difference after training. However, the value of this notebook lies in providing a comprehensive learning method for beginners. This is of great importance, and through this notebook, Gemma can also learn about topics it was previously unfamiliar with.

# 10. Conclusion

In this beginner-friendly notebook, we've outlined the process of fine-tuning the Gemma model, a Large Language Model (LLM), specifically for Python Q&A generation. Starting from data loading and preprocessing, we've demonstrated how to train the Gemma model effectively, even for those new to working with LLMs.

We leveraged the Dataset_Python_Question_Answer, featuring hundreds of Python programming questions and answers, to train and refine the Gemma model's capabilities in generating accurate Q&As. This journey, while introductory, underscores the potential and straightforward path to engaging with LLMs through the Gemma model.

Achieving the best performance with the Gemma model (or any LLM) generally requires training with more extensive datasets and over more epochs. Future enhancements could include integrating Retrieval-Augmented Generation (RAG) and Direct Preference Optimization (DPO) training techniques, offering a way to further improve the model by incorporating external knowledge bases for more precise and relevant responses.

Ultimately, this notebook is designed to make the Gemma model approachable for beginners, illustrating that straightforward steps can unlock the potential of LLMs for diverse domain-specific tasks. It encourages users to experiment with the Gemma model across various fields, broadening the scope of its application and enhancing its utility.

<b>If you find this notebook useful, please consider upvoting it.</b> 
   
<b>This will help others find it and encourage me to write more code, which benefits everyone.</b>