<a href="https://colab.research.google.com/github/shuvayan/AIMLOPS/blob/main/M6_AST_02_PEFT_C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Programme in AI and MLOps
## A Program by IISc and TalentSprint
### Assignment 2: PEFT (Parameter Efficient Fine-Tuning)

## Learning Objectives

At the end of the experiment, you will be able to:

* understand the working of a parameter efficient finetuning method - LoRA
* load and quantize `google/gemma-2b` model
* fine-tune `google/gemma-2b` model, on the subset of `databricks-dolly-15k` dataset for text generation using LoRA
* perform inference with the  fine-tuned model

## Dataset Description

Here, in this experiment we will use **`ai-bites/databricks-mini`** dataset which is a subset of the `databricks-dolly-15k`.

The complete **`databricks-dolly-15k`** dataset is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create **prompt / response pairs** in several different instruction categories such as

- brainstorming
- classification
- closed QA
- generation
- information extraction
- open QA
- summarization
- general QA

The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category.

Data Splits:
- train: 15,011

Data Fields:

- ***instruction***: instruction or prompt
- ***context***: context information to consider while giving response
- ***response***: response of prompt
- ***category***: category in which instruction lies

<br>

**Example 1:**

\{
> '**instruction**': "When did Virgin Australia start operating?",
>
>'**context**': "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.",
>
>'**response**': "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route."
>
>'**category**': "closed_qa"

\}

**Example 2:**

\{
> '**instruction**': "Which is a species of fish? Tope or Rope",
>
>'**context**': " ",
>
>'**response**': "Tope"
>
>'**category**': "classification"

\}

To know more about `databricks-dolly-15k` dataset, refer [here](https://huggingface.co/datasets/databricks/databricks-dolly-15k).

Here, in this experiment we will use **`ai-bites/databricks-mini`** dataset which is a subset of the `databricks-dolly-15k`.

This subset has only those records which doesn't have a *context*. Also, *instruction* and *response* were combined in one text sentence.

Data Splits:
- train: 10,544

Data Fields:

- ***text***: instruction and response combined in one text sentence

<br>

**Example 1:**

\{
> '**text**': "Instruction:\nWhich is a species of fish? Tope or Rope\n\nResponse:\nTope",

\}

**Example 2:**

\{
> '**text**': "Instruction:\nWhy can camels survive for long without water?\n\nResponse:\nCamels use the fat in their humps to keep them filled with energy and hydration for long periods of time.",

\}

**Example 3:**

\{
> '**text**': "Instruction:\nAlice's parents have three daughters: Amy, Jessy, and what's the name of the third daughter?\n\nResponse:\nThe name of the third daughter is Alice",

\}

To know more about `ai-bites/databricks-mini` dataset, refer [here](https://huggingface.co/datasets/ai-bites/databricks-mini).

## Information

### **Parameter-Efficient Fine-Tuning (PEFT) methods**

Fine-tuning large pretrained models is often prohibitively costly due to their scale. Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to various downstream applications by only fine-tuning a small number of (extra) model parameters instead of all the model's parameters. This significantly decreases the computational and storage costs. Recent state-of-the-art PEFT techniques achieve performance comparable to fully fine-tuned models.

PEFT is integrated with Transformers for easy model training and inference, and Accelerate for distributed training and inference for really big models.

[PEFT](https://github.com/huggingface/peft) is also a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications ***without*** fine-tuning all the model's parameters.

PEFT currently includes techniques for:

- **LoRA:** Low-Rank Adaptation of Large Language Models
- **Prefix Tuning:** P-Tuning v2
- **P-Tuning**
- **Prompt Tuning**


### **LoRA**

It is a technique that accelerates the fine-tuning of large models while consuming less memory.

To make fine-tuning more efficient, LoRA's approach is to represent the weight updates with two smaller matrices (called update matrices) through low-rank decomposition.

A and B are update matrices in below figure.

<center>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/lora_diagram.png" width=900px>
</center>
<br>

- These new matrices can be trained to **adapt to the new data** while keeping the overall number of changes low.
- The original weight matrix **remains frozen** and doesn't receive any further adjustments.
- To produce the final results, both the original and the adapted weights are **combined**.

### Install Dependencies

In [1]:
%%capture

# For loading models, tokenizers, and datasets from HuggingFace
!pip -q uninstall pyarrow -y
!pip -q install pyarrow==15.0.2
!pip -q install datasets
!pip -q install accelerate
!pip -q install transformers

# For doing PEFT
!pip -q install --upgrade peft
!pip -q install --upgrade trl    # TRL - Transformer Reinforcement Learning; built on top of the 'transformers' library; Full stack library to fine-tune and align large language models
!pip -q install bitsandbytes     # for quantization


### <font color="#990000">Restart Session/Runtime</font>

### Setup Steps:

In [2]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "2303646" #@param {type:"string"}

In [3]:
#@title Please enter your password (your registered phone number) to continue: { run: "auto", display-mode: "form" }
password = "9002239227" #@param {type:"string"}

In [4]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "M6_AST_02_PEFT_C" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")

    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://aimlops-iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")

Setup completed successfully


### Import required packages

In [5]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
from trl import SFTTrainer             # Supervised Fine-tuning (SFT) Trainer

import warnings
warnings.filterwarnings('ignore')

### Login to HuggingFace

In [None]:
# Run, and paste your HF Access token when prompted
!huggingface-cli login

# OR

# from huggingface_hub import notebook_login
# notebook_login()            ## paste your HF Access token when prompted


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### **Load Model & Tokenizer**

* Load the model - `"google/gemma-2b"`

  <font color="#990000">NOTE that you will be required to raise a request before using the "google/gemma-2b" model.</font>

  <font color="#990000">You can login to your account on HuggingFace and then go to https://huggingface.co/google/gemma-2b, click on "Accept License" and follow the prompts. Requests are approved instantly.</font>

* Load the tokenizer
* Visualize the model architecture
* Finally, query the model with a prompt and see the response before fine-tuning

In [None]:
# Load model from HF Model Hub

"""
Gemma-2b model has 2 billion parameters.
Look into Model card: https://huggingface.co/google/gemma-2b
"""



model_name = "google/gemma-2b"          # username/repo-name

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model
model = AutoModelForCausalLM.from_pretrained(model_name,                # Make sure you have the access to this model
                                             device_map='auto'          # to load the entire model on the GPU if its available
                                             )

tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
# Visualize the model architecture
print(model)

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear(in_features=16384, out_features=2048, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): GemmaRMSNorm((2048,), eps=1e-

In [None]:
# Check the device of the model
device = next(model.parameters()).device
device

device(type='cuda', index=0)

In [None]:
# Query the model with a prompt and see the response before fine-tuning

input_text = "What should I do on a trip to Europe?"

input_ids = tokenizer(input_text, return_tensors="pt").to(device)
outputs = model.generate(**input_ids, max_length=128)

print(tokenizer.decode(outputs[0]))

<bos>What should I do on a trip to Europe?

The answer to this question is not as simple as it seems. There are many different things to see and do in Europe, and it can be difficult to know where to start.

If you’re planning a trip to Europe, here are some tips to help you get started:

1. Decide what you want to see and do.

There are so many amazing places to see and do in Europe, it can be hard to know where to start. Start by deciding what you want to see and do. Do you want to see the Eiffel Tower in Paris? Or


In [None]:
# Query the model with a prompt and see the response before fine-tuning

input_text = "Explain the process of photosynthesis in a way that a child could understand"

input_ids = tokenizer(input_text, return_tensors="pt").to(device)
print(input_ids)

outputs = model.generate(**input_ids, max_length=128)
print(tokenizer.decode(outputs[0]))

{'input_ids': tensor([[     2,  74198,    573,   2185,    576, 105156,    575,    476,   1703,
            674,    476,   2047,   1538,   3508]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
<bos>Explain the process of photosynthesis in a way that a child could understand.

A 100-W lightbulb is plugged into a standard $120-\mathrm{V}$ (rms) outlet. Find $(a) I_{\text {mas }}$ and $(b) I_{\max }$ when a device like this is operating at the maximum current allowed by its own internal circuitry. (Such a device is often called a light dimmer.)

A 100-turn, 2.0-cm-diameter coil is at rest with its axis vertical. A uniform magnetic field $60^{\circ}$ away


## Motivation for PEFT

- Try to finetune Gemma-2b on the subset of `databricks-dolly-15k` dataset **without LoRA**
  * Load and visualize the dataset
  * Initiate the trainer
  * Start the training
  * Note that its impossible to fine-tune on a single 14GB GPU (T4) on colab

In [None]:
# Load dataset
dataset = load_dataset("ai-bites/databricks-mini", split="train[0:1000]")
dataset

README.md:   0%|          | 0.00/288 [00:00<?, ?B/s]

dolly-mini-train.jsonl:   0%|          | 0.00/5.24M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10544 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 1000
})

In [None]:
print(dataset['text'][0])

Instruction:
Which is a species of fish? Tope or Rope

Response:
Tope


In [None]:
print(dataset['text'][1])

Instruction:
Why can camels survive for long without water?

Response:
Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.


In [None]:
# Fine-tune the model

from trl import SFTTrainer         # Supervised Fine-tuning step (SFT) Trainer

trainer = SFTTrainer(model=model,
                     train_dataset=dataset,
                     tokenizer=tokenizer,
                     dataset_text_field="text",
                     )

print("Initialized trainer for training!")

trainer.train()

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Initialized trainer for training!


OutOfMemoryError: CUDA out of memory. Tried to allocate 102.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 91.06 MiB is free. Process 18021 has 14.66 GiB memory in use. Of the allocated memory 14.39 GiB is allocated by PyTorch, and 142.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

<font color="#119911">Since it is difficult to fine-tune such a large model having 2 billion parameters, the above code cell will run out of memory. It will give `OutOfMemoryError` and the session will be crashed.</font>

At this point, you can restart the session and continue executing the below code cells from here. Make sure you are on GPU runtime.

# Finetune using Parameter Efficient Fine-tuning (PEFT)

### Import required packages

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
from transformers import TrainingArguments
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

import warnings
warnings.filterwarnings('ignore')

### Load dataset

In [None]:
# Load dataset
dataset = load_dataset("ai-bites/databricks-mini", split="train[0:1000]")
dataset

README.md:   0%|          | 0.00/288 [00:00<?, ?B/s]

dolly-mini-train.jsonl:   0%|          | 0.00/5.24M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10544 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 1000
})

In [None]:
print(dataset['text'][0])

Instruction:
Which is a species of fish? Tope or Rope

Response:
Tope


In [None]:
print(dataset['text'][1])

Instruction:
Why can camels survive for long without water?

Response:
Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.


## Load and Quantize the Model

**Quantization** is a technique to reduce the size of deep neural networks (including LLMs) by changing the precision of the weights and biases data structure.

Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8).

**bitsandbytes** is the easiest option for quantizing a model to 8-bit and 4-bit.

* 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance.

* 4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.

You can quantize a model by passing a `BitsAndBytesConfig` to `from_pretrained()` method.

To know more about bitsandbytes, refer [here](https://huggingface.co/docs/transformers/main/en/quantization/bitsandbytes?bnb=4-bit).

In [None]:
# Create `BitsAndBytesConfig` object for Quantization

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,                            # Activate 4-bit precision base model loading
    bnb_4bit_quant_type="nf4",                    # Quantization type (fp4 or nf4)    nf4 -> Normal Float 4
    bnb_4bit_compute_dtype=torch.float16,         # Compute dtype for 4-bit base models
    bnb_4bit_use_double_quant=False,              # Whether to activate nested quantization for 4-bit base models (double quantization)
)

In [None]:
model_name = "google/gemma-2b"

# Load and quantize the base model
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             quantization_config=quantization_config,    # note that GPU is needed for quantization
                                             device_map="auto"                           # to load the entire model on the GPU if its available
                                             )
model.config.use_cache = False
model.config.pretraining_tp = 1


model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"


### **Fine-tuning Model using LoRA**

To fine-tune a model using LoRA, you need to:

- Instantiate a base model, here it is `google/gemma-2b`
- Create a configuration (`LoraConfig`) where you define LoRA-specific parameters
- Pass this configuration to SFTTrainer class `peft_config = LoraConfig(...)`
- Start the training process

In [None]:
# Create LoRA configuration object with LoRA-specific parameters

peft_config = LoraConfig(
    lora_alpha=16,                         # LoRA scaling factor
    r=4,                      # 8, 16, 32  # LoRA attention dimension; the rank of the update matrices
    lora_dropout=0.1,                      # Dropout probability for LoRA layers
    bias="none",                           # specifies if the bias parameters should be trained
    task_type="CAUSAL_LM",                 # telling lora that this is a causal language modeling task
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj"]            # layers/blocks present in gemma-2b model
)

In [None]:
from peft import get_peft_model

# Number of trainable parameters
get_peft_model(model, peft_config=peft_config).print_trainable_parameters()



trainable params: 3,575,808 || all params: 2,509,748,224 || trainable%: 0.1425


From above we can see that now only 14.25% out of 2 billion parameters will be updated during the training process.

In [None]:
# Set training parameters

training_arguments = TrainingArguments(
    output_dir="./results",                     # Output directory where the model predictions and checkpoints will be stored
    num_train_epochs=2,                         # Number of training epochs
    per_device_train_batch_size=4,              # Batch size per GPU for training
    gradient_accumulation_steps=4,              # Batch size per GPU for evaluation
    optim="paged_adamw_32bit",                  # Optimizer to use (AdamW optimizer)
    save_steps=25,                              # Save checkpoint every X updates steps
    logging_steps=25,                           # Log every X updates steps
    learning_rate=2e-4,                         # Initial learning rate (AdamW optimizer)
    weight_decay=0.001,                         # Weight decay to apply to all layers except bias/LayerNorm weights
    max_grad_norm=0.3,                          # Maximum gradient normal (gradient clipping)
    warmup_ratio=0.03,                          # Ratio of steps for a linear warmup (from 0 to learning rate)
    group_by_length=True,                       # Group sequences into batches with same length; Saves memory and speeds up training considerably
    lr_scheduler_type="constant",               # Learning rate schedule (constant a bit better than cosine)
)

In [None]:
# Set Supervised Fine-Tuning (SFT) parameters

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,                  # LoRA configuration to use
    dataset_text_field="text",
    max_seq_length=40,                        # Maximum sequence length to use
    tokenizer=tokenizer,
    args=training_arguments,
    packing=True,                             # Pack multiple short examples in the same input sequence to increase efficiency
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


In [None]:
# Train model
trainer.train()

Step,Training Loss
25,5.5938
50,3.5142
75,3.0969
100,2.9027
125,3.0027
150,2.8497
175,2.7795
200,2.8685
225,2.8281
250,2.9448


In [None]:
# Save model
new_model = "gemma-finetuned"
trainer.model.save_pretrained(new_model)

## Prompt the newly fine-tuned model
* Load and MERGE the LoRA weights with the model weights
* Run inference with the same prompt we used to test the pre-trained model

In [None]:
# Base model
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Load and MERGE the LoRA weights with the model weights

model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

In [None]:
model

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear(in_features=16384, out_features=2048, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): GemmaRMSNorm((2048,), eps=1e-

In [None]:
# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
# Run inference

input_text = "What should I do on a trip to Europe?"

input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
print(input_ids)
outputs = model.generate(**input_ids, max_length=128)
print(tokenizer.decode(outputs[0]))

{'input_ids': tensor([[     2,   1841,   1412,    590,    749,    611,    476,   7957,    577,
           4238, 235336]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
<bos>What should I do on a trip to Europe?

Response:
There are many options for a trip to Europe.  You can visit a country like France, where you can visit the Eiffel Tower, the Louvre, and the Arc de Triomphe.  You can visit a country like Italy, where you can visit the Colosseum, the Vatican, and the Trevi Fountain.  You can visit a country like Spain, where you can visit the Sagrada Familia, the Cathedral of St. John, and the Plaza de Espana.  You can visit a country like Germany, where you can visit the Berlin Wall, the


In [None]:
tokenizer.decode(outputs[0])

'<bos>What should I do on a trip to Europe?\n\nResponse:\nThere are many options for a trip to Europe.  You can visit a country like France, where you can visit the Eiffel Tower, the Louvre, and the Arc de Triomphe.  You can visit a country like Italy, where you can visit the Colosseum, the Vatican, and the Trevi Fountain.  You can visit a country like Spain, where you can visit the Sagrada Familia, the Cathedral of St. John, and the Plaza de Espana.  You can visit a country like Germany, where you can visit the Berlin Wall, the'

In [None]:
# Run inference

input_text = "What are some good places to visit during holidays?"

input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
print(input_ids)
outputs = model.generate(**input_ids, max_length=128)
print(tokenizer.decode(outputs[0]))

{'input_ids': tensor([[     2,   1841,    708,   1009,   1426,   6853,    577,   3532,   2290,
          23567, 235336]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
<bos>What are some good places to visit during holidays?

Response:
There are many places to visit during holidays. Some popular places are:
- Paris, France
- London, England
- Rome, Italy
- Venice, Italy
- Amsterdam, Netherlands
- Prague, Czech Republic
- Berlin, Germany
- Barcelona, Spain
- Istanbul, Turkey
- Athens, Greece
- Prague, Czech Republic
- Budapest, Hungary
- Prague, Czech Republic
- Prague, Czech Republic
- Prague, Czech Republic
- Prague, Czech Republic
- Prague, Czech Republic
- Prague, Czech Republic



### Please answer the questions below to complete the experiment:




In [6]:
#@title Select the False statement w.r.t LoRA: { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "It is a technique that accelerates the fine-tuning of large models while consuming less memory" #@param ["", "It is a technique that accelerates the fine-tuning of large models while consuming less memory", "During training, the original weight matrix remains frozen and doesn't receive any further adjustments", "During inference, only the original weight matrix is used to produce final results", "None of the above"]

In [7]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "Good and Challenging for me" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [8]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "na" #@param {type:"string"}


In [9]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["","Yes", "No"]


In [10]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [11]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [12]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")

Your submission is successful.
Ref Id: 7714
Date of submission:  18 Oct 2024
Time of submission:  12:17:39
View your submissions: https://aimlops-iisc.talentsprint.com/notebook_submissions
