<a href="https://colab.research.google.com/github/vanderbilt-data-science/PEFT/blob/main/peft_gemma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parameter-Efficient Fine-Tuning Gemma

*By Myranda Shirk and Umang Chaudhry, Vanderbilt Data Science Institute*

Notebook created with help from [Gemma fine-tuning documentation](https://github.com/huggingface/notebooks/blob/main/peft/gemma_7b_english_quotes.ipynb).

## Fine-Tuning in Google Colab

According to [Gemma's HuggingFace Space](https://huggingface.co/google/gemma-7b), this fine-tuning code can be run on a free instance of Google Colab using the available GPU runtime. To change your runtime to GPU, select "Runtime"-> ""Change Runtime Type" -> GPU.

If for any reason you are not able to use a GPU, you may see the cells indicated for use on CPU.

LoRA (Low-Rank Adaptation) is the PEFT method that we will explore today. More information on this method can be found at the following [link](https://arxiv.org/abs/2106.09685). Specifically, we'll be using qLoRA as we will be using a quantized version of the model.

### Libraries and APIs

To access Gemma, you need a [HuggingFace](www.huggingface.co) account and a HuggingFace API Token with write permissions (In HF: Profile -> Settings -> Access Tokens). Additionally, you need to visit [Gemma's HuggingFace Space](https://huggingface.co/google/gemma-7b) and click the button to accept their terms of use. After accepting the terms, access should be immediately granted.

In [None]:
import os
import getpass
os.environ["HF_TOKEN"] = getpass.getpass("Enter your HuggingFace token: ")

Enter your HuggingFace token: ··········


In [None]:
!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.1

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.9/150.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.1/108.1 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━

### Model Setup and Training Objective

For this example, we will be fine-tuning Gemma on the English Quotes Dataset. We want our model to output a quote and its author given the start of a quote.

First, we can access Gemma-7B (or any Gemma model - simply change to "gemma-2b" for the 2B parameter model, etc) through HuggingFace (this is where you need your HF authenitication, which we set above).

**NOTE**: The below cells needs connection to a GPU, which you can access by selecting "Runtime" -> "Change Runtime Type" -> GPU

If you are not connected to a GPU, the error will say something along the lines of "you must have accelerate and bitsandbytes installed."

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, GemmaTokenizer

model_id = "google/gemma-7b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])



tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
# FOR CPU: DELETE QUOTES AND RUN THIS CELL
'''
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")

'''


'\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\ntokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")\nmodel = AutoModelForCausalLM.from_pretrained("google/gemma-7b")\n\n'

Let's see how Gemma does on this task without any fine-tuning. We will give it the start of a quote.

In [None]:
text = "Quote: Imagination is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: Imagination is more important than knowledge.

Albert Einstein

I am a creative and curious person who loves to learn


As we can see above, the model does finish the quote and attribute an author, but then it continues on with another quote without us prompting. Not exactly what we want!

### Data and Training Functions

Next, we will set up our training configuation for [LoRA] (https://www.run.ai/guides/generative-ai/lora-fine-tuning), a highly efficient training method.

In [None]:
os.environ["WANDB_DISABLED"] = "true"

In [None]:
print(model)

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 3072, padding_idx=0)
    (layers): ModuleList(
      (0-27): 28 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=3072, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=3072, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=3072, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=3072, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=3072, out_features=24576, bias=False)
          (up_proj): Linear4bit(in_features=3072, out_features=24576, bias=False)
          (down_proj): Linear4bit(in_features=24576, out_features=3072, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
   

"We find that the most critical LoRA hyperparameter is how many LoRA adapters are used in total and that LoRA on all linear transformer block layers is required to match full finetuning performance." - Include all Linear Layers for best tuning results.

In [None]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

We can load our dataset through the HF datasets library.

In [None]:
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/647k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

Take a look at the [Dataset](https://huggingface.co/datasets/Abirate/english_quotes) to see what we're fine-tuning the model on.

Now we can define our Supervised Fine-Tuning (SFT) trainer below and start the training!

In [None]:
import transformers
from trl import SFTTrainer

def formatting_func(example):
    text = f"Quote: {example['quote'][0]}\nAuthor: {example['author'][0]} <eos>"
    return [text]

trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)
trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Step,Training Loss
1,1.8727
2,0.6616
3,0.9745
4,0.6288
5,0.339
6,0.7132
7,0.628
8,0.191
9,0.3427
10,0.3348


TrainOutput(global_step=10, training_loss=0.6686274752020835, metrics={'train_runtime': 6.6269, 'train_samples_per_second': 6.036, 'train_steps_per_second': 1.509, 'total_flos': 23002150103040.0, 'train_loss': 0.6686274752020835, 'epoch': 6.67})

### Evaluation

Let's see how our model does after fine-tuning. We will run the same example we did at the beginning. Remember that we want our model to give us the rest of the quote and its author.

In [None]:
text = "Quote: Imagination is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

# stopping criteria: eos token
outputs = model.generate(**inputs, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: Imagination is more important than knowledge.
Author: Albert Einstein 


In the original version of this demo, we used "max_new_tokens" to determine when our model should stop generating. Try this option out below and note how the model behavior changes.

In [None]:
text = "Quote: Imagination is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

#stopping criteria: max new tokens
outputs = model.generate(**inputs, max_new_tokens=20)

Great! Our model performs exactly as we want. We can now save our model as a .pt file to access later.

In [None]:
torch.save(model.state_dict(), "gemma-7b-peft-quotes.pt")

Additionally, we can upload it to the Hub on HuggingFace.

NOTE: You will need an API token with WRITE access!

In [None]:
# upload to huggingface hub
# CHANGE TO YOUR USERNAME
model.push_to_hub("myshirk/my-finetuned-model")


Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.64G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/myshirk/my-finetuned-model/commit/529be97c1fce0fea249c6f0266649d0156a9380b', commit_message='Upload GemmaForCausalLM', commit_description='', oid='529be97c1fce0fea249c6f0266649d0156a9380b', pr_url=None, pr_revision=None, pr_num=None)

## Conclusion

You have just successfully fine-tuned Google's Gemma model on our English Quotes dataset. Feel free to adapt this process for SFT on your own project.