<a href="https://colab.research.google.com/github/sufyanAshraf/Fine-tune-google-gemma_with_custom_dataset/blob/main/Fine_tuning_gemma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install huggingface_hub

In [1]:
import os
from google.colab import userdata
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

In [2]:
from huggingface_hub import login
login(os.environ["HF_TOKEN"])

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U peft
!pip install -q -U trl
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U transformers

In [3]:

import transformers
import torch
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig, GemmaTokenizer

#import model from huggingface

In [4]:
#configuration

model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(          #bit and bytes config
    load_in_4bit=True,                    #convert 32 bit model into 4 bit
    bnb_4bit_quant_type="nf4",            # nf4(4-bit NormalFloat(NF4)) it is quantization technique
    bnb_4bit_compute_dtype=torch.bfloat16 # we are keeping new fine tune parameter in 16 bit for good accuracy
)

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=bnb_config,
                                             device_map={"":0},
                                             token=os.environ['HF_TOKEN'])

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Test the model

In [6]:
input_text = "What is Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))


<bos>What is Machine Learning.

Machine learning is a subfield of computer science that focuses on the development of computer programs that can learn and improve from experience without being explicitly programmed.

Machine learning is a subfield of computer science that focuses on the development of computer programs that can learn and improve from experience without being explicitly programmed.

Machine learning is a subfield of computer science that focuses on the development of computer programs that can learn and improve from experience without being explicitly programmed.

Machine learning is a subfield of computer science


In [7]:
text = "Quote: Imagination is more,"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: Imagination is more, than knowledge.

I am a self-taught artist, born in 1985 in


#lets fine tune with lora

In [8]:
# make this false given in documentation
os.environ["WANDB_DISABLED"] = "false"

In [9]:
lora_config = LoraConfig(
    r = 8,
    target_modules = ["q_proj", "o_proj", "k_proj", "v_proj",
                      "gate_proj", "up_proj", "down_proj"],
    task_type = "CAUSAL_LM", #this means language model
)

## DATASET containing author and quotes

In [10]:
# pip install datasets==2.20.0

In [11]:
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")

Downloading readme:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/647k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2508 [00:00<?, ? examples/s]

In [12]:
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

In [None]:
data['train']['quote']

In [14]:
#formate dataset
def formatting_func(example):
    text = f"Quote: {example['quote'][0]}\nAuthor: {example['author'][0]}"
    return [text]

In [15]:
data['train']

Dataset({
    features: ['quote', 'author', 'tags', 'input_ids', 'attention_mask'],
    num_rows: 2508
})

## lets train

In [25]:
trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=100,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs


In [17]:
trainer.train()

Step,Training Loss
1,2.6716
2,1.7279
3,2.54
4,2.5231
5,2.864
6,2.8956
7,2.5953
8,2.3157
9,3.0888
10,2.6408


TrainOutput(global_step=100, training_loss=2.1341786229610444, metrics={'train_runtime': 143.9784, 'train_samples_per_second': 2.778, 'train_steps_per_second': 0.695, 'total_flos': 189744345784320.0, 'train_loss': 2.1341786229610444, 'epoch': 0.1594896331738437})

## test model on new dataset

In [19]:
text = "Quote: A woman is like a tea bag;"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: A woman is like a tea bag; you can't tell how strong she is until you put her in hot water.

I'm not sure if this is a quote or not, but I've heard it before. I've heard it used in a few different ways
