<a href="https://colab.research.google.com/github/subedinab/finetunig/blob/master/Llama_2_Fine_Tuning_using_QLora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Finetune Llama-2-7b on a Google colab

Welcome to this Google Colab notebook that shows how to fine-tune the recent Llama-2-7b model on a single Google colab and turn it into a chatbot

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

In [1]:
!pip install -q wandb

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.0/289.0 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasetsf
!pip install -q trl


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.t

Dataset

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import json
import numpy as np
file_path = "/content/drive/MyDrive/ml/grading_gpt_training_data.json"
data_list = []
# Read the file line by line and load each line as a JSON object (dictionary)
with open(file_path, "r") as file:
    for line in file:
        data = json.loads(line)
        data_list.append(data)

# Now you have the list of dictionaries loaded from the JSON file
print(data_list[-1])

{'input': 'Describe Bhaktapur Durbar Square.', 'label': "Bhaktapur Durbar Square is a UNESCO World Heritage Site known for its 55-window palace, Lion's Gate, Golden Gate, and Nyatapola Temple, showcasing intricate Newar architecture."}


In [5]:
from itertools import repeat
upsampled_list = [np.random.choice(data_list) for _ in range(10000)]
print(len(upsampled_list))

10000


In [6]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

```data_list = [
    {"instruction": "### Human: Please tell me a joke.", "response": "Assistant: Why don't scientists trust atoms? Because they make up everything!"},
    {"instruction": "### Human: Translate 'hello' to French.", "response": "Assistant: Bonjour"},
    {"instruction": "### Human: What is the capital of Spain?", "response": "Assistant: Madrid"},]```

In [7]:
from trl import SFTTrainer

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


In [8]:
from datasets import load_dataset, Dataset

training_data = []
for i in upsampled_list:
  prompt = "### Human: "+i['input']
  completion = "### Assistant: "+i['label']
  training_data.append({"prompt":prompt, "completion": completion})

In [9]:
# Create a Hugging Face Dataset from a list
dataset = Dataset.from_list(training_data)

## Loading the model

In [10]:
model_id = "meta-llama/Llama-2-7b-chat-hf"

In [12]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

cpu


In [13]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,  device_map={"":0})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

In [None]:
tokenizer.pad_token = tokenizer.eos_token

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

### Doing zero shot inference

In [None]:
index = 200

question = dataset[index]['prompt']
answer = dataset[index]['completion']

prompt = f"""
Answer the following question.

{question}

Answer:
"""

inputs = tokenizer(prompt, return_tensors='pt').to(device)
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{answer}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Answer the following question.

### Human: Can you answer a few questions about how grading works ?

Answer:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
### Assistant: Sure, I'll be more than happy to help.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:

Answer the following question.

### Human: Can you answer a few questions about how grading works ?

Answer:
Of course! I'd be happy to help you understand how grading works. Can you tell me a bit more about what you're looking for? Are you interested in learning about the different grading scales, or maybe you want to know more about how teachers and professors determine grades? Let me know and I'll do my best to help!


### Full Fine Tuning

In [None]:
def formatting_prompts_func(x):
    output_texts = []
    for i in range(len(x['prompt'])):
        text = f"Answer the following question \n ### Question: {x['prompt'][i]}\n ### Answer: {x['completion'][i]}"
        output_texts.append(text)
    return output_texts

In [None]:
print(print_trainable_parameters(model))

trainable params: 262410240 || all params: 3500412928 || trainable%: 7.496550989769399
None


In [None]:
from transformers import TrainingArguments
output_dir = "./results"

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
    max_steps=100
)

max_seq_length = 512

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_args,
    formatting_func=formatting_prompts_func,
)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [None]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

In [None]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


RuntimeError: ignored

### PEFT

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, peft_config)
print_trainable_parameters(model)

trainable params: 33554432 || all params: 3533967360 || trainable%: 0.9494833591219133


## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [None]:
from transformers import TrainingArguments
output_dir = "./results"

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-3,
    logging_steps=10,
    max_steps=100
)

max_seq_length = 512

Then finally pass everthing to the trainer
We also format the input as its in the format prompt - completion

In [None]:
from trl import SFTTrainer


trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_args,
    formatting_func=formatting_prompts_func,
)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

## Train the model

Now let's train the model! Simply call `trainer.train()`

In [None]:
trainer.train()



Step,Training Loss
10,2.8275
20,1.5584
30,0.8046
40,0.3471
50,0.1598
60,0.1027
70,0.0828
80,0.0714
90,0.0649
100,0.065


TrainOutput(global_step=100, training_loss=0.6084123599529266, metrics={'train_runtime': 3194.0751, 'train_samples_per_second': 0.501, 'train_steps_per_second': 0.031, 'total_flos': 3152687269969920.0, 'train_loss': 0.6084123599529266, 'epoch': 0.16})

In [None]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")

In [None]:
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)

In [None]:
text = "Answer the question being asked ### Question : What are the most important features of the model?"
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Answer the question being asked ### Question : What are the most important features of the model?
 Unterscheidung between the different types of models.

Answer: The most important features of a model are:

1. **Simplification**: A model simplifies a complex system or phenomenon by representing it in a more manageable
