**Code Credit: Hugging Face**

**Dataset Credit: https://twitter.com/Dorialexander/status/1681671177696161794 **

## Finetune Llama-2-7b on a Google colab

Welcome to this Google Colab notebook that shows how to fine-tune the recent Llama-2-7b model on a single Google colab and turn it into a chatbot

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [1]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.0/110.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m91.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m105.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m81.6 MB/s[0m eta [36

In [2]:
import pandas as pd

## Dataset



In [11]:
dp=pd.read_csv('/content/train.csv')

In [12]:
dp

Unnamed: 0,Questions,Answers,text
0,"Problem in- PC, Problem type is- Does not star...",currupt profile reset,you are given details on an issue faced by our...
1,"Problem in- PRINTER, Problem type is- Other - ...",Changed setting,you are given details on an issue faced by our...
2,"Problem in- PC, Problem type is- Does not star...",config properly,you are given details on an issue faced by our...
3,"Problem in- PRINTER, Problem type is- Other - ...",Resolved issue,you are given details on an issue faced by our...
4,"Problem in- PC, Problem type is- Other - Pleas...",Changed setting,you are given details on an issue faced by our...
...,...,...,...
49995,"Problem in- OTHER, Problem type is- Other - Pl...",Call transferred - 128205,you are given details on an issue faced by our...
49996,"Problem in- PC, Problem type is- Virus, Proble...",Call transferred - 128202,you are given details on an issue faced by our...
49997,"Problem in- PC, Problem type is- Virus, Proble...",Call transferred - 128184,you are given details on an issue faced by our...
49998,"Problem in- SCANNER, Problem type is- Other - ...",scanner settings done,you are given details on an issue faced by our...


## Loading the model

In [5]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "TinyPixel/Llama-2-7B-bf16-sharded"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

Downloading (…)lve/main/config.json:   0%|          | 0.00/626 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/14 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00014.bin:   0%|          | 0.00/981M [00:00<?, ?B/s]

Downloading (…)l-00002-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00003-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00004-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)l-00005-of-00014.bin:   0%|          | 0.00/944M [00:00<?, ?B/s]

Downloading (…)l-00006-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)l-00007-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00008-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00009-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)l-00010-of-00014.bin:   0%|          | 0.00/944M [00:00<?, ?B/s]

Downloading (…)l-00011-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)l-00012-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00013-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00014-of-00014.bin:   0%|          | 0.00/847M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/14 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Let's also load the tokenizer below

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Downloading (…)okenizer_config.json:   0%|          | 0.00/676 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [7]:
from peft import LoraConfig, get_peft_model

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [8]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 100
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 100
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
)

Then finally pass everthing to the trainer

In [16]:
from datasets import DatasetDict, Dataset
dset = {
    "Human": dp["Questions"],
    "Assistent": dp["Answers"]
}
dataset = Dataset.from_dict(dset)

In [18]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="Human",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]



We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [19]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Train the model

Now let's train the model! Simply call `trainer.train()`

In [20]:
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,3.8876
20,3.4234
30,1.9829
40,1.0878
50,0.7455
60,1.9716
70,1.2153
80,1.0031
90,0.9111
100,0.8751


TrainOutput(global_step=100, training_loss=1.7103435611724853, metrics={'train_runtime': 712.9108, 'train_samples_per_second': 2.244, 'train_steps_per_second': 0.14, 'total_flos': 1568544198819840.0, 'train_loss': 1.7103435611724853, 'epoch': 0.03})

During training, the model should converge nicely as follows:

![image](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/loss-falcon-7b.png)

The `SFTTrainer` also takes care of properly saving only the adapters during training instead of saving the entire model.

In [21]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")

In [22]:
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)

In [24]:
dataset['Human']

['Problem in- PC, Problem type is- Does not start, Problem description is- pc not lofin trust error  show, Name of Equipment- HIGHER RAM HP PC, Service Company is- WIPRO',
 'Problem in- PRINTER, Problem type is- Other - Please Specify, Problem description is- Printer off line, Name of Equipment- PC GEN, Service Company is- WIPRO',
 'Problem in- PC, Problem type is- Does not start, Problem description is- new pc configration and domain add, Name of Equipment- PC GEN, Service Company is- WIPRO',
 'Problem in- PRINTER, Problem type is- Other - Please Specify, Problem description is- Printer off line, Name of Equipment- PC GEN, Service Company is- WIPRO',
 'Problem in- PC, Problem type is- Other - Please Specify, Problem description is- .lst file not open, Name of Equipment- PC GEN, Service Company is- WIPRO',
 'Problem in- OTHER, Problem type is- Other - Please specify, Problem description is- Intranet is not working. , Name of Equipment- pc, Service Company is- HCL',
 'Problem in- NETWOR

In [27]:
text = "Problem in- OTHER, Problem type is- Other - Please specify, Problem description is- Printer is not working with new HP computer (they are not sy, Name of Equipment- Printer no 20/11323, Service Company is- NETWORK ### Assistant: Call transferred - 183109"
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Problem in- OTHER, Problem type is- Other - Please specify, Problem description is- Printer is not working with new HP computer (they are not sy, Name of Equipment- Printer no 20/11323, Service Company is- NETWORK ### Assistant: Call transferred - 18310900000000000000000000000000000000000000000000000000


In [28]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [29]:
model.push_to_hub("llama2-qlora-finetunined-query-resolver")

adapter_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Devis2awe/llama2-qlora-finetunined-query-resolver/commit/f49659482866e77d70f63340a98a1107fbf6220f', commit_message='Upload model', commit_description='', oid='f49659482866e77d70f63340a98a1107fbf6220f', pr_url=None, pr_revision=None, pr_num=None)

In [30]:
text = "Problem in- PC, Problem type is- Does not start, Problem description is- PC does not start, Name of Equipment- PC GEN, Service Company is- NETWORK"
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



Problem in- PC, Problem type is- Does not start, Problem description is- PC does not start, Name of Equipment- PC GEN, Service Company is- NETWORK, Service Type is- Onsite, Service Location is- Home, Service Date is- 12/12/2018, Service Time is- 10:00 AM, Service Duration is- 1 H
