<a href="https://colab.research.google.com/github/vsingh9076/Generative_AI/blob/main/QLoRA/Sharded_Llama2_Fine_Tuning_QLora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Code Credit: Hugging Face**

**Dataset Credit: https://twitter.com/Dorialexander/status/1681671177696161794 **

## Finetune Llama-2-7b on a Google colab

Welcome to this Google Colab notebook that shows how to fine-tune the recent Llama-2-7b model on a single Google colab and turn it into a chatbot

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [1]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.9/133.9 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━

## Dataset



In [2]:
from datasets import load_dataset

#dataset_name = "timdettmers/openassistant-guanaco" ###Human ,.,,,,,, ###Assistant

dataset_name = 'AlexanderDoria/novel17_test' #french novels
dataset = load_dataset(dataset_name, split="train")

Downloading readme:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.21M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/119k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## Loading the model

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "TinyPixel/Llama-2-7B-bf16-sharded"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

config.json:   0%|          | 0.00/626 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/14 [00:00<?, ?it/s]

pytorch_model-00001-of-00014.bin:   0%|          | 0.00/981M [00:00<?, ?B/s]

pytorch_model-00002-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

pytorch_model-00003-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

pytorch_model-00004-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

pytorch_model-00005-of-00014.bin:   0%|          | 0.00/944M [00:00<?, ?B/s]

pytorch_model-00006-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

pytorch_model-00007-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

pytorch_model-00008-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

pytorch_model-00009-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

pytorch_model-00010-of-00014.bin:   0%|          | 0.00/944M [00:00<?, ?B/s]

pytorch_model-00011-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

pytorch_model-00012-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

pytorch_model-00013-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

pytorch_model-00014-of-00014.bin:   0%|          | 0.00/847M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/14 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Let's also load the tokenizer below

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/676 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [5]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM"
)

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [6]:
from transformers import TrainingArguments

training_arguments = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    save_steps=100,
    logging_steps=10,
    learning_rate=2e-4,
    fp16=True,
    max_grad_norm=0.3,
    max_steps=100,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
)

Then finally pass everthing to the trainer

In [7]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_arguments,
)



Map:   0%|          | 0/1 [00:00<?, ? examples/s]



We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [8]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Train the model

Now let's train the model! Simply call `trainer.train()`

In [9]:
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,0.871
20,0.4571
30,0.0792
40,0.0147
50,0.011
60,0.0103
70,0.0095
80,0.007
90,0.0006
100,0.0002


TrainOutput(global_step=100, training_loss=0.14605359879671595, metrics={'train_runtime': 152.6977, 'train_samples_per_second': 10.478, 'train_steps_per_second': 0.655, 'total_flos': 1207315265126400.0, 'train_loss': 0.14605359879671595, 'epoch': 100.0})

During training, the model should converge nicely as follows:

![image](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/loss-falcon-7b.png)

The `SFTTrainer` also takes care of properly saving only the adapters during training instead of saving the entire model.

In [10]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")

In [12]:
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)

AttributeError: ignored

In [13]:
lora_config

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='TinyPixel/Llama-2-7B-bf16-sharded', revision=None, task_type='CAUSAL_LM', inference_mode=True, r=64, target_modules={'q_proj', 'v_proj'}, lora_alpha=16, lora_dropout=0.1, fan_in_fan_out=False, bias='none', modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={})

In [14]:
dataset['text']

["### Human: Écrire un texte dans un style baroque, utilisant le langage et la syntaxe du 17ème siècle, mettant en scène un échange entre un prêtre et un jeune homme confus au sujet de ses péchés.### Assistant: Si j'en luis éton. né ou empêché ce n'eſt pas ſans cauſe vů que ſouvent les liommes ne ſçaventque dire non plus que celui de tantôt qui ne ſçavoit rien faire que des civiéresVALDEN: Jefusbien einpêché confeſſant un jour un jeune Breton Vallonqui enfin de confeſſion me dit qu'il avoit beſongné une civiere . Quoilui dis je mon amice peché n'eſt point écrit au livre Angeli que d'enfernommé la ſommedes pechez ,qui eſt le livre le plus déteſtable qui fut jamais fait& le plus blafphematoire d'autant qu'il eſt dédié à la plus femme de bien je ne ſçai quelle penitence te donner ; mais non mon amiquel goûty prenois-tu ? Mon fieur bon & delectable. Quoi!"]

In [15]:
text = "Écrire un texte dans un style baroque sur la glace et le feu ### Assistant: Si j'en luis éton"
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



Écrire un texte dans un style baroque sur la glace et le feu ### Assistant: Si j'en luis éton.### Jefusbien écrien ou de fois ni de lieu.### Jefusbien peché de la nature ou de lieu.### Jefusbien écrien de la peché ou de lieu.##


In [19]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [20]:
model.push_to_hub("llama2-qlora-finetunined-french")

NotImplementedError: ignored