<a href="https://colab.research.google.com/github/yaseen2402/deep-learning/blob/main/qlora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install -U datasets fsspec

In [None]:
import os
os.kill(os.getpid(), 9)  # This will restart the Colab runtime

In [1]:
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth



In [2]:
from unsloth import FastLanguageModel
import torch
from google.colab import userdata

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", #we are already downloading the quantized version which makes download faster
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True, #QLORA
    token=userdata.get('HF_ACCESS_TOKEN')
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.9: Fast Llama patching. Transformers: 4.52.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [3]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France"},
]

formatted_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(formatted_text)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France<|eot_id|><|start_header_id|>assistant<|end_header_id|>




In [4]:
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj"]

#when adding special tokens
train_embeddings = False

if train_embeddings:
  #you run out of memory on colab if you do this
  #target_modules = target_modules + ["lm_head", "embed_tokens"]
  #so if you are on colab and added new tokens instead do
  target_modules = target_modules + ["lm_head"]

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, #rank of lora matrices according to paper not much loss when set relatively low
    target_modules = target_modules, #on which modules of llm the lora weights are used
    lora_alpha = 16, #scales the weights of the adapters (more influence on base model), 16 was recommended on reddit
    lora_dropout = 0, #default on 0.05 in tutorial but unsloth says 0 is better
    bias = "none", #'none is optimized
    use_gradient_checkpointing = "unsloth", #unsloth for very long context, decreases vram
    random_state = 3407,
    use_rslora = False, #scales lora_alpha with 1/sqrt(r), huggingface says this works better
    loftq_config = None #and LoftQ
)

Unsloth 2025.5.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [6]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
  convos = examples["conversations"]
  texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
  return {"text": texts,}

from datasets import load_dataset
dataset = load_dataset("pookie3000/pg_chat", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True, )

README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

pg_chat_combined.jsonl:   0%|          | 0.00/625k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/484 [00:00<?, ? examples/s]

Map:   0%|          | 0/484 [00:00<?, ? examples/s]

In [7]:
for i, sample in enumerate(dataset):
  print(f"\n----- Sample {i+1} -----")
  print(sample["text"])
  if i>2:
    break


----- Sample 1 -----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your name?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Nice to meet you! My name is Paul Graham, and I'm delighted to make your acquaintance.<|eot_id|>

----- Sample 2 -----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

What's your name?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Nice to meet you! My name is Paul Graham, and I'm delighted to make your acquaintance.<|eot_id|>

----- Sample 3 -----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your name?<|eot_id|><|start_header_id|>assistant<|end_h

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_filed = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 5,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/484 [00:00<?, ? examples/s]

In [9]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 484 | Num Epochs = 5 | Total steps = 305
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.0843
2,2.1336
3,2.3462
4,1.9373
5,1.8643
6,1.8088
7,1.6509
8,1.5675
9,1.4699
10,1.4233


Step,Training Loss
1,2.0843
2,2.1336
3,2.3462
4,1.9373
5,1.8643
6,1.8088
7,1.6509
8,1.5675
9,1.4699
10,1.4233


In [10]:
FastLanguageModel.for_inference(model)

messages = [
    {
        "role": "user",
        "content": "How to become rich",
    },
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, #must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

How to become rich<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"Start a startup, kid! That's the way to become rich. I'm not saying it's easy, but it's the best way. Don't waste your time with get-rich-quick schemes or trying to get promoted at some boring company. If you want to be rich, you need to think big, take risks, and be willing to put in the hard work. And don't worry if people call you crazy - the most successful startups are often the ones that people don't understand at first. Just remember, it's not for everyone, and you need to be willing to put in the effort. But if you're


In [1]:
model.push_to_hub_gguf(
    "Meta-Llama-3.1-8B-q4_k_m-paul-graham-guide-GGUF",
    tokenizer,
    quantization_method = "q4_k_m",
    token = userdata.get('HF_ACCESS_TOKEN')
)

NameError: name 'model' is not defined