Big thanks to Daniel Han & the unsloth team.

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!





# Install Packages


In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
#%%capture
## Installs Unsloth, Xformers (Flash Attention) and all other packages!
#!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
#!pip install --no-deps "xformers==0.0.27" trl peft accelerate bitsandbytes
## !pip install xformers trl peft accelerate bitsandbytes

In [1]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "microsoft/Phi-3.5-mini-instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.12: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA GeForce RTX 4070 Laptop GPU. Max memory: 7.996 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [2]:
# !pip install unsloth

# Load Model

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [2]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 , changed from 16
#    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
#                      "gate_proj", "up_proj", "down_proj",],
    target_modules = ["q_proj", "k_proj", "v_proj"],
    lora_alpha = 16, #r/alpha is weight of the adapter
    lora_dropout = 0, #0.05, #0 until 4/18/25 # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Not an error, but Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.2.12 patched 32 layers with 32 QKV layers, 0 O layers and 0 MLP layers.


<a name="Data"></a>
# Data Prep
We now use the `Phi-3` format for conversation style finetunes. We use [Open Assistant conversations](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style) in ShareGPT style. Phi-3 renders multi turn conversations like below:

```
<|user|>
Hi!<|end|>
<|assistant|>
Hello! How are you?<|end|>
<|user|>
I'm doing great! And you?<|end|>

```

In [3]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
   # mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass


from datasets import load_dataset
#dataset = load_dataset("json", data_files="/content/drive/MyDrive/NPC_Research/chandler_conversations3.json", split="train")
#dataset = load_dataset("json", data_files="../../Datasets/chandler_conversations3.json", split="train")
dataset = load_dataset("json", data_files="../../Datasets/rachel_conversations.json", split="train")

dataset = dataset.map(formatting_prompts_func, batched = True,)

In [4]:
dataset

Dataset({
    features: ['messages', 'text'],
    num_rows: 267
})

In [5]:
#dataset = dataset.shuffle(seed=42).select(range(3000))

In [6]:
#dataset[0]["messages"]

In [7]:
print(dataset[5]["text"])

<|system|>
You are Rachel Green from the TV show Friends. Respond to questions and engage in conversations in her signature style.<|end|>
<|user|>
I wanted this to work so much. I mean I'm still in there, changing his diapers, pickin' his fleas... but he's just phoning it in. Just so hard to accept the fact that something you love so much doesn't love you back.<|end|>
<|assistant|>
...I think that bitch cracked my tooth.<|end|>



In [8]:
#for d in dataset:
#    if "\"" in d["text"]:
#        print(d["text"])
#        break

In [9]:
unsloth_template = \
    "{{ bos_token }}"\
    "{{ 'You are a helpful assistant to the user\n' }}"\
    "{% for message in messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ '>>> User: ' + message['content'] + '\n' }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ '>>> Assistant: ' + message['content'] + eos_token + '\n' }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ '>>> Assistant: ' }}"\
    "{% endif %}"
unsloth_eos_token = "eos_token"


if False:
    tokenizer = get_chat_template(
        tokenizer,
        chat_template = (unsloth_template, unsloth_eos_token,), # You must provide a template and EOS token
        mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
        map_eos_token = True, # Maps <|im_end|> to </s> instead
    )

<a name="Train"></a>
# Train the model


In [10]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.

    args = TrainingArguments(
        per_device_train_batch_size = 2, # can be 1 for slower training and more personality at cost of smoother training
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1, #max_steps = 60,
        learning_rate = 5e-5, #2e-4
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Tokenizing train dataset (num_proc=2):   0%|          | 0/267 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/267 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [11]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4070 Laptop GPU. Max memory = 7.996 GB.
2.143 GB of memory reserved.


In [12]:
#!pip uninstall -y xformers
#!pip install "xformers==0.0.27"
#import xformers

In [13]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 267 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 33
 "-____-"     Number of trainable parameters = 4,718,592


Step,Training Loss
1,3.8934
2,4.1125
3,4.0222
4,3.891
5,4.1071
6,3.8308
7,3.913
8,3.7784
9,4.1461
10,3.8039


In [14]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

102.3152 seconds used for training.
1.71 minutes used for training.
Peak reserved memory = 2.582 GB.
Peak reserved memory for training = 0.439 GB.
Peak reserved memory % of max memory = 32.291 %.
Peak reserved memory for training % of max memory = 5.49 %.


<a name="Inference"></a>
# Inference
Let's run the model! You can change the instruction and input - leave the output blank!



In [13]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    #mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

#messages = [
#    {"role": "system", "content": "You are Ross."},
#    {"role": "user", "content": "What is bigger sun or the moon?"},
#]
messages = [
    {"role": "system", "content": "You are Rachel from the TV show Friends. Respond to questions and engage in conversations in her signature style."},
    {"role": "user", "content": "What is bigger sun or the moon?"},
    #    {"role": "user", "content": "Are you an AI?"},
]

   # {"role": "assitant", "content": "Well, the moon is bigger than the sun."},
   # {"role": "user", "content": "Why?"},
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 100, use_cache = True, eos_token_id=32007) #, repetition_penalty=1.01,do_sample=True, top_p=0.9)
tokenizer.batch_decode(outputs)

print(tokenizer.batch_decode(outputs))
xx = outputs[0][inputs.shape[-1]:]
print(tokenizer.decode(xx, skip_special_tokens=True))

["<|system|> You are Rachel from the TV show Friends. Respond to questions and engage in conversations in her signature style.<|end|><|user|> What is bigger sun or the moon?<|end|><|assistant|> Oh, darling, they're both pretty big, but the sun is a whole lot bigger. The sun is like a giant ball of fire, and it's about 109 times the diameter of the moon. The moon, on the other hand, is about 1/4 the diameter of the sun. So, while the moon is pretty big, the sun is a whole lot bigger.<|end|>"]
Oh, darling, they're both pretty big, but the sun is a whole lot bigger. The sun is like a giant ball of fire, and it's about 109 times the diameter of the moon. The moon, on the other hand, is about 1/4 the diameter of the sun. So, while the moon is pretty big, the sun is a whole lot bigger.


In [15]:
for name, param in model.named_parameters():
    if 'lora' in name.lower():  # Freeze LoRA layers
        param.requires_grad = False

In [14]:
#sentence = f"<s>Your sentence goes here. </s> {tokenizer.eos_token}"
#tokenized_output = tokenizer(sentence)

# Print tokenized ids (numerical representation of the tokens)
#print("Token IDs:", tokenized_output.input_ids)

tokens = tokenizer.convert_ids_to_tokens(tokenizer.batch_decode(outputs))
print("Tokens:", tokens)
tokens = tokenizer.convert_ids_to_tokens(t2)
print("Tokens:", tokens)


ValueError: invalid literal for int() with base 10: '<|system|> You are Chandler.<|end|><|user|> What is bigger sun or the moon?<|end|><|assistant|> Well, the moon is bigger than the sun.<|end|><|user|> What?<|end|><|assistant|> Yeah, the moon is bigge

In [26]:
tokenizer.pad_token_id = 32007

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [29]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference


messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
messages = [
    {"role": "system", "content": "You are Chandler Bing from Friends. Talk like him with his sarcasm."},
    {"role": "user", "content": "Tell me a story with sarcasm?"},
]
messages = [
    {"role": "system", "content": "You are Monica from the TV show Friends. Respond to questions and engage in conversations in her signature style."},
    {"role": "user", "content": "Tell me a story"},
    #    {"role": "user", "content": "Are you an AI?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 428, use_cache = True, eos_token_id=32007, temperature = .8)

Alright, here goes.

Once upon a time, in the bustling city of New York, there was a group of friends. There was Monica, a waitress at a diner, Joey, a struggling actor, Chandler, a misfit with a penchant for pranks, and Ross, a nerdy guy with a heart of gold.

Their lives were intertwined, and they were each other's support system. They navigated through the ups and downs of life, the highs of love, and the lows of heartbreak.

Their story was a roller coaster ride, filled with laughter, tears, and a lot of heart.

And that, my dear, is a story of friendship, love, and life.<|end|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [15]:
#model.save_pretrained("lora_model_phi3_chandler_test2") # Local saving
model.save_pretrained("lora_model_phi3_rachel_2") # Local saving
#tokenizer.save_pretrained("lora_model_tokenizer_phi3_chandler_test2") # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

In [16]:
model.save_pretrained_merged("model", tokenizer, save_method = "lora",)

Unsloth: Saving tokenizer... Done.
 Done.h: Saving model...


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )