<a href="https://colab.research.google.com/github/siquick/fine-tuning-experiments/blob/main/Blue_Toon_fine_tune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/tutorial-how-to-finetune-llama-3-and-use-in-ollama

In [None]:
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-8w1896ka/unsloth_c8f687c66f094af8888274b7c731c79f
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-8w1896ka/unsloth_c8f687c66f094af8888274b7c731c79f
  Resolved https://github.com/unslothai/unsloth.git to commit 2267b5c5532957141a33bfa5bb9f0b220a4b3efe
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting xformers<0.0.27
  Using cached xformers-0.0.26.post1.tar.gz (4.1 MB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting trl<0.9.0
  Using cached trl-0.8.6-py3-none-any.whl.metadata (11 kB)
Using cached trl-0.8.6-py3-none-any.whl (245 kB)
Building wheels for collected packages: xfo

In [None]:
from unsloth import FastLanguageModel
import torch

# This determines the context length of the model.
# Gemini for example has over 1 million context length, whilst Llama-3 has 8192 context length.
# We allow you to select ANY number - but we recommend setting it 2048 for testing purposes.
# Unsloth also supports very long context finetuning, and we show we can provide 4x longer context lengths than the best.
max_seq_length = 2048

# Keep this as None, but you can select torch.float16 or torch.bfloat16 for newer GPUs.
dtype = None

# We do finetuning in 4 bit quantization.
# This reduces memory usage by 4x, allowing us to actually do finetuning in a free 16GB memory GPU.
# 4 bit quantization essentially converts weights into a limited set of numbers to reduce memory usage.
# A drawback of this is there is a 1-2% accuracy degradation.
# Set this to False on larger GPUs like H100s if you want that tiny extra accuracy.
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-8B-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    # The rank of the finetuning process. A larger number uses more memory and will be slower, but can increase accuracy on harder tasks. We normally suggest numbers like 8 (for fast finetunes), and up to 128. Too large numbers can causing over-fitting, damaging your model's quality.
    r = 16,

    # We select all modules to finetune. You can remove some to reduce memory usage and make training faster, but we highly do not suggest this. Just train on all modules!
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],

    # The scaling factor for finetuning. A larger number will make the finetune learn more about your dataset, but can promote over-fitting. We suggest this to equal to the rank r, or double it.
    lora_alpha = 16,

    # Leave this as 0 for faster training! Can reduce over-fitting, but not that much.
    lora_dropout = 0,

    # Leave this as 0 for faster and less over-fit training!
    bias = "none",

    # Options include True, False and "unsloth". We suggest "unsloth" since we reduce memory usage by an extra 30% and support extremely long context finetunes.You can read up here: https://unsloth.ai/blog/long-context for more details.
    use_gradient_checkpointing = "unsloth",

    # The number to determine deterministic runs. Training and finetuning needs random numbers, so setting this number makes experiments reproducible.
    random_state = 3407,

    # Advanced feature to set the lora_alpha = 16 automatically. You can use this if you want!
    use_rslora = False,

    # Advanced feature to initialize the LoRA matrices to the top r singular vectors of the weights. Can improve accuracy somewhat, but can make memory usage explode at the start.
    loftq_config = None,
)

In [None]:
from datasets import load_dataset

dataset = load_dataset("franco334578/bluetoon")['train']
print(dataset.column_names)

from unsloth import to_sharegpt

dataset = to_sharegpt(
    dataset,
    merged_prompt="{instruction}[[\nYour input is:\n{input}]]",
    output_column_name="output",
    conversation_extension=3,  # Select more to handle longer conversations
)

from unsloth import standardize_sharegpt

dataset = standardize_sharegpt(dataset)

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

chat_template = """Below are some instructions that describe some tasks. Write responses that appropriately complete each request.

### Instruction:
{INPUT}

### Response:
{OUTPUT}"""

from unsloth import apply_chat_template

dataset = apply_chat_template(
    dataset,
    tokenizer=tokenizer,
    chat_template=chat_template,
    # default_system_message = "You are a helpful assistant", << [OPTIONAL]
)



In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field = "text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=20,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="paged_adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [                    # Change below!
    {"role": "user", "content": "How are you?"},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [                         # Change below!
    {"role": "user",      "content": "How are you?"},
    {"role": "assistant", "content": "Nae bad min"},
    {"role": "user",      "content": "What are you doing tonight?"},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

In [None]:
# Now save to Ollama
!curl -fsSL https://ollama.com/install.sh | sh


In [None]:
!uv pip install python-dotenv
import os
from dotenv import load_dotenv
load_dotenv()

if True:
  model.push_to_hub_gguf("franco334578/blue-tooner-8B-q4_k_m", tokenizer, quantization_method = "q4_k_m", token = os.environ.get('HF_TOKEN'))




In [None]:
import subprocess

subprocess.Popen(["ollama", "serve"])
import time

time.sleep(3)  # Wait for a few seconds for Ollama to load!

In [None]:
print(tokenizer._ollama_modelfile)
!ollama create blue-tooner -f ./model/Modelfile


In [None]:
!curl http://localhost:11434/api/chat -d '{ \
    "model": "blue-tooner", \
    "messages": [ \
        { "role": "user", "content": "How are you?" } \
    ] \
    }'

# To run in Interactive mode

Go to Terminal in Colab (bottom of the screen).  
Then type `ollama run blue-tooner`

You can also use the blue-tooner-8B-q4_k_m.gguf file or model-unsloth-Q4_K_M.gguf file in llama.cpp.