To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [4]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

### Unsloth

If you want to finetune Llama-3 2x faster and use 70% less VRAM, go to our [finetuning notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Alpaca.ipynb)!

In [6]:
from unsloth import FastLanguageModel

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = 8192,
    load_in_4bit = True,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.



Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


ImportError: Unsloth: Please install unsloth_zoo via `pip install unsloth_zoo`

In [None]:
from transformers import TextStreamer
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

Change the "value" part to call the model!

Unsloth makes inference natively 2x faster!! No need to change or do anything!

In [None]:
# Unsloth → PubMedQA, SIMPLE one-by-one inference (no padding, no batching)
# Assumes you've already loaded Unsloth Llama 3.1 8B Instruct as:
#   model, tokenizer = FastLanguageModel.from_pretrained(..., load_in_4bit=True or fp16)
#   tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1",
#                                 mapping={"role":"from","content":"value","user":"human","assistant":"gpt"})
#   FastLanguageModel.for_inference(model)
# DO NOT call model.to("cuda") if you loaded in 4/8-bit; we only move INPUTS.

import re, torch, gc
from datasets import load_dataset
from tqdm.auto import tqdm
from sklearn.metrics import accuracy_score, f1_score, classification_report

DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
MAX_INP  = 384   # per-example input cap (tokens). Lower if you still hit OOM.
MAX_NEW  = 2     # only need the label
SYSTEM   = "Answer ONLY with one word: yes, no, or maybe."
LABELS   = {"yes","no","maybe"}

def build_messages(q, ctx):
    if isinstance(ctx, list): ctx = " ".join(ctx)
    return [{"from":"human", "value": f"{SYSTEM}\n\nQuestion: {q}\n\nAbstract:\n{ctx}\n\nAnswer:"}]

def prompt_text(row):
    return tokenizer.apply_chat_template(build_messages(row["question"], row["context"]),
                                         tokenize=False, add_generation_prompt=True)

def parse_first_label(completion: str) -> str:
    # first non-empty line, first word (handles "Yes,", "No." etc.)
    for line in completion.splitlines():
        line = line.strip()
        if not line:
            continue
        m = re.match(r"^[^\w]*([A-Za-z]+)", line)
        if m:
            tok = m.group(1).lower()
            if tok in LABELS:
                return tok
        m2 = re.search(r"\b(yes|no|maybe)\b", line.lower())
        return m2.group(1) if m2 else "maybe"
    return "maybe"

# ---- Load data ----
ds   = load_dataset("qiaojin/PubMedQA", "pqa_labeled")
pmqa = ds["train"]           # use "test" for paper-style reporting; change to "train" if desired
N    = len(pmqa)            # set smaller for a smoke test, e.g., N = 100

preds, golds = [], []

for i in tqdm(range(N), desc="Unsloth one-by-one", ncols=100):
    row = pmqa[i]
    golds.append(row["final_decision"].lower())

    prompt = prompt_text(row)
    # tokenize ONE example — no padding, no batching
    enc = tokenizer(prompt, return_tensors="pt",
                    padding=False, truncation=True, max_length=MAX_INP).to(DEVICE)

    with torch.inference_mode():
        out = model.generate(
            **enc,
            max_new_tokens=MAX_NEW,
            do_sample=False,
            temperature=0.0,
            use_cache=False,                      # reduce KV cache since we generate 1–2 tokens
            pad_token_id=tokenizer.eos_token_id,  # harmless even without padding
            return_dict_in_generate=False,
            output_scores=False,
        )

    full = tokenizer.decode(out[0], skip_special_tokens=True)
    # slice completion off the decoded text
    if full.startswith(prompt):
        comp = full[len(prompt):].strip()
    else:
        # fallback: take text after the last "Answer:"
        cut = full.lower().rfind("answer:")
        comp = full[cut+len("answer:"):].strip() if cut != -1 else full.strip()

    preds.append(parse_first_label(comp))

    # free per-iteration to prevent VRAM creep
    del enc, out
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    if i % 50 == 0:
        gc.collect()

# ---- Metrics ----
print(f"\nAccuracy:  {accuracy_score(golds, preds):.4f}")
print(f"Macro-F1:  {f1_score(golds, preds, average='macro'):.4f}\n")
print(classification_report(golds, preds, digits=4))

# ---- Peek a few rows ----
for j in range(min(10, N)):
    print(f"[{j:03d}] gold={golds[j]:<6} pred={preds[j]}")


In [7]:
messages = [
                               # EDIT HERE!
    {"from": "human", "value": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 1024, use_cache = True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>human<|end_header_id|>

Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The Fibonacci sequence continues as follows:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144.<|eot_id|>


In [8]:
messages = [
    {"from": "human", "value": "Describe the tallest tower in the world."},
]
inputs = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 1024, use_cache = True)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>human<|end_header_id|>

Describe the tallest tower in the world.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The tallest tower in the world is the Burj Khalifa, located in Dubai, United Arab Emirates. It stands at a height of 828 meters (2,722 feet) and has held this record since its completion in 2010.

The Burj Khalifa was designed by the American architectural firm Skidmore, Owings & Merrill, and was developed by Emaar Properties. It has 163 floors, including residential, commercial, and hotel space. The tower is a mixed-use development that includes luxury apartments, offices, and the Armani Hotel.

Some of the notable features of the Burj Khalifa include:

* The highest occupied floor, which is at a height of 585.4 meters (1,920 feet)
* The highest outdoor observation deck, which is at a height of 555.7 meters (1,823 

In [None]:
messages = [
    {"from": "human", "value": "What is Unsloth?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 1024, use_cache = True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>human<|end_header_id|>

What is Unsloth?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Unsloth is a fascinating topic!

Unsloth is a term used to describe a hypothetical, hypothetical creature that is the opposite of a sloth. While sloths are known for their slow movements and sedentary lifestyle, Unsloth would be a creature that is incredibly fast, agile, and energetic.

The concept of Unsloth is often used as a thought experiment or a humorous idea, rather than a serious scientific concept. It's a fun way to imagine what a creature would be like if it were the exact opposite of a sloth in terms of its physical abilities and behavior.

In reality, there is no such creature as Unsloth, and it's not a recognized scientific term. However, the idea of Unsloth can be a fun and imaginative concept to explore, and it can even inspire creative writing, art, or even scientific speculation about what such a creature might look like or how it might b

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme)
</div>
