To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

### Unsloth

If you want to finetune Llama-3 2x faster and use 70% less VRAM, go to our [finetuning notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Alpaca.ipynb)!

In [2]:
from unsloth import FastLanguageModel

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = 8192,
    load_in_4bit = True,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.9: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [3]:
from transformers import TextStreamer
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)
# FastLanguageModel.for_inference(model) # Enable native 2x faster inference

Change the "value" part to call the model!

Unsloth makes inference natively 2x faster!! No need to change or do anything!

In [None]:

# Enable Unsloth’s fast inference path
FastLanguageModel.for_inference(model)

# padding setup for batching
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model.config.pad_token_id = tokenizer.eos_token_id

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# ----- Data -----
ds   = load_dataset("qiaojin/PubMedQA", "pqa_labeled")
split = "test"     # use the paper split; change to "train" if you want
pmqa = ds[split]
K    = len(pmqa)   # set to a smaller number for a quick run

# ----- Build chat prompts for scoring -----
SYSTEM_SHORT = "Answer ONLY with one word: yes, no, or maybe."
CANDS = {"yes":["yes","Yes"], "no":["no","No"], "maybe":["maybe","Maybe"]}

def build_chat_prompt_text(question, context):
    if isinstance(context, list):
        context = " ".join(context)
    msgs = [
        {"from":"human", "value": f"{SYSTEM_SHORT}\n\nQuestion: {question}\n\nAbstract:\n{context}\n\nAnswer:"}
    ]
    # We want the raw text prompt; add_generation_prompt=True gives us the assistant prefix to append to.
    return tokenizer.apply_chat_template(
        msgs, tokenize=False, add_generation_prompt=True
    )

# ----- Helper: tokenize text → ids (no special tokens) -----
def _tok_ids(text):
    return tokenizer(text, add_special_tokens=False).input_ids

# ----- Safe LL scoring that NEVER drops the continuation -----
@torch.inference_mode()
def score_ll_unsloth(prompts, conts_per_ex, per_example_max_len=512, flat_micro=6):
    """
    For each example i, and its candidate strings, returns a list of summed log-likelihoods.
    - tokenizes prompt and continuation separately
    - left-truncates the prompt so the FULL continuation is kept within per_example_max_len
    - processes flattened pairs in micro-batches to keep VRAM ≤ T4 limits
    """
    pad_id = tokenizer.pad_token_id
    # Pre-tokenize prompts
    p_ids_list = [_tok_ids(p) for p in prompts]

    flat_inputs, flat_labels = [], []
    for p_ids, cands in zip(p_ids_list, conts_per_ex):
        for c in cands:
            c_ids = _tok_ids(c)
            need  = len(c_ids)
            keep_prompt = max(0, per_example_max_len - need)
            p_keep = p_ids[-keep_prompt:] if len(p_ids) > keep_prompt else p_ids
            inp = torch.tensor(p_keep + c_ids, dtype=torch.long)
            lab = torch.tensor([-100]*len(p_keep) + c_ids, dtype=torch.long)
            flat_inputs.append(inp); flat_labels.append(lab)

    # Run in micro-batches (on the flattened prompt+continuation pairs)
    seq_lp_all = []
    for m in range(0, len(flat_inputs), flat_micro):
        mb_inputs = flat_inputs[m:m+flat_micro]
        mb_labels = flat_labels[m:m+flat_micro]

        input_ids = pad_sequence(mb_inputs, batch_first=True, padding_value=pad_id).to(device)
        labels    = pad_sequence(mb_labels, batch_first=True, padding_value=-100).to(device)
        attn_mask = (input_ids != pad_id).long()

        out  = model(input_ids=input_ids, attention_mask=attn_mask, use_cache=False, return_dict=True)
        logp = torch.log_softmax(out.logits, dim=-1)

        # Shift for next-token prediction
        if input_ids.size(1) < 2:
            seq_lp_all.extend([0.0]*input_ids.size(0))
        else:
            shift_logp = logp[:, :-1, :]
            shift_lbls = labels[:, 1:]

            # safe gather: replace -100 by 0 *before* gather, then zero them out
            safe_lbls = shift_lbls.clone()
            safe_lbls[safe_lbls == -100] = 0
            gathered = torch.gather(shift_logp, 2, safe_lbls.unsqueeze(-1)).squeeze(-1)
            gathered[shift_lbls == -100] = 0.0

            seq_lp_all.extend(gathered.sum(dim=1).tolist())

        # free ASAP
        del input_ids, labels, attn_mask, out, logp
        torch.cuda.empty_cache()

    # regroup to per-example lists
    grouped, k = [], 0
    for cands in conts_per_ex:
        grouped.append(seq_lp_all[k:k+len(cands)])
        k += len(cands)
    return grouped

def classify_ll_unsloth(rows, batch=6, per_example_max_len=512, flat_micro=6,
                        return_details=True, print_first=8):
    preds, details = [], []
    # length bucketing → less padding
    def prompt_len_tokens(r):
        return len(_tok_ids(build_chat_prompt_text(r["question"], r["context"])))
    rows_sorted = sorted(rows, key=prompt_len_tokens)

    for s in tqdm(range(0, len(rows_sorted), batch), desc="Unsloth LL scoring", ncols=100):
        b = rows_sorted[s:s+batch]
        prompts = [build_chat_prompt_text(r["question"], r["context"]) for r in b]
        conts   = [CANDS["yes"] + CANDS["no"] + CANDS["maybe"] for _ in b]
        grouped = score_ll_unsloth(prompts, conts, per_example_max_len, flat_micro)

        for j, sc in enumerate(grouped):
            ll_yes   = max(sc[0:2]); ll_no = max(sc[2:4]); ll_maybe = max(sc[4:6])
            scores   = {"yes": ll_yes, "no": ll_no, "maybe": ll_maybe}
            v        = torch.tensor([ll_yes, ll_no, ll_maybe])
            pr       = torch.softmax(v, dim=0).tolist()
            probs    = {"yes": float(pr[0]), "no": float(pr[1]), "maybe": float(pr[2])}
            pred     = max(scores.items(), key=lambda kv: kv[1])[0]
            preds.append(pred)
            if return_details:
                gold = b[j]["final_decision"].lower()
                details.append({"scores": scores, "probs": probs, "pred": pred, "gold": gold})
    if return_details and print_first:
        print("\nExamples (LL → probs):")
        for i in range(min(print_first, len(details))):
            d = details[i]; sc = d["scores"]; pr = d["probs"]
            print(f"[{i:03d}] gold={d['gold']:<6} pred={d['pred']:<6} "
                  f"LLs: yes={sc['yes']:.3f} no={sc['no']:.3f} maybe={sc['maybe']:.3f}  "
                  f"probs: yes={pr['yes']:.3f} no={pr['no']:.3f} maybe={pr['maybe']:.3f}")
    return (preds, details) if return_details else preds

# ----- Run -----
rows  = [pmqa[i] for i in range(K)]
# knobs for VRAM/time: per_example_max_len (256–512), batch (4–8), flat_micro (4–8)
preds, details = classify_ll_unsloth(rows, batch=6, per_example_max_len=512, flat_micro=6, return_details=True)
golds = [r["final_decision"].lower() for r in rows]

print(f"\nAccuracy:  {accuracy_score(golds, preds):.4f}")
print(f"Macro-F1:  {f1_score(golds, preds, average='macro'):.4f}\n")
print(classification_report(golds, preds, digits=4))

In [None]:
messages = [
                               # EDIT HERE!
    {"from": "human", "value": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 1024, use_cache = True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>human<|end_header_id|>

Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The next numbers in the Fibonacci sequence would be:

13, 21, 34, 55, 89, 144,...<|eot_id|>


In [None]:
messages = [
    {"from": "human", "value": "Describe the tallest tower in the world."},
]
inputs = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 1024, use_cache = True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>human<|end_header_id|>

Describe the tallest tower in the world.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The tallest tower in the world is the Tokyo Skytree, located in Tokyo, Japan. It stands at an incredible height of 634 meters (2,080 feet) and was completed in 2012. The Tokyo Skytree is not only the tallest tower in the world but also the tallest free-standing tower, meaning it is not supported by any external structures.

The Tokyo Skytree was built as a broadcasting tower, designed to replace the aging Tokyo Tower, which was built in 1958. The new tower was designed to provide better broadcasting services to the Tokyo metropolitan area, as well as to serve as a iconic landmark and tourist attraction.

The tower's design is unique, with a distinctive shape that resembles a giant antenna. It has a square base that tapers to a point at the top, with a series of observation decks and broadcasting equipment installed along the way. T

In [None]:
messages = [
    {"from": "human", "value": "What is Unsloth?"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 1024, use_cache = True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>human<|end_header_id|>

What is Unsloth?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Unsloth is a fascinating topic!

Unsloth is a term used to describe a hypothetical, hypothetical creature that is the opposite of a sloth. While sloths are known for their slow movements and sedentary lifestyle, Unsloth would be a creature that is incredibly fast, agile, and energetic.

The concept of Unsloth is often used as a thought experiment or a humorous idea, rather than a serious scientific concept. It's a fun way to imagine what a creature would be like if it were the exact opposite of a sloth in terms of its physical abilities and behavior.

In reality, there is no such creature as Unsloth, and it's not a recognized scientific term. However, the idea of Unsloth can be a fun and imaginative concept to explore, and it can even inspire creative writing, art, or even scientific speculation about what such a creature might look like or how it might b

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme)
</div>
