To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Unsloth supports Gemma2 9b and Gemma2 27b!**

## Kaggle is slow - you'll have to wait **5 minutes** for it to install.

I suggest you to use our free Colab notebooks instead. I linked our Mistral Colab notebook here: [notebook](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing)

In [1]:
%%capture
!pip install pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
!pip install unsloth
!pip install info-nce-pytorch

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
* [**NEW**] We make Llama-3 8b, 70b **2x faster**! See our [Llama-3 8b notebook](https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing)

In [2]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("huggingface")

In [3]:
from unsloth import FastLanguageModel
import torch


max_seq_length = 512 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/Qwen2-0.5b-bnb-4bit",           # Qwen2 2x faster!
    "unsloth/Qwen2-1.5b-bnb-4bit",
    "unsloth/Qwen2-7b-bnb-4bit",
    "unsloth/Qwen2-72b-bnb-4bit",
    "unsloth/gemma-2-9b-bnb-4bit",           # 8T tokens 2x faster!
    "unsloth/gemma-2-27b-bnb-4bit",          # 13T tokens 2x faster!
] # Try more models at https://huggingface.co/unsloth!

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-2-9b-bnb-4bit", # Reminder we support ANY Hugging Face model!
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    # modules_to_save = ['embed_tokens'],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)
_model = model

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.11: Fast Gemma2 patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/6.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/46.4k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

Unsloth 2024.11.11 patched 42 layers with 42 QKV layers, 42 O layers and 42 MLP layers.


In [4]:
# Configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch_size = 2
max_seq_length = max_seq_length  # Replace with your sequence length
seed = 3407

# Set random seed for reproducibility
torch.manual_seed(seed);

## Load dataset

In [5]:
import pandas as pd
import datasets
from torch.utils.data import DataLoader

EOS_TOKEN = tokenizer.eos_token

passage_col = 'finalpassage_cro'
query_col = 'query_cro'
train_path = "/kaggle/input/ms-marco-hr/ms-marco-translated.csv"
df = pd.read_csv(train_path)
ds = datasets.arrow_dataset.Dataset.from_pandas(df)

def to_text(examples):
    queries = ["Query: " + x + EOS_TOKEN for x in examples["query_cro"]]
    passages = ["Passage: " + x + EOS_TOKEN for x in examples["finalpassage_cro"]]
    return {"query_text": queries, "passage_text": passages}

def tokenize(examples):
    """Dodaje stupce input_ids i attention_mask tako da tokenizira stupac text"""
    data = {}
    for col in ds.features:
        if "text" in col:
            new_data = tokenizer(examples[col], truncation=True, padding='max_length', max_length=max_seq_length)
            data[col.replace("text", "input_ids")] = new_data["input_ids"]
            data[col.replace("text", "attention_mask")] = new_data["attention_mask"]
    return data

ds = ds.map(to_text, batched = True)
ds = ds.map(tokenize, batched=True)
ds.set_format(type='torch', 
              columns=['query_input_ids',
                       'query_attention_mask',
                       'passage_input_ids',
                       'passage_attention_mask'])

seed = 42
_train_val, test_ds = ds.train_test_split(test_size=0.1, seed=42).values()
train_ds, val_ds = _train_val.train_test_split(test_size=0.1111, seed=42).values()  # 10/90 = 0.1111

train_dataloader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_ds, batch_size=batch_size, shuffle=False)
test_dataloader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
print(len(train_dataloader), len(val_dataloader), len(test_dataloader), batch_size)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

4000 500 500 2


In [6]:
import torch
from transformers import AdamW, get_scheduler, TrainingArguments
from transformers.modeling_outputs import ModelOutput
from tqdm import tqdm
from trl import SFTTrainer
from unsloth import is_bfloat16_supported
from info_nce import InfoNCE
import torch.nn as nn

In [7]:
class ModelWrapper(nn.Module):
    def __init__(self, model, pooling_method="cls"):
        super().__init__()
        self.model = model
        self.pooling_method = pooling_method
        self.loss_fn = InfoNCE()

    def forward_multiple(self, query_input_ids, query_attention_mask, 
             passage_input_ids, passage_attention_mask, **kwargs):
        query_embeddings = self(
            input_ids=query_input_ids,
            attention_mask=query_attention_mask).embeddings
        passage_embeddings = self(
            input_ids=passage_input_ids,
            attention_mask=passage_attention_mask).embeddings
        loss = self.loss_fn(query_embeddings, passage_embeddings)
        return ModelOutput(
            query_embeddings=query_embeddings,
            passage_embeddings=passage_embeddings,
            loss=loss)
        
    
    def forward(self, input_ids, attention_mask, **kwargs) -> ModelOutput:
        kwargs["output_hidden_states"] = True
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, **kwargs)
        hidden_states = outputs.hidden_states[-1]
        if self.pooling_method == "cls":
            indices = attention_mask.sum(dim=-1)-1 # Last non zero element per sentence in batch
            indices = indices.unsqueeze(-1).expand(-1, hidden_states.size(-1))  # Shape: (batch_size, hidden_size)
            embeddings = hidden_states.gather(1, indices.unsqueeze(1)).squeeze(1)  # Shape: (batch_size, hidden_size)
        elif self.pooling_method == "mean":
            mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
            embeddings = torch.sum(hidden_states * mask, dim=1) / torch.clamp(mask.sum(dim=1), min=1e-9)
        return ModelOutput(embeddings=embeddings)

    def __getattr__(self, name: str):
        """Forward missing attributes to the wrapped module."""
        try:
            return super().__getattr__(name)  # defer to nn.Module's logic
        except AttributeError:
            return getattr(self.model, name)

In [8]:
def mean_reciprocal_rank(cosine_sim_matrix, ground_truth_indices):
    num_queries = cosine_sim_matrix.size(0)
    reciprocal_ranks = []
    for i in range(num_queries):
        sorted_indices = torch.argsort(cosine_sim_matrix[i], descending=True)
        rank = (sorted_indices == ground_truth_indices[i]).nonzero(as_tuple=True)[0].item() + 1
        reciprocal_ranks.append(1 / rank)
    return sum(reciprocal_ranks) / num_queries

def hit_rate_at_1(cosine_sim_matrix, ground_truth_indices):
    top_1_indices = torch.argmax(cosine_sim_matrix, dim=1)  # Shape: (num_samples,)
    hits = (top_1_indices == ground_truth_indices).sum().item()
    return hits / ground_truth_indices.size(0)


class Looper:
    def __init__(self, model, **kwargs):
        self.model = model

    def loop(self, dataloader, num_steps, call, train=False, **kwargs):
        assert len(dataloader) >= num_steps, "Dataloader is smaller than number of steps!"
        step = 0
        with tqdm(range(num_steps), leave=True, position=0,
                  desc="Training" if train else "Testing", unit="step") as progress_bar:
            for batch in dataloader:
                batch = {k: v.to(device) for k, v in batch.items()}
                with torch.amp.autocast(device.type, dtype=torch.bfloat16 if use_bfloat16 else torch.float16):
                    if train:
                        outputs = self.model.forward_multiple(**batch, **kwargs)
                    else:
                        with torch.no_grad():
                            outputs = model.forward_multiple(**batch, **kwargs)
                call(outputs=outputs, step=step, progress_bar=progress_bar)
                progress_bar.update(1)
                step += 1
                if step >= num_steps:
                    break


In [9]:
model = ModelWrapper(_model, pooling_method="cls")

In [10]:
# Step je broj batcheva
gradient_accumulation_steps = 8
learning_rate = 2e-4
num_training_steps = 4000
num_val_steps = 500
val_every_steps = 1000
warmup_steps = 10

model.train()
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01)
scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=num_training_steps,
)
use_bfloat16 = is_bfloat16_supported()
scaler = torch.amp.GradScaler(device.type, enabled=not use_bfloat16)
looper = Looper(model)

def test_loop(test_dataloader, num_test_steps):
    query_embeddings = []
    passage_embeddings = []
    print()
    def test_callback(outputs, step, progress_bar):
        query_embeddings.append(outputs.query_embeddings)
        passage_embeddings.append(outputs.passage_embeddings)
    looper.loop(test_dataloader, num_test_steps, test_callback)
    query_embeddings = torch.concatenate(query_embeddings, dim=0) # po batch dimenziji
    passage_embeddings = torch.concatenate(passage_embeddings, dim=0)

    cosine_sim_matrix = torch.matmul(query_embeddings, passage_embeddings.T)
    ground_truth_indices = torch.arange(query_embeddings.shape[0], device=query_embeddings.device)
    mrr_score = mean_reciprocal_rank(cosine_sim_matrix, ground_truth_indices)
    hr_1 = hit_rate_at_1(cosine_sim_matrix, ground_truth_indices)
    print(f"MRR: {mrr_score:.4f} | Hit Rate @ 1: {hr_1:.4f}")

def train_loop():
    test_loop(val_dataloader, num_val_steps)
    def train_callback(outputs, step, progress_bar):
        loss = outputs.loss / gradient_accumulation_steps
        scaler.scale(loss).backward()
        if (step + 1) % gradient_accumulation_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
            scheduler.step()
        progress_bar.set_postfix({'loss': loss.item()})
        # progress_bar.set_description(f"Training [loss={loss.item()}]")
        if (step + 1) % val_every_steps == 0:
            test_loop(val_dataloader, num_val_steps)
    looper.loop(train_dataloader, num_training_steps, train_callback, train=True)

train_loop()




Testing: 100%|██████████| 500/500 [22:39<00:00,  2.72s/step]


MRR: 0.1590 | Hit Rate @ 1: 0.0930


Training:  25%|██▍       | 999/4000 [2:07:38<6:19:51,  7.59s/step, loss=9.62e-5]




Testing: 100%|██████████| 500/500 [23:04<00:00,  2.77s/step]
Training:  25%|██▌       | 1000/4000 [2:30:43<352:33:55, 423.08s/step, loss=9.62e-5]

MRR: 0.9530 | Hit Rate @ 1: 0.9250


Training:  50%|████▉     | 1999/4000 [4:37:52<4:13:18,  7.60s/step, loss=6.94e-5]




Testing: 100%|██████████| 500/500 [23:02<00:00,  2.76s/step]
Training:  50%|█████     | 2000/4000 [5:00:55<234:39:25, 422.38s/step, loss=6.94e-5]

MRR: 0.9552 | Hit Rate @ 1: 0.9240


Training:  75%|███████▍  | 2999/4000 [7:08:08<2:07:23,  7.64s/step, loss=3.19e-5]




Testing: 100%|██████████| 500/500 [23:01<00:00,  2.76s/step]
Training:  75%|███████▌  | 3000/4000 [7:31:10<117:15:53, 422.15s/step, loss=3.19e-5]

MRR: 0.9561 | Hit Rate @ 1: 0.9300


Training: 100%|█████████▉| 3999/4000 [9:38:10<00:07,  7.57s/step, loss=0.00102]




Testing: 100%|██████████| 500/500 [22:56<00:00,  2.75s/step]
Training: 100%|██████████| 4000/4000 [10:01:08<00:00,  9.02s/step, loss=0.00102] 

MRR: 0.9461 | Hit Rate @ 1: 0.9170





In [11]:
# Save model and tokenizer
model.save_pretrained("outputs")
tokenizer.save_pretrained("outputs")

('outputs/tokenizer_config.json',
 'outputs/special_tokens_map.json',
 'outputs/tokenizer.model',
 'outputs/added_tokens.json',
 'outputs/tokenizer.json')

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Llama-3 8b, 70b **2x faster**! See our [Llama-3 8b notebook](https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>