This notebook shows how to fine-tuning Mistral 7B with unsloth.

The first section shows how to do it with recent GPUs (Ampere or more recent), such as the A100 of Colab. Jump to the Non-Ampere GPU section if you want to use an older GPU such as the T4.



First, we need to know which version of CUDA we are using to install the optimized version of unsloth.

In [None]:
import torch; torch.version.cuda

'12.1'

Using an Ampere GPU with CUDA 12.1, we need to install the version cu121_ampere of unsloth:

In [None]:
!pip install "unsloth[cu121_ampere] @ git+https://github.com/unslothai/unsloth.git"

Collecting unsloth[cu121_ampere]@ git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-iz452p8c/unsloth_db3b147ed93c4385af78610355429f3c
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-iz452p8c/unsloth_db3b147ed93c4385af78610355429f3c
  Resolved https://github.com/unslothai/unsloth.git to commit 24133feda65a6996a1fd242ac8f9fc237a3befe5
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting bitsandbytes (from unsloth[cu121_ampere]@ git+https://github.com/unslothai/unsloth.git)
  Downloading bitsandbytes-0.41.3.post2-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting flas

It installs all the dependencies. We can then import them:

In [None]:
import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer

We shall run `ldconfig /usr/lib64-nvidia` to try to fix it.


FastLanguageModel.from_pretrained loads the model and its tokenizer.

In [None]:
model_name = "mistralai/Mistral-7B-v0.1"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = 512,
    dtype = None,
    load_in_4bit = True,
)


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2023.12
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB
O^O/ \_/ \    CUDA capability = 8.0. Xformers = 0.0.22.post7. FA = True.
\        /    Pytorch version: 2.1.0+cu121. CUDA Toolkit = 12.1
 "-____-"     bfloat16 = TRUE. Platform = Linux



model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Then, we add LoRA. I only add for the MLP modules: "gate_proj", "up_proj", "down_proj"

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = 512,
)

Unsloth 2023.12 patched 32 layers with 0 QKV layers, 0 O layers and 32 MLP layers.


Training arguments are standard:

In [None]:
training_arguments = TrainingArguments(
        output_dir="./unsloth_results",
        evaluation_strategy="steps",
        do_eval=True,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=6,
        per_device_eval_batch_size=4,
        log_level="debug",
        save_steps=100,
        logging_steps=25,
        learning_rate=2e-4,
        eval_steps=50,
        optim='adamw_8bit',
        bf16=True, #change to fp16 if not using an Ampere GPU
        weight_decay=0.1,
        max_steps=500,
        warmup_ratio=0.01,
        lr_scheduler_type="linear",
)

Load the dataset used for fine-tuning:

In [None]:
dataset = load_dataset("timdettmers/openassistant-guanaco")

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Fine-tuning with TRL's SFTTrainer:

In [None]:
trainer = SFTTrainer(
        model=model,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 2
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 24
  Gradient Accumulation steps = 6
  Total optimization steps = 500
  Number of trainable parameters = 28,311,552
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Unsloth: `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`


Step,Training Loss,Validation Loss
50,1.1241,1.148698
100,1.1267,1.141217
150,1.0996,1.13867
200,1.0915,1.134471
250,1.0893,1.131233
300,1.1145,1.128209
350,1.0814,1.126775
400,1.0913,1.125081
450,0.9878,1.128103
500,1.0032,1.128946


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./unsloth_results/checkpoint-100
tokenizer config file saved in ./unsloth_results/checkpoint-100/tokenizer_config.json
Special tokens file saved in ./unsloth_results/checkpoint-100/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./unsloth_results/checkpoint-200
tokenizer config file saved in ./unsloth_results/checkpoint-200/tokenizer_config.json
Special tokens file saved in ./unsloth_results/checkpoint-200/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./unsloth_results/checkpoint-300
tokenizer config file saved in ./unsloth_results/checkpoint-30

TrainOutput(global_step=500, training_loss=1.0917891845703125, metrics={'train_runtime': 1921.0503, 'train_samples_per_second': 6.247, 'train_steps_per_second': 0.26, 'total_flos': 2.5001061360176333e+17, 'train_loss': 1.0917891845703125, 'epoch': 1.22})

For comparison, this is the code I used for fine-tuning without unsloth:

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
)
from trl import SFTTrainer
model_name = "mistralai/Mistral-7B-v0.1"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id =  tokenizer.unk_token_id
tokenizer.padding_side = 'left'

compute_dtype = getattr(torch, "bfloat16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, device_map={"": 0}
)


model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.0,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ["gate_proj", "up_proj", "down_proj"]
)

dataset = load_dataset("timdettmers/openassistant-guanaco")

training_arguments = TrainingArguments(
        output_dir="./drive/MyDrive/unsloth_results2",
        evaluation_strategy="steps",
        do_eval=True,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=6,
        per_device_eval_batch_size=4,
        log_level="debug",
        save_steps=100,
        logging_steps=25,
        learning_rate=2e-4,
        eval_steps=50,
        optim='adamw_8bit',
        bf16=True, #change to fp16 if not using an Ampere GPU
        weight_decay=0.1,
        max_steps=500,
        warmup_ratio=0.01,
        lr_scheduler_type="linear",
)

trainer = SFTTrainer(
        model=model,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Map:   0%|          | 0/518 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 2
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 24
  Gradient Accumulation steps = 6
  Total optimization steps = 500
  Number of trainable parameters = 28,311,552
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
50,1.1411,1.16348
100,1.1342,1.146146
150,1.1027,1.140601
200,1.093,1.136812
250,1.1307,1.185034
300,1.1563,1.153719
350,1.0902,1.134425
400,1.0957,1.13104
450,1.0057,1.134104
500,1.0255,1.135155


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./drive/MyDrive/unsloth_results2/checkpoint-100
tokenizer config file saved in ./drive/MyDrive/unsloth_results2/checkpoint-100/tokenizer_config.json
Special tokens file saved in ./drive/MyDrive/unsloth_results2/checkpoint-100/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./drive/MyDrive/unsloth_results2/checkpoint-200
tokenizer config file saved in ./drive/MyDrive/unsloth_results2/checkpoint-200/tokenizer_config.json
Special tokens file saved in ./drive/MyDrive/unsloth_results2/checkpoint-200/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./dri

TrainOutput(global_step=500, training_loss=1.1072690620422363, metrics={'train_runtime': 4002.947, 'train_samples_per_second': 2.998, 'train_steps_per_second': 0.125, 'total_flos': 2.5001061360176333e+17, 'train_loss': 1.1072690620422363, 'epoch': 1.22})

# *Non* Ampere GPU (T4)

The following QLoRA fine-tuning is the same as above but without bfloat16 and using the cu121 version of unsloth.

In [None]:
!pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"

Collecting unsloth[cu121]@ git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-suof1hsm/unsloth_ec0a51ef3d274815b155a322ce6ca9c9
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-suof1hsm/unsloth_ec0a51ef3d274815b155a322ce6ca9c9
  Resolved https://github.com/unslothai/unsloth.git to commit 0369e7aa7ad67e1f9e59ecbc98306f4f19afb0b3
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting bitsandbytes (from unsloth[cu121]@ git+https://github.com/unslothai/unsloth.git)
  Downloading bitsandbytes-0.41.3.post2-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xformers@ https://

In [None]:
from unsloth import FastLanguageModel
from datasets import load_dataset
from transformers import (
    TrainingArguments,
)
from trl import SFTTrainer

model_name = "mistralai/Mistral-7B-v0.1"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = 512,
    dtype = None,
    load_in_4bit = True,
)
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = 512,
)

training_arguments = TrainingArguments(
        output_dir="./unsloth_results_t4",
        evaluation_strategy="steps",
        do_eval=True,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=6,
        per_device_eval_batch_size=4,
        log_level="debug",
        save_steps=100,
        logging_steps=50,
        learning_rate=2e-4,
        eval_steps=50,
        optim='adamw_8bit',
        fp16=True, #change to fp16 if not using an Ampere GPU
        weight_decay=0.1,
        max_steps=500,
        warmup_ratio=0.01,
        lr_scheduler_type="linear",
)

dataset = load_dataset("timdettmers/openassistant-guanaco")

trainer = SFTTrainer(
        model=model,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

We shall run `ldconfig /usr/lib64-nvidia` to try to fix it.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2023.12
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB
O^O/ \_/ \    CUDA capability = 7.5. Xformers = 0.0.22.post7. FA = False.
\        /    Pytorch version: 2.1.0+cu121. CUDA Toolkit = 12.1
 "-____-"     bfloat16 = FALSE. Platform = Linux



model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Unsloth 2023.12 patched 32 layers with 0 QKV layers, 0 O layers and 32 MLP layers.


Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 2
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 24
  Gradient Accumulation steps = 6
  Total optimization steps = 500
  Number of trainable parameters = 28,311,552
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Unsloth: `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`


Step,Training Loss,Validation Loss
50,1.1589,1.148608
100,1.1206,1.140937
150,1.1192,1.138291
200,1.0972,1.134411
250,1.0935,1.130968


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./unsloth_results_t4/checkpoint-100
tokenizer config file saved in ./unsloth_results_t4/checkpoint-100/tokenizer_config.json
Special tokens file saved in ./unsloth_results_t4/checkpoint-100/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./unsloth_results_t4/checkpoint-200
tokenizer config file saved in ./unsloth_results_t4/checkpoint-200/tokenizer_config.json
Special tokens file saved in ./unsloth_results_t4/checkpoint-200/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4


Step,Training Loss,Validation Loss
50,1.1589,1.148608
100,1.1206,1.140937
150,1.1192,1.138291
200,1.0972,1.134411
250,1.0935,1.130968
300,1.1168,1.128115
350,1.1033,1.12657
400,1.0829,1.124936
450,1.0154,1.127879
500,1.0075,1.128644


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./unsloth_results_t4/checkpoint-300
tokenizer config file saved in ./unsloth_results_t4/checkpoint-300/tokenizer_config.json
Special tokens file saved in ./unsloth_results_t4/checkpoint-300/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./unsloth_results_t4/checkpoint-400
tokenizer config file saved in ./unsloth_results_t4/checkpoint-400/tokenizer_config.json
Special tokens file saved in ./unsloth_results_t4/checkpoint-400/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./unsloth_results_t4/checkpoint-500
tokenizer config file saved in ./unsloth_results_t4/checkpoint-500/tokenizer_config.json
Special tokens file s

TrainOutput(global_step=500, training_loss=1.0915244064331056, metrics={'train_runtime': 17339.8068, 'train_samples_per_second': 0.692, 'train_steps_per_second': 0.029, 'total_flos': 2.5001061360176333e+17, 'train_loss': 1.0915244064331056, 'epoch': 1.22})