# 🚀 OpenSloth Demo Training Notebook

This notebook have 2 main sections:
 1. Training a model with OpenSloth
 2. Training with unsloth

- Setup:
- Both use the same setup with datasets, sequences and global batch size.

In [1]:
# import wandb
# wandb.init(project="compare-unsloth")

In [None]:
import os
os.environ['WANDB_PROJECT'] = 'compare-unsloth'
from opensloth.scripts.opensloth_sft_trainer import run_mp_training, setup_envs
from opensloth.opensloth_config import (
    OpenSlothConfig,
    HFDatasetConfig,
    FastModelArgs,
    LoraArgs,
    TrainingArguments,
)
from loguru import logger

# from transformers.training_args import TrainingArguments


# # Main configuration using Pydantic models
def get_configs(devices) -> tuple[OpenSlothConfig, TrainingArguments]:
    num_gpu = len(devices)
    opensloth_config = OpenSlothConfig(
        data=HFDatasetConfig(
            tokenizer_name="Qwen/Qwen3-8B",
            chat_template="qwen3",
            instruction_part="<|im_start|>user\n",
            response_part="<|im_start|>assistant\n",
            num_samples=10000,
            nproc=52,
            max_seq_length=4096,
            source_type="hf",
            dataset_name="mlabonne/FineTome-100k",
            split="train",
        ),
        devices=devices,  # list of int representing GPU ids
        fast_model_args=FastModelArgs(
            model_name="model_store/unsloth/Qwen3-14B-bnb-4bit",
            max_seq_length=4096,
            load_in_4bit=True,
        ),
        lora_args=LoraArgs(
            r=8,
            lora_alpha=16,
            target_modules=[
                "q_proj",
                "k_proj",
                "v_proj",
                "o_proj",
                "gate_proj",
                "up_proj",
                "down_proj",
            ],
            lora_dropout=0,
            bias="none",
            use_rslora=False,
        ),
        sequence_packing=True,
    )

    # # Training arguments using Pydantic model
    training_config = TrainingArguments(
        output_dir=f"outputs/exps/qwen3-14b-FineTome-{num_gpu}gpus-seql-packing",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=64,  # Adjust based on n_gpu
        learning_rate=1e-5,
        logging_steps=1,
        num_train_epochs=1,
        lr_scheduler_type="linear",
        warmup_steps=5,
        save_total_limit=1,
        weight_decay=0.01,
        optim="adamw_8bit",
        seed=3407,
        report_to="wandb",  # tensorboard or wawndb
    )
    setup_envs(opensloth_config, training_config)
    return opensloth_config, training_config


if __name__ == "__main__":
    opensloth_config, training_config = get_configs(devices=[0])
    # opensloth_config, training_config = get_configs(devices=[0,1,2,3])
    run_mp_training(opensloth_config.devices, opensloth_config, training_config)

Global batch size: 64
[MP] Running on 1 GPUs


[32m03:29:22[0m | [1mINFO    [0m | [36mGPU0[0m | [36mopensloth_sft_trainer.py:41[0m | [1mTraining on GPU 0 with output_dir outputs/exps/qwen3-14b-FineTome-1gpus-seql-packing[0m
[32m03:29:22[0m | [1mINFO    [0m | [36mGPU0[0m | [36mopensloth_sft_trainer.py:44[0m | [1m🚀 Starting total training timer[0m


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Using compiler location: .cache/unsloth_compiled_cache_0
==((====))==  Unsloth 2025.5.9: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.08it/s]
[32m03:29:42[0m | [1mINFO    [0m | [36mGPU0[0m | [36mlogging_config.py:161[0m | [1m⏱️  model_loading: 8.57s[0m
[32m03:29:42[0m | [1mINFO    [0m | [36mGPU0[0m | [36mnccl_grad_sync.py:124[0m | [1m[GPU=0] NCCL env: RANK=0, WORLD_SIZE=1, MASTER_ADDR=127.0.0.1, MASTER_PORT=29501[0m
[32m03:29:42[0m | [1mINFO    [0m | [36mGPU0[0m | [36mnccl_grad_sync.py:128[0m | [1m[GPU=0] Setting current CUDA device to:0, os.environ['CUDA_VISIBLE_DEVICES']='0'[0m
[32m03:29:42[0m | [1mINFO    [0m | [36mGPU0[0m | [36minit_modules.py:50[0m | [1mModel loaded on device cuda:0, tokenizer: Qwen2TokenizerFast[0m
[32m03:29:45[0m | [1mINFO    [0m | [36mGPU0[0m | [36mlogging_config.py:161[0m | [1m⏱️  lora_setup: 3.23s[0m
[32m03:29:45[0m | [1mINFO    [0m | [36mGPU0[0m | [36minit_modules.py:74[0m | [1mApplied chat template: qwen3[0m
[32m03:29:45[0m | [1mINFO    [0m | [36mGPU0[0m | [3

Unsloth: Making `model.base_model.model.model` require gradients
[LOCAL_RANK=0] Patching log. Dir: outputs/exps/qwen3-14b-FineTome-1gpus-seql-packing, GPUs: 1
[LOCAL_RANK=0] Log patch initialization complete.
🔧 Patching Trainer to use RandomSamplerSeededByEpoch


  0%|          | 0/157 [00:00<?, ?it/s][32m03:29:48[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:21[0m | [1m🔄 Starting epoch 1[0m
[32m03:29:48[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:52[0m | [1m🎲 Sampler epoch 0: emitting 10000 indices
First ids dataset samples: [3771, 6672, 7261, 760, 3779, 1772, 7509, 2679, 2305, 9215]
...Last ids: [9674, 1424, 8935, 1679, 2286, 3657, 4012, 4506, 409, 1824][0m
[32m03:29:49[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:28[0m | [1m📋 Dataloader examples logged to .log/dataloader_examples.html[0m
[32m03:29:49[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:52[0m | [1m🎲 Sampler epoch 0: emitting 10000 indices
First ids dataset samples: [3771, 6672, 7261, 760, 3779, 1772, 7509, 2679, 2305, 9215]
...Last ids: [9674, 1424, 8935, 1679, 2286, 3657, 4012, 4506, 409, 1824][0m
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  1%|      


=== EXAMPLE #1 ===
[92m<|im_start|>user
What is the similarity between elements in the same vertical column of the periodic table?<|im_end|>
<|im_start|>assistant
[0m[93m<think>

</think>

Valence electrons are the electrons in the outermost shell of an atom, and they determine the chemical properties of the element. Elements in the same vertical column have the same number of valence electrons, which means they have similar chemical properties. For example, all the elements in Group 1 (the alkali metals) have one valence electron, and they are all highly reactive. Similarly, all the elements in Group 17 (the halogens) have seven valence electrons, and they are all highly reactive nonmetals.
####
Elements in the same vertical column, also known as groups or families, share the same number and arrangement of valence electrons.<|im_end|>
[0m

More training debug examples written to .log/dataloader_examples.html
[Update step: 0]0/63 - Total tokens seen: 0.00M, Non-padded tokens: 0.00

  1%|▏         | 2/157 [02:13<2:49:43, 65.70s/it]


[Update step: 1]9/63 - Total tokens seen: 0.04M, Non-padded tokens: 0.04M - Sequence length: 849
[Update step: 1]10/63 - Total tokens seen: 0.04M, Non-padded tokens: 0.04M - Sequence length: 438
[Update step: 1]11/63 - Total tokens seen: 0.04M, Non-padded tokens: 0.04M - Sequence length: 609
[Update step: 1]12/63 - Total tokens seen: 0.04M, Non-padded tokens: 0.04M - Sequence length: 216
[Update step: 1]13/63 - Total tokens seen: 0.04M, Non-padded tokens: 0.04M - Sequence length: 575
[Update step: 1]14/63 - Total tokens seen: 0.04M, Non-padded tokens: 0.04M - Sequence length: 489
[Update step: 1]15/63 - Total tokens seen: 0.04M, Non-padded tokens: 0.04M - Sequence length: 355
[Update step: 1]16/63 - Total tokens seen: 0.04M, Non-padded tokens: 0.04M - Sequence length: 709
[Update step: 1]17/63 - Total tokens seen: 0.04M, Non-padded tokens: 0.04M - Sequence length: 287
[Update step: 1]18/63 - Total tokens seen: 0.04M, Non-padded tokens: 0.04M - Sequence length: 447
[Update step: 1]19/6

  2%|▏         | 3/157 [03:06<2:34:17, 60.11s/it]


[Update step: 2]28/63 - Total tokens seen: 0.09M, Non-padded tokens: 0.09M - Sequence length: 350
[Update step: 2]29/63 - Total tokens seen: 0.09M, Non-padded tokens: 0.09M - Sequence length: 693
[Update step: 2]30/63 - Total tokens seen: 0.09M, Non-padded tokens: 0.09M - Sequence length: 420
[Update step: 2]31/63 - Total tokens seen: 0.09M, Non-padded tokens: 0.09M - Sequence length: 642
[Update step: 2]32/63 - Total tokens seen: 0.09M, Non-padded tokens: 0.09M - Sequence length: 190
[Update step: 2]33/63 - Total tokens seen: 0.09M, Non-padded tokens: 0.09M - Sequence length: 136
[Update step: 2]34/63 - Total tokens seen: 0.09M, Non-padded tokens: 0.09M - Sequence length: 277
[Update step: 2]35/63 - Total tokens seen: 0.09M, Non-padded tokens: 0.09M - Sequence length: 914
[Update step: 2]36/63 - Total tokens seen: 0.09M, Non-padded tokens: 0.09M - Sequence length: 826
[Update step: 2]37/63 - Total tokens seen: 0.09M, Non-padded tokens: 0.09M - Sequence length: 1307
[Update step: 2]38

  3%|▎         | 5/157 [05:01<2:29:18, 58.94s/it]


[Update step: 3]47/63 - Total tokens seen: 0.14M, Non-padded tokens: 0.14M - Sequence length: 277
[Update step: 3]48/63 - Total tokens seen: 0.14M, Non-padded tokens: 0.14M - Sequence length: 469
[Update step: 3]49/63 - Total tokens seen: 0.14M, Non-padded tokens: 0.14M - Sequence length: 808
[Update step: 3]50/63 - Total tokens seen: 0.14M, Non-padded tokens: 0.14M - Sequence length: 207
[Update step: 3]51/63 - Total tokens seen: 0.14M, Non-padded tokens: 0.14M - Sequence length: 659
[Update step: 3]52/63 - Total tokens seen: 0.14M, Non-padded tokens: 0.14M - Sequence length: 629
[Update step: 3]53/63 - Total tokens seen: 0.14M, Non-padded tokens: 0.14M - Sequence length: 592
[Update step: 3]54/63 - Total tokens seen: 0.14M, Non-padded tokens: 0.14M - Sequence length: 1048
[Update step: 3]55/63 - Total tokens seen: 0.14M, Non-padded tokens: 0.14M - Sequence length: 449
[Update step: 3]56/63 - Total tokens seen: 0.14M, Non-padded tokens: 0.14M - Sequence length: 815
[Update step: 3]57

  4%|▍         | 6/157 [06:05<2:32:42, 60.68s/it]


[Update step: 5]1/63 - Total tokens seen: 0.18M, Non-padded tokens: 0.18M - Sequence length: 419
[Update step: 5]2/63 - Total tokens seen: 0.18M, Non-padded tokens: 0.18M - Sequence length: 645
[Update step: 5]3/63 - Total tokens seen: 0.18M, Non-padded tokens: 0.18M - Sequence length: 107
[Update step: 5]4/63 - Total tokens seen: 0.19M, Non-padded tokens: 0.19M - Sequence length: 808
[Update step: 5]5/63 - Total tokens seen: 0.19M, Non-padded tokens: 0.19M - Sequence length: 563
[Update step: 5]6/63 - Total tokens seen: 0.19M, Non-padded tokens: 0.19M - Sequence length: 609
[Update step: 5]7/63 - Total tokens seen: 0.19M, Non-padded tokens: 0.19M - Sequence length: 971
[Update step: 5]8/63 - Total tokens seen: 0.19M, Non-padded tokens: 0.19M - Sequence length: 389
[Update step: 5]9/63 - Total tokens seen: 0.19M, Non-padded tokens: 0.19M - Sequence length: 1025
[Update step: 5]10/63 - Total tokens seen: 0.19M, Non-padded tokens: 0.19M - Sequence length: 206
[Update step: 5]11/63 - Tot

  4%|▍         | 7/157 [07:07<2:33:13, 61.29s/it]


[Update step: 6]20/63 - Total tokens seen: 0.23M, Non-padded tokens: 0.23M - Sequence length: 402
[Update step: 6]21/63 - Total tokens seen: 0.24M, Non-padded tokens: 0.24M - Sequence length: 1205
[Update step: 6]22/63 - Total tokens seen: 0.24M, Non-padded tokens: 0.24M - Sequence length: 269
[Update step: 6]23/63 - Total tokens seen: 0.24M, Non-padded tokens: 0.24M - Sequence length: 1053
[Update step: 6]24/63 - Total tokens seen: 0.24M, Non-padded tokens: 0.24M - Sequence length: 244
[Update step: 6]25/63 - Total tokens seen: 0.24M, Non-padded tokens: 0.24M - Sequence length: 326
[Update step: 6]26/63 - Total tokens seen: 0.24M, Non-padded tokens: 0.24M - Sequence length: 807
[Update step: 6]27/63 - Total tokens seen: 0.24M, Non-padded tokens: 0.24M - Sequence length: 362
[Update step: 6]28/63 - Total tokens seen: 0.24M, Non-padded tokens: 0.24M - Sequence length: 347
[Update step: 6]29/63 - Total tokens seen: 0.24M, Non-padded tokens: 0.24M - Sequence length: 660
[Update step: 6]3

  5%|▌         | 8/157 [08:12<2:34:44, 62.31s/it]

: 

: 

: 

: 

: 

## Unsloth default 

In [None]:
import os

from opensloth.patching.patch_sampler import patch_sampler

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["OPENSLOTH_LOCAL_RANK"] = "0"


def train_qwen3_model():
    """Train Qwen3 model with minimal setup."""
    from opensloth.dataset_utils import get_tokenized_dataset, HFDatasetConfig

    text_dataset = get_tokenized_dataset(
        HFDatasetConfig(
            tokenizer_name="Qwen/Qwen3-8B",
            chat_template="qwen3",
            instruction_part="<|im_start|>user\n",
            response_part="<|im_start|>assistant\n",
            num_samples=10000,
            nproc=52,
            max_seq_length=4096,
            source_type="hf",
            dataset_name="mlabonne/FineTome-100k",
            split="train",
        ),
        do_tokenize=False,
    )
    from unsloth import FastLanguageModel
    import torch
    from trl import SFTTrainer, SFTConfig

    # Load model
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Qwen3-0.6B-bnb-4bit",
        max_seq_length=4096,
        load_in_4bit=True,
        load_in_8bit=False,
        full_finetuning=False,
    )

    # Add LoRA adapters
    model = FastLanguageModel.get_peft_model(
        model,
        r=8,
        lora_alpha=16,
        target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
        ],
        lora_dropout=0,
        bias="none",
        use_gradient_checkpointing=True,
        random_state=3407,
        use_rslora=False,
        loftq_config=None,
    )
    args = SFTConfig(
        output_dir="outputs/exps/qwen3-14b-FineTome-unsloth",
        dataset_text_field="text",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8*2, # Adjust based on n_gpu
        warmup_steps=5,
        learning_rate=1e-5,
        num_train_epochs=1,
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        report_to="wandb",  
    )

    # args.skip_prepare_dataset = True
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=text_dataset,
        eval_dataset=None,
        args=args,
    )
    from unsloth_zoo.dataset_utils import train_on_responses_only

    trainer = train_on_responses_only(
        trainer,
        tokenizer=tokenizer,
        instruction_part="<|im_start|>user\n",
        response_part="<|im_start|>assistant\n",
    )

    # Show memory stats
    gpu_stats = torch.cuda.get_device_properties(0)
    start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
    print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
    print(f"{start_gpu_memory} GB of memory reserved.")

    # Train the model

    # from ._patch_sampler import patch_sampler

    trainer = patch_sampler(trainer)
    trainer_stats = trainer.train()

    # Show final memory and time stats
    used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
    used_percentage = round(used_memory / max_memory * 100, 3)
    lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
    print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
    print(
        f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
    )
    print(f"Peak reserved memory = {used_memory} GB.")
    print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
    print(f"Peak reserved memory % of max memory = {used_percentage} %.")
    print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

    return model, tokenizer


if __name__ == "__main__":
    model, tokenizer = train_qwen3_model()
    print("Training completed successfully!")

[32m16:47:41[0m | [1mINFO    [0m | [36mGPU0[0m | [36mdataset_utils.py:222[0m | [1mPreparing dataset 7fe3c373565b53a9...[0m


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


Saving the dataset (0/1 shards):   0%|          | 0/10000 [00:00<?, ? examples/s]

==((====))==  Unsloth 2025.5.9: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.5.9 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Unsloth: Tokenizing ["text"] (num_proc=104):   0%|          | 0/10000 [00:00<?, ? examples/s]

Map (num_proc=104):   0%|          | 0/10000 [00:00<?, ? examples/s]

GPU = NVIDIA H100 80GB HBM3. Max memory = 79.189 GB.
0.812 GB of memory reserved.
🔧 Patching Trainer to use RandomSamplerSeededByEpoch


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000 | Num Epochs = 1 | Total steps = 157
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 16
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 16 x 1) = 64
 "-____-"     Trainable parameters = 5,046,272/6,000,000,000 (0.08% trained)
[34m[1mwandb[0m: [32m[41mERROR[0m Failed to detect the name of this notebook. You can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33manhvth[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[32m16:49:02[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:52[0m | [1m🎲 Sampler epoch 0: emitting 10000 indices
First ids dataset samples: [3771, 6672, 7261, 760, 3779, 1772, 7509, 2679, 2305, 9215]
...Last ids: [9674, 1424, 8935, 1679, 2286, 3657, 4012, 4506, 409, 1824][0m


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.334
2,1.3177
3,1.2919
4,1.3261
5,1.3738
6,1.3305
7,1.3372
8,1.1476
9,1.2005
10,1.3537


[32m17:13:00[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:61[0m | [1m🎲 Sampler epoch 0: dataset_size=10000
   📋 First 10 indices: [3771, 6672, 7261, 760, 3779, 1772, 7509, 2679, 2305, 9215]
   📋 Last 10 indices: [9674, 1424, 8935, 1679, 2286, 3657, 4012, 4506, 409, 1824][0m


1444.214 seconds used for training.
24.07 minutes used for training.
Peak reserved memory = 1.725 GB.
Peak reserved memory for training = 0.913 GB.
Peak reserved memory % of max memory = 2.178 %.
Peak reserved memory for training % of max memory = 1.153 %.
Training completed successfully!
