# üöÄ HyperSloth Demo Training Notebook

This notebook demonstrates how to fine-tune large language models using HyperSloth's multi-GPU capabilities. It's equivalent to running:

```bash
hypersloth-train examples/example_sharegpt_lora_2gpus.py
```

## What This Demo Does

- **Multi-GPU Training**: Uses 2 GPUs with NCCL synchronization
- **Adaptive Batching**: Optimizes sequence sorting and padding
- **LoRA Fine-tuning**: Efficient parameter updates with Low-Rank Adaptation
- **Response-only Loss**: Calculates loss only on assistant responses

## Prerequisites

1. HyperSloth installed: `pip install git+https://github.com/anhvth/HyperSloth.git`
2. At least 2 GPUs available (adjust `gpus=[0, 1]` if needed)
3. Sufficient VRAM (reduce batch size if needed)

In [1]:
%%capture
%load_ext autoreload
%autoreload 2

In [2]:
# Import HyperSloth configuration classes
from HyperSloth.hypersloth_config import *

# Check GPU availability
import torch
print(f'üî• CUDA Available: {torch.cuda.is_available()}')
print(f'üî• GPU Count: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
    print(f'   GPU {i}: {torch.cuda.get_device_name(i)}')


üî• CUDA Available: True
üî• GPU Count: 4
   GPU 0: NVIDIA H100 80GB HBM3
   GPU 1: NVIDIA H100 80GB HBM3
   GPU 2: NVIDIA H100 80GB HBM3
   GPU 3: NVIDIA H100 80GB HBM3


## ‚öôÔ∏è Configuration Setup

HyperSloth uses Pydantic models for type-safe configuration. We'll set up:

1. **Data Configuration**: Dataset and tokenization settings
2. **Training Configuration**: GPU allocation and loss calculation
3. **Model Configuration**: Base model and LoRA parameters
4. **Training Arguments**: Learning rate, batch size, and optimization settings

In [5]:
from HyperSloth.hypersloth_config import *
from HyperSloth.scripts.hp_trainer import run_mp_training, setup_envs

# Main configuration using Pydantic models
hyper_config_model = HyperConfig(
    data=HFDatasetConfig(
        dataset_name="llamafactory/OpenThoughts-114k",
        split="train",
        tokenizer_name="Qwen/Qwen3-8B",  # does not matter same family qwen3
        num_samples=1000,
        instruction_part="<|im_start|>user\n",
        response_part="<|im_start|>assistant\n",
        chat_template="chatml",
    ),
    training=TrainingConfig(
        gpus=[0, 1,2,3],
        loss_type="response_only",
    ),
    fast_model_args=FastModelArgs(
        model_name="unsloth/Qwen3-0.6b-bnb-4bit",
        max_seq_length=32_000,
        load_in_4bit=True,
    ),
    lora_args=LoraArgs(
        r=8,
        lora_alpha=16,
        target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
        ],
        lora_dropout=0,
        bias="none",
        use_rslora=False,
    ),
)

# Training arguments using Pydantic model
training_config_model = TrainingArgsConfig(
    output_dir="outputs/qwen3-8b-openthought-2gpus/",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=1e-5,
    logging_steps=3,
    num_train_epochs=3,
    lr_scheduler_type="linear",
    warmup_steps=5,
    save_total_limit=2,
    weight_decay=0.01,
    optim="adamw_8bit",
    seed=3407,
    report_to="none",  # tensorboard or wawndb
)

setup_envs(hyper_config_model, training_config_model)

run_mp_training(
    hyper_config_model.training.gpus, hyper_config_model, training_config_model
)

Global batch size: 32
[MP] Running on 4 GPUs


[32m06:37:23[0m | [1mINFO    [0m | [36mGPU0[0m | [36mhp_trainer.py:42[0m | [1mTraining on GPU 0 with output_dir outputs/qwen3-8b-openthought-2gpus/[0m
[32m06:37:23[0m | [1mINFO    [0m | [36mGPU0[0m | [36mhp_trainer.py:45[0m | [1müöÄ Starting total training timer[0m
[32m06:37:23[0m | [1mINFO    [0m | [36mGPU2[0m | [36mhp_trainer.py:42[0m | [1mTraining on GPU 2 with output_dir outputs/qwen3-8b-openthought-2gpus/[0m
[32m06:37:23[0m | [1mINFO    [0m | [36mGPU2[0m | [36mhp_trainer.py:45[0m | [1müöÄ Starting total training timer[0m
[32m06:37:23[0m | [1mINFO    [0m | [36mGPU1[0m | [36mhp_trainer.py:42[0m | [1mTraining on GPU 1 with output_dir outputs/qwen3-8b-openthought-2gpus/[0m
[32m06:37:23[0m | [1mINFO    [0m | [36mGPU1[0m | [36mhp_trainer.py:45[0m | [1müöÄ Starting total training timer[0m
[32m06:37:23[0m | [1mINFO    [0m | [36mGPU3[0m | [36mhp_trainer.py:42[0m | [1mTraining on GPU 3 with output_dir outputs/qwen3-8b-o

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Using compiler location: .cache/unsloth_compiled_cache_2
==((====))==  Unsloth 2025.5.9: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Making `model.base_model.model.model` require gradients
[LOCAL_RANK=2] Patching log. Dir: outputs/qwen3-8b-openthought-2gpus/, GPUs: 4
[LOCAL_RANK=2] Log patch initialization complete.
üîß Patching Trainer to use RandomSamplerSeededByEpoch
ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unslot

  0%|          | 0/96 [00:00<?, ?it/s][32m06:38:02[0m | [1mINFO    [0m | [36mGPU3[0m | [36mpatch_sampler.py:52[0m | [1müé≤ Sampler epoch 0: emitting 1000 indices
First ids: [776, 507, 895, 922, 33, 483, 85, 750, 354, 523]
...Last ids: [104, 754, 142, 228, 250, 281, 759, 25, 114, 654][0m
  0%|          | 0/96 [00:00<?, ?it/s][32m06:38:02[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:21[0m | [1müîÑ Starting epoch 1[0m
[32m06:38:02[0m | [1mINFO    [0m | [36mGPU2[0m | [36mpatch_sampler.py:52[0m | [1müé≤ Sampler epoch 0: emitting 1000 indices
First ids: [776, 507, 895, 922, 33, 483, 85, 750, 354, 523]
...Last ids: [104, 754, 142, 228, 250, 281, 759, 25, 114, 654][0m
[32m06:38:02[0m | [1mINFO    [0m | [36mGPU1[0m | [36mpatch_sampler.py:52[0m | [1müé≤ Sampler epoch 0: emitting 1000 indices
First ids: [776, 507, 895, 922, 33, 483, 85, 750, 354, 523]
...Last ids: [104, 754, 142, 228, 250, 281, 759, 25, 114, 654][0m
[32m06:38:02[0m | [1mI


=== EXAMPLE #1 ===
[92m<|im_start|>system
You are an assistant that thoroughly explores questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarization, exploration, reassessment, reflection, backtracing, and iteration to develop a well-considered thinking process. Detail your reasoning process using the specified format: <think>thought with steps separated by '

'</think> Each step should include detailed considerations such as analyzing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. Based on various attempts, explorations, and reflections from the thoughts, you should systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary steps needed to reach 

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[32m06:38:11[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:28[0m | [1müìã Dataloader examples logged to .log/dataloader_examples.html[0m
[32m06:38:11[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:52[0m | [1müé≤ Sampler epoch 0: emitting 1000 indices
First ids: [776, 507, 895, 922, 33, 483, 85, 750, 354, 523]
...Last ids: [104, 754, 142, 228, 250, 281, 759, 25, 114, 654][0m
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
 32%|‚ñà‚ñà‚ñà‚ñè      | 31/96 [03:08<05:50,  5.39s/it][32m06:41:11[0m | [1mINFO    [0m | [36mGPU3[0m | [36mpatch_sampler.py:61[0m | [1müé≤ Sampler epoch 0: dataset_size=1000
   üìã First 10 indices: [776, 507, 89


More training debug examples written to .log/dataloader_examples.html
Unsloth: Will smartly offload gradients to save VRAM!
{'loss': 0.7813999652862549, 'grad_norm': 0.6751890182495117, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.02}
{'loss': 0.7476000189781189, 'grad_norm': 0.5184872150421143, 'learning_rate': 1e-05, 'epoch': 0.05}
{'loss': 0.7621999979019165, 'grad_norm': 0.5991377234458923, 'learning_rate': 9.670329670329671e-06, 'epoch': 0.07}
{'loss': 0.7696999907493591, 'grad_norm': 0.517305314540863, 'learning_rate': 9.340659340659341e-06, 'epoch': 0.1}
{'loss': 0.7705000042915344, 'grad_norm': 0.5889832973480225, 'learning_rate': 9.010989010989011e-06, 'epoch': 0.12}
{'loss': 0.7658999562263489, 'grad_norm': 0.5188556909561157, 'learning_rate': 8.681318681318681e-06, 'epoch': 0.14}
{'loss': 0.7606000304222107, 'grad_norm': 0.5299318432807922, 'learning_rate': 8.351648351648353e-06, 'epoch': 0.17}
{'loss': 0.7849999666213989, 'grad_norm': 0.5340387225151062, 'learning_ra

[32m06:41:20[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:28[0m | [1müìã Dataloader examples logged to .log/dataloader_examples.html[0m
[32m06:41:20[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:52[0m | [1müé≤ Sampler epoch 1: emitting 1000 indices
First ids: [64, 461, 401, 241, 916, 634, 953, 426, 183, 501]
...Last ids: [715, 687, 378, 473, 982, 147, 780, 712, 292, 39][0m
 66%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå   | 63/96 [06:03<02:49,  5.14s/it][32m06:44:06[0m | [1mINFO    [0m | [36mGPU3[0m | [36mpatch_sampler.py:61[0m | [1müé≤ Sampler epoch 1: dataset_size=1000
   üìã First 10 indices: [64, 461, 401, 241, 916, 634, 953, 426, 183, 501]
   üìã Last 10 indices: [715, 687, 378, 473, 982, 147, 780, 712, 292, 39][0m
[32m06:44:06[0m | [1mINFO    [0m | [36mGPU2[0m | [36mpatch_sampler.py:61[0m | [1müé≤ Sampler epoch 1: dataset_size=1000
   üìã First 10 indices: [64, 461, 401, 241, 916, 634, 953, 426, 183, 501]
   üìã Last 10 indices: [7


More training debug examples written to .log/dataloader_examples.html
{'loss': 0.7128000259399414, 'grad_norm': 0.4320141673088074, 'learning_rate': 7.032967032967034e-06, 'epoch': 1.01}
{'loss': 0.7687000036239624, 'grad_norm': 0.4193204939365387, 'learning_rate': 6.703296703296703e-06, 'epoch': 1.03}
{'loss': 0.7821000218391418, 'grad_norm': 0.4171277582645416, 'learning_rate': 6.373626373626373e-06, 'epoch': 1.06}
{'loss': 0.7347999811172485, 'grad_norm': 0.3646993339061737, 'learning_rate': 6.043956043956044e-06, 'epoch': 1.08}
{'loss': 0.7346000075340271, 'grad_norm': 0.3870265483856201, 'learning_rate': 5.7142857142857145e-06, 'epoch': 1.1}
{'loss': 0.6715999841690063, 'grad_norm': 0.34849029779434204, 'learning_rate': 5.384615384615385e-06, 'epoch': 1.13}
{'loss': 0.6632000207901001, 'grad_norm': 0.3121817708015442, 'learning_rate': 5.054945054945055e-06, 'epoch': 1.15}
{'loss': 0.6910999417304993, 'grad_norm': 0.3255983293056488, 'learning_rate': 4.725274725274726e-06, 'epoch'

[32m06:44:14[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:28[0m | [1müìã Dataloader examples logged to .log/dataloader_examples.html[0m
[32m06:44:14[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:52[0m | [1müé≤ Sampler epoch 2: emitting 1000 indices
First ids: [103, 186, 913, 930, 85, 790, 223, 505, 925, 979]
...Last ids: [296, 230, 388, 180, 119, 882, 718, 554, 532, 418][0m
 99%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 95/96 [08:53<00:04,  4.71s/it][32m06:46:56[0m | [1mINFO    [0m | [36mGPU2[0m | [36mpatch_sampler.py:61[0m | [1müé≤ Sampler epoch 2: dataset_size=1000
   üìã First 10 indices: [103, 186, 913, 930, 85, 790, 223, 505, 925, 979]
   üìã Last 10 indices: [296, 230, 388, 180, 119, 882, 718, 554, 532, 418][0m
[32m06:46:56[0m | [1mINFO    [0m | [36mGPU1[0m | [36mpatch_sampler.py:61[0m | [1müé≤ Sampler epoch 2: dataset_size=1000
   üìã First 10 indices: [103, 186, 913, 930, 85, 790, 223, 505, 925, 979]
   üìã Last 10 ind

Unsloth: Will smartly offload gradients to save VRAM!
Unsloth: Will smartly offload gradients to save VRAM!


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 96/96 [08:57<00:00,  5.60s/it]
[32m06:46:59[0m | [1mINFO    [0m | [36mGPU2[0m | [36mlogging_config.py:140[0m | [1m‚è±Ô∏è  actual_training: 9.0m[0m


Unsloth: Will smartly offload gradients to save VRAM!

More training debug examples written to .log/dataloader_examples.html
{'loss': 0.7157999873161316, 'grad_norm': 0.31424659490585327, 'learning_rate': 3.406593406593407e-06, 'epoch': 2.02}
{'loss': 0.6995999813079834, 'grad_norm': 0.32045120000839233, 'learning_rate': 3.0769230769230774e-06, 'epoch': 2.04}
{'loss': 0.7498999834060669, 'grad_norm': 0.31497055292129517, 'learning_rate': 2.7472527472527476e-06, 'epoch': 2.06}
{'loss': 0.6959999799728394, 'grad_norm': 0.2789263129234314, 'learning_rate': 2.4175824175824177e-06, 'epoch': 2.09}
{'loss': 0.7142999768257141, 'grad_norm': 0.2758362889289856, 'learning_rate': 2.0879120879120883e-06, 'epoch': 2.12}
{'loss': 0.7074999809265137, 'grad_norm': 0.27244481444358826, 'learning_rate': 1.7582417582417585e-06, 'epoch': 2.14}
{'loss': 0.6798999905586243, 'grad_norm': 0.2971792221069336, 'learning_rate': 1.4285714285714286e-06, 'epoch': 2.16}
{'loss': 0.7002000212669373, 'grad_norm': 0.28



All processes finished
