# Unsloth challenge 2- Make QLoRA work with FSDP2

## Problem statement

---
---
---
<a name="FSDP2"></a>
## B) Make `QLoRA` work with `FSDP2` [Difficulty: Medium to Hard] [Max points: 10]

1. Goal: Write a single Python script to finetune Llama 3.1 8B on 2x or more GPUs with FSDP2.

2. You must showcase this working in a free **Kaggle notebook with 2 x Tesla T4 GPUs**.

3. Pipeline parallelism is also fine, but must utilize [`zero bubble scheduling`](https://pytorch.org/docs/stable/distributed.pipelining.html#torch.distributed.pipelining.schedules.ScheduleInterleavedZeroBubble) somehow.

4. Can use a pre-quantized 4bit BnB safetensor file from [Unsloth's HF page](https://huggingface.co/unsloth) or a full 16bit one, but must do QLoRA.

5. Can use `accelerate` but must be FSDP2 or related - you can investigate https://github.com/huggingface/accelerate/pull/3394, Torch Titan, other repos etc.

6. Must be fully `transformers` compatible - so we must use `TrainingArguments` and `Trainer`, or `TRL` related classes.

7. The loss must be equivalent to single GPU training.

8. You must enable all features in FSDP2 - ie showcase offloading, checkpointing, mixed precision training etc.

9. You can use `nf4` from `torch AO`, but best from `bitsandbytes`.

10. Finally showcase everything working in a free Kaggle 2x Tesla T4 notebook.

## Evaluation Parameters

## Marking Criteria for B) Max points = 10
```python
if attemped_B:
    B_score = 0
    if FSDP2_works_with_QLoRA:
        if torch_compile_works: B_score += 5
        else: B_score += 3
        if uses_part_A_and_single_kernel_and_faster: B_score += 3
        elif uses_torchAO:
            if torchAO_slower_than_BnB: B_score -= 3
    elif TP_or_PP_with_QLoRA:
        if zero_bubble: B_score += 3
        else: B_score += 2
    elif FSDP1_works_with_QLoRA:
        B_score += 1
    if kaggle_notebook_2_tesla_t4_example:
        B_score += 2
    else:
        B_score = 0
    final_score += B_score
else:
    final_score -= 2
```

start with library setup 

In [4]:
!pip install bitsandbytes
!pip install peft
!pip install transformers
!pip install datasets



In [5]:
!pip install torch



In [6]:
import os
import sys
import subprocess
from packaging import version

# Clear environment variables
if "MASTER_ADDR" in os.environ: del os.environ["MASTER_ADDR"]
if "MASTER_PORT" in os.environ: del os.environ["MASTER_PORT"]
if "RANK" in os.environ: del os.environ["RANK"]
if "WORLD_SIZE" in os.environ: del os.environ["WORLD_SIZE"]
if "LOCAL_RANK" in os.environ: del os.environ["LOCAL_RANK"]

# Setup CUDA devices
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

# Check bitsandbytes
try:
    import bitsandbytes
except ImportError:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "bitsandbytes"])
    import bitsandbytes

if version.parse(bitsandbytes.__version__) < version.parse("0.39.0"):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-U", "bitsandbytes"])
    print("Please restart the runtime after upgrading bitsandbytes.")
    sys.exit(0)

# Install packages
subprocess.check_call([sys.executable, "-m", "pip", "install", "accelerate>=0.23.0"])
subprocess.check_call([sys.executable, "-m", "pip", "install", "trl>=0.7.4"])

# Performance settings
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True,roundup_power2_divisions:[32:256,64:128,256:64,>:32]"

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from datasets import load_dataset
from peft import get_peft_model, LoraConfig, TaskType
from trl import SFTTrainer

print("Setting up QLoRA with model parallel...")

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Number of GPUs: {torch.cuda.device_count()}")

model_name = "unsloth/meta-Llama-3.1-8B-Instruct-bnb-4bit"
max_seq_length = 1024

# Configure BnB quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

# Load model with device_map to distribute across GPUs
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",  # This is what provides the multi-GPU support
    torch_dtype=torch.float16,
    offload_folder="offload",  # Enable CPU offloading
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = "right"

model.config.use_cache = False

# Apply LoRA
lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", 
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)

# Set trainable parameters - only LoRA adapters
with torch.no_grad():
    for name, param in model.named_parameters():
        if ".lora_A." in name or ".lora_B." in name:
            param.requires_grad_(True)
        else:
            param.requires_grad_(False)

model.gradient_checkpointing_enable()
model.enable_input_require_grads()

# Load dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files={"train": url}, split="train[:5%]")

# Training arguments WITHOUT FSDP as it's not compatible with single-process notebook
training_args = TrainingArguments(
    output_dir="./llama-3.1-8b-lora",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=1,
    max_steps=15,
    logging_steps=1,
    save_steps=15,
    learning_rate=2e-5,
    fp16=True,
    optim="adamw_torch",
    report_to="none",
    # No FSDP settings - we're using device_map="auto" instead
)

# Create trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

# Print configuration
print(f"Starting training with QLoRA and model parallelism")
print(f"Model: {model_name}")
print(f"Device map: {model.hf_device_map}")
print(f"Dataset size: {len(dataset)}")
print(f"Batch size per device: {training_args.per_device_train_batch_size}")
print(f"Gradient accumulation: {training_args.gradient_accumulation_steps}")

# Start training
trainer.train()

# Save model
model.save_pretrained("./llama-3.1-8b-lora")
print("Training complete. Model saved to ./llama-3.1-8b-lora")


Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Setting up QLoRA with model parallel...
CUDA available: True
Number of GPUs: 2




Starting training with QLoRA and model parallelism
Model: unsloth/meta-Llama-3.1-8B-Instruct-bnb-4bit
Device map: {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 1, 'model.layers.9': 1, 'model.layers.10': 1, 'model.layers.11': 1, 'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.layers.28': 1, 'model.layers.29': 1, 'model.layers.30': 1, 'model.layers.31': 1, 'model.norm': 1, 'model.rotary_emb': 1, 'lm_head': 1}
Dataset size: 10514
Batch size per device: 2
Gradient accumulation: 4


Step,Training Loss
1,9.9611
2,8.9489
3,8.4481
4,10.8177
5,7.3112
6,8.7521
7,6.9094
8,8.5203
9,7.5204
10,6.8818


Training complete. Model saved to ./llama-3.1-8b-lora
