# Qwen Model Fine-tuning with Unsloth

This notebook demonstrates how to fine-tune a Qwen model using the Unsloth framework for efficient training.

## Features
- 2x faster training with 70% less VRAM usage
- Support for 4-bit quantization and LoRA adapters
- Mixed dataset training (reasoning + conversational)
- Compatible with Google Colab and local environments

## Installation

First, let's install the required dependencies:

In [None]:
# Install Unsloth and dependencies
!pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
!pip install torch torchvision torchaudio
!pip install transformers datasets accelerate peft trl
!pip install bitsandbytes xformers

## Import Libraries

In [None]:
import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
import os
from IPython.display import display, HTML

# Check GPU availability
print(f"GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU name: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

## Model Configuration

In [None]:
# Model configuration
model_name = "unsloth/Qwen3-14B"  # Choose based on your GPU memory
max_seq_length = 2048  # Adjust based on your needs
dtype = None  # Auto-detect best dtype
load_in_4bit = True  # Use 4-bit quantization for memory efficiency

# LoRA configuration
lora_r = 16
lora_alpha = 16
lora_dropout = 0.1
target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj"
]

print(f"Loading model: {model_name}")
print(f"Max sequence length: {max_seq_length}")
print(f"4-bit quantization: {load_in_4bit}")

## Load Model and Tokenizer

In [None]:
# Load model and tokenizer with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    device_map="auto",
)

print(f"Model loaded successfully!")
print(f"Model type: {type(model)}")
print(f"Tokenizer type: {type(tokenizer)}")

## Add LoRA Adapters

In [None]:
# Add LoRA adapters for efficient fine-tuning
model = FastLanguageModel.get_peft_model(
    model,
    r=lora_r,
    target_modules=target_modules,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    use_rslora=False,
    loftq_config=None,
)

print("LoRA adapters added successfully!")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

## Load and Prepare Dataset

In [None]:
# Load a sample dataset (you can replace this with your own)
# Using Alpaca dataset as an example
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")  # Small subset for demo

print(f"Dataset loaded: {len(dataset)} samples")
print(f"Sample keys: {dataset.column_names}")

# Show a sample
sample = dataset[0]
print("\nSample data:")
for key, value in sample.items():
    print(f"{key}: {value[:100]}..." if len(str(value)) > 100 else f"{key}: {value}")

## Format Dataset for Chat Template

In [None]:
# Format the dataset for chat template
def format_chat_template(examples):
    texts = []
    for instruction, input_text, output in zip(
        examples["instruction"], examples["input"], examples["output"]
    ):
        # Create conversation format
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": instruction + (f"\n\n{input_text}" if input_text else "")},
            {"role": "assistant", "content": output}
        ]
        
        # Apply chat template
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False
        )
        texts.append(text)
    
    return {"text": texts}

# Apply formatting
dataset = dataset.map(format_chat_template, batched=True)

print("Dataset formatted successfully!")
print(f"Sample formatted text:")
print(dataset[0]["text"][:500] + "...")

## Training Configuration

In [None]:
# Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    num_train_epochs=1,  # Reduced for demo
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=42,
    output_dir="./qwen_finetuned",
    save_steps=50,
    save_total_limit=2,
    dataloader_num_workers=0,
    remove_unused_columns=False,
)

print("Training arguments configured:")
print(f"- Batch size: {training_args.per_device_train_batch_size}")
print(f"- Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"- Learning rate: {training_args.learning_rate}")
print(f"- Mixed precision: {'BF16' if training_args.bf16 else 'FP16' if training_args.fp16 else 'FP32'}")
print(f"- Epochs: {training_args.num_train_epochs}")

## Create Trainer

In [None]:
# Create SFT trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=training_args,
)

print("Trainer created successfully!")
print(f"Training dataset size: {len(trainer.train_dataset)}")

## Start Training

In [None]:
# Start training
print("Starting training...")
trainer.train()

print("Training completed successfully!")

## Save the Model

In [None]:
# Save the fine-tuned model
output_dir = "./qwen_finetuned"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"Model saved to {output_dir}")

# Optional: Save in different formats
print("\nSaving model in merged 16-bit format...")
model.save_pretrained_merged(
    output_dir + "_merged_16bit",
    tokenizer,
    save_method="merged_16bit"
)

print("Model saved in multiple formats!")

## Test the Fine-tuned Model

In [None]:
# Test the fine-tuned model
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

# Test prompt
test_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain what machine learning is in simple terms."}
]

test_prompt = tokenizer.apply_chat_template(
    test_messages,
    tokenize=False,
    add_generation_prompt=True
)

print("Test prompt:")
print(test_prompt)
print("\n" + "="*50 + "\n")

# Generate response
inputs = tokenizer(test_prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated response:")
print(response[len(test_prompt):])

## Memory Usage Statistics

In [None]:
# Display memory usage statistics
if torch.cuda.is_available():
    print("GPU Memory Usage:")
    print(f"Allocated: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")
    print(f"Cached: {torch.cuda.memory_reserved(0) / 1024**3:.2f} GB")
    print(f"Max allocated: {torch.cuda.max_memory_allocated(0) / 1024**3:.2f} GB")
    
    # Clear cache
    torch.cuda.empty_cache()
    print("\nMemory cache cleared.")
else:
    print("No GPU available - running on CPU")

## Summary

This notebook demonstrated:

1. **Efficient Model Loading**: Using Unsloth's optimized model loading with 4-bit quantization
2. **LoRA Fine-tuning**: Adding Low-Rank Adaptation layers for efficient parameter updates
3. **Dataset Formatting**: Converting datasets to chat template format
4. **Optimized Training**: Using Unsloth's SFT trainer with memory-efficient settings
5. **Model Saving**: Saving in multiple formats for different use cases
6. **Inference Testing**: Testing the fine-tuned model with faster inference

### Key Benefits of Unsloth:
- **2x faster training** compared to standard methods
- **70% less VRAM usage** enabling larger models on smaller GPUs
- **No accuracy loss** - maintains full model quality
- **Easy integration** with existing workflows

### Next Steps:
1. Experiment with different model sizes (4B, 8B, 14B, 32B)
2. Try different datasets and mixing ratios
3. Adjust LoRA parameters for your specific use case
4. Implement evaluation metrics for your domain
5. Deploy the model using your preferred serving framework