# üöÄ LLM Fine-Tuning Quick Start Demo

This notebook demonstrates the complete pipeline:
1. GPU availability check
2. Model loading (PyTorch)
3. Inference testing
4. Fine-tuning setup
5. ONNX export
6. Benchmarking

**Hardware**: NVIDIA L40S (48GB VRAM)  
**Target**: Cross-platform deployment (GPU ‚Üí ONNX ‚Üí NPU)

## 1. Environment Setup

In [None]:
import sys
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path.cwd().parent))

print("‚úÖ Path configured")

## 2. GPU Availability Check

In [None]:
from src.utils.gpu_check import check_gpu_availability, print_gpu_info

gpu_info = check_gpu_availability()
print_gpu_info(gpu_info)

## 3. Load Pre-trained Model (Demo)

We'll use a small model for quick testing

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Use a small model for demo
model_name = "gpt2"  # Change to "meta-llama/Llama-3.2-3B-Instruct" for full model

print(f"Loading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

print("‚úÖ Model loaded!")

## 4. Test Inference

In [None]:
def generate_text(prompt, max_new_tokens=50):
    """Generate text from prompt"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test generation
prompt = "Write a Python function to calculate fibonacci numbers:"
result = generate_text(prompt)

print("üìù Generated text:")
print(result)

## 5. Configuration for Fine-Tuning

In [None]:
from src.utils.config_loader import ConfigLoader

# Load training config
config_loader = ConfigLoader()
config = config_loader.load()

print("üìã Training Configuration:")
print(f"  Model: {config['model']['name']}")
print(f"  Dataset: {config['dataset']['name']}")
print(f"  Batch Size: {config['training']['per_device_train_batch_size']}")
print(f"  Learning Rate: {config['training']['learning_rate']}")
print(f"  LoRA Rank: {config['lora']['r']}")

## 6. Dataset Loading Demo

In [None]:
from src.training.dataset import DatasetLoader

# Load dataset (small subset for demo)
dataset_loader = DatasetLoader(
    dataset_name="iamtarun/python_code_instructions_18k_alpaca",
    tokenizer=tokenizer,
    max_seq_length=512,  # Smaller for demo
    train_split=0.99,  # Use most for training
    eval_split=0.01,
)

datasets = dataset_loader.load()

print(f"üìä Dataset Statistics:")
print(f"  Train samples: {len(datasets['train'])}")
print(f"  Eval samples: {len(datasets['eval'])}")
print(f"\nüìù Sample data:")
print(datasets['train'][0])

## 7. Memory Profiling

In [None]:
from src.utils.model_utils import print_model_memory

print_model_memory()

## 8. Start Training (Commented Out)

Uncomment to start training. This will take 2-3 hours.

In [None]:
# from src.training.train import main

# # Start training
# main(config_path="../configs/training_config.yaml")

## 9. Load Fine-Tuned Model

In [None]:
# After training completes, load the fine-tuned model

# fine_tuned_model_path = "../checkpoints/final_model"

# tokenizer_ft = AutoTokenizer.from_pretrained(fine_tuned_model_path)
# model_ft = AutoModelForCausalLM.from_pretrained(
#     fine_tuned_model_path,
#     torch_dtype=torch.float16,
#     device_map="auto",
# )

# print("‚úÖ Fine-tuned model loaded!")

## 10. Compare Before/After

In [None]:
# test_prompt = "Write a Python function to reverse a string:"

# print("üîµ Base Model:")
# print(generate_text(test_prompt, model, tokenizer))
# print()

# print("üü¢ Fine-Tuned Model:")
# print(generate_text(test_prompt, model_ft, tokenizer_ft))

## 11. Benchmark Performance

In [None]:
# from src.evaluation.benchmark import quick_benchmark

# # Benchmark fine-tuned model
# result = quick_benchmark(
#     model_path="../checkpoints/final_model",
#     device="cuda",
#     num_runs=50,
# )

## 12. Export to ONNX

In [None]:
# from src.export.onnx_export import export_model_to_onnx

# # Export to ONNX with optimization and quantization
# export_model_to_onnx(
#     model_path="../checkpoints/final_model",
#     output_path="../models/onnx_model",
#     optimize=True,
#     quantize=True,
# )

## 13. Summary & Next Steps

You've completed the demo! üéâ

**Next Steps:**
1. Run full training with larger model (Llama 3.2 3B)
2. Export to ONNX for deployment
3. Benchmark GPU vs NPU performance
4. Deploy on Snapdragon X Elite

**Resources:**
- Training script: `python src/training/train.py`
- Config: `configs/training_config.yaml`
- Documentation: `README.md`