![Thinkube AI Lab](../icons/tk_full_logo.svg)

# Model Evaluation and Deployment 🚀

Evaluate and deploy fine-tuned models:
- Load fine-tuned models
- Evaluation metrics
- Compare with base model
- Quantization for deployment
- Deploy with vLLM
- Inference optimization

## Evaluation is Critical

Before deploying:

- **Quantitative Metrics**: Perplexity, accuracy, F1
- **Qualitative Review**: Human evaluation
- **Comparison**: vs base model and benchmarks
- **Safety**: Check for biases and errors
- **Performance**: Speed and resource usage

## Load Fine-Tuned Model

In [None]:
# Load your fine-tuned model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# TODO: Load fine-tuned model
# TODO: Load tokenizer
# TODO: Move to GPU
# TODO: Set to eval mode
# TODO: Display model info

## Evaluation Metrics

Quantitative assessment:

In [None]:
# Calculate metrics

# TODO: Calculate perplexity on test set
# TODO: Measure accuracy (if applicable)
# TODO: Calculate BLEU/ROUGE for generation
# TODO: Measure inference speed
# TODO: Display metrics comparison

## Generate Sample Outputs

Qualitative evaluation:

In [None]:
# Test generation quality

# TODO: Create diverse test prompts
# TODO: Generate responses
# TODO: Configure generation parameters (temperature, top_p)
# TODO: Display outputs
# TODO: Evaluate quality manually

## Compare with Base Model

Side-by-side comparison:

In [None]:
# Base vs fine-tuned comparison

# TODO: Load base model
# TODO: Generate with both models
# TODO: Compare outputs
# TODO: Highlight improvements
# TODO: Display comparison table

## Quantize for Deployment

Reduce model size:

In [None]:
# Quantization for production

# TODO: Quantize to INT8 or INT4
# TODO: Measure size reduction
# TODO: Benchmark inference speed
# TODO: Verify quality maintained
# TODO: Save quantized model

## Export to Different Formats

Format for various deployment targets:

In [None]:
# Export model

# TODO: Save as safetensors
# TODO: Export to GGUF (for llama.cpp)
# TODO: Export to ONNX (optional)
# TODO: Create model card
# TODO: Display export locations

## Deploy with vLLM

High-performance inference:

In [None]:
# vLLM deployment

# TODO: Show vLLM server configuration
# TODO: Load model with vLLM
# TODO: Configure batch size and parallelism
# TODO: Test inference speed
# TODO: Compare with vanilla transformers
# TODO: Display throughput improvement

## Inference Optimization

Maximize performance:

In [None]:
# Optimization techniques

# TODO: Enable KV cache
# TODO: Use Flash Attention
# TODO: Batch inference
# TODO: Continuous batching
# TODO: Measure latency and throughput
# TODO: Display performance metrics

## Benchmark Inference

Production readiness test:

In [None]:
# Benchmark production inference

# TODO: Test various input lengths
# TODO: Measure p50, p95, p99 latencies
# TODO: Test concurrent requests
# TODO: Monitor GPU utilization
# TODO: Calculate tokens/second
# TODO: Display benchmark results

## Deploy to Kubernetes

Production deployment:

In [None]:
# Kubernetes deployment spec

# TODO: Show Deployment manifest
# TODO: Configure GPU resources
# TODO: Setup Service and Ingress
# TODO: Add health checks
# TODO: Configure autoscaling
# TODO: Display deployment guide

## Monitor in Production

Track model performance:

In [None]:
# Production monitoring

# TODO: Setup Langfuse tracking
# TODO: Log all requests/responses
# TODO: Track latency and errors
# TODO: Monitor costs
# TODO: Setup alerts
# TODO: Display monitoring dashboard

## Best Practices

- ✅ Evaluate thoroughly before deployment
- ✅ Test with diverse inputs
- ✅ Compare quantized vs full precision
- ✅ Benchmark under realistic load
- ✅ Monitor production performance
- ✅ Version all models
- ✅ Keep rollback plan ready
- ✅ Document model behavior
- ✅ Collect user feedback
- ✅ Plan for model updates

## Deployment Checklist

- [ ] Model evaluated on test set
- [ ] Compared with base model
- [ ] Manual quality review done
- [ ] Quantization tested
- [ ] Inference optimized
- [ ] Benchmarked under load
- [ ] Monitoring configured
- [ ] Documentation complete
- [ ] Rollback plan ready
- [ ] Stakeholders informed

## Congratulations!

You've completed the fine-tuning track!

### What you've learned:
- Unsloth for efficient fine-tuning
- QLoRA for memory-efficient training
- Dataset preparation best practices
- Model evaluation and deployment

### Next steps:
- Explore **agent-dev/** for multi-agent systems
- Check **ml-gpu/** for advanced GPU training
- Build production applications!