## This notebook requires GPU

This lab must be run in Google Colab in order to use GPU acceleration for model training. Click the button below to open this notebook in Colab, then set your runtime to GPU:

**Runtime > Change Runtime Type > T4 GPU**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/scott2b/coursera-msds-public/blob/main/notebooks/3_llm_classification_course.ipynb)

# Advanced LLM Classification with vLLM and Unsloth Fine-tuning

Welcome to the cutting-edge LLM Classification course! This comprehensive guide covers state-of-the-art techniques for large language model classification using modern tools and methods.

## What's Covered in This Course:
- **Modern LLM Architectures**: Understanding transformers, attention mechanisms, and scaling laws
- **Efficient Inference**: Using vLLM for high-throughput, low-latency inference
- **Advanced Fine-tuning**: Parameter-efficient fine-tuning with Unsloth
- **Classification Tasks**: Multi-class, multi-label, and hierarchical classification
- **Optimization Techniques**: Quantization, pruning, and distillation
- **Production Deployment**: Serving LLMs at scale
- **Evaluation & Benchmarking**: Comprehensive model evaluation
- **Ethical Considerations**: Bias detection and mitigation

## Learning Objectives:
1. Master modern LLM architectures and training techniques
2. Implement efficient inference pipelines with vLLM
3. Apply advanced fine-tuning methods with Unsloth
4. Deploy and serve LLMs in production environments
5. Optimize models for performance and efficiency
6. Evaluate and benchmark LLM classification systems
7. Address ethical considerations in LLM deployment

In [None]:
# Install required packages
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install vllm unsloth transformers datasets accelerate peft trl
!pip install bitsandbytes scipy wandb huggingface-hub
!pip install scikit-learn pandas numpy matplotlib seaborn plotly
!pip install fastapi uvicorn gradio

# For advanced features
!pip install flash-attn --no-build-isolation
!pip install auto-gptq optimum

In [None]:
import torch
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import (
    AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding,
    BitsAndBytesConfig, pipeline
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
import vllm
from vllm import LLM, SamplingParams
import warnings
warnings.filterwarnings('ignore')

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

## Understanding Modern LLM Architectures

Let's explore the fundamental components of modern large language models.

In [None]:
# Load a modern LLM and examine its architecture
model_name = "microsoft/DialoGPT-medium"  # Start with a smaller model for demonstration

print("Loading tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

print(f"Model type: {type(model)}")
print(f"Number of parameters: {model.num_parameters():,}")
print(f"Model config: {model.config}")

# Examine model architecture
print("\nModel Architecture:")
for name, module in model.named_modules():
    if len(name.split('.')) <= 2:  # Only show top-level modules
        print(f"  {name}: {type(module).__name__}")

## Efficient Inference with vLLM

Learn how to use vLLM for high-throughput, low-latency inference.

In [None]:
# Initialize vLLM for efficient inference
def setup_vllm(model_name, quantization="awq"):
    """Setup vLLM with optimized configuration"""

    # vLLM configuration
    llm = LLM(
        model=model_name,
        quantization=quantization,
        tensor_parallel_size=1,  # Adjust based on available GPUs
        gpu_memory_utilization=0.9,
        max_model_len=4096,
        enforce_eager=False,  # Use CUDA graphs for better performance
    )

    return llm

# For demonstration, we'll use a smaller model that works well with vLLM
demo_model = "microsoft/DialoGPT-small"

try:
    print("Setting up vLLM...")
    llm = setup_vllm(demo_model, quantization=None)  # No quantization for this demo
    print("vLLM setup complete!")

    # Test inference
    prompts = [
        "Classify this product review as positive or negative: 'This product exceeded my expectations!'",
        "Analyze the sentiment: 'I'm disappointed with the quality.'"
    ]

    sampling_params = SamplingParams(
        temperature=0.1,
        max_tokens=100,
        stop=["\n", "\r"]
    )

    outputs = llm.generate(prompts, sampling_params)

    for i, output in enumerate(outputs):
        print(f"\nPrompt {i+1}: {prompts[i]}")
        print(f"Response: {output.outputs[0].text.strip()}")

except Exception as e:
    print(f"vLLM setup failed (expected in some environments): {e}")
    print("Continuing with standard transformers inference...")

## Data Preparation for Classification

Prepare datasets for LLM classification tasks.

In [None]:
# Load and prepare classification dataset
def load_classification_data():
    """Load a text classification dataset"""

    # Use a standard classification dataset
    dataset = load_dataset("imdb", split="train[:5%]")  # Small subset for demo

    # Convert to binary classification (positive/negative)
    def preprocess_function(examples):
        return {
            "text": examples["text"],
            "label": examples["label"],
            "label_text": "positive" if examples["label"] == 1 else "negative"
        }

    dataset = dataset.map(preprocess_function)
    return dataset

print("Loading classification dataset...")
train_dataset = load_classification_data()
print(f"Dataset size: {len(train_dataset)}")
print("Sample:")
print(train_dataset[0])

## Advanced Fine-tuning with Unsloth

Learn parameter-efficient fine-tuning using Unsloth.

In [None]:
# Unsloth setup for efficient fine-tuning
try:
    from unsloth import FastLanguageModel

    print("Setting up Unsloth for efficient fine-tuning...")

    # Load model with Unsloth
    model_name = "unsloth/mistral-7b-bnb-4bit"  # Quantized model for efficiency

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=2048,
        dtype=None,
        load_in_4bit=True,
    )

    print("Model loaded with Unsloth!")
    print(f"Model type: {type(model)}")

    # Add LoRA adapters
    model = FastLanguageModel.get_peft_model(
        model,
        r=16,  # LoRA rank
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                       "gate_proj", "up_proj", "down_proj"],
        lora_alpha=16,
        lora_dropout=0,  # Optimized
        bias="none",    # Optimized
        use_gradient_checkpointing=True,
        random_state=3407,
        use_rslora=False,
        loftq_config=None,
    )

    print("LoRA adapters added!")

except ImportError:
    print("Unsloth not available. Using standard PEFT instead...")

    # Fallback to standard transformers + PEFT
    from peft import LoraConfig, get_peft_model

    model_name = "microsoft/DialoGPT-medium"
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_8bit=True,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Prepare model for training
    model = prepare_model_for_kbit_training(model)

    # Add LoRA
    config = LoraConfig(
        r=8,
        lora_alpha=16,
        target_modules=["c_attn", "c_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )
    model = get_peft_model(model, config)

    print("Standard PEFT setup complete!")

In [None]:
# Prepare data for instruction tuning
def format_instruction_example(example):
    """Format examples for instruction tuning"""
    instruction = "Classify the sentiment of this movie review as positive or negative."

    formatted_text = f"""### Instruction:
{instruction}

### Input:
{example['text']}

### Response:
{example['label_text']}
"""

    return {"text": formatted_text, "label": example["label"]}

print("Formatting dataset for instruction tuning...")
formatted_dataset = train_dataset.map(format_instruction_example)
print("Sample formatted example:")
print(formatted_dataset[0]["text"][:500] + "...")

In [None]:
# Training setup
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,  # Short training for demo
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=50,  # Very short for demo
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="steps",
    save_steps=20,
    eval_strategy="steps",
    load_best_model_at_end=True,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=formatted_dataset,
    eval_dataset=formatted_dataset.select(range(min(50, len(formatted_dataset)))),
)

print("Starting training...")
# trainer.train()  # Commented out for demo
print("Training completed (simulated)!")

## Quantization and Optimization

Learn advanced model optimization techniques.

In [None]:
# Model quantization
def quantize_model(model, method="4bit"):
    """Apply quantization to reduce model size and improve inference speed"""

    if method == "4bit":
        from transformers import BitsAndBytesConfig

        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4"
        )

        quantized_model = AutoModelForCausalLM.from_pretrained(
            model.config.name_or_path,
            quantization_config=quantization_config,
            device_map="auto"
        )

    elif method == "8bit":
        quantized_model = AutoModelForCausalLM.from_pretrained(
            model.config.name_or_path,
            load_in_8bit=True,
            device_map="auto"
        )

    return quantized_model

print("Demonstrating model quantization...")
# quantized_model = quantize_model(model, method="4bit")
print("Quantization example (model size would be reduced by ~75%)")

# Model size comparison
original_params = sum(p.numel() for p in model.parameters())
print(f"Original model parameters: {original_params:,}")
print("Quantized model would have ~25% of original size")

## Advanced Classification Techniques

Implement sophisticated classification approaches with LLMs.

In [None]:
# Few-shot and zero-shot classification
def setup_classification_pipeline(model_name="microsoft/DialoGPT-medium"):
    """Setup a classification pipeline with modern techniques"""

    # For zero-shot classification, we can use a pipeline
    from transformers import pipeline

    classifier = pipeline(
        "zero-shot-classification",
        model="facebook/bart-large-mnli",
        device=0 if torch.cuda.is_available() else -1
    )

    return classifier

# Setup classification
try:
    classifier = setup_classification_pipeline()

    # Test classification
    test_texts = [
        "This movie was absolutely fantastic! The acting was superb.",
        "I hated this film. It was boring and poorly made.",
        "The product arrived quickly and works as expected."
    ]

    candidate_labels = ["positive", "negative", "neutral"]

    for text in test_texts:
        result = classifier(text, candidate_labels)
        print(f"\nText: {text}")
        print(f"Prediction: {result['labels'][0]} (confidence: {result['scores'][0]:.3f})")

except Exception as e:
    print(f"Classification pipeline setup failed: {e}")
    print("This is expected in some environments")

## Production Deployment

Learn how to deploy LLMs for production use.

In [None]:
# FastAPI deployment example
fastapi_code = '''
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import pipeline
import uvicorn

app = FastAPI(title="LLM Classification API")

# Load model at startup
@app.on_event("startup")
async def load_model():
    global classifier
    classifier = pipeline(
        "zero-shot-classification",
        model="facebook/bart-large-mnli",
        device=0 if torch.cuda.is_available() else -1
    )

class ClassificationRequest(BaseModel):
    text: str
    labels: list[str]

@app.post("/classify")
async def classify_text(request: ClassificationRequest):
    try:
        result = classifier(request.text, request.labels)
        return {
            "prediction": result["labels"][0],
            "confidence": result["scores"][0],
            "all_scores": dict(zip(result["labels"], result["scores"]))
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
'''

# Save FastAPI code
with open('scripts/api_server.py', 'w') as f:
    f.write(fastapi_code)

print("FastAPI deployment code saved to scripts/api_server.py")
print("To run: python scripts/api_server.py")
print("API will be available at http://localhost:8000")
print("\nExample request:")
print('curl -X POST "http://localhost:8000/classify" \
     -H "Content-Type: application/json" \
     -d "{\"text\": \"This product is amazing!\", \"labels\": [\"positive\", \"negative\"]}"')

## Model Evaluation and Benchmarking

Comprehensive evaluation techniques for LLM classification.

In [None]:
# Advanced evaluation metrics
from sklearn.metrics import (
    accuracy_score, precision_recall_fscore_support,
    confusion_matrix, classification_report
)
import numpy as np

def evaluate_classification_model(predictions, true_labels, model_name="Model"):
    """Comprehensive model evaluation"""

    # Basic metrics
    accuracy = accuracy_score(true_labels, predictions)
    precision, recall, f1, support = precision_recall_fscore_support(
        true_labels, predictions, average='weighted'
    )

    # Per-class metrics
    per_class = precision_recall_fscore_support(
        true_labels, predictions, average=None
    )

    # Confusion matrix
    cm = confusion_matrix(true_labels, predictions)

    print(f"\n=== {model_name} Evaluation ===")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")

    print("\nPer-class metrics:")
    for i, (p, r, f) in enumerate(zip(per_class[0], per_class[1], per_class[2])):
        print(f"Class {i}: Precision={p:.4f}, Recall={r:.4f}, F1={f:.4f}")

    print("\nConfusion Matrix:")
    print(cm)

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'confusion_matrix': cm
    }

# Example evaluation with dummy data
np.random.seed(42)
dummy_predictions = np.random.choice([0, 1], size=100)
dummy_true_labels = np.random.choice([0, 1], size=100)

results = evaluate_classification_model(
    dummy_predictions,
    dummy_true_labels,
    "Dummy LLM Classifier"
)

## Ethical Considerations and Bias Detection

Address ethical issues in LLM classification systems.

In [None]:
# Bias detection and mitigation
def analyze_model_bias(predictions, true_labels, demographic_data):
    """Analyze potential biases in model predictions"""

    bias_analysis = {}

    # Performance by demographic group
    for group_name, group_mask in demographic_data.items():
        group_predictions = predictions[group_mask]
        group_true = true_labels[group_mask]

        if len(group_predictions) > 0:
            group_accuracy = accuracy_score(group_true, group_predictions)
            bias_analysis[group_name] = {
                'accuracy': group_accuracy,
                'sample_size': len(group_predictions)
            }

    # Fairness metrics
    accuracies = [metrics['accuracy'] for metrics in bias_analysis.values()]
    if accuracies:
        fairness_score = 1 - (np.std(accuracies) / np.mean(accuracies))
        bias_analysis['fairness_score'] = fairness_score

    return bias_analysis

print("Bias analysis framework implemented")
print("In practice, you would analyze:")
print("- Performance across demographic groups")
print("- Fairness metrics (demographic parity, equal opportunity)")
print("- Bias mitigation techniques (reweighting, adversarial training)")
print("- Regular audits and monitoring")

## Performance Benchmarking

Compare different models and configurations.

In [None]:
# Benchmarking framework
import time
from contextlib import contextmanager

@contextmanager
def timer():
    start = time.time()
    yield start  # Yield the start time
    end = time.time()
    print(f"Time elapsed: {end - start:.2f} seconds")

def benchmark_inference(model, input_texts, batch_sizes=[1, 4, 8]):
    """Benchmark inference performance"""

    results = {}

    for batch_size in batch_sizes:
        print(f"\nBenchmarking batch size {batch_size}...")

        # Prepare batched inputs
        batched_texts = [input_texts[i:i+batch_size]
                        for i in range(0, len(input_texts), batch_size)]

        total_time = 0
        total_tokens = 0

        with timer() as t:
            for batch in batched_texts:
                # Simulate inference
                time.sleep(0.1 * len(batch))  # Mock inference time
                total_tokens += sum(len(text.split()) for text in batch)

        throughput = total_tokens / (time.time() - t)
        results[batch_size] = {
            'throughput': throughput,
            'latency': (time.time() - t) / len(batched_texts)
        }

    return results

# Example benchmarking
sample_texts = [
    "This is a test review for benchmarking.",
    "Another sample text for performance testing.",
    "Classification performance evaluation text."
] * 10  # Repeat for more data

print("Running inference benchmarks...")
benchmark_results = benchmark_inference(None, sample_texts)

for batch_size, metrics in benchmark_results.items():
    print(f"Batch {batch_size}: Throughput = {metrics['throughput']:.2f} tokens/sec, "
          f"Latency = {metrics['latency']:.3f} sec/batch")

## Summary and Best Practices

Key takeaways from the LLM Classification course.

In [None]:
# Best practices summary
best_practices = {
    "Model Selection": [
        "Choose model size based on available resources",
        "Consider domain-specific pretraining",
        "Evaluate multiple architectures (Decoder-only, Encoder-Decoder)"
    ],

    "Efficient Inference": [
        "Use vLLM for high-throughput serving",
        "Implement quantization (4-bit, 8-bit)",
        "Batch requests for optimal throughput",
        "Use model parallelism for large models"
    ],

    "Fine-tuning": [
        "Use parameter-efficient methods (LoRA, QLoRA)",
        "Implement proper data formatting",
        "Monitor for overfitting with validation",
        "Use gradient checkpointing for memory efficiency"
    ],

    "Production Deployment": [
        "Implement proper error handling",
        "Add request rate limiting",
        "Monitor model performance continuously",
        "Plan for model updates and A/B testing"
    ],

    "Ethical Considerations": [
        "Regular bias audits and monitoring",
        "Implement fairness constraints",
        "Transparent decision explanations",
        "Responsible data collection practices"
    ]
}

for category, practices in best_practices.items():
    print(f"\n{category}:")
    for practice in practices:
        print(f"  • {practice}")

print("\n" + "="*50)
print("🎉 Congratulations! You've completed the LLM Classification course!")
print("You now have the skills to:")
print("  • Build and deploy advanced LLM classification systems")
print("  • Optimize models for production environments")
print("  • Address ethical considerations in AI deployment")
print("  • Implement modern fine-tuning techniques")
print("="*50)