# Lab 2: Evaluation and Deployment

In this lab, you'll learn how to evaluate your fine-tuned model, merge LoRA weights, and prepare for deployment.

## Learning Objectives
- Load and test a fine-tuned LoRA adapter
- Evaluate model quality using various metrics
- Merge LoRA weights into the base model
- Export the model for deployment
- Understand quantization options for inference

## Prerequisites
- Completed Lab 1 (or have a trained LoRA adapter)
- GPU with 16GB+ VRAM

## Step 1: Install Dependencies

In [None]:
# Install required packages
!pip install -q transformers datasets accelerate peft bitsandbytes
!pip install -q rouge-score nltk

## Step 2: Import Libraries

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel, PeftConfig
from datasets import Dataset
import json
import os

# Check GPU
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Step 3: Load the Fine-Tuned Model

Load the base model with the LoRA adapter from Lab 1.

In [None]:
# Configuration
base_model_name = "mistralai/Mistral-7B-v0.1"
adapter_path = "./lora-adapter"  # Path from Lab 1

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(adapter_path)
tokenizer.pad_token = tokenizer.eos_token

print("Tokenizer loaded!")

In [None]:
# Quantization config for loading
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Load adapter
model = PeftModel.from_pretrained(base_model, adapter_path)

print("Model with adapter loaded!")

## Step 4: Create Evaluation Dataset

Create a held-out test set for evaluation.

In [None]:
# Evaluation dataset - questions with expected answers
eval_data = [
    {
        "instruction": "What is deep learning?",
        "expected_keywords": ["neural network", "layers", "learn", "data", "patterns"]
    },
    {
        "instruction": "Explain the concept of backpropagation.",
        "expected_keywords": ["gradient", "weights", "error", "chain rule", "update"]
    },
    {
        "instruction": "What is a transformer architecture?",
        "expected_keywords": ["attention", "sequence", "parallel", "encoder", "decoder"]
    },
    {
        "instruction": "Explain regularization in machine learning.",
        "expected_keywords": ["overfitting", "penalty", "generalization", "complexity"]
    },
    {
        "instruction": "What is the purpose of an activation function?",
        "expected_keywords": ["non-linear", "neuron", "output", "learn", "complex"]
    },
]

print(f"Evaluation samples: {len(eval_data)}")

## Step 5: Generate Responses

In [None]:
def generate_response(instruction, max_tokens=256):
    """Generate a response for a given instruction."""
    prompt = f"""### Instruction:
{instruction}

### Response:
"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("### Response:")[-1].strip()

# Generate responses for all eval samples
responses = []
for sample in eval_data:
    response = generate_response(sample["instruction"])
    responses.append({
        "instruction": sample["instruction"],
        "response": response,
        "expected_keywords": sample["expected_keywords"]
    })
    print(f"\n{'='*60}")
    print(f"Q: {sample['instruction']}")
    print(f"A: {response[:200]}..." if len(response) > 200 else f"A: {response}")

## Step 6: Evaluate Response Quality

In [None]:
def evaluate_keywords(response, keywords):
    """Check how many expected keywords appear in the response."""
    response_lower = response.lower()
    found = [kw for kw in keywords if kw.lower() in response_lower]
    return len(found) / len(keywords), found

def evaluate_length(response):
    """Check if response length is appropriate."""
    words = len(response.split())
    if words < 20:
        return 0.3, "Too short"
    elif words < 50:
        return 0.6, "Brief"
    elif words < 150:
        return 1.0, "Good"
    elif words < 300:
        return 0.8, "Verbose"
    else:
        return 0.5, "Too long"

def evaluate_coherence(response):
    """Simple coherence check."""
    # Check for repetition (simplified)
    sentences = response.split('.')
    if len(sentences) > 1:
        unique_sentences = set(s.strip().lower() for s in sentences if s.strip())
        repetition_ratio = len(unique_sentences) / len([s for s in sentences if s.strip()])
        return repetition_ratio, "Low repetition" if repetition_ratio > 0.8 else "Some repetition"
    return 1.0, "Single sentence"

# Evaluate all responses
print("\n" + "="*80)
print("EVALUATION RESULTS")
print("="*80)

total_keyword_score = 0
total_length_score = 0
total_coherence_score = 0

for resp in responses:
    print(f"\nInstruction: {resp['instruction'][:50]}...")
    
    # Keyword evaluation
    kw_score, found_kw = evaluate_keywords(resp['response'], resp['expected_keywords'])
    print(f"  Keywords: {kw_score:.0%} ({len(found_kw)}/{len(resp['expected_keywords'])}) - Found: {found_kw}")
    
    # Length evaluation
    len_score, len_desc = evaluate_length(resp['response'])
    print(f"  Length: {len_score:.0%} ({len_desc}) - {len(resp['response'].split())} words")
    
    # Coherence evaluation
    coh_score, coh_desc = evaluate_coherence(resp['response'])
    print(f"  Coherence: {coh_score:.0%} ({coh_desc})")
    
    total_keyword_score += kw_score
    total_length_score += len_score
    total_coherence_score += coh_score

n = len(responses)
print(f"\n{'='*80}")
print(f"AVERAGE SCORES:")
print(f"  Keywords: {total_keyword_score/n:.0%}")
print(f"  Length: {total_length_score/n:.0%}")
print(f"  Coherence: {total_coherence_score/n:.0%}")
print(f"  Overall: {(total_keyword_score + total_length_score + total_coherence_score)/(3*n):.0%}")

## Step 7: Compare with Base Model (Optional)

Compare responses from the fine-tuned model vs the base model.

In [None]:
# To compare, you would load the base model without the adapter
# This is optional and requires additional memory

# Example comparison structure:
print("To compare with base model:")
print("1. Load base model without adapter")
print("2. Generate responses with same prompts")
print("3. Run same evaluation metrics")
print("4. Compare scores side by side")

# Uncomment below to actually compare (requires ~2x memory)
# base_only = AutoModelForCausalLM.from_pretrained(
#     base_model_name,
#     quantization_config=bnb_config,
#     device_map="auto",
# )

## Step 8: Merge LoRA Weights

Merge the LoRA adapter weights into the base model for deployment.

In [None]:
# First, load the base model in full precision for merging
# Note: This requires more memory than 4-bit loading

print("Loading base model in full precision for merging...")

# Clean up current model to free memory
del model, base_model
torch.cuda.empty_cache()

# Load base model in fp16
base_model_fp16 = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)

# Load adapter
model_with_adapter = PeftModel.from_pretrained(base_model_fp16, adapter_path)

print("Model loaded for merging!")

In [None]:
# Merge LoRA weights into base model
print("Merging LoRA weights...")
merged_model = model_with_adapter.merge_and_unload()

print(f"Merged model parameters: {merged_model.num_parameters():,}")

In [None]:
# Save merged model
merged_path = "./merged-model"
os.makedirs(merged_path, exist_ok=True)

print(f"Saving merged model to {merged_path}...")
merged_model.save_pretrained(merged_path)
tokenizer.save_pretrained(merged_path)

print("Merged model saved!")

## Step 9: Verify Merged Model

In [None]:
# Load and test merged model
print("Loading merged model for verification...")

# Clean up
del merged_model, model_with_adapter, base_model_fp16
torch.cuda.empty_cache()

# Load merged model with quantization for inference
test_model = AutoModelForCausalLM.from_pretrained(
    merged_path,
    quantization_config=bnb_config,
    device_map="auto",
)
test_tokenizer = AutoTokenizer.from_pretrained(merged_path)

print("Merged model loaded for testing!")

In [None]:
# Test merged model
def test_merged_model(instruction):
    prompt = f"""### Instruction:
{instruction}

### Response:
"""
    inputs = test_tokenizer(prompt, return_tensors="pt").to(test_model.device)
    
    with torch.no_grad():
        outputs = test_model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.7,
            do_sample=True,
            pad_token_id=test_tokenizer.eos_token_id,
        )
    
    response = test_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("### Response:")[-1].strip()

# Verify with a test question
test_q = "What is the difference between L1 and L2 regularization?"
print(f"Question: {test_q}")
print(f"\nResponse: {test_merged_model(test_q)}")

## Step 10: Export Options

Different export formats for various deployment scenarios.

In [None]:
print("Deployment Options:")
print("")
print("1. vLLM (High throughput serving)")
print("   python -m vllm.entrypoints.api_server --model ./merged-model")
print("")
print("2. Text Generation Inference (TGI)")
print("   docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference --model-id ./merged-model")
print("")
print("3. Ollama (Local deployment)")
print("   # First convert to GGUF format")
print("   python llama.cpp/convert.py ./merged-model --outtype f16")
print("   ./llama.cpp/quantize ./model.gguf ./model-q4.gguf q4_k_m")
print("   # Then create Modelfile and run with ollama")
print("")
print("4. Hugging Face Hub")
print("   from huggingface_hub import HfApi")
print("   api = HfApi()")
print("   api.upload_folder(folder_path='./merged-model', repo_id='your-username/model-name')")

## Step 11: Create Model Card

In [None]:
# Create a model card for documentation
model_card = """---
language: en
license: apache-2.0
base_model: mistralai/Mistral-7B-v0.1
tags:
  - fine-tuned
  - lora
  - qlora
---

# Fine-Tuned Mistral-7B

## Model Description
This model is a fine-tuned version of Mistral-7B-v0.1 using QLoRA.

## Training Details
- **Base Model:** mistralai/Mistral-7B-v0.1
- **Method:** QLoRA (4-bit quantization + LoRA)
- **LoRA Rank:** 16
- **LoRA Alpha:** 32
- **Training Epochs:** 3

## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("path/to/model")
tokenizer = AutoTokenizer.from_pretrained("path/to/model")

prompt = "### Instruction:\\nYour question here\\n\\n### Response:\\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
```

## Limitations
- Fine-tuned on a small demonstration dataset
- May not generalize to all topics
- Should be evaluated before production use
"""

with open(f"{merged_path}/README.md", "w") as f:
    f.write(model_card)

print("Model card created!")

## Exercises

1. **Add more evaluation metrics**: Implement ROUGE or BLEU scores
2. **A/B Testing**: Compare outputs from adapter-only vs merged model
3. **Quantization**: Export the model in different quantization levels and compare
4. **Deployment**: Deploy the model using one of the serving options
5. **Safety Evaluation**: Test the model with adversarial prompts

In [None]:
# YOUR CODE HERE - Add your own experiments!
