# Lab 1.3 - Micro-Evaluation Exercise

**Duration**: 60 minutes | **Difficulty**: Intermediate

## Objectives
- Build a systematic evaluation framework
- Compare Ollama and watsonx.ai outputs
- Create evaluation metrics and rubrics
- Store results in structured format for analysis

## Setup

In [None]:
import pandas as pd
import time
from typing import List, Dict
import ollama
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

## Step 1: Define Test Set

Create diverse test prompts covering different task types.

In [None]:
TEST_PROMPTS = [
    {
        "prompt": "Summarize the following in 2 sentences: Machine learning is a subset of artificial intelligence that enables systems to learn from data without explicit programming.",
        "task_type": "summarization",
        "expected_keywords": ["machine learning", "AI", "data", "learn"]
    },
    {
        "prompt": "Extract the main entities from: 'IBM released watsonx.ai in 2023 as an enterprise AI platform.'",
        "task_type": "extraction",
        "expected_keywords": ["IBM", "watsonx.ai", "2023"]
    },
    {
        "prompt": "Rewrite formally: 'Hey team, the API is down, can someone check it ASAP?'",
        "task_type": "style_transfer",
        "expected_keywords": ["API", "unavailable", "investigate"]
    },
    {
        "prompt": "Answer: What is the capital of France?",
        "task_type": "qa_factual",
        "ground_truth": "Paris",
        "expected_keywords": ["Paris"]
    },
    {
        "prompt": "Explain in simple terms: What is a REST API?",
        "task_type": "explanation",
        "expected_keywords": ["API", "web", "HTTP", "request", "response"]
    },
    {
        "prompt": "List 3 benefits of cloud computing in bullet points.",
        "task_type": "list_generation",
        "expected_format": "bullets"
    },
    {
        "prompt": "Calculate: If a product costs $100 and has a 20% discount, what is the final price?",
        "task_type": "reasoning",
        "ground_truth": "$80"
    },
    {
        "prompt": "Based on this context: 'Python 3.11 was released in October 2022.', answer: When was Python 3.11 released?",
        "task_type": "qa_context",
        "ground_truth": "October 2022"
    }
]

print(f"âœ… Created test set with {len(TEST_PROMPTS)} prompts")

## Step 2: Setup Models

In [None]:
# Setup watsonx.ai
credentials = Credentials(
    url=os.getenv("WATSONX_URL", "https://us-south.ml.cloud.ibm.com"),
    api_key=os.getenv("WATSONX_APIKEY")
)

PROJECT_ID = os.getenv("WATSONX_PROJECT_ID")

model_params = {
    GenParams.MAX_NEW_TOKENS: 200,
    GenParams.TEMPERATURE: 0.7,
    GenParams.TOP_P: 1,
    GenParams.TOP_K: 50
}

granite_model = ModelInference(
    model_id="ibm/granite-13b-chat-v2",
    credentials=credentials,
    project_id=PROJECT_ID,
    params=model_params
)

print("âœ… Models initialized")

## Step 3: Collect Outputs

Run all prompts through both models and measure latency.

In [None]:
def get_ollama_response(prompt: str) -> Dict:
    """Get response from Ollama model."""
    start_time = time.time()
    try:
        response = ollama.chat(
            model="qwen2.5:0.5b-instruct",
            messages=[{"role": "user", "content": prompt}]
        )
        output = response["message"]["content"]
        latency_ms = (time.time() - start_time) * 1000
        return {"output": output, "latency_ms": latency_ms, "error": None}
    except Exception as e:
        return {"output": None, "latency_ms": 0, "error": str(e)}

def get_granite_response(prompt: str) -> Dict:
    """Get response from Granite model."""
    start_time = time.time()
    try:
        output = granite_model.generate_text(prompt=prompt)
        latency_ms = (time.time() - start_time) * 1000
        return {"output": output.strip(), "latency_ms": latency_ms, "error": None}
    except Exception as e:
        return {"output": None, "latency_ms": 0, "error": str(e)}

In [None]:
# Collect all responses
results = []

for i, test_case in enumerate(TEST_PROMPTS):
    print(f"Processing {i+1}/{len(TEST_PROMPTS)}: {test_case['task_type']}")
    
    # Get responses from both models
    ollama_result = get_ollama_response(test_case["prompt"])
    granite_result = get_granite_response(test_case["prompt"])
    
    results.append({
        "prompt": test_case["prompt"],
        "task_type": test_case["task_type"],
        "ollama_output": ollama_result["output"],
        "ollama_latency_ms": ollama_result["latency_ms"],
        "ollama_error": ollama_result["error"],
        "granite_output": granite_result["output"],
        "granite_latency_ms": granite_result["latency_ms"],
        "granite_error": granite_result["error"],
        "expected_keywords": test_case.get("expected_keywords", []),
        "ground_truth": test_case.get("ground_truth", "")
    })

# Create DataFrame
df_results = pd.DataFrame(results)
print(f"\nâœ… Collected {len(df_results)} results")
df_results.head()

## Step 4: Apply Rating Rubric

Evaluate each response based on multiple criteria.

In [None]:
def keyword_coverage(text: str, keywords: List[str]) -> float:
    """Calculate percentage of expected keywords present in text."""
    if not keywords or not text:
        return 0.0
    text_lower = text.lower()
    matches = sum(1 for kw in keywords if kw.lower() in text_lower)
    return matches / len(keywords)

def contains_ground_truth(text: str, ground_truth: str) -> bool:
    """Check if text contains ground truth answer."""
    if not ground_truth or not text:
        return False
    return ground_truth.lower() in text.lower()

def rate_response(output: str, expected_keywords: List[str], ground_truth: str) -> Dict:
    """Rate a model response."""
    if not output:
        return {"keyword_score": 0.0, "ground_truth_match": False, "completeness": 0}
    
    keyword_score = keyword_coverage(output, expected_keywords)
    gt_match = contains_ground_truth(output, ground_truth) if ground_truth else None
    
    # Simple completeness score based on length
    completeness = min(len(output.split()) / 20, 1.0)  # Normalize to 1.0
    
    return {
        "keyword_score": keyword_score,
        "ground_truth_match": gt_match,
        "completeness": completeness
    }

In [None]:
# Rate all responses
for idx, row in df_results.iterrows():
    # Rate Ollama
    ollama_rating = rate_response(
        row["ollama_output"],
        row["expected_keywords"],
        row["ground_truth"]
    )
    df_results.at[idx, "ollama_keyword_score"] = ollama_rating["keyword_score"]
    df_results.at[idx, "ollama_gt_match"] = ollama_rating["ground_truth_match"]
    df_results.at[idx, "ollama_completeness"] = ollama_rating["completeness"]
    
    # Rate Granite
    granite_rating = rate_response(
        row["granite_output"],
        row["expected_keywords"],
        row["ground_truth"]
    )
    df_results.at[idx, "granite_keyword_score"] = granite_rating["keyword_score"]
    df_results.at[idx, "granite_gt_match"] = granite_rating["ground_truth_match"]
    df_results.at[idx, "granite_completeness"] = granite_rating["completeness"]

print("âœ… All responses rated")

## Step 5: Analyze Results

In [None]:
# Summary statistics
summary = pd.DataFrame({
    "Metric": [
        "Avg Latency (ms)",
        "Avg Keyword Score",
        "Avg Completeness",
        "Success Rate"
    ],
    "Ollama": [
        df_results["ollama_latency_ms"].mean(),
        df_results["ollama_keyword_score"].mean(),
        df_results["ollama_completeness"].mean(),
        (df_results["ollama_error"].isna().sum() / len(df_results))
    ],
    "Granite": [
        df_results["granite_latency_ms"].mean(),
        df_results["granite_keyword_score"].mean(),
        df_results["granite_completeness"].mean(),
        (df_results["granite_error"].isna().sum() / len(df_results))
    ]
})

print("\nðŸ“Š Summary Statistics:\n")
print(summary)

In [None]:
# Performance by task type
task_performance = df_results.groupby("task_type").agg({
    "ollama_keyword_score": "mean",
    "granite_keyword_score": "mean",
    "ollama_latency_ms": "mean",
    "granite_latency_ms": "mean"
}).round(3)

print("\nðŸ“Š Performance by Task Type:\n")
print(task_performance)

## Visualization

In [None]:
import matplotlib.pyplot as plt

# Latency comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Latency
ax1.bar(["Ollama", "Granite"], 
        [df_results["ollama_latency_ms"].mean(), df_results["granite_latency_ms"].mean()],
        color=['#3b82f6', '#10b981'])
ax1.set_ylabel("Average Latency (ms)")
ax1.set_title("Response Time Comparison")

# Quality
ax2.bar(["Ollama", "Granite"],
        [df_results["ollama_keyword_score"].mean(), df_results["granite_keyword_score"].mean()],
        color=['#3b82f6', '#10b981'])
ax2.set_ylabel("Average Keyword Score")
ax2.set_title("Quality Comparison")
ax2.set_ylim([0, 1])

plt.tight_layout()
plt.savefig("evaluation_comparison.png", dpi=150, bbox_inches='tight')
plt.show()

print("âœ… Visualization saved as 'evaluation_comparison.png'")

## Save Results

In [None]:
# Save to CSV for further analysis
df_results.to_csv("evaluation_results.csv", index=False)
print("âœ… Results saved to 'evaluation_results.csv'")

## Key Takeaways

- âœ… Built systematic evaluation framework
- âœ… Compared two LLM backends quantitatively
- âœ… Created reusable evaluation metrics
- âœ… Stored results in structured format
- âœ… Generated visualizations for analysis

## Next Steps

1. Extend this framework for Day 2 RAG evaluation
2. Add more sophisticated metrics (BLEU, ROUGE, semantic similarity)
3. Integrate with production logging (see `Analyze_Log_and_Feedback.ipynb`)
4. Create automated testing pipelines