# Evaluating Prompt Effectiveness Tutorial

## Overview

This tutorial delves into strategies and methodologies for assessing the efficacy of prompts used in AI language models. We'll examine a range of metrics for quantifying prompt performance and explore both hands-on and automated evaluation approaches.

## Motivation

With prompt engineering playing an increasingly vital role in AI applications, it's crucial to develop robust methods for gauging prompt effectiveness. This empowers developers and researchers to fine-tune their prompts, resulting in enhanced AI model performance and more dependable outputs.

## Key Components

1. Quantitative metrics for assessing prompt performance.
2. Hands-on evaluation methodologies.
3. Automated assessment techniques.
4. Real-world examples utilizing Amazon Nova via OpenRouter and LangChain.

## Setup

First, let's import the necessary libraries and set up our environment.

In [1]:
from os import getenv

import numpy as np
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

load_dotenv()

# Initialize the language model
llm = ChatOpenAI(
    openai_api_key=getenv("OPENROUTER_API_KEY"),
    openai_api_base=getenv("OPENROUTER_BASE_URL"),
    model_name="bedrock/nova-lite-v1",
)

# Initialize sentence transformer for semantic similarity
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")


def semantic_similarity(text1, text2):
    """Calculate semantic similarity between two texts using cosine similarity."""
    embeddings = sentence_model.encode([text1, text2])
    return cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]

## Metrics for Measuring Prompt Performance

Let's define some key metrics for evaluating prompt effectiveness:

In [2]:
def relevance_score(response, expected_content):
    """Calculate relevance score based on semantic similarity to expected content."""
    return semantic_similarity(response, expected_content)


def consistency_score(responses):
    """Calculate consistency score based on similarity between multiple responses."""
    if len(responses) < 2:
        return 1.0  # Perfect consistency if there's only one response
    similarities = []
    for i in range(len(responses)):
        for j in range(i + 1, len(responses)):
            similarities.append(semantic_similarity(responses[i], responses[j]))
    return np.mean(similarities)


def specificity_score(response):
    """Calculate specificity score based on response length and unique word count."""
    words = response.split()
    unique_words = set(words)
    return len(unique_words) / len(words) if words else 0

## Manual Evaluation Techniques

Manual evaluation involves human assessment of prompt-response pairs. Let's create a function to simulate this process:

In [3]:
def manual_evaluation(prompt, response, criteria):
    """Simulate manual evaluation of a prompt-response pair."""
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print("\nEvaluation Criteria:")
    for criterion in criteria:
        score = float(input(f"Score for {criterion} (0-10): "))
        print(f"{criterion}: {score}/10")
    print("\nAdditional Comments:")
    comments = input("Enter any additional comments: ")
    print(f"Comments: {comments}")


# Example usage
prompt = "Explain the concept of machine learning in simple terms."
response = llm.invoke(prompt).content
criteria = ["Clarity", "Accuracy", "Simplicity"]
manual_evaluation(prompt, response, criteria)

Prompt: Explain the concept of machine learning in simple terms.
Response: Imagine you have a puppy you're trying to teach to sit.  Instead of explicitly telling it *how* to sit (step-by-step instructions), you show it examples: when it gets close to sitting, you give it a treat.  Over time, the puppy learns to associate the action of sitting with the reward, and eventually sits on command without needing further instruction.

Machine learning is similar.  Instead of giving a computer explicit instructions for every task, we give it lots of examples (data) and let it figure out the patterns and rules on its own.  If it does well (makes accurate predictions or takes the right actions), we "reward" it by adjusting its internal "rules" slightly.  Through this process of learning from examples and feedback, the computer becomes better at the task over time, just like the puppy.

So basically, machine learning is about teaching computers to learn from data without being explicitly programme

## Automated Evaluation Techniques

Now, let's implement some automated evaluation techniques:

In [4]:
def automated_evaluation(prompt, response, expected_content):
    """Perform automated evaluation of a prompt-response pair."""
    relevance = relevance_score(response, expected_content)
    specificity = specificity_score(response)

    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print(f"\nRelevance Score: {relevance:.2f}")
    print(f"Specificity Score: {specificity:.2f}")

    return {"relevance": relevance, "specificity": specificity}


# Example usage
prompt = "What are the three main types of machine learning?"
expected_content = "The three main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
response = llm.invoke(prompt).content
automated_evaluation(prompt, response, expected_content)

Prompt: What are the three main types of machine learning?
Response: The three main types of machine learning are:

1. **Supervised Learning:**  The algorithm learns from a labeled dataset, meaning the data includes both the input features and the desired output (the "label").  The algorithm learns to map inputs to outputs. Examples include image classification (input: image; output: object label) and spam detection (input: email; output: spam/not spam).

2. **Unsupervised Learning:** The algorithm learns from an unlabeled dataset, meaning the data only includes input features without corresponding outputs. The algorithm tries to find patterns, structure, or relationships within the data. Examples include clustering (grouping similar data points together) and dimensionality reduction (reducing the number of variables while preserving important information).

3. **Reinforcement Learning:** The algorithm learns through trial and error by interacting with an environment. It receives rewar

{'relevance': 0.8201715, 'specificity': 0.6845637583892618}

## Comparative Analysis

Let's compare the effectiveness of different prompts for the same task:

In [5]:
def compare_prompts(prompts, expected_content):
    """Compare the effectiveness of multiple prompts for the same task."""
    results = []
    for prompt in prompts:
        response = llm.invoke(prompt).content
        evaluation = automated_evaluation(prompt, response, expected_content)
        results.append({"prompt": prompt, **evaluation})

    # Sort results by relevance score
    sorted_results = sorted(results, key=lambda x: x["relevance"], reverse=True)

    print("Prompt Comparison Results:")
    for i, result in enumerate(sorted_results, 1):
        print(f"\n{i}. Prompt: {result['prompt']}")
        print(f"   Relevance: {result['relevance']:.2f}")
        print(f"   Specificity: {result['specificity']:.2f}")

    return sorted_results


# Example usage
prompts = [
    "List the types of machine learning.",
    "What are the main categories of machine learning algorithms?",
    "Explain the different approaches to machine learning.",
]
expected_content = "The main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
compare_prompts(prompts, expected_content)

Prompt: List the types of machine learning.
Response: Machine learning can be broadly categorized in several ways, and these categories often overlap.  Here are some common types:

**By Learning Style:**

* **Supervised Learning:**  The algorithm learns from labeled data; it's given input data and corresponding correct outputs.  The goal is to learn a mapping from input to output.  Examples include:
    * **Regression:** Predicting a continuous value (e.g., house price prediction).
    * **Classification:** Predicting a categorical value (e.g., spam detection).

* **Unsupervised Learning:** The algorithm learns from unlabeled data; it's given only input data and must find structure or patterns on its own. Examples include:
    * **Clustering:** Grouping similar data points together (e.g., customer segmentation).
    * **Dimensionality Reduction:** Reducing the number of variables while preserving important information (e.g., principal component analysis).
    * **Association Rule Learn

[{'prompt': 'What are the main categories of machine learning algorithms?',
  'relevance': 0.73312235,
  'specificity': 0.6455026455026455},
 {'prompt': 'List the types of machine learning.',
  'relevance': 0.69804984,
  'specificity': 0.617816091954023},
 {'prompt': 'Explain the different approaches to machine learning.',
  'relevance': 0.6432309,
  'specificity': 0.57421875}]

## Putting It All Together

Now, let's create a comprehensive prompt evaluation function that combines both manual and automated techniques:

In [6]:
def evaluate_prompt(
    prompt, expected_content, manual_criteria=["Clarity", "Accuracy", "Relevance"]
):
    """Perform a comprehensive evaluation of a prompt using both manual and automated techniques."""
    response = llm.invoke(prompt).content

    print("Automated Evaluation:")
    auto_results = automated_evaluation(prompt, response, expected_content)

    print("\nManual Evaluation:")
    manual_evaluation(prompt, response, manual_criteria)

    return {"prompt": prompt, "response": response, **auto_results}


# Example usage
prompt = "Explain the concept of overfitting in machine learning."
expected_content = "Overfitting occurs when a model learns the training data too well, including its noise and fluctuations, leading to poor generalization on new, unseen data."
evaluate_prompt(prompt, expected_content)

Automated Evaluation:
Prompt: Explain the concept of overfitting in machine learning.
Response: Overfitting in machine learning occurs when a model learns the training data *too well*.  Instead of learning the underlying patterns and relationships in the data that generalize to unseen data, it memorizes the specific details and noise present in the training set.  This leads to excellent performance on the training data but poor performance on new, unseen data (the test data).

Imagine you're trying to learn the relationship between hours studied and exam scores.  You have a small dataset.  An overfit model might perfectly capture every data point in your training set, perhaps even drawing a wildly complex curve that zig-zags through each point.  However, this complex curve is unlikely to accurately reflect the true relationship between studying and exam scores.  When presented with a new student's study hours, the model will likely give a wildly inaccurate prediction because it's focus

{'prompt': 'Explain the concept of overfitting in machine learning.',
 'response': "Overfitting in machine learning occurs when a model learns the training data *too well*.  Instead of learning the underlying patterns and relationships in the data that generalize to unseen data, it memorizes the specific details and noise present in the training set.  This leads to excellent performance on the training data but poor performance on new, unseen data (the test data).\n\nImagine you're trying to learn the relationship between hours studied and exam scores.  You have a small dataset.  An overfit model might perfectly capture every data point in your training set, perhaps even drawing a wildly complex curve that zig-zags through each point.  However, this complex curve is unlikely to accurately reflect the true relationship between studying and exam scores.  When presented with a new student's study hours, the model will likely give a wildly inaccurate prediction because it's focused on the 