### Evaluating Prompt Effectiveness
#### Overview
This tutorial focuses on methods and techniques for evaluating the effectiveness of prompts in AI language models. We'll explore various metrics for measuring prompt performance and discuss both manual and automated evaluation techniques.

#### Motivation
As prompt engineering becomes increasingly crucial in AI applications, it's essential to have robust methods for assessing prompt effectiveness. This enables developers and researchers to optimize their prompts, leading to better AI model performance and more reliable outputs.

In [1]:
import os
from langchain_openai import ChatOpenAI
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import numpy as np

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

# Initialize the language model
llm = ChatOpenAI(model="gpt-4o-mini")

# Initialize sentence transformer for semantic similarity
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(text1, text2):
    """Calculate semantic similarity between two texts using cosine similarity."""
    embeddings = sentence_model.encode([text1, text2])
    return cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]

W0820 20:23:56.680000 14112 torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.





#### Metrics for Measuring Prompt Performance

In [7]:
def relevance_score(response, expected_content):
    """Calculate relevance score based on semantic similarity to expected content."""
    return semantic_similarity(response, expected_content)

def consistency_score(responses):
    """Calculate consistency score based on similarity between multiple responses."""
    if len(responses) < 2:
        return 1.0  # Perfect consistency if there's only one response
    similarities = []
    for i in range(len(responses)):
        for j in range(i+1, len(responses)):
            similarities.append(semantic_similarity(responses[i], responses[j]))
    return np.mean(similarities)

def specificity_score(response):
    """Calculate specificity score based on response length and unique word count."""
    words = response.split()
    unique_words = set(words)
    return len(unique_words) / len(words) if words else 0

In [3]:
def manual_evaluation(prompt, response, criteria):
    """Simulate manual evaluation of a prompt-response pair."""
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print("\nEvaluation Criteria:")
    for criterion in criteria:
        score = float(input(f"Score for {criterion} (0-10): "))
        print(f"{criterion}: {score}/10")
    print("\nAdditional Comments:")
    comments = input("Enter any additional comments: ")
    print(f"Comments: {comments}")

# Example usage
prompt = "Explain the concept of machine learning in simple terms."
response = llm.invoke(prompt).content
criteria = ["Clarity", "Accuracy", "Simplicity"]
manual_evaluation(prompt, response, criteria)

Prompt: Explain the concept of machine learning in simple terms.
Response: Machine learning is a type of technology that allows computers to learn from data and improve their performance on a task without being explicitly programmed for it. 

Here’s how it works in simple terms:

1. **Data**: First, we gather a lot of information (data) about a specific task. For example, if we want to teach a computer to recognize cats in photos, we would collect many pictures of cats and also pictures of other things.

2. **Learning**: Instead of writing specific rules for the computer about how to identify a cat, we use algorithms—these are like recipes—that analyze the data. The computer looks for patterns and features that help distinguish cats from non-cats.

3. **Training**: The computer uses a part of this data to "train" itself. It makes guesses about which images contain cats and checks its answers against the correct outcomes (whether each image actually has a cat or not). The more guesses i

Score for Clarity (0-10):  9


Clarity: 9.0/10


Score for Accuracy (0-10):  8


Accuracy: 8.0/10


Score for Simplicity (0-10):  10


Simplicity: 10.0/10

Additional Comments:


Enter any additional comments:  best prompt


Comments: best prompt


In [4]:
def automated_evaluation(prompt, response, expected_content):
    """Perform automated evaluation of a prompt-response pair."""
    relevance = relevance_score(response, expected_content)
    specificity = specificity_score(response)
    
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print(f"\nRelevance Score: {relevance:.2f}")
    print(f"Specificity Score: {specificity:.2f}")
    
    return {"relevance": relevance, "specificity": specificity}

# Example usage
prompt = "What are the three main types of machine learning?"
expected_content = "The three main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
response = llm.invoke(prompt).content
automated_evaluation(prompt, response, expected_content)

  return forward_call(*args, **kwargs)


Prompt: What are the three main types of machine learning?
Response: The three main types of machine learning are:

1. **Supervised Learning**: In supervised learning, the model is trained on labeled data, which means that each training example is paired with an output label. The algorithm learns to map inputs to the correct outputs by minimizing the error between predicted and actual values. Common applications include classification tasks (e.g., spam detection) and regression tasks (e.g., predicting house prices).

2. **Unsupervised Learning**: Unsupervised learning involves training a model on data that does not have labeled outputs. The goal is to find hidden patterns or intrinsic structures within the data. Common techniques include clustering (e.g., grouping customers by purchasing behavior) and dimensionality reduction (e.g., simplifying data by reducing the number of features).

3. **Reinforcement Learning**: In reinforcement learning, an agent learns to make decisions by takin

{'relevance': 0.767517, 'specificity': 0.6836734693877551}

In [5]:
def compare_prompts(prompts, expected_content):
    """Compare the effectiveness of multiple prompts for the same task."""
    results = []
    for prompt in prompts:
        response = llm.invoke(prompt).content
        evaluation = automated_evaluation(prompt, response, expected_content)
        results.append({"prompt": prompt, **evaluation})
    
    # Sort results by relevance score
    sorted_results = sorted(results, key=lambda x: x['relevance'], reverse=True)
    
    print("Prompt Comparison Results:")
    for i, result in enumerate(sorted_results, 1):
        print(f"\n{i}. Prompt: {result['prompt']}")
        print(f"   Relevance: {result['relevance']:.2f}")
        print(f"   Specificity: {result['specificity']:.2f}")
    
    return sorted_results

# Example usage
prompts = [
    "List the types of machine learning.",
    "What are the main categories of machine learning algorithms?",
    "Explain the different approaches to machine learning."
]
expected_content = "The main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
compare_prompts(prompts, expected_content)

  return forward_call(*args, **kwargs)


Prompt: List the types of machine learning.
Response: Machine learning can generally be categorized into several types based on how learning is structured and the type of feedback provided. The primary types include:

1. **Supervised Learning**: In this approach, the model is trained on a labeled dataset, meaning that each training example is paired with an output label. The goal is to learn a mapping from inputs to outputs. Common algorithms include:
   - Linear Regression
   - Logistic Regression
   - Decision Trees
   - Support Vector Machines (SVM)
   - Neural Networks
   - Random Forests

2. **Unsupervised Learning**: This type involves training on data without labeled responses. The goal is to identify patterns or structures within the data. Common techniques include:
   - Clustering (e.g., K-Means, Hierarchical Clustering)
   - Dimensionality Reduction (e.g., Principal Component Analysis, t-SNE)
   - Association Rule Learning (e.g., Apriori Algorithm)

3. **Semi-Supervised Learn

  return forward_call(*args, **kwargs)


Prompt: What are the main categories of machine learning algorithms?
Response: Machine learning algorithms can be broadly categorized into several main categories based on how they learn from data and the type of tasks they are designed to perform. Here are the primary categories:

1. **Supervised Learning**:
   - Algorithms are trained on labeled data, meaning that the input data is paired with the correct output (the label).
   - Common tasks include classification (e.g., identifying categories) and regression (e.g., predicting continuous values).
   - Examples: Linear Regression, Logistic Regression, Decision Trees, Support Vector Machines (SVM), Neural Networks.

2. **Unsupervised Learning**:
   - Algorithms are trained on unlabeled data, where the system tries to learn the underlying structure of the data without explicit output labels.
   - Common tasks include clustering (grouping similar instances) and dimensionality reduction (reducing the number of features).
   - Examples: K

  return forward_call(*args, **kwargs)


[{'prompt': 'List the types of machine learning.',
  'relevance': 0.73224485,
  'specificity': 0.6276923076923077},
 {'prompt': 'What are the main categories of machine learning algorithms?',
  'relevance': 0.67915523,
  'specificity': 0.5989717223650386},
 {'prompt': 'Explain the different approaches to machine learning.',
  'relevance': 0.66815615,
  'specificity': 0.5982300884955752}]

In [6]:
def evaluate_prompt(prompt, expected_content, manual_criteria=['Clarity', 'Accuracy', 'Relevance']):
    """Perform a comprehensive evaluation of a prompt using both manual and automated techniques."""
    response = llm.invoke(prompt).content
    
    print("Automated Evaluation:")
    auto_results = automated_evaluation(prompt, response, expected_content)
    
    print("\nManual Evaluation:")
    manual_evaluation(prompt, response, manual_criteria)
    
    return {"prompt": prompt, "response": response, **auto_results}

# Example usage
prompt = "Explain the concept of overfitting in machine learning."
expected_content = "Overfitting occurs when a model learns the training data too well, including its noise and fluctuations, leading to poor generalization on new, unseen data."
evaluate_prompt(prompt, expected_content)

  return forward_call(*args, **kwargs)


Automated Evaluation:
Prompt: Explain the concept of overfitting in machine learning.
Response: Overfitting in machine learning refers to a modeling error that occurs when a machine learning model captures noise or random fluctuations in the training data, rather than the underlying patterns that generalize to unseen data. When a model is overfitted, it performs well on the training dataset but poorly on validation or test datasets because it has learned to recognize too many specific details and anomalies in the training data, rather than the underlying relationships.

### Key Characteristics of Overfitting:

1. **High Training Accuracy and Low Testing Accuracy**: An overfitted model shows very high performance (accuracy, error rate, etc.) on the training set while exhibiting significantly lower performance on unseen data.

2. **Complexity of the Model**: Overfitting is often associated with models that are too complex relative to the amount of training data available. Simple models m

Score for Clarity (0-10):  8


Clarity: 8.0/10


Score for Accuracy (0-10):  10


Accuracy: 10.0/10


Score for Relevance (0-10):  10


Relevance: 10.0/10

Additional Comments:


Enter any additional comments:  best prompt


Comments: best prompt


{'prompt': 'Explain the concept of overfitting in machine learning.',
 'response': "Overfitting in machine learning refers to a modeling error that occurs when a machine learning model captures noise or random fluctuations in the training data, rather than the underlying patterns that generalize to unseen data. When a model is overfitted, it performs well on the training dataset but poorly on validation or test datasets because it has learned to recognize too many specific details and anomalies in the training data, rather than the underlying relationships.\n\n### Key Characteristics of Overfitting:\n\n1. **High Training Accuracy and Low Testing Accuracy**: An overfitted model shows very high performance (accuracy, error rate, etc.) on the training set while exhibiting significantly lower performance on unseen data.\n\n2. **Complexity of the Model**: Overfitting is often associated with models that are too complex relative to the amount of training data available. Simple models may und