# Evaluating Prompt Effectiveness
### Overview
This tutorial focuses on methods and techniques for evaluating the effectiveness of prompts in AI language models. We'll explore various metrics for measuring prompt performance and discuss both manual and automated evaluation techniques.

### Motivation
As prompt engineering becomes increasingly crucial in AI applications, it's essential to have robust methods for assessing prompt effectiveness. This enables developers and researchers to optimize their prompts, leading to better AI model performance and more reliable outputs.

### Key Components
1. Metrics for measuring prompt performance
2. Manual evaluation techniques
3. Automated evaluation techniques
4. Practical examples using Gemini and LangChain
### Method Details
We'll start by setting up our environment and introducing key metrics for evaluating prompts. We'll then explore manual evaluation techniques, including human assessment and comparative analysis. Next, we'll delve into automated evaluation methods, utilizing techniques like perplexity scoring and automated semantic similarity comparisons. Throughout the tutorial, we'll provide practical examples using Gemini models and LangChain library to demonstrate these concepts in action.

### Conclusion
By the end of this tutorial, you'll have a comprehensive understanding of how to evaluate prompt effectiveness using both manual and automated techniques. You'll be equipped with practical tools and methods to optimize your prompts, leading to more efficient and accurate AI model interactions.

### Setup
First, let's import the necessary libraries and set up our environment.

In [2]:
import os
import numpy as np
from langchain_core.prompts import PromptTemplate
from sklearn.metrics.pairwise import cosine_similarity
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_huggingface import HuggingFaceEmbeddings

# Load enviroment variables
from dotenv import load_dotenv
load_dotenv()

# Set up Google API key
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")
os.environ['HF_TOKEN']=os.getenv('HF_TOKEN')

# Initialize the language model
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")
embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

def semantic_similarity(text1, text2):
    """Calculate semantic similarity between two texts using cosine similarity."""
    # Generate embeddings for both texts
    embeddings1 = embedding.embed_query(text1)
    embeddings2 = embedding.embed_query(text2)
    
    # Calculate cosine similarity
    return cosine_similarity([embeddings1], [embeddings2])[0][0]

### Metrics for Measuring Prompt Performance
Let's define some key metrics for evaluating prompt effectiveness:

In [3]:
def relevance_score(response, expected_content):
    """Calculate relevance score based on semantic similarity to expected content."""
    return semantic_similarity(response, expected_content)

def consistency_score(responses):
    """Calculate consistency score based on similarity between multiple responses."""
    if len(responses) < 2:
        return 1.0  # Perfect consistency if there's only one response
    similarities = []
    for i in range(len(responses)):
        for j in range(i+1, len(responses)):
            similarities.append(semantic_similarity(responses[i], responses[j]))
    return np.mean(similarities)

def specificity_score(response):
    """Calculate specificity score based on response length and unique word count."""
    words = response.split()
    unique_words = set(words)
    return len(unique_words) / len(words) if words else 0

### Manual Evaluation Techniques
Manual evaluation involves human assessment of prompt-response pairs. Let's create a function to simulate this process:

In [4]:
def manual_evaluation(prompt, response, criteria):
    """Simulate manual evaluation of a prompt-response pair."""
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print("\nEvaluation Criteria:")
    for criterion in criteria:
        score = float(input(f"Score for {criterion} (0-10): "))
        print(f"{criterion}: {score}/10")
    print("\nAdditional Comments:")
    comments = input("Enter any additional comments: ")
    print(f"Comments: {comments}")

# Example usage
prompt = "Explain the concept of machine learning in simple terms."
response = llm.invoke(prompt).content
criteria = ["Clarity", "Accuracy", "Simplicity"]
manual_evaluation(prompt, response, criteria)

Prompt: Explain the concept of machine learning in simple terms.
Response: Imagine you're teaching a dog a trick. You show it what you want it to do, and when it gets it right, you give it a treat. The dog learns by seeing examples and getting feedback.

Machine learning is similar, but instead of a dog, it's a computer.  We give the computer lots of data (like examples) and tell it what we want it to learn.  The computer then analyzes the data and tries to find patterns and rules.  

Instead of treats, the computer gets better at making predictions or decisions. The more data it sees, the better it gets.

**Here's a breakdown:**

* **Data:**  The "examples" you give the computer. This could be anything from pictures of cats and dogs to sales figures from the past year.
* **Learning:** The process of the computer finding patterns and relationships in the data.
* **Model:** The "trick" the computer learns. It's a set of rules or a formula that helps it make predictions or decisions.
* *

### Automated Evaluation Techniques
Now, let's implement some automated evaluation techniques:

In [5]:
def automated_evaluation(prompt, response, expected_content):
    """Perform automated evaluation of a prompt-response pair."""
    relevance = relevance_score(response, expected_content)
    specificity = specificity_score(response)
    
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print(f"\nRelevance Score: {relevance:.2f}")
    print(f"Specificity Score: {specificity:.2f}")
    
    return {"relevance": relevance, "specificity": specificity}

# Example usage
prompt = "What are the three main types of machine learning?"
expected_content = "The three main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
response = llm.invoke(prompt).content
automated_evaluation(prompt, response, expected_content)

Prompt: What are the three main types of machine learning?
Response: The three main types of machine learning are:

1.  **Supervised Learning:** In supervised learning, the algorithm is trained on a labeled dataset, meaning the data includes both the input features and the desired output (the "label"). The algorithm learns a mapping function that can predict the output for new, unseen input data.  Think of it like learning with a teacher who provides the correct answers.

    *   **Examples:** Image classification (identifying objects in images), spam detection (classifying emails as spam or not spam), regression (predicting house prices based on features like size and location).

2.  **Unsupervised Learning:**  In unsupervised learning, the algorithm is trained on an unlabeled dataset, meaning the data only includes the input features without any corresponding output labels. The algorithm aims to discover hidden patterns, structures, or relationships within the data without any prior 

{'relevance': 0.7502844021521002, 'specificity': 0.6428571428571429}

### Comparative Analysis
Let's compare the effectiveness of different prompts for the same task:

In [6]:
def compare_prompts(prompts, expected_content):
    """Compare the effectiveness of multiple prompts for the same task."""
    results = []
    for prompt in prompts:
        response = llm.invoke(prompt).content
        evaluation = automated_evaluation(prompt, response, expected_content)
        results.append({"prompt": prompt, **evaluation})
    
    # Sort results by relevance score
    sorted_results = sorted(results, key=lambda x: x['relevance'], reverse=True)
    
    print("Prompt Comparison Results:")
    for i, result in enumerate(sorted_results, 1):
        print(f"\n{i}. Prompt: {result['prompt']}")
        print(f"   Relevance: {result['relevance']:.2f}")
        print(f"   Specificity: {result['specificity']:.2f}")
    
    return sorted_results

# Example usage
prompts = [
    "List the types of machine learning.",
    "What are the main categories of machine learning algorithms?",
    "Explain the different approaches to machine learning."
]
expected_content = "The main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
compare_prompts(prompts, expected_content)

Prompt: List the types of machine learning.
Response: Machine learning can be broadly categorized into the following types:

**1. Supervised Learning:**

*   **Definition:** Learns from labeled data, where the input features and the corresponding output are provided. The algorithm learns a mapping function to predict the output for new, unseen inputs.
*   **Types of Problems:**
    *   **Classification:** Predicts a categorical output (e.g., spam or not spam, cat or dog).
    *   **Regression:** Predicts a continuous output (e.g., house price, temperature).
*   **Common Algorithms:**
    *   Linear Regression
    *   Logistic Regression
    *   Support Vector Machines (SVM)
    *   Decision Trees
    *   Random Forest
    *   K-Nearest Neighbors (KNN)
    *   Naive Bayes
    *   Neural Networks (e.g., Multi-Layer Perceptron)

**2. Unsupervised Learning:**

*   **Definition:** Learns from unlabeled data, where only the input features are provided. The algorithm aims to discover hidden p

[{'prompt': 'What are the main categories of machine learning algorithms?',
  'relevance': 0.7058402065193237,
  'specificity': 0.5550786838340487},
 {'prompt': 'List the types of machine learning.',
  'relevance': 0.6883750515394957,
  'specificity': 0.5508196721311476},
 {'prompt': 'Explain the different approaches to machine learning.',
  'relevance': 0.6571559594837572,
  'specificity': 0.4625801853171775}]

### Putting It All Together
Now, let's create a comprehensive prompt evaluation function that combines both manual and automated techniques:

In [7]:
def evaluate_prompt(prompt, expected_content, manual_criteria=['Clarity', 'Accuracy', 'Relevance']):
    """Perform a comprehensive evaluation of a prompt using both manual and automated techniques."""
    response = llm.invoke(prompt).content
    
    print("Automated Evaluation:")
    auto_results = automated_evaluation(prompt, response, expected_content)
    
    print("\nManual Evaluation:")
    manual_evaluation(prompt, response, manual_criteria)
    
    return {"prompt": prompt, "response": response, **auto_results}

# Example usage
prompt = "Explain the concept of overfitting in machine learning."
expected_content = "Overfitting occurs when a model learns the training data too well, including its noise and fluctuations, leading to poor generalization on new, unseen data."
evaluate_prompt(prompt, expected_content)

Automated Evaluation:
Prompt: Explain the concept of overfitting in machine learning.
Response: ## Overfitting in Machine Learning: A Detailed Explanation

Overfitting is a common problem in machine learning where a model learns the training data **too well**, including its noise and specific patterns.  This results in a model that performs exceptionally well on the training data but performs poorly on new, unseen data (the test data).  Essentially, the model has memorized the training data instead of learning the underlying, generalizable patterns.

**Here's a breakdown of the key aspects:**

* **What it means:**
    * The model has learned the **noise** and **outliers** present in the training data.
    * The model is too complex and has too many parameters.
    * The model's decision boundaries are overly intricate and tailored to the specific training instances.
    * The model fails to generalize to new data because it's specifically adapted to the quirks of the training set.

* *

{'prompt': 'Explain the concept of overfitting in machine learning.',
 'response': '## Overfitting in Machine Learning: A Detailed Explanation\n\nOverfitting is a common problem in machine learning where a model learns the training data **too well**, including its noise and specific patterns.  This results in a model that performs exceptionally well on the training data but performs poorly on new, unseen data (the test data).  Essentially, the model has memorized the training data instead of learning the underlying, generalizable patterns.\n\n**Here\'s a breakdown of the key aspects:**\n\n* **What it means:**\n    * The model has learned the **noise** and **outliers** present in the training data.\n    * The model is too complex and has too many parameters.\n    * The model\'s decision boundaries are overly intricate and tailored to the specific training instances.\n    * The model fails to generalize to new data because it\'s specifically adapted to the quirks of the training set.\n\n