<a href="https://colab.research.google.com/github/samratkar/samratkar.github.io/blob/main/Copy_of_LLM_Prod_Prompt_Engg_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluating Prompt Effectiveness

> Add blockquote



## Overview
This tutorial focuses on methods and techniques for evaluating the effectiveness of prompts in AI language models. We'll explore various metrics for measuring prompt performance and discuss both manual and automated evaluation techniques.

## Motivation
As prompt engineering becomes increasingly crucial in AI applications, it's essential to have robust methods for assessing prompt effectiveness. This enables developers and researchers to optimize their prompts, leading to better AI model performance and more reliable outputs.

## Key Components
1. Metrics for measuring prompt performance
2. Manual evaluation techniques
3. Automated evaluation techniques
4. Practical examples using OpenAI and LangChain

## Method Details
We'll start by setting up our environment and introducing key metrics for evaluating prompts. We'll then explore manual evaluation techniques, including human assessment and comparative analysis. Next, we'll delve into automated evaluation methods, utilizing techniques like perplexity scoring and automated semantic similarity comparisons. Throughout the tutorial, we'll provide practical examples using OpenAI's GPT models and LangChain library to demonstrate these concepts in action.

## Conclusion
By the end of this tutorial, you'll have a comprehensive understanding of how to evaluate prompt effectiveness using both manual and automated techniques. You'll be equipped with practical tools and methods to optimize your prompts, leading to more efficient and accurate AI model interactions.

## Setup

First, let's import the necessary libraries and set up our environment.

In [None]:
!pip install -q langchain-openai python-dotenv


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/74.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.3/74.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/443.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m443.5/443.5 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
from langchain_openai import ChatOpenAI
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import numpy as np

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = ""

# Initialize the language model
llm = ChatOpenAI(model="gpt-4o-mini")

# Initialize sentence transformer for semantic similarity
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(text1, text2):
    """Calculate semantic similarity between two texts using cosine similarity."""
    embeddings = sentence_model.encode([text1, text2])
    return cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Metrics for Measuring Prompt Performance

Let's define some key metrics for evaluating prompt effectiveness:

In [None]:
def relevance_score(response, expected_content):
    """Calculate relevance score based on semantic similarity to expected content."""
    return semantic_similarity(response, expected_content)

def consistency_score(responses):
    """Calculate consistency score based on similarity between multiple responses."""
    if len(responses) < 2:
        return 1.0  # Perfect consistency if there's only one response
    similarities = []
    for i in range(len(responses)):
        for j in range(i+1, len(responses)):
            similarities.append(semantic_similarity(responses[i], responses[j]))
    return np.mean(similarities)

def specificity_score(response):
    """Calculate specificity score based on response length and unique word count."""
    words = response.split()
    unique_words = set(words)
    return len(unique_words) / len(words) if words else 0

## Manual Evaluation Techniques

Manual evaluation involves human assessment of prompt-response pairs. Let's create a function to simulate this process:

In [None]:
def manual_evaluation(prompt, response, criteria):
    """Simulate manual evaluation of a prompt-response pair."""
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print("\nEvaluation Criteria:")
    for criterion in criteria:
        score = float(input(f"Score for {criterion} (0-10): "))
        print(f"{criterion}: {score}/10")
    print("\nAdditional Comments:")
    comments = input("Enter any additional comments: ")
    print(f"Comments: {comments}")

# Example usage
prompt = "Explain the concept of machine learning in simple terms."
response = llm.invoke(prompt).content
criteria = ["Clarity", "Accuracy", "Simplicity"]
manual_evaluation(prompt, response, criteria)

Prompt: Explain the concept of machine learning in simple terms.
Response: Machine learning is a field of computer science that teaches computers to learn from data and make decisions or predictions without being explicitly programmed for each specific task. 

In simple terms, think of it like teaching a child. Instead of telling the child every single rule or fact, you show them examples and let them figure things out on their own. For instance, if you show a child many pictures of cats and dogs, they will eventually learn to distinguish between the two animals based on the features you’ve shown them.

Similarly, in machine learning, you feed a computer a lot of data (like images, text, or numbers) and it analyzes this data to find patterns. Once it learns these patterns, you can give it new data, and it can make predictions or decisions based on what it has learned. 

Overall, machine learning is about creating algorithms that allow computers to improve their performance on a task ov

## Automated Evaluation Techniques

Now, let's implement some automated evaluation techniques:

In [None]:
def automated_evaluation(prompt, response, expected_content):
    """Perform automated evaluation of a prompt-response pair."""
    relevance = relevance_score(response, expected_content)
    specificity = specificity_score(response)

    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print(f"\nRelevance Score: {relevance:.2f}")
    print(f"Specificity Score: {specificity:.2f}")

    return {"relevance": relevance, "specificity": specificity}

# Example usage
prompt = "What are the three main types of machine learning?"
expected_content = "The three main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
response = llm.invoke(prompt).content
automated_evaluation(prompt, response, expected_content)

Prompt: What are the three main types of machine learning?
Response: The three main types of machine learning are:

1. **Supervised Learning**: In supervised learning, the model is trained on a labeled dataset, which means that each training example is paired with an output label. The algorithm learns to map inputs to the correct outputs, and it can then predict the labels for new, unseen data. Common applications include classification tasks (e.g., spam detection) and regression tasks (e.g., predicting housing prices).

2. **Unsupervised Learning**: In unsupervised learning, the model is trained on data that does not have labeled outputs. The goal is to identify patterns or structures within the data. This type of learning is commonly used for clustering (grouping similar data points together) and dimensionality reduction (reducing the number of features while preserving essential information). Examples include customer segmentation and anomaly detection.

3. **Reinforcement Learning*

{'relevance': np.float32(0.7616832), 'specificity': 0.6763285024154589}

## Comparative Analysis

Let's compare the effectiveness of different prompts for the same task:

In [None]:
def compare_prompts(prompts, expected_content):
    """Compare the effectiveness of multiple prompts for the same task."""
    results = []
    for prompt in prompts:
        response = llm.invoke(prompt).content
        evaluation = automated_evaluation(prompt, response, expected_content)
        results.append({"prompt": prompt, **evaluation})

    # Sort results by relevance score
    sorted_results = sorted(results, key=lambda x: x['relevance'], reverse=True)

    print("Prompt Comparison Results:")
    for i, result in enumerate(sorted_results, 1):
        print(f"\n{i}. Prompt: {result['prompt']}")
        print(f"   Relevance: {result['relevance']:.2f}")
        print(f"   Specificity: {result['specificity']:.2f}")

    return sorted_results

# Example usage
prompts = [
    "List the types of machine learning.",
    "What are the main categories of machine learning algorithms?",
    "Explain the different approaches to machine learning."
]
expected_content = "The main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
compare_prompts(prompts, expected_content)

Prompt: List the types of machine learning.
Response: Machine learning can be broadly categorized into several types based on the nature of the learning processes and the types of data available. Here are the main types:

1. **Supervised Learning**: 
   - Involves training a model on a labeled dataset, which means the input data is paired with the correct output. The model learns to predict the output from the input data. Common algorithms include:
     - Linear Regression
     - Logistic Regression
     - Decision Trees
     - Support Vector Machines (SVM)
     - Neural Networks
     
2. **Unsupervised Learning**: 
   - Involves training a model on an unlabeled dataset, where the model tries to learn the underlying structure or distribution in the data without explicit feedback. Common techniques include:
     - K-Means Clustering
     - Hierarchical Clustering
     - Principal Component Analysis (PCA)
     - t-Distributed Stochastic Neighbor Embedding (t-SNE)
     - Autoencoders

3. 

[{'prompt': 'List the types of machine learning.',
  'relevance': np.float32(0.76925355),
  'specificity': 0.55},
 {'prompt': 'What are the main categories of machine learning algorithms?',
  'relevance': np.float32(0.7125774),
  'specificity': 0.6348122866894198},
 {'prompt': 'Explain the different approaches to machine learning.',
  'relevance': np.float32(0.6880341),
  'specificity': 0.5311973018549747}]

## Putting It All Together

Now, let's create a comprehensive prompt evaluation function that combines both manual and automated techniques:

In [None]:
def evaluate_prompt(prompt, expected_content, manual_criteria=['Clarity', 'Accuracy', 'Relevance']):
    """Perform a comprehensive evaluation of a prompt using both manual and automated techniques."""
    response = llm.invoke(prompt).content

    print("Automated Evaluation:")
    auto_results = automated_evaluation(prompt, response, expected_content)

    print("\nManual Evaluation:")
    manual_evaluation(prompt, response, manual_criteria)

    return {"prompt": prompt, "response": response, **auto_results}

# Example usage
prompt = "Explain the concept of overfitting in machine learning."
expected_content = "Overfitting occurs when a model learns the training data too well, including its noise and fluctuations, leading to poor generalization on new, unseen data."
evaluate_prompt(prompt, expected_content)

Automated Evaluation:
Prompt: Explain the concept of overfitting in machine learning.
Response: Overfitting is a common issue in machine learning that occurs when a model learns not only the underlying patterns in the training data but also the noise and outliers present in that specific dataset. As a result, an overfitted model performs exceptionally well on the training data but fails to generalize to new, unseen data.

### Key Characteristics of Overfitting:

1. **High Training Accuracy**: The model shows a low training error, indicating that it has memorized the training data well.

2. **Low Validation/Test Accuracy**: When evaluated on validation or test datasets, the performance of the model significantly drops compared to its performance on the training data.

3. **Complex Models**: Overfitting is more likely to occur with overly complex models, such as those with too many parameters or layers relative to the amount of training data available.

4. **Sensitivity to Noise**: An ov

{'prompt': 'Explain the concept of overfitting in machine learning.',
 'response': "Overfitting is a common issue in machine learning that occurs when a model learns not only the underlying patterns in the training data but also the noise and outliers present in that specific dataset. As a result, an overfitted model performs exceptionally well on the training data but fails to generalize to new, unseen data.\n\n### Key Characteristics of Overfitting:\n\n1. **High Training Accuracy**: The model shows a low training error, indicating that it has memorized the training data well.\n\n2. **Low Validation/Test Accuracy**: When evaluated on validation or test datasets, the performance of the model significantly drops compared to its performance on the training data.\n\n3. **Complex Models**: Overfitting is more likely to occur with overly complex models, such as those with too many parameters or layers relative to the amount of training data available.\n\n4. **Sensitivity to Noise**: An over