# DeepEval: LLM Evaluation Framework Tutorial

DeepEval is an open-source framework for evaluating Large Language Models (LLMs), similar to Pytest but specialized for LLM outputs. It incorporates cutting-edge research and offers 40+ evaluation metrics to assess LLM performance across various dimensions.

## Key Features
- **LLM-as-a-Judge**: Uses advanced LLMs to evaluate outputs with human-like accuracy
- **Comprehensive Metrics**: G-Eval, Faithfulness, Toxicity, Answer Relevancy, and more
- **Easy Integration**: Works with any LLM provider (OpenAI, Anthropic, Hugging Face, etc.)
- **Unit Testing**: Pytest-like interface for systematic LLM testing

## Installation
```bash
pip install deepeval
```

In [None]:
# Install DeepEval if not already installed
!pip install deepeval python-dotenv

# Load environment variables from .env file
import os
from dotenv import load_dotenv

# Load API keys from .env file
load_dotenv()

# Set API keys
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY')
TOGETHER_API_KEY = os.getenv('TOGETHER_API_KEY')

# Set environment variables
if OPENAI_API_KEY:
    os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
if ANTHROPIC_API_KEY:
    os.environ['ANTHROPIC_API_KEY'] = ANTHROPIC_API_KEY
if TOGETHER_API_KEY:
    os.environ['TOGETHER_API_KEY'] = TOGETHER_API_KEY

print("Environment variables loaded successfully!")

## Core Concepts

### LLMTestCase
The fundamental unit in DeepEval representing a single LLM interaction with:
- **input**: The prompt/question
- **actual_output**: LLM's response
- **expected_output**: Ideal answer (optional)
- **retrieval_context**: Context for RAG applications (optional)

### Evaluation Metrics
DeepEval provides research-backed metrics for comprehensive LLM assessment.

In [None]:
# Import necessary libraries
import deepeval
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    GEval,
    FaithfulnessMetric,
    ToxicityMetric,
    AnswerRelevancyMetric
)

# Create sample test cases
test_cases = [
    LLMTestCase(
        input="What is the capital of France?",
        actual_output="The capital of France is Paris.",
        expected_output="Paris is the capital of France."
    ),
    LLMTestCase(
        input="Explain quantum computing in simple terms.",
        actual_output="Quantum computing uses quantum mechanics principles like superposition and entanglement to process information in ways classical computers cannot, potentially solving certain problems exponentially faster.",
        expected_output="Quantum computing is a type of computing that uses quantum mechanical phenomena to process information differently than classical computers."
    )
]

print(f"Created {len(test_cases)} test cases")

## 1. G-Eval Metric

G-Eval uses LLM-as-a-judge with chain-of-thought reasoning to evaluate outputs based on custom criteria. It's the most versatile metric in DeepEval.

In [None]:
# Create G-Eval metric for correctness
correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    evaluation_steps=[
        "Check if the main facts in the actual output align with the expected output",
        "Verify there are no factual errors or contradictions",
        "Assess if the response directly answers the input question"
    ],
    evaluation_params=[LLMTestCase.input, LLMTestCase.actual_output, LLMTestCase.expected_output]
)

# Create G-Eval metric for coherence
coherence_metric = GEval(
    name="Coherence",
    criteria="Evaluate the coherence and logical flow of the actual output.",
    evaluation_steps=[
        "Check if the response flows logically from one point to the next",
        "Assess if the ideas are well-connected and organized",
        "Verify the response maintains consistency throughout"
    ],
    evaluation_params=[LLMTestCase.input, LLMTestCase.actual_output]
)

print("G-Eval metrics created successfully!")

## 2. Faithfulness Metric

Measures whether the LLM output factually aligns with the provided context, crucial for RAG applications to detect hallucinations.

In [None]:
# Create test case with retrieval context for RAG evaluation
rag_test_case = LLMTestCase(
    input="What is the population of Tokyo?",
    actual_output="Tokyo has a population of approximately 14 million people in the city proper and about 38 million in the greater metropolitan area.",
    retrieval_context=[
        "Tokyo is the capital of Japan with a city population of around 14 million.",
        "The Greater Tokyo Area has a population of approximately 38 million people."
    ]
)

# Create Faithfulness metric
faithfulness_metric = FaithfulnessMetric(threshold=0.7)

print("Faithfulness metric created for RAG evaluation!")

## 3. Toxicity Metric

Detects harmful, offensive, or toxic content in LLM outputs to ensure safe and appropriate responses.

In [None]:
# Create test cases for toxicity evaluation
toxicity_test_cases = [
    LLMTestCase(
        input="Tell me about renewable energy.",
        actual_output="Renewable energy sources like solar, wind, and hydroelectric power are sustainable alternatives to fossil fuels that help reduce environmental impact."
    ),
    LLMTestCase(
        input="How can I stay healthy?",
        actual_output="Maintaining a balanced diet, regular exercise, adequate sleep, and managing stress are key components of a healthy lifestyle."
    )
]

# Create Toxicity metric
toxicity_metric = ToxicityMetric(threshold=0.5)

print("Toxicity metric created for safety evaluation!")

## 4. Answer Relevancy Metric

Measures how well the LLM output addresses the input question, ensuring responses are on-topic and useful.

In [None]:
# Create Answer Relevancy metric
relevancy_metric = AnswerRelevancyMetric(threshold=0.7)

print("Answer Relevancy metric created for relevance evaluation!")

## Running Evaluations

Execute evaluations using the `evaluate()` function with your test cases and metrics.

In [None]:
# Run G-Eval evaluation
print("Running G-Eval Correctness evaluation...")
correctness_results = evaluate(
    test_cases=test_cases,
    metrics=[correctness_metric]
)

# Run multiple metrics evaluation
print("\nRunning comprehensive evaluation...")
comprehensive_results = evaluate(
    test_cases=test_cases,
    metrics=[coherence_metric, relevancy_metric]
)

# Run RAG-specific evaluation
print("\nRunning RAG faithfulness evaluation...")
rag_results = evaluate(
    test_cases=[rag_test_case],
    metrics=[faithfulness_metric]
)

# Run toxicity evaluation
print("\nRunning toxicity evaluation...")
toxicity_results = evaluate(
    test_cases=toxicity_test_cases,
    metrics=[toxicity_metric]
)

print("\nAll evaluations completed successfully!")

## Viewing Results

DeepEval provides detailed results including scores, reasons, and pass/fail status for each metric.

In [None]:
# Function to display results
def display_results(results, metric_name):
    print(f"\n=== {metric_name} Results ===")
    for i, result in enumerate(results.test_results):
        print(f"\nTest Case {i+1}:")
        print(f"Input: {result.input}")
        print(f"Output: {result.actual_output[:100]}...")
        
        for metric_result in result.metrics_data:
            print(f"Metric: {metric_result.name}")
            print(f"Score: {metric_result.score:.3f}")
            print(f"Success: {metric_result.success}")
            if hasattr(metric_result, 'reason'):
                print(f"Reason: {metric_result.reason}")
        print("-" * 50)

# Display all results
display_results(correctness_results, "G-Eval Correctness")
display_results(comprehensive_results, "Comprehensive Evaluation")
display_results(rag_results, "RAG Faithfulness")
display_results(toxicity_results, "Toxicity Check")

## Best Practices

1. **Choose Appropriate Metrics**: Select metrics relevant to your use case (RAG, chatbots, content generation)
2. **Set Realistic Thresholds**: Adjust thresholds based on your quality requirements
3. **Use Multiple Metrics**: Combine different metrics for comprehensive evaluation
4. **Custom Criteria**: Leverage G-Eval for domain-specific evaluation criteria
5. **Continuous Testing**: Integrate DeepEval into your CI/CD pipeline for ongoing quality assurance

## Conclusion

DeepEval provides a robust framework for LLM evaluation with research-backed metrics and easy integration. It enables systematic testing and quality assurance for LLM applications, helping ensure reliable and safe AI systems.