<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/evaluation/prometheus_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluation using [Prometheus](https://huggingface.co/TheBloke/prometheus-13B-v1.0-GPTQ) model

Evaluation is a crucial aspect of iterating over your RAG (Retrieval-Augmented Generation) pipeline. This process has relied heavily on GPT-4. However, a new open-source model named [Prometheus](https://arxiv.org/abs/2310.08491) has recently emerged as an alternative for evaluation purposes.

In this notebook, we will demonstrate how you can utilize the Prometheus model for evaluation, integrating it with the LlamaIndex abstractions.

We will demonstrate the correctness evaluation using the Prometheus model with two datasets from the Llama Datasets. If you haven't yet explored Llama Datasets, I recommend taking some time to read about them [here](https://blog.llamaindex.ai/introducing-llama-datasets-aadb9994ad9e).

1. Paul Graham Essay
2. Llama2

In [None]:
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

## Download Datasets

In [None]:
from llama_index.llama_dataset import download_llama_dataset

paul_graham_rag_dataset, paul_graham_documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./data/paul_graham"
)

llama2_rag_dataset, llama2_documents = download_llama_dataset(
    "Llama2PaperDataset", "./data/llama2"
)

## Define Prometheus LLM hosted on HuggingFace and prompt template.

We hosted the model on HF Inference endpoint using Nvidia A10G GPU.

In [None]:
from llama_index.llms import HuggingFaceInferenceAPI

HF_TOKEN = "YOUR HF TOKEN"
HF_ENDPOINT_URL = (
    "https://lj6l3d9g2zwx5gfn.us-east-1.aws.endpoints.huggingface.cloud"
)

prometheus_llm = HuggingFaceInferenceAPI(
    model_name=HF_ENDPOINT_URL,
    token=HF_TOKEN,
    temperature=0.1,
    do_sample=True,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1,
)

  from .autonotebook import tqdm as notebook_tqdm


### Correctness Evaluation Prompt

In [None]:
prometheus_correctness_eval_prompt_template = """###Task Description: An instruction (might include an Input inside it), a query, a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given. 
			1. Write a detailed feedback that assesses the quality of the response strictly based on the given score rubric, not evaluating in general. 
			2. After writing a feedback, write a score that is an integer of 1 or 5. You should refer to the score rubric. 
			3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer of 1 or 5)" 
			4. Please do not generate any other opening, closing, and explanations. 

			###The instruction to evaluate: Your task is to evaluate the generated answer and reference answer for the query: {query}

			###Generate answer to evaluate: {generated_answer} 

            ###Reference Answer (Score 5): {reference_answer}
            
    		###Score Rubrics: 
            Score 1: If the generated answer is not relevant to the user query and reference answer.
            Score 2: If the generated answer is according to reference answer but not relevant to user query.
            Score 3: If the generated answer is relevant to the user query and reference answer but contains mistakes.
    		Score 4: If the generated answer is relevant to the user query and has the exact same metrics as the reference answer, but it is not as concise.
            Score 5: If the generated answer is relevant to the user query and fully correct according to the reference answer.
    
    		###Feedback:"""

### Faithfulness Evaluation Prompt

In [None]:
prometheus_faithfulness_eval_prompt_template = """###Task Description: An instruction (might include an Input inside it), an information, a context, and a score rubric representing evaluation criteria are given. 
	        1. You are provided with evaluation task with the help of information, context information to give result based on score rubrics.
            2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general. 
			3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric. 
            4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)” 
            5. Please do not generate any other opening, closing, and explanations. 

        ###The instruction to evaluate: Your task is to evaluate if the given piece of information is supported by context. You are provided with some examples for reference.

        ###Information: {query_str} 

        ###Context: {context_str}
            
        ###Score Rubrics: 
        Score YES: If the given piece of information is supported by context.
        Score NO: If the given piece of information is not supported by context
    
        ###Feedback: """

prometheus_faithfulness_refine_prompt_template = """###Task Description: An instruction (might include an Input inside it), a information, a context information, an existing answer, and a score rubric representing a evaluation criteria are given. 
			1. You are provided with evaluation task with the help of information, context information and an existing answer.
            2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general. 
			3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric. 
			4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)" 
			5. Please do not generate any other opening, closing, and explanations. 

			###The instruction to evaluate: If the information is present in the context and also provided with an existing answer.

			###Existing answer: {existing_answer} 

            ###Information: {query_str}

            ###Context: {context_msg}
            
    		###Score Rubrics: 
            Score YES: If the existing answer is already YES or If the Information is present in the context.
            Score NO: If the existing answer is NO and If the Information is not present in the context.
    
    		###Feedback: """

### Relevancy Evaluation Prompt

In [None]:
prometheus_relevancy_eval_prompt_template = """###Task Description: An instruction (might include an Input inside it), a query with response, context, and a score rubric representing evaluation criteria are given. 
            1. You are provided with evaluation task with the help of a query with response and context.
            2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general. 
			3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric. 
            4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)” 
            5. Please do not generate any other opening, closing, and explanations. 

        ###The instruction to evaluate: Your task is to evaluate if the response for the query is in line with the context information provided.

        ###Query and Response: {query_str} 

        ###Context: {context_str}
            
        ###Score Rubrics: 
        Score YES: If the response for the query is in line with the context information provided.
        Score NO: If the response for the query is not in line with the context information provided.
    
        ###Feedback: """

prometheus_relevancy_refine_prompt_template = """###Task Description: An instruction (might include an Input inside it), a query with response, context, an existing answer, and a score rubric representing a evaluation criteria are given. 
			1. You are provided with evaluation task with the help of a query with response and context and an existing answer.
            2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general. 
			3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric. 
			4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)" 
			5. Please do not generate any other opening, closing, and explanations. 

			###The instruction to evaluate: Your task is to evaluate if the response for the query is in line with the context information provided.

			###Query and Response: {query_str} 

            ###Context: {context_str}
            
    		###Score Rubrics: 
            Score YES: If the existing answer is already YES or If the response for the query is in line with the context information provided.
            Score NO: If the existing answer is NO and If the response for the query is in line with the context information provided.
    
    		###Feedback: """

Set OpenAI Key for indexing

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "YOUR OPENAI API KEY"

## Define parser function 

It will be used in correctness evaluator.

In [None]:
from typing import Tuple
import re


def parser_function(output_str: str) -> Tuple[float, str]:
    # Pattern to match the feedback and response
    # This pattern looks for any text ending with '[RESULT]' followed by a number
    pattern = r"(.+?) \[RESULT\] (\d)"

    # Using regex to find all matches
    matches = re.findall(pattern, output_str)

    # Check if any match is found
    if matches:
        # Assuming there's only one match in the text, extract feedback and response
        feedback, score = matches[0]
        score = float(score.strip()) if score is not None else score
        return score, feedback.strip()
    else:
        return None, None

## Define Correctness, FaithFulness, Relevancy Evaluators

In [None]:
from llama_index import ServiceContext
from llama_index.evaluation import (
    CorrectnessEvaluator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
)

# Provide Prometheus model in service_context
prometheus_service_context = ServiceContext.from_defaults(llm=prometheus_llm)

# CorrectnessEvaluator with Prometheus model
prometheus_correctness_evaluator = CorrectnessEvaluator(
    service_context=prometheus_service_context,
    parser_function=parser_function,
    eval_template=prometheus_correctness_eval_prompt_template,
)

# FaithfulnessEvaluator with Prometheus model
prometheus_faithfulness_evaluator = FaithfulnessEvaluator(
    service_context=prometheus_service_context,
    eval_template=prometheus_faithfulness_eval_prompt_template,
    refine_template=prometheus_faithfulness_refine_prompt_template,
)

# RelevancyEvaluator with Prometheus model
prometheus_relevancy_evaluator = RelevancyEvaluator(
    service_context=prometheus_service_context,
    eval_template=prometheus_relevancy_eval_prompt_template,
    refine_template=prometheus_relevancy_refine_prompt_template,
)

## Let's create a function to create `query_engine` and `rag_dataset` for different datasets.

In [None]:
from llama_index.llama_dataset import LabelledRagDataset
from llama_index import SimpleDirectoryReader, VectorStoreIndex


def create_query_engine_rag_dataset(dataset_path):
    rag_dataset = LabelledRagDataset.from_json(
        f"{dataset_path}/rag_dataset.json"
    )
    documents = SimpleDirectoryReader(
        input_dir=f"{dataset_path}/source_files"
    ).load_data()

    index = VectorStoreIndex.from_documents(documents=documents)
    query_engine = index.as_query_engine()

    return query_engine, rag_dataset

## Function to check the distribution of scores

In [None]:
from collections import Counter
from typing import List, Dict


def get_scores_distribution(scores: List[float]) -> Dict[str, float]:
    # Counting the occurrences of each score
    score_counts = Counter(scores)

    # Total number of scores
    total_scores = len(scores)

    # Calculating the percentage distribution
    percentage_distribution = {
        score: (count / total_scores) * 100
        for score, count in score_counts.items()
    }

    return percentage_distribution

## Function to check faithfulness and relevancy evaluation score

In [None]:
def get_eval_results(key, eval_results):
    results = eval_results[key]
    correct = 0
    for result in results:
        if result.passing:
            correct += 1
    score = correct / len(results)
    print(f"{key} Score: {round(score, 2)}")
    return score

## Function to compute correctness evaluation scores

In [None]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm


def process_sample(sample, evaluator):
    query = sample.query
    reference_answer = sample.reference_answer
    response = query_engine.query(query).response

    result = evaluator.evaluate(
        query=query,
        response=response,
        reference=reference_answer,
    )
    return result


def correctness_evaluation_scores(rag_dataset, evaluator):
    scores = []

    with ThreadPoolExecutor() as executor:
        # Create a list to hold the future results
        futures = [
            executor.submit(process_sample, ex, evaluator)
            for ex in rag_dataset.examples
        ]

        # Process the futures as they are completed
        for future in tqdm(
            as_completed(futures), total=len(rag_dataset.examples)
        ):
            result = future.result()
            # Here, you need to extract scores from the result
            scores.append(result.score)

    return scores

### PaulGraham Essay text

In [None]:
query_engine, rag_dataset = create_query_engine_rag_dataset(
    "./data/paul_graham"
)

In [None]:
questions = [example.query for example in rag_dataset.examples]

### Compute Correctness Evaluation

In [None]:
prometheus_scores = correctness_evaluation_scores(
    rag_dataset, prometheus_correctness_evaluator
)

100%|██████████| 44/44 [01:21<00:00,  1.85s/it]


### Compute Faithfulness and Relevancy Evaluation

In [None]:
from llama_index.evaluation import BatchEvalRunner

batch_runner = BatchEvalRunner(
    {
        "faithfulness": prometheus_faithfulness_evaluator,
        "relevancy": prometheus_relevancy_evaluator,
    },
    workers=8,
)

eval_results = await batch_runner.aevaluate_queries(
    query_engine, queries=questions
)

### Correctness Evaluation score distribution.

In [None]:
get_scores_distribution(prometheus_scores)

{3.0: 59.09090909090909,
 1.0: 34.090909090909086,
 5.0: 2.272727272727273,
 None: 2.272727272727273,
 4.0: 2.272727272727273}

### Faithfulness Evaluation score

In [None]:
score = get_eval_results("faithfulness", eval_results)

faithfulness Score: 0.82


### Relevancy Evaluation score

In [None]:
score = get_eval_results("relevancy", eval_results)

relevancy Score: 0.82


### Llama2 paper

In [None]:
query_engine, rag_dataset = create_query_engine_rag_dataset("./data/llama2")

In [None]:
questions = [example.query for example in rag_dataset.examples]

### Compute Correctness Evaluation

In [None]:
prometheus_scores = correctness_evaluation_scores(
    rag_dataset, prometheus_correctness_evaluator
)

100%|██████████| 100/100 [03:05<00:00,  1.86s/it]


### Compute Faithfulness and Relevancy Evaluation

In [None]:
batch_runner = BatchEvalRunner(
    {
        "faithfulness": prometheus_faithfulness_evaluator,
        "relevancy": prometheus_relevancy_evaluator,
    },
    workers=8,
)

eval_results = await batch_runner.aevaluate_queries(
    query_engine, queries=questions
)

### Correctness Evaluation score distribution.

In [None]:
get_scores_distribution(prometheus_scores)

{1.0: 11.0, 3.0: 78.0, 4.0: 7.000000000000001, 5.0: 4.0}

### Faithfulness Evaluation score

In [None]:
score = get_eval_results("faithfulness", eval_results)

faithfulness Score: 0.56


### Relevancy Evaluation score

In [None]:
score = get_eval_results("relevancy", eval_results)

relevancy Score: 0.61


## Observation:

1. The cost for evaluation (approx.): `$0.433` for `144` queries (`44` for Paul Graham Essay and `100` for Llama2 paper) which accounts to `$0.003` per query.
2. The higher percentage of examples with a score of `3.0` suggests that the model is making certain mistakes in relation to the reference answer. Additionally, in some cases, the model fails to provide a result, which is indicated by `None`.
3. The faithfulness and relevancy scores indicate that there are instances of hallucination and a lack of answer relevancy in relation to the query and context in some examples.

Note: The endpoint on HF is served on AWS Nvidia A10G · 1x GPU · 24 GB which costs $1.3/h