# Response Evaluator

In this example we are going to test the `ResponseEvaluator` on a vector index and tree index. The data is extracted from the [New York City](https://en.wikipedia.org/wiki/New_York_City) wikipedia page.

In [1]:
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

In [2]:
# configuring logger to INFO level
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [3]:
from llama_index import (
    TreeIndex,
    VectorStoreIndex,
    SimpleDirectoryReader,
    LLMPredictor,
    ServiceContext,
    Response,
)
from llama_index.llms import OpenAI
from llama_index.evaluation import ResponseEvaluator
import pandas as pd

pd.set_option("display.max_colwidth", 0)

INFO:numexpr.utils:Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
NumExpr defaulting to 8 threads.


Using GPT-4 here for evaluation

In [4]:
# gpt-4
gpt4 = OpenAI(temperature=0, model="gpt-4")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

evaluator_gpt4 = ResponseEvaluator(service_context=service_context_gpt4)

In [5]:
documents = SimpleDirectoryReader("./test_wiki_data/").load_data()

In [7]:
# create tree index
tree_index = TreeIndex.from_documents(documents=documents)

INFO:llama_index.indices.common_tree.base:> Building index from nodes: 3 chunks
> Building index from nodes: 3 chunks


In [8]:
# create vector index
vector_index = VectorStoreIndex.from_documents(
    documents, service_context=ServiceContext.from_defaults(chunk_size=512)
)

In [9]:
# define jupyter display function


def display_eval_df(response: Response, eval_result: str) -> None:
    if response.source_nodes == []:
        print("no response!")
        return
    eval_df = pd.DataFrame(
        {
            "Response": str(response),
            "Source": response.source_nodes[0].node.text[:1000] + "...",
            "Evaluation Result": eval_result,
        },
        index=[0],
    )
    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "600px",
            "overflow-wrap": "break-word",
        },
        subset=["Response", "Source"]
    )
    display(eval_df)

To run evaluations you can call the `.evaluate()` function on the `Response` object return from the query to run the evaluations. Lets evaluate the outputs of both the tree_index and vector_index to see if there is any difference.

In [16]:
query_engine = tree_index.as_query_engine()
response_tree = query_engine.query(
    "What battles took place in New York City in the American Revolution?"
)
eval_result = evaluator_gpt4.evaluate(response_tree)

In [17]:
display_eval_df(response_tree, eval_result)

no response!


In [14]:
query_engine = vector_index.as_query_engine()
response_vector = query_engine.query(
    "What battles took place in New York City in the American Revolution?"
)
eval_result = evaluator_gpt4.evaluate(response_vector)

In [15]:
display_eval_df(response_vector, eval_result)

Unnamed: 0,Response,Source,Evaluation Result
0,"The Battle of Long Island, which was the largest battle of the American Revolutionary War, took place in August 1776 within the modern-day borough of Brooklyn. The only attempt at a peaceful solution to the war took place at the Conference House on Staten Island between American delegates, including Benjamin Franklin, and British general Lord Howe on September 11, 1776.","enslaved few or several people. Others were hired out to work at labor. Slavery became integrally tied to New York's economy through the labor of slaves throughout the port, and the banking and shipping industries trading with the American South. During construction in Foley Square in the 1990s, the African Burying Ground was discovered; the cemetery included 10,000 to 20,000 of graves of colonial-era Africans, some enslaved and some free.The 1735 trial and acquittal in Manhattan of John Peter Zenger, who had been accused of seditious libel after criticizing colonial governor William Cosby, helped to establish the freedom of the press in North America. In 1754, Columbia University was founded under charter by King George II as King's College in Lower Manhattan. === American Revolution === The Stamp Act Congress met in New York in October 1765, as the Sons of Liberty organization emerged in the city and skirmished over the next ten years with British troops stationed there. The Battl...",YES


## Benchmark on Generated Question

Now lets generate a few more questions so that we have more to evaluate with and run a small benchmark. In practic

In [19]:
from llama_index.evaluation import DatasetGenerator

question_generator = DatasetGenerator.from_documents(documents)
eval_questions = question_generator.generate_questions_from_nodes(5)

eval_questions

chunk_size_limit is deprecated, please specify chunk_size instead


['What is the population of New York City as of 2020?',
 'Which borough of New York City has the highest population?',
 'What is the economic significance of New York City?',
 'How did New York City get its name?',
 'What is the significance of the Statue of Liberty in New York City?']

In [22]:
import asyncio


def evaluate_query_engine(query_engine, questions):
    c = [query_engine.aquery(q) for q in questions]
    results = asyncio.run(asyncio.gather(*c))
    print("finished query")

    total_correct = 0
    for r in results:
        # evaluate with gpt 4
        eval_result = 1 if evaluator_gpt4.evaluate(r) == "YES" else 0
        total_correct += eval_result

    return total_correct, len(results)

In [23]:
vector_query_engine = vector_index.as_query_engine()
correct, total = evaluate_query_engine(vector_query_engine, eval_questions[:5])

print(f"score: {correct}/{total}")

INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=681 request_id=f587e4f416ad0d995a4400bc0a3be400 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=681 request_id=f587e4f416ad0d995a4400bc0a3be400 response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=1128 request_id=9af1677bef53dc66d8824900c8958b27 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=1128 request_id=9af1677bef53dc66d8824900c8958b27 response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=2529 request_id=a58de27cf52038d452c2df03dd9251cb response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=2529 request_id=a58de27cf52038d452c2df03dd9251cb response_code=200
INFO:openai:message='OpenAI API response' 

In [25]:
tree_query_engine = tree_index.as_query_engine()
correct, total = evaluate_query_engine(tree_query_engine, eval_questions[:5])

print(f"score: {correct}/{total}")

INFO:llama_index.indices.tree.select_leaf_retriever:>[Level 0] Selected node: [1]/[1]
>[Level 0] Selected node: [1]/[1]
INFO:llama_index.indices.tree.select_leaf_retriever:>[Level 1] Selected node: [1]/[1]
>[Level 1] Selected node: [1]/[1]
INFO:llama_index.indices.tree.select_leaf_retriever:>[Level 0] Selected node: [1]/[1]
>[Level 0] Selected node: [1]/[1]
INFO:llama_index.indices.tree.select_leaf_retriever:>[Level 1] Selected node: [2]/[2]
>[Level 1] Selected node: [2]/[2]
INFO:llama_index.indices.tree.select_leaf_retriever:>[Level 0] Selected node: [3]/[3]
>[Level 0] Selected node: [3]/[3]
INFO:llama_index.indices.tree.select_leaf_retriever:>[Level 1] Selected node: [5]/[5]
>[Level 1] Selected node: [5]/[5]
INFO:llama_index.indices.tree.select_leaf_retriever:>[Level 0] Selected node: [1]/[1]
>[Level 0] Selected node: [1]/[1]
INFO:llama_index.indices.tree.select_leaf_retriever:>[Level 1] Selected node: [4]/[4]
>[Level 1] Selected node: [4]/[4]
INFO:llama_index.indices.tree.select_lea