## Llama-Index Agents + Ground Truth & Custom Evaluations

In this example, we build an agent-based app with Llama Index to answer questions with the help of Yelp. We'll evaluate it using two feedback functions:

1. Definitiveness - we want our app to respond with authority. We'll accomplish this with a simple, custom feedback function.
2. Ground truth eval - we want to make sure our app responds correctly. We will create a ground truth set for this evaluation.

Last, we'll compare the evaluation of this app against a standalone LLM. May the best bot win?

In [None]:
# Setup OpenAI Agent
from llama_index.agent import OpenAIAgent
import openai
openai.api_key = '...'

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "..."
os.environ["HUGGINGFACE_API_KEY"] = "..."

In [None]:
# Import and initialize our tool spec
from llama_hub.tools.yelp.base import YelpToolSpec
from llama_index.tools.tool_spec.load_and_search.base import LoadAndSearchToolSpec

tool_spec = YelpToolSpec(api_key='...', client_id='...')

In [None]:
def llm_standalone(prompt):
    return openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
            {"role": "system", "content": "You are a question and answer bot, and you answer concisely."},
            {"role": "user", "content": prompt}
        ]
    )["choices"][0]["message"]["content"]

In [None]:
llm_standalone("what are good restaurants in toronto?")

In [None]:
# Create the Agent with our tools
tools = tool_spec.to_tool_list()
agent = OpenAIAgent.from_tools(
    [
        *LoadAndSearchToolSpec.from_defaults(tools[0]).to_tool_list(),
        *LoadAndSearchToolSpec.from_defaults(tools[1]).to_tool_list()
    ],
    verbose=True
)

In [None]:
agent.chat("what are good restaurants in toronto")

In [None]:
from trulens_eval import Feedback, OpenAI, Tru, TruBasicApp, TruLlama
from trulens_eval.feedback import GroundTruthAgreement

tru = Tru()

class OpenAI_custom(OpenAI):
    def definitive(self, response: str) -> float:

        return float(openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
            {"role": "system", "content": "Your job is to rate how definitive the following text is on a scale of 1 to 10. Respond with the number only."},
            {"role": "user", "content": response}
        ]
    )["choices"][0]["message"]["content"]) / 10

custom = OpenAI_custom()
definitive = Feedback(custom.definitive).on_output()

In [None]:
golden_set = [
    {"query": "What's the vibe like at oprhan andy's in SF?", "response": "cute diner, 50s/60s vibe"},
    {"query": "Is park tavern in San Fran open yet?", "response": "Yes"},
    {"query": "I'm in san francisco for the morning, does Juniper serve pastries?", "response": "Yes"},
    {"query": "What's the address of Gumbo Social in San Francisco?", "response": "5176 3rd St, San Francisco, CA 94124"},
    {"query": "What are the reviews like of Gola in SF?", "response": "Excellent, 4.7/5"}
]

f_groundtruth = Feedback(GroundTruthAgreement(golden_set).agreement_measure).on_input_output()

In [None]:
tru.run_dashboard()

In [None]:
standalone_app = TruBasicApp(llm_standalone, app_id="OpenAIChatCompletion", feedbacks=[f_groundtruth, definitive])

In [None]:
yelp_app = TruLlama(agent,
    app_id='YelpAgent',
    feedbacks=[definitive, f_groundtruth])

In [None]:
prompt_set = ["What's the vibe like at oprhan andy's in SF?",
                "Is park tavern in San Fran open yet?",
                "What's the address of Gumbo Social in San Francisco?",
                "What are the reviews like of Gola in SF?"
                ]

In [None]:
for prompt in prompt_set:
    standalone_app.call_with_record(prompt)
    yelp_app.query(prompt)