## Llama-Index Agents + Ground Truth & Custom Evaluations

In this example, we build an agent-based app with Llama Index to answer questions with the help of Yelp. We'll evaluate it using two feedback functions:

1. Definitiveness - we want our app to respond with authority. We'll accomplish this with a simple, custom feedback function.
2. Ground truth eval - we want to make sure our app responds correctly. We will create a ground truth set for this evaluation.

Last, we'll compare the evaluation of this app against a standalone LLM. May the best bot win?

In [None]:
!pip install trulens_eval==0.7.0 \
             llama_index==0.7.11 \
             llama_hub==0.0.13 \
             yelpapi==2.5.0

In [None]:
YELP_API_KEY = "..."
YELP_CLIENT_ID = "..."
OPENAI_API_KEY = "..."

import os
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

### Set up our Llama-Index App

In [None]:
# Setup OpenAI Agent
import llama_index
from llama_index.agent import OpenAIAgent
from llama_index import question_gen
from llama_index.question_gen import types
import openai
import os
openai.api_key = os.environ["OPENAI_API_KEY"]

In [None]:
# Import and initialize our tool spec
from llama_hub.tools.yelp.base import YelpToolSpec
from llama_index.tools.tool_spec.load_and_search.base import LoadAndSearchToolSpec

# Add Yelp API key and client ID
tool_spec = YelpToolSpec(api_key=YELP_API_KEY, client_id=YELP_CLIENT_ID)

In [None]:
# Create the Agent with our tools
tools = tool_spec.to_tool_list()
agent = OpenAIAgent.from_tools(
    [
        *LoadAndSearchToolSpec.from_defaults(tools[0]).to_tool_list(),
        *LoadAndSearchToolSpec.from_defaults(tools[1]).to_tool_list()
    ],
    verbose=True
)

### Create a standalone GPT3.5 for comparison

In [None]:
def llm_standalone(prompt):
    return openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
            {"role": "system", "content": "You are a question and answer bot, and you answer concisely."},
            {"role": "user", "content": prompt}
        ]
    )["choices"][0]["message"]["content"]

## Custom yelp evaluations

Here we'll set a number of custom evals specific to our problem, including:
1. Query translation score: Check to make sure the query used by Yelp Business Search matches the user query.
2. Check to see if Yelp ratings are included in the context returned by Yelp business search.

In [None]:
from trulens_eval import Feedback, OpenAI, Tru, TruBasicApp, TruLlama
from trulens_eval.feedback import GroundTruthAgreement

tru = Tru()

class OpenAI_custom(OpenAI):
    def query_translation_score(self, question1: str, question2: str) -> float:
        return float(openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
            {"role": "system", "content": "Your job is to rate how similar two quesitons are on a scale of 1 to 10. Respond with the number only."},
            {"role": "user", "content": f"QUESTION 1: {question1}; QUESTION 2: {question2}"}
        ]
    )["choices"][0]["message"]["content"]) / 10

    def ratings_usage(self, last_context: str) -> float:
        print(last_context)
        return float(openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
            {"role": "system", "content": "Your job is to respond with a '1' if the following statement mentions ratings or reviews, and a '0' if not."},
            {"role": "user", "content": f"STATEMENT: {last_context}"}
        ]
    )["choices"][0]["message"]["content"])

custom = OpenAI_custom()
from trulens_eval import Select
query_translation_score = Feedback(custom.query_translation_score).on_input().on(
    Select.Record.calls[0].args.str_or_query_bundle # check the query bundle passed to yelp api
)
ratings_usage = Feedback(custom.ratings_usage).on(
    Select.App.app.chat_history[-1]["content"] # check the last content chunk for mentions of ratings or reviews
)

### Ground Truth Eval

It's also useful in many cases to do ground truth eval with small golden sets. We'll do so here.

In [None]:
golden_set = [
    {"query": "What's the vibe like at oprhan andy's in SF?", "response": "welcoming and friendly"},
    {"query": "Is park tavern in San Fran open yet?", "response": "Yes"},
    {"query": "I'm in san francisco for the morning, does Juniper serve pastries?", "response": "Yes"},
    {"query": "What's the address of Gumbo Social in San Francisco?", "response": "5176 3rd St, San Francisco, CA 94124"},
    {"query": "What are the reviews like of Gola in SF?", "response": "Excellent, 4.6/5"},
    {"query": "Where's the best pizza in New York City", "response": "Joe's Pizza"},
    {"query": "What's the best diner in Toronto?", "response": "The George Street Diner"}
]

f_groundtruth = Feedback(GroundTruthAgreement(golden_set).agreement_measure).on_input_output()

### Standard Relevance Evals

Last, we'll add in our standard relevance evals.

We can use the Select function to capture deep context in our app for QS relevance, alongside our standard QA relevance on input and output.

In [None]:
import numpy as np
from trulens_eval import OpenAI as fOpenAI
fopenai = fOpenAI()
# Question/statement relevance between question and last context chunk (i.e. summary)
f_qs_relevance = Feedback(fopenai.qs_relevance).on_input().on(
    Select.App.app.chat_history[-1]["content"] # check the last context chunk
)

# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(fopenai.relevance).on_input_output()

### Run the dashboard

By running the dashboard before we start to make app calls, we can see them come in 1 by 1.

In [None]:
tru.run_dashboard()

### Instrument Yelp App

We can instrument our yelp app with TruLlama and utilize the full suite of evals we set up.

In [None]:
yelp_app = TruLlama(agent,
    app_id='YelpAgent',
    tags = "agent prototype",
    feedbacks = [f_qa_relevance, f_groundtruth, f_qs_relevance, query_translation_score, ratings_usage])

### Instrument Standalone LLM app.

Since we don't have insight into the OpenAI innerworkings, we cannot run many of the evals on intermediate steps.

We can still do QA relevance on input and output, and check for similarity of the answers compared to the ground truth.

In [None]:
standalone_app = TruBasicApp(llm_standalone, app_id="OpenAIChatCompletion", tags = "comparison", feedbacks=[f_qa_relevance, f_groundtruth])

### Start using our apps!

In [None]:
prompt_set = ["What's the vibe like at oprhan andy's in SF?",
                "What are the reviews like of Gola in SF?",
                "Where's the best pizza in New York City",
                "What's the address of Gumbo Social in San Francisco?",
                "I'm in san francisco for the morning, does Juniper serve pastries?",
                "What's the best diner in Toronto?"
                ]

In [None]:
for prompt in prompt_set:
    standalone_app.call_with_record(prompt)
    yelp_app.query(prompt)