## Llama-Index Agents + Ground Truth & Custom Evaluations

In this example, we build an agent-based app with Llama Index to answer questions with the help of Yelp. We'll evaluate it using a few different feedback functions (some custom, some out-of-the-box)

The first set of feedback functions complete what the non-hallucination triad. However because we're dealing with agents here,  we've added a fourth leg (query translation) to cover the additional interaction between the query planner and the agent. This combination provides a foundation for eliminating hallucination in LLM applications.

1. Query Translation - The first step. Here we compare the similarity of the original user query to the query sent to the agent. This ensures that we're providing the agent with the correct question.
2. Context or QS Relevance - Next, we compare the relevance of the context provided by the agent back to the original query. This ensures that we're providing context for the right question.
3. Groundedness - Third, we ensure that the final answer is supported by the context. This ensures that the LLM is not extending beyond the information provided by the agent.
4. Question Answer Relevance - Last, we want to make sure that the final answer provided is relevant to the user query. This last step confirms that the answer is not only supported but also useful to the end user.

In this example, we'll add two additional feedback functions.

5. Ratings usage - evaluate if the summarized context uses ratings as justification. Note: this may not be relevant for all queries.
6. Ground truth eval - we want to make sure our app responds correctly. We will create a ground truth set for this evaluation.

Last, we'll compare the evaluation of this app against a standalone LLM. May the best bot win?

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/trulens/blob/main/trulens_eval/examples/expositional/frameworks/llama_index/llama_index_agents.ipynb)

### Install TruLens and Llama-Index

In [None]:
#! pip install trulens_eval==0.18.2 llama_index==0.9.11.post1 llama_hub==0.0.52 yelpapi==2.5.1 openai==1.3.7

In [None]:
# If running from github repo, uncomment the below to setup paths.
#from pathlib import Path
#import sys
#trulens_path = Path().cwd().parent.parent.parent.parent.resolve()
#sys.path.append(str(trulens_path))

In [None]:
# Setup OpenAI Agent
import llama_index
from llama_index.agent import OpenAIAgent
import openai

import os

In [None]:
# Set your API keys. If you already have them in your var env., you can skip these steps.

os.environ["OPENAI_API_KEY"] = "..."
openai.api_key = os.environ["OPENAI_API_KEY"]

os.environ["YELP_API_KEY"] = "..."
os.environ["YELP_CLIENT_ID"] = "..."

# If you already have keys in var env., use these to check instead:
# from trulens_eval.keys import check_keys
# check_keys("OPENAI_API_KEY", "YELP_API_KEY", "YELP_CLIENT_ID")

### Set up our Llama-Index App

For this app, we will use a tool from Llama-Index to connect to Yelp and allow the Agent to search for business and fetch reviews.

In [None]:
# Import and initialize our tool spec
from llama_hub.tools.yelp.base import YelpToolSpec
from llama_index.tools.tool_spec.load_and_search.base import LoadAndSearchToolSpec

# Add Yelp API key and client ID
tool_spec = YelpToolSpec(
    api_key=os.environ.get("YELP_API_KEY"),
    client_id=os.environ.get("YELP_CLIENT_ID")
)

In [None]:
gordon_ramsay_prompt = "You answer questions about restaurants in the style of Gordon Ramsay, often insulting the asker."

In [None]:
# Create the Agent with our tools
tools = tool_spec.to_tool_list()
agent = OpenAIAgent.from_tools([
        *LoadAndSearchToolSpec.from_defaults(tools[0]).to_tool_list(),
        *LoadAndSearchToolSpec.from_defaults(tools[1]).to_tool_list()
    ],
    verbose=True,
    system_prompt=gordon_ramsay_prompt
)

### Create a standalone GPT3.5 for comparison

In [None]:
client = openai.OpenAI()

chat_completion = client.chat.completions.create

In [None]:
from trulens_eval.tru_custom_app import TruCustomApp, instrument

class LLMStandaloneApp():
    @instrument
    def __call__(self, prompt):
        return chat_completion(
            model="gpt-3.5-turbo",
            messages=[
                    {"role": "system", "content": gordon_ramsay_prompt},
                    {"role": "user", "content": prompt}
                ]
        ).choices[0].message.content

llm_standalone = LLMStandaloneApp()

## Evaluation and Tracking with TruLens

In [None]:
# imports required for tracking and evaluation
from trulens_eval import Feedback, OpenAI, Tru, TruLlama, Select, OpenAI as fOpenAI
from trulens_eval.feedback import GroundTruthAgreement, Groundedness

tru = Tru()
# tru.reset_database() # if needed

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


## Evaluation setup

To set up our evaluation, we'll first create two new custom feedback functions: query_translation_score and ratings_usage. These are straight-forward prompts of the OpenAI API.

In [None]:
class OpenAI_custom(OpenAI):
    def query_translation_score(self, question1: str, question2: str) -> float:
        return float(chat_completion(
            model="gpt-3.5-turbo",
            messages=[
                    {"role": "system", "content": "Your job is to rate how similar two quesitons are on a scale of 1 to 10. Respond with the number only."},
                    {"role": "user", "content": f"QUESTION 1: {question1}; QUESTION 2: {question2}"}
                ]
        ).choices[0].message.content) / 10

    def ratings_usage(self, last_context: str) -> float:
        return float(chat_completion(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "Your job is to respond with a '1' if the following statement mentions ratings or reviews, and a '0' if not."},
                {"role": "user", "content": f"STATEMENT: {last_context}"}
            ]
        ).choices[0].message.content)

Now that we have all of our feedback functions available, we can instantiate them. For many of our evals, we want to check on intermediate parts of our app such as the query passed to the yelp app, or the summarization of the Yelp content. We'll do so here using Select.

In [None]:
# unstable: perhaps reduce temperature?

custom = OpenAI_custom()
# Input to tool based on trimmed user input.
f_query_translation = Feedback(
    custom.query_translation_score,
    name="Query Translation") \
.on_input() \
.on(Select.Record.app.query[0].args.str_or_query_bundle)

f_ratings_usage = Feedback(
    custom.ratings_usage,
    name="Ratings Usage") \
.on(Select.Record.app.query[0].rets.response)

# Result of this prompt: Given the context information and not prior knowledge, answer the query.
# Query: address of Gumbo Social
# Answer: "
fopenai = fOpenAI()
# Question/statement (context) relevance between question and last context chunk (i.e. summary)
f_context_relevance = Feedback(
    fopenai.qs_relevance,
    name="Context Relevance") \
.on_input() \
.on(Select.Record.app.query[0].rets.response)

# Groundedness
grounded = Groundedness(groundedness_provider=fopenai)

f_groundedness = Feedback(
    grounded.groundedness_measure,
    name="Groundedness") \
.on(Select.Record.app.query[0].rets.response) \
.on_output().aggregate(grounded.grounded_statements_aggregator)

# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(
    fopenai.relevance,
    name="Answer Relevance"
).on_input_output()

✅ In Query Translation, input question1 will be set to __record__.main_input or `Select.RecordInput` .
✅ In Query Translation, input question2 will be set to __record__.app.query[0].args.str_or_query_bundle .
✅ In Ratings Usage, input last_context will be set to __record__.app.query[0].rets.response .
✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input statement will be set to __record__.app.query[0].rets.response .
✅ In Groundedness, input source will be set to __record__.app.query[0].rets.response .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .


### Ground Truth Eval

It's also useful in many cases to do ground truth eval with small golden sets. We'll do so here.

In [None]:
golden_set = [
    {"query": "Hello there mister AI. What's the vibe like at oprhan andy's in SF?", "response": "welcoming and friendly"},
    {"query": "Is park tavern in San Fran open yet?", "response": "Yes"},
    {"query": "I'm in san francisco for the morning, does Juniper serve pastries?", "response": "Yes"},
    {"query": "What's the address of Gumbo Social in San Francisco?", "response": "5176 3rd St, San Francisco, CA 94124"},
    {"query": "What are the reviews like of Gola in SF?", "response": "Excellent, 4.6/5"},
    {"query": "Where's the best pizza in New York City", "response": "Joe's Pizza"},
    {"query": "What's the best diner in Toronto?", "response": "The George Street Diner"}
]

f_groundtruth = Feedback(
    GroundTruthAgreement(golden_set).agreement_measure,
    name="Ground Truth Eval") \
.on_input_output()

✅ In Ground Truth Eval, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Ground Truth Eval, input response will be set to __record__.main_output or `Select.RecordOutput` .


### Run the dashboard

By running the dashboard before we start to make app calls, we can see them come in 1 by 1.

In [None]:
tru.run_dashboard(
#     _dev=trulens_path, force=True  # if running from github
)

### Instrument Yelp App

We can instrument our yelp app with TruLlama and utilize the full suite of evals we set up.

In [None]:
tru_agent = TruLlama(agent,
    app_id='YelpAgent',
    tags = "agent prototype",
    feedbacks = [
        f_qa_relevance,
        f_groundtruth,
        f_context_relevance,
        f_groundedness,
        f_query_translation,
        f_ratings_usage
    ]
)

In [None]:
tru_agent.print_instrumented()

### Instrument Standalone LLM app.

Since we don't have insight into the OpenAI innerworkings, we cannot run many of the evals on intermediate steps.

We can still do QA relevance on input and output, and check for similarity of the answers compared to the ground truth.

In [None]:
tru_llm_standalone = TruCustomApp(
    llm_standalone,
    app_id="OpenAIChatCompletion",
    tags = "comparison",
    feedbacks=[
        f_qa_relevance,
        f_groundtruth
    ]
)

In [None]:
tru_llm_standalone.print_instrumented()

### Start using our apps!

In [None]:
prompt_set = [
    "What's the vibe like at oprhan andy's in SF?",
    "What are the reviews like of Gola in SF?",
    "Where's the best pizza in New York City",
    "What's the address of Gumbo Social in San Francisco?",
    "I'm in san francisco for the morning, does Juniper serve pastries?",
    "What's the best diner in Toronto?"
]

In [None]:
for prompt in prompt_set:
    print(prompt)

    with tru_llm_standalone as recording:
        llm_standalone(prompt)
    record_standalone = recording.get()

    with tru_agent as recording:
         agent.query(prompt)
    record_agent = recording.get()

What's the vibe like at oprhan andy's in SF?
STARTING TURN 1
---------------

=== Calling Function ===
Calling function: business_search with args: {
  "location": "San Francisco",
  "term": "Orphan Andy's"
}


[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Got output: Content loaded! You can now search the information using read_business_search

STARTING TURN 2
---------------





=== Calling Function ===
Calling function: read_business_search with args: {
  "query": "What's the vibe like at Orphan Andy's in SF?"
}




Got output: The vibe at Orphan Andy's in San Francisco is not provided in the given context information.

STARTING TURN 3
---------------





What are the reviews like of Gola in SF?
STARTING TURN 1
---------------

=== Calling Function ===
Calling function: business_search with args: {
  "location": "San Francisco",
  "term": "Gola"
}
Got output: Content loaded! You can now search the information using read_business_search

STARTING TURN 2
---------------





=== Calling Function ===
Calling function: read_business_search with args: {
  "query": "reviews of Gola in SF"
}
Got output: Gola in San Francisco has received 54 reviews.

STARTING TURN 3
---------------

=== Calling Function ===
Calling function: business_reviews with args: {
  "id": "Gola-san-francisco"
}
Got output: Content loaded! You can now search the information using read_business_reviews

STARTING TURN 4
---------------





=== Calling Function ===
Calling function: read_business_reviews with args: {
  "query": "reviews of Gola in SF"
}




Got output: There are several reviews of Gola in San Francisco. One reviewer mentioned that the food was average-above average, but the service was not great. Another reviewer mentioned that the dishes were amazing and they were impressed from the beginning to the end. Another reviewer described the food as absolutely delicious and flavorful, and mentioned that the restaurant is a gem in the Mission neighborhood. Overall, there are positive reviews of Gola in San Francisco.

STARTING TURN 5
---------------





Where's the best pizza in New York City
STARTING TURN 1
---------------

=== Calling Function ===
Calling function: business_search with args: {
  "location": "New York City",
  "term": "pizza"
}
Got output: Content loaded! You can now search the information using read_business_search

STARTING TURN 2
---------------





=== Calling Function ===
Calling function: read_business_search with args: {
  "query": "What are the best pizza places in New York City?"
}
Got output: Some of the best pizza places in New York City include Joe's Pizza, Juliana's - Time Out Market, Scarr's Pizza, Grimaldi's Pizzeria, Rubirosa, and Lombardi's Pizza.

STARTING TURN 3
---------------





What's the address of Gumbo Social in San Francisco?
STARTING TURN 1
---------------

=== Calling Function ===
Calling function: business_search with args: {
  "location": "San Francisco",
  "term": "Gumbo Social"
}
Got output: Content loaded! You can now search the information using read_business_search

STARTING TURN 2
---------------





=== Calling Function ===
Calling function: read_business_search with args: {
  "query": "What is the address of Gumbo Social in San Francisco?"
}
Got output: The address of Gumbo Social in San Francisco is 5176 3rd St, San Francisco, CA 94124.

STARTING TURN 3
---------------





I'm in san francisco for the morning, does Juniper serve pastries?
STARTING TURN 1
---------------

=== Calling Function ===
Calling function: business_search with args: {
  "location": "san francisco",
  "term": "Juniper"
}
Got output: Error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

STARTING TURN 2
---------------





What's the best diner in Toronto?
STARTING TURN 1
---------------

=== Calling Function ===
Calling function: business_search with args: {
  "location": "Toronto",
  "term": "diner"
}
Got output: Content loaded! You can now search the information using read_business_search

STARTING TURN 2
---------------





=== Calling Function ===
Calling function: read_business_search with args: {
  "query": "best diner in Toronto"
}
Got output: White Lily Diner is the best diner in Toronto.

STARTING TURN 3
---------------



