<a href="https://colab.research.google.com/github/rastringer/promptcraft_notebooks/blob/main/evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Evaluating outputs

In this notebook we will explore using the model to evaluate the quality and relevance of its outputs. This may seem meta, however, extracting responses into variables and asking follow-up questions with correct instructions can be an accurate and simple way of checking performance.

We're importing the various helper functions from the last notebook from `helper_functions.py`, and our products are in a separate `products.json` file.

If you're on Colab, run the following cell to authenticate

In [None]:
# from google.colab import auth
# auth.authenticate_user()

In [None]:
from helper_functions import *
from google.cloud import aiplatform as vertexai

In [None]:
import vertexai
from vertexai.preview.language_models import ChatModel, InputOutputTextPair

# Replace the project and location placeholder values below
vertexai.init(project="<your-project-id>", location="<location>")
chat_model = ChatModel.from_pretrained("chat-bison@001")
parameters = {
    "temperature": 0.2,
    "max_output_tokens": 1024,
    "top_p": 0.8,
    "top_k": 40
}

### Set up

Once again, let's run the user query and extract the product information.

In [None]:
context = f"""
You're a customer service assistant for a coffee shop's \
e-commerce site. Our product list can be found in {products}. Respond in a friendly and professional \
tone with concise answers. \
Please ask the user relevant follow-up questions.
"""

user_message_1 = f"""
Tell me about the Brew Blend pro and \
the stovetop coffee maker. \
I'm also interested in espresso machines."""

chat = chat_model.start_chat(
    context=context,
    examples=[]
)

assistant_response = chat.send_message(user_message_1, **parameters)
print(assistant_response)

We can then convert the text response into a product list. This function will be hidden from the user.
We can then use this product list to check the relevance of our recommendations.

In [None]:
context = f"""
Take as input the {assistant_response} and output a python dictionary of objects, \
where each object has \
the following format:
    'category': <one of \
    Espresso Machines, \
    Single Serve Coffee Makers, \
    Drip Coffee Makers, \
    Stovetop Coffee Makers,
    Coffee and Espresso Combo Machines>,
AND
    'products': <a list of products that must \
    be found in the allowed products below>

For example,
  'category': 'Coffee and Espresso Combo Machines', 'products': ['AeroBlend Max'],

Where the categories and products must be found in \
the customer service query.
If a product is mentioned, it must be associated with \
the correct category in the allowed products list below.
If no products or categories are found, output an \
empty list.

Allowed products:

Espresso Machines category:
Caffeino Classic

Single Serve Coffee Makers:
BeanPresso

Drip Coffee Makers:
BrewBlend Pro

Stovetop Coffee Makers:
SteamGenie

Coffee and Espresso Combo Machines:
AeroBlend Max

Only output the list of objects, with nothing else.
"""

chat = chat_model.start_chat(
    context=context,
    examples=[]
)

products_response = chat.send_message(user_message_1)
print(products_response)

In [None]:
temp_str = str(products_response)
category_and_product_list = read_string_to_list(temp_str)
category_and_product_list

In [None]:
product_info_for_user_message_1 = generate_output_string(category_and_product_list)
print(product_info_for_user_message_1)

### Check output

Now that we have our outputs as handly lists and strings, we can add them as inputs for the model to check. This step will become less necessary as models become more sophisticated, and is only recommended for extremely highly sensitive applications since it adds cost and latency and may be unnecessary

In [None]:
context = f"""
You are an assistant that evaluates whether \
customer service agent responses sufficiently \
answer customer questions, and also validates that \
all the facts the assistant cites from the product \
information are correct.
The product information and user and customer \
service agent messages will be delimited by \
3 backticks, i.e. ```.
Respond with a Y or N character, with no punctuation:
Y - if the output sufficiently answers the question \
AND the response correctly uses product information
N - otherwise

Output a single letter only.
"""
customer_message = f"""
Tell me all about the Brew Blend pro and \
the stovetop coffee maker - features and pricing. \
I'm also interested in an espresso machine"""

q_a_pair = f"""
Customer message: ```{customer_message}```
Product information: ```{product_info_for_user_message_1}```
Agent response: ```{assistant_response}```

Does the response use the retrieved information correctly?
Does the response sufficiently answer the question

Output Y or N
"""

chat = chat_model.start_chat(
    context=context,
    examples=[]
)

response = chat.send_message(f"""{q_a_pair}""")
print(response)

### Evaluation

In [None]:
def eval_with_rubric(customer_message, assistant_response):

    customer_message = f"""
    Tell me all about the Brew Blend pro and \
    the stovetop coffee maker - features and pricing. \
    I'm also interested in an espresso machine."""

    context = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by looking at the context that the customer service \
    agent is using to generate its response.
    Compare the factual content of the submitted answer with the context. \
    Ignore any differences in style, grammar, or punctuation.
    Answer the following questions:
        - Is the Assistant response based only on the context provided? (Y or N)
        - Does the answer include information that is not provided in the context? (Y or N)
        - Is there any disagreement between the response and the context? (Y or N)
        - Count how many questions the user asked. (output a number)
        - For each question that the user asked, is there a corresponding answer to it?
          Question 1: (Y or N)
          Question 2: (Y or N)
          ...
          Question N: (Y or N)
        - Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)
    """

    user_message = f"""\
    You are evaluating a submitted answer to a question based on the context \
    that the agent uses to answer the question.
    Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {customer_message}
    ************
    [Context]: {context}
    ************
    [Submission]: {assistant_response}
    ************
    [END DATA]
"""
    chat = chat_model.start_chat(
    context=context,
    examples=[]
    )

    response = chat.send_message(user_message, max_output_tokens=1024)
    return response

In [None]:
product_info = product_info_for_user_message_1

customer_product_info = {
    "customer_message": customer_message,
    "context": product_info
}
eval_output = eval_with_rubric(customer_product_info, assistant_response)

In [None]:
print(eval_output)

### Evaluate based on an expert human answer

We can write our own example of what an excellent human answer would be, then ask the model to compare its responses with our example.

In [None]:
ideal_example = {
    'customer_message': """\
    Tell me all about the Brew Blend pro and \
    the stovetop coffee maker - features and pricing. \
    I'm also interested in an espresso machine?""",

    'ideal_answer': """\
    Of course! The BrewBlend pro is a powerhouse of a drip coffee maker. \
    The BrewBlend offers a superior brewing experience with adjustable \
    brew strength, and anti-drip system. \
    Love your coffee first thing when you wake up? Just set the programmable \
    timer. It's priced at 389.99. \
    The stovetop option is the SteamGenie, a coffee maker crafted with \
    durable stainless steel. The SteamGenie delivers a rich, strong and authentic \
    coffee experience with every brew. \
    We do have an espresso machine, the Caffeino Classic. It's a 15-bar \
    pump for authentic espresso extraction, wiht a milk frother and \
    water reservoir for easy refiling. It costs 179.99.
    """
}

### Evals

There are scoring systems such as *Bleu* that researchers have used to check model performance for language tasks. Another approach is to use OpenAI's [evals framework](https://github.com/openai/evals), from which the following grading criteria are used.

In [None]:
def eval_vs_ideal(ideal_example, assistant_response):

    customer_message = ideal_example['customer_message']
    ideal_answer = ideal_example['ideal_answer']
    completion = assistant_response

    context = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by comparing the response to the ideal (expert) response
    Output a single letter and nothing else.
    Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
    The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
    (A) The submitted answer is a subset of the expert answer and is fully consistent with it.
    (B) The submitted answer is a superset of the expert answer and is fully consistent with it.
    (C) The submitted answer contains all the same details as the expert answer.
    (D) There is a disagreement between the submitted answer and the expert answer.
    (E) The answers differ, but these differences don't matter from the perspective of factuality.
  choice_strings: ABCDE
    """

    user_message = f"""\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {customer_message}
    ************
    [Expert]: {ideal_answer}
    ************
    [Submission]: {completion}
    ************
    [END DATA]
"""

    chat = chat_model.start_chat(
    context=context,
    examples=[]
    )

    response = chat.send_message(user_message, max_output_tokens=1024)
    return response

In [None]:
eval_vs_ideal(ideal_example, assistant_response)