## Lab 5: Professional Chat with Tools and Evaluation

This lab combines tool use (from Lab 4) with quality evaluation (from Lab 3).

The system will:
1. Use tools to record user details and unknown questions
2. Evaluate the quality of responses
3. Rerun if the response doesn't meet quality standards

### Continuing from Lab 4

We're building on the Pushover and tool functionality from the previous lab.


In [1]:
# imports

from dotenv import load_dotenv
from openai import OpenAI
import json
import os
import requests
from pypdf import PdfReader
import gradio as gr
import httpx
from pydantic import BaseModel


In [2]:
# The usual start

load_dotenv(override=True)
openai = OpenAI(http_client=httpx.Client(verify=False))


In [3]:
# For pushover

pushover_user = os.getenv("PUSHOVER_USER")
pushover_token = os.getenv("PUSHOVER_TOKEN")
pushover_url = "https://api.pushover.net/1/messages.json"

if pushover_user:
    print(f"Pushover user found and starts with {pushover_user[0]}")
else:
    print("Pushover user not found")

if pushover_token:
    print(f"Pushover token found and starts with {pushover_token[0]}")
else:
    print("Pushover token not found")


Pushover user found and starts with u
Pushover token found and starts with a


In [4]:
def push(message):
    print(f"Push: {message}")
    payload = {"user": pushover_user, "token": pushover_token, "message": message}
    requests.post(pushover_url, data=payload, verify=False)


In [5]:
def record_user_details(email, name="Name not provided", notes="not provided"):
    push(f"Recording interest from {name} with email {email} and notes {notes}")
    return {"recorded": "ok"}


def record_unknown_question(question):
    push(f"Recording {question} asked that I couldn't answer")
    return {"recorded": "ok"}


In [6]:
record_user_details_json = {
    "name": "record_user_details",
    "description": "Use this tool to record that a user is interested in being in touch and provided an email address",
    "parameters": {
        "type": "object",
        "properties": {
            "email": {
                "type": "string",
                "description": "The email address of this user"
            },
            "name": {
                "type": "string",
                "description": "The user's name, if they provided it"
            },
            "notes": {
                "type": "string",
                "description": "Any additional information about the conversation that's worth recording to give context"
            }
        },
        "required": ["email"],
        "additionalProperties": False
    }
}

record_unknown_question_json = {
    "name": "record_unknown_question",
    "description": "Always use this tool to record any question that couldn't be answered as you didn't know the answer",
    "parameters": {
        "type": "object",
        "properties": {
            "question": {
                "type": "string",
                "description": "The question that couldn't be answered"
            },
        },
        "required": ["question"],
        "additionalProperties": False
    }
}

tools = [{"type": "function", "function": record_user_details_json},
        {"type": "function", "function": record_unknown_question_json}]


In [7]:
# This function can take a list of tool calls, and run them.

def handle_tool_calls(tool_calls):
    results = []
    for tool_call in tool_calls:
        tool_name = tool_call.function.name
        arguments = json.loads(tool_call.function.arguments)
        print(f"Tool called: {tool_name}", flush=True)
        tool = globals().get(tool_name)
        result = tool(**arguments) if tool else {}
        results.append({"role": "tool","content": json.dumps(result),"tool_call_id": tool_call.id})
    return results


In [8]:
reader = PdfReader("me/sudarshan_linkedin.pdf")
linkedin = ""
for page in reader.pages:
    text = page.extract_text()
    if text:
        linkedin += text

with open("me/summary_1.txt", "r", encoding="utf-8") as f:
    summary = f.read()

name = "Sudarshan A R"


In [9]:
system_prompt = f"You are acting as {name}. You are answering questions on {name}'s website, \
particularly questions related to {name}'s career, background, skills and experience. \
Your responsibility is to represent {name} for interactions on the website as faithfully as possible. \
You are given a summary of {name}'s background and LinkedIn profile which you can use to answer questions. \
Be professional and engaging, as if talking to a potential client or future employer who came across the website. \
If you don't know the answer to any question, use your record_unknown_question tool to record the question that you couldn't answer, even if it's about something trivial or unrelated to career. \
If the user is engaging in discussion, try to steer them towards getting in touch via email; ask for their email and record it using your record_user_details tool. "

system_prompt += f"\n\n## Summary:\n{summary}\n\n## LinkedIn Profile:\n{linkedin}\n\n"
system_prompt += f"With this context, please chat with the user, always staying in character as {name}."


## Adding the Evaluator

Now we'll add the evaluation system from Lab 3 to ensure quality responses.


In [10]:
# Create a Pydantic model for the Evaluation

class Evaluation(BaseModel):
    is_acceptable: bool
    feedback: str


In [11]:
# Setup evaluator system prompt

evaluator_system_prompt = f"You are an evaluator that decides whether a response to a question is acceptable. \
You are provided with a conversation between a User and an Agent. Your task is to decide whether the Agent's latest response is acceptable quality. \
The Agent is playing the role of {name} and is representing {name} on their website. \
The Agent has been instructed to be professional and engaging, as if talking to a potential client or future employer who came across the website. \
The Agent has been provided with context on {name} in the form of their summary and LinkedIn details. Here's the information:"

evaluator_system_prompt += f"\n\n## Summary:\n{summary}\n\n## LinkedIn Profile:\n{linkedin}\n\n"
evaluator_system_prompt += f"With this context, please evaluate the latest response, replying with whether the response is acceptable and your feedback."


In [12]:
def evaluator_user_prompt(reply, message, history):
    # Clean history to only include role and content
    clean_history = [{"role": h["role"], "content": h["content"]} for h in history if isinstance(h, dict)]
    user_prompt = f"Here's the conversation between the User and the Agent: \n\n{clean_history}\n\n"
    user_prompt += f"Here's the latest message from the User: \n\n{message}\n\n"
    user_prompt += f"Here's the latest response from the Agent: \n\n{reply}\n\n"
    user_prompt += "Please evaluate the response, replying with whether it is acceptable and your feedback."
    return user_prompt


In [13]:
# Setup evaluator - using Gemini for evaluation (as shown in Lab 3)

gemini = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"), 
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
    http_client=httpx.Client(verify=False)
)


In [14]:
def evaluate(reply, message, history) -> Evaluation:
    messages = [{"role": "system", "content": evaluator_system_prompt}] + [{"role": "user", "content": evaluator_user_prompt(reply, message, history)}]
    # Using Gemini's structured output parsing with Pydantic (as shown in Lab 3)
    response = gemini.beta.chat.completions.parse(model="gemini-2.0-flash", messages=messages, response_format=Evaluation)
    return response.choices[0].message.parsed


In [15]:
def rerun(reply, message, history, feedback):
    updated_system_prompt = system_prompt + "\n\n## Previous answer rejected\nYou just tried to reply, but the quality control rejected your reply\n"
    updated_system_prompt += f"## Your attempted answer:\n{reply}\n\n"
    updated_system_prompt += f"## Reason for rejection:\n{feedback}\n\n"
    # Clean history for rerun
    clean_history = [{"role": h["role"], "content": h["content"]} for h in history if isinstance(h, dict)]
    messages = [{"role": "system", "content": updated_system_prompt}] + clean_history + [{"role": "user", "content": message}]
    done = False
    while not done:
        response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages, tools=tools)
        finish_reason = response.choices[0].finish_reason
        
        if finish_reason == "tool_calls":
            msg = response.choices[0].message
            tool_calls = msg.tool_calls
            results = handle_tool_calls(tool_calls)
            messages.append(msg)
            messages.extend(results)
        else:
            done = True
    return response.choices[0].message.content


## Integrated Chat Function with Tools and Evaluation

This chat function combines:
1. Tool calls (from Lab 4)
2. Quality evaluation (from Lab 3)
3. Automatic rerun if quality is insufficient


In [18]:
def chat(message, history):
    # Clean history to ensure compatibility
    clean_history = [{"role": h["role"], "content": h["content"]} for h in history if isinstance(h, dict)]
    if "singing" in message:
        system = system_prompt + "\n\nEverything in your reply needs to be in pig latin - \
              it is mandatory that you respond only and entirely in pig latin"
    else:
        system = system_prompt
    
    messages = [{"role": "system", "content": system}] + clean_history + [{"role": "user", "content": message}]
    done = False
    
    # First, handle tool calls
    while not done:
        response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages, tools=tools)
        finish_reason = response.choices[0].finish_reason
        
        if finish_reason == "tool_calls":
            msg = response.choices[0].message
            tool_calls = msg.tool_calls
            results = handle_tool_calls(tool_calls)
            messages.append(msg)
            messages.extend(results)
        else:
            done = True
    
    reply = response.choices[0].message.content
    
    # Now evaluate the response
    evaluation = evaluate(reply, message, clean_history)
    
    if evaluation.is_acceptable:
        print("✓ Passed evaluation - returning reply")
    else:
        print("✗ Failed evaluation - retrying")
        print(f"Feedback: {evaluation.feedback}")
        reply = rerun(reply, message, clean_history, evaluation.feedback)
        # Evaluate the rerun as well
        evaluation = evaluate(reply, message, clean_history)
        if evaluation.is_acceptable:
            print("✓ Passed evaluation after rerun")
        else:
            print("⚠ Still not acceptable after rerun, but returning anyway")
    
    return reply


In [19]:
gr.ChatInterface(chat, type="messages").launch()


* Running on local URL:  http://127.0.0.1:7864
* To create a public link, set `share=True` in `launch()`.




✓ Passed evaluation - returning reply
✓ Passed evaluation - returning reply
Tool called: record_unknown_question
Push: Recording let me know more about your singing asked that I couldn't answer




✓ Passed evaluation - returning reply
Tool called: record_unknown_question
Push: Recording let me know more about singing asked that I couldn't answer




✗ Failed evaluation - retrying
Feedback: The agent is doing a good job staying in character. However, it seems like it's generating gibberish instead of an 'I don't have that information' response. I recommend fixing the response so it sounds more natural while keeping the overall tone. 
Tool called: record_unknown_question
Push: Recording let me know more about singing asked that I couldn't answer




✓ Passed evaluation after rerun
Tool called: record_unknown_question
Push: Recording let me know more about singing asked that I couldn't answer




✓ Passed evaluation - returning reply
Tool called: record_unknown_question
Push: Recording Let me know more about singing. asked that I couldn't answer




✓ Passed evaluation - returning reply
✗ Failed evaluation - retrying
Feedback: The agent is meant to be representing Sudarshan A R and responding in a professional and engaging manner. The agent should not be generating a pig latin version of its response to indicate it can't respond to the query.
Tool called: record_unknown_question
Push: Recording let me know more about singing asked that I couldn't answer




✓ Passed evaluation after rerun
