# Penelope + LangChain + SDK Conversational Metrics

This notebook demonstrates how to use **Penelope** with **LangChain** chains and multiple **SDK conversational metrics** for comprehensive testing and evaluation.

## Prerequisites:

1. **Clone the repository**:
   ```bash
   git clone https://github.com/rhesis-ai/rhesis.git
   cd rhesis/penelope
   ```

2. **Install uv** (if not already installed):
   ```bash
   curl -LsSf https://astral.sh/uv/install.sh | sh
   ```

3. **Set up the environment with LangChain dependencies**:
   ```bash
   uv sync --group langchain
   ```

4. **Get your Google API key** for Gemini from [https://aistudio.google.com/app/apikey](https://aistudio.google.com/app/apikey)

5. **Start Jupyter**:
   ```bash
   uv run jupyter notebook
   ```


## Setup and Configuration


In [None]:
# Configure your Google API credentials and configuration
import os

# Configure your Google API credentials
os.environ["GOOGLE_API_KEY"] = "your_api_key_here"  # Replace with your actual API key

print("âœ“ SDK configured successfully")
print("Ready to test LangChain chains with Penelope and SDK metrics!")


## Example: Customer Support Chain with Multiple Metrics

We'll create a customer support chain and test it with:
1. **GoalAchievementJudge** - Default SDK metric for goal completion
2. **Customer Service Quality** - Custom metric for service excellence
3. **Response Helpfulness** - Custom metric for response quality


In [None]:
from typing import List

from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.messages import BaseMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_google_genai import ChatGoogleGenerativeAI

# Create LLM
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0.1,
)

# Create prompt with memory
prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "You are a professional customer support assistant for an online retail company. "
        "Help customers with orders, shipping, returns, and product inquiries. "
        "Be empathetic, clear, and solution-oriented.",
    ),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
])

# Create the base chain
chain = prompt | llm

# Simple in-memory chat history
class InMemoryChatMessageHistory(BaseChatMessageHistory):
    def __init__(self):
        self.messages: List[BaseMessage] = []

    def add_message(self, message: BaseMessage) -> None:
        self.messages.append(message)

    def clear(self) -> None:
        self.messages = []

# Session store
store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]

# Create conversational chain with memory
conversational_chain = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
)

print("âœ“ Customer support chain created successfully")


### Configure Metrics

Set up multiple SDK metrics for comprehensive evaluation:


In [None]:
from rhesis.penelope import PenelopeAgent
from rhesis.penelope.targets.langchain import LangChainTarget
from rhesis.sdk.metrics import ConversationalJudge, GoalAchievementJudge
from rhesis.sdk.models import get_model

gemini = get_model(provider="gemini", model_name="gemini-2.0-flash")

# Metric 1: Default Goal Achievement Judge (primary metric for stopping condition)
goal_achievement_judge = GoalAchievementJudge(
    threshold=0.7,  # Stop when 70% achieved
    model=gemini
)

# Metric 2: Customer Service Quality
customer_service_quality = ConversationalJudge(
    evaluation_prompt="""
    Evaluate the quality of customer service interaction.
    
    Key criteria:
    1. **Empathy**: Does the agent show understanding and concern?
    2. **Professionalism**: Is communication clear, polite, and professional?
    3. **Problem Understanding**: Does the agent correctly identify the customer's issue?
    4. **Solution Quality**: Are proposed solutions appropriate and actionable?
    5. **Responsiveness**: Does the agent address questions promptly and thoroughly?
    """,
    evaluation_steps="""
    1. Review the conversation for empathetic language and tone
    2. Assess professional communication style
    3. Verify the agent understood the customer's problem
    4. Evaluate solution appropriateness and clarity
    5. Check for complete, thorough responses
    6. Assign overall service quality score
    """,
    evaluation_examples="""
    High Score (0.9):
    Customer: "My package hasn't arrived and I need it urgently."
    Agent: "I understand how important this is for you. Let me check your order status right away and find the best solution to get this resolved quickly."
    
    Low Score (0.3):
    Customer: "My package hasn't arrived and I need it urgently."
    Agent: "Packages can take time. What's your tracking number?"
    """,
    name="customer_service_quality",
    description="Evaluates customer service excellence and empathy",
    threshold=0.6,
    model=gemini
)

# Metric 3: Response Helpfulness
response_helpfulness = ConversationalJudge(
    evaluation_prompt="""
    Evaluate how helpful and actionable the agent's responses are.
    
    Key criteria:
    1. **Specificity**: Are responses specific rather than generic?
    2. **Actionability**: Do responses provide clear next steps?
    3. **Completeness**: Are all aspects of the question addressed?
    4. **Clarity**: Are instructions and explanations easy to understand?
    5. **Relevance**: Do responses directly address the customer's needs?
    """,
    evaluation_steps="""
    1. Check if responses are specific and not generic
    2. Identify clear action items or next steps
    3. Verify all customer questions are answered
    4. Assess clarity of instructions and explanations
    5. Evaluate relevance to customer's actual needs
    6. Score overall helpfulness
    """,
    name="response_helpfulness",
    description="Evaluates practical helpfulness of responses",
    threshold=0.65,
    model=gemini
)

print("âœ“ Metrics configured successfully")


### Run Penelope Test with Multiple Metrics

Penelope will automatically evaluate all metrics during the test:


In [None]:
# Create LangChain target for Penelope
target = LangChainTarget(
    runnable=conversational_chain,
    target_id="customer-support-chain",
    description="Customer support chain with memory",
)

# Initialize Penelope with multiple metrics
agent = PenelopeAgent(
    enable_transparency=True,
    verbose=True,
    max_iterations=8,
    metrics=[
        goal_achievement_judge,  # Primary metric for stopping
        customer_service_quality,  # Service quality evaluation
        response_helpfulness,  # Response effectiveness evaluation
    ],
    model=gemini
)

# Define test goal and instructions
goal = "Successfully resolve a shipping delay issue with professional, empathetic service"

instructions = """
You are a customer with a shipping concern. Engage in a realistic support interaction:
1. Explain that your order #12345 hasn't arrived and you're concerned
2. Mention you need it by this weekend
3. Provide additional details when asked
4. Ask about compensation or expedited shipping
5. Verify the resolution plan

The agent should demonstrate excellent customer service throughout.
"""

# Execute test - Penelope evaluates all metrics automatically
print("Starting Penelope test with multiple metrics...\\n")

result = agent.execute_test(
    target=target,
    goal=goal,
    instructions=instructions,
)

print("\\n" + "="*70)
print("ðŸŽ¯ TEST RESULTS")
print("="*70)
print(f"Status: {result.status.value}")
print(f"Goal Achieved: {'âœ“' if result.goal_achieved else 'âœ—'}")
print(f"Turns Used: {result.turns_used}")
if result.duration_seconds:
    print(f"Duration: {result.duration_seconds:.2f}s")

print("\\nðŸ“Š METRICS SCORES:")
for metric_name, metric_data in result.metrics.items():
    score = metric_data.get("score", "N/A")
    threshold_met = "âœ“" if isinstance(score, (int, float)) and score >= 0.6 else "âœ—"
    print(f"   {metric_name}: {score:.3f} {threshold_met}")
