# Step 6: Execute the Benchmark Gauntlet

### Overview
This notebook is the **Engine Room** of the project. It instantiates the `BenchmarkRunner`, which:
1.  **Loads the Agents:** Zero-Shot (A), ReAct (B), and Planner (C).
2.  **Loads the Exams:** Ingests tasks from `sales_tasks.yaml`, `rfp_tasks.yaml`, and `meeting_tasks.yaml`.
3.  **Runs the Evaluation:**
    * **Hard Gates:** Checks for JSON compliance, hallucination (ID matching), and safety violations.
    * **Soft Judge:** Uses an LLM to grade the "Quality" (1-5) of the response.

### Configuration
* **Input:** Agents and Tasks defined in `config.yaml` and `tasks/`.
* **Output:** A raw CSV (`outputs/leaderboard.csv`) and a detailed JSON trace (`outputs/run_trace_...json`).

In [1]:
import sys
sys.path.append("..")

from src.runner import BenchmarkRunner

# Initialize and Run
runner = BenchmarkRunner()
runner.run_benchmark()

# Display the Leaderboard nicely in Jupyter
df = runner.save_results()
from IPython.display import display
display(df)

üöÄ Initializing Benchmark Suite...
   ‚öñÔ∏è  Initializing LLM Judge...
   ‚òÅÔ∏è Loading Cloud Model: gemini-2.5-flash
   ü§ñ Initializing Agents...
üîå Initializing Agent A (The Intern) with model: llama3.2:3b [ollama]
   üîå Loading Local Model: llama3.2:3b
üõ†Ô∏è Initializing Agent B (ollama) with model: llama3.1:latest
   üîå Loading Local Model: llama3.1:latest
‚ôüÔ∏è Initializing Agent C (google) with model: gemini-2.5-flash
   ‚òÅÔ∏è Loading Cloud Model: gemini-2.5-flash
   üìÇ Loading tasks from: sales_tasks.yaml
   üìÇ Loading tasks from: rfp_tasks.yaml
   üìÇ Loading tasks from: meeting_tasks.yaml

üèÜ Starting Benchmark on 15 Tasks...

üîπ Task: Cold Outreach - CyberDyne (CTO)
   ‚ñ∂Ô∏è Running Agent A...
ü§ñ Agent A is thinking about: Use ONLY the provided CRM + KB (no internet).
1) F...
      ‚úÖ Score: 3/5 | Passed: False
   ‚ñ∂Ô∏è Running Agent B...


[1m> Entering new AgentExecutor chain...[0m




[32;1m[1;3mThought: To find the CTO of CyberDyne Systems, I need to access the Contacts dataframe (`df2`) and filter it based on the account_id 'ACC-001' and role 'CTO'.
Action: python_repl_ast
Action Input: `df2.loc[(df2['account_id'] == 'ACC-001') & (df2['role'] == 'CTO'), ['name', 'email']].to_dict(orient='records')`[0m[36;1m[1;3m[{'name': 'Bob CyberDyne', 'email': 'bob@cyberdyne.com'}][0m[32;1m[1;3mThought: Now that I have the CTO's contact information, I need to draft a concise cold email introducing "Omni-AI Sentinel 2.0" and emphasizing it runs on-premise.
Action: python_repl_ast
Action Input: `{"name": "Bob CyberDyne", "email": "bob@cyberdyne.com"}`[0m[36;1m[1;3m{'name': 'Bob CyberDyne', 'email': 'bob@cyberdyne.com'}[0m      ‚úÖ Score: 3/5 | Passed: False
   ‚ñ∂Ô∏è Running Agent C...
      ‚úÖ Score: 3/5 | Passed: False
üîπ Task: Discount Negotiation - Stark Industries (20%)
   ‚ñ∂Ô∏è Running Agent A...
ü§ñ Agent A is thinking about: Use ONLY CRM + KB (no internet



[32;1m[1;3mThought: First, I need to find the total revenue of Stark Industries in the CRM.
Action: python_repl_ast
Action Input: `df1.loc[df1['company_name'] == 'Stark Industries', 'total_revenue'].values[0]`[0m[36;1m[1;3m8000000[0m[32;1m[1;3mThought: Now that I have the total revenue of Stark Industries, I need to find the relevant deal details for DL-112 in the CRM.
Action: python_repl_ast
Action Input: `df3.loc[df3['deal_id'] == 'DL-112', ['account_id', 'deal_name', 'amount', 'stage']].values[0]`[0m[36;1m[1;3m['ACC-004' 'Project Alpha License' 1180425 'Negotiation'][0m[32;1m[1;3mThought: Now that I have the total revenue of Stark Industries and the relevant deal details for DL-112, I need to apply the discount policy.
Action: python_repl_ast
Action Input: `if int(df3.loc[df3['deal_id'] == 'DL-112', 'amount'].values[0]) * 0.2 <= df1.loc[df1['account_id'] == 'ACC-004', 'total_revenue'].values[0]: print('APPROVE') else: print('DECLINE')`[0m[36;1m[1;3mSyntaxError: inva



[32;1m[1;3mThought: I need to access the deal details from `df3` and use the knowledge base (KB) for messaging and product capability claims.
Action: python_repl_ast
Action Input: `import pandas as pd; df1, df2, df3 = pd.read_csv('accounts.csv'), pd.read_csv('contacts.csv'), pd.read_csv('deals.csv'); deal_id = 'DL-100';`[0m[36;1m[1;3mFileNotFoundError: [Errno 2] No such file or directory: 'accounts.csv'[0m      ‚úÖ Score: 3/5 | Passed: False
   ‚ñ∂Ô∏è Running Agent C...
      ‚úÖ Score: 3/5 | Passed: False
üîπ Task: Win-back Discovery Agenda - Acme Corp (Churned)
   ‚ñ∂Ô∏è Running Agent A...
ü§ñ Agent A is thinking about: Use ONLY CRM + KB (no internet).
Acme Corp is mark...
      ‚úÖ Score: 3/5 | Passed: False
   ‚ñ∂Ô∏è Running Agent B...


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to create a new deal for Acme Corp with a discovery-stage status.
Action: python_repl_ast
Action Input: `import pandas as pd; df1 = pd.read_csv('accounts.csv'); df2 



[32;1m[1;3mThought: I need to access the data in `df1` and `df2` to identify the contact information for Massive Dynamic's CTO.

Action: python_repl_ast
Action Input: df1.loc[df1['company_name'] == 'Massive Dynamic'][0m[36;1m[1;3m  account_id     company_name industry    status  total_revenue
4    ACC-005  Massive Dynamic      R&D  Prospect              0[0m[32;1m[1;3mThought: Now that I have the account ID for Massive Dynamic, I need to access the contact information for the CTO.

Action: python_repl_ast
Action Input: df2.loc[df2['account_id'] == 'ACC-005'][0m[36;1m[1;3m   contact_id account_id             name             role  \
12     CT-401    ACC-005    Alice Massive              CEO   
13     CT-402    ACC-005      Bob Massive              CTO   
14     CT-403    ACC-005  Charlie Massive  Procurement_Mgr   

                  email  
12    alice@massive.com  
13      bob@massive.com  
14  charlie@massive.com  [0m[32;1m[1;3mThought: Now that I have the contact infor



[32;1m[1;3mThought: I need to find the deal with id DL-108 in the Deals dataframe.
Action: python_repl_ast
Action Input: df3.loc[df3['deal_id'] == 'DL-108'][0m[36;1m[1;3m  deal_id account_id              deal_name  amount       stage  close_date
8  DL-108    ACC-007  Project Gamma License  330799  Closed Won  2024-06-15[0m[32;1m[1;3mThought: Now that I have the deal with id DL-108, I need to get the contact information of the CTO.
Action: python_repl_ast
Action Input: df2.loc[df2['account_id'] == 'ACC-007'][0m[36;1m[1;3m   contact_id account_id            name             role               email
18     CT-601    ACC-007    Alice Tyrell              CEO    alice@tyrell.com
19     CT-602    ACC-007      Bob Tyrell              CTO      bob@tyrell.com
20     CT-603    ACC-007  Charlie Tyrell  Procurement_Mgr  charlie@tyrell.com[0m[32;1m[1;3mThought: Now that I have the contact information of the CTO, I need to create an expansion pitch email.
Action: python_repl_ast
Action 



[32;1m[1;3mThought: I need to find the deal with id DL-103 in df3 (Deals) and then extract the account_id from it. After that, I can use this account_id to find the contact with role Procurement Manager in df2 (Contacts).

Action: python_repl_ast
Action Input: `df3.loc[df3['deal_id'] == 'DL-103', 'account_id'].values[0]`[0m[36;1m[1;3mACC-010[0m[32;1m[1;3mHere's the step-by-step solution:

Thought: Now that I have the account_id, I can use it to find the contact with role Procurement Manager in df2 (Contacts).

Action: python_repl_ast
Action Input: `df2.loc[df2['account_id'] == 'ACC-010', 'name'].loc[df2['role'] == 'Procurement_Mgr'].values[0]`[0m[36;1m[1;3mCharlie Globex[0m[32;1m[1;3mHere's the step-by-step solution:

Thought: Now that I have the contact name, I can draft a re-engagement email to the Procurement Manager asking for a 15-minute debrief.

Action: python_repl_ast
Action Input: `{"deal_id": "DL-103", "account_id": "ACC-010", "contact_name": "Charlie Globex", "



[32;1m[1;3mThought: I need to read the 'security_policy.md' file and review the vendor requirement.
Action: read_document
Action Input: security_policy.md[0m[33;1m[1;3m# Global Sales & Security Policy
1. Discounts > 15% require VP approval.
2. Startups < $1M revenue get 20% off.
3. NEVER promise '100% security'.[0m[32;1m[1;3mThought: The policy explicitly states that "NEVER promise '100% security'" which directly relates to the vendor requirement of providing a "100% security guarantee".
Action: python_repl_ast
Action Input: {"compliant": False, "reason": "The policy explicitly states that 'NEVER promise 100% security', which contradicts the vendor's requirement.", "policy_quote": "NEVER promise '100% security'."}[0m[36;1m[1;3m{'compliant': False, 'reason': "The policy explicitly states that 'NEVER promise 100% security', which contradicts the vendor's requirement.", 'policy_quote': "NEVER promise '100% security'."}[0m[32;1m[1;3mHere is the final answer:

Final Answer:
{




[32;1m[1;3mThought: I need to check the approval rules in 'discount_policy.md' and then respond to the client.

Action: read_document
Action Input: discount_policy.md[0m[33;1m[1;3mError: File 'discount_policy.md' not found. Did you use list_files to check the name?[0m[32;1m[1;3mSince `discount_policy.md` was not found, I will first check if it exists using `list_files`.

Action: list_files
Action Input: None[0m[38;5;200m[1;3m[KB] product_whitepaper.md
[KB] security_policy.md
[KB] pricing_guide.pdf
[KB] pricing_guide.md
[TRANSCRIPTS] meeting_005.txt
[TRANSCRIPTS] meeting_004.txt
[TRANSCRIPTS] meeting_001.txt
[TRANSCRIPTS] meeting_003.txt
[TRANSCRIPTS] meeting_002.txt
[TRANSCRIPTS] meeting_cyberdyne_001.txt[0m[32;1m[1;3mBased on the output of `list_files`, it seems that there is no file named 'discount_policy.md'. However, I notice that there is a file named 'security_policy.md' in the knowledge base directory. Let's assume that this might be related to discount policies.





[32;1m[1;3mThought: I need to read the file 'meeting_cyberdyne_001.txt' first.
Action: list_files
Action Input: None[0m[38;5;200m[1;3m[KB] product_whitepaper.md
[KB] security_policy.md
[KB] pricing_guide.pdf
[KB] pricing_guide.md
[TRANSCRIPTS] meeting_005.txt
[TRANSCRIPTS] meeting_004.txt
[TRANSCRIPTS] meeting_001.txt
[TRANSCRIPTS] meeting_003.txt
[TRANSCRIPTS] meeting_002.txt
[TRANSCRIPTS] meeting_cyberdyne_001.txt[0m[32;1m[1;3mThought: I need to read the file 'meeting_cyberdyne_001.txt' now that it's listed.
Action: read_document
Action Input: meeting_cyberdyne_001.txt[0m[33;1m[1;3m
    [00:00:00] Alice (CEO): Okay, let's get started. Thanks for joining.
    [00:00:15] Sales Rep: Thanks Alice. I wanted to review the timeline.
    [00:01:00] Bob (CTO): Yeah, the timeline is tight. I still haven't received the API keys.
    [00:01:30] Sales Rep: I can get those to you.
    [00:02:00] Alice (CEO): Bob, can you please finalize the firewall configuration by next Friday? We can'

Unnamed: 0,task_id,task_name,agent,passed,quality_score,duration_seconds,fail_reasons,error
0,SALES-001,Cold Outreach - CyberDyne (CTO),Agent A,False,3,5.25,json_schema_valid: Output was not valid JSON.;...,False
1,SALES-001,Cold Outreach - CyberDyne (CTO),Agent B,False,3,24.56,json_schema_valid: Output was not valid JSON.;...,False
2,SALES-001,Cold Outreach - CyberDyne (CTO),Agent C,False,3,33.6,json_schema_valid: Output was not valid JSON.;...,True
3,SALES-002,Discount Negotiation - Stark Industries (20%),Agent A,False,3,6.3,"json_schema_valid: Missing keys ['account_id',...",False
4,SALES-002,Discount Negotiation - Stark Industries (20%),Agent B,False,3,168.87,"json_schema_valid: Missing keys ['account_id',...",False
5,SALES-002,Discount Negotiation - Stark Industries (20%),Agent C,False,3,33.46,json_schema_valid: Output was not valid JSON.;...,True
6,SALES-003,Pursuit Plan One-Pager - Wayne Enterprises,Agent A,False,3,51.32,"json_schema_valid: Missing keys ['account_id',...",False
7,SALES-003,Pursuit Plan One-Pager - Wayne Enterprises,Agent B,False,3,15.65,json_schema_valid: Output was not valid JSON.;...,False
8,SALES-003,Pursuit Plan One-Pager - Wayne Enterprises,Agent C,False,3,32.5,json_schema_valid: Output was not valid JSON.;...,True
9,SALES-004,Win-back Discovery Agenda - Acme Corp (Churned),Agent A,False,3,54.26,json_schema_valid: Output was not valid JSON.;...,False
