This Notebook is made for experimentation purposes, so we can track the changes vs performance of our agent

### Setup

In [1]:
# Libraries
from typing import Literal
from datasets import load_dataset
import pandas as pd
import os
from random import sample
from dotenv import load_dotenv
from huggingface_hub import snapshot_download, login

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Import tasks attached files 
load_dotenv()
login(token=os.getenv(key="HF_TOKEN_CHAPPIE"))
gaia_repo_dir = snapshot_download(repo_id="gaia-benchmark/GAIA", repo_type="dataset")
print(gaia_repo_dir)

Fetching 114 files: 100%|██████████| 114/114 [00:00<00:00, 2811.46it/s]

/home/santiagoal/.cache/huggingface/hub/datasets--gaia-benchmark--GAIA/snapshots/897f2dfbb5c952b5c3c1509e648381f9c7b70316





In [3]:
# Local Modules
os.sys.path.append("../src/")
os.sys.path.append("../src/agents/")
os.sys.path.append("../src/utils/")

import react  # My AI assistant
import gaia_eval

Langfuse client is disabled since no public_key was provided as a parameter or environment variable 'LANGFUSE_PUBLIC_KEY'. See our docs: https://langfuse.com/docs/sdk/python/low-level-sdk#initialize-client


In [4]:
# Paths
output_results_path = "/home/santiagoal/current-projects/chappie/data/agent_experiments/"
experiment_iterations_path = os.path.join(output_results_path, "iterations/")
summary_experiments_path = os.path.join(output_results_path, "summary.csv")

In [5]:
# GAIA dataset
gaia_dataset = load_dataset("gaia-benchmark/GAIA", "2023_level1")

Docs suggest to focus on validation set for dev purposes, let's explore it

In [6]:
dev_set = gaia_dataset["validation"]
dev_set

Dataset({
    features: ['task_id', 'Question', 'Level', 'Final answer', 'file_name', 'file_path', 'Annotator Metadata'],
    num_rows: 53
})

In [7]:
dev_set.shape

(53, 7)

### Explore GAIA Lvl. 1 Questions

In [8]:
df_dev = pd.DataFrame(dev_set)

In [9]:
df_dev.head()

Unnamed: 0,task_id,Question,Level,Final answer,file_name,file_path,Annotator Metadata
0,e1fc63a2-da7a-432f-be78-7c4a95598703,If Eliud Kipchoge could maintain his record-ma...,1,17.0,,,{'Steps': '1. Googled Eliud Kipchoge marathon ...
1,8e867cd7-cff9-4e6c-867a-ff5ddc2550be,How many studio albums were published by Merce...,1,3.0,,,{'Steps': '1. I did a search for Mercedes Sosa...
2,ec09fa32-d03f-4bf8-84b0-1f16922c3ae4,Here's a fun riddle that I think you'll enjoy....,1,3.0,,,{'Steps': 'Step 1: Evaluate the problem statem...
3,5d0080cb-90d7-4712-bc33-848150e917d3,What was the volume in m^3 of the fish bag tha...,1,0.1777,,,"{'Steps': '1. Searched '""Can Hiccup Supply Eno..."
4,a1e91b78-d3d8-4675-bb8d-62741b4b68a6,In the video https://www.youtube.com/watch?v=L...,1,3.0,,,{'Steps': '1. Navigate to the YouTube link. 2....


In order to start our dev phase, let's observe how does our React Agent perform of a single question

In [10]:
# DEL
# Experiment with only .txt like tasts
# Temporal cell
#df_dev = df_dev[df_dev["file_path"].map(lambda f: f.endswith("44.png"))]

In [20]:
n_samples = min(20, df_dev.shape[0])
sample_questions = df_dev.sample(n_samples)

In [21]:
# Dataset copy just for eval

results_df = sample_questions.copy()[["Question", "Final answer", "file_path"]]
results_df["Agent response"] = None
results_df["is_correct"] = None  # 1 if it is correct, 0 otherwise

results_df = results_df[["Question", "file_path", "Agent response", "Final answer", "is_correct"]]

In [22]:
results_df

Unnamed: 0,Question,file_path,Agent response,Final answer,is_correct
23,If there is anything that doesn't make sense i...,,,Guava,
2,Here's a fun riddle that I think you'll enjoy....,,,3,
26,Examine the video at https://www.youtube.com/w...,,,Extremely,
1,How many studio albums were published by Merce...,,,3,
14,"In the fictional language of Tizin, basic sent...",,,Maktay mato apple,
16,Review the chess position provided in the imag...,/home/santiagoal/.cache/huggingface/hub/datase...,,Rd5,
31,"In the Scikit-Learn July 2017 changelog, what ...",,,BaseLabelPropagation,
49,What country had the least number of athletes ...,,,CUB,
15,In Nature journal's Scientific Reports confere...,,,diamond,
3,What was the volume in m^3 of the fish bag tha...,,,0.1777,


### Experiment and Track performance on dev set

In [23]:
# Get XP history
old_experiments_data = pd.read_csv(summary_experiments_path)

In [24]:
# Form

# Get last information 
latest_experiment = old_experiments_data.iloc[-1]
#latest_xp_name = latest_experiment["experiment"]
latest_agent = latest_experiment["agent"]
latest_tools = latest_experiment["tools"]
latest_iteration = latest_experiment["iteration"]
current_iteration = latest_iteration + 1

# Get XP name
msg = "Type the experiment name (E.g. Integrate Whisper Transcriber)"
usr_response = ""
while usr_response.lower() == "":
    usr_response = input(msg + ": ")
xp_name = usr_response


# Get Agent Archictecure
usr_response = ""
msg = f"Is your agent different from '{latest_agent}'? [y/N]"
usr_response = input(msg + ": ")
warning_msg = f"Oops! '{usr_response}' is not a valid response, pls try again. "
while usr_response.lower() not in ("y", "n"):
    usr_response = input(warning_msg.format(usr_response) + msg + ": ")

if usr_response.lower() == "n":
    agent_architecture = latest_agent
elif usr_response.lower() == "y":
    agent_architecture = input("Please type the new agent architecture to track" + ": ")

# Get Tools
usr_response = ""
msg = "Are there new tools to track? [y/N]"
usr_response = input(msg + ": ")
while usr_response.lower() not in ("y", "n"):
    usr_response = input(warning_msg.format(usr_response) + msg + ": ")

if usr_response.lower() == "n":
    new_tools = ""
elif usr_response.lower() == "y":
    new_tools = input("Please type the new tools list to track (comma separated)" + ": ")

new_tools_list = latest_tools + ", " + new_tools

# Format xp name and path
xp_name_snake = str(current_iteration) + "_" + xp_name.replace(" ", "_").replace(",", "").lower()
xp_path = os.path.join(experiment_iterations_path, xp_name_snake + ".csv")

In [25]:
results_df

Unnamed: 0,Question,file_path,Agent response,Final answer,is_correct
23,If there is anything that doesn't make sense i...,,,Guava,
2,Here's a fun riddle that I think you'll enjoy....,,,3,
26,Examine the video at https://www.youtube.com/w...,,,Extremely,
1,How many studio albums were published by Merce...,,,3,
14,"In the fictional language of Tizin, basic sent...",,,Maktay mato apple,
16,Review the chess position provided in the imag...,/home/santiagoal/.cache/huggingface/hub/datase...,,Rd5,
31,"In the Scikit-Learn July 2017 changelog, what ...",,,BaseLabelPropagation,
49,What country had the least number of athletes ...,,,CUB,
15,In Nature journal's Scientific Reports confere...,,,diamond,
3,What was the volume in m^3 of the fish bag tha...,,,0.1777,


In [26]:
# Compute and save agent responses and their evaluation

results_df["Agent response"] = results_df.apply(func=gaia_eval.get_agent_response, axis=1)
results_df["is_correct"] = results_df.apply(func=gaia_eval.evaluate_response, axis=1)
results_df

attached_files: 
attached_files: 
attached_files: 
attached_files: 
attached_files: 
attached_files: /home/santiagoal/.cache/huggingface/hub/datasets--gaia-benchmark--GAIA/snapshots/897f2dfbb5c952b5c3c1509e648381f9c7b70316/2023/validation/cca530fc-4052-43b2-b130-b30968d8aa44.png


Using CPU. Note: This module is much faster with a GPU.


attached_files: 
attached_files: 
attached_files: 
attached_files: 
attached_files: 
attached_files: 
attached_files: /home/santiagoal/.cache/huggingface/hub/datasets--gaia-benchmark--GAIA/snapshots/897f2dfbb5c952b5c3c1509e648381f9c7b70316/2023/validation/a3fbeb63-0e8c-4a11-bff6-0e3b484c3e9c.pptx
attached_files: 
attached_files: 
attached_files: 
attached_files: 
attached_files: /home/santiagoal/.cache/huggingface/hub/datasets--gaia-benchmark--GAIA/snapshots/897f2dfbb5c952b5c3c1509e648381f9c7b70316/2023/validation/1f975693-876d-457b-a649-393859e79bf3.mp3




attached_files: /home/santiagoal/.cache/huggingface/hub/datasets--gaia-benchmark--GAIA/snapshots/897f2dfbb5c952b5c3c1509e648381f9c7b70316/2023/validation/99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3.mp3




attached_files: 
String To maximize your odds of winning the grand prize you should choose ball number **1**. cannot be normalized to number str.
String I couldn't find the specific number of studio albums released by Mercedes Sosa between 2000 and 2009. Please check the latest version of her discography on Wikipedia for detailed information. cannot be normalized to number str.
String Please provide the file containing the University of Leicester paper so I can extract the required information. cannot be normalized to number str.
String I can only process text files. Please convert the PowerPoint presentation to a text format and provide it again. cannot be normalized to number str.
String Please provide the text of the poem "Father Son and Holy Ghost" by Audre Lorde and I will help you identify the stanza with indented lines. cannot be normalized to number str.


Unnamed: 0,Question,file_path,Agent response,Final answer,is_correct
23,If there is anything that doesn't make sense i...,,Guava,Guava,1
2,Here's a fun riddle that I think you'll enjoy....,,To maximize your odds of winning the grand pri...,3,0
26,Examine the video at https://www.youtube.com/w...,,I can't access or analyze video content direct...,Extremely,0
1,How many studio albums were published by Merce...,,I couldn't find the specific number of studio ...,3,0
14,"In the fictional language of Tizin, basic sent...",,"""Maktay Zapple Pa""",Maktay mato apple,0
16,Review the chess position provided in the imag...,/home/santiagoal/.cache/huggingface/hub/datase...,Rd5,Rd5,1
31,"In the Scikit-Learn July 2017 changelog, what ...",,Please provide the file path to the Scikit-Lea...,BaseLabelPropagation,0
49,What country had the least number of athletes ...,,To find the country with the least number of a...,CUB,0
15,In Nature journal's Scientific Reports confere...,,Please provide the file containing the confere...,diamond,0
3,What was the volume in m^3 of the fish bag tha...,,Please provide the file containing the Univers...,0.1777,0


In [27]:
accuracy = results_df["is_correct"].mean()
print(f" Experiment Accuracy: {(100 * accuracy):.2f} %")

 Experiment Accuracy: 20.00 %


In [28]:
results_df

Unnamed: 0,Question,file_path,Agent response,Final answer,is_correct
23,If there is anything that doesn't make sense i...,,Guava,Guava,1
2,Here's a fun riddle that I think you'll enjoy....,,To maximize your odds of winning the grand pri...,3,0
26,Examine the video at https://www.youtube.com/w...,,I can't access or analyze video content direct...,Extremely,0
1,How many studio albums were published by Merce...,,I couldn't find the specific number of studio ...,3,0
14,"In the fictional language of Tizin, basic sent...",,"""Maktay Zapple Pa""",Maktay mato apple,0
16,Review the chess position provided in the imag...,/home/santiagoal/.cache/huggingface/hub/datase...,Rd5,Rd5,1
31,"In the Scikit-Learn July 2017 changelog, what ...",,Please provide the file path to the Scikit-Lea...,BaseLabelPropagation,0
49,What country had the least number of athletes ...,,To find the country with the least number of a...,CUB,0
15,In Nature journal's Scientific Reports confere...,,Please provide the file containing the confere...,diamond,0
3,What was the volume in m^3 of the fish bag tha...,,Please provide the file containing the Univers...,0.1777,0


### Save Results

In [29]:
# Save current experiment
results_df.to_csv(xp_path, index=False)

# Update experimentation history

xp_results = {
    "iteration": current_iteration,
    "experiment": xp_name,
    "agent": agent_architecture,
    "tools": new_tools_list,
    "accuracy": round(results_df.copy()["is_correct"].mean(), 2),
}

updated_experiments_data = pd.concat([old_experiments_data, pd.DataFrame([xp_results])], ignore_index=True)
updated_experiments_data.drop_duplicates(inplace=True)
updated_experiments_data.to_csv(summary_experiments_path, index=False)

In [30]:
updated_experiments_data

Unnamed: 0,iteration,experiment,agent,tools,accuracy
0,1,Implement Calculator Tool,React agent,Aritmetic,0.0
1,2,Implement Search and Code tools,React agent,"Aritmetic, Search, Code",0.17
2,3,Integrate Whisper Audio Transcriber,React agent,"Aritmetic, Search, Code, Audio Transcriber",0.15
3,4,Test Workflow,React agent,"Aritmetic, Search, Code, Audio Transcriber,",0.2
4,5,Integrate Text processing tool,React agent,"Aritmetic, Search, Code, Audio Transcriber, Te...",0.1
5,6,Integrate Text Handler tool,React agent,"Aritmetic, Search, Code, Audio Transcriber, Te...",0.2
6,7,Test Agent performance against tasks with atta...,React agent,"Aritmetic, Search, Code, Audio Transcriber, Te...",0.0
7,8,Integrate Chess Tool,React agent,"Aritmetic, Search, Code, Audio Transcriber, Te...",0.2


### Evaluation Summary



In [None]:
# Split responses into good and bad ones. Also filter extensions whose tasks are completed or not    
good_responses = results_df[results_df["is_correct"] == 1].copy()
good_extensions = (
    good_responses["file_path"]
    .fillna("No files")
    .apply(lambda row: row.split(".")[-1] if "." in row else "No files")
    .unique()
)
good_file_management = ", ".join(sorted(good_extensions))

bad_responses = results_df[results_df["is_correct"] == 0].copy()
bad_extensions = (
    bad_responses["file_path"]
    .fillna("No files")
    .apply(lambda row: row.split(".")[-1] if "." in row else "No files")
    .unique()
)
bad_file_management = ", ".join(sorted(bad_extensions))

performance_no_attached = results_df[results_df["file_path"].apply(lambda row: len(row))==0]["is_correct"].mean()

In [32]:
print(
    "Insights\n\n",
    "-" * 50,
    f"\n\n1. The Agent has an overall accuracy of {100 * accuracy:.1f}%"
    f"\n2. The Agent succeded at questions with the following files types: {good_file_management} ({good_responses.is_correct.shape[0]}/{results_df.shape[0]})",
    f"\n3. The Agent failed at questions with the following files types: {bad_file_management} ({bad_responses.is_correct.shape[0]}/{results_df.shape[0]})",
    f"\n4. The Agent has an Accuracy of {100 * performance_no_attached:.1f}% at tasks with no attached files"
)

Insights

 -------------------------------------------------- 

1. The Agent has an overall accuracy of 20.0%
2. The Agent succeded at questions with the following files types: No files, mp3, png (4/20) 
3. The Agent failed at questions with the following files types: No files, pptx (16/20) 
4. The Agent has an Accuracy of 6.2% at tasks with no attached files


Conclusions

1. The Agent shows no good usage of the **web search tool**. Probably this tool needs to be expanded as it just includes the main web search results. So it is useful for summaryzing but unuseful for deep web search. 
2. Reasoning task has a big margin of improvement (0% of success on reasoning tasks). There are some questions that ask for watching internet videos, so it might be necessary to implement tools to adress this kind of tasks
3. Some tasks seem to be addressable by implementing a youtube video scrapper and an office document scanner tools

Next steps

1. Build tools to deal with .png, .pptx, .py .xlsx files
2. Investigate improvements for pure-resoning tasks
3. Build tools to enable the agent watch / process youtube videos
4. Strengthen the web search tool for deep research


### Main Questions to solve

$\square$ Which are the core tools for each level of questions 
  - Level 1:
  - Level 2:
  - Level 3:

