This Notebook is made for experimentation purposes, so we can track the changes vs performance of our agent

### Setup

### Explore GAIA Lvl. 1 Questions

### Experiment and Track performance on dev set

### Save Results

In [1]:
from datasets import load_dataset
from typing import Literal

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
output_results_path = "/home/santi/current-projects/chappie/data/agent_experiments/"

In [3]:
gaia_dataset = load_dataset("gaia-benchmark/GAIA", "2023_level1")

Docs suggest to focus on validation set for dev purposes, let's explore it

In [4]:
dev_set = gaia_dataset["validation"]

In [5]:
dev_set

Dataset({
    features: ['task_id', 'Question', 'Level', 'Final answer', 'file_name', 'file_path', 'Annotator Metadata'],
    num_rows: 53
})

### Gather Level 1 Dev Questions

In [6]:
import pandas as pd

In [7]:
df_dev = pd.DataFrame(dev_set)

In [8]:
df_dev.head()

Unnamed: 0,task_id,Question,Level,Final answer,file_name,file_path,Annotator Metadata
0,e1fc63a2-da7a-432f-be78-7c4a95598703,If Eliud Kipchoge could maintain his record-ma...,1,17.0,,,{'Steps': '1. Googled Eliud Kipchoge marathon ...
1,8e867cd7-cff9-4e6c-867a-ff5ddc2550be,How many studio albums were published by Merce...,1,3.0,,,{'Steps': '1. I did a search for Mercedes Sosa...
2,ec09fa32-d03f-4bf8-84b0-1f16922c3ae4,Here's a fun riddle that I think you'll enjoy....,1,3.0,,,{'Steps': 'Step 1: Evaluate the problem statem...
3,5d0080cb-90d7-4712-bc33-848150e917d3,What was the volume in m^3 of the fish bag tha...,1,0.1777,,,"{'Steps': '1. Searched '""Can Hiccup Supply Eno..."
4,a1e91b78-d3d8-4675-bb8d-62741b4b68a6,In the video https://www.youtube.com/watch?v=L...,1,3.0,,,{'Steps': '1. Navigate to the YouTube link. 2....


In order to start our dev phase, let's observe how does our React Agent perform of a single question

In [9]:
sample_questions = df_dev.sample(12)

In [10]:
# Dataset copy just for eval

results_df = sample_questions.copy()[["Question", "Final answer"]]
results_df["Agent response"] = None
results_df["is_correct"] = None  # 1 if it is correct, 0 otherwise

results_df = results_df[["Question", "Agent response", "Final answer", "is_correct"]]

In [11]:
results_df

Unnamed: 0,Question,Agent response,Final answer,is_correct
17,"In the year 2022, and before December, what do...",,research,
43,In Audre Lorde’s poem “Father Son and Holy Gho...,,2,
31,"In the Scikit-Learn July 2017 changelog, what ...",,BaseLabelPropagation,
1,How many studio albums were published by Merce...,,3,
12,In Emily Midkiff's June 2014 article in a jour...,,fluffy,
7,An office held a Secret Santa gift exchange wh...,,Fred,
36,"Bob was invited to participate in a game show,...",,16000,
51,The attached Excel file contains the sales of ...,,89706.00,
47,Where were the Vietnamese specimens described ...,,Saint Petersburg,
8,".rewsna eht sa ""tfel"" drow eht fo etisoppo eht...",,Right,


### React Agent Developing

In [None]:
import os
from random import sample

os.sys.path.append("../src/")
os.sys.path.append("../src/agents/")
os.sys.path.append("../src/utils/")

import react  # My AI assistant

In [12]:
#react.run_app(user_query="Calculate the result of: (12 multiplied by 3) minus (15 divided by 5) plus (8 added to 2).")

Can I create a column agent_answer in order to evaluate? (i.e. running multiple questions in parallel)

In [None]:
# TODO: move to utils

def eval_answer(row: pd.Series) -> Literal[0, 1]:   
    # DEPRECATED
    """
    Evaluate Agent responses of GAIA-like answers. Exact match is mandatory for good responses

    Parameters
    ----------
    model_response: str
        Model response to the question
    gt_answer: str
        Ground truth answer to the question
    
    Returns:
        Literal[0, 1]: 0 if the answer is not correct, 1 otherwise 
    
    Example:
        >>> eval_answer(32.0, 32.1)
        '0'
    """
    model_response = row["Agent response"]
    gt_answer = row["Final answer"]
    return 1 if (model_response == gt_answer) else 0




In [18]:
import gaia_scorer

In [24]:
# TODO: move to utils
# TODO: docstring

def evaluate_response(row):
    model_res = row["Agent response"]
    gt_ans = row["Final answer"]
    score = gaia_scorer.question_scorer(
        model_answer=model_res, 
        ground_truth=gt_ans
    )
    return int(score)    

def get_agent_response(row) -> str:
    user_query = row["Question"]
    agent_response = react.run_app(user_query=user_query)
    agent_response = str(agent_response)
    return agent_response

In [25]:
# Save agent responses and their evaluation

results_df["Agent response"] = results_df.apply(func=get_agent_response, axis=1)


In [26]:

results_df["is_correct"] = results_df.apply(func=evaluate_response, axis=1)

String I couldn't find the specific number of studio albums published by Mercedes Sosa between 2000 and 2009 from the search. Please check the latest version of English Wikipedia for detailed discography information. cannot be normalized to number str.
String To solve this problem we need to determine the distribution of coins in the boxes that satisfies the given conditions and then find Bob's optimal strategy to maximize his winnings.

### Conditions:
1. Total coins = 30
2. One box must contain at least 2 coins.
3. One box must contain 6 more coins than another box.

### Analysis:
Let's denote the number of coins in the three boxes as \( x \) \( y \) and \( z \) such that \( x + y + z = 30 \).

- Without loss of generality assume \( x \leq y \leq z \).
- From the conditions we have:
  - \( x \geq 2 \)
  - \( z = x + 6 \)

### Solving the Equations:
Substitute \( z = x + 6 \) into the total:
\[ x + y + (x + 6) = 30 \]
\[ 2x + y + 6 = 30 \]
\[ 2x + y = 24 \]
\[ y = 24 - 2x \]

Since \(



In [27]:
# issue: Ground truth answers have str type 

In [40]:
results_df.to_csv("search_and_code_tools_integrated.csv", index=False)

In [None]:
xp_name = "search and code tools integrated"
results_df.to_csv(output_results_path + xp_name + ".csv", index=False)

In [31]:
accuracy = results_df["is_correct"].mean()

In [32]:
(100 * accuracy).astype("str")[:3] + " %"

'16. %'

In [38]:
path = output_results_path + "historical.csv"

performance_df = pd.read_csv(path)

results = {
    "iteration": 2,
    "agent": "React agent",
    "tools": "Aritmetic, Search, Code",
    "accuracy": round(results_df.copy()["is_correct"].mean(), 2),
}

performance_df = pd.concat([performance_df, pd.DataFrame([results])], ignore_index=True)
performance_df.drop_duplicates(inplace=True)
performance_df.to_csv(path, index=False)

In [39]:
performance_df

Unnamed: 0,iteration,agent,tools,accuracy
0,1,React agent,Aritmetic,0.0
1,2,React agent,"Aritmetic, Search, Code",0.17


How to access and handle this dataset?

### Main Questions to solve

$\square$ Which are the core tools for each level of questions 
  - Level 1:
  - Level 2:
  - Level 3:

