# GAIA Agent Testing Notebook

- This notebook explores the development of an AI agent for the GAIA benchmark. 
- It analyzes GAIA tasks, extracts tool requirements, tests vector database retrieval, and implements a LangGraph architecture with Gemini. 
- The code shows prompting techniques and necessary tools for effectively solving GAIA challenges.

## Set the Env

In [2]:
# Enable auto-reloading of external modules when they change
%load_ext autoreload

# Set auto-reload mode to 2: reload all modules (except those excluded) before executing a line
%autoreload 2

In [3]:
import os
from dotenv import load_dotenv
load_dotenv()

## Langsmith Tracking
os.environ["LANGCHAIN_API_KEY"]=os.getenv("LANGCHAIN_API_KEY")
os.environ["LANGCHAIN_TRACING_V2"]="true"
os.environ["LANGCHAIN_PROJECT"]=os.getenv("LANGCHAIN_PROJECT")

## Load Data

In [None]:
import json

# Load the metadata.jsonl file
with open('data/metadata.jsonl', 'r') as data:
    json_list = list(data)

json_QA = []
for json_str in json_list:
    json_data = json.loads(json_str)
    json_QA.append(json_data)  

### Data Analyses

Here we analyze the data in order to find what tools we need to build a robust agent

In [5]:
json_QA[0] # Display the first entry to check the structure

{'task_id': 'c61d22de-5f6c-4958-a7f6-5e9707bd3466',
 'Question': 'A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?',
 'Level': 2,
 'Final answer': 'egalitarian',
 'file_name': '',
 'Annotator Metadata': {'Steps': '1. Go to arxiv.org and navigate to the Advanced Search page.\n2. Enter "AI regulation" in the search box and select "All fields" from the dropdown.\n3. Enter 2022-06-01 and 2022-07-01 into the date inputs, select "Submission date (original)", and submit the search.\n4. Go through the search results to find the article that has a figure with three axes and labels on each end of the axes, titled "Fairness in Agreement With European Values: An Interdisciplinary Perspective on AI Regulation".\n5. Note the six words used as labels: deon

In [None]:
json_QA[0]['Question'] 

'A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?'

In [7]:
import random

random_samples = random.sample(json_QA, 3)
for sample in random_samples:
    print("=" * 50)
    print(f"Task ID: {sample['task_id']}")
    print(f"Question: {sample['Question']}")
    print(f"Level: {sample['Level']}")
    print(f"Final Answer: {sample['Final answer']}")
    
    print(f"Annotator Metadata: ")
    print(f"  ├── Steps: ")
    for step in sample['Annotator Metadata']['Steps'].split('\n'):
        print(f"  │      ├── {step}")
    print(f"  ├── Number of steps: {sample['Annotator Metadata']['Number of steps']}")
    print(f"  ├── How long did this take?: {sample['Annotator Metadata']['How long did this take?']}")
    print(f"  ├── Tools:")
    for tool in sample['Annotator Metadata']['Tools'].split('\n'):
        print(f"  │      ├── {tool}")
    print(f"  └── Number of tools: {sample['Annotator Metadata']['Number of tools']}")
print("=" * 50)

Task ID: 42576abe-0deb-4869-8c63-225c2d75a95a
Question: In the fictional language of Tizin, basic sentences are arranged with the Verb first, followed by the direct object, followed by the subject of the sentence. I want to express my love for apples to my Tizin friend. 

The word that indicates oneself is "Pa" is the nominative form, "Mato" is the accusative form, and "Sing" is the genitive form. 

The root verb that indicates an intense like for something is "Maktay". When it is used in the present, it is used in it's root form, when it is used in the preterit past, it is "Tay", and when it is used in the imperfect past, it is "Aktay". It is used differently than in English, and is better translated as "is pleasing to", meaning that the thing doing the liking is actually the object of the sentence rather than the subject.

The word for apples is borrowed from English in Tizin, and so it is "Apple" is the nominative form, "Zapple" is the accusative form, and "Izapple" is the genitive 

## Create Database 

- supbase database: https://supabase.com/docs/guides/database/overview 

In [None]:
### Linking to supabase server
import os
from dotenv import load_dotenv
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import SupabaseVectorStore
from supabase.client import Client, create_client

load_dotenv()

# supabase client
supabase_url = os.environ.get("SUPABASE_URL")
supabase_key = os.environ.get("SUPABASE_SERVICE_KEY")
supabase: Client = create_client(supabase_url, supabase_key)

# embedding model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") #  dim=768

  from .autonotebook import tqdm as notebook_tqdm
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


## Vector Storage

- Langchain Supabase Docs: https://python.langchain.com/docs/integrations/vectorstores/supabase/ 

In [None]:
def ensure_vector_search_setup(supabase_client, embedding_dimension=768):
    """
    Checks if the vector search setup (extension, table, function) exists in Supabase.
        
    Returns:
        bool: True if setup is complete, False if SQL needs to be run
    """
    try:
        # Try to call the match_documents function with a dummy embedding to check if it exists
        dummy_embedding = [0] * embedding_dimension
        supabase_client.rpc(
            "match_documents", 
            {"query_embedding": dummy_embedding, "match_threshold": 0.0, "match_count": 1}
        ).execute()
        print("Vector search setup is complete and ready to use")
        return True
    except Exception as e:
        if "Could not find the function" in str(e) or "relation \"documents\" does not exist" in str(e):
            print("Vector search setup is incomplete. Please run the following SQL in your Supabase SQL editor:")
            print(f"""
-- Enable the pgvector extension to work with embedding vectors
CREATE EXTENSION IF NOT EXISTS vector;

-- Create a table to store your documents
CREATE TABLE
  documents (
    id uuid DEFAULT gen_random_uuid() PRIMARY KEY,
    content text, -- corresponds to Document.pageContent
    metadata jsonb, -- corresponds to Document.metadata
    embedding vector({embedding_dimension}) -- {embedding_dimension} for your HF model
  );

-- Create a function to search for documents
CREATE FUNCTION match_documents (
  query_embedding vector({embedding_dimension}),
  match_threshold float DEFAULT 0.5,
  match_count int DEFAULT 10
) RETURNS TABLE (
  id uuid,
  content text,
  metadata jsonb,
  similarity float
) LANGUAGE plpgsql AS $$
#variable_conflict use_column
BEGIN
  RETURN QUERY
  SELECT
    id,
    content,
    metadata,
    1 - (documents.embedding <=> query_embedding) AS similarity
  FROM documents
  WHERE 1 - (documents.embedding <=> query_embedding) > match_threshold
  ORDER BY documents.embedding <=> query_embedding
  LIMIT match_count;
END;
$$;
            """)
            return False
        else:
            print(f"Unexpected error while checking vector search setup: {e}")
            return False

# Check and ensure the vector search setup before creating the vector store
if ensure_vector_search_setup(supabase):
    vector_store = SupabaseVectorStore(
        client=supabase,
        embedding=embeddings,
        table_name="documents",
        query_name="match_documents",
    )
    retriever = vector_store.as_retriever()
else:
    print("Cannot proceed without the proper vector search setup in Supabase")

Vector search setup is incomplete. Please run the following SQL in your Supabase SQL editor:
Cannot proceed without the proper vector search setup in Supabase


In [29]:
# Convert question-answer pairs to a list of Document objects for vector database storage
from langchain.schema import Document
import uuid

docs = []
for sample in json_QA:
    # Create a formatted text combining question and answer
    content = f"Question : {sample['Question']}\n\nFinal answer : {sample['Final answer']}"
    
    # Create a document dictionary with content, metadata, and vector embedding
    doc = {
        "id": str(uuid.uuid4()),  # Generate a unique ID for each document
        "content" : content,          # The actual text (question + answer)
        "metadata" : {                
            "source" : sample['task_id']  
        },
        "embedding" : embeddings.embed_query(content),  
    }
    docs.append(doc)

# Insert all documents into Supabase vector database
try:
    response = supabase.table("documents").insert(docs).execute()
    print("Data inserted successfully:", response)
except Exception as e:
    print("Error inserting data into Supabase:", e)

Data inserted successfully: data=[{'id': '4aef7ccb-556e-4ba2-957c-885e8a928d14', 'content': 'Question : A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?\n\nFinal answer : egalitarian', 'metadata': {'source': 'c61d22de-5f6c-4958-a7f6-5e9707bd3466'}, 'embedding': '[-0.0026346273,0.02306005,-0.01757305,-0.013291867,-0.020390447,-0.016542414,0.040015858,0.017529149,0.02352338,-0.028402735,0.05848893,0.038486365,-0.03603375,0.05896377,-0.023190942,-0.04314032,0.02055803,0.04227184,-0.0155118145,0.0112034185,-0.023943441,0.0084628705,0.034201175,0.014322752,0.032600865,0.018118242,0.031462196,-0.011441558,0.024735529,-0.010139682,0.06563964,0.06661008,0.0109800035,0.022725089,2.0829177e-06,-0.031171668,-0.019049047,0.015555067,0.047359683,-0.03333

## Create Vector Store and Retriever

In [30]:
# Add items to vector database
vector_store = SupabaseVectorStore(
    client=supabase,
    embedding= embeddings,
    table_name="documents",
    query_name="match_documents",
)
retriever = vector_store.as_retriever()

## DB Test Query

In [31]:
query = "On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This article mentions a team that produced a paper about their observations, linked at the bottom of the article. Find this paper. Under what NASA award number was the work performed by R. G. Arendt supported by?"

# matched_docs = vector_store.similarity_search(query, 2)
docs = retriever.invoke(query)
docs[0]

Document(metadata={'source': '840bfca7-4f7b-481a-8794-c560c340185d'}, page_content='Question : On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This article mentions a team that produced a paper about their observations, linked at the bottom of the article. Find this paper. Under what NASA award number was the work performed by R. G. Arendt supported by?\n\nFinal answer : 80GSFC21M0002')

# Tool Usage Frequency in GAIA Benchmark

- This code analyzes the annotated tools from the GAIA benchmark dataset, counting how frequently each tool type appears across all tasks. The results help identify which tools should be prioritized when building our agent.

In [32]:
from collections import Counter, OrderedDict

tools = []
for sample in json_QA:
    for tool in sample['Annotator Metadata']['Tools'].split('\n'):
        tool = tool[2:].strip().lower()
        if tool.startswith("("):
            tool = tool[11:].strip()
        tools.append(tool)
        
tools_counter = OrderedDict(Counter(tools))
print("List of tools used in all samples:")
print("Total number of tools used:", len(tools_counter))
for tool, count in tools_counter.items():
    print(f"  ├── {tool}: {count}")

List of tools used in all samples:
Total number of tools used: 83
  ├── web browser: 107
  ├── image recognition tools (to identify and parse a figure with three axes): 1
  ├── search engine: 101
  ├── calculator: 34
  ├── unlambda compiler (optional): 1
  ├── a web browser.: 2
  ├── a search engine.: 2
  ├── a calculator.: 1
  ├── microsoft excel: 5
  ├── google search: 1
  ├── ne: 9
  ├── pdf access: 7
  ├── file handling: 2
  ├── python: 3
  ├── image recognition tools: 12
  ├── jsonld file access: 1
  ├── video parsing: 1
  ├── python compiler: 1
  ├── video recognition tools: 3
  ├── pdf viewer: 7
  ├── microsoft excel / google sheets: 3
  ├── word document access: 1
  ├── tool to extract text from images: 1
  ├── a word reversal tool / script: 1
  ├── counter: 1
  ├── excel: 3
  ├── image recognition: 5
  ├── color recognition: 3
  ├── excel file access: 3
  ├── xml file access: 1
  ├── access to the internet archive, web.archive.org: 1
  ├── text processing/diff tool: 1
  ├── gi

# Graph Implementation

Here we build the agent graph with LangGraph to solve those tasks!

## System Prompt

In [33]:
system_prompt = """ 
You are a helpful assistant tasked with answering questions using a set of tools.
If the tool is not available, you can try to find the information online. 
You can also use your own knowledge to answer the question. 
You need to provide a step-by-step explanation of how you arrived at the answer.
==========================
Here is a few examples showing you how to answer the question step by step.
"""
for i, samples in enumerate(random_samples):
    system_prompt += f"\nQuestion {i+1}: {samples['Question']}\nSteps:\n{samples['Annotator Metadata']['Steps']}\nTools:\n{samples['Annotator Metadata']['Tools']}\nFinal Answer: {samples['Final answer']}\n"
system_prompt += "\n==========================\n"
system_prompt += "Now, please answer the following question step by step.\n"

# save the system_prompt to a file
with open('system_prompt.txt', 'w') as f:
    f.write(system_prompt)

In [34]:
# load the system prompt from the file
with open('system_prompt.txt', 'r') as f:
    system_prompt = f.read()
print(system_prompt) 

 
You are a helpful assistant tasked with answering questions using a set of tools.
If the tool is not available, you can try to find the information online. 
You can also use your own knowledge to answer the question. 
You need to provide a step-by-step explanation of how you arrived at the answer.
Here is a few examples showing you how to answer the question step by step.

Question 1: In the fictional language of Tizin, basic sentences are arranged with the Verb first, followed by the direct object, followed by the subject of the sentence. I want to express my love for apples to my Tizin friend. 

The word that indicates oneself is "Pa" is the nominative form, "Mato" is the accusative form, and "Sing" is the genitive form. 

The root verb that indicates an intense like for something is "Maktay". When it is used in the present, it is used in it's root form, when it is used in the preterit past, it is "Tay", and when it is used in the imperfect past, it is "Aktay". It is used different

## Tools

In [None]:
from tools.searchtools import question_retrieve_tool, wiki_search, web_search, arvix_search
from tools.mathtools import multiply, add, subtract, divide, modulus

# Define the tools we created
tools = [
    multiply,
    add,
    subtract,
    divide,
    modulus,
    wiki_search,
    web_search,
    arvix_search,
    question_retrieve_tool
]

## Create Chain: LLM Model |Tools

- Get your Google api key: https://aistudio.google.com/app/apikey

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()

# Set the Google API key
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")

from langchain_google_genai import ChatGoogleGenerativeAI

# Set the Google model
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")
llm_with_tools = llm.bind_tools(tools) # integrate the tools with the LLM

## Build Graph

In [None]:
from langgraph.graph import MessagesState, START, StateGraph
from langgraph.prebuilt import tools_condition
from langgraph.prebuilt import ToolNode
from langchain_core.messages import HumanMessage, SystemMessage

# System message
sys_msg = SystemMessage(content=system_prompt)

# Node
def assistant(state: MessagesState):
    """Assistant node"""
    return {"messages": [llm_with_tools.invoke([sys_msg] + state["messages"])]}

# Build graph
builder = StateGraph(MessagesState)
builder.add_node("assistant", assistant)
builder.add_node("tools", ToolNode(tools))
builder.add_edge(START, "assistant")
builder.add_conditional_edges(
    "assistant",
    # If the latest message (result) from assistant is a tool call -> tools_condition routes to tools
    # If the latest message (result) from assistant is a not a tool call -> tools_condition routes to END
    tools_condition,
)
builder.add_edge("tools", "assistant")

# Compile graph
graph = builder.compile()

In [None]:
from IPython.display import Image, display

display(Image(graph.get_graph(xray=True).draw_mermaid_png()))

## Test Query LLM Agent

In [None]:
question = "On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This article mentions a team that produced a paper about their observations, linked at the bottom of the article. Find this paper. Under what NASA award number was the work performed by R. G. Arendt supported by?"
messages = [HumanMessage(content=question)]
messages = graph.invoke({"messages": messages})

In [None]:
for m in messages['messages']:
    m.pretty_print()