### 🗂️ DATA INSPECTION AND ANALYSIS

Before evaluating the models, a detailed inspection of the dataset was conducted to understand the **typology of data** and tailor the evaluation metrics accordingly.

---

### Answer Types Distribution

The dataset consists of structured QA entries, each containing one or more **expected answers** (`exe_ans`). These answers fall into two primary categories:

- **Numerical answers** (e.g., revenue figures, ratios, growth rates)  
- **Boolean string answers**, strictly `"yes"` or `"no"`  

Through the analysis of the dataset fields (`qa`, `qa_0`, `qa_1`), it was observed that:

- The **vast majority** of entries contain **numerical answers**
- Only a **small fraction** of entries involve **string responses**
- Among string values, **100%** of them are either `"yes"` or `"no"` (case-insensitive, whitespace-normalized)

---


In [None]:
# data path
data_path = "/Users/francescostocchi/ConvFinQA_LLM_Project/data/train.json"
# directory path
directory = "/Users/francescostocchi/ConvFinQA_LLM_Project/results"

# extract data
import json
import pandas as pd
with open(data_path, "r") as f:
    data = json.load(f)

# convert to dataframe
df = pd.DataFrame(data)

# print the first few rows
print(df.head())    

In [None]:
# remove unnecessary columns
df = df.drop(columns=["annotation", "filename"])

# print a list of df columns
print(df.columns)


In [None]:
# Count the number of valid (non-null) entries
qa = df["qa"].notnull().sum()
qa_0 = df["qa_0"].notnull().sum()
qa_1 = df["qa_1"].notnull().sum()

# Sum them to get the total number of questions
total_number_of_questions = qa + qa_0 + qa_1

# Print the result
print(f"The total number of questions in the dataset is: {total_number_of_questions}")



In [None]:
# check if all the questions have an exact answer
def check_exe_ans(qa):
    return isinstance(qa, dict) and "exe_ans" in qa

# Drop NaNs before applying the check
if df["qa"].dropna().apply(check_exe_ans).all() and df["qa_0"].dropna().apply(check_exe_ans).all() and df["qa_1"].dropna().apply(check_exe_ans).all():
    print("Exact answer is present in all the question fields (excluding NaNs)")
else:
    print("Some questions are missing the exact answer field (excluding NaNs)")

In [None]:
# analyze the kind of answers 
def get_exe_ans_type(qa):
    if isinstance(qa, dict) and "exe_ans" in qa:
        val = qa["exe_ans"]
        if isinstance(val, (int, float)):
            return "number"
        elif isinstance(val, str):
            return "string"
        else:
            return type(val).__name__  # catch other types like list, None, etc.
    return "missing"

# apply the function to the qa column
df["exe_ans_type_qa"] = df["qa"].dropna().apply(get_exe_ans_type)

print(df["exe_ans_type_qa"].value_counts())

 # apply the function to the qa_0 column
df["exe_ans_type_qa_0"] = df["qa_0"].dropna().apply(get_exe_ans_type)
print(df["exe_ans_type_qa_0"].value_counts())

# apply the function to the qa_1 column
df["exe_ans_type_qa_1"] = df["qa_1"].dropna().apply(get_exe_ans_type)
print(df["exe_ans_type_qa_1"].value_counts())


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Define the custom palette
metavoice_palette = {
    'qa': '#309227',   # Green
    'qa_0': '#051827', # Dark Blue
    'qa_1': '#d8ff02'  # Yellow
}


# Apply type detection to all three columns without dropna (your function handles missing)
df["exe_ans_type_qa"] = df["qa"].apply(get_exe_ans_type)
df["exe_ans_type_qa_0"] = df["qa_0"].apply(get_exe_ans_type)
df["exe_ans_type_qa_1"] = df["qa_1"].apply(get_exe_ans_type)

# Combine results into a single DataFrame for plotting
type_data = pd.concat([
    df[["exe_ans_type_qa"]].rename(columns={"exe_ans_type_qa": "type"}).assign(source="qa"),
    df[["exe_ans_type_qa_0"]].rename(columns={"exe_ans_type_qa_0": "type"}).assign(source="qa_0"),
    df[["exe_ans_type_qa_1"]].rename(columns={"exe_ans_type_qa_1": "type"}).assign(source="qa_1"),
])

# Filter out "missing"
type_data = type_data[type_data["type"] != "missing"]

# plotting
plt.figure(figsize=(10, 6))
sns.countplot(data=type_data, x="type", hue="source", palette=metavoice_palette)

# Bigger title and labels
plt.title("Distribution of exact answers types across QA sources", fontsize=18, weight='bold')
plt.xlabel("Exact answers type", fontsize=14)
plt.ylabel("Count", fontsize=14)

# Increase tick label sizes
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Legend title and text size
plt.legend(title="QA Source", title_fontsize=13, fontsize=12)

# Optional: log scale
plt.yscale("log")

plt.tight_layout()
plt.show()


In [None]:
# Check if ALL string values in exe_ans are either "yes" or "no"
def all_string_answers_yes_no(column):
    return not column.apply(
        lambda qa: isinstance(qa, dict) and isinstance(qa.get("exe_ans"), str) and qa["exe_ans"].strip().lower() not in ["yes", "no"]
    ).any()

if all_string_answers_yes_no(df["qa"]) and all_string_answers_yes_no(df["qa_0"]) and all_string_answers_yes_no(df["qa_1"]):
    print("Yes — all string answers are either 'yes' or 'no'")
else:
    print("Some string answers are NOT 'yes' or 'no'")



### 🧠 AGENTIC MODEL EVALUATION

This experiment evaluates the performance of three leading LLMs:
- **GPT-4o** from OpenAI  
- **Claude 3.7 Sonnet** from Anthropic  
- **Gemini 2.0 Flash** from Google  

### Prompting Strategies (Prompt Styles)

The evaluation includes three distinct **prompt styles**, each designed to test different agent capabilities:

| Prompt Style   | Name          | Description |
|----------------|---------------|-------------|
| **JSON-Chat**  | `json-chat`   | A custom prompt template using a **system** and **user prompt** defined in a structured way within the prompt templates file. This prompt is optimized for clarity and consistency in numerical QA. |
| **ReAct**      | `react`       | A **ReAct-style** prompt where the model is guided to follow a reasoning-then-acting structure. It mimics classical step-by-step CoT (Chain-of-Thought) reasoning followed by action. |
| **Tool Agent** | `tools-agent` | A tool-oriented prompt that explicitly encourages the model to use **tool calls**. It's crafted to test how well the model can interact with external tools (e.g., calculators, parsers) during reasoning. |

Each prompt style is evaluated independently to observe how **prompt engineering** affects model performance on both numeric and boolean tasks.

---

These prompt strategies help assess:
- The model’s ability to **reason step-by-step**
- Its capacity to **trigger and use external tools**
- Its consistency in producing **well-formatted, verifiable answers**
The evaluation focuses on **numerical accuracy** and **string classification correctness** over a randomly selected sample of **20 entries**, which represents approximately **1% of the total dataset**. Due to resource constraints (API call limits), this sample offers only a **low-statistical-validity** glimpse into the models' capabilities. However, the methodology scales to the full dataset for comprehensive evaluation when needed.

---

### Evaluation Methodology

The evaluation is conducted using a custom metric function: `measure_accuracy`, supported by two helper functions: `compute_single_sample_accuracy` and `evaluate_answer`. Here's an explanation of each.

---

#### `measure_accuracy(...)`

This is the **main evaluation function**, which:
- Randomly samples a subset of questions from a given dataset.
- Prompts the agentic model to answer the questions.
- Compares the model’s output (`actual_answers`) with the ground truth (`expected_answers`).
- Returns key metrics:  
  - `mean_accuracy`: overall % of correct answers  
  - `mae`: mean absolute error for numeric predictions  
  - `mse`: mean squared error for numeric predictions  
  - `accuracy_measurements`: list of 0/1 indicating per-answer correctness
  - `llm_average_score`: overall % of correct answers 

A **tolerance margin** is applied to numeric comparisons to account for minor variations. Answers outside the tolerance are considered incorrect. For string answers (e.g., "yes"/"no"), a strict match is used after lowercasing and removing spaces.

---

#### `compute_single_sample_accuracy(...)`

This sub-function performs **answer-level comparison**:
- For **numeric answers**:  
  - Compares using `abs(expected - actual)`  
  - Applies the `tolerance` threshold to mark correctness (score 1 or 0)  
  - Calculates **MAE** and **MSE**
- For **string answers**:  
  - Performs strict comparison after normalization (lowercased, whitespace removed)  
  - Only "yes" and "no" are valid answers; all others are treated as incorrect
- Supports **asymmetric lengths** between predicted and expected answers by scoring missing answers as incorrect.

---

#### `evaluate_answer(...)`

This optional function introduces an **LLM-as-a-judge** approach:
- Used when the model's answer might be in an unexpected format (e.g., verbose explanations, embedded numbers, misformatted strings).
- It sends the original question, expected answer, and the model's response to another LLM acting as a "judge".
- The judge returns a score (e.g., 1 or 0) and an **explanation**.

This provides an **additional quality-control layer** that can catch cases where a numerically correct answer is returned in the wrong format and would otherwise be unfairly marked as incorrect.

---

### Metrics Summary

Since most of the answers in the dataset are **numerical**, the metrics naturally emphasize:
- **Numerical Accuracy**
- **MAE (Mean Absolute Error)**
- **MSE (Mean Squared Error)**

The few **string-based answers** (all either "yes" or "no") are treated as **binary classification tasks** using exact matching.

---

### Why This Approach?

- **Lightweight**: The method can evaluate a small sample with minimal cost.
- **Scalable**: Easily extends to evaluate the full dataset.
- **Robust**: Supports both numeric and string answer types.
- **Adaptable**: The LLM-judge adds human-like reasoning for edge cases.

---

### Limitations

- The 20-sample evaluation is **not statistically significant** and meant only for indicative benchmarking.
- String answer evaluation is **strict** — any extra formatting or words can cause a mismatch unless the LLM-judge is used.

---



In [None]:
import sys
import os
# Save the metrics to a JSON file
import json 
from datetime import datetime


# Go up one directory to reach the project root
project_root = os.path.abspath("..")
if project_root not in sys.path:
    sys.path.insert(0, project_root)

import importlib
import src.utils.data_extractor
importlib.reload(src.utils.data_extractor)
import importlib
import src.metrics.accuracy  
importlib.reload(src.metrics.accuracy)
import src.agent.agent_builder
#importlib.reload(src.agent.agent_builder)
import src.agent.agent_tools
importlib.reload(src.agent.agent_tools)
import src.agent.prompt_templates
importlib.reload(src.agent.prompt_templates)

# Re-import the function
from src.metrics.accuracy import measure_accuracy



In [None]:
# setting the number of samples
number_samples = 10

#### TESTING OPENAI MODEL

In [None]:
model = "gpt-4o"
provider = "openai"
tolerance = 0.005


In [None]:
## TESTING REACT PROMPT STYLE
prompt_style = "react"  

# save the the metrics for gpt-4o_react 
metrics_gpt4o_react = measure_accuracy(
    data_path=data_path,
    model=model,
    provider=provider,
    prompt_style=prompt_style,
    tolerance=tolerance,
    number_samples=number_samples,
    verbose=True
)

In [None]:
## print the metrics
print("Metrics for gpt-4o_react:")
for key, value in metrics_gpt4o_react.items():
    print(f"{key}: {value}")
    
# Save the metrics to a JSON file
import json 
from datetime import datetime

# Create a timestamp for uniqueness
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Define directory and dynamic filename
filename = f"metrics_{model}_{prompt_style}_{timestamp}.json"
file_path = os.path.join(directory, filename)

# Ensure the directory exists
os.makedirs(directory, exist_ok=True)

# Save the metrics to the JSON file
with open(file_path, "w") as f:
    json.dump(metrics_gpt4o_react, f, indent=4)

print(f"Metrics saved successfully to {file_path}")


In [None]:
## TESTING JSON-CHAT PROMPT STYLE
prompt_style = "json-chat"  

# save the the metrics for gpt-4o_react 
metrics_gpt4o_custom = measure_accuracy(
    data_path=data_path,
    model=model,
    provider=provider,
    prompt_style=prompt_style,
    tolerance=tolerance,
    number_samples=number_samples,
    verbose=True
)


In [None]:
## print the metrics
print("Metrics for gpt-4o_customt:")
for key, value in metrics_gpt4o_custom.items():
    print(f"{key}: {value}")

# Create a timestamp for uniqueness
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Define directory and dynamic filename
filename = f"metrics_{model}_{prompt_style}_{timestamp}.json"
file_path = os.path.join(directory, filename)

# Ensure the directory exists
os.makedirs(directory, exist_ok=True)

# Save the metrics to the JSON file
with open(file_path, "w") as f:
    json.dump(metrics_gpt4o_custom, f, indent=4)

print(f"Metrics saved successfully to {file_path}")


In [None]:
## TESTING TOOLS-AGENT PROMPT STYLE
prompt_style = "few-shot-CoT"  

# save the the metrics for TOOLS-AGENT 
metrics_gpt4o_cot = measure_accuracy(
    data_path=data_path,
    model=model,
    provider=provider,
    prompt_style=prompt_style,
    tolerance=tolerance,
    number_samples=number_samples,
    verbose=True
)

In [None]:
## print the metrics
print("Metrics for gpt-4o_tool:")
for key, value in metrics_gpt4o_cot.items():
    print(f"{key}: {value}")

# Create a timestamp for uniqueness
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Define directory and dynamic filename
filename = f"metrics_{model}_{prompt_style}_{timestamp}.json"
file_path = os.path.join(directory, filename)

# Ensure the directory exists
os.makedirs(directory, exist_ok=True)

# Save the metrics to the JSON file
with open(file_path, "w") as f:
    json.dump(metrics_gpt4o_cot, f, indent=4)

print(f"Metrics saved successfully to {file_path}")


#### TESTING ANTHROPIC MODEL

In [None]:
# model = "claude-3-7-sonnet-20250219"
# model = "claude-3-5-haiku-20241022"
model = "claude-3-5-sonnet-20241022"
provider = "anthropic"
tolerance = 0.005


In [None]:
## TESTING REACT PROMPT STYLE
prompt_style = "react"  

# save the the metrics for gpt-4o_react 
metrics_sonnet3_7_react = measure_accuracy(
    data_path=data_path,
    model=model,
    provider=provider,
    prompt_style=prompt_style,
    tolerance=tolerance,
    number_samples=number_samples,
    verbose=True
)

In [None]:
## print the metrics
print("Metrics for sonnet_3_5_react:")
for key, value in metrics_sonnet3_7_react.items():
    print(f"{key}: {value}")
    
# Create a timestamp for uniqueness
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Define directory and dynamic filename
filename = f"metrics_{model}_{prompt_style}_{timestamp}.json"
file_path = os.path.join(directory, filename)

# Ensure the directory exists
os.makedirs(directory, exist_ok=True)

# Save the metrics to the JSON file
with open(file_path, "w") as f:
    json.dump(metrics_sonnet3_7_react, f, indent=4)

print(f"Metrics saved successfully to {file_path}")

In [None]:
## TESTING JSON-CHAT PROMPT STYLE
prompt_style = "json-chat"  

# save the the metrics for gpt-4o_react 
metrics_sonnet3_7_custom = measure_accuracy(
    data_path=data_path,
    model=model,
    provider=provider,
    prompt_style=prompt_style,
    tolerance=tolerance,
    number_samples=number_samples,
    verbose=True
)

In [None]:
## print the metrics
print("Metrics for sonnet_3_5_custom:")
for key, value in metrics_sonnet3_7_custom.items():
    print(f"{key}: {value}")
    
# Create a timestamp for uniqueness
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Define directory and dynamic filename
filename = f"metrics_{model}_{prompt_style}_{timestamp}.json"
file_path = os.path.join(directory, filename)

# Ensure the directory exists
os.makedirs(directory, exist_ok=True)

# Save the metrics to the JSON file
with open(file_path, "w") as f:
    json.dump(metrics_sonnet3_7_custom, f, indent=4)

print(f"Metrics saved successfully to {file_path}")

In [None]:
## TESTING  PROMPT STYLE
prompt_style = "few-shot-CoT"  

# save the the metrics for gpt-4o_react 
metrics_sonnet3_7_cot = measure_accuracy(
    data_path=data_path,
    model=model,
    provider=provider,
    prompt_style=prompt_style,
    tolerance=tolerance,
    number_samples=number_samples,
    verbose=True
)

In [None]:
## print the metrics
print("Metrics for sonnet3_5_few_shot_react:")
for key, value in metrics_sonnet3_7_cot.items():
    print(f"{key}: {value}")

# Create a timestamp for uniqueness
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Define directory and dynamic filename
filename = f"metrics_{model}_{prompt_style}_{timestamp}.json"
file_path = os.path.join(directory, filename)

# Ensure the directory exists
os.makedirs(directory, exist_ok=True)

# Save the metrics to the JSON file
with open(file_path, "w") as f:
    json.dump(metrics_sonnet3_7_cot, f, indent=4)

print(f"Metrics saved successfully to {file_path}")

#### TESTING GOOGLE MODEL

In [None]:
model = "gemini-2.0-flash"
provider = "google"
tolerance = 0.005


In [None]:
## TESTING REACT PROMPT STYLE
prompt_style = "react"  

metrics_gemini_react = measure_accuracy(
    data_path=data_path,
    model=model,
    provider=provider,
    prompt_style=prompt_style,
    tolerance=tolerance,
    number_samples=number_samples,
    verbose=True
)

In [None]:
## print the metrics
print("Metrics for gemini_react:")
for key, value in metrics_gemini_react.items():
    print(f"{key}: {value}")

# Create a timestamp for uniqueness
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Define directory and dynamic filename
filename = f"metrics_{model}_{prompt_style}_{timestamp}.json"
file_path = os.path.join(directory, filename)

# Ensure the directory exists
os.makedirs(directory, exist_ok=True)

# Save the metrics to the JSON file
with open(file_path, "w") as f:
    json.dump(metrics_gemini_react, f, indent=4)

print(f"Metrics saved successfully to {file_path}")

In [None]:
## TESTING REACT PROMPT STYLE
prompt_style = "json-chat"  

metrics_gemini_json_chat = measure_accuracy(
    data_path=data_path,
    model=model,
    provider=provider,
    prompt_style=prompt_style,
    tolerance=tolerance,
    number_samples=number_samples,
    verbose=True
)

In [None]:
## print the metrics
print("Metrics for gemini_custom:")
for key, value in metrics_gemini_json_chat.items():
    print(f"{key}: {value}")

# Create a timestamp for uniqueness
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Define directory and dynamic filename
filename = f"metrics_{model}_{prompt_style}_{timestamp}.json"
file_path = os.path.join(directory, filename)

# Ensure the directory exists
os.makedirs(directory, exist_ok=True)

# Save the metrics to the JSON file
with open(file_path, "w") as f:
    json.dump(metrics_gemini_json_chat, f, indent=4)

print(f"Metrics saved successfully to {file_path}")

In [None]:
## TESTING REACT PROMPT STYLE
prompt_style = "few-shot-CoT"  

metrics_gemini_cot = measure_accuracy(
    data_path=data_path,
    model=model,
    provider=provider,
    prompt_style=prompt_style,
    tolerance=tolerance,
    number_samples=number_samples,
    verbose=True
)

In [None]:
## print the metrics
print("Metrics for gemini_custom:")
for key, value in metrics_gemini_cot.items():
    print(f"{key}: {value}")

# Create a timestamp for uniqueness
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Define directory and dynamic filename
filename = f"metrics_{model}_{prompt_style}_{timestamp}.json"
file_path = os.path.join(directory, filename)

# Ensure the directory exists
os.makedirs(directory, exist_ok=True)

# Save the metrics to the JSON file
with open(file_path, "w") as f:
    json.dump(metrics_gemini_cot, f, indent=4)

print(f"Metrics saved successfully to {file_path}")