# FinanceBench Evaluation Analysis

This notebook helps inspect JSONL logs generated by `scripts/run_financebench_eval.py`.

## How to use
1. Point the path below to the latest JSONL file.
2. Load the data and inspect basic stats (accuracy, by question type, etc.).
3. Filter rows where `eval_is_same == False` to study bad cases.
4. Inspect hits/citations for deeper debugging and extend with your own cells if needed.


In [49]:
from pathlib import Path
import json
import pandas as pd

# Update this path to the JSONL log you want to inspect
LOG_PATH = Path("/home/moon/Desktop/Financial_Document_Analyzer/data/logs/financebench_eval_20251124_193308.jsonl")
assert LOG_PATH.exists(), f"Log file not found: {LOG_PATH}"

rows = []
with LOG_PATH.open("r", encoding="utf-8") as fh:
    for line in fh:
        line = line.strip()
        if not line:
            continue
        rows.append(json.loads(line))
df = pd.DataFrame(rows)
df

Unnamed: 0,doc_name,question,ground_truth,question_type,question_reasoning,answer,citations,hits,eval_is_same,eval_reasoning
0,PEPSICO_2022_10K,What are the geographies that Pepsico primaril...,"As of FY2022, Pepsico primarily operates in th...",domain-relevant,Information extraction,The text appears to be the annual report of Pe...,[],"[{'chunk_id': '1571', 'source_doc': 'PEPSICO_2...",True,
1,PEPSICO_2022_10K,Has Pepsico reported any materially important ...,"No, Pepsico is not involved in material legal ...",domain-relevant,Information extraction,"Yes, PepsiCo reported ongoing legal battles.","[{'i': 6, 'source_doc': 'PEPSICO_2022_10K', 'c...","[{'chunk_id': '352', 'source_doc': 'PEPSICO_20...",False,
2,PEPSICO_2022_10K,What is the quantity of restructuring costs di...,Pepsico's restructuring costs in FY2022 amount...,domain-relevant,Information extraction,334,"[{'i': 8, 'source_doc': 'PEPSICO_2022_10K', 'c...","[{'chunk_id': '267', 'source_doc': 'PEPSICO_20...",False,The model output Answer 2 ('334') is a numeric...
3,PEPSICO_2022_10K,What is the FY2022 unadjusted EBITDA less cape...,$9068.00,metrics-generated,Numerical reasoning,5662,[],"[{'chunk_id': '356', 'source_doc': 'PEPSICO_20...",False,
4,PEPSICO_2022_10K,What is the FY2022 unadjusted EBITDA % margin ...,16.5%,metrics-generated,Numerical reasoning,16.9%,"[{'i': 6, 'source_doc': 'PEPSICO_2022_10K', 'c...","[{'chunk_id': '356', 'source_doc': 'PEPSICO_20...",False,


In [50]:
# Basic metrics
total = len(df)
correct = int(df['eval_is_same'].sum())
accuracy = correct / total if total else 0
print(f'Total questions: {total}')
print(f'Correct: {correct} ({accuracy:.2%})')

print('\nAccuracy by question_type:')
grouped = df.groupby('question_type')['eval_is_same'].agg(['mean', 'count'])
display(grouped.rename(columns={'mean': 'accuracy'}))


Total questions: 5
Correct: 1 (20.00%)

Accuracy by question_type:


Unnamed: 0_level_0,accuracy,count
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
domain-relevant,0.333333,3
metrics-generated,0.0,2


In [51]:
# Inspect incorrect predictions
bad_cases = df[df['eval_is_same'] == False]
print(f'Bad cases: {len(bad_cases)}')
bad_cases[['question', 'answer', 'ground_truth']]


Bad cases: 4


Unnamed: 0,question,answer,ground_truth
1,Has Pepsico reported any materially important ...,"Yes, PepsiCo reported ongoing legal battles.","No, Pepsico is not involved in material legal ..."
2,What is the quantity of restructuring costs di...,334,Pepsico's restructuring costs in FY2022 amount...
3,What is the FY2022 unadjusted EBITDA less cape...,5662,$9068.00
4,What is the FY2022 unadjusted EBITDA % margin ...,16.9%,16.5%


In [52]:
# Inspect a specific bad case (default: first failure)
idx = bad_cases.index[0] if len(bad_cases) else None
if idx is not None:
    rec = bad_cases.loc[idx]
    print('Question:', rec['question'])
    print('\nGround Truth:', rec['ground_truth'])
    print('\nModel Answer:', rec['answer'])
    print('\nCitations:', rec['citations'])
    print('\nHits (top-k chunk excerpts):')
    for hit in rec['hits']:
        print('-', hit.get('doc_id'), '|', (hit.get('text') or ''))
else:
    print('No bad cases available.')


Question: Has Pepsico reported any materially important ongoing legal battles from FY2022 and FY2021?

Ground Truth: No, Pepsico is not involved in material legal battles.

Model Answer: Yes, PepsiCo reported ongoing legal battles.

Citations: [{'i': 6, 'source_doc': 'PEPSICO_2022_10K', 'chunk_id': '694', 'element': 'title', 'page_start': 133, 'page_end': 133, 'text': 'Certain Provisions of PepsiCo’s Articles of Incorporation and By Laws; Director Indemnification Agreements'}]

Hits (top-k chunk excerpts):
- PEPSICO_2022_10K_p65 | PepsiCo, Inc. and Subsidiaries Fiscal years ended December 31, 2022, December 25, 2021 and December 26, 2020 (in millions)
- PEPSICO_2022_10K_p66 | PepsiCo, Inc. and Subsidiaries December 31, 2022 and December 25, 2021 (in millions except per share amounts)
- PEPSICO_2022_10K_p482 | PEPSICO, INC. SUBSIDIARIES (as of December 31, 2022)
- PEPSICO_2022_10K_p104 | Note 11 — Accumulated Other Comprehensive Loss Attributable to PepsiCo
- PEPSICO_2022_10K_p112 | To 

In [None]:
# Inspect correct predictions
good_cases = df[df['eval_is_same'] == True]
print(f'Bad cases: {len(good_cases)}')
good_cases[['question', 'answer', 'ground_truth']]

Bad cases: 1


Unnamed: 0,question,answer,ground_truth
0,What are the geographies that Pepsico primaril...,The text appears to be the annual report of Pe...,"As of FY2022, Pepsico primarily operates in th..."


In [None]:
# Inspect a specific good case (default: first failure)
idx = good_cases.index[0] if len(good_cases) else None
if idx is not None:
    rec = good_cases.loc[idx]
    print('Question:', rec['question'])
    print('\nGround Truth:', rec['ground_truth'])
    print('\nModel Answer:', rec['answer'])
    print('\nCitations:', rec['citations'])
    print('\nHits (top-k chunk excerpts):')
    for hit in rec['hits']:
        print("'chunk_id:'", hit.get('chunk_id'), '|', (hit.get('text') or ''))
else:
    print('No bad cases available.')


Question: What are the geographies that Pepsico primarily operates in as of FY2022?

Ground Truth: As of FY2022, Pepsico primarily operates in the following geographies: North America, Latin America, Europe, Africa, Middle East, South Asia, Asia Pacific, Australia, New Zealand and China.

Model Answer: The text appears to be the annual report of PepsiCo, Inc. for the fiscal year ended December 31, 2022. The report provides an overview of the company's business, financial performance, and sustainability initiatives. Here are some key points extracted from the text: 

**Business Overview**: PepsiCo is a leading global beverage and convenient food company with a portfolio of brands including Lay’s, Doritos, Cheetos, Gatorade, Pepsi Cola, Mountain Dew, Quaker, and SodaStream. The company operates in more than 200 countries and territories through its own operations, authorized bottlers, contract manufacturers, and other third parties.

**Challenges Faced**: In 2022, the company faced sever

In [55]:
# TODO: add more analysis cells here.
# e.g., citation coverage, performance by question_reasoning, evidence matching heuristics, etc.
