# FinanceBench Evaluation Analysis

This notebook helps inspect JSONL logs generated by `scripts/run_financebench_eval.py`.

## How to use
1. Point the path below to the latest JSONL file.
2. Load the data and inspect basic stats (accuracy, by question type, etc.).
3. Filter rows where `eval_is_same == False` to study bad cases.
4. Inspect hits/citations for deeper debugging and extend with your own cells if needed.


In [None]:
from pathlib import Path
import json
import pandas as pd

# Update this path to the JSONL log you want to inspect
LOG_PATH = Path("/home/moon/Desktop/Financial_Document_Analyzer/data/logs/financebench_eval_20251124_184953.jsonl")
assert LOG_PATH.exists(), f"Log file not found: {LOG_PATH}"

rows = []
with LOG_PATH.open("r", encoding="utf-8") as fh:
    for line in fh:
        line = line.strip()
        if not line:
            continue
        rows.append(json.loads(line))
df = pd.DataFrame(rows)
df

Unnamed: 0,doc_name,question,ground_truth,question_type,question_reasoning,answer,citations,hits,eval_is_same,eval_result,eval_reasoning
0,AMERICANEXPRESS_2022_10K,Which debt securities are registered to trade ...,There are none,domain-relevant,Information extraction,There is no information provided about debt se...,[],"[{'chunk_id': '943', 'source_doc': 'AMERICANEX...",False,False,The two answers convey different meanings. Ans...
1,AMERICANEXPRESS_2022_10K,What are the geographies that American Express...,"United States, EMEA, APAC, and LACC",domain-relevant,Information extraction,"The United States, the Netherlands, Ireland, M...","[{'i': 5, 'source_doc': 'AMERICANEXPRESS_2022_...","[{'chunk_id': '1197', 'source_doc': 'AMERICANE...",False,False,While Answer 1 lists broad geographic regions ...
2,AMERICANEXPRESS_2022_10K,Does AMEX have an improving operating margin p...,Performance is not measured through operating ...,domain-relevant,Numerical reasoning OR information extraction,There is no specific question provided in the ...,[],"[{'chunk_id': '492', 'source_doc': 'AMERICANEX...",False,False,
3,AMERICANEXPRESS_2022_10K,What drove gross margin change as of the FY202...,Performance is not measured through gross margin,domain-relevant,Logical reasoning (based on numerical reasonin...,The company's net interest income and revenue ...,"[{'i': 4, 'source_doc': 'AMERICANEXPRESS_2022_...","[{'chunk_id': '492', 'source_doc': 'AMERICANEX...",False,False,Answer 1 states a vague assertion that perform...
4,AMERICANEXPRESS_2022_10K,How much has the effective tax rate of America...,The effective tax rate for American Express ha...,domain-relevant,Numerical reasoning,The effective tax rate decreased by 2.9 percen...,"[{'i': 6, 'source_doc': 'AMERICANEXPRESS_2022_...","[{'chunk_id': '275', 'source_doc': 'AMERICANEX...",False,False,Although Answer 1 and Answer 2 convey the same...
5,AMERICANEXPRESS_2022_10K,What was the largest liability in American Exp...,Customer deposits,novel-generated,,Financial liabilities were the largest liabili...,"[{'i': 9, 'source_doc': 'AMERICANEXPRESS_2022_...","[{'chunk_id': '24', 'source_doc': 'AMERICANEXP...",False,False,"Answer 1 is more specific and accurate, as 'cu..."
6,AMERICANEXPRESS_2022_10K,Was American Express able to retain card membe...,Yes,novel-generated,,Yes,"[{'i': 6, 'source_doc': 'AMERICANEXPRESS_2022_...","[{'chunk_id': '24', 'source_doc': 'AMERICANEXP...",True,True,Both Answer 1 and Answer 2 provide a positive ...


In [23]:
# Basic metrics
total = len(df)
correct = int(df['eval_is_same'].sum())
accuracy = correct / total if total else 0
print(f'Total questions: {total}')
print(f'Correct: {correct} ({accuracy:.2%})')

print('\nAccuracy by question_type:')
grouped = df.groupby('question_type')['eval_is_same'].agg(['mean', 'count'])
display(grouped.rename(columns={'mean': 'accuracy'}))


Total questions: 7
Correct: 1 (14.29%)

Accuracy by question_type:


Unnamed: 0_level_0,accuracy,count
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
domain-relevant,0.0,5
novel-generated,0.5,2


In [None]:
# Inspect incorrect predictions
bad_cases = df[df['eval_is_same'] == False]
print(f'Bad cases: {len(bad_cases)}')
bad_cases[['question', 'answer', 'ground_truth']]


Bad cases: 6


Unnamed: 0,question,answer,ground_truth
0,Which debt securities are registered to trade ...,There is no information provided about debt se...,There are none
1,What are the geographies that American Express...,"The United States, the Netherlands, Ireland, M...","United States, EMEA, APAC, and LACC"
2,Does AMEX have an improving operating margin p...,There is no specific question provided in the ...,Performance is not measured through operating ...
3,What drove gross margin change as of the FY202...,The company's net interest income and revenue ...,Performance is not measured through gross margin
4,How much has the effective tax rate of America...,The effective tax rate decreased by 2.9 percen...,The effective tax rate for American Express ha...
5,What was the largest liability in American Exp...,Financial liabilities were the largest liabili...,Customer deposits


In [None]:
# Inspect a specific bad case (default: first failure)
idx = bad_cases.index[0] if len(bad_cases) else None
if idx is not None:
    rec = bad_cases.loc[idx]
    print('Question:', rec['question'])
    print('\nGround Truth:', rec['ground_truth'])
    print('\nModel Answer:', rec['answer'])
    print('\nCitations:', rec['citations'])
    print('\nHits (top-k chunk excerpts):')
    for hit in rec['hits']:
        print('-', hit.get('doc_id'), '|', (hit.get('text') or ''))
else:
    print('No bad cases available.')


Question: Which debt securities are registered to trade on a national securities exchange under American Express' name as of 2022?

Ground Truth: There are none

Model Answer: There is no information provided about debt securities registered to trade on a national securities exchange under American Express' name as of 2022.

Citations: []

Hits (top-k chunk excerpts):
- AMERICANEXPRESS_2022_10K_p176 | AMERICAN EXPRESS COMPANY AMERICAN EXPRESS NATIONAL BANK
- AMERICANEXPRESS_2022_10K_p252 | American Express Company 56th Street AXP Campus LLC American Express Bank LLC Russian Federation American Express Banking Corp. American Express Travel Related Services Company, Inc. Netherlands Antilles Page 1
- AMERICANEXPRESS_2022_10K_p252 | Country Name Jurisdiction United States New York United States Arizona Russian Federation United States New York United States New York Accertify, Inc. United States Delaware AE Innovation Labs Holdings, LLC United States Delaware American Express Innovation L

In [None]:
# TODO: add more analysis cells here.
# e.g., citation coverage, performance by question_reasoning, evidence matching heuristics, etc.
