# Calculate execution accuracy metric

In this notebook, we will calculate the accuracy of tabular agent by comparing its results (resulting from SQL query) against the ground truth results. 
In the evaluation dataset, the questions and evidence are supplied. The evidence represents the external knowledge supplied to the LLM which can be related to the knowledge about data schema, definitons of columns or specific values, definitions of acronyms or specific function names for mathematical calculations. We will calculate the execution accuracy without evidence, and with evidence. This will allow us to estimate the impact of external knowledge on tabular agent performance. Also we will calculate the execution accuracy with evidence excluding challenging questions. This will allow us to get a more representative performance metric on less challenging use cases or less complex user questions. For more information about the execution accuracy, please check the BIRD paper: https://arxiv.org/pdf/2305.03111

The baseline performance derived from the original paper (above) depends on the chosen model. When the paper was released (2023), the execution accuracy achieved by GPT4 was 30% without evidence and 46% with evidence. Please note that this is the accuracy achieved by GPT4 in 2023 and that the baseline performance would be higher if the model has been re-trained post 2023. More recent models combined with data profiling techniques and other approaches have achieved higher accuracy (75% on Dev set). The performance leaderbord can be accessed here: https://bird-bench.github.io. Please note that the Dev set is a much larger dataset, containing multiple databases (including the Financial dataset used for evaluation here).

In [1]:
import json
import numpy as np



In [2]:
def derive_accuracy(eval_data):
    accuracy_with_evidence_list = []
    accuracy_without_evidence_list = []
    accuracy_with_evidence_notchallenging_list = []
    for row in eval_data:
        if row["evidence"] != "":
            accuracy_with_evidence_list.append(row["is_accurate_with_evidence"])

    for row in eval_data:
        accuracy_without_evidence_list.append(row["is_accurate_without_evidence"])
    
    for row in eval_data:
        if row["evidence"] != "" and row["difficulty"] != "challenging":
            accuracy_with_evidence_notchallenging_list.append(row["is_accurate_with_evidence"])
        
    accuracy_without_evidence=sum(accuracy_without_evidence_list)/len(accuracy_without_evidence_list)
    accuracy_with_evidence=sum(accuracy_with_evidence_list)/len(accuracy_with_evidence_list)
    accuracy_with_evidence_notchallenging=sum(accuracy_with_evidence_notchallenging_list)/len(accuracy_with_evidence_notchallenging_list)

    return accuracy_without_evidence,  accuracy_with_evidence, accuracy_with_evidence_notchallenging

## Evaluation results with Claude 3.7 without extended thinking

In [3]:
with open('../data_results/tabular/evaluation_results.json') as f:
    eval_data = json.load(f)

In [4]:
accuracy_without_evidence,  accuracy_with_evidence, accuracy_with_evidence_notchallenging = derive_accuracy(eval_data)

In [5]:
#accuracy when evidence (external knowledge) is NOT in the prompt 
print("execution accuracy without evidence: ",accuracy_without_evidence) 
#accuracy when evidence (external knowledge) is in the prompt 
print("execution accuracy with evidence:",accuracy_with_evidence)
#accuracy when evidence (external knowledge) is in the prompt excluding challenging questions
print("execution accuracy with evidence excluding challenging questions: ",accuracy_with_evidence_notchallenging) 


execution accuracy without evidence:  0.23333333333333334
execution accuracy with evidence: 0.39285714285714285
execution accuracy with evidence excluding challenging questions:  0.5


## Evaluation results with Claude 3.7 WITH extended thinking

In [6]:
with open('../data_results/tabular/evaluation_results_extended_thinking.json') as f:
    eval_data = json.load(f)

In [7]:
accuracy_without_evidence,  accuracy_with_evidence, accuracy_with_evidence_notchallenging = derive_accuracy(eval_data)

In [8]:
#accuracy when evidence (external knowledge) is NOT in the prompt 
print("execution accuracy without evidence: ",accuracy_without_evidence) 
#accuracy when evidence (external knowledge) is in the prompt 
print("execution accuracy with evidence:",accuracy_with_evidence)
#accuracy when evidence (external knowledge) is in the prompt excluding challenging questions
print("execution accuracy with evidence excluding challenging questions: ",accuracy_with_evidence_notchallenging) 

execution accuracy without evidence:  0.3
execution accuracy with evidence: 0.32142857142857145
execution accuracy with evidence excluding challenging questions:  0.4090909090909091


We do not see any increase in accuracy with extended thinking. In fact, the performance is lower which is unexpected. 