## Calculate execution accuracy metric

In this notebook, we will calculate the accuracy of tabular agent by comparing its results (resulting from SQL query) against the ground truth results. 
In the evaluation dataset, the questions and evidence are supplied. The evidence represents the external knowledge supplied to the LLM which can be related to the knowledge about data schema, definitons of columns or specific values, definitions of acronyms or specific function names for mathematical calculations. We will calculate the execution accuracy without evidence, and with evidence. This will allow us to estimate the impact of external knowledge on tabular agent performance. Also we will calculate the execution accuracy with evidence excluding challenging questions. This will allow us to get a more representative performance metric on less challenging use cases or less complex user questions. For more information about the execution accuracy, please check the BIRD paper: https://arxiv.org/pdf/2305.03111

In [1]:
import json
import numpy as np



In [2]:
with open('../data_results/tabular/evaluation_results.json') as f:
    eval_data = json.load(f)

In [3]:
accuracy_with_evidence_list = []
for row in eval_data:
    if row["evidence"] != "":
        accuracy_with_evidence_list.append(row["is_accurate_with_evidence"])

In [4]:
accuracy_without_evidence_list = []
for row in eval_data:
    accuracy_without_evidence_list.append(row["is_accurate_without_evidence"])

In [5]:

accuracy_without_evidence=sum(accuracy_without_evidence_list)/len(accuracy_without_evidence_list)

In [6]:
accuracy_with_evidence=sum(accuracy_with_evidence_list)/len(accuracy_with_evidence_list)

In [11]:
print("execution accuracy without evidence: ",accuracy_without_evidence) 
print("execution accuracy with evidence:",accuracy_with_evidence) 

execution accuracy without evidence:  0.23333333333333334
execution accuracy with evidence: 0.39285714285714285


In [8]:
accuracy_with_evidence_notchallenging_list = []
for row in eval_data:
    if row["evidence"] != "" and row["difficulty"] != "challenging":
        accuracy_with_evidence_notchallenging_list.append(row["is_accurate_with_evidence"])

In [9]:
accuracy_with_evidence_notchallenging=sum(accuracy_with_evidence_notchallenging_list)/len(accuracy_with_evidence_notchallenging_list)

In [13]:
#accuracy when evidence (external knowledge) is in the prompt excluding challenging questions
print("execution accuracy with evidence excluding challenging questions: ",accuracy_with_evidence_notchallenging) 


execution accuracy with evidence excluding challenging questions:  0.5
