# Here We test Evaluators Based
- To assure the robustness of llm as a judge, we need to test with human-expert annotated data
- Step1: Evaluate with only Negative data with existing dataset -> Calculate FP, TN
- Step2: Evaluate with full positive and negative data by adding hand-made poistive set -> Calculate TP ,FN  

### Make Negative Evaluation-pipeline test set from nl2sql_bug set
- Import nl2sql testest and create only-negative evaluation-pipeline test set 

In [1]:
import pandas as pd
from azureml.fsspec import AzureMachineLearningFileSystem
from dotenv import load_dotenv
import os
import json

In [2]:
load_dotenv()
data_storage_uri = os.getenv("AZURE_DATASTORAGE_URI")
fs = AzureMachineLearningFileSystem(data_storage_uri)
fs.ls()

Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented


['UI/']

In [3]:
with fs.open('./UI/2025-07-25_003545_UTC/NL2SQL-Bugs-with-evidence.json') as f:
    data = json.load(f)

nl2sql_bug_df = pd.DataFrame(data)
nl2sql_bug_df["label"] = nl2sql_bug_df["label"].astype(bool)
len(nl2sql_bug_df.loc[nl2sql_bug_df["label"]== True])

1019

In [4]:
# 같은 Question은 공유하고 SQL은 다르게 적힌 True 있나 확인
dupes = (
    nl2sql_bug_df[nl2sql_bug_df["label"] == True]
    .groupby("question")
    #그룹 객체 대상으로 람다실행하고, 조건을 마족하는 원래 DataFrame의 row반환
    .filter(lambda x: len(x) > 1)
)
print(dupes) # 중복된 질문을 가진 True 라벨 row의 수
print(dupes["question"].unique())  # 어떤 question인지 보기



        id                                           question  \
1126  1126  Show me the season page of year when the race ...   
1145  1145  Show me the season page of year when the race ...   

                                               evidence      db_id  \
1126                      race number refers to raceId;  formula_1   
1145  the season page refers to url; race number ref...  formula_1   

                                                    sql  label error_types  
1126  SELECT T2.url FROM races AS T1 INNER JOIN seas...   True          []  
1145  SELECT T2.url FROM races AS T1 INNER JOIN seas...   True          []  
['Show me the season page of year when the race No. 901 took place.']


In [5]:
# 0) Make sure your label column is really boolean
#    (if it's the strings "True"/"False" or ints 1/0, convert it)
nl2sql_bug_df["label"] = nl2sql_bug_df["label"].astype(bool)

# 1) Build a map from question → gold_sql (there must be at most one per question)
gold_map = (
    nl2sql_bug_df
    # 조건을 이용해 행선택하는 함수 
    .loc[nl2sql_bug_df["label"]]
    # 특정열을 인덱스로만 바꾸고, sql 만 선택  
    .set_index("question")["sql"]
    .to_dict()
)

len(gold_map)

1018

In [6]:
# 2) Walk through all FALSE‐label rows and emit one entry per pred_sql
eval_pipeline_test_list = []
for _, row in nl2sql_bug_df.loc[~nl2sql_bug_df["label"]].iterrows():
    q    = row["question"]
    pred = row["sql"]
    eval_pipeline_test_list.append({
        "question":    q,
        "gold_sql":    gold_map.get(q),       # ← will be None only if no True‐row ever existed
        "pred_sql":    pred,
        "label":       False,
        "evidence":    row["evidence"],
        "error_types": row["error_types"],
    })



In [7]:
eval_pipeline_test_df = pd.DataFrame(eval_pipeline_test_list)
eval_pipeline_test_df["gold_sql"].isnull().sum()
eval_pipeline_test_df = eval_pipeline_test_df.loc[eval_pipeline_test_df["gold_sql"].notnull()].copy()

In [8]:
eval_pipeline_test_df

Unnamed: 0,question,gold_sql,pred_sql,label,evidence,error_types
0,What is the highest eligible free rate for K-1...,SELECT MAX(CAST(`Free Meal Count (K-12)` AS RE...,SELECT max(`free meal count (k-12)`) / max(`en...,False,Eligible free rate for K-12 = `Free Meal Count...,"[{'error_type': 'Function-Related Errors', 'su..."
1,Please list the lowest three eligible free rat...,SELECT T2.`Free Meal Count (Ages 5-17)` / T2.`...,SELECT min(`free meal count (ages 5-17)`) FROM...,False,Eligible free rates for students aged 5-17 = `...,"[{'error_type': 'Attribute-Related Errors', 's..."
2,Please list the zip code of all the charter sc...,SELECT T2.Zip FROM frpm AS T1 INNER JOIN schoo...,SELECT schools.zip FROM schools INNER JOIN frp...,False,Charter schools refers to `Charter School (Y/N...,"[{'error_type': 'Value-Related Errors', 'sub_e..."
3,What is the unabbreviated mailing street addre...,SELECT T2.MailStreet FROM frpm AS T1 INNER JOI...,SELECT schools.streetabr FROM frpm INNER JOIN ...,False,,"[{'error_type': 'Attribute-Related Errors', 's..."
8,Among the schools with the SAT test takers of ...,SELECT T2.School FROM satscores AS T1 INNER JO...,SELECT schools.school FROM schools INNER JOIN ...,False,Magnet schools or offer a magnet program means...,"[{'error_type': 'Attribute-Related Errors', 's..."
...,...,...,...,...,...,...
986,For all the transactions happened during 8:00-...,SELECT count(transactions_1k.transactionid) FR...,select count(transactions_1k.transactionid) fr...,False,Czech Republic can be represented as the Count...,"[{'error_type': 'Attribute-Related Errors', 's..."
987,There's one customer spent 214582.17 in the Ju...,SELECT customers.currency FROM customers INNER...,select distinct customers.currency from custom...,False,June of 2013 means Date contains '201306' in t...,"[{'error_type': 'Value-Related Errors', 'sub_e..."
993,Which gas station has the highest amount of re...,SELECT gasstationid FROM transactions_1k GROUP...,select gasstations.gasstationid from gasstatio...,False,,"[{'error_type': 'Table-Related Errors', 'sub_e..."
997,For all the people who paid more than 29.00 pe...,SELECT T1.Consumption FROM yearmonth AS T1 INN...,select distinct yearmonth.consumption from yea...,False,August of 2012 refers to the Date value = '201...,"[{'error_type': 'Value-Related Errors', 'sub_e..."


### Evaluate llm-as-judge-raw-sql-evaluators 
- Testing with only False dataset

In [None]:
import sys
import os
from evaluation.evaluators.llm_as_judge_raw_sql_evaluator import LLMasJudgeRawSQL
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
from tqdm import tqdm

load_dotenv()

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint= os.environ["AZURE_ENDPOINT"],
    azure_key = os.environ["AZURE_API_KEY"],
    azure_deployment = os.environ["AZURE_4O_DEPLOYMENT"],
    api_version = os.environ["AZURE_4O_API_VERSION"]
)


[INFO] Could not import AIAgentConverter. Please install the dependency with `pip install azure-ai-projects`.
[INFO] Could not import SKAgentConverter. Please install the dependency with `pip install semantic-kernel`.


In [None]:
llm_sql_evaluator = LLMasJudgeRawSQL(model_config= model_config)
llm_raw_sql_pipeline_eval_result = []
for i, row in tqdm(eval_pipeline_test_df.iterrows(),ncols=100,colour="cyan", total= len(eval_pipeline_test_df)):
    question = row["question"]
    gold_sql = row["gold_sql"]
    pred_sql = row["pred_sql"]

    result = llm_sql_evaluator(question = question, gold_sql= gold_sql, pred_sql = pred_sql)
    result = json.loads(result)
    llm_raw_sql_pipeline_eval_result.append(
        {"question": question, "gold_sql": gold_sql, "pred_sql": pred_sql, "llm_judgement": result["label"], "reason": result["reason"]}
    )
    print(f"##### {i}th DEBUG LOG #####")
    print("Question:", question )
    print("Gold Sql:", gold_sql)
    print("Pred Sql:", pred_sql)
    print("LLM Judgement:", result["label"])
    print("Reason:", result["reason"])




In [None]:
llm_raw_sql_pipeline_eval_df = pd.DataFrame(llm_raw_sql_pipeline_eval_result)

#Join with Original Eval test set 

merged_llm_raw_sql_pipeline_eval_df = pd.merge(eval_pipeline_test_df, llm_raw_sql_pipeline_eval_df, on="pred_sql", how='left')
merged_llm_raw_sql_pipeline_eval_df.drop(["question_y","gold_sql_y"], axis = 1)

fp = len(merged_llm_raw_sql_pipeline_eval_df[merged_llm_raw_sql_pipeline_eval_df["llm_judgement"] == "correct"])
tn = len(merged_llm_raw_sql_pipeline_eval_df[merged_llm_raw_sql_pipeline_eval_df["llm_judgement"] == "incorrect"])

print("Total:", len(merged_llm_raw_sql_pipeline_eval_df))
print("FP:", fp)
print("TN:", tn)
print(fp/tn*100)


Total: 551
FP: 98
TN: 453
21.63355408388521


### Evaluate Component Matching 
- Testing with Incorrect set 

In [9]:
from evaluators.cm_evaluator import ComponentMatchingEvaluator
import json 
from tqdm import tqdm
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')
cm_evaluator = ComponentMatchingEvaluator()
cm_eval_result = []
for i, row in tqdm(eval_pipeline_test_df.iterrows(),ncols=100,colour="cyan", total= len(eval_pipeline_test_df)):
    gold_sql = row["gold_sql"]
    pred_sql = row["pred_sql"]
    print(f"##### {i}th DEBUG LOG #####")
    print("Gold Sql:", gold_sql)
    print("Pred Sql:", pred_sql)
    cm, detail = cm_evaluator(gold_sql= gold_sql, pred_sql = pred_sql)
    
    cm_eval_result.append(
        {"gold_sql": gold_sql, "pred_sql": pred_sql, "cm_score": cm, "detail": detail}
    )
    print("CM Score", cm   )
    print("CM Detail ", detail)

[nltk_data] Downloading package punkt to /home/azureuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/azureuser/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
  0%|[36m                                                                       [0m| 0/549 [00:00<?, ?it/s][0m

 61%|[36m███████████████████████████████████▉                       [0m| 334/549 [00:00<00:00, 1704.37it/s][0m

##### 0th DEBUG LOG #####
Gold Sql: SELECT MAX(CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) FROM frpm WHERE `County Name` = 'Alameda'
Pred Sql: SELECT max(`free meal count (k-12)`) / max(`enrollment (k-12)`) FROM frpm WHERE `county name` = 'Alameda'
CM Score 0.8571428571428571
CM Detail  {'select': False, 'from': True, 'where': True, 'groupBy': True, 'having': True, 'orderBy': True, 'limit': True}
##### 1th DEBUG LOG #####
Gold Sql: SELECT T2.`Free Meal Count (Ages 5-17)` / T2.`Enrollment (Ages 5-17)` AS EligibleFreeRate FROM schools AS T1 INNER JOIN frpm AS T2 ON T1.CDSCode = T2.CDSCode WHERE T1.EdOpsName = 'Continuation School' AND T2.`Enrollment (Ages 5-17)` > 0 AND T2.`Free Meal Count (Ages 5-17)` IS NOT NULL AND T2.`Enrollment (Ages 5-17)` IS NOT NULL ORDER BY EligibleFreeRate ASC LIMIT 3
Pred Sql: SELECT min(`free meal count (ages 5-17)`) FROM frpm WHERE `educational option type` = 'Continuation School'
CM Score 0.7142857142857143
CM Detail  {'select': False, 'fr

100%|[36m███████████████████████████████████████████████████████████[0m| 549/549 [00:00<00:00, 1665.17it/s][0m

CM Score 0.5714285714285714
CM Detail  {'select': True, 'from': True, 'where': False, 'groupBy': True, 'having': True, 'orderBy': False, 'limit': False}
##### 612th DEBUG LOG #####
Gold Sql: SELECT surname FROM drivers WHERE nationality = 'Italian'
Pred Sql: select surname from drivers where nationality = 'italian'
CM Score 1.0
CM Detail  {'select': True, 'from': True, 'where': True, 'groupBy': True, 'having': True, 'orderBy': True, 'limit': True}
##### 615th DEBUG LOG #####
Gold Sql: SELECT T2.date FROM circuits AS T1 INNER JOIN races AS T2 ON T1.circuitId = T2.circuitId WHERE T1.name = 'Circuit de Barcelona-Catalunya'
Pred Sql: select races.date from races inner join circuits on races.circuitid = circuits.circuitid where circuits.name = 'Barcelona-Catalunya'
CM Score 0.7142857142857143
CM Detail  {'select': False, 'from': False, 'where': True, 'groupBy': True, 'having': True, 'orderBy': True, 'limit': True}
##### 616th DEBUG LOG #####
Gold Sql: select circuits.url from races inner jo


