## 🧪 Exercise 2: Evaluating NL2SQL Model with Rule-Based Metrics

In this section, we will evaluate the performance of our NL2SQL model using **rule-based metrics** provided by the **NL2SQL360** library (developed by Dial Lab at HKUST).  
We will focus on the following two metrics:

---

### ✅ Metric 1: Execution Accuracy (EX)

| Metric               | Evaluation Target                                           | Requires DB Execution? | Notes                                                                 |
|----------------------|-------------------------------------------------------------|-------------------------|-----------------------------------------------------------------------|
| **Execution Accuracy (EX)** | Compares the execution result of predicted SQL with the ground-truth | ✅ Yes                  | Strict semantic measure. Different SQLs are allowed if results match. |

---

### ✅ Metric 2: Valid Efficiency Score (VES)

| Metric               | Evaluation Target                                           | Requires DB Execution? | Notes                                                                 |
|----------------------|-------------------------------------------------------------|-------------------------|-----------------------------------------------------------------------|
| **Valid Efficiency Score (VES)** | Measures execution efficiency relative to ground-truth SQL | ✅ Yes                  | Combines execution correctness with runtime/memory efficiency.        |

- `VES = EX × EfficiencyRatio`
- VES penalizes correct SQLs that are inefficient compared to the reference.

---

These metrics help us go beyond syntax and evaluate whether the predicted SQL is **semantically accurate and computationally efficient**.


In [2]:
# ================================================================
# WARNING: DO NOT INSTALL nl2sql360 VIA pip
# ---------------------------------------------------------------
# This project uses a **locally customized version** of NL2SQL360
# located at: ../nl2sql360/src
#
# Why?
# - Fixed minor bugs in the original implementation
# - Added internal debugging and logging features
# - Integrated with our custom evaluation pipeline and prompt structure
#
# ✔️ No need to install the library via pip
# ✔️ This will automatically load the local version
# ================================================================

import sys
import importlib

# Add local path with customized NL2SQL360 implementation
sys.path.insert(0, '../nl2sql360/src')

# Import and reload to reflect any recent changes
import nl2sql360
importlib.reload(nl2sql360)

# Confirm that the local path is used (not global site-packages)
print(nl2sql360.__file__)  # Should point to ../nl2sql360/src/nl2sql360/__init__.py


/home/azureuser/workspace/nl2sql-samsung-finanical/evaluation/../nl2sql360/src/nl2sql360/__init__.py


## Step 1: Import Dataset

Before evaluating the NL2SQL model, you must import a properly structured dataset into the NL2SQL360 pipeline.  
We use Core.import_dataset() with DatasetArguments to register the dataset.

### 🗂 Expected Directory Structure
```
../dataset/bird_benchmark/
├── dev.json ← Sample file
├── dev_tables.json ← (Optional) Table schema info
├── dev_databases/ ← Folder containing SQLite DBs
│ ├── academic/
│ │ └── academic.sqlite
│ ├── college/
│ │ └── college.sqlite
```
> Official Bird Dataset: https://bird-bench.github.io/
### 📄 Required Files

| File | Description |
|------|-------------|
| dev.json | JSON list of samples. Each entry should contain: <br>• "question": natural language query <br>• "SQL": gold SQL query <br>• "db_id": matching database ID <br>• (Optional) "difficulty": difficulty tag for analysis |
| dev_databases/ | Directory with subfolders (e.g. academic/, college/) each containing a .sqlite file |
| dev_tables.json *(optional)* | JSON file containing database schema (Spider format). Needed for exact-match evaluation |


In [3]:

from nl2sql360.core import Core
from nl2sql360.arguments import CoreArguments, DatasetArguments


core_args = CoreArguments()
core = Core(core_args)

dataset_args = DatasetArguments(
    dataset_name = "bird_benchmark",
    dataset_dir= "../dataset/bird_benchmark",
    samples_file= "dev.json",
    database_dir="dev_databases",
    tables_file="dev_tables.json",
    question_key="question",
    sql_key="SQL",
    db_id_key="db_id",
    sql_complexity_key = "difficulty",
    database_domain_file=None
)
core.import_dataset(dataset_args)



## Step 2: Run Evaluation with Rule-Based Metrics

Once the dataset has been registered with `Core.import_dataset()`, you can evaluate model predictions using built-in rule-based metrics such as:

- **Execution Accuracy (EX)** — whether the execution results match
- **Valid Efficiency Score (VES)** — combines correctness with efficiency (runtime/memory)

In [4]:
from nl2sql360.core import Core
from nl2sql360.arguments import CoreArguments, EvaluationArguments
import nest_asyncio

nest_asyncio.apply()

evaluation_args = EvaluationArguments(
    eval_name="Test_Evaluation_1",
    eval_dataset = "bird_benchmark",
    eval_metrics = ["ex", "ves"],
    pred_sqls_file = "../dataset/bird_benchmark/dev.sql",
    enable_spider_eval=True

)
core.evaluate(evaluation_args)



## Step3: Make a Evaluation Report
- There are various way to make a report.
- Below I present 2 simple ways to see the result but, you can implement Filter and Scenarios to filter out result for specific query cases. 
- For various reference and use case, please refer to official library website 
> Official Library: https://github.com/HKUSTDial/NL2SQL360/blob/master/examples/py_examples/report.py#L5

In [5]:
from nl2sql360.core import Core
from nl2sql360.arguments import CoreArguments, EvaluationArguments
from nl2sql360.filter import Filter, Scenario, Field, Operator

    

print(core.query_overall_performance(dataset_name="bird_benchmark", metric="ex", eval_name="Test_Evaluation_1"))

print(core.query_overall_performance(dataset_name="bird_benchmark", metric="ves", eval_name="Test_Evaluation_1"))



# Also there are lots of variations and options for viewing result. Belows are pseudo exmaples for report options.

'''
SUBQUERY_FILTER = Filter(
        name="subquery",
        field=Field.SUBQUERY,
        operator=Operator.GT,
        value=0
    )
    
BI_SCENARIO = Scenario(
        name="BI",
        filters=[Filter('agg', Field.AGGREGATION, Operator.GT, 0), Filter('join', Field.JOIN, Operator.GT, 0)]
    )

print(core.query_overall_leaderboard(dataset_name="bird_benchmark", metric="ex"))
    
print(core.query_filter_performance(dataset_name="bird_benchmark", metric="ex", filter=filter, eval_name="Test_Eval_1"))
    
print(core.query_filter_leaderboard(dataset_name="bird_benchmark", metric="ex", filter=filter))

print(core.query_scenario_leaderboard(dataset_name="bird_benchmark", metric="ex", scenario=BI_SCENARIO))
    
print(core.query_dataset_domain_distribution(dataset_name="bird_benchmark"))

print(core.generate_evaluation_report(dataset_name="bird_benchmark",
                                       filters=[SUBQUERY_FILTER],
                                       scenarios=[BI_SCENARIO],
                                       metrics=["ex", "ves"]))
'''

          Evaluation     EX
0  Test_Evaluation_1  100.0
          Evaluation   VES
0  Test_Evaluation_1  2.27


'\nSUBQUERY_FILTER = Filter(\n        name="subquery",\n        field=Field.SUBQUERY,\n        operator=Operator.GT,\n        value=0\n    )\n\nBI_SCENARIO = Scenario(\n        name="BI",\n        filters=[Filter(\'agg\', Field.AGGREGATION, Operator.GT, 0), Filter(\'join\', Field.JOIN, Operator.GT, 0)]\n    )\n\nprint(core.query_overall_leaderboard(dataset_name="bird_benchmark", metric="ex"))\n\nprint(core.query_filter_performance(dataset_name="bird_benchmark", metric="ex", filter=filter, eval_name="Test_Eval_1"))\n\nprint(core.query_filter_leaderboard(dataset_name="bird_benchmark", metric="ex", filter=filter))\n\nprint(core.query_scenario_leaderboard(dataset_name="bird_benchmark", metric="ex", scenario=BI_SCENARIO))\n\nprint(core.query_dataset_domain_distribution(dataset_name="bird_benchmark"))\n\nprint(core.generate_evaluation_report(dataset_name="bird_benchmark",\n                                       filters=[SUBQUERY_FILTER],\n                                       scenarios=[BI_

## Step4: Clear History Cache
- There are various way to make a report.
- Below I present 2 simple ways to see the result but, you can implement Filter and Scenarios to filter out result for specific query cases. 
- For various reference and use case, please refer to official library website 
> Official Library Example: https://github.com/HKUSTDial/NL2SQL360/blob/master/examples/py_examples/report.py#L5

In [7]:
core.delete_evaluation_history(
    dataset_name="bird_benchmark",
    eval_name="Test_Evaluation_1"
    )
core.delete_dataset_history(
    dataset_name="bird_benchmark",
    delete_relavant_evaluations=True
    )

[32m2025-07-30 05:09:43.919[0m | [32m[1mSUCCESS [0m | [36mnl2sql360.core.core[0m:[36mdelete_evaluation_history[0m:[36m537[0m - [32m[1mDelete evaluation `Test_Evaluation_1` for dataset `bird_benchmark` successfully.[0m
[32m2025-07-30 05:09:45.955[0m | [32m[1mSUCCESS [0m | [36mnl2sql360.core.core[0m:[36mdelete_dataset_history[0m:[36m515[0m - [32m[1mDelete dataset `bird_benchmark` successfully.[0m
