# Text-to-SQL

In this example, we adapt the [code](https://github.com/ShayanTalaei/CHESS/tree/fc6f0b7ef34ccb573d764be8fba52b4afdd20ff5) from the paper [CHESS: Contextual Harnessing for Efficient SQL Synthesis](https://arxiv.org/abs/2405.16755).  

The workflow is as follows:

![text-to-sql](../imgs/text_to_sql.png)

To try this example, you should create and activate a new python virtual environment and then run the following commands:

```console
    pip install -r requirements.txt
    pip install pysqlite3-binary
    pip install -U cognify-ai
```

Next, run the pre-processing script in `./run/run_preprocess.sh` to create the databases. This should generate a `data` folder. Ensure your `.env` file contains the following keys:
- `OPENAI_API_KEY`
- `DB_ROOT_PATH`, which should be set to the path of `data/dev`

## Data loader

The original repository expects command line arguments passed into its data-loader. We can preserve the original parser function and just set the arguments in the script itself. This dataset does not contain a ground truth, so we pass in an empty dictionary `{}` as the second value in the tuple.

Then, we use the data files generated by the pre-processing step in the `data` folder.

In [21]:
import json
import cognify 
from src.utils import parse_arguments

import cognify
import numpy as np

import dotenv
dotenv.load_dotenv()

@cognify.register_data_loader
def load_data():
    args = parse_arguments() # 

    def read_from_file(data_path, args):
        with open(data_path, "r") as file:
            dataset = json.load(file)

        inputs = []
        for data in dataset:
            inputs.append(
                {
                    'args': args,
                    'dataset': [data],
                }
            )
        eval_data = [(input, {}) for input in inputs] # no ground truth in this case, set to empty dictionary
        return eval_data

    all_train = read_from_file('data/dev/other_sub_sampled.json', args)
    test_set = read_from_file('data/dev/sub_sampled_bird_dev_set.json', args)
    
    # shuffle the data
    all_train = np.random.permutation(all_train).tolist()
    return all_train[:100], all_train[100:], test_set[:10]

## Evaluator

In this case, the SQL code is executed *during* the workflow in a sandbox environment. Hence, our evaluator does not need to re-execute the code. Instead, it can just return whether the result was correct as a numerical value.

In [22]:
@cognify.register_evaluator
def eval_text_to_sql(counts):
    """
    Evaluate the statistics of the run.
    """
    correct = any(vs['correct'] == 1 for vs in counts.values())
    return correct

## Configuring the Optimizer

We've created a search option for text-to-sql that searches over the following:
- Chain-of-Thought reasoning
- Planning before acting
- 2 few-shot examples
- An ensemble of 3 agents for a task

Let's use these search settings to conduct the optimization.

In [23]:
from cognify.hub.search import text_to_sql
search_settings = text_to_sql.create_search()

## Start the optimization

We've provided the 3 code blocks above in `configy.py`. With the Cognify command line interface (CLI), you can start the optimization like this:

```console
$ cognify optimize workflow.py
```

Alternatively, you can run the following cell (*warning*: this workflow may run for quite some time):

In [None]:
train, val, dev = load_data()

opt_cost, pareto_frontier, opt_logs = cognify.optimize(
    script_path="workflow.py",
    control_param=search_settings,
    train_set=train,
    val_set=val,
    eval_fn=eval_text_to_sql,
    force=True, # This will overwrite the existing results
)