# Text2SQL Combined with RAG

In [2]:
import os, sys
sys.path.append(os.path.abspath(os.path.join('..', 'model_evaluation')))
from utils import preprare_directory
from dotenv import load_dotenv
load_dotenv()


True

# Workflow Diagram

![](/root/workspace/vrdc_text2sql/images/text2sql_rag.png)

**Features**:
- By default, the whole DDL directly inserted into prompt. Though it can also be chunked
- RAG is performed only on Q&A pairs from the training dataset
- Default to retrieve top 3
- No documentation RAG support
- Async execution. Better than vanna because vanna doesn't support async. See [here](https://github.com/vanna-ai/vanna/discussions/394)
- Using Faiss-CPU for now
- Swap with different embeddings models

# Q&A Pair database

The train and validation split of eICU is used as the vector database 

In [25]:
import pandas as pd
fp = "/root/workspace/vrdc_text2sql/model_evaluation/dataset/train_eval/eicu/train_val.csv"
df = pd.read_csv(fp)

print("Number of Q&A pairs in the vector database: ", len(df))

Number of Q&A pairs in the vector database:  10387


# Steps to run


1. Confirm that your server still running:
	```
	curl -v http://localhost:8000/health
	```
2. First, create the output directory where you want to store the results, such as:
	```bash
	mkdir -p model_predictions/eICU/rag/mistral_finetuned_openai_embed 
	```
3. Run
	```bash
	python model_inference/rag/mistral-text2sql-rag-vllm.py \
	    --task_name ehrsql_eicu \
	    --ip localhost \
	    --port 8000 \
	    --checkpoint_path /root/workspace/mistral-nemo-minitron-8b-healthcare-text2sql_v1.0 \
	    --max_seq_length 4096 \
	    --batch_size 4 \
	    --save_dir model_predictions/eICU/rag/mistral_finetuned_openai_embed \
	    --train_file model_evaluation/dataset/train_eval/eicu/train_val.csv \
	    --embedding_model_name text-embedding-3-large \
	    --embedding_cache_file model_evaluation/dataset/train_eval/eicu/train_database_openai-text-embedding-3-large.pkl \
	    --top_k 3 \
	    --dataset_path model_evaluation/dataset/test/test_ehrsql_eicu_data_benchmark_rag.json \
	    --metadata_path model_evaluation/dataset/metadata/eicu_instruct_benchmark_rag.sql \
	    --format sqlite
	```

**Key Parameters:**

- `--task_name`: Dataset name (e.g., ehrsql_eicu, mimicsql)
- `--ip` and `--port`: vLLM server connection details
- `--checkpoint_path`: Path to your fine-tuned model
- `--train_file`: CSV file with training examples for RAG retrieval
- `--embedding_model_name`: embedding model for vector search
- `--embedding_cache_file`: Cache file for pre-computed embeddings. The pipeline will create embedding if this is not a valid, pre-existing file. 
- `--top_k`: Number of similar examples to retrieve (default: 3)
- `--dataset_path`: Test dataset JSON file
- `--metadata_path`: Database schema/metadata file

**Outputs:**

- Model predictions in JSONL format
- System performance metrics (memory, latency, throughput)
- FAISS index cache for faster subsequent runs

**Performance Optimization:**

- Adjust `--batch_size` based on GPU memory
- Use `--top_k` to control retrieval context size
- Monitor memory usage and adjust `--max-num-seqs` accordingly

# Evaluation

In [3]:


# create output directory for evaluation results, relative to the path of model_evaluation directory
# note that the evaluate results need a clean new folder, because it will overwrite any existing files in the folder
pred_directory = f"/root/workspace/vrdc_text2sql/model_predictions/eICU/rag/claude_sonnet_4_no_thinking_ddl5_qa6"  
eval_directory = os.path.join(pred_directory, "evaluation")
preprare_directory(eval_directory, exist_ok=False)

# the predicted file from previous step
pred_file = f"{pred_directory}/test_rag_vllm_ehrsql_eicu_result_mis_embedd.jsonl"

print("Using predictions from: ", pred_file)

# path to the eICU database
db_path = "/root/workspace/vrdc_text2sql/model_evaluation/databases/eicu.sqlite"

Using predictions from:  /root/workspace/vrdc_text2sql/model_predictions/eICU/rag/claude_sonnet_4_no_thinking_ddl5_qa6/test_rag_vllm_ehrsql_eicu_result_mis_embedd.jsonl


In [4]:
# run evaluation
!python ../model_evaluation/ehrsql_eval.py \
    --pred_file {pred_file} \
    --db_path {db_path} \
    --num_workers -1 \
    --timeout 60 \
    --out_file {eval_directory} \
    --ndigits 2

In [5]:
import json

# file path to the evaluation result file. 
fp = f"{pred_directory}/evaluation/test_rag_vllm_ehrsql_eicu_result_mis_embedd_metrics.json"
print("Reading from file: ", fp)

with open(fp, "r") as f:
    metrics = json.load(f)

print(json.dumps(metrics, indent=4))

Reading from file:  /root/workspace/vrdc_text2sql/model_predictions/eICU/rag/claude_sonnet_4_no_thinking_ddl5_qa6/evaluation/test_rag_vllm_ehrsql_eicu_result_mis_embedd_metrics.json
{
    "precision_ans": 100.0,
    "recall_ans": 100.0,
    "f1_ans": 100.0,
    "precision_exec": 92.63,
    "recall_exec": 92.63,
    "f1_exec": 92.63,
    "acc": 92.63
}
