# Text-to-SQL with MIMIC-IV Demo & CodeS-1B Model

This notebook provides a complete pipeline for converting natural language questions into SQL queries using a local LLM. It is configured to work with the **MIMIC-IV Demo database** (`mimic_iv_demo.sqlite`) and the resources from the **EHRSQL 2024 Shared Task**.

### Workflow:
1.  **Setup & Configuration**: Define paths to your database and project files.
2.  **Load Database & Schema**: Connect to the SQLite database and load the schema context from `tables.json`.
3.  **Load LLM**: Load the `seeklhy/codes-1b` model for SQL generation.
4.  **Generate & Execute**: Take a natural language question, generate an SQL query, and execute it against the database.

## 1. Install Required Packages

In [1]:
%pip install torch transformers accelerate sqlalchemy pandas python-dateutil

Collecting accelerate
  Downloading accelerate-1.10.0-py3-none-any.whl (374 kB)
     ---------------------------------------- 0.0/374.7 kB ? eta -:--:--
     ------------------------------------- 374.7/374.7 kB 11.8 MB/s eta 0:00:00
Collecting sqlalchemy
  Downloading sqlalchemy-2.0.43-cp310-cp310-win_amd64.whl (2.1 MB)
     ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
     ------------------------ --------------- 1.3/2.1 MB 41.0 MB/s eta 0:00:01
     ---------------------------------------- 2.1/2.1 MB 34.1 MB/s eta 0:00:00
Collecting greenlet>=1
  Downloading greenlet-3.2.4-cp310-cp310-win_amd64.whl (298 kB)
     ---------------------------------------- 0.0/298.7 kB ? eta -:--:--
     ---------------------------------------- 298.7/298.7 kB ? eta 0:00:00
Installing collected packages: greenlet, sqlalchemy, accelerate
Successfully installed accelerate-1.10.0 greenlet-3.2.4 sqlalchemy-2.0.43
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## 2. Import Libraries

In [2]:
import os
import json
import warnings
import pandas as pd
from sqlalchemy import create_engine, text
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully.")

  from .autonotebook import tqdm as notebook_tqdm


‚úÖ Libraries imported successfully.


## 3. Configuration

Update the `PROJECT_PATH` variable to the absolute path of your `ehrsql-2024` project folder.

In [25]:
# IMPORTANT: Set this to the path of your ehrsql-2024 folder
# Example for Windows with WSL: '/mnt/c/Uni/Bachelorarbeit/ehrsql-2024/'
# Example for standard Windows: 'C:/Uni/Bachelorarbeit/ehrsql-2024/'
PROJECT_PATH = 'C:/Uni/Bachelorarbeit/ehrsql-2024'

DB_PATH = os.path.join(PROJECT_PATH, 'data/mimic_iv/mimic_iv.sqlite')
SCHEMA_PATH = os.path.join(PROJECT_PATH, 'data/mimic_iv/tables.json')
QUESTIONS_PATH = os.path.join(PROJECT_PATH, 'data/mimic_iv/test/data.json')


print(f"Database Path: {DB_PATH}")
print(f"Schema Path: {SCHEMA_PATH}")
print(f"Questions Path: {QUESTIONS_PATH}")

# Verify paths exist
if not os.path.exists(DB_PATH):
    print("‚ùå ERROR: Database file not found. Please check your PROJECT_PATH.")
if not os.path.exists(SCHEMA_PATH):
    print("‚ùå ERROR: Schema file not found. Please check your PROJECT_PATH.")

Database Path: C:/Uni/Bachelorarbeit/ehrsql-2024\data/mimic_iv/mimic_iv.sqlite
Schema Path: C:/Uni/Bachelorarbeit/ehrsql-2024\data/mimic_iv/tables.json
Questions Path: C:/Uni/Bachelorarbeit/ehrsql-2024\data/mimic_iv/test/data.json


## 4. Connect to the Database

In [18]:
def connect_to_db(db_path):
    """Create a connection engine for the SQLite database."""
    try:
        engine = create_engine(f'sqlite:///{db_path}')
        with engine.connect() as conn:
            print(f"‚úÖ Database connection successful to: {db_path}")
        return engine
    except Exception as e:
        print(f"‚ùå Database connection failed: {e}")
        return None

db_engine = connect_to_db(DB_PATH)

‚úÖ Database connection successful to: C:/Uni/Bachelorarbeit/ehrsql-2024\data/mimic_iv/mimic_iv.sqlite


## 5. Load and Format the Database Schema

This is a key step for Retrieval-Augmented Generation (RAG). We load the schema from `tables.json` and format it as a string to provide context to the LLM.

In [19]:
def get_schema_context(schema_path):
    """Loads the database schema from tables.json and formats it for the LLM prompt."""
    try:
        with open(schema_path, 'r') as f:
            schema_data = json.load(f)[0] # The data is inside a list
        
        context_parts = []
        for i, table_name in enumerate(schema_data['table_names_original']):
            table_columns = [col[1] for col in schema_data['column_names_original'] if col[0] == i]
            context_parts.append(f"Table {table_name}, columns = [{', '.join(table_columns)}]")
            
        schema_context = '\n'.join(context_parts)
        print("‚úÖ Schema context created successfully.")
        # print("--- Schema Context Preview ---")
        # print(schema_context[:500] + "...")
        return schema_context
    except Exception as e:
        print(f"‚ùå Failed to load or parse schema file: {e}")
        return None

schema_context = get_schema_context(SCHEMA_PATH)

‚úÖ Schema context created successfully.


## 6. Load the LLM (CodeS-1B)

In [20]:
def load_codes_model():
    """Load the CodeS-1B model for SQL generation."""
    print("ü§ñ Loading CodeS-1B model...")
    print("üì• This may take a few minutes on the first run as it downloads the model (~2GB).")
    
    model_name = "seeklhy/codes-1b"
    
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float32, # Use float32 for CPU
            trust_remote_code=True
        )
        
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        
        print("üî• Model loaded successfully on CPU!")
        return model, tokenizer
        
    except Exception as e:
        print(f"‚ùå Failed to load model: {e}")
        return None, None

model, tokenizer = load_codes_model()

ü§ñ Loading CodeS-1B model...
üì• This may take a few minutes on the first run as it downloads the model (~2GB).
üî• Model loaded successfully on CPU!


## 7. Core Functions for Text-to-SQL

These functions will handle SQL generation and safe execution.

In [None]:
def generate_sql(question: str, schema: str, llm_model, llm_tokenizer):
    """Generate SQL using the loaded LLM."""
    if not llm_model or not llm_tokenizer:
        return "‚ùå Model not loaded"

    prompt = f"""### Instructions:
Your task is to convert a question into a SQL query, given a database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question.
- **Use Table Aliases** to prevent ambiguity. For example, `SELECT t1.col1, t2.col2 FROM table1 AS t1 JOIN table2 AS t2 ON t1.id = t2.id`.

### Input:
Question: {question}

### Database Schema:
{schema}

### SQL Query:"""
    
    # Tokenize the input and create an attention mask
    inputs = llm_tokenizer(
        prompt, 
        return_tensors="pt", 
        truncation=True, 
        max_length=2048
    )
    
    with torch.no_grad():
        outputs = llm_model.generate(
            **inputs, # Pass both input_ids and attention_mask
            max_new_tokens=512,
            do_sample=False, # Use greedy decoding for more consistent results
            pad_token_id=llm_tokenizer.eos_token_id
        )
    
    generated_text = llm_tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract the SQL query from the generated text
    sql_start = generated_text.find("### SQL Query:") + len("### SQL Query:")
    sql_query = generated_text[sql_start:].strip()
    
    # A simple cleanup to remove potential text after the query
    if ';' in sql_query:
        sql_query = sql_query.split(';')[0]
        
    return sql_query

def execute_sql(engine, query):
    """Execute a SQL query and return the result as a DataFrame."""
    if not engine:
        return pd.DataFrame(), "No database connection."
    try:
        with engine.connect() as conn:
            df = pd.read_sql_query(text(query), conn)
        return df, "‚úÖ Success"
    except Exception as e:
        return pd.DataFrame(), f"‚ùå Query execution error: {str(e)}"

## 8. Run a Test Query

Let's test the full pipeline with a sample question from the EHRSQL dataset.

In [26]:
def run_full_test(question, schema, engine, llm_model, llm_tokenizer):
    print(f"‚ùì Question: {question}")
    
    # Generate SQL
    generated_query = generate_sql(question, schema, llm_model, llm_tokenizer)
    print(f"\nü§ñ Generated SQL:\n```sql\n{generated_query}\n```")
    
    # Execute SQL
    result_df, message = execute_sql(engine, generated_query)
    
    print(f"\nüìä Execution Result: {message}")
    if not result_df.empty:
        display(result_df)

# Load one question from the dev set for testing
try:
    with open(QUESTIONS_PATH, 'r') as f:
        questions_data = json.load(f)['data']
    
    # Let's pick a specific, interesting question for a clear test
    test_question = "How many patients are female?"
    
    # Run the test if all components are ready
    if db_engine and schema_context and model and tokenizer:
        run_full_test(test_question, schema_context, db_engine, model, tokenizer)
    else:
        print("‚ö†Ô∏è Cannot run test. One or more components (DB, Schema, Model) failed to load.")
except Exception as e:
    print(f"‚ùå Could not load or run test questions: {e}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


‚ùì Question: How many patients are female?

ü§ñ Generated SQL:
```sql
SELECT COUNT(DISTINCT gender)
FROM patients
JOIN admissions ON patients.subject_id = admissions.subject_id
JOIN d_icd_diagnoses ON admissions.hadm_id = d_icd_diagnoses.hadm_id
JOIN d_icd_procedures ON admissions.hadm_id = d_icd_procedures.hadm_id
JOIN d_labitems ON admissions.hadm_id = d_labitems.hadm_id
JOIN d_items ON d_labitems.itemid = d_items.itemid
JOIN diagnoses_icd ON admissions.hadm_id = diagnoses_icd.hadm_id
JOIN procedures_icd ON admissions.hadm_id = procedures_icd.hadm_id
JOIN labevents ON admissions.hadm_id = labevents.hadm_id
JOIN prescriptions ON admissions.hadm_id = prescriptions.hadm_id
JOIN cost ON admissions.hadm_id = cost.hadm_id
JOIN chartevents ON admissions.hadm_id = chartevents.hadm_id
JOIN inputevents ON admissions.hadm_id = inputevents.hadm_id
JOIN outputevents ON admissions.hadm_id = outputevents.hadm_id
JOIN microbiologyevents ON admissions.hadm_id = microbiologyevents.hadm_id
JOIN icust