# Database AI Agents: Text-to-SQL for HR Recruitment

This notebook demonstrates **Database AI Agent development** for AI-powered data analysis by building an agent that:

1. **Converts natural language to SQL** - Translates HR questions into database queries
2. **Enforces safety guardrails** - Read-only operations, table whitelisting, row limits
3. **Applies time constraints** - Automatic filtering for recent data
4. **Generates professional summaries** - Clear explanations for HR teams

## Key Concepts Demonstrated

- **Natural Language to SQL**: Using LLMs to convert questions to queries
- **Safety Guardrails**: Preventing dangerous database operations
- **Schema-Aware Processing**: Understanding database structure and relationships
- **Error Handling & Retry Logic**: Graceful failure recovery
- **Professional Output Generation**: Business-ready summaries

## Scenario
An HR analytics agent that helps recruitment teams get insights from their hiring database without writing SQL. The agent answers questions about candidates, interviews, offers, and hiring pipelines.

**Note**: This demo uses **SQLite** for simplicity, but the same patterns work with PostgreSQL, MySQL, or other databases via SQLAlchemy.

In [1]:
# Import required libraries
import os
import re
import json
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import List, Optional, Dict, Any, Tuple
import pandas as pd
from sqlalchemy import create_engine, text
from sqlalchemy.exc import SQLAlchemyError
from openai import OpenAI

# Initialize OpenAI client with Vocareum endpoint
client = OpenAI(
    base_url="https://openai.vocareum.com/v1",
    api_key=os.getenv("OPENAI_API_KEY")
)

print("üîß Environment Setup:")
print(f"   ‚úÖ OpenAI API Key: {'‚úì Configured' if os.getenv('OPENAI_API_KEY') else '‚ùå Missing'}")
print(f"   üîß Database: Using SQLite for demo")

üîß Environment Setup:
   ‚úÖ OpenAI API Key: ‚úì Configured
   üîß Database: Using SQLite for demo


## Define Data Models and Schema

We'll use dataclasses to structure our query results and define the database schema.

In [2]:
@dataclass
class QueryResult:
    """Represents the result of a text-to-SQL operation"""
    original_question: str
    generated_sql: str
    executed_sql: str
    data: pd.DataFrame
    row_count: int
    summary: str
    time_filter_applied: Optional[str] = None
    assumptions_made: Optional[List[str]] = None
    
    def __post_init__(self):
        if self.assumptions_made is None:
            self.assumptions_made = []

@dataclass
class DatabaseSchema:
    """Represents our known database schema for validation"""
    tables: Dict[str, List[str]]
    relationships: Dict[str, str]
    time_columns: Dict[str, str]

# Define our HR recruitment database schema
HR_SCHEMA = DatabaseSchema(
    tables={
        'departments': ['department_id', 'department_name', 'hiring_manager', 'budget_usd'],
        'positions': ['position_id', 'department_id', 'job_title', 'level', 'salary_min', 'salary_max', 'status', 'posted_date'],
        'candidates': ['candidate_id', 'full_name', 'email', 'phone', 'years_experience', 'current_company', 'source'],
        'applications': ['application_id', 'candidate_id', 'position_id', 'application_date', 'status', 'resume_score'],
        'interviews': ['interview_id', 'application_id', 'interview_date', 'interview_type', 'interviewer_name', 'rating', 'feedback_summary'],
        'offers': ['offer_id', 'application_id', 'offer_date', 'salary_offered', 'signing_bonus', 'status', 'response_date']
    },
    relationships={
        'positions.department_id': 'departments.department_id',
        'applications.candidate_id': 'candidates.candidate_id',
        'applications.position_id': 'positions.position_id',
        'interviews.application_id': 'applications.application_id',
        'offers.application_id': 'applications.application_id'
    },
    time_columns={
        'positions': 'posted_date',
        'applications': 'application_date',
        'interviews': 'interview_date',
        'offers': 'offer_date'
    }
)

print("üìã HR Database Schema Loaded:")
for table, columns in HR_SCHEMA.tables.items():
    print(f"   üìä {table}: {len(columns)} columns")

üìã HR Database Schema Loaded:
   üìä departments: 4 columns
   üìä positions: 8 columns
   üìä candidates: 7 columns
   üìä applications: 6 columns
   üìä interviews: 7 columns
   üìä offers: 7 columns


## Database Connection and Utilities

In [3]:
def get_schema_description(schema: DatabaseSchema) -> str:
    """Get a formatted description of database schema for LLM"""
    desc = "Available Tables and Columns:\n"
    
    for table, columns in schema.tables.items():
        desc += f"\n{table}:\n"
        for col in columns:
            desc += f"  - {col}\n"
    
    desc += "\nKey Relationships:\n"
    for rel, target in schema.relationships.items():
        desc += f"  - {rel} ‚Üí {target}\n"
        
    desc += "\nTime Columns (for filtering):\n"
    for table, time_col in schema.time_columns.items():
        desc += f"  - {table}.{time_col}\n"
            
    return desc

def check_database_exists():
    """Check if the HR database exists and show stats"""
    db_path = "hr_recruitment.db"
    if not os.path.exists(db_path):
        print(f"‚ùå Database file '{db_path}' not found!")
        print("   Run 'python3 setup_hr_database.py' first to create the database.")
        return False
    
    # Connect and show stats
    engine = create_engine(f"sqlite:///{db_path}", echo=False)
    with engine.connect() as conn:
        # Get table counts
        tables_info = []
        for table in HR_SCHEMA.tables.keys():
            result = conn.execute(text(f"SELECT COUNT(*) FROM {table}"))
            count = result.fetchone()[0]
            tables_info.append(f"   üìä {table}: {count:,} records")
        
        # Get key metrics
        result = conn.execute(text("SELECT AVG(salary_offered) FROM offers"))
        avg_salary = result.fetchone()[0]
        
        result = conn.execute(text("SELECT COUNT(*) FROM applications WHERE status = 'Hired'"))
        hired_count = result.fetchone()[0]
        
        print("‚úÖ Database connection successful!")
        print("üìã Database Statistics:")
        for info in tables_info:
            print(info)
        print(f"   üí∞ Average salary offered: ${avg_salary:,.0f}")
        print(f"   üë• Total hires: {hired_count}")
        print(f"   üìÖ Data range: Last 180 days of recruitment data")
    
    return engine

# Initialize database connection
print("üîó Connecting to HR recruitment database...")
engine = check_database_exists()

if engine:
    print("ü§ñ Ready to initialize the HR Text-to-SQL Agent!")
else:
    print("‚ö†Ô∏è  Please run the database setup script first.")

üîó Connecting to HR recruitment database...
‚úÖ Database connection successful!
üìã Database Statistics:
   üìä departments: 8 records
   üìä positions: 39 records
   üìä candidates: 200 records
   üìä applications: 263 records
   üìä interviews: 402 records
   üìä offers: 76 records
   üí∞ Average salary offered: $146,671
   üë• Total hires: 42
   üìÖ Data range: Last 180 days of recruitment data
ü§ñ Ready to initialize the HR Text-to-SQL Agent!


## Build the HR Text-to-SQL Agent

The agent orchestrates multiple capabilities:
1. **SQL Generation**: Converts natural language to SQL using LLM
2. **Safety Validation**: Applies guardrails before execution
3. **Query Execution**: Runs safe queries and returns data
4. **Summary Generation**: Creates professional explanations
5. **Retry Logic**: Handles errors with feedback loop

In [4]:
class HRTextToSQLAgent:
    """AI agent for converting natural language to safe SQL queries for HR operations"""
    
    def __init__(self, engine, schema: DatabaseSchema):
        self.engine = engine
        self.schema = schema
        self.query_history = []
        
    def process_question(self, question: str, show_sql_answer: bool = False) -> QueryResult:
        """
        Main method to process a natural language question
        
        Args:
            question: Natural language question about HR data
            show_sql_answer: Whether to display SQL queries during processing
            
        Returns:
            QueryResult with SQL, data, and summary
        """
        print(f"üîç Processing: {question}")
        
        # Step 1: Generate SQL from natural language with retry logic
        generated_sql, generation_attempts = self._generate_sql_with_retry(question, show_sql_answer)
        if show_sql_answer:
            print(f"üìù Generated SQL (attempt {generation_attempts}): {generated_sql}")
        
        # Step 2: Apply safety checks and modifications
        safe_sql, assumptions = self._apply_safety_checks(generated_sql, question)
        if show_sql_answer:
            print(f"üõ°Ô∏è Safe SQL: {safe_sql}")
        
        # Step 3: Execute the query
        data, row_count = self._execute_query(safe_sql)
        
        # Step 4: Generate summary
        summary = self._generate_summary(question, safe_sql, data, assumptions)
        
        # Create and store result
        result = QueryResult(
            original_question=question,
            generated_sql=generated_sql,
            executed_sql=safe_sql,
            data=data,
            row_count=row_count,
            summary=summary,
            assumptions_made=assumptions
        )
        
        self.query_history.append(result)
        return result
    
    def _generate_sql_with_retry(self, question: str, show_sql_answer: bool = False, max_attempts: int = 3) -> Tuple[str, int]:
        """Generate SQL with retry logic and error feedback"""
        last_error = None
        
        for attempt in range(1, max_attempts + 1):
            try:
                sql = self._generate_sql(question, previous_error=last_error, attempt=attempt)
                validation_error = self._validate_sql_syntax(sql)
                
                if validation_error is None:
                    if attempt > 1 and show_sql_answer:
                        print(f"‚úÖ SQL generation successful on attempt {attempt}")
                    return sql, attempt
                else:
                    last_error = validation_error
                    if show_sql_answer:
                        print(f"‚ùå Attempt {attempt} failed: {validation_error}")
                        if attempt < max_attempts:
                            print(f"üîÑ Retry attempt {attempt + 1} with error feedback...")
                    
            except Exception as e:
                last_error = f"Generation error: {str(e)}"
                if show_sql_answer:
                    print(f"‚ùå Attempt {attempt} failed: {last_error}")
        
        if show_sql_answer:
            print(f"‚ö†Ô∏è  All {max_attempts} attempts failed, using last attempt")
        return sql, max_attempts
    
    def _generate_sql(self, question: str, previous_error: Optional[str] = None, attempt: int = 1) -> str:
        """Generate SQL query from natural language using LLM"""
        
        schema_info = get_schema_description(self.schema)
        
        base_rules = """Important Rules:
1. Only use SELECT statements (no INSERT, UPDATE, DELETE, DROP, etc.)
2. Only query from the tables listed above
3. Always include a LIMIT clause (max 20 rows)
4. For time-based queries, include appropriate date filters
5. Use proper JOINs to get related data
6. Use meaningful column aliases for readability
7. Order results logically
8. **CRITICAL: Use SQLite functions ONLY - NO MySQL/PostgreSQL syntax**

SQLite Date/Time Functions (USE THESE):
- Time filters: application_date >= date('now', '-30 days')
- Extract month: strftime('%m', application_date) AS month
- Extract year: strftime('%Y', application_date) AS year
- Month name: strftime('%B', application_date) AS month_name

FORBIDDEN Functions (DO NOT USE):
- MONTH() ‚ùå Use strftime('%m', date_column) ‚úÖ
- YEAR() ‚ùå Use strftime('%Y', date_column) ‚úÖ  
- NOW() ‚ùå Use date('now') ‚úÖ
- INTERVAL ‚ùå Use date('now', '-X days') ‚úÖ"""
        
        error_feedback = ""
        if previous_error and attempt > 1:
            error_feedback = f"\nPREVIOUS ATTEMPT FAILED with error: {previous_error}\nFix the previous error and generate a corrected SQL query.\n"
        
        prompt = f"""You are a SQL expert helping HR teams analyze recruitment data. Convert this natural language question into a SELECT SQL query.

Database Schema:
{schema_info}

{error_feedback}{base_rules}

Question: {question}

Return only the SQL query, no explanations or markdown formatting:"""
        
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": prompt},
                    {"role": "user", "content": question}
                ],
                temperature=0.1,
                max_tokens=500
            )
            
            sql = response.choices[0].message.content.strip()
            sql = re.sub(r'```sql\n?', '', sql)
            sql = re.sub(r'```\n?', '', sql)
            
            return sql
            
        except Exception as e:
            print(f"‚ùå Error generating SQL: {e}")
            return "SELECT 'Error generating SQL' as error_message LIMIT 1;"
    
    def _validate_sql_syntax(self, sql: str) -> Optional[str]:
        """Quick validation of SQL syntax"""
        sql_upper = sql.upper().strip()
        
        # Check for common MySQL/PostgreSQL syntax issues
        if 'MONTH(' in sql_upper or 'YEAR(' in sql_upper:
            return "Invalid function: Use strftime() instead of MONTH()/YEAR()"
        
        if 'NOW()' in sql_upper and 'INTERVAL' in sql_upper:
            return "Invalid syntax: Use date('now', '-X days') instead of NOW() - INTERVAL"
        
        if not sql_upper.startswith('SELECT'):
            return "Query must start with SELECT"
        
        return None
    
    def _apply_safety_checks(self, sql: str, question: str) -> Tuple[str, List[str]]:
        """Apply safety checks and modifications to the generated SQL"""
        
        assumptions = []
        sql_upper = sql.upper().strip()
        
        # 1. Ensure it's a SELECT statement
        if not sql_upper.startswith('SELECT'):
            return "SELECT 'Error: Only SELECT queries are allowed' as error_message;", ["Query rejected - only SELECT allowed"]
        
        # 2. Check for forbidden keywords
        forbidden = ['INSERT', 'UPDATE', 'DELETE', 'DROP', 'CREATE', 'ALTER', 'EXEC']
        for keyword in forbidden:
            if keyword in sql_upper:
                return f"SELECT 'Error: {keyword} operations not allowed' as error_message;", [f"Query rejected - {keyword} not allowed"]
        
        # 3. Ensure LIMIT is present
        if 'LIMIT' not in sql_upper:
            sql = sql.rstrip(';') + ' LIMIT 20;'
            assumptions.append("Added LIMIT 20 for performance")
        
        return sql, assumptions
    
    def _execute_query(self, sql: str) -> Tuple[pd.DataFrame, int]:
        """Execute SQL query and return results as DataFrame"""
        
        try:
            with self.engine.connect() as conn:
                result = conn.execute(text(sql))
                df = pd.DataFrame(result.fetchall(), columns=result.keys())
                row_count = len(df)
                
                print(f"üìä Query executed: {row_count} rows returned")
                return df, row_count
                
        except SQLAlchemyError as e:
            print(f"‚ùå Database error: {e}")
            error_df = pd.DataFrame({'error': [f"Database error: {str(e)}"]})
            return error_df, 0
        except Exception as e:
            print(f"‚ùå Execution error: {e}")
            error_df = pd.DataFrame({'error': [f"Execution error: {str(e)}"]})
            return error_df, 0
    
    def _generate_summary(self, question: str, sql: str, data: pd.DataFrame, assumptions: List[str]) -> str:
        """Generate natural language summary of query results"""
        
        if 'error' in data.columns:
            return f"Query failed: {data['error'].iloc[0]}"
        
        row_count = len(data)
        summary_stats = self._get_data_summary(data)
        
        prompt = f"""You are an HR analyst summarizing query results for a recruitment team.

Original Question: {question}
SQL Executed: {sql}
Rows Returned: {row_count}
Data Summary: {summary_stats}
Assumptions: {', '.join(assumptions) if assumptions else 'None'}

Write a 2-4 sentence professional summary that:
1. Describes what was analyzed
2. Mentions any assumptions made
3. Highlights key insights from the results
4. Uses clear language for HR staff

Summary:"""
        
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": prompt},
                    {"role": "user", "content": "Generate the summary."}
                ],
                temperature=0.3,
                max_tokens=200
            )
            
            return response.choices[0].message.content.strip()
            
        except Exception as e:
            print(f"‚ùå Error generating summary: {e}")
            assumptions_text = f" (Assumptions: {', '.join(assumptions)})" if assumptions else ""
            return f"Query returned {row_count} rows{assumptions_text}. Review results for insights."
    
    def _get_data_summary(self, data: pd.DataFrame) -> str:
        """Get summary statistics for LLM context"""
        
        if data.empty:
            return "No data returned"
        
        stats = []
        
        # Numeric columns (salary, counts, etc.)
        numeric_cols = [col for col in data.columns if data[col].dtype in ['float64', 'int64']]
        for col in numeric_cols[:3]:  # First 3 numeric columns
            if 'salary' in col.lower() or 'bonus' in col.lower():
                total = data[col].sum()
                avg = data[col].mean()
                stats.append(f"{col} total: ${total:,.0f}, average: ${avg:,.0f}")
            else:
                total = data[col].sum()
                stats.append(f"{col} total: {total:,}")
        
        # Categorical columns
        categorical_cols = [col for col in data.columns if data[col].dtype == 'object']
        for col in categorical_cols[:2]:
            unique_count = data[col].nunique()
            stats.append(f"{col}: {unique_count} unique values")
        
        return "; ".join(stats) if stats else "Mixed data types"

# Initialize the agent
agent = HRTextToSQLAgent(engine, HR_SCHEMA)
print("ü§ñ HR Text-to-SQL Agent initialized and ready!")

ü§ñ HR Text-to-SQL Agent initialized and ready!


## Utility Functions for Display

In [5]:
def display_result(result: QueryResult, show_sql: bool = True):
    """Display query result in a formatted, professional way"""
    
    print("=" * 80)
    print("üìä HR DATABASE QUERY RESULT")
    print("=" * 80)
    
    print(f"\nüîç Question:")
    print(f"   {result.original_question}")
    
    if show_sql:
        print(f"\nüìù Executed SQL:")
        print(f"   {result.executed_sql}")
    
    if result.assumptions_made:
        print(f"\n‚ö†Ô∏è Assumptions Made:")
        for assumption in result.assumptions_made:
            print(f"   ‚Ä¢ {assumption}")
    
    print(f"\nüìä Results ({result.row_count} rows):")
    if not result.data.empty and 'error' not in result.data.columns:
        pd.set_option('display.max_columns', None)
        pd.set_option('display.width', None)
        pd.set_option('display.max_colwidth', 30)
        print(result.data.to_string(index=False, max_rows=20))
    else:
        print("   No data returned or error occurred")
    
    print(f"\nüí° Summary:")
    print(f"   {result.summary}")
    
    print("=" * 80)

print("‚úÖ Display utilities loaded")

‚úÖ Display utilities loaded


## Demo: Natural Language to SQL for HR Analytics

Let's test our agent with typical HR recruitment questions. The database contains 200+ candidates, 39 positions, 263 applications, 402 interviews, and 76 offers over 180 days.

In [6]:
# Test Case 1: Top departments by hiring activity
print("üß™ Test Case 1: Hiring activity by department")
result1 = agent.process_question("Show me the top 5 departments by number of positions posted in the last 90 days", show_sql_answer=True)
display_result(result1, show_sql=False)

üß™ Test Case 1: Hiring activity by department
üîç Processing: Show me the top 5 departments by number of positions posted in the last 90 days
üìù Generated SQL (attempt 1): SELECT d.department_name AS department, COUNT(p.position_id) AS positions_posted
FROM departments d
JOIN positions p ON d.department_id = p.department_id
WHERE p.posted_date >= date('now', '-90 days')
GROUP BY d.department_id
ORDER BY positions_posted DESC
LIMIT 5;
üõ°Ô∏è Safe SQL: SELECT d.department_name AS department, COUNT(p.position_id) AS positions_posted
FROM departments d
JOIN positions p ON d.department_id = p.department_id
WHERE p.posted_date >= date('now', '-90 days')
GROUP BY d.department_id
ORDER BY positions_posted DESC
LIMIT 5;
üìä Query executed: 5 rows returned
üìä HR DATABASE QUERY RESULT

üîç Question:
   Show me the top 5 departments by number of positions posted in the last 90 days

üìä Results (5 rows):
        department  positions_posted
         Marketing                 5
         

In [7]:
# Test Case 2: Candidate pipeline analysis
print("\nüß™ Test Case 2: Candidate pipeline by status")
result2 = agent.process_question("Count of applications by status, show me the funnel")
display_result(result2, show_sql=False)


üß™ Test Case 2: Candidate pipeline by status
üîç Processing: Count of applications by status, show me the funnel
üìä Query executed: 7 rows returned
üìä HR DATABASE QUERY RESULT

üîç Question:
   Count of applications by status, show me the funnel

üìä Results (7 rows):
      status  application_count
   Interview                 46
       Hired                 42
   Withdrawn                 38
     Applied                 37
Phone Screen                 34
       Offer                 34
    Rejected                 32

üí° Summary:
   The analysis focused on the count of job applications categorized by their current status, providing a clear view of the recruitment funnel. A total of 263 applications were analyzed, revealing 7 unique status values. No assumptions were made during this analysis. The results indicate a diverse range of application statuses, which can help the recruitment team identify areas for improvement in the hiring process.


In [8]:
# Test Case 3: Salary analysis
print("\nüß™ Test Case 3: Salary offers by department")
result3 = agent.process_question("Average salary offered by department, include number of offers made")
display_result(result3)


üß™ Test Case 3: Salary offers by department
üîç Processing: Average salary offered by department, include number of offers made
üìä Query executed: 5 rows returned
üìä HR DATABASE QUERY RESULT

üîç Question:
   Average salary offered by department, include number of offers made

üìù Executed SQL:
   SELECT d.department_name AS department, 
       AVG(o.salary_offered) AS average_salary_offered, 
       COUNT(o.offer_id) AS number_of_offers
FROM offers o
JOIN applications a ON o.application_id = a.application_id
JOIN positions p ON a.position_id = p.position_id
JOIN departments d ON p.department_id = d.department_id
GROUP BY d.department_name
ORDER BY d.department_name
LIMIT 20;

üìä Results (5 rows):
        department  average_salary_offered  number_of_offers
      Data Science           126683.230769                13
       Engineering           140305.000000                13
         Marketing           130349.136364                22
Product Management           192852.0

In [9]:
# Test Case 4: Interview performance
print("\nüß™ Test Case 4: Top rated candidates")
result4 = agent.process_question("Show candidates with average interview rating above 4, include their name, current company, and average rating")
display_result(result4)


üß™ Test Case 4: Top rated candidates
üîç Processing: Show candidates with average interview rating above 4, include their name, current company, and average rating
üìä Query executed: 7 rows returned
üìä HR DATABASE QUERY RESULT

üîç Question:
   Show candidates with average interview rating above 4, include their name, current company, and average rating

üìù Executed SQL:
   SELECT c.full_name AS candidate_name, c.current_company AS company, AVG(i.rating) AS average_rating
FROM candidates c
JOIN applications a ON c.candidate_id = a.candidate_id
JOIN interviews i ON a.application_id = i.application_id
GROUP BY c.candidate_id
HAVING AVG(i.rating) > 4
ORDER BY average_rating DESC
LIMIT 20;

üìä Results (7 rows):
 candidate_name    company  average_rating
  Skylar Thomas     GitHub        5.000000
    Sage Miller       None        5.000000
   Reese Wilson Salesforce        5.000000
    Ryan Wilson       None        4.666667
      Avery Lee     PayPal        4.333333
Skylar Gonza

In [10]:
# Test Case 5: Recruiting source effectiveness
print("\nüß™ Test Case 5: Recruiting source effectiveness")
result5 = agent.process_question("Which recruiting sources brought in candidates that got hired? Show source, number of hires, and conversion rate", show_sql_answer=True)
display_result(result5, show_sql=False)


üß™ Test Case 5: Recruiting source effectiveness
üîç Processing: Which recruiting sources brought in candidates that got hired? Show source, number of hires, and conversion rate
üìù Generated SQL (attempt 1): SELECT 
    c.source AS recruiting_source, 
    COUNT(DISTINCT a.candidate_id) AS number_of_hires, 
    (COUNT(DISTINCT a.candidate_id) * 1.0 / COUNT(DISTINCT a.application_id)) * 100 AS conversion_rate
FROM 
    candidates c
JOIN 
    applications a ON c.candidate_id = a.candidate_id
JOIN 
    offers o ON a.application_id = o.application_id
WHERE 
    o.status = 'hired'
GROUP BY 
    c.source
ORDER BY 
    number_of_hires DESC
LIMIT 20;
üõ°Ô∏è Safe SQL: SELECT 
    c.source AS recruiting_source, 
    COUNT(DISTINCT a.candidate_id) AS number_of_hires, 
    (COUNT(DISTINCT a.candidate_id) * 1.0 / COUNT(DISTINCT a.application_id)) * 100 AS conversion_rate
FROM 
    candidates c
JOIN 
    applications a ON c.candidate_id = a.candidate_id
JOIN 
    offers o ON a.application_id = 

## Safety Validation Tests

Let's verify our safety guardrails prevent dangerous operations:

In [11]:
# Safety Test 1: Attempt forbidden operations
print("üõ°Ô∏è Safety Test 1: Attempt to DELETE data")
safety_result1 = agent.process_question("Delete all rejected applications")
display_result(safety_result1)

üõ°Ô∏è Safety Test 1: Attempt to DELETE data
üîç Processing: Delete all rejected applications
üìä Query executed: 0 rows returned
üìä HR DATABASE QUERY RESULT

üîç Question:
   Delete all rejected applications

üìù Executed SQL:
   SELECT * FROM applications WHERE status = 'rejected' LIMIT 20;

üìä Results (0 rows):
   No data returned or error occurred

üí° Summary:
   The analysis focused on identifying rejected applications within the recruitment database. No assumptions were made during this query. The results indicated that there were no rejected applications present, as the SQL query returned zero rows. This suggests that the current application pool does not contain any candidates who have been marked as rejected.


In [12]:
# Safety Test 2: Missing LIMIT clause
print("\nüõ°Ô∏è Safety Test 2: Missing LIMIT - should add default")
safety_result2 = agent.process_question("Show all candidates who applied to Engineering positions")
display_result(safety_result2)


üõ°Ô∏è Safety Test 2: Missing LIMIT - should add default
üîç Processing: Show all candidates who applied to Engineering positions
üìä Query executed: 20 rows returned
üìä HR DATABASE QUERY RESULT

üîç Question:
   Show all candidates who applied to Engineering positions

üìù Executed SQL:
   SELECT c.candidate_id AS CandidateID, c.full_name AS FullName, c.email AS Email, c.phone AS Phone, c.years_experience AS YearsExperience, c.current_company AS CurrentCompany, c.source AS Source
FROM candidates c
JOIN applications a ON c.candidate_id = a.candidate_id
JOIN positions p ON a.position_id = p.position_id
JOIN departments d ON p.department_id = d.department_id
WHERE d.department_name = 'Engineering'
LIMIT 20;

üìä Results (20 rows):
 CandidateID         FullName                        Email           Phone  YearsExperience CurrentCompany          Source
         129      River Smith      river.smith@example.com +1-555-483-5752               14           Meta       Recruiter
      

## Key Learning Points

### ‚úÖ **Core Features Demonstrated**

1. **Natural Language Processing**: Converts plain English questions to SQL using LLMs
2. **Safety Guardrails**: Enforces read-only operations, table whitelisting, and row limits  
3. **Schema-Aware Processing**: Understands table relationships and data types
4. **Professional Summaries**: Generates clear explanations suitable for HR teams
5. **Database Integration**: SQLAlchemy support for multiple database types

### üõ°Ô∏è **Security & Safety Measures**

- **Query Validation**: Blocks DML operations (INSERT, UPDATE, DELETE, DROP)
- **Table Whitelisting**: Only allows queries against approved schema tables
- **Automatic Limits**: Adds LIMIT 20 to prevent large result sets
- **Error Handling**: Graceful failure with informative error messages
- **Retry Logic**: Attempts to fix errors with LLM feedback

### üèóÔ∏è **Architecture Highlights**

- **Modular Design**: Separate classes for schema, results, and agent logic
- **Schema-Aware**: Understands table relationships and data types
- **Query History**: Tracks all processed queries for auditing
- **Assumption Tracking**: Records when defaults are applied

### üí° **Applications to Other Domains**

This pattern extends to:
- **Finance Operations** (expense tracking, budget analysis)
- **Sales Analytics** (pipeline metrics, revenue forecasting)
- **Customer Support** (ticket analysis, response times)
- **Inventory Management** (stock levels, reorder points)
- **Healthcare Analytics** (patient data, appointment scheduling)