# Chapter 3: Advanced SQL

This chapter covers advanced SQL concepts for complex data analysis, optimization, and sophisticated querying techniques.

## Topics Covered:
1. Subqueries and Correlated Queries
2. Common Table Expressions (CTEs)
3. Window Functions in Detail
4. Advanced Data Manipulation
5. Query Optimization Techniques
6. Real-world Analysis Scenarios

## Prerequisites

Ensure you've completed Chapters 1 & 2 and have the database connection established.

In [1]:
import sqlite3
import pandas as pd
from IPython.display import display
import numpy as np

# Connect to the database
conn = sqlite3.connect('my_database.db')
cursor = conn.cursor()

# Add some additional sample data for advanced examples
cursor.execute('''
CREATE TABLE IF NOT EXISTS employee_performance (
    emp_id INTEGER,
    year INTEGER,
    performance_score DECIMAL(3,2),
    bonus DECIMAL(10,2),
    PRIMARY KEY (emp_id, year),
    FOREIGN KEY (emp_id) REFERENCES employees (emp_id)
)
''')

# Insert performance data
performance_data = [
    (1, 2022, 4.2, 5000), (1, 2023, 4.5, 6000),
    (2, 2022, 4.8, 8000), (2, 2023, 4.7, 7500),
    (3, 2022, 3.9, 3500), (3, 2023, 4.1, 4000),
    (4, 2022, 4.6, 7000), (4, 2023, 4.8, 8500),
    (5, 2022, 4.4, 6500), (5, 2023, 4.3, 6000),
    (6, 2022, 4.0, 4500), (6, 2023, 4.2, 5000),
    (7, 2022, 4.7, 7500), (7, 2023, 4.9, 9000),
    (8, 2022, 4.1, 4000), (8, 2023, 4.3, 4500)
]

cursor.executemany('INSERT OR REPLACE INTO employee_performance VALUES (?, ?, ?, ?)', performance_data)
conn.commit()

print("Advanced examples database setup complete!")

Advanced examples database setup complete!


## 1. Subqueries and Correlated Queries

Subqueries allow you to use the result of one query inside another query.

In [2]:
# Example 1: Simple subquery - Employees earning above average
print("Employees earning above company average:")
df = pd.read_sql_query("""
SELECT 
    e.first_name,
    e.last_name,
    e.salary,
    d.dept_name,
    (SELECT AVG(salary) FROM employees) as company_avg
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
WHERE e.salary > (SELECT AVG(salary) FROM employees)
ORDER BY e.salary DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Correlated subquery - Employees earning above their department average
print("Employees earning above their department average:")
df = pd.read_sql_query("""
SELECT 
    e.first_name,
    e.last_name,
    e.salary,
    d.dept_name,
    (SELECT AVG(e2.salary) 
     FROM employees e2 
     WHERE e2.dept_id = e.dept_id) as dept_avg
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
WHERE e.salary > (
    SELECT AVG(e2.salary) 
    FROM employees e2 
    WHERE e2.dept_id = e.dept_id
)
ORDER BY d.dept_name, e.salary DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 3: EXISTS clause - Departments with high-budget projects
print("Departments that have projects with budget > 200,000:")
df = pd.read_sql_query("""
SELECT d.dept_name, d.location
FROM departments d
WHERE EXISTS (
    SELECT 1 
    FROM projects p 
    WHERE p.dept_id = d.dept_id 
    AND p.budget > 200000
)
""", conn)
display(df)

Employees earning above company average:


Unnamed: 0,first_name,last_name,salary,dept_name,company_avg
0,David,Brown,95000,Sales,79625.0
1,Sarah,Williams,90000,Human Resources,79625.0
2,Robert,Wilson,88000,Engineering,79625.0
3,Jane,Smith,82000,Engineering,79625.0




Employees earning above their department average:


Unnamed: 0,first_name,last_name,salary,dept_name,dept_avg
0,Robert,Wilson,88000,Engineering,81666.666667
1,Jane,Smith,82000,Engineering,81666.666667
2,Lisa,Anderson,72000,Marketing,68500.0




Departments that have projects with budget > 200,000:


Unnamed: 0,dept_name,location
0,Engineering,San Francisco


## 2. Common Table Expressions (CTEs)

CTEs provide a way to write more readable and maintainable complex queries.

In [3]:
# Example 1: Basic CTE - Department statistics
print("Department analysis using CTE:")
df = pd.read_sql_query("""
WITH dept_stats AS (
    SELECT 
        d.dept_id,
        d.dept_name,
        COUNT(e.emp_id) as employee_count,
        AVG(e.salary) as avg_salary,
        SUM(e.salary) as total_salary
    FROM departments d
    LEFT JOIN employees e ON d.dept_id = e.dept_id
    GROUP BY d.dept_id, d.dept_name
)
SELECT 
    dept_name,
    employee_count,
    ROUND(avg_salary, 2) as avg_salary,
    total_salary,
    ROUND(total_salary / (SELECT SUM(total_salary) FROM dept_stats) * 100, 2) as salary_percentage
FROM dept_stats
ORDER BY total_salary DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Recursive CTE simulation - Employee hierarchy
print("Performance trends using CTE:")
df = pd.read_sql_query("""
WITH performance_trends AS (
    SELECT 
        e.emp_id,
        e.first_name || ' ' || e.last_name as full_name,
        d.dept_name,
        AVG(p.performance_score) as avg_performance,
        SUM(p.bonus) as total_bonus,
        COUNT(p.year) as years_reviewed
    FROM employees e
    INNER JOIN departments d ON e.dept_id = d.dept_id
    INNER JOIN employee_performance p ON e.emp_id = p.emp_id
    GROUP BY e.emp_id, e.first_name, e.last_name, d.dept_name
),
company_stats AS (
    SELECT 
        AVG(avg_performance) as company_avg_performance,
        AVG(total_bonus) as company_avg_bonus
    FROM performance_trends
)
SELECT 
    pt.full_name,
    pt.dept_name,
    ROUND(pt.avg_performance, 2) as avg_performance,
    pt.total_bonus,
    ROUND(cs.company_avg_performance, 2) as company_avg_performance,
    CASE 
        WHEN pt.avg_performance > cs.company_avg_performance THEN 'Above Average'
        ELSE 'Below Average'
    END as performance_rating
FROM performance_trends pt
CROSS JOIN company_stats cs
ORDER BY pt.avg_performance DESC
""", conn)
display(df)

Department analysis using CTE:


Unnamed: 0,dept_name,employee_count,avg_salary,total_salary,salary_percentage
0,Engineering,3,81666.67,245000,0.0
1,Marketing,2,68500.0,137000,0.0
2,Sales,1,95000.0,95000,0.0
3,Human Resources,1,90000.0,90000,0.0
4,Finance,1,70000.0,70000,0.0




Performance trends using CTE:


Unnamed: 0,full_name,dept_name,avg_performance,total_bonus,company_avg_performance,performance_rating
0,Robert Wilson,Engineering,4.8,16500,4.41,Above Average
1,Jane Smith,Engineering,4.75,15500,4.41,Above Average
2,Sarah Williams,Human Resources,4.7,15500,4.41,Above Average
3,John Doe,Engineering,4.35,11000,4.41,Below Average
4,David Brown,Sales,4.35,12500,4.41,Below Average
5,Lisa Anderson,Marketing,4.2,8500,4.41,Below Average
6,Emily Davis,Finance,4.1,9500,4.41,Below Average
7,Mike Johnson,Marketing,4.0,7500,4.41,Below Average


## 3. Advanced Window Functions

Window functions perform calculations across a set of rows related to the current row.

In [4]:
# Example 1: Ranking functions
print("Employee rankings across multiple dimensions:")
df = pd.read_sql_query("""
SELECT 
    e.first_name || ' ' || e.last_name as full_name,
    d.dept_name,
    e.salary,
    ROW_NUMBER() OVER (ORDER BY e.salary DESC) as overall_rank,
    RANK() OVER (PARTITION BY d.dept_name ORDER BY e.salary DESC) as dept_rank,
    DENSE_RANK() OVER (ORDER BY e.salary DESC) as dense_rank,
    NTILE(3) OVER (ORDER BY e.salary DESC) as salary_tier
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
ORDER BY e.salary DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Analytical functions with performance data
print("Performance analysis with trends:")
df = pd.read_sql_query("""
SELECT 
    e.first_name || ' ' || e.last_name as full_name,
    d.dept_name,
    p.year,
    p.performance_score,
    LAG(p.performance_score) OVER (PARTITION BY e.emp_id ORDER BY p.year) as prev_score,
    p.performance_score - LAG(p.performance_score) OVER (PARTITION BY e.emp_id ORDER BY p.year) as score_change,
    AVG(p.performance_score) OVER (PARTITION BY d.dept_name) as dept_avg_performance,
    MAX(p.performance_score) OVER (PARTITION BY d.dept_name) as dept_max_performance
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
INNER JOIN employee_performance p ON e.emp_id = p.emp_id
ORDER BY d.dept_name, e.last_name, p.year
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 3: Running totals and moving averages
print("Financial analysis with running totals:")
df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    p.project_name,
    p.budget,
    SUM(p.budget) OVER (PARTITION BY d.dept_name ORDER BY p.budget) as running_total,
    AVG(p.budget) OVER (PARTITION BY d.dept_name) as dept_avg_budget,
    p.budget - AVG(p.budget) OVER (PARTITION BY d.dept_name) as budget_vs_avg,
    PERCENT_RANK() OVER (ORDER BY p.budget) as budget_percentile
FROM projects p
INNER JOIN departments d ON p.dept_id = d.dept_id
ORDER BY d.dept_name, p.budget
""", conn)
display(df)

Employee rankings across multiple dimensions:


Unnamed: 0,full_name,dept_name,salary,overall_rank,dept_rank,dense_rank,salary_tier
0,David Brown,Sales,95000,1,1,1,1
1,Sarah Williams,Human Resources,90000,2,1,2,1
2,Robert Wilson,Engineering,88000,3,1,3,1
3,Jane Smith,Engineering,82000,4,2,4,2
4,John Doe,Engineering,75000,5,3,5,2
5,Lisa Anderson,Marketing,72000,6,1,6,2
6,Emily Davis,Finance,70000,7,1,7,3
7,Mike Johnson,Marketing,65000,8,2,8,3




Performance analysis with trends:


Unnamed: 0,full_name,dept_name,year,performance_score,prev_score,score_change,dept_avg_performance,dept_max_performance
0,John Doe,Engineering,2022,4.2,,,4.633333,4.9
1,John Doe,Engineering,2023,4.5,4.2,0.3,4.633333,4.9
2,Jane Smith,Engineering,2022,4.8,,,4.633333,4.9
3,Jane Smith,Engineering,2023,4.7,4.8,-0.1,4.633333,4.9
4,Robert Wilson,Engineering,2022,4.7,,,4.633333,4.9
5,Robert Wilson,Engineering,2023,4.9,4.7,0.2,4.633333,4.9
6,Emily Davis,Finance,2022,4.0,,,4.1,4.2
7,Emily Davis,Finance,2023,4.2,4.0,0.2,4.1,4.2
8,Sarah Williams,Human Resources,2022,4.6,,,4.7,4.8
9,Sarah Williams,Human Resources,2023,4.8,4.6,0.2,4.7,4.8




Financial analysis with running totals:


Unnamed: 0,dept_name,project_name,budget,running_total,dept_avg_budget,budget_vs_avg,budget_percentile
0,Engineering,Website Redesign,150000,150000,225000.0,-75000.0,0.5
1,Engineering,Mobile App Development,300000,450000,225000.0,75000.0,1.0
2,Finance,Financial System Upgrade,200000,200000,200000.0,0.0,0.75
3,Marketing,Marketing Campaign Q2,75000,75000,75000.0,0.0,0.25
4,Sales,Sales Training Program,50000,50000,50000.0,0.0,0.0


## 4. Advanced Data Manipulation

Explore sophisticated data transformation and analysis techniques.

In [5]:
# Example 1: Pivot-like analysis using CASE statements
print("Employee distribution by salary ranges:")
df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    COUNT(*) as total_employees,
    SUM(CASE WHEN e.salary < 70000 THEN 1 ELSE 0 END) as below_70k,
    SUM(CASE WHEN e.salary BETWEEN 70000 AND 85000 THEN 1 ELSE 0 END) as mid_range,
    SUM(CASE WHEN e.salary > 85000 THEN 1 ELSE 0 END) as above_85k,
    ROUND(
        AVG(CASE WHEN e.salary > 85000 THEN e.salary END), 2
    ) as avg_high_earner_salary
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name
ORDER BY total_employees DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Complex conditional aggregation
print("Department performance matrix:")
df = pd.read_sql_query("""
WITH dept_metrics AS (
    SELECT 
        d.dept_name,
        COUNT(DISTINCT e.emp_id) as employee_count,
        COUNT(DISTINCT pr.project_id) as project_count,
        AVG(e.salary) as avg_salary,
        COALESCE(SUM(pr.budget), 0) as total_budget,
        AVG(perf.performance_score) as avg_performance
    FROM departments d
    LEFT JOIN employees e ON d.dept_id = e.dept_id
    LEFT JOIN projects pr ON d.dept_id = pr.dept_id
    LEFT JOIN employee_performance perf ON e.emp_id = perf.emp_id
    GROUP BY d.dept_name
)
SELECT 
    dept_name,
    employee_count,
    project_count,
    ROUND(avg_salary, 2) as avg_salary,
    total_budget,
    ROUND(avg_performance, 2) as avg_performance,
    CASE 
        WHEN avg_performance > 4.5 AND avg_salary > 80000 THEN 'High Performance, High Cost'
        WHEN avg_performance > 4.5 THEN 'High Performance, Standard Cost'
        WHEN avg_salary > 80000 THEN 'High Cost, Standard Performance'
        ELSE 'Standard'
    END as department_profile,
    ROUND(total_budget / NULLIF(employee_count, 0), 2) as budget_per_employee
FROM dept_metrics
ORDER BY avg_performance DESC, avg_salary DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 3: Time-based analysis
print("Year-over-year performance comparison:")
df = pd.read_sql_query("""
WITH yearly_performance AS (
    SELECT 
        e.emp_id,
        e.first_name || ' ' || e.last_name as full_name,
        d.dept_name,
        p.year,
        p.performance_score,
        p.bonus,
        LAG(p.performance_score) OVER (PARTITION BY e.emp_id ORDER BY p.year) as prev_year_score,
        LAG(p.bonus) OVER (PARTITION BY e.emp_id ORDER BY p.year) as prev_year_bonus
    FROM employees e
    INNER JOIN departments d ON e.dept_id = d.dept_id
    INNER JOIN employee_performance p ON e.emp_id = p.emp_id
)
SELECT 
    full_name,
    dept_name,
    year,
    performance_score as current_score,
    prev_year_score,
    ROUND(performance_score - COALESCE(prev_year_score, 0), 2) as score_improvement,
    bonus as current_bonus,
    prev_year_bonus,
    bonus - COALESCE(prev_year_bonus, 0) as bonus_change,
    CASE 
        WHEN performance_score > COALESCE(prev_year_score, 0) THEN 'Improved'
        WHEN performance_score = prev_year_score THEN 'Stable'
        ELSE 'Declined'
    END as performance_trend
FROM yearly_performance
WHERE year = 2023  -- Focus on 2023 with comparison to 2022
ORDER BY score_improvement DESC
""", conn)
display(df)

Employee distribution by salary ranges:


Unnamed: 0,dept_name,total_employees,below_70k,mid_range,above_85k,avg_high_earner_salary
0,Engineering,3,0,2,1,88000.0
1,Marketing,2,1,1,0,
2,Sales,1,0,0,1,95000.0
3,Human Resources,1,0,0,1,90000.0
4,Finance,1,0,1,0,




Department performance matrix:


Unnamed: 0,dept_name,employee_count,project_count,avg_salary,total_budget,avg_performance,department_profile,budget_per_employee
0,Human Resources,1,0,90000.0,0,4.7,"High Performance, High Cost",0.0
1,Engineering,3,2,81666.67,2700000,4.63,"High Performance, High Cost",900000.0
2,Sales,1,1,95000.0,100000,4.35,"High Cost, Standard Performance",100000.0
3,Finance,1,1,70000.0,400000,4.1,Standard,400000.0
4,Marketing,2,1,68500.0,300000,4.1,Standard,150000.0




Year-over-year performance comparison:


Unnamed: 0,full_name,dept_name,year,current_score,prev_year_score,score_improvement,current_bonus,prev_year_bonus,bonus_change,performance_trend
0,John Doe,Engineering,2023,4.5,4.2,0.3,6000,5000,1000,Improved
1,Mike Johnson,Marketing,2023,4.1,3.9,0.2,4000,3500,500,Improved
2,Sarah Williams,Human Resources,2023,4.8,4.6,0.2,8500,7000,1500,Improved
3,Emily Davis,Finance,2023,4.2,4.0,0.2,5000,4500,500,Improved
4,Robert Wilson,Engineering,2023,4.9,4.7,0.2,9000,7500,1500,Improved
5,Lisa Anderson,Marketing,2023,4.3,4.1,0.2,4500,4000,500,Improved
6,Jane Smith,Engineering,2023,4.7,4.8,-0.1,7500,8000,-500,Declined
7,David Brown,Sales,2023,4.3,4.4,-0.1,6000,6500,-500,Declined


## 5. Query Optimization Techniques

Learn techniques to write efficient SQL queries for better performance.

In [6]:
# Example 1: Using indexes (simulation with EXPLAIN QUERY PLAN)
print("Query plan analysis:")
df = pd.read_sql_query("""
EXPLAIN QUERY PLAN
SELECT e.first_name, e.last_name, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
WHERE e.salary > 80000
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Efficient vs inefficient query patterns
print("Optimized query - using EXISTS instead of IN:")

# More efficient approach
df1 = pd.read_sql_query("""
SELECT d.dept_name
FROM departments d
WHERE EXISTS (
    SELECT 1 FROM employees e 
    WHERE e.dept_id = d.dept_id AND e.salary > 85000
)
""", conn)
print("Departments with high earners (using EXISTS):")
display(df1)

# Alternative approach (for comparison)
df2 = pd.read_sql_query("""
SELECT DISTINCT d.dept_name
FROM departments d
INNER JOIN employees e ON d.dept_id = e.dept_id
WHERE e.salary > 85000
""", conn)
print("\nSame result using INNER JOIN:")
display(df2)

print("\n" + "="*50 + "\n")

# Example 3: Using LIMIT for large datasets
print("Top performers by department (limited results):")
df = pd.read_sql_query("""
WITH ranked_employees AS (
    SELECT 
        e.first_name || ' ' || e.last_name as full_name,
        d.dept_name,
        e.salary,
        RANK() OVER (PARTITION BY d.dept_name ORDER BY e.salary DESC) as rank
    FROM employees e
    INNER JOIN departments d ON e.dept_id = d.dept_id
)
SELECT full_name, dept_name, salary
FROM ranked_employees
WHERE rank <= 2  -- Top 2 per department
ORDER BY dept_name, rank
""", conn)
display(df)

Query plan analysis:


Unnamed: 0,id,parent,notused,detail
0,3,0,0,SCAN e
1,7,0,0,SEARCH d USING INTEGER PRIMARY KEY (rowid=?)




Optimized query - using EXISTS instead of IN:
Departments with high earners (using EXISTS):


Unnamed: 0,dept_name
0,Human Resources
1,Engineering
2,Sales



Same result using INNER JOIN:


Unnamed: 0,dept_name
0,Human Resources
1,Sales
2,Engineering




Top performers by department (limited results):


Unnamed: 0,full_name,dept_name,salary
0,Robert Wilson,Engineering,88000
1,Jane Smith,Engineering,82000
2,Emily Davis,Finance,70000
3,Sarah Williams,Human Resources,90000
4,Lisa Anderson,Marketing,72000
5,Mike Johnson,Marketing,65000
6,David Brown,Sales,95000


## 6. Real-world Analysis Scenarios

Apply advanced SQL techniques to solve complex business questions.

In [7]:
# Scenario 1: Executive Dashboard Query
print("Executive Dashboard - Company Overview:")
df = pd.read_sql_query("""
WITH company_metrics AS (
    SELECT 
        COUNT(DISTINCT d.dept_id) as total_departments,
        COUNT(DISTINCT e.emp_id) as total_employees,
        COUNT(DISTINCT p.project_id) as total_projects,
        SUM(e.salary) as total_payroll,
        SUM(p.budget) as total_project_budget,
        AVG(perf.performance_score) as avg_company_performance
    FROM departments d
    LEFT JOIN employees e ON d.dept_id = e.dept_id
    LEFT JOIN projects p ON d.dept_id = p.dept_id
    LEFT JOIN employee_performance perf ON e.emp_id = perf.emp_id
),
top_performers AS (
    SELECT COUNT(*) as high_performers
    FROM employees e
    INNER JOIN employee_performance perf ON e.emp_id = perf.emp_id
    WHERE perf.performance_score > 4.5
),
dept_efficiency AS (
    SELECT 
        AVG(total_budget / NULLIF(employee_count, 0)) as avg_budget_per_employee
    FROM (
        SELECT 
            d.dept_id,
            COUNT(e.emp_id) as employee_count,
            COALESCE(SUM(p.budget), 0) as total_budget
        FROM departments d
        LEFT JOIN employees e ON d.dept_id = e.dept_id
        LEFT JOIN projects p ON d.dept_id = p.dept_id
        GROUP BY d.dept_id
    )
)
SELECT 
    cm.total_departments,
    cm.total_employees,
    cm.total_projects,
    cm.total_payroll,
    cm.total_project_budget,
    cm.total_payroll + cm.total_project_budget as total_investment,
    ROUND(cm.avg_company_performance, 2) as avg_performance,
    tp.high_performers,
    ROUND(tp.high_performers * 100.0 / cm.total_employees, 1) as high_performer_percentage,
    ROUND(de.avg_budget_per_employee, 2) as avg_budget_per_employee
FROM company_metrics cm
CROSS JOIN top_performers tp
CROSS JOIN dept_efficiency de
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Scenario 2: Talent Management Analysis
print("Talent Management - Risk Assessment:")
df = pd.read_sql_query("""
WITH employee_analysis AS (
    SELECT 
        e.emp_id,
        e.first_name || ' ' || e.last_name as full_name,
        d.dept_name,
        e.salary,
        e.hire_date,
        AVG(perf.performance_score) as avg_performance,
        SUM(perf.bonus) as total_bonus,
        CASE 
            WHEN DATE('now') > DATE(e.hire_date, '+5 years') THEN 'Senior'
            WHEN DATE('now') > DATE(e.hire_date, '+2 years') THEN 'Mid-level'
            ELSE 'Junior'
        END as tenure_level,
        CASE 
            WHEN e.salary > (SELECT AVG(salary) * 1.2 FROM employees) THEN 'High'
            WHEN e.salary > (SELECT AVG(salary) * 0.8 FROM employees) THEN 'Medium'
            ELSE 'Low'
        END as compensation_level
    FROM employees e
    INNER JOIN departments d ON e.dept_id = d.dept_id
    LEFT JOIN employee_performance perf ON e.emp_id = perf.emp_id
    GROUP BY e.emp_id, e.first_name, e.last_name, d.dept_name, e.salary, e.hire_date
)
SELECT 
    full_name,
    dept_name,
    tenure_level,
    compensation_level,
    ROUND(avg_performance, 2) as avg_performance,
    total_bonus,
    CASE 
        WHEN avg_performance > 4.5 AND compensation_level = 'Low' THEN 'Flight Risk - High Performer, Low Pay'
        WHEN avg_performance < 3.5 AND compensation_level = 'High' THEN 'Performance Concern - Low Performance, High Pay'
        WHEN avg_performance > 4.5 AND tenure_level = 'Senior' THEN 'Key Talent - Retain'
        WHEN avg_performance < 3.5 THEN 'Development Needed'
        ELSE 'Standard'
    END as talent_status,
    salary
FROM employee_analysis
ORDER BY 
    CASE 
        WHEN avg_performance > 4.5 AND compensation_level = 'Low' THEN 1
        WHEN avg_performance < 3.5 AND compensation_level = 'High' THEN 2
        WHEN avg_performance > 4.5 AND tenure_level = 'Senior' THEN 3
        ELSE 4
    END,
    avg_performance DESC
""", conn)
display(df)

Executive Dashboard - Company Overview:


Unnamed: 0,total_departments,total_employees,total_projects,total_payroll,total_project_budget,total_investment,avg_performance,high_performers,high_performer_percentage,avg_budget_per_employee
0,5,8,5,1764000,3500000,5264000,4.47,6,75.0,110000.0




Talent Management - Risk Assessment:


Unnamed: 0,full_name,dept_name,tenure_level,compensation_level,avg_performance,total_bonus,talent_status,salary
0,Robert Wilson,Engineering,Senior,Medium,4.8,16500,Key Talent - Retain,88000
1,Jane Smith,Engineering,Senior,Medium,4.75,15500,Key Talent - Retain,82000
2,Sarah Williams,Human Resources,Mid-level,Medium,4.7,15500,Standard,90000
3,John Doe,Engineering,Senior,Medium,4.35,11000,Standard,75000
4,David Brown,Sales,Senior,Medium,4.35,12500,Standard,95000
5,Lisa Anderson,Marketing,Mid-level,Medium,4.2,8500,Standard,72000
6,Emily Davis,Finance,Mid-level,Medium,4.1,9500,Standard,70000
7,Mike Johnson,Marketing,Mid-level,Medium,4.0,7500,Standard,65000


## Practice Exercises - Chapter 3

Challenge yourself with these advanced SQL problems:

### Exercise Questions:
1. **Create a CTE that identifies the most improved employee in each department (2022 vs 2023)**
2. **Write a query using window functions to find employees whose salary is in the top 25% company-wide**
3. **Build a comprehensive department scorecard using subqueries and aggregations**
4. **Use EXISTS to find departments that have both high-budget projects (>150k) and high-performing employees (>4.5 avg)**
5. **Create a query that shows month-over-month trends (simulate monthly data)**

In [8]:
# Advanced Practice Area

# Exercise 1: Most improved employee per department
print("Exercise 1: Most improved employees by department (2022 vs 2023)")
query1 = """
WITH performance_improvement AS (
    SELECT 
        e.emp_id,
        e.first_name || ' ' || e.last_name as full_name,
        d.dept_name,
        p2023.performance_score as score_2023,
        p2022.performance_score as score_2022,
        p2023.performance_score - p2022.performance_score as improvement,
        RANK() OVER (PARTITION BY d.dept_name ORDER BY (p2023.performance_score - p2022.performance_score) DESC) as improvement_rank
    FROM employees e
    INNER JOIN departments d ON e.dept_id = d.dept_id
    INNER JOIN employee_performance p2023 ON e.emp_id = p2023.emp_id AND p2023.year = 2023
    INNER JOIN employee_performance p2022 ON e.emp_id = p2022.emp_id AND p2022.year = 2022
)
SELECT 
    dept_name,
    full_name,
    score_2022,
    score_2023,
    ROUND(improvement, 2) as improvement
FROM performance_improvement
WHERE improvement_rank = 1
ORDER BY improvement DESC
"""
df = pd.read_sql_query(query1, conn)
display(df)

print("\n" + "="*50 + "\n")

# Exercise 2: Top 25% earners using window functions
print("Exercise 2: Employees in top 25% salary range")
query2 = """
SELECT 
    e.first_name || ' ' || e.last_name as full_name,
    d.dept_name,
    e.salary,
    NTILE(4) OVER (ORDER BY e.salary DESC) as salary_quartile,
    PERCENT_RANK() OVER (ORDER BY e.salary DESC) as salary_percentile,
    ROUND((SELECT AVG(salary) FROM employees), 2) as company_avg
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
WHERE NTILE(4) OVER (ORDER BY e.salary DESC) = 1
ORDER BY e.salary DESC
"""
df = pd.read_sql_query(query2, conn)
display(df)

print("\n" + "="*50 + "\n")

# Exercise 3: Comprehensive department scorecard
print("Exercise 3: Department Performance Scorecard")
query3 = """
WITH dept_scorecard AS (
    SELECT 
        d.dept_name,
        COUNT(DISTINCT e.emp_id) as employee_count,
        ROUND(AVG(e.salary), 2) as avg_salary,
        COUNT(DISTINCT p.project_id) as project_count,
        COALESCE(SUM(p.budget), 0) as total_budget,
        ROUND(AVG(perf.performance_score), 2) as avg_performance,
        SUM(perf.bonus) as total_bonuses,
        ROUND(COALESCE(SUM(p.budget), 0) / NULLIF(COUNT(DISTINCT e.emp_id), 0), 2) as budget_per_employee
    FROM departments d
    LEFT JOIN employees e ON d.dept_id = e.dept_id
    LEFT JOIN projects p ON d.dept_id = p.dept_id
    LEFT JOIN employee_performance perf ON e.emp_id = perf.emp_id
    GROUP BY d.dept_name
),
company_benchmarks AS (
    SELECT 
        AVG(avg_salary) as company_avg_salary,
        AVG(avg_performance) as company_avg_performance,
        AVG(budget_per_employee) as company_avg_budget_per_emp
    FROM dept_scorecard
)
SELECT 
    ds.dept_name,
    ds.employee_count,
    ds.avg_salary,
    ds.project_count,
    ds.total_budget,
    ds.avg_performance,
    ds.budget_per_employee,
    CASE 
        WHEN ds.avg_performance > cb.company_avg_performance * 1.1 THEN 'Excellent'
        WHEN ds.avg_performance > cb.company_avg_performance * 0.9 THEN 'Good'
        ELSE 'Needs Improvement'
    END as performance_rating,
    CASE 
        WHEN ds.budget_per_employee > cb.company_avg_budget_per_emp * 1.2 THEN 'High Investment'
        WHEN ds.budget_per_employee > cb.company_avg_budget_per_emp * 0.8 THEN 'Standard Investment'
        ELSE 'Low Investment'
    END as investment_level
FROM dept_scorecard ds
CROSS JOIN company_benchmarks cb
ORDER BY ds.avg_performance DESC, ds.total_budget DESC
"""
df = pd.read_sql_query(query3, conn)
display(df)

# Continue with remaining exercises...

Exercise 1: Most improved employees by department (2022 vs 2023)


Unnamed: 0,dept_name,full_name,score_2022,score_2023,improvement
0,Engineering,John Doe,4.2,4.5,0.3
1,Finance,Emily Davis,4.0,4.2,0.2
2,Human Resources,Sarah Williams,4.6,4.8,0.2
3,Marketing,Lisa Anderson,4.1,4.3,0.2
4,Sales,David Brown,4.4,4.3,-0.1




Exercise 2: Employees in top 25% salary range


DatabaseError: Execution failed on sql '
SELECT 
    e.first_name || ' ' || e.last_name as full_name,
    d.dept_name,
    e.salary,
    NTILE(4) OVER (ORDER BY e.salary DESC) as salary_quartile,
    PERCENT_RANK() OVER (ORDER BY e.salary DESC) as salary_percentile,
    ROUND((SELECT AVG(salary) FROM employees), 2) as company_avg
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
WHERE NTILE(4) OVER (ORDER BY e.salary DESC) = 1
ORDER BY e.salary DESC
': misuse of window function NTILE()

In [None]:
# Clean up connections
conn.close()
print("Database connections closed. Advanced SQL chapter complete!")

## Chapter Summary

Congratulations! You've mastered advanced SQL concepts:

✅ **Subqueries & Correlated Queries** - Complex nested query patterns  
✅ **Common Table Expressions (CTEs)** - Readable and maintainable complex queries  
✅ **Advanced Window Functions** - Analytical functions for sophisticated analysis  
✅ **Data Manipulation** - Complex transformations and conditional logic  
✅ **Query Optimization** - Performance considerations and best practices  
✅ **Real-world Scenarios** - Business intelligence and analytics applications  

### Key Advanced Concepts Mastered:
- **Correlated subqueries** for row-by-row comparisons
- **CTEs** for breaking down complex logic
- **Window functions** for rankings and running calculations
- **Advanced aggregations** with conditional logic
- **Performance optimization** techniques
- **Business intelligence** query patterns

### Professional Development:
You now have the SQL skills to:
- Build executive dashboards and reports
- Perform complex data analysis
- Optimize query performance
- Handle real-world business scenarios
- Work with large datasets efficiently

### Next Steps for Continued Learning:
1. **Database-Specific Features** - Learn PostgreSQL, MySQL, or SQL Server specific functions
2. **Data Warehousing** - Explore OLAP, star schemas, and dimensional modeling
3. **Big Data SQL** - Learn Spark SQL, Presto, or other distributed SQL engines
4. **Database Administration** - Indexing, query tuning, and performance monitoring
5. **Integration Projects** - Combine SQL with Python, R, or BI tools

**🎉 You've completed the SQL Learning Journey! You're now equipped with professional-level SQL skills.**