# Advanced SQL Concepts

This notebook covers advanced SQL topics that build upon the fundamentals. You'll learn powerful features used in real-world data analysis and database management.

## Topics Covered:
1. Window Functions
2. Common Table Expressions (CTEs)
3. Advanced Joins (Self-joins, Cross joins)
4. Views and Indexes
5. Data Types and Constraints
6. Performance Optimization
7. Advanced Aggregations
8. Conditional Logic

## Setup and Database Connection

Let's connect to our existing database and ensure we have the sample data.

In [1]:
import sqlite3
import pandas as pd
from IPython.display import display
import os

# Connect to the SQLite database
if os.path.exists('my_database.db'):
    conn = sqlite3.connect('my_database.db')
elif os.path.exists('../my_database.db'):
    conn = sqlite3.connect('../my_database.db')
else:
    conn = sqlite3.connect('my_database.db')

cursor = conn.cursor()

print("Connected to SQLite database successfully!")
print(f"Database location: {os.path.abspath(conn.execute('PRAGMA database_list').fetchone()[2])}")

# Verify our tables exist
tables = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table'", conn)
print(f"\nAvailable tables: {', '.join(tables['name'].tolist())}")

Connected to SQLite database successfully!
Database location: /workspaces/sql-notes/notebooks/my_database.db

Available tables: departments, employees, projects, sales_transactions, performance_reviews


## 1. Window Functions

Window functions perform calculations across a set of rows related to the current row, without grouping the result set.

### Key Window Functions:
- **ROW_NUMBER()** - Assigns unique numbers to rows
- **RANK()** - Assigns ranks with gaps for ties
- **DENSE_RANK()** - Assigns ranks without gaps
- **LAG()/LEAD()** - Access previous/next row values
- **FIRST_VALUE()/LAST_VALUE()** - Get first/last values in window

In [2]:
# Example 1: ROW_NUMBER() - Number employees by salary within each department
print("Employee ranking by salary within each department:")
query = """
SELECT 
    first_name,
    last_name,
    salary,
    dept_id,
    ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) as salary_rank
FROM employees
ORDER BY dept_id, salary_rank
"""
df = pd.read_sql_query(query, conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: RANK() and DENSE_RANK() comparison
print("Comparison of RANK() vs DENSE_RANK():")
query = """
SELECT 
    first_name,
    last_name,
    salary,
    RANK() OVER (ORDER BY salary DESC) as rank_with_gaps,
    DENSE_RANK() OVER (ORDER BY salary DESC) as dense_rank,
    ROW_NUMBER() OVER (ORDER BY salary DESC) as row_number
FROM employees
ORDER BY salary DESC
"""
df = pd.read_sql_query(query, conn)
display(df)

Employee ranking by salary within each department:


Unnamed: 0,first_name,last_name,salary,dept_id,salary_rank
0,Sarah,Williams,90000,1,1
1,Robert,Wilson,88000,2,1
2,Jane,Smith,82000,2,2
3,John,Doe,75000,2,3
4,Lisa,Anderson,72000,3,1
5,Mike,Johnson,65000,3,2
6,David,Brown,95000,4,1
7,Emily,Davis,70000,5,1




Comparison of RANK() vs DENSE_RANK():


Unnamed: 0,first_name,last_name,salary,rank_with_gaps,dense_rank,row_number
0,David,Brown,95000,1,1,1
1,Sarah,Williams,90000,2,2,2
2,Robert,Wilson,88000,3,3,3
3,Jane,Smith,82000,4,4,4
4,John,Doe,75000,5,5,5
5,Lisa,Anderson,72000,6,6,6
6,Emily,Davis,70000,7,7,7
7,Mike,Johnson,65000,8,8,8


In [3]:
# Example 3: LAG() and LEAD() - Compare with previous/next values
print("Salary comparison with previous employee:")
query = """
SELECT 
    first_name,
    last_name,
    salary,
    LAG(salary, 1) OVER (ORDER BY salary) as prev_salary,
    salary - LAG(salary, 1) OVER (ORDER BY salary) as salary_diff
FROM employees
ORDER BY salary
"""
df = pd.read_sql_query(query, conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 4: Running totals with SUM() OVER()
print("Running total of salaries by department:")
query = """
SELECT 
    e.first_name,
    e.last_name,
    d.dept_name,
    e.salary,
    SUM(e.salary) OVER (PARTITION BY e.dept_id ORDER BY e.salary) as running_total
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id
ORDER BY d.dept_name, e.salary
"""
df = pd.read_sql_query(query, conn)
display(df)

Salary comparison with previous employee:


Unnamed: 0,first_name,last_name,salary,prev_salary,salary_diff
0,Mike,Johnson,65000,,
1,Emily,Davis,70000,65000.0,5000.0
2,Lisa,Anderson,72000,70000.0,2000.0
3,John,Doe,75000,72000.0,3000.0
4,Jane,Smith,82000,75000.0,7000.0
5,Robert,Wilson,88000,82000.0,6000.0
6,Sarah,Williams,90000,88000.0,2000.0
7,David,Brown,95000,90000.0,5000.0




Running total of salaries by department:


Unnamed: 0,first_name,last_name,dept_name,salary,running_total
0,John,Doe,Engineering,75000,75000
1,Jane,Smith,Engineering,82000,157000
2,Robert,Wilson,Engineering,88000,245000
3,Emily,Davis,Finance,70000,70000
4,Sarah,Williams,Human Resources,90000,90000
5,Mike,Johnson,Marketing,65000,65000
6,Lisa,Anderson,Marketing,72000,137000
7,David,Brown,Sales,95000,95000


## 2. Common Table Expressions (CTEs)

CTEs allow you to define temporary result sets that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement.

In [4]:
# Example 1: Basic CTE - Calculate department averages
print("Employees above their department average salary:")
query = """
WITH dept_averages AS (
    SELECT 
        dept_id,
        AVG(salary) as avg_salary
    FROM employees
    GROUP BY dept_id
)
SELECT 
    e.first_name,
    e.last_name,
    e.salary,
    d.dept_name,
    da.avg_salary,
    ROUND(e.salary - da.avg_salary, 2) as above_avg
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id
JOIN dept_averages da ON e.dept_id = da.dept_id
WHERE e.salary > da.avg_salary
ORDER BY above_avg DESC
"""
df = pd.read_sql_query(query, conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Multiple CTEs - Complex analysis
print("Department performance analysis:")
query = """
WITH dept_stats AS (
    SELECT 
        dept_id,
        COUNT(*) as employee_count,
        AVG(salary) as avg_salary,
        MAX(salary) as max_salary,
        MIN(salary) as min_salary
    FROM employees
    GROUP BY dept_id
),
project_budgets AS (
    SELECT 
        dept_id,
        COUNT(*) as project_count,
        SUM(budget) as total_budget
    FROM projects
    GROUP BY dept_id
)
SELECT 
    d.dept_name,
    d.location,
    ds.employee_count,
    ROUND(ds.avg_salary, 0) as avg_salary,
    pb.project_count,
    pb.total_budget,
    ROUND(pb.total_budget / ds.employee_count, 0) as budget_per_employee
FROM departments d
LEFT JOIN dept_stats ds ON d.dept_id = ds.dept_id
LEFT JOIN project_budgets pb ON d.dept_id = pb.dept_id
ORDER BY budget_per_employee DESC
"""
df = pd.read_sql_query(query, conn)
display(df)

Employees above their department average salary:


Unnamed: 0,first_name,last_name,salary,dept_name,avg_salary,above_avg
0,Robert,Wilson,88000,Engineering,81666.666667,6333.33
1,Lisa,Anderson,72000,Marketing,68500.0,3500.0
2,Jane,Smith,82000,Engineering,81666.666667,333.33




Department performance analysis:


Unnamed: 0,dept_name,location,employee_count,avg_salary,project_count,total_budget,budget_per_employee
0,Finance,Boston,1,70000.0,1.0,200000.0,200000.0
1,Engineering,San Francisco,3,81667.0,2.0,450000.0,150000.0
2,Sales,Los Angeles,1,95000.0,1.0,50000.0,50000.0
3,Marketing,Chicago,2,68500.0,1.0,75000.0,37500.0
4,Human Resources,New York,1,90000.0,,,


## 3. Advanced Joins

Beyond basic INNER and LEFT joins, let's explore self-joins and cross joins.

In [5]:
# Example 1: Self-join - Find employees in the same department
print("Employees working in the same department:")
query = """
SELECT 
    e1.first_name || ' ' || e1.last_name as employee1,
    e2.first_name || ' ' || e2.last_name as employee2,
    d.dept_name,
    d.location
FROM employees e1
JOIN employees e2 ON e1.dept_id = e2.dept_id AND e1.emp_id < e2.emp_id
JOIN departments d ON e1.dept_id = d.dept_id
ORDER BY d.dept_name, employee1
"""
df = pd.read_sql_query(query, conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Cross join for combinations (be careful with large datasets!)
print("All possible employee-project combinations (first 10):")
query = """
SELECT 
    e.first_name || ' ' || e.last_name as employee,
    p.project_name,
    CASE 
        WHEN e.dept_id = p.dept_id THEN 'Assigned Department'
        ELSE 'Other Department'
    END as assignment_type
FROM employees e
CROSS JOIN projects p
ORDER BY employee, p.project_name
LIMIT 10
"""
df = pd.read_sql_query(query, conn)
display(df)

Employees working in the same department:


Unnamed: 0,employee1,employee2,dept_name,location
0,Jane Smith,Robert Wilson,Engineering,San Francisco
1,John Doe,Jane Smith,Engineering,San Francisco
2,John Doe,Robert Wilson,Engineering,San Francisco
3,Mike Johnson,Lisa Anderson,Marketing,Chicago




All possible employee-project combinations (first 10):


Unnamed: 0,employee,project_name,assignment_type
0,David Brown,Financial System Upgrade,Other Department
1,David Brown,Marketing Campaign Q2,Other Department
2,David Brown,Mobile App Development,Other Department
3,David Brown,Sales Training Program,Assigned Department
4,David Brown,Website Redesign,Other Department
5,Emily Davis,Financial System Upgrade,Assigned Department
6,Emily Davis,Marketing Campaign Q2,Other Department
7,Emily Davis,Mobile App Development,Other Department
8,Emily Davis,Sales Training Program,Other Department
9,Emily Davis,Website Redesign,Other Department


## 4. Views and Indexes

Views create virtual tables based on queries, while indexes improve query performance.

In [6]:
# Example 1: Create a view for employee details
print("Creating and using a view:")
cursor.execute("""
CREATE VIEW IF NOT EXISTS employee_details AS
SELECT 
    e.emp_id,
    e.first_name || ' ' || e.last_name as full_name,
    e.email,
    e.salary,
    d.dept_name,
    d.location,
    CASE 
        WHEN e.salary >= 90000 THEN 'Senior'
        WHEN e.salary >= 75000 THEN 'Mid-level'
        ELSE 'Junior'
    END as level
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id
""")

# Query the view
df = pd.read_sql_query("SELECT * FROM employee_details ORDER BY salary DESC", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Create an index for better performance
print("Creating index on salary for faster queries:")
cursor.execute("CREATE INDEX IF NOT EXISTS idx_employee_salary ON employees(salary)")
cursor.execute("CREATE INDEX IF NOT EXISTS idx_employee_dept ON employees(dept_id)")

# Show index information
indexes = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='index' AND tbl_name='employees'", conn)
print("Indexes on employees table:")
display(indexes)

conn.commit()

Creating and using a view:


Unnamed: 0,emp_id,full_name,email,salary,dept_name,location,level
0,5,David Brown,david.brown@company.com,95000,Sales,Los Angeles,Senior
1,4,Sarah Williams,sarah.williams@company.com,90000,Human Resources,New York,Senior
2,7,Robert Wilson,robert.wilson@company.com,88000,Engineering,San Francisco,Mid-level
3,2,Jane Smith,jane.smith@company.com,82000,Engineering,San Francisco,Mid-level
4,1,John Doe,john.doe@company.com,75000,Engineering,San Francisco,Mid-level
5,8,Lisa Anderson,lisa.anderson@company.com,72000,Marketing,Chicago,Junior
6,6,Emily Davis,emily.davis@company.com,70000,Finance,Boston,Junior
7,3,Mike Johnson,mike.johnson@company.com,65000,Marketing,Chicago,Junior




Creating index on salary for faster queries:
Indexes on employees table:


Unnamed: 0,name
0,sqlite_autoindex_employees_1
1,idx_employee_salary
2,idx_employee_dept


## 5. Advanced Conditional Logic

Using CASE statements and conditional aggregations for complex business logic.

In [7]:
# Example 1: Complex CASE statements
print("Employee categorization with multiple conditions:")
query = """
SELECT 
    first_name || ' ' || last_name as employee,
    salary,
    hire_date,
    CASE 
        WHEN salary >= 90000 AND hire_date < '2020-01-01' THEN 'Senior Veteran'
        WHEN salary >= 90000 THEN 'Senior New Hire'
        WHEN salary >= 75000 AND hire_date < '2020-01-01' THEN 'Mid-level Veteran'
        WHEN salary >= 75000 THEN 'Mid-level New Hire'
        ELSE 'Junior Level'
    END as employee_category,
    CASE 
        WHEN salary > (SELECT AVG(salary) FROM employees) THEN 'Above Average'
        ELSE 'Below Average'
    END as salary_comparison
FROM employees
ORDER BY salary DESC
"""
df = pd.read_sql_query(query, conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Conditional aggregations (pivot-like behavior)
print("Department salary distribution:")
query = """
SELECT 
    d.dept_name,
    COUNT(*) as total_employees,
    SUM(CASE WHEN e.salary >= 90000 THEN 1 ELSE 0 END) as senior_count,
    SUM(CASE WHEN e.salary BETWEEN 75000 AND 89999 THEN 1 ELSE 0 END) as mid_count,
    SUM(CASE WHEN e.salary < 75000 THEN 1 ELSE 0 END) as junior_count,
    ROUND(AVG(CASE WHEN e.salary >= 90000 THEN e.salary END), 0) as avg_senior_salary
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_id, d.dept_name
ORDER BY total_employees DESC
"""
df = pd.read_sql_query(query, conn)
display(df)

Employee categorization with multiple conditions:


Unnamed: 0,employee,salary,hire_date,employee_category,salary_comparison
0,David Brown,95000,2018-11-12,Senior Veteran,Above Average
1,Sarah Williams,90000,2020-09-05,Senior New Hire,Above Average
2,Robert Wilson,88000,2019-08-17,Mid-level Veteran,Above Average
3,Jane Smith,82000,2019-03-22,Mid-level Veteran,Above Average
4,John Doe,75000,2020-01-15,Mid-level New Hire,Below Average
5,Lisa Anderson,72000,2021-04-03,Junior Level,Below Average
6,Emily Davis,70000,2022-02-28,Junior Level,Below Average
7,Mike Johnson,65000,2021-06-10,Junior Level,Below Average




Department salary distribution:


Unnamed: 0,dept_name,total_employees,senior_count,mid_count,junior_count,avg_senior_salary
0,Engineering,3,0,3,0,
1,Marketing,2,0,0,2,
2,Human Resources,1,1,0,0,90000.0
3,Sales,1,1,0,0,95000.0
4,Finance,1,0,0,1,


## 6. Practice Exercises - Advanced Level

Test your understanding with these challenging exercises!

### Advanced Exercise Questions:

1. **Window Function Challenge**: Find the 2nd highest paid employee in each department
2. **CTE Challenge**: Calculate the percentage of total company salary each department represents
3. **Self-Join Challenge**: Find pairs of employees with salary differences less than $5,000
4. **Performance Analysis**: Create a view showing project ROI (budget per employee in department)
5. **Complex Aggregation**: Show month-over-month hiring trends (extract month from hire_date)

In [8]:
# Advanced Practice Area - Try the exercises above!

# Solution 1: 2nd highest paid employee in each department
print("Solution 1: 2nd highest paid employee in each department")
query1 = """
SELECT 
    first_name,
    last_name,
    salary,
    dept_id,
    salary_rank
FROM (
    SELECT 
        first_name,
        last_name,
        salary,
        dept_id,
        RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) as salary_rank
    FROM employees
) ranked
WHERE salary_rank = 2
"""
df = pd.read_sql_query(query1, conn)
display(df)

print("\n" + "="*50 + "\n")

# Solution 2: Department salary percentage of total
print("Solution 2: Department salary as percentage of total company")
query2 = """
WITH company_total AS (
    SELECT SUM(salary) as total_salary FROM employees
),
dept_totals AS (
    SELECT 
        dept_id,
        SUM(salary) as dept_salary
    FROM employees
    GROUP BY dept_id
)
SELECT 
    d.dept_name,
    dt.dept_salary,
    ct.total_salary,
    ROUND((dt.dept_salary * 100.0 / ct.total_salary), 2) as percentage_of_total
FROM departments d
JOIN dept_totals dt ON d.dept_id = dt.dept_id
CROSS JOIN company_total ct
ORDER BY percentage_of_total DESC
"""
df = pd.read_sql_query(query2, conn)
display(df)

# Add your solutions for exercises 3-5 here!

Solution 1: 2nd highest paid employee in each department


Unnamed: 0,first_name,last_name,salary,dept_id,salary_rank
0,Jane,Smith,82000,2,2
1,Mike,Johnson,65000,3,2




Solution 2: Department salary as percentage of total company


Unnamed: 0,dept_name,dept_salary,total_salary,percentage_of_total
0,Engineering,245000,637000,38.46
1,Marketing,137000,637000,21.51
2,Sales,95000,637000,14.91
3,Human Resources,90000,637000,14.13
4,Finance,70000,637000,10.99


In [9]:
# Cleanup
print("Advanced SQL concepts completed!")
print("Remember to practice these concepts with your own datasets.")
print("Next: Try the Data Analysis with SQL notebook for real-world applications!")

# Note: Keep connection open for other notebooks
# conn.close() would be called at the very end

Advanced SQL concepts completed!
Remember to practice these concepts with your own datasets.
Next: Try the Data Analysis with SQL notebook for real-world applications!


In [10]:
# Cleanup
print("Advanced SQL concepts completed!")
print("Remember to practice these concepts with your own datasets.")
print("Next: Try the Data Analysis with SQL notebook for real-world applications!")

# Note: Keep connection open for other notebooks
# conn.close() would be called at the very end

Advanced SQL concepts completed!
Remember to practice these concepts with your own datasets.
Next: Try the Data Analysis with SQL notebook for real-world applications!
