# Correlated Subqueries

### Introduction

In this lesson, we'll learn about correlated subqueries.  With correlated subqueries, we first return a set of rows from an outer table.  And then our subquery makes a new query for each one of these rows to find a match.  

Let's see this by way of example.

### Loading our data

In [1]:
import pandas as pd
import sqlite3
conn = sqlite3.connect('users.db')

table_names = ['employees', 'orders', 'customers', 'dog_foods']
root_url = "https://raw.githubusercontent.com/tech-interviews-jigsaw/sql-advanced-joins/main/6-common-strategies"
dfs = [pd.read_csv(f"{root_url}/{table_name}.csv") for table_name in table_names]

In [2]:
[df.to_sql(table_name, conn, index = True, index_label = 'id', if_exists = 'replace') for df, table_name in zip(dfs, table_names)]

[5, 3, 5, 5]

### Correlated Subqueries

With a correlated subquery, the query is executed once per each row in the outer query.  Let's see this by way of example.

Below we have a list of employees, with each employee assigned to a department.

In [17]:
query = """ with ranked_salaries as (
SELECT employee_name, department, salary,
       rank() OVER (PARTITION BY department order by salary desc) AS max_dept_salary
FROM employees e1
)
select * from ranked_salaries where max_dept_salary = 2
"""
pd.read_sql(query, conn)

Unnamed: 0,employee_name,department,salary,max_dept_salary
0,Alice,HR,50000,2
1,Carol,IT,55000,2


In [18]:
query = """SELECT employee_name, department, salary, 
(SELECT MAX(salary) FROM employees e2
    WHERE e2.department = e1.department) AS max_dept_salary
FROM employees e1"""
pd.read_sql(query, conn)

Unnamed: 0,employee_name,department,salary,max_dept_salary
0,Alice,HR,50000,52000
1,Bob,IT,60000,60000
2,Carol,IT,55000,60000
3,David,HR,52000,52000
4,Eve,Finance,62000,62000


In [19]:
query = """
select *, 
(select salary from employees
e2 where e1.department = e2.department order by salary desc limit 1 offset 1 )
as max_salary
from employees e1
order by department, salary desc
"""

pd.read_sql(query, conn)

Unnamed: 0,id,employee_id,employee_name,department,salary,max_salary
0,4,5,Eve,Finance,62000,
1,3,4,David,HR,52000,50000.0
2,0,1,Alice,HR,50000,50000.0
3,1,2,Bob,IT,60000,55000.0
4,2,3,Carol,IT,55000,55000.0


Let's say that we want to find the highest salary for each department.  One way to do this is with a window function. 

In [7]:
query = """SELECT employee_name, department, salary,
       MAX(salary) OVER (PARTITION BY department ORDER BY salary) AS max_dept_salary
FROM employees e1"""
pd.read_sql(query, conn)

Unnamed: 0,employee_name,department,salary,max_dept_salary
0,Eve,Finance,62000,62000
1,Alice,HR,50000,50000
2,David,HR,52000,52000
3,Carol,IT,55000,55000
4,Bob,IT,60000,60000


In [None]:
query = """SELECT employee_name, department, salary,
(SELECT MAX(salary) FROM employees e2
    WHERE e2.department = e1.department) AS max_dept_salary
FROM employees e1"""
pd.read_sql(query, conn)

* break it down

```sql
SELECT employee_name, department, salary,
(SELECT MAX(salary) FROM employees e2 WHERE e2.department = e1.department) AS max_sal -- 2. correlated subquery
FROM employees e1 -- 1. outer query
```

### Moving to a use case

* Second highest

In [13]:
query = """SELECT employee_name, department, salary,
(SELECT salary FROM employees e2 WHERE e2.department = e1.department
order by salary desc limit 1 offset 1 ) AS second_highest
FROM employees e1 order by department desc, salary desc"""
pd.read_sql(query, conn)

Unnamed: 0,employee_name,department,salary,second_highest
0,Bob,IT,60000,55000.0
1,Carol,IT,55000,55000.0
2,David,HR,52000,50000.0
3,Alice,HR,50000,50000.0
4,Eve,Finance,62000,


In [135]:
query = """
with ranked_employees as (
    SELECT employee_name, department, salary,
     DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
    FROM  employees
)
select * from ranked_employees where dept_rank = 2
"""
pd.read_sql(query, conn)

Unnamed: 0,employee_name,department,salary,dept_rank
0,Alice,HR,50000,2
1,Carol,IT,55000,2


### Two tables

In [20]:
pd.read_sql("""select * from orders""", conn)

Unnamed: 0,id,order_id,customer_id,order_date
0,0,101,1,2023-01-15
1,1,102,1,2023-02-10
2,2,103,2,2023-03-05


In [3]:
pd.read_sql("""select * from customers""", conn)

Then we can use our correlated subquery.

In [137]:
query = """SELECT name
FROM customers c
WHERE (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
)"""
pd.read_sql(query, conn)

Unnamed: 0,name
0,John Smith
1,Jane Doe


### A better use case

In [21]:
pd.read_sql("""select * from dog_foods""", conn)

Unnamed: 0,id,brand,price
0,0,Acme Dog Food,22
1,1,Puppy Chow,32
2,2,Healthy Paws,38
3,3,Bark Bites,19
4,4,Superior K9,45


And our customer budgets.

In [22]:
pd.read_sql("""select * from customers""", conn)

Unnamed: 0,id,customer_id,name,budget
0,0,1,John Smith,25
1,1,2,Jane Doe,30
2,2,3,Michael Brown,40
3,3,4,Emily Johnson,22
4,4,5,David Lee,50


So above, we first select all of our customers in the outer query.  Then in our subquery, for each customer, we find the dog food price that is lower than or equal to the budget.  Then we sort these from highest to lowest, returning the highest food price within this upper bound.    

In [27]:
query = """SELECT *, 
(select dog_foods.price 
from dog_foods where dog_foods.price <= c.budget
order by price desc limit 1) as dog_food_price
FROM customers c"""

pd.read_sql(query, conn)

Unnamed: 0,id,customer_id,name,budget,dog_food_price
0,0,1,John Smith,25,22
1,1,2,Jane Doe,30,22
2,2,3,Michael Brown,40,38
3,3,4,Emily Johnson,22,22
4,4,5,David Lee,50,45


Let's copy this query below.

```sql
SELECT *, 
(select dog_foods.price 
from dog_foods where dog_foods.price <= c.budget
order by price desc limit 1) as dog_food_price
FROM customers c
```

### Summary

In this lesson, we saw how to perform a correlated subquery.  To perform a correlated subquery, we need an outer table, and a subquery that references that outer table.  

The correlated subquery executes for each row in the outer query, for either a filter or calculation. 

```sql
SELECT employee_name, department, salary,
(SELECT MAX(salary) FROM employees e2 WHERE e2.department = e1.department) AS max_sal -- 2. correlated subquery
FROM employees e1 -- 1. outer query
```

We also saw some use cases for our correlated subquery.  

For example, we saw saw a query that returns the second highest salary per department -- which goes further than what we can do with a window function.

In [None]:
query = """SELECT employee_name, department, salary,
(SELECT salary FROM employees e2 WHERE e2.department = e1.department
order by salary desc limit 1 offset 1 ) AS max_sal
FROM employees e1 order by department desc, salary desc"""

And then finally, we saw how we can join two tables, aligning the rows based on our subquery -- as we did by finding the priciest dog food within each customer's budget.

### Resources

[Window Fn vs Subqueries](https://www.linkedin.com/pulse/comparing-sql-subqueries-window-functions-differences-siva-kowsika/)