# SQL Subqueries

Like you might nest one function within another in Python, you can nest queries in SQL. We can use a **subquery** within another query to succinctly implement queries that have multiple query steps.

## Objectives

- Use SQL subqueries to nest queries
- Identify common SQL dialects and tools
- Query data from web databases

In [1]:
import pandas as pd
import sqlite3

In [2]:
conn = sqlite3.connect('data/flights.db')

## Subqueries in `FROM`

You can use a subquery in the `FROM` clause - this is useful, for example, if you want to apply multiple aggregation functions.

Let say we want to get the average of the number of routes departing from all airports. First we'd need to get the total number of routes departing from all airports, then take the average.

In [5]:
pd.read_sql("SELECT * FROM routes LIMIT 5;", conn)

Unnamed: 0,index,airline,airline_id,source,source_id,dest,dest_id,codeshare,stops,equipment
0,0,2B,410,AER,2965,KZN,2990,,0,CR2
1,1,2B,410,ASF,2966,KZN,2990,,0,CR2
2,2,2B,410,ASF,2966,MRV,2962,,0,CR2
3,3,2B,410,CEK,2968,KZN,2990,,0,CR2
4,4,2B,410,CEK,2968,OVB,4078,,0,CR2


In [3]:
query = '''
SELECT 
    source AS depart_airport, 
    COUNT() AS number_of_departures
FROM
    routes
GROUP BY
    source
'''

pd.read_sql(query, conn)

Unnamed: 0,depart_airport,number_of_departures
0,AAE,9
1,AAL,20
2,AAN,2
3,AAQ,3
4,AAR,8
...,...,...
3404,ZUH,60
3405,ZUM,2
3406,ZVK,3
3407,ZYI,15


We can use this query as a subquery, and take the average of the new `number_of_departures` column.

In [4]:
query = '''
SELECT
    AVG(number_of_departures)
FROM (
    SELECT 
        source AS depart_airport,
        COUNT() AS number_of_departures
    FROM
        routes
    GROUP BY
        source
)
'''

pd.read_sql(query, conn)

Unnamed: 0,AVG(number_of_departures)
0,19.848343


## Note: Subqueries are Like New Tables!

If you squint, you'll notice that the subquery is taking the place of where we might put a table!

For example, checkout the SQL we wrote in our first subquery example:

```sql
SELECT 
    AVG(number_of_departures)
FROM (
    SELECT 
        source AS depart_airport,
        COUNT() AS number_of_departures
    FROM
        routes
    GROUP BY
        source
)
```

We could imagine that some new table that returned by the subquery existed (let's call it `airport_departures`) and be placed in place of the subquery:

```sql
SELECT 
    AVG(number_of_departures)
FROM (
    airport_departures -- Replacing subquery with this hypothetical table
) 
```

You can actually use syntax close to this with **Common Table Expressions (CTEs)** found in a section below.

## Subqueries in `WHERE`

You can use a subquery in the `WHERE` clause - this is useful, for example, if you want to filter a query based on results from another query.

Let's say that we want to get a table with all of the departures and destinations for the flight routes, but I only want to include flights departing from the five countries with the most airports.

To do this, we'd first need to identify the five countries that have the most airports. 

In [8]:
query = '''
SELECT 
    country, 
    COUNT() AS number_of_airports_in_country
FROM
    airports
GROUP BY
    country
ORDER BY
    number_of_airports_in_country DESC
LIMIT 6
'''

pd.read_sql(query, conn)

Unnamed: 0,country,number_of_airports_in_country
0,United States,1697
1,Canada,435
2,Germany,321
3,Australia,263
4,Russia,249
5,France,233


I could enter these results manually into a new query of the routes table to get the data I want.

In [7]:
query = '''
SELECT 
    rt.source AS depart_airport,
    rt.dest AS destination_airport,
    ap.country AS depart_country
FROM
    routes AS rt
    LEFT JOIN airports AS ap
        ON rt.source_id = ap.id
WHERE 
    ap.country IN (
        "United States", 
        "Canada", 
        "Germany", 
        "Australia", 
        "Russia"
    )
ORDER BY 
    depart_country
'''

pd.read_sql(query, conn)

Unnamed: 0,depart_airport,destination_airport,depart_country
0,DRW,SIN,Australia
1,PER,SIN,Australia
2,MEL,CTU,Australia
3,SYD,CKG,Australia
4,ADL,BNE,Australia
...,...,...,...
20330,SOW,FMN,United States
20331,SOW,PHX,United States
20332,SVC,PHX,United States
20333,VIS,LAX,United States


This approach works but has a few limitations:

- We have to manually enter the countries to filter them
- The list of countries won't update with our data, so we'd have to monitor and manually change them in the future
- We have to look at two separate queries to understand what our code is supposed to do
- We have to run two separate queries, which might take longer than one combined query

A better solution uses a subquery to get the list of 5 countries and feed it into our WHERE clause.

In [11]:
query = '''
SELECT 
    rt.source AS depart_airport,
    rt.dest AS destination_airport,
    ap.country AS depart_country
FROM
    routes AS rt
    LEFT JOIN airports AS ap
        ON rt.source_id = ap.id
WHERE ap.country IN (
-- Subquery to get the 5 countries with the most airports
    SELECT 
        country
    FROM 
        airports
    GROUP BY 
        country
    ORDER BY 
        COUNT() DESC
    LIMIT 5
)

ORDER BY
    depart_country
'''

pd.read_sql(query, conn)

Unnamed: 0,depart_airport,destination_airport,depart_country
0,DRW,SIN,Australia
1,PER,SIN,Australia
2,MEL,CTU,Australia
3,SYD,CKG,Australia
4,ADL,BNE,Australia
...,...,...,...
20330,SOW,FMN,United States
20331,SOW,PHX,United States
20332,SVC,PHX,United States
20333,VIS,LAX,United States


## Common Table Expressions

Common Table Expressions (CTEs) are a more readable way to implement subqueries, using `WITH` and `AS`.

In [None]:
query = '''
-- Basically creating a new table named top_5_countries
WITH top_5_countries AS (
    SELECT 
        country 
    FROM 
        airports
    GROUP BY 
        country
    ORDER BY 
        COUNT() DESC
    LIMIT 5
) 

SELECT 
    rt.source AS depart_airport,
    rt.dest AS destination_airport,
    ap.country AS depart_country
FROM
    routes AS rt
    LEFT JOIN airports AS ap
        ON rt.source_id = ap.id
WHERE 
    ap.country IN top_5_countries
ORDER BY 
    depart_country
'''

pd.read_sql(query, conn)

### Exercise

Create a table listing all airlines that serve the three airports with the most outbound routes.

In [None]:
# Your work here


# Web Databases: data.world

For the rest of this lesson, we'll be exploring databases in [data.world](https://data.world/), a web database that we can query using SQL in our browser. For reference, you can see the instructions for creating a new project here: [Getting Started Working with Data at data.world](https://help.data.world/hc/en-us/articles/360008853693-Getting-started-guide#working_with_data)

## Step 1: Create a data.world account

You will need to enter and verify your email address.

## Step 2: Create a project using this [Austin AirBnB](https://data.world/jonloyens/inside-airbnb-austin) dataset

Navigate to [this page](https://data.world/jonloyens/inside-airbnb-austin) and use the button at the top right of the page to create a new project using the data.

![](images/data_world_austin_airbnb_new_proj.png)

## Step 3: Create a SQL query

In your project, use the "+ Add" button to add a SQL query.

![](images/data_world_add_sql_query.png)

## Step 4: Run a simple SQL query

Try entering "SELECT * FROM calendar LIMIT 5;" and clicking the "Run Query" button in the top right.

## Step 5: Practice using SQL to explore the data

Below are some exercises to practice your SQL skills and help explore the data. You will need to explore the table schemas to complete these exercises. 

You may find it helpful to click on the corresponding .csv files to inspect the data, or look at the columns in each table in the right sidebar.

### Exercise 1: Create a table showing the number of listings in each neighborhood

In [None]:
'''
SELECT neighbourhood, COUNT(*) AS neighborhood_count
FROM listings
GROUP BY neighbourhood
ORDER BY neighborhood_count DESC;
'''

### Exercise 2: Create a table showing the 20 listings with the most reviews

In [None]:
'''
SELECT reviews.listing_id, COUNT(reviews.listing_id) AS review_count
FROM reviews
GROUP BY reviews.listing_id
ORDER BY review_count DESC
LIMIT 20;
'''
# Or
'''
SELECT id, number_of_reviews
FROM listings
ORDER BY number_of_reviews DESC
LIMIT 20;
'''

### Exercise 3: Create a table showing all of the reviews for listings that are "Bed & Breakfast" property types.

In [None]:
'''
SELECT r.id, r.comments
FROM reviews AS r
JOIN listings AS l
    ON l.id = r.listing_id
WHERE property_type = "Bed & Breakfast";
'''

### Exercise 4: Run your own query using a subquery or CTE.

Note that the syntax and functionality for subqueries and CTEs in data.world are more limited than SQLite, so try creating simple ones.

-----

## Extra Resources: More Practice

Want more practice? See if you can solve the mystery in the [SQL Murder Mystery game](https://mystery.knightlab.com/index.html)!

Both [Kaggle](https://www.kaggle.com/learn/intro-to-sql) and [Khan Academy](https://www.khanacademy.org/computing/computer-programming/sql) both have short free courses on SQL - the Kaggle one has you practice by connecting to a Google BigQuery database, and the Khan Academy one has resources about creating/updating databases.

## Extra: SQL Versions

There is no one version of SQL - there are many versions out there! What you're learning about SQL with SQLite will apply to all of them. Just keep in mind when you apply for jobs that you may see any of these listed in any given job posting, and they are all just different versions of what you know.

## SQL Dialects

As with dialects of spoken languages, SQL dialects have many commonalities but some differences in syntax and functionality.  Here are a few of the major players:

- SQLite (we've already seen this!)
- PostgreSQL (free and open-source!)
- Oracle SQL
- MySQL (half open-souce, half Oracle)
- Microsoft SQL Server
- Transact-SQL (extends MS SQL)

## Extra Resources: SQL Versions

[What Is a SQL Dialect, and Which one Should You Learn?](https://learnsql.com/blog/what-sql-dialect-to-learn/)

[SQLite vs MySQL vs PostgreSQL](https://www.digitalocean.com/community/tutorials/sqlite-vs-mysql-vs-postgresql-a-comparison-of-relational-database-management-systems)

[SQL Dialect Reference](https://en.wikibooks.org/wiki/SQL_Dialects_Reference)