![sql](images/sql-logo.jpg)

## Objectives

- Use SQL aggregation functions with GROUP BY
- Use HAVING for group filtering
- Use SQL JOIN to combine tables using keys

In [None]:
import pandas as pd
import sqlite3

conn = sqlite3.connect("data/flights.db")
cur = conn.cursor()

## A Quick Note about Execution Order!

![query execution order, from https://wizardzines.com/comics/sql-query-order/](images/sql-query-order.png)

[[Image Source]](https://wizardzines.com/comics/sql-query-order/)

# Aggregating Functions

>  A SQL **aggregating function** takes in many values and returns one value.

We have already seen some SQL aggregating functions like `COUNT()`. There are also others, like SUM(), AVG(), MIN(), and MAX().

### Example Simple Aggregations

Find the max value for `longitude` in the `airports` table:

In [None]:
query = '''

'''

pd.read_sql(query, conn)

Find the max value for `id` in the `airports` table:

In [None]:
query = '''

'''

pd.read_sql(query, conn)

Find the count for all inactive airlines:

In [None]:
query = '''

'''

pd.read_sql(query, conn)

## Grouping in SQL

We can go deeper and use aggregation functions on _groups_ using the `GROUP BY` clause.

The `GROUP BY` clause will group one or more columns together to perform aggregation functions on.

## Example `GROUP BY`  Statements

Let's say we want to know how many active and non-active airlines there are.

### Without `GROUP BY`

Let's first start with just seeing how many airlines there are:

In [None]:
query = '''
    SELECT 
        COUNT() AS "Number of Airlines"
    FROM 
        airlines
'''

pd.read_sql(query, conn)

One way for us to get the counts for each is to create two queries that will filter each kind of airline (active vs non-active) and count those values:

In [None]:
active_query = '''
    SELECT 
        COUNT() AS "Number of Active Airlines"
    FROM 
        airlines
    WHERE 
        active='Y'
'''

not_active_query = '''
    SELECT 
        COUNT() AS "Number of Non Active Airlines"
    FROM 
        airlines
    WHERE 
        active='N'
'''

display(pd.read_sql(active_query, conn))
display(pd.read_sql(not_active_query, conn))

This works but it's inefficient.

### With `GROUP BY`

Instead, we can tell the SQL server to do the work for us by grouping values we care about for us!

In [None]:
query = '''
    SELECT 
        COUNT() AS number_of_airlines
    FROM 
        airlines
    GROUP BY
        active
'''

pd.read_sql(query, conn)

This is great! And if you look closely, you can observe we have _three_ different groups instead of our expected two!

Let's also print out the `active` column values for each group/aggregation so we know what we're looking at:

In [None]:
query = '''
    SELECT 
        active,
        COUNT() AS number_of_airlines
    FROM 
        airlines
    GROUP BY
        active
'''

pd.read_sql(query, conn)

What do we think this extra category captures? Can we filter those out?

## Exercises

### Question 1:

Which countries have the highest numbers of active airlines? Return the top 10.

> Note that the `GROUP BY` clause is considered _before_ the `ORDER BY` and `LIMIT` clauses

In [None]:
query = '''

'''

pd.read_sql(query, conn)

### Question 2:

How many airports are in each time zone?

In [None]:
query = '''

'''

pd.read_sql(query, conn)

## Filtering Groups with `HAVING`

We showed that you can filter tables with `WHERE`. We can similarly filter _groups/aggregations_ using `HAVING` clauses.

## Examples of  `HAVING`

### Simple Filtering - Number of Airports in a Country

Let's come back to the aggregation of active airports:

In [None]:
query = '''
    SELECT 
        COUNT() AS num,
        country
    FROM 
        airlines
    WHERE 
        active='Y'
    GROUP BY 
        country
    ORDER BY 
        num DESC
'''

pd.read_sql(query, conn)

We can see we have a lot of results. But maybe we only want to keep the countries that have more than $30$ active airlines:

In [None]:
query = '''
    SELECT 
        country,
        COUNT() AS num
    FROM 
        airlines
    WHERE 
        active='Y'
    GROUP BY 
        country
    HAVING
        num > 30
    ORDER BY 
        num DESC
'''

pd.read_sql(query, conn)

## Filtering Different Aggregations - Airport Altitudes

We can also filter on other aggregations. For example, let's say we want to investigate the `airports` table.

Specifically, we want to know the height of the _highest airport_ in a country given that it has _at least $100$ airports_.

### Looking at the `airports` Table

In [None]:
query = '''
    SELECT 
        *
    FROM 
        airports 
'''
pd.read_sql(query, conn).head()

### Looking at the Highest Airport

Let's first get the highest altitude for each airport:

In [None]:
query = '''
    SELECT 
        country,
        MAX(CAST(altitude AS int)) AS highest_airport_in_country
    FROM 
        airports 
    GROUP BY
        country
    ORDER BY
        country
'''

pd.read_sql(query, conn)

### Looking at the Number of Airports Too

We can also get the number of airports for each country.

In [None]:
query = '''
    SELECT 
        country,
        MAX(CAST(altitude AS int)) AS highest_airport_in_country,
        COUNT() AS number_of_airports_in_country
    FROM
        airports 
    GROUP BY
        country
    ORDER BY
        country
'''

pd.read_sql(query, conn)

### Filtering on Aggregations

> Recall:
>
> We want to know the height of the _highest airport_ in a country given that it has _at least $100$ airports_.

In [None]:
query = '''
    SELECT 
        country,
        MAX(CAST(altitude AS int)) AS highest_airport_in_country
        -- Note we don't have to include this in our SELECT to use it to filter!
        --,COUNT() AS number_of_airports_in_country
    FROM
        airports 
    GROUP BY
        country
    HAVING
        COUNT() >= 100
    ORDER BY
        country
'''

pd.read_sql(query, conn)

# Joins

The biggest advantage in using a relational database (like we've been with SQL) is that you can create **joins**.

> By using **`JOIN`** in our query, we can connect different tables using their _relationships_ to other tables.
>
> Usually we use a key (*foreign key*) to tell us how the two tables are related.

There are different types of joins and each has their different use case - because SQL joins can be used to both **add** data to a table and **remove** data from a table. 

![venn](images/venn.png)

## `INNER JOIN`

> An **inner join** will join two tables together and only keep rows if the _key is in both tables_

![](images/inner_join.png)

Example of an inner join:

```sql
SELECT
    table1.column_name,
    table2.different_column_name
FROM
    table1
    INNER JOIN table2
        ON table1.shared_column_name = table2.shared_column_name
```

### Code Example for Inner Joins

Let's say we want to look at the different airplane routes

In [None]:
query = '''
    SELECT 
        *
    FROM
        routes 
'''

pd.read_sql(query, conn)

This is great but notice the `airline_id` column. It'd be nice to have some more information about the airlines associated with these routes.

We can do an **inner join** to get this information!

#### Inner Join Routes & Airline Data

In [None]:
query = '''
    SELECT 
        *
    FROM
        routes
    INNER JOIN airlines
        ON routes.airline_id = airlines.id
'''

pd.read_sql(query, conn)

We can also specify that we want to retain only certain columns in the `SELECT` clause:

In [None]:
query = '''
    SELECT 
        routes.source AS departing,
        routes.dest AS destination,
        routes.stops AS stops_before_destination,
        airlines.name AS airline_name
    FROM
        routes
        INNER JOIN airlines
            ON routes.airline_id = airlines.id
'''

pd.read_sql(query, conn)

Also we can alias the different tables to make the queries a bit easier to write!

In [None]:
query = '''
    SELECT 
        r.source AS departing,
        r.dest AS destination,
        r.stops AS stops_before_destination,
        a.name AS airline_name
    FROM
        routes AS r
        INNER JOIN airlines AS a
            ON r.airline_id = a.id
'''

pd.read_sql(query, conn)

#### Note: Losing Data with Inner Joins

Since data rows are kept only if _both_ tables have the key, some data can be lost

In [None]:
df_all_routes = pd.read_sql('''
    SELECT 
        *
    FROM
        routes
''', conn)

df_routes_after_join = pd.read_sql('''
    SELECT 
        *
    FROM
        routes
        INNER JOIN airlines
            ON routes.airline_id = airlines.id
''', conn)

In [None]:
# Look at how the number of rows are different
df_all_routes.shape, df_routes_after_join.shape

If you want to keep your data from at least one of your tables, you should use a left join instead of an inner join.

## `LEFT JOIN`

> A **left join** will join two tables together and but will keep all data from the first (left) table using the key provided.

![](images/left_join.png)

Example of a left and right join:

```sql
SELECT
    table1.column_name,
    table2.different_column_name
FROM
    table1
    LEFT JOIN table2
        ON table1.shared_column_name = table2.shared_column_name
```

### Code Example for Left Join

If wanted to ensure we always had every route even if the key in `airlines` was not found, we could replace our `INNER JOIN` with a `LEFT JOIN`:

In [None]:
# This will include all the data from routes
df_routes_after_left_join = pd.read_sql('''
    SELECT 
        *
    FROM
        routes
        LEFT JOIN airlines
            ON routes.airline_id = airlines.id
''', conn)

df_routes_after_left_join.shape

## Exercise: 

Which airline has the most routes listed in our database?

In [None]:
query = '''
    
'''

pd.read_sql(query, conn)