# Database 2
## Last lecture: SQL query clauses
- FROM: table
- SELECT: columns
- WHERE: row condition -> boolean expression (recommended to do before LIMIT)
    - boolean operators: AND, OR, NOT
    - AND / OR: can be used to combine conditions
- LIMIT: simple limiation of number of rows
- GROUP BY: sorting

## Today's lecture:
- aggregation: SUM, AVG, COUNT, MIN, MAX
- group by: equivalent to bucketization; one row can only be part of one bucket
- having: applying condition to groups

In [None]:
# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
from IPython.core.display import HTML
HTML('<style>em { color: red; }</style>')

In [None]:
# import statements
import sqlite3
import pandas as pd
import os

In [None]:
movies_path = "movies.db"
assert os.path.exists(movies_path)

c = sqlite3.connect(movies_path)
c

In [None]:
pd.read_sql("select * from sqlite_master", c)

In [None]:
pd.read_sql("select * from movies", c).head(5)

In [None]:
def qry(sql, conn = c):
    return pd.read_sql(sql, conn)

In [None]:
qry("""
SELECT *
FROM movies
""")

# Review: Simple Selections

SQL query outline format:

SELECT<br>
FROM<br>
WHERE<br>
ORDER BY<br>
LIMIT<br>

### Which *movie* has the *highest rating*?

In [None]:
qry("""

""")

### Which *director* made the *shortest movie*?

In [None]:
qry("""

""")

### Which *director* made the *highest-revenue movie*?

In [None]:
qry("""

""")

### Which *movie* had the *highest revenues* in *2016*?

In [None]:
qry("""

""")

### Which *3 movies* had the *highest revenues* in *2016*?

In [None]:
qry("""

""")

### Which *3 movies* have the *highest rating-to-revenue ratios*?

Introduce `AS`

In [None]:
qry("""

""")

# Aggregate Queries

```
SUM, AVG, COUNT, MIN, MAX
```

SQL query outline format:

SELECT<br>
FROM<br>
WHERE<br>
ORDER BY<br>
LIMIT<br>

### How many *movies* are there?

In [None]:
qry("""

""")

### How many *directors* are there?

In [None]:
# This doesn't feel correct - it counts duplicates for director names!
qry("""

""")

In [None]:
qry("""

""")

### What is the *total revenue* of *all the movies*?

In [None]:
qry("""

""")

### What is the *average rating* across *all movies*?

* v1: with `SUM` and `COUNT`
* v2: with `AVG`

In [None]:
# v1
qry("""

""")

In [None]:
# v2
qry("""

""")

### What is the *average revenue* and *average runtime* of *all the movies*?

In [None]:
qry("""

""")

### What is the *average runtime* for a *James Gunn* movie?

In [None]:
qry("""

""")

### What is the *average revenue* for a *Ridley Scott* movie?

In [None]:
qry("""

""")

### *How many movies* were there in *2016*?

In [None]:
qry("""

""")

### What *percentage* of the *total revenue* came from the *highest-revenue movie*?

In [None]:
qry("""

""")

### What *percentage* of the *revenue* came from the *highest-revenue movie* in *2016*?

In [None]:
qry("""

""")

### Follow up question: *which movie* was this?

In [None]:
qry("""

""")

# GROUP BY Queries

```sql
SELECT ???, ??? FROM Movies
GROUP BY ???
```

SQL query outline format:

SELECT<br>
FROM<br>
WHERE<br>
GROUP BY<br>
ORDER BY<br>
LIMIT<br>

### What is the *total revenue* per each *year*?

* v1: the amounts
* v2: the amounts, as labeled by year

In [None]:
# v1
qry("""

""")

In [None]:
# v2
qry("""

""")

### *How many movies* were by each *director*?

In [None]:
qry("""

""")

### What is the *average rating* for each *director*?

In [None]:
qry("""

""")

### What is the *average runtime* for each *director*?

In [None]:
qry("""

""")

### How many *unique directors* created a movie in each *year*

In [None]:
qry("""

""")

# Combining GROUP BY with other CLAUSES

<img src="groupby.png">

### What is the *total revenue* of per *year*, in *recent* years?

In [None]:
# recent means 5 years
qry("""

""")

### Which *directors* have had the *largest number of movies* earning *over 100M dollars*?

In [None]:
qry("""

""")

### Which *three* of the *directors* have the *greatest average rating*?

In [None]:
qry("""

""")

Why is the above question maybe not the best to ask?

In [None]:
# We want to consider if the director has multiple great movies, instead of just one

### Which *three* of the *directors* have the *greatest average rating* over at *least three movies*?

In [None]:
# We cannot use where clause on aggregates because that data doesn't exist in the original table
qry("""

""")

Need filtering BEFORE and AFTER the GROUP operations

<img src="pipeline.png">

# WHERE vs. HAVING

* WHERE: filter rows in original table
* HAVING: filter groups

### Repeat: Which *three* of the *directors* have the *greatest average rating* over at *least three movies*?

<img src="having.png">

SQL query outline format:

SELECT<br>
FROM<br>
WHERE<br>
GROUP BY<br>
HAVING<br>
ORDER BY<br>
LIMIT<br>

In [None]:
# We cannot use where clause on aggregates because that data doesn't exist in the original table
qry("""

""")

### Which *directors* have had *more than 3 movies* that have been *since 2010*?

In [None]:
qry("""

""")

### Which *directors* have had more than *three* movies with runtimes under *100* minutes

In [None]:
qry("""

""")

In [None]:
# Don't forget to close the movies.db connection
c.close()

# Practice: Survey data

In [None]:
# open a connection to survey.db
survey_path = 'survey.db'
assert os.path.exists(survey_path)

conn = sqlite3.connect(survey_path)
conn

In [None]:
qry("""
select *
from sqlite_master
""", conn)

### Take a peek at fall_2021 table data

In [None]:
qry("""
select *
from fall_2021
""", conn)

### How many students in *LEC003* are graduating with *Engineering* major?

In [None]:
qry("""

""", conn)

### How many students are in *each major*?
- bonus: sort based on majors, with most popular at the top

In [None]:
qry("""

""", conn)

### What are the *top 5 popular majors*?

In [None]:
qry("""

""", conn)

### What is the *average age* for *each major* with *at least 10 people*?
- bonus: sort based on popular major

In [None]:
qry("""

""", conn)

### How many *CS or DS majors* are in *each lecture*?

In [None]:
qry("""

""", conn)

### What are the *top 10 pizza toppings*?

In [None]:
qry("""

""", conn)

### Which 2 lectures like pineapple the most?

In [None]:
qry("""

""", conn)

In [None]:
c.close()