# Database 2
## Last lecture: SQL query clauses
- FROM: table
- SELECT: columns
- WHERE: row condition -> boolean expression (recommended to do before LIMIT)
- LIMIT: simple limiation of number of rows
- GROUP BY: sorting

## Today's lecture:
- aggregation: SUM, AVG, COUNT, MIN, MAX
- group by: equivalent to bucketization; one row can only be part of one bucket
- having: applying condition to groups

In [1]:
import sqlite3
import pandas as pd

In [2]:
# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
from IPython.core.display import HTML
HTML('<style>em { color: red; }</style>')

In [3]:
c = sqlite3.connect("movies.db")
c

<sqlite3.Connection at 0x24bd04ab9d0>

In [5]:
pd.read_sql("select * from sqlite_master", c)

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,movies,movies,2,"CREATE TABLE ""movies"" (\n""Title"" TEXT,\n ""Dir..."


In [6]:
pd.read_sql("select * from movies", c).head(15)

Unnamed: 0,Title,Director,Year,Runtime,Rating,Revenue
0,Guardians of the Galaxy,James Gunn,2014,121,8.1,333.13
1,Prometheus,Ridley Scott,2012,124,7.0,126.46
2,Split,M. Night Shyamalan,2016,117,7.3,138.12
3,Sing,Christophe Lourdelet,2016,108,7.2,270.32
4,Suicide Squad,David Ayer,2016,123,6.2,325.02
5,The Great Wall,Yimou Zhang,2016,103,6.1,45.13
6,La La Land,Damien Chazelle,2016,128,8.3,151.06
7,Mindhorn,Sean Foley,2016,89,6.4,0.0
8,The Lost City of Z,James Gray,2016,141,7.1,8.01
9,Passengers,Morten Tyldum,2016,116,7.0,100.01


In [7]:
def qry(sql, cap=10):
    return pd.read_sql(sql, c).head(cap)

In [8]:
qry("""
SELECT *
FROM movies
""")

Unnamed: 0,Title,Director,Year,Runtime,Rating,Revenue
0,Guardians of the Galaxy,James Gunn,2014,121,8.1,333.13
1,Prometheus,Ridley Scott,2012,124,7.0,126.46
2,Split,M. Night Shyamalan,2016,117,7.3,138.12
3,Sing,Christophe Lourdelet,2016,108,7.2,270.32
4,Suicide Squad,David Ayer,2016,123,6.2,325.02
5,The Great Wall,Yimou Zhang,2016,103,6.1,45.13
6,La La Land,Damien Chazelle,2016,128,8.3,151.06
7,Mindhorn,Sean Foley,2016,89,6.4,0.0
8,The Lost City of Z,James Gray,2016,141,7.1,8.01
9,Passengers,Morten Tyldum,2016,116,7.0,100.01


# Review: Simple Selections

### Which *movie* has the *highest rating*?

In [11]:
qry("""
SELECT Title,Director,Rating
FROM movies
ORDER BY Rating DESC
LIMIT 1
""")

Unnamed: 0,Title,Director,Rating
0,The Dark Knight,Christopher Nolan,9.0


### Which *director* made the *shortest movie*?

In [14]:
qry("""
SELECT Director, Runtime
FROM movies
ORDER BY Runtime ASC
LIMIT 1
""")

Unnamed: 0,Director,Runtime
0,Claude Barras,66


### Which *director* made the *highest-revenue movie*?

In [16]:
qry("""
SELECT Director, Revenue
FROM movies
ORDER BY Revenue DESC
LIMIT 1
""")

Unnamed: 0,Director,Revenue
0,J.J. Abrams,936.63


### Which *movie* had the *highest revenues* in *2016*?

In [17]:
qry("""
SELECT Title, Revenue
FROM movies
WHERE Year = 2016
ORDER BY Revenue DESC
LIMIT 1
""")

Unnamed: 0,Title,Revenue
0,Rogue One,532.17


### Which *3 movies* had the *highest revenues* in *2016*?

In [18]:
qry("""
SELECT Title, Revenue
FROM movies
WHERE Year = 2016
ORDER BY Revenue DESC
LIMIT 3
""")

Unnamed: 0,Title,Revenue
0,Rogue One,532.17
1,Finding Dory,486.29
2,Captain America: Civil War,408.08


### Which *3 movies* have the *highest rating-to-revenue ratios*?

Introduce `AS`

In [19]:
qry("""
SELECT Title, Rating/Revenue AS Ratio
FROM movies
ORDER BY Ratio DESC
LIMIT 3
""")

Unnamed: 0,Title,Ratio
0,Wakefield,750.0
1,"Love, Rosie",720.0
2,Lovesong,640.0


# Aggregate Queries

```
SUM, AVG, COUNT, MIN, MAX
```

### How many *movies* are there?

In [20]:
qry("""
SELECT COUNT(*)
FROM movies
""")

Unnamed: 0,COUNT(*)
0,998


### How many *directors* are there?

In [21]:
#This is counting the number of entries (lines) in the table that have a director
qry("""
SELECT COUNT(Director)
FROM movies
""")

Unnamed: 0,COUNT(Director)
0,998


In [22]:
qry("""
SELECT COUNT(DISTINCT Director)
FROM movies
""")

Unnamed: 0,COUNT(DISTINCT Director)
0,643


### What is the *total revenue* of *all the movies*?

In [23]:
qry("""
SELECT SUM(Revenue)
FROM movies
""")

Unnamed: 0,SUM(Revenue)
0,72215.45


### What is the *average rating* across *all movies*?

* v1: with `SUM` and `COUNT`
* v2: with `AVG`

In [25]:
qry("""
SELECT SUM(Rating) / COUNT(Rating) AS AvgRating
FROM movies
""")

Unnamed: 0,AvgRating
0,6.723447


In [26]:
qry("""
SELECT AVG(Rating)
FROM movies
""")

Unnamed: 0,AVG(Rating)
0,6.723447


### What is the *average revenue* and *average runtime* of *all the movies*?

In [27]:
qry("""
SELECT AVG(Rating),AVG(Runtime)
FROM movies
""")

Unnamed: 0,AVG(Rating),AVG(Runtime)
0,6.723447,113.170341


### What is the *average runtime* for a *James Gunn* movie?

In [28]:
qry("""
SELECT AVG(Rating),AVG(Runtime)
FROM movies
WHERE Director = "James Gunn"
""")

Unnamed: 0,AVG(Rating),AVG(Runtime)
0,7.133333,104.0


### What is the *average revenue* for a *Ridley Scott* movie?

In [29]:
qry("""
SELECT AVG(Revenue)
FROM movies
WHERE Director = "Ridley Scott"
""")

Unnamed: 0,AVG(Revenue)
0,89.8825


### *How many movies* were there in *2016*?

In [30]:
qry("""
SELECT COUNT(*)
FROM movies
WHERE Year = 2016
""")

Unnamed: 0,COUNT(*)
0,296


### What *percentage* of the *total revenue* came from the *highest-revenue movie*?

In [31]:
qry("""
SELECT MAX(Revenue) / SUM(Revenue) * 100 AS Percent
FROM movies
""")

Unnamed: 0,Percent
0,1.296994


### What *percentage* of the *revenue* came from the *highest-revenue movie* in *2016*?

In [32]:
qry("""
SELECT MAX(Revenue) / SUM(Revenue) * 100 AS Percent
FROM movies
WHERE Year = 2016
""")

Unnamed: 0,Percent
0,4.746581


# GROUP BY Queries

```sql
SELECT ???, ??? FROM Movies
GROUP BY ???
```

### What is the *total revenue* per each *year*?

* v1: the amounts
* v2: the amounts, as labeled by year

In [34]:
qry("""
SELECT SUM(Revenue), Year
FROM movies
GROUP BY year
""")

Unnamed: 0,SUM(Revenue),Year
0,3624.46,2006
1,4306.23,2007
2,5053.22,2008
3,5292.26,2009
4,5989.65,2010
5,5431.96,2011
6,6910.29,2012
7,7544.21,2013
8,7997.4,2014
9,8854.12,2015


### *How many movies* were by each *director*?

In [36]:
qry("""
SELECT COUNT(*), Director
FROM movies
GROUP BY Director
""",5)

Unnamed: 0,COUNT(*),Director
0,1,Aamir Khan
1,1,Abdellatif Kechiche
2,1,Adam Leon
3,4,Adam McKay
4,2,Adam Shankman


In [37]:
qry("""
SELECT COUNT(*) AS mov_count, Director
FROM movies
GROUP BY Director
ORDER BY mov_count DESC
""",5)

Unnamed: 0,mov_count,Director
0,8,Ridley Scott
1,6,Paul W.S. Anderson
2,6,Michael Bay
3,6,M. Night Shyamalan
4,6,David Yates


### What is the *average rating* for each *director*?

In [38]:
qry("""
SELECT AVG(Rating) AS ar, Director
FROM movies
GROUP BY Director
ORDER BY ar DESC
""",5)

Unnamed: 0,ar,Director
0,8.8,Nitesh Tiwari
1,8.68,Christopher Nolan
2,8.6,Olivier Nakache
3,8.6,Makoto Shinkai
4,8.5,Florian Henckel von Donnersmarck


### What is the *average runtime* for each *director*?

In [39]:
qry("""
SELECT AVG(Runtime) AS ar, Director
FROM movies
GROUP BY Director
ORDER BY ar DESC
""",5)

Unnamed: 0,ar,Director
0,180.0,Abdellatif Kechiche
1,165.0,Aamir Khan
2,163.0,Andrea Arnold
3,162.0,Maren Ade
4,162.0,James Cameron


### How many *unique directors* created a movie in each *year*

In [40]:
qry("""
SELECT Year, COUNT(DISTINCT Director) AS directors
FROM movies
GROUP BY Year
""")

Unnamed: 0,Year,directors
0,2006,44
1,2007,51
2,2008,51
3,2009,51
4,2010,60
5,2011,63
6,2012,64
7,2013,88
8,2014,97
9,2015,127


# Combining GROUP BY with other CLAUSES

<img src="groupby.png">

### What is the *total revenue* of per *year*, in *recent* years?

In [42]:
qry("""
SELECT year, SUM(Revenue) as total_rev
FROM movies
GROUP BY year
ORDER BY year DESC
LIMIT 4
""")

Unnamed: 0,Year,total_rev
0,2016,11211.65
1,2015,8854.12
2,2014,7997.4
3,2013,7544.21


### Which *directors* have had the *largest number of movies* earning *over 100M dollars*?

In [43]:
qry("""
SELECT director, COUNT(title) as num_movies
FROM movies
WHERE revenue > 100
GROUP BY director
ORDER BY num_movies DESC
LIMIT 4
""")

Unnamed: 0,Director,num_movies
0,David Yates,6
1,J.J. Abrams,5
2,Zack Snyder,4
3,Ridley Scott,4


### Which *three* of the *directors* have the *greatest average rating*?

In [44]:
qry("""
SELECT director, AVG(rating) as ar
FROM movies
GROUP BY director
ORDER BY ar DESC
""")

Unnamed: 0,Director,ar
0,Nitesh Tiwari,8.8
1,Christopher Nolan,8.68
2,Olivier Nakache,8.6
3,Makoto Shinkai,8.6
4,Florian Henckel von Donnersmarck,8.5
5,Aamir Khan,8.5
6,Naoko Yamada,8.4
7,Damien Chazelle,8.4
8,Thomas Vinterberg,8.3
9,S.S. Rajamouli,8.3


Why is the above question maybe not the best to ask?

In [None]:
# We might care that some directors have only a single movie in the dataset, even if it is highly rated

### Which *three* of the *directors* have the *greatest average rating* over at *least three movies*?

In [45]:
qry("""
SELECT director, AVG(rating) as ar, COUNT(*) as mov_count
FROM movies
WHERE mov_count >= 3
GROUP BY director
ORDER BY ar DESC
LIMIT 3
""")

DatabaseError: Execution failed on sql '
SELECT director, AVG(rating) as ar, COUNT(*) as mov_count
FROM movies
WHERE mov_count >= 3
GROUP BY director
ORDER BY ar DESC
LIMIT 3
': misuse of aggregate: COUNT()

Need filtering BEFORE and AFTER the GROUP operations

<img src="pipeline.png">

# WHERE vs. HAVING

* WHERE: filter rows in original table
* HAVING: filter groups

### Repeat: Which *three* of the *directors* have the *greatest average rating* over at *least three movies*?

<img src="having.png">

In [47]:
qry("""
SELECT director, AVG(rating) as ar, COUNT(*) as mov_count
FROM movies
GROUP BY director
HAVING mov_count >= 3
ORDER BY ar DESC
LIMIT 5
""")

Unnamed: 0,Director,ar,mov_count
0,Christopher Nolan,8.68,5
1,Martin Scorsese,7.92,5
2,Quentin Tarantino,7.9,4
3,Wes Anderson,7.9,3
4,David Fincher,7.82,5


### Which *directors* have had *more than 3 movies* that have been *since 2010*?

In [53]:
qry("""
SELECT Director, COUNT(title) as num_movies
FROM movies
WHERE year >= 2010
GROUP BY director
HAVING num_movies > 3
ORDER BY num_movies ASC
""")

Unnamed: 0,Director,num_movies
0,Antoine Fuqua,4
1,David O. Russell,4
2,David Yates,4
3,James Wan,4
4,M. Night Shyamalan,4
5,Martin Scorsese,4
6,Michael Bay,4
7,Mike Flanagan,4
8,Paul Feig,4
9,Peter Berg,4


### Which *directors* have had more than *three* movies with runtimes under *100* minutes

In [51]:
qry("""
SELECT Director, COUNT(title) as num_movies
FROM movies
WHERE runtime < 100
GROUP BY director
HAVING num_movies > 3
""")

Unnamed: 0,Director,num_movies
0,Woody Allen,4


In [54]:
c.close()