In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("proj1.ipynb")

# Project 1 - SQL

In this project we will be working with SQL on the IMDB database.

## Due Date

This assignment is due on **Friday, February 19th** at 11:59 pm.

## Scoring Breakdown

|Question|Points|
|---|---|
|0|1|
|1a|1|
|1b|2|
|1c|1|
|2ai|1|
|2aii|3|
|2bi|2|
|2bii|2|
|2biii|2|
|2biv|1|
|3a|2|
|3b|2|
|3c|1|
|4|2|
|**Total**|23|

In [2]:
# Run this cell to set up imports
import numpy as np
import pandas as pd

# Getting Conected
We are going to be using the `ipython-sql` library to connect our notebook to a PostgreSQL database server on your jupyterhub account. The next cell should do the trick; you should not see any error messages after it completes.

In [3]:
# The first time you are running this cell, you may need to run the following line as: %load_ext sql 
%reload_ext sql
%sql postgresql://jovyan@127.0.0.1:5432/postgres

Now we need to unzip the data.

In [4]:
!unzip -u data/imdbdb.zip -d data/

Archive:  data/imdbdb.zip


We will use PostgreSQL commands to create a database and import our data into it. Run the following cell to do this. The line `%sql \l` will display the databases made and you should see a database called `imdb`.

In [5]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb'
!psql -h localhost -c 'CREATE DATABASE imdb' 
!psql -h localhost -c -d imdb -f data/imdbdb.sql
%sql \l

DROP DATABASE
CREATE DATABASE
ERROR:  syntax error at or near "-"
LINE 1: -d
        ^
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
psql:data/imdbdb.sql:34: ERROR:  role "postgres" does not exist
CREATE TABLE
psql:data/imdbdb.sql:48: ERROR:  role "postgres" does not exist
CREATE TABLE
psql:data/imdbdb.sql:62: ERROR:  role "postgres" does not exist
CREATE VIEW
psql:data/imdbdb.sql:74: ERROR:  role "postgres" does not exist
CREATE TABLE
psql:data/imdbdb.sql:86: ERROR:  role "imdb" does not exist
CREATE TABLE
psql:data/imdbdb.sql:99: ERROR:  role "postgres" does not exist
CREATE TABLE
psql:data/imdbdb.sql:111: ERROR:  role "imdb" does not exist
COPY 500000
COPY 3804162
COPY 113
COPY 2433431
COPY 337179
COPY 12
ALTER TABLE
ALTER TABLE
 * postgresql://jovyan@127.0.0.1:5432/postgres
5 rows affected.


Name,Owner,Encoding,Collate,Ctype,Access privileges
imdb,jovyan,UTF8,en_US.utf8,en_US.utf8,
jovyan,jovyan,UTF8,en_US.utf8,en_US.utf8,
postgres,jovyan,UTF8,en_US.utf8,en_US.utf8,
template0,jovyan,UTF8,en_US.utf8,en_US.utf8,=c/jovyan jovyan=CTc/jovyan
template1,jovyan,UTF8,en_US.utf8,en_US.utf8,=c/jovyan jovyan=CTc/jovyan


Now let's connect to the new database we just created! There should be no errors after running the following cell.

In [6]:
%sql postgresql://jovyan@127.0.0.1:5432/imdb

To make sure things are working, let's fetch 10 rows from one of our tables.

In [7]:
%%sql
SELECT * 
  FROM cast_sample
 LIMIT 10

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
10 rows affected.


id,person_id,movie_id,role_id
708,235,2345369,1
721,241,2504309,1
789,264,2156734,1
875,299,1954994,1
888,302,765037,1
889,302,765172,1
898,306,291387,1
899,306,1477434,1
931,324,824119,1
1936,543,1754068,1


## Table Descriptions

We are working with the IMDB database. This database is huge and has a lot of information that we have pared down for this project. 

- actor_sample: information about the actors including id, name, and gender
- cast_sample: each person on the cast of each movie gets a row including cast id, person's id (actor_sample.id), movie id, and role id
- movie_sample: sample of movies the actors have been in including movie's id, title, and the production year
- movie_info_sample: this table originally had a lot of information for each movie (take a look at info_type to see the information available) but we have dropped some information to make it a bit easier to manage. This table includes movie info's id, movie id, info type id, and the info itself
- info_type: reference table to match each info type id to the description of the type of information
- role_type: reference table for cast_sample to match role id to the description of the role

To make these tables smaller in your snapshot of IMDB, we took a random sample of actors from the full database, and included their corresponding movies and cast info. You will learn how to do sampling in SQL below.

Let's look at metadata about the tables. Many database clilent applications (like dbeaver or psql) and connectivity libraries offer some convenience commands for exploring metadata. We can use the psql's meta-commands (also called "backslash commands") from Jupyter directly.

In [8]:
%sql \d

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
7 rows affected.


Schema,Name,Type,Owner
public,actor_sample,table,jovyan
public,cast_sample,table,jovyan
public,cut1,view,jovyan
public,info_type,table,jovyan
public,movie_info_sample,table,jovyan
public,movie_sample,table,jovyan
public,role_type,table,jovyan


You can get quick help on psql meta-commands via \?:

In [9]:
%sql \?

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
29 rows affected.


Command,Description
\! [command],Pass commands to shell.
\?,Show Commands.
\copy [tablename] to/from [filename],Copy data between a file and a table.
\d[+] [pattern],"List or describe tables, views and sequences."
\dD[+] [pattern],List or describe domains.
\dE[+] [pattern],List foreign tables.
\dF[+] [pattern],List text search configurations.
\dT[S+] [pattern],List data types
\db[+] [pattern],List tablespaces.
\df[+] [pattern],List functions.


There is a more "native SQL" way to look at metadata. SQL stores its metadata in tables -- essentially metadata is just data! So we can use SQL to query the metadata. We access metadata table names via a prefix information_schema.. So for example we can query the table of all tables as information_schema.tables. We want to restrict it to tables in our default schema, "public", so we use a WHERE clause:

In [10]:
%%sql
SELECT * 
FROM information_schema.tables
WHERE table_schema = 'public';

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
7 rows affected.


table_catalog,table_schema,table_name,table_type,self_referencing_column_name,reference_generation,user_defined_type_catalog,user_defined_type_schema,user_defined_type_name,is_insertable_into,is_typed,commit_action
imdb,public,actor_sample,BASE TABLE,,,,,,YES,NO,
imdb,public,cast_sample,BASE TABLE,,,,,,YES,NO,
imdb,public,movie_info_sample,BASE TABLE,,,,,,YES,NO,
imdb,public,cut1,VIEW,,,,,,NO,NO,
imdb,public,info_type,BASE TABLE,,,,,,YES,NO,
imdb,public,movie_sample,BASE TABLE,,,,,,YES,NO,
imdb,public,role_type,BASE TABLE,,,,,,YES,NO,


## Question 0
Now you write a query to get the name of all the tables in the PostgreSQL information_schema schema! 

Note: The `result_0 <<` syntax stores the result of the SQL query to the variable `result_0`.

<!--
BEGIN QUESTION
name: q0
points: 1
-->

In [11]:
%%sql result_0 <<
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'information_schema';

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
65 rows affected.
Returning data to local variable result_0


In [12]:
grader.check("q0")

# Question 1: Assessing Table Contents 
One of the first things you'll want to do with a database table is get a sense for its metadata: column names and types, and number of rows. 

We can use the PostgreSQL `\d` meta-command to get a description of all the columns in the `movie_info_sample` table.

In [13]:
%sql \d movie_info_sample

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
4 rows affected.


Column,Type,Modifiers
id,integer,
movie_id,integer,
info_type_id,integer,
info,character varying,


## Question 1a
In the next cell, write an SQL query to calculate the number of rows in the `movie_info_sample` table.

<!--
BEGIN QUESTION
name: q1a
points: 1
-->

In [14]:
%sql SELECT COUNT(*) FROM movie_info_sample;

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
1 rows affected.


count
2433431


In [15]:
count_1a = %sql SELECT COUNT(*) FROM movie_info_sample;
count_1a = count_1a.DataFrame().iloc[0, 0]
count_1a

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
1 rows affected.


2433431

In [16]:
grader.check("q1a")

### Taking a Random Sample (and Python variable substitution)

The cell positioned 2 below this one shows you a good way to take a random sample from a table: use the `TABLESAMPLE` clause after the table name in the `FROM` clause. It also shows you how you can reference a Python variable within a SQL string in the connection library we're using, `ipython-sql`. (The syntax for language-variable substitution is not part of the SQL standard, it's done by the connection library, so this syntax is specific to `ipython-sql`)

The SQL query below fetches a sample of size `p%` via the [Bernoulli sampling](https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation#BERNOULLI_Option) scheme, which instructs the database engine to "flip a coin" before deciding whether to return each tuple. 

## Question 1b
Given that you know the size of the table from the previous query, choose a sampling rate `p` that retrieves 5 tuples on expectation. Don't forget to express `p` in units of percent (i.e. `p=5` is 5%)!

Try running the SQL cell many times and see what you notice.

<!--
BEGIN QUESTION
name: q1b
points: 2
-->

In [17]:
p_1b = (5 / count_1a) * 100
p_1b

0.0002054712050598517

In [18]:
grader.check("q1b")

In [19]:
%%sql
SELECT *
  FROM movie_info_sample TABLESAMPLE bernoulli('{p_1b}')

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
2 rows affected.


id,movie_id,info_type_id,info
1480763,2434620,105,"$5,000"
4391653,1001360,8,USA


## Question 1c

If you don't care about randomness, a more efficient way to get some arbitrary example rows from a table is to just use a `LIMIT` clause. In the next cell, fetch 5 rows from `movie_info_sample` using `LIMIT`.

<!--
BEGIN QUESTION
name: q1c
points: 1
-->

In [20]:
%%sql result_1c <<
SELECT * 
FROM movie_info_sample
LIMIT 5;

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
5 rows affected.
Returning data to local variable result_1c


In [21]:
grader.check("q1c")

# Question 2: Data Cleaning and Transformation

## Question 2a: MPAA Rating

The MPAA rating is something most datasets about movies contain and ours is no exception! But it's pretty messy to extract from the existing data. We're going to create a nice refined view of the `movie_sample` table that also includes a rating field.

### Question 2ai:
To start, using the `info_type` table, find which `info_type_id` corresponds to a film's MPAA rating. Store the resulting tuple from the `info_type` table in results.

<!--
BEGIN QUESTION
name: q2ai
points: 1
-->

In [22]:
mpaa_rating_id = %sql SELECT id FROM info_type WHERE info = 'mpaa';
mpaa_rating_id = mpaa_rating_id.DataFrame().iloc[0, 0]
mpaa_rating_id

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
1 rows affected.


97

In [23]:
grader.check("q2ai")

You will want to reuse the query above via copy/paste in the next cell or use `mpaa_rating_id` directly using python variable substitution.

### Question 2aii:
In the next cell please construct a view named `movie_ratings` containing one row for each movie containing `(movie_id, title, info, mpaa_rating)`, where `info` is the full MPAA rating string from `movie_info_sample`, and `mpaa_rating` is just the rating itself (i.e. `R`, `PG-13`, `PG`, etc). Following the view definition, write a `SELECT` query to return the first 20 rows of the view ordered by ascending `movie_id`.

Before you get started, take a look at the `movie_info_sample` tuples corresponding to the MPAA rating. The `info` field is a little longer than just the rating. It includes an explanation for why that movie received its score. You will need to extract a substring from the `info` column of `movie_info_sample`; you can use the [string functions](https://www.postgresql.org/docs/current/functions-string.html) in PostgreSQL to do it. There are many possible solutions. You might choose to do it by position (STRPOS), by using the SUBSTRING function, or both. We recommend you define a CTE (`WITH` clause) that does this part, and then use it to complete the query.

<!--
BEGIN QUESTION
name: q2aii
points: 3
-->

In [24]:
%%sql result_2aii <<

DROP VIEW IF EXISTS movie_ratings;
CREATE VIEW movie_ratings AS
(SELECT DISTINCT movie_id, title, info,
(CASE WHEN SUBSTRING(info from 'R') = 'R' THEN 'R'
     WHEN SUBSTRING(info from 'PG-13') = 'PG-13' THEN 'PG-13'
     WHEN SUBSTRING(info from 'PG') = 'PG' THEN 'PG'
     ELSE 'NC-17' END) AS mpaa_rating 
FROM movie_sample AS ms, movie_info_sample AS ms_info
WHERE ms.id = ms_info.movie_id AND
ms_info.info_type_id = 97);

SELECT * 
FROM movie_ratings
ORDER BY movie_id ASC
LIMIT 20;

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
Done.
Done.
20 rows affected.
Returning data to local variable result_2aii


In [25]:
grader.check("q2aii")

## Question 2b - Movie Moola

One measure of a movie's success is how much money it makes. If we look at our `info_type` table, we have information about the film's gross earnings and the budget for a film. It would be nice to know how much money a film made using the profit formula:
$$profit = earnings - moneyspent$$

We start by taking a look at the gross info types.

In [26]:
%%sql
SELECT * 
FROM movie_info_sample
WHERE info_type_id = 107
Order BY id
LIMIT 10 OFFSET 100000;

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
10 rows affected.


id,movie_id,info_type_id,info
1464348,2281091,107,"INR 23,373,000 (India) (25 February 2005)"
1464349,2281091,107,"INR 19,207,000 (India) (18 February 2005)"
1464374,1766950,107,"HKD 826,364 (Hong Kong) (11 December 1975)"
1464375,1769023,107,"HKD 3,148,549 (Hong Kong) (19 November 1980)"
1464378,1799099,107,"HKD 6,493,694 (Hong Kong) (22 December 1981)"
1464383,1847670,107,"$21,438 (USA) (9 August 2009)"
1464384,1847670,107,"$10,266 (USA) (2 August 2009)"
1464396,1916002,107,"$5,932 (USA) (27 November 2005)"
1464397,1916002,107,"$4,206 (USA) (20 November 2005)"
1464398,1916002,107,"$2,939 (USA) (23 October 2005)"


There are a lot of things to notice here. First of all, the `info` is a string with not only the earnings but the country and the month the earnings are cummulatively summed till. Additionally, the info is not in the same currency! On top of that, it looks like some of the gross earnings even for those in USD are from worldwide sales while others only count sales within the USA.

For consistency, let's only use movies with gross earnings counted in the USA.

### Question 2bi:
We want the number part of the `info` column and the latest (maximum) earnings value for a particular film. Clean `info` for the gross earnings information and extract the dollar amount as an float. The resulting table should have columns for the `gross` as a float, `movie_id`, and `title`. 

We are going to need this table later on when calucting the movie's profit so let's save the query as a view called `movie_gross`.

To take a look at our cleaned data, display the `title` for the top 10 highest grossing films along with their `movie_id` and `gross`.

- HINT: We extracted the rating above just like how we want to isolate the dollar amount as a string (There are multiple ways of doing this).

- HINT: Look at the [regexp_replace](https://www.postgresql.org/docs/9.4/functions-matching.html) function and the 'g' tag

<!--
BEGIN QUESTION
name: q2bi
points: 2
-->

In [27]:
%%sql

SELECT MAX(CAST(regexp_replace(info, '[^\d|^\(]+|\([^\)]*', '', 'g') AS float)) AS gross, movie_id, title
FROM movie_sample AS ms, movie_info_sample AS ms_info
WHERE ms.id = ms_info.movie_id
AND info_type_id = 107
AND info LIKE '$%(USA)%'
GROUP BY movie_id, title
ORDER BY gross desc;

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
11527 rows affected.


gross,movie_id,title
760507625.0,1704289,Avatar
658672302.0,2438179,Titanic
623357910.0,2346436,The Avengers
534858444.0,2360583,The Dark Knight
460935665.0,2310522,Star Wars
448139099.0,2360588,The Dark Knight Rises
436471036.0,2285018,Shrek 2
435110554.0,1851357,E.T. the Extra-Terrestrial
431065444.0,2310573,Star Wars: Episode I - The Phantom Menace
423315812.0,2204345,Pirates of the Caribbean: Dead Man's Chest


In [28]:
# delete cell above
# split_part(split_part(REPLACE(info, ',', ''), '$', 2), ' ', 1) AS float

In [29]:
%%sql result_2bi <<

DROP VIEW IF EXISTS movie_gross;
CREATE VIEW movie_gross AS
(SELECT MAX(CAST(regexp_replace(info, '[^\d|^\(]+|\([^\)]*', '', 'g') AS float)) AS gross, movie_id, title
FROM movie_sample AS ms, movie_info_sample AS ms_info
WHERE ms.id = ms_info.movie_id
AND info_type_id = 107
AND info LIKE '$%(USA)%'
GROUP BY movie_id, title
ORDER BY gross DESC);

SELECT * 
FROM movie_gross
LIMIT 10

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
Done.
Done.
10 rows affected.
Returning data to local variable result_2bi


In [30]:
grader.check("q2bi")

### Question 2bii:

In [31]:
# delete these cells
#regexp_replace(info, '[^\d|^\(]+|\([^\)]*', '', 'g') AS budget

In [32]:
%%sql budgets <<
select MAX(CAST(regexp_replace(info, '[^\d|^\(]+|\([^\)]*', '', 'g') AS float)) AS budget, movie_id, title
FROM movie_sample MS, movie_info_sample MI
where MS.id = MI.movie_id
and info_type_id = 105
GROUP BY movie_id, title
ORDER BY budget DESC;

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
57268 rows affected.
Returning data to local variable budgets


In [33]:
new = budgets.DataFrame()
new[new['title'] == 'Gwoemul']

Unnamed: 0,budget,movie_id,title
5,12215500000.0,1932032,Gwoemul


In [34]:
budgets.DataFrame()#.iloc[5, 2]
budgets.DataFrame()['budget'].iloc[:5].sum() - 1113000000000

0.0

In [35]:
# delete cells above

Now, we want something similar for the budget of the film so we can perform the subtraction of the `gross` and `budget`. Clean `info` for the budget information and extract the dollar amount as an float. The resulting table should have columns for the `budget` as a float, `movie_id`, and `title`. 

We are going to need this table later on as well when calculating the movie's profit so let's save the query as a view called `movie_budget`.

To take a look at our cleaned data, display the `title` for the top 10 highest budget films along with their `movie_id` and `budget`.

<!--
BEGIN QUESTION
name: q2bii
points: 2
-->

In [36]:
%%sql result_2bii <<

DROP VIEW IF EXISTS movie_budget;
CREATE VIEW movie_budget AS
(SELECT MAX(CAST(regexp_replace(info, '[^\d|^\(]+|\([^\)]*', '', 'g') AS float)) AS budget, movie_id, title
FROM movie_sample AS MS, movie_info_sample AS MI
WHERE MS.id = MI.movie_id
AND info_type_id = 105
GROUP BY movie_id, title
ORDER BY budget DESC);

SELECT * 
FROM movie_budget
LIMIT 10

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
Done.
Done.
10 rows affected.
Returning data to local variable result_2bii


In [37]:
grader.check("q2bii")

In [38]:
# delete cells below

In [39]:
%%sql
select mg.movie_id, mg.title, gross - budget as profit
from movie_gross as mg, movie_budget as mb
where mg.movie_id = mb.movie_id
and mg.title like '%102%';

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
1 rows affected.


movie_id,title,profit
1635380,102 Dalmatians,-18042974.0


In [40]:
# delete cells above

### Question 2biii:

We have all the parts we need to calculate the profits. Using the `movie_gross` and `movie_budget` views created above, we can subtract the numeric columns and save the reuslt in another column called `profit`.

In the next cell, construct a view named `movie_profit` containing one row for each movie containing `(movie_id, title, profit)`, where `profit` is the result of subtracting that movie's `budget` from `gross`. Following the view definition, write a `SELECT` query to return the first 20 rows of the view ordered by descending `profit`. This may take a second.

<!--
BEGIN QUESTION
name: q2biii
points: 2
-->

In [41]:
%%sql result_2biii <<

DROP VIEW IF EXISTS movie_profit;
CREATE VIEW movie_profit AS
(SELECT mg.movie_id, mg.title, gross - budget AS profit
FROM movie_gross AS mg, movie_budget AS mb
WHERE mg.movie_id = mb.movie_id
);

SELECT *
FROM movie_profit
ORDER BY profit DESC
LIMIT 20

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
Done.
Done.
20 rows affected.
Returning data to local variable result_2biii


In [42]:
grader.check("q2biii")

<!-- BEGIN QUESTION -->

### Question 2biv:

We analyzed the data but something seems odd. On a closer look, there are many negative values for `profit`. For example, the movie `102 Dalmations` looks to have lost ~$18M but it was a widely successful film! What may account for this issue?

<!--
BEGIN QUESTION
name: q2biv
manual: true
points: 1
-->

Because we are only considering movies with gross earnings counted in the USA, we ignore the sales and box office performances for movies beyond the USA audience. For one, we need to account for international sales by incoporating non-USA countries. In addition, we need to regularize sales by adjusting every currency according to the dollar so that performance is neither inflated nor deflated since some countries will have stronger currencies while others will have weaker currencies. We also need to account for movies with higher budgets and whether or not the domestic box office would be adequate to make up for it; including total sales worldwide may resolve this issue. With this approach, we will have a much better idea of how well 102 Dalmations and other movies with negative profits actually performed at the box office. 

<!-- END QUESTION -->



# Question 3: Using Cleaned Data

Now that we have gone through all the work of cleaning our financials from `info` in `movie_info_sample`, let's take a closer look at the data we generated. 

In [43]:
# delete cells below

In [44]:
%%sql
Select * from info_type
where info like '%genre%';

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
2 rows affected.


id,info
3,genres
62,LD group genre


In [45]:
%%sql
Select distinct info from movie_info_sample
Where info_type_id = 3;

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
30 rows affected.


info
Sport
Biography
Talk-Show
Thriller
Game-Show
Romance
Reality-TV
Film-Noir
Adventure
Family


In [46]:
%%sql
Select mg.movie_id, title, gross, info As genre, AVG(gross) Over (Partition By info) As genre_average
From movie_info_sample As mis, movie_gross As mg
Where mis.movie_id = mg.movie_id
And info_type_id = 3
limit 10;

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
10 rows affected.


movie_id,title,gross,genre,genre_average
1736033,Blood: The Last Vampire,256681.0,Action,42123826.131625965
2415643,The Spy Next Door,24268828.0,Action,42123826.131625965
2132092,Mr. & Mrs. Smith,186336103.0,Action,42123826.131625965
2414480,The Soldier,6328816.0,Action,42123826.131625965
2379419,The Hunter,16274150.0,Action,42123826.131625965
2498494,Whiteout,10275638.0,Action,42123826.131625965
2370048,The Forbidden Kingdom,52040293.0,Action,42123826.131625965
2443506,Tora! Tora! Tora!,14500000.0,Action,42123826.131625965
2380050,The Incredible Hulk,134518390.0,Action,42123826.131625965
2441738,Tombstone,56505065.0,Action,42123826.131625965


In [47]:
# delete cells above

### Question 3a: Earnings per Genre

Another `info_type` available to us is the movie genre. Looking at the `movie_gross` values, how much does each genre earn on average from the US?

Create a view with the columns `movie_id`, `title`, `gross`, `genre`, and `average_genre` where `gross` is a movie's gross US earnings, `genre` is the movie's genre, and `average_genre` is the average earnings for the corresponding genre. If a movie has multiple genres, the movie should appear in multiple rows with each genre as a row.

Following the view definition, write a `SELECT` query to return the rows for the movie "Mr. & Mrs. Smith" ordered by genre alphabetically.

- HINT: Look into [window functions](https://www.postgresql.org/docs/9.1/tutorial-window.html)

<!--
BEGIN QUESTION
name: q3a
points: 2
-->

In [48]:
%%sql result_3a <<

DROP VIEW IF EXISTS movie_avg_genre;
CREATE VIEW movie_avg_genre AS
(SELECT mg.movie_id, title, gross, info As genre, AVG(gross) OVER (PARTITION BY info) As average_genre
FROM movie_info_sample As mis, movie_gross As mg
WHERE mis.movie_id = mg.movie_id
AND info_type_id = 3);

SELECT *
FROM movie_avg_genre
WHERE title LIKE '%Mr. & Mrs. Smith%'
ORDER BY genre

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
Done.
Done.
3 rows affected.
Returning data to local variable result_3a


In [49]:
grader.check("q3a")

### Question 3b: Analyzing Gross Earnings

A good way to view numerical data is through a histogram. The histogram shows the spread of the data along with several other key attributes that allow for the data to be analyzed. 

We went through a lot of work transforming the gross earnings `info` string into a numerical value. Because of our hard work, we can now look more closely at this data and understand what it looks like. We will generate a [five-number summary](https://en.wikipedia.org/wiki/Five-number_summary) and the average of the US gross earnings data to do this.

Create a view of a one row summary of the `movie_gross` `gross` data with the `min`, `25th_percentile`, `median`, `75th_percentile`, `max`, and `average`. Following the view definition, display it.

- HINT: Look at SQL [aggregate functions](https://www.postgresql.org/docs/9.4/functions-aggregate.html). You may find some useful.

<!--
BEGIN QUESTION
name: q3b
points: 2
-->

In [50]:
%%sql result_3b <<

DROP VIEW IF EXISTS earnings_summary;
CREATE VIEW earnings_summary AS
(SELECT MIN(gross) AS min, Percentile_Cont(0.25) Within Group (Order By gross) AS "25th_percentile", 
Percentile_Cont(0.50) Within Group (Order By gross) AS "median", 
Percentile_Cont(0.75) Within Group (Order By gross) AS "75th_percentile",
MAX(gross) As max, AVG(gross) AS average
FROM movie_gross);

SELECT *
FROM earnings_summary

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
Done.
Done.
1 rows affected.
Returning data to local variable result_3b


In [51]:
grader.check("q3b")

<!-- BEGIN QUESTION -->

### Question 3c

What do you notice about the summary values generated in `earnings_summary`? Identify two properties about the histogram of the data. 
- HINT: Think in terms of about concepts from statistics like spread, modality, skew, etc. and hwo they may apply here

<!--
BEGIN QUESTION
name: q3c
manual: true
points: 1
-->

In [52]:
result_3b.DataFrame()

Unnamed: 0,min,25th_percentile,median,75th_percentile,max,average
0,30.0,166623.0,2317091.0,20002717.5,760507625.0,19594420.0


* Because we see that the mean is much greater than the median value of the gross sales, the histogram will be skewed right. This means that there will be a longer right tail, extending because of outliers that bring out the average gross value.
* There is a sizable amount of spread and variability in the data as we can see from the range (using max - min) as well as the inner quartile range of about 19.8 million dollars, using the 75th and 25th percentile values. We see that at each quartile (from the 25th to 50th to 75th), the corresponding value jumps by about a factor of 10; from about 200,000 to 2 million to then 20 million.

In [53]:
# delete cells below

In [54]:
%%sql
Select actor_sample.id, name, count(*)
From actor_sample Right Join cast_sample
On actor_sample.id = cast_sample.person_id
Where gender Like 'm'
Group By name, actor_sample.id
Order By count Desc
Limit 10;

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
10 rows affected.


id,name,count
95397,"Barker, Bob",6853
515315,"Freeman, Morgan",5938
677696,"Hinnant, Skip",4697
1573853,"Trebek, Alex",4690
1362169,"Sajak, Pat",3937
1417394,"Shaffer, Paul",3546
911160,"Lima, Pedro",2911
900749,"Letterman, David",2895
487253,"Filipe, Guilherme",2861
356575,"Davidson, Doug",2760


In [55]:
%%sql
Select *
From actor_sample
Limit 10;

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
10 rows affected.


id,name,gender
2591445,"Taylor, Joan",f
640876,"Harris, R.H.",m
614937,"Gélinas, Jean-Maurice",m
987924,"Martins, Pedro",m
104379,"Bass, Monty",m
66819,"Atanasov, Grudi",m
1309781,"Rios, Raul",m
497885,"Flouw, Jonathan",m
484858,"Ficarra, Amedeo",m
2356357,"Nance, Nichole",f


<!-- END QUESTION -->

# Question 4: Joins

Joins are a powerful tool in database cleaning. They allow for the user to create useful tables and bring together information in a meaningful way. 

There are many types of joins: inner, outer, left, right, etc. Let's practice these in a special scenario. 

You are now working as a talent director and you need a list of all `actor` roles and the number of movies in which they have acted. 

Create a view called `number_movies` which columns `id`, `name`, `number` where `id` is the actor's id, `name` is the actor's name, and `number` is the number of movies they have acted in. 

NOTE: The `cast_sample` may include actors not included in `actor_sample` table. We still want to include these actors in our result by reference to their id. The `name` field can be NULL.

Following your view, display the `id`, `name`, and `number` of films for the top 10 actors who have been in the most films.

<!--
BEGIN QUESTION
name: q4
points: 2
-->

In [56]:
%%sql result_4 <<

DROP VIEW IF EXISTS number_movies;
CREATE VIEW number_movies AS 
(SELECT actor_sample.id, name, count(*) AS number
FROM cast_sample LEFT JOIN actor_sample
ON actor_sample.id = cast_sample.person_id
WHERE gender LIKE 'm'
GROUP BY name, actor_sample.id
ORDER BY number DESC);

Select *
From number_movies
Limit 10

 * postgresql://jovyan@127.0.0.1:5432/imdb
   postgresql://jovyan@127.0.0.1:5432/postgres
Done.
Done.
10 rows affected.
Returning data to local variable result_4


In [57]:
grader.check("q4")

# DONE!

Great job! You finished your first project in Data Engineering. To submit, please run the `grader.check_all()` cell below and save your notebook. After that, submit your `proj1.ipynb` file to Gradescope. Make sure that we can see the results of your `grader.check_all()` cell on your Gradescope submission. That's all!

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [58]:
grader.check_all()