# Advanced Aggregates

Please remember to use the `EXPLAIN` before you execute a query to help avoid unnecessary load on the DBMS and indefinite waits by you for results.

Therefore, for each question, we are providing a cell for the `EXPLAIN` as well as the final SQL.


## Our practice schema:

We will use the DVD Rental database.

A PDF of the _Entity-Relationship Diagrams_ (ERD) is available [here](https://web.dsa.missouri.edu/static/PDF/DVD_Rental_ERD2.pdf).   
Printing it out is recommended.


**NOTE**: These queries are more complex that the others.
If you get stuck on one, skip and come back to it later.

**NOTE**: For this notebook, it is desired that you construct solutions using advanced aggregates and derived tables.

In [1]:
%load_ext sql
%sql postgres://dsa_ro_user:readonly@pgsql.dsa.lan/dvdrental

'Connected: dsa_ro_user@dvdrental'

### 1
### What is the average, variance, and standard deviation of the film length?


In [2]:
%%sql
EXPLAIN
SELECT avg(length) as avg_length, var_pop(length) as var_length, stddev(length) as std_length
from film;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
2 rows affected.


QUERY PLAN
Aggregate (cost=71.51..71.52 rows=1 width=96)
-> Seq Scan on film (cost=0.00..64.00 rows=1000 width=2)


In [3]:
%%sql
SELECT avg(length) as avg_length, var_pop(length) as var_length, stddev(length) as std_length
from film;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
1 rows affected.


avg_length,var_length,std_length
115.272,1632.654016,40.426331818559845


### 2
### What is the average, variance, and standard deviation of the film length; broken down by film category.

In [4]:
%%sql
EXPLAIN
SELECT name, avg(length) as avg_length, var_pop(length) as var_length, stddev(length) as std_length
from film
JOIN film_category
  ON (film_category.film_id = film.film_id)
JOIN category
  ON (film_category.category_id = category.category_id)
GROUP BY name
ORDER BY name ASC;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
13 rows affected.


QUERY PLAN
Sort (cost=110.41..110.45 rows=16 width=164)
Sort Key: category.name
-> HashAggregate (cost=109.81..110.09 rows=16 width=164)
Group Key: category.name
-> Hash Join (cost=77.86..99.81 rows=1000 width=70)
Hash Cond: (film_category.category_id = category.category_id)
-> Hash Join (cost=76.50..95.14 rows=1000 width=4)
Hash Cond: (film_category.film_id = film.film_id)
-> Seq Scan on film_category (cost=0.00..16.00 rows=1000 width=4)
-> Hash (cost=64.00..64.00 rows=1000 width=6)


In [5]:
%%sql
SELECT name, avg(length) as avg_length, var_pop(length) as var_length, stddev(length) as std_length
from film
JOIN film_category USING (film_id)
JOIN category USING (category_id)
GROUP BY name
ORDER BY name ASC;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
16 rows affected.


name,avg_length,var_length,std_length
Action,111.609375,1819.4880371093752,42.99265983401323
Animation,111.01515151515152,1696.984618916437,41.51014423718706
Children,109.8,1475.8933333333332,38.74156004314064
Classics,111.66666666666666,1449.345029239766,38.40867337563471
Comedy,115.82758620689656,1750.6254458977407,42.2059021111829
Documentary,108.75,1787.9816176470588,42.598919122998424
Drama,120.83870967741936,1631.6514047866804,40.72345501638715
Family,114.78260869565216,1501.1556395715183,39.02859794817784
Foreign,121.6986301369863,1836.4023268905985,43.14983099345905
Games,127.83606557377048,1241.7764041924213,35.531291527266895


[Helpful Hints Video](https://youtu.be/jy9H2KLI4Iw) 

### 3
### A movie's "cumulative rented duration" is the sum of all rentals from rental table.  What is the average _cumulative rented duration_ per store (inventory.store_id).

In [None]:
%%sql
EXPLAIN
SELECT s.store_id, avg(x.cumulative) as avg_cumul_rented_duration
from store s
JOIN address USING (address_id)
JOIN customer USING (address_id)
NATURAL JOIN (
    SELECT inventory_id, sum(r.return_date - r.rental_date) as cumulative
    FROM rental r 
    WHERE r.rental_date IS NOT NULL
    GROUP BY inventory_id) as x
GROUP BY s.store_id;

In [53]:
%%sql
SELECT s.store_id, avg(x.cumulative) as avg_cumul_rented_duration
from store s
JOIN address USING (address_id)
JOIN customer USING (address_id)
NATURAL JOIN (
    SELECT inventory_id, sum(r.return_date - r.rental_date) as cumulative
    FROM rental r 
    WHERE r.rental_date IS NOT NULL
    GROUP BY inventory_id) as x
GROUP BY s.store_id;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
0 rows affected.


store_id,avg_cumul_rented_duration


In [51]:
%%sql
SELECT inventory_id, sum(r.return_date - r.rental_date) as cumulative
    FROM rental r 
    WHERE r.rental_date IS NOT NULL
    GROUP BY inventory_id

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
4580 rows affected.


inventory_id,cumulative
1489,"32 days, 21:19:00"
273,"32 days, 19:37:00"
3936,"9 days, 18:37:00"
2574,"30 days, 2:49:00"
951,"21 days, 22:07:00"
4326,"15 days, 23:59:00"
2614,"13 days, 23:38:00"
2520,"23 days, 11:50:00"
2466,"19 days, 1:20:00"
2196,"11 days, 11:25:00"


[Helpful Hints Video](https://youtu.be/Scyn7exzUcY)  

### 4
### Which three categories of film have the highest average number of actors per film?

In [47]:
%%sql
EXPLAIN
SELECT c.name, avg(x.count_actors) as avg_num_actors
from category c
NATURAL JOIN (
    SELECT film_id, count(actor_id) as count_actors
    from film_actor
    JOIN film USING (film_id)
    JOIN film_category USING (film_id)
    JOIN category USING (category_id)
    GROUP BY film_id
    )  AS x
GROUP BY c.name
ORDER BY avg_num_actors DESC
LIMIT 3;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
23 rows affected.


QUERY PLAN
Limit (cost=600.95..600.96 rows=3 width=100)
-> Sort (cost=600.95..600.99 rows=16 width=100)
Sort Key: (avg((count(film_actor.actor_id)))) DESC
-> HashAggregate (cost=600.54..600.74 rows=16 width=100)
Group Key: c.name
-> Nested Loop (cost=299.34..520.54 rows=16000 width=76)
-> HashAggregate (cost=299.34..309.34 rows=1000 width=12)
Group Key: film.film_id
-> Hash Join (cost=112.31..272.03 rows=5462 width=6)
Hash Cond: (film_actor.film_id = film.film_id)


In [69]:
%%sql
SELECT c.name, avg(x.count_actors) as avg_num_actors
from category c
NATURAL JOIN (
    SELECT film_id, count(actor_id) as count_actors
    FROM film_actor
    JOIN film USING (film_id)
    JOIN film_category USING (film_id)
    JOIN category USING (category_id)
    GROUP BY film_id
    )  AS x
GROUP BY c.name
ORDER BY avg_num_actors DESC
LIMIT 3;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
3 rows affected.


name,avg_num_actors
Games,5.478435305917753
Animation,5.478435305917753
Family,5.478435305917753


### 5
### For each staff member, list their average daily payment amount processed.

In [67]:
%%sql
EXPLAIN
SELECT s.first_name, s.last_name, avg(x.sum_amount) as avg_daily_payment
from staff s
NATURAL JOIN (
    SELECT p.payment_date::date, sum(p.amount) as sum_amount, p.staff_id
    from payment p
    group by payment_date::date, staff_id
    ) as x
GROUP BY s.first_name, s.last_name;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
11 rows affected.


QUERY PLAN
GroupAggregate (cost=803.76..805.23 rows=2 width=248)
"Group Key: s.first_name, s.last_name"
-> Sort (cost=803.76..804.12 rows=144 width=248)
"Sort Key: s.first_name, s.last_name"
-> Hash Join (cost=400.96..798.60 rows=144 width=248)
Hash Cond: (p.staff_id = s.staff_id)
-> HashAggregate (cost=399.92..615.40 rows=14365 width=38)
"Group Key: (p.payment_date)::date, p.staff_id"
-> Seq Scan on payment p (cost=0.00..290.45 rows=14596 width=12)
-> Hash (cost=1.02..1.02 rows=2 width=220)


In [66]:
%%sql
SELECT s.first_name, s.last_name, avg(x.sum_amount) as avg_daily_payment
from staff s
NATURAL JOIN (
    SELECT p.payment_date::date, sum(p.amount) as sum_amount, p.staff_id
    from payment p
    group by payment_date::date, staff_id
    ) as x
GROUP BY s.first_name, s.last_name;


 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
2 rows affected.


first_name,last_name,avg_daily_payment
Jon,Stephens,970.6225
Mike,Hillyer,945.37875


### 6
### What is the statistical correlation between film length and rental rate?

In [40]:
%%sql
EXPLAIN
SELECT corr(length, rental_rate)
FROM film;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
2 rows affected.


QUERY PLAN
Aggregate (cost=71.50..71.51 rows=1 width=8)
-> Seq Scan on film (cost=0.00..64.00 rows=1000 width=8)


In [39]:
%%sql
SELECT corr(length, rental_rate)
FROM film;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
1 rows affected.


corr
0.0297892586459086


[Helpful Hints Video](https://youtu.be/3d2vgLn9KVs)  

# Save your Notebook, then `File > Close and Halt`