# IMDB Window Functions Lab

### Introduction

In this lesson, we'll practice working with window functions to aggregate our data.  Let's get started.

### Loading our Data

Let's begin by loading our data, which consists of a relational database of various movies and related ratings and actors.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/data-eng-10-21/window-functions/main/data/"
movies_df = pd.read_csv(f'{url}/movies.csv')
names_df = pd.read_csv(f'{url}/names.csv')
ratings_df = pd.read_csv(f'{url}/ratings.csv')
title_principals_df = pd.read_csv(f'{url}/title_principals.csv')
names_df = pd.read_csv(f'{url}/names.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
import sqlite3
conn = sqlite3.connect('imdb.db')

Now we can create our various tables.

In [4]:
movies_df.to_sql('movies', conn, index = False, if_exists = 'replace')

In [14]:
names_df.to_sql('names', conn, index = False, if_exists = 'replace')

In [15]:
ratings_df.to_sql('ratings', conn, index = False, if_exists = 'replace')

In [16]:
title_principals_df.to_sql('movie_roles', conn, index = False, if_exists = 'replace')

### Using window functions

In [19]:
pd.read_sql('SELECT * FROM movies LIMIT 1;', conn)

Unnamed: 0,imdb_title_id,title,year,date_published,genre,duration,country,language,director,writer,budget,worlwide_gross_income,metascore,income
0,tt0000009,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,Alexander Black,,,,


Let's begin by selecting movies after the year 2000, finding the average length of the move for that year, and also including columns for the title of the movie, and each movie's runtime.  Order the movies by year and duration.

In [24]:
query = """
"""

pd.read_sql(query, conn)

# title	            duration	     avg_duration
# 0	Kai Doh Maru	45	       102.467865
# 1	Wave Twisters	46	        102.467865
# 2	The Yellow Sign	47	      102.467865
# 3	China: The Panda Adventure	48	102.467865
# 4	Lay It Down	50	102.467865

Unnamed: 0,title,duration,avg_duration
0,Kai Doh Maru,45,102.467865
1,Wave Twisters,46,102.467865
2,The Yellow Sign,47,102.467865
3,China: The Panda Adventure,48,102.467865
4,Lay It Down,50,102.467865


Next, select each movie's year, title, duration, and calculate the `average_duration` for that year, and movies of the same genre.  Then calculate how each movie's runtime length deviates from the average.

Select only those movies made after 2018, whose genre is `Drama` or `Comedy` and order the results by year, genre, and duration.

In [42]:
query = """
"""

df = pd.read_sql(query, conn)
df

# 	year	title	duration	avg_length	length_minus_avg
# 0	2019	Present Laughter	180	99.968627	80.031373
# 1	2019	Rangeela Raja	162	99.968627	62.031373
# 2	2019	F2: Fun and Frustration	148	99.968627	48.031373
# 3	2019	ABCD: American-Born Confused Desi	145	99.968627	45.031373
# 4	2019	Takatak	144	99.968627	44.031373
# ...	...	...	...	...	...
# 890	2020	Domangchin yeoja	77	108.096154	-31.096154
# 891	2020	Zima	76	108.096154	-32.096154
# 892	2020	Betta Fish	76	108.096154	-32.096154
# 893	2020	A Stormy Night	75	108.096154	-33.096154
# 894	2020	Ar Condicionado	72	108.096154	-36.096154

Unnamed: 0,year,title,duration,avg_length,length_minus_avg
0,2019,Present Laughter,180,99.968627,80.031373
1,2019,Rangeela Raja,162,99.968627,62.031373
2,2019,F2: Fun and Frustration,148,99.968627,48.031373
3,2019,ABCD: American-Born Confused Desi,145,99.968627,45.031373
4,2019,Takatak,144,99.968627,44.031373
...,...,...,...,...,...
890,2020,Domangchin yeoja,77,108.096154,-31.096154
891,2020,Zima,76,108.096154,-32.096154
892,2020,Betta Fish,76,108.096154,-32.096154
893,2020,A Stormy Night,75,108.096154,-33.096154


### Window functions group by

Now use SQL to produce the picture below.

> We'll explain more below.

<img src="./data.png" width="80%">

So in the above, we reduced our data to years after 2015.  And we have a row for each genre and year.  For each year and genre, we calculate the average length of the movie, the number of movies in that category per year, and then also calculate the average length of movies for that year across all genres.

In [63]:
query = """

"""
pd.read_sql(query, conn)

Unnamed: 0,year,genre,avg_duration_per_genre,num_movies,avg_per_year
0,2016,Drama,101.268456,447,102.954704
1,2016,Comedy,98.249135,289,102.954704
2,2016,"Comedy, Drama",102.988166,169,102.954704
3,2016,"Drama, Romance",104.401786,112,102.954704
4,2016,Horror,89.145631,103,102.954704
...,...,...,...,...,...
1374,2020,Musical,92.000000,1,107.808140
1375,2020,Mystery,140.000000,1,107.808140
1376,2020,"Mystery, Sci-Fi, Thriller",97.000000,1,107.808140
1377,2020,Sport,104.000000,1,107.808140


### Summary

In this lesson, we practiced working with window functions.  As we saw, we can use a window function to partition our data by multiple columns.  And we saw that we can use a window function to calculate how a movie differs from  the average amount in that group.