# CS-6570 Homework 1

*Weber State University*

**Your Name**: Rob Christiansen


The goal of this first assignment is to give you some experience working with Pandas dataframes. The assignment is focused on one of the most important, if not always the most exciting, tasks in industry data science - exploring, transforming, and preparing data.

For this assignment, we'll be working with the [Amazon Prime Video Data](https://www.kaggle.com/datasets/victorsoeiro/amazon-prime-tv-shows-and-movies), which you can download from Canvas. It consists of two files, one containing movie titles and associated data about the movies, and the other containing movie credits and associated data about the crew members. The linked Amazon Prime Video Data website provides a data dictionary.

The first thing we'll want to do is import our usual libraries - Numpy and Pandas, which you can do by running the code below.

In [1]:
import numpy as np
import pandas as pd

Next, you should download the datasets (credits.csv and titles.csv) and import them as the respective Pandas dataframes credits_df and titles_df.

In [2]:
# TODO: Import the credits.csv dataset as a dataframe named credits_df, and the titles.csv dataset as a dataframe named titles_df.
credits_df = pd.read_csv('Datasets/credits.csv')
titles_df = pd.read_csv('Datasets/titles.csv')

As an initial exploration, display the first 10 rows of titles_df.

In [3]:
# TODO: Display the first 10 rows of titles_df.
titles_df.head(10)


Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts20945,The Three Stooges,SHOW,The Three Stooges were an American vaudeville ...,1934,TV-PG,19,"['comedy', 'family', 'animation', 'action', 'f...",['US'],26.0,tt0850645,8.6,1092.0,15.424,7.6
1,tm19248,The General,MOVIE,"During America’s Civil War, Union spies steal ...",1926,,78,"['action', 'drama', 'war', 'western', 'comedy'...",['US'],,tt0017925,8.2,89766.0,8.647,8.0
2,tm82253,The Best Years of Our Lives,MOVIE,It's the hope that sustains the spirit of ever...,1946,,171,"['romance', 'war', 'drama']",['US'],,tt0036868,8.1,63026.0,8.435,7.8
3,tm83884,His Girl Friday,MOVIE,"Hildy, the journalist former wife of newspaper...",1940,,92,"['comedy', 'drama', 'romance']",['US'],,tt0032599,7.8,57835.0,11.27,7.4
4,tm56584,In a Lonely Place,MOVIE,An aspiring actress begins to suspect that her...,1950,,94,"['thriller', 'drama', 'romance']",['US'],,tt0042593,7.9,30924.0,8.273,7.6
5,tm160494,Stagecoach,MOVIE,A group of people traveling on a stagecoach fi...,1939,,96,"['western', 'drama']",['US'],,tt0031971,7.8,48149.0,11.786,7.7
6,tm87233,It's a Wonderful Life,MOVIE,A holiday favourite for generations... George...,1946,PG,130,"['drama', 'family', 'fantasy', 'romance', 'com...",['US'],,tt0038650,8.6,444243.0,26.495,8.3
7,tm19424,Detour,MOVIE,"The life of Al Roberts, a pianist in a New Yor...",1945,,66,"['thriller', 'drama', 'crime']",['US'],,tt0037638,7.3,17233.0,7.757,7.2
8,tm116781,My Man Godfrey,MOVIE,"Fifth Avenue socialite Irene Bullock needs a ""...",1936,,95,"['comedy', 'romance', 'drama']",['US'],,tt0028010,8.0,23532.0,8.633,7.6
9,tm112005,Marihuana,MOVIE,A young girl named Burma attends a beach party...,1936,,57,"['crime', 'drama']",['US'],,tt0026683,4.0,864.0,3.748,3.6


Next, we'll examine the data types in our titles dataframe.

In [4]:
#TODO: Display the data types in the titles_df dataframe.
titles_df.dtypes

id                       object
title                    object
type                     object
description              object
release_year              int64
age_certification        object
runtime                   int64
genres                   object
production_countries     object
seasons                 float64
imdb_id                  object
imdb_score              float64
imdb_votes              float64
tmdb_popularity         float64
tmdb_score              float64
dtype: object

We're only going to want to use the *id, title, type, release_year, runtime, genres, production_countries, imdb_score, imdb_votes, tmdb_popularity*, and *tmdb_score* columns. So, drop all other columns from titles_df.

In [5]:
#TODO: Drop all columns from titles_df except id, title, type, release_year, runtime, genres, production_countries, imdb_score, imdb_votes, tmdb_popularity, and tmdb_score.
titles_df = titles_df[['id', 'title', 'type', 'release_year', 'runtime', 'genres', 'production_countries', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']] 
# Rob: Double brackets will drop all columns except those list
titles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9871 entries, 0 to 9870
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    9871 non-null   object 
 1   title                 9871 non-null   object 
 2   type                  9871 non-null   object 
 3   release_year          9871 non-null   int64  
 4   runtime               9871 non-null   int64  
 5   genres                9871 non-null   object 
 6   production_countries  9871 non-null   object 
 7   imdb_score            8850 non-null   float64
 8   imdb_votes            8840 non-null   float64
 9   tmdb_popularity       9324 non-null   float64
 10  tmdb_score            7789 non-null   float64
dtypes: float64(4), int64(2), object(5)
memory usage: 848.4+ KB


Now we'll want to clean up our dataframe a bit. First, drop all rows with any null values.

In [6]:
#TODO: Drop all rows in titles_df containing any null values.
titles_df = titles_df.dropna(axis=0, how="any") 

By dropping these rows, the index of titles_df is no longer perfectly sequential - it skips some values.

In [7]:
titles_df

Unnamed: 0,id,title,type,release_year,runtime,genres,production_countries,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts20945,The Three Stooges,SHOW,1934,19,"['comedy', 'family', 'animation', 'action', 'f...",['US'],8.6,1092.0,15.424,7.6
1,tm19248,The General,MOVIE,1926,78,"['action', 'drama', 'war', 'western', 'comedy'...",['US'],8.2,89766.0,8.647,8.0
2,tm82253,The Best Years of Our Lives,MOVIE,1946,171,"['romance', 'war', 'drama']",['US'],8.1,63026.0,8.435,7.8
3,tm83884,His Girl Friday,MOVIE,1940,92,"['comedy', 'drama', 'romance']",['US'],7.8,57835.0,11.270,7.4
4,tm56584,In a Lonely Place,MOVIE,1950,94,"['thriller', 'drama', 'romance']",['US'],7.9,30924.0,8.273,7.6
...,...,...,...,...,...,...,...,...,...,...,...
9843,tm616953,Ammaa Ki Boli,MOVIE,2021,117,"['comedy', 'drama']",['IN'],7.3,1335.0,2.382,1.0
9844,tm1068475,Alleyway,MOVIE,2021,67,"['action', 'crime', 'thriller']",[],5.4,92.0,1.870,6.8
9847,tm1098070,Girls' Night In,MOVIE,2021,91,"['comedy', 'drama']",['US'],2.8,28.0,1.306,7.0
9856,tm1019060,Anbirkiniyal,MOVIE,2021,118,"['thriller', 'drama']",['IN'],6.8,361.0,2.191,7.0


We'll want to clean this up. So, reset the index, and then drop the old index column.

In [8]:
#TODO: Reset the index column of titles_df
titles_df = titles_df.reset_index()

In [9]:
titles_df

Unnamed: 0,index,id,title,type,release_year,runtime,genres,production_countries,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,0,ts20945,The Three Stooges,SHOW,1934,19,"['comedy', 'family', 'animation', 'action', 'f...",['US'],8.6,1092.0,15.424,7.6
1,1,tm19248,The General,MOVIE,1926,78,"['action', 'drama', 'war', 'western', 'comedy'...",['US'],8.2,89766.0,8.647,8.0
2,2,tm82253,The Best Years of Our Lives,MOVIE,1946,171,"['romance', 'war', 'drama']",['US'],8.1,63026.0,8.435,7.8
3,3,tm83884,His Girl Friday,MOVIE,1940,92,"['comedy', 'drama', 'romance']",['US'],7.8,57835.0,11.270,7.4
4,4,tm56584,In a Lonely Place,MOVIE,1950,94,"['thriller', 'drama', 'romance']",['US'],7.9,30924.0,8.273,7.6
...,...,...,...,...,...,...,...,...,...,...,...,...
7312,9843,tm616953,Ammaa Ki Boli,MOVIE,2021,117,"['comedy', 'drama']",['IN'],7.3,1335.0,2.382,1.0
7313,9844,tm1068475,Alleyway,MOVIE,2021,67,"['action', 'crime', 'thriller']",[],5.4,92.0,1.870,6.8
7314,9847,tm1098070,Girls' Night In,MOVIE,2021,91,"['comedy', 'drama']",['US'],2.8,28.0,1.306,7.0
7315,9856,tm1019060,Anbirkiniyal,MOVIE,2021,118,"['thriller', 'drama']",['IN'],6.8,361.0,2.191,7.0


In [10]:
#TODO: Drop the old index column from titles_df
titles_df = titles_df.drop(['index'], axis=1)
# Rob: Note that if you don't have the axis parameter, the function will look to the rows by default and won't find the match

Now if we examine titles_df, the indices should be sequential without any gaps.

In [11]:
titles_df

Unnamed: 0,id,title,type,release_year,runtime,genres,production_countries,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts20945,The Three Stooges,SHOW,1934,19,"['comedy', 'family', 'animation', 'action', 'f...",['US'],8.6,1092.0,15.424,7.6
1,tm19248,The General,MOVIE,1926,78,"['action', 'drama', 'war', 'western', 'comedy'...",['US'],8.2,89766.0,8.647,8.0
2,tm82253,The Best Years of Our Lives,MOVIE,1946,171,"['romance', 'war', 'drama']",['US'],8.1,63026.0,8.435,7.8
3,tm83884,His Girl Friday,MOVIE,1940,92,"['comedy', 'drama', 'romance']",['US'],7.8,57835.0,11.270,7.4
4,tm56584,In a Lonely Place,MOVIE,1950,94,"['thriller', 'drama', 'romance']",['US'],7.9,30924.0,8.273,7.6
...,...,...,...,...,...,...,...,...,...,...,...
7312,tm616953,Ammaa Ki Boli,MOVIE,2021,117,"['comedy', 'drama']",['IN'],7.3,1335.0,2.382,1.0
7313,tm1068475,Alleyway,MOVIE,2021,67,"['action', 'crime', 'thriller']",[],5.4,92.0,1.870,6.8
7314,tm1098070,Girls' Night In,MOVIE,2021,91,"['comedy', 'drama']",['US'],2.8,28.0,1.306,7.0
7315,tm1019060,Anbirkiniyal,MOVIE,2021,118,"['thriller', 'drama']",['IN'],6.8,361.0,2.191,7.0


Now let's play around a bit with the data types. Cast the data type of the *imdb_votes* column to be integer.

In [12]:
#TODO: Change the data type of the imdb_votes column to be integer. Check to make sure this has been done by displaying the data types of titles_df once you've done so.
titles_df['imdb_score'] = titles_df['tmdb_score'].astype(int)
titles_df.dtypes

id                       object
title                    object
type                     object
release_year              int64
runtime                   int64
genres                   object
production_countries     object
imdb_score                int64
imdb_votes              float64
tmdb_popularity         float64
tmdb_score              float64
dtype: object

Next, we're going to create a numeric flag that indicates whether a given title is a movie. This should be a new column, called *is_movie*, that has a value of 1 if the type of content is MOVIE, and a value of 0 otherwise.

In [13]:
#TODO: Create a new column in titles_df called 'is_movie' that equals 1 if 'type' is MOVIE and 0 otherwise.
titles_df['is_movie'] = titles_df['type'].apply(lambda x:  True if x == 'MOVIE' else False)
# Rob: Note that lambda function has to be applied to the columm containing the relevant values
# print(titles_df['is_movie'])

In [14]:
titles_df.dtypes

id                       object
title                    object
type                     object
release_year              int64
runtime                   int64
genres                   object
production_countries     object
imdb_score                int64
imdb_votes              float64
tmdb_popularity         float64
tmdb_score              float64
is_movie                   bool
dtype: object

If we check out our dataframe, we note that the 'genres' column and the 'production_countries' column contain lists. This makes sense, as a movie could belong to more than one genre, or be made in more than one country. However, suppose that instead of collapsing this information as a list, we want for there to be one row for each list element. This would mean that, for example, a romantic comedy might be in both the 'comedy' and 'romance' genre, and would get a separate row for each. Let's do this for the 'genres' column.

HOWEVER, before you do this you'll need to let your kernel know that the data in the 'genres' column are lists. You can do this using the literal_eval function from the ast module. You can find more info about the ast (which stands for "Abstract Syntax Tree") [here](https://docs.python.org/3/library/ast.html), and a bit about how to use it [here](https://medium.com/@aniruddhapal/the-power-of-the-ast-literal-eval-method-in-python-8fb4014a2574). I've included code below that should creates the 'genres_expanded' column and cast the data appropriately. You should "explode" this column.

In [15]:
#TODO: Crete a genres_expanded column, and create individual rows for each genre. Note if you're having trouble here, pay attention to data types!
import ast
titles_df['genres_expanded'] = titles_df['genres'].map(lambda x: ast.literal_eval(x))
# Your code here:
titles_df = titles_df.explode('genres_expanded')

# titles_df

In [16]:
titles_df

Unnamed: 0,id,title,type,release_year,runtime,genres,production_countries,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,is_movie,genres_expanded
0,ts20945,The Three Stooges,SHOW,1934,19,"['comedy', 'family', 'animation', 'action', 'f...",['US'],7,1092.0,15.424,7.6,False,comedy
0,ts20945,The Three Stooges,SHOW,1934,19,"['comedy', 'family', 'animation', 'action', 'f...",['US'],7,1092.0,15.424,7.6,False,family
0,ts20945,The Three Stooges,SHOW,1934,19,"['comedy', 'family', 'animation', 'action', 'f...",['US'],7,1092.0,15.424,7.6,False,animation
0,ts20945,The Three Stooges,SHOW,1934,19,"['comedy', 'family', 'animation', 'action', 'f...",['US'],7,1092.0,15.424,7.6,False,action
0,ts20945,The Three Stooges,SHOW,1934,19,"['comedy', 'family', 'animation', 'action', 'f...",['US'],7,1092.0,15.424,7.6,False,fantasy
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7314,tm1098070,Girls' Night In,MOVIE,2021,91,"['comedy', 'drama']",['US'],7,28.0,1.306,7.0,True,comedy
7314,tm1098070,Girls' Night In,MOVIE,2021,91,"['comedy', 'drama']",['US'],7,28.0,1.306,7.0,True,drama
7315,tm1019060,Anbirkiniyal,MOVIE,2021,118,"['thriller', 'drama']",['IN'],7,361.0,2.191,7.0,True,thriller
7315,tm1019060,Anbirkiniyal,MOVIE,2021,118,"['thriller', 'drama']",['IN'],7,361.0,2.191,7.0,True,drama


OK. Now we're going to want to do some cleanup on our dataframe. We've produced a lot of rows with the same index, and we should probably reset that. Plus, there are now some columns that we've made redundant. So:
* Drop the columns *type* and *genres*.
* Drop all rows with NA values.
* Reset the index and drop the old index.

In [17]:
#TODO: The three bullet points above to the titles_df dataframe.

titles_df = titles_df.drop(['type', 'genres'], axis=1)
titles_df = titles_df.dropna(axis=0)
titles_df = titles_df.reset_index()
titles_df = titles_df.drop(columns=['index'])
titles_df                           

Unnamed: 0,id,title,release_year,runtime,production_countries,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,is_movie,genres_expanded
0,ts20945,The Three Stooges,1934,19,['US'],7,1092.0,15.424,7.6,False,comedy
1,ts20945,The Three Stooges,1934,19,['US'],7,1092.0,15.424,7.6,False,family
2,ts20945,The Three Stooges,1934,19,['US'],7,1092.0,15.424,7.6,False,animation
3,ts20945,The Three Stooges,1934,19,['US'],7,1092.0,15.424,7.6,False,action
4,ts20945,The Three Stooges,1934,19,['US'],7,1092.0,15.424,7.6,False,fantasy
...,...,...,...,...,...,...,...,...,...,...,...
18233,tm1098070,Girls' Night In,2021,91,['US'],7,28.0,1.306,7.0,True,comedy
18234,tm1098070,Girls' Night In,2021,91,['US'],7,28.0,1.306,7.0,True,drama
18235,tm1019060,Anbirkiniyal,2021,118,['IN'],7,361.0,2.191,7.0,True,thriller
18236,tm1019060,Anbirkiniyal,2021,118,['IN'],7,361.0,2.191,7.0,True,drama


In [18]:
titles_df

Unnamed: 0,id,title,release_year,runtime,production_countries,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,is_movie,genres_expanded
0,ts20945,The Three Stooges,1934,19,['US'],7,1092.0,15.424,7.6,False,comedy
1,ts20945,The Three Stooges,1934,19,['US'],7,1092.0,15.424,7.6,False,family
2,ts20945,The Three Stooges,1934,19,['US'],7,1092.0,15.424,7.6,False,animation
3,ts20945,The Three Stooges,1934,19,['US'],7,1092.0,15.424,7.6,False,action
4,ts20945,The Three Stooges,1934,19,['US'],7,1092.0,15.424,7.6,False,fantasy
...,...,...,...,...,...,...,...,...,...,...,...
18233,tm1098070,Girls' Night In,2021,91,['US'],7,28.0,1.306,7.0,True,comedy
18234,tm1098070,Girls' Night In,2021,91,['US'],7,28.0,1.306,7.0,True,drama
18235,tm1019060,Anbirkiniyal,2021,118,['IN'],7,361.0,2.191,7.0,True,thriller
18236,tm1019060,Anbirkiniyal,2021,118,['IN'],7,361.0,2.191,7.0,True,drama


Alright, now that we've transformed our dataframe, let's use it! Specifically, let's analyze it a bit. I wonder which genre tends to have the highest score. Let's investigate!

In [19]:
#TODO: Create a dataframe "genres_df" consisting ONLY OF MOVIES with only the columns genres_expanded, tmdb_popularity, imdb_score, and tmdb_score
genres_df = titles_df[['genres_expanded', 'tmdb_popularity', 'imdb_score','tmdb_score']]

Filter genres to keep only movies with a *tmdb_popularity* greater than 2.0.

In [20]:
#TODO: Filter genres to keep only movies with a tmdb_popularity greater than 2.0.
genres_df = genres_df.loc[genres_df['tmdb_popularity'] > 2]
# Rob: Note the use of a bracket not a parenthesis after loc. loc is more of a location operator than a classic function

# Check results below
# genres_df.loc[genres_df['tmdb_popularity'] <= 2]

# or alternatively
print(f"The lowest tmdb score is: {genres_df['tmdb_popularity'].min()}" )

The lowest tmdb score is: 2.005


Now, let's calculate the average imdb_score by genre, and the average tmdb_score by genre.

In [21]:
#TODO: Create a dataframe genre_scores_df with the average imdb_score and tmdb_score by genre.
genre_scores_df = genres_df.pivot_table(index='genres_expanded', values=('imdb_score', 'tmdb_score'), aggfunc=np.mean)

# Rob: Need to better understand when 'pivot' and when 'pivot_table'. It doesn't appear pivot supports aggfunctions. Are there other differences?


  genre_scores_df = genres_df.pivot_table(index='genres_expanded', values=('imdb_score', 'tmdb_score'), aggfunc=np.mean)


Now, let's check out which genres tend to have the highest imdb_score and tmdb_score.

In [22]:
#TODO: Sort the genre_scores_df dataframe descending by average imdb_score. What are the top 5 genres?
genre_scores_df = genre_scores_df.sort_values(['imdb_score'], ascending=False)
genre_scores_df.head(5)

Unnamed: 0_level_0,imdb_score,tmdb_score
genres_expanded,Unnamed: 1_level_1,Unnamed: 2_level_1
animation,6.596091,7.023127
reality,6.507692,6.853846
documentation,6.428904,6.830769
history,6.272727,6.701455
sport,6.083916,6.503497


In [23]:
#TODO: Sort the columns descending by average tmdb_score. What are the top 5 genres?
genre_scores_df = genre_scores_df.sort_values(['tmdb_score'], ascending=False)
genre_scores_df.head(5)

Unnamed: 0_level_0,imdb_score,tmdb_score
genres_expanded,Unnamed: 1_level_1,Unnamed: 2_level_1
animation,6.596091,7.023127
reality,6.507692,6.853846
documentation,6.428904,6.830769
history,6.272727,6.701455
sport,6.083916,6.503497


Well, that was interesting. Looks like documentaries tend to have higher scores.

Let's look into another question. I wonder whether movies are getting better over the years. 

To look into this, let's calculate the average imdb_score by decade.

But first, let's do some cleaning on our *titles_df* dataframe.

In [24]:
#TODO: From the titles_df dataframe drop the genres_expanded column, drop duplicate rows, reset the index, and drop the old index column.
titles_df = titles_df.drop(['genres_expanded'], axis=1)
titles_df = titles_df.drop_duplicates()
titles_df = titles_df.reset_index()
titles_df = titles_df.drop(columns=['index'])
titles_df


Unnamed: 0,id,title,release_year,runtime,production_countries,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,is_movie
0,ts20945,The Three Stooges,1934,19,['US'],7,1092.0,15.424,7.6,False
1,tm19248,The General,1926,78,['US'],8,89766.0,8.647,8.0,True
2,tm82253,The Best Years of Our Lives,1946,171,['US'],7,63026.0,8.435,7.8,True
3,tm83884,His Girl Friday,1940,92,['US'],7,57835.0,11.270,7.4,True
4,tm56584,In a Lonely Place,1950,94,['US'],7,30924.0,8.273,7.6,True
...,...,...,...,...,...,...,...,...,...,...
7298,tm616953,Ammaa Ki Boli,2021,117,['IN'],1,1335.0,2.382,1.0,True
7299,tm1068475,Alleyway,2021,67,[],6,92.0,1.870,6.8,True
7300,tm1098070,Girls' Night In,2021,91,['US'],7,28.0,1.306,7.0,True
7301,tm1019060,Anbirkiniyal,2021,118,['IN'],7,361.0,2.191,7.0,True


Now, create a new column in titles_df - call it *decade* - that returns the decade in which the movie was released. So, for example, if the movie was released in 1994, the decade would be 1990.

In [25]:
#TODO: Create the decade column described above in titles_df
titles_df['decade'] = titles_df['release_year'].apply(lambda x: (x // 10) * 10  ) # The idea here is to integer divide by 10 (to round down) and then multiply it by 10
titles_df[['release_year', 'decade']] # Rob: Note how to display two columns (frustrated I had to look this up)

Unnamed: 0,release_year,decade
0,1934,1930
1,1926,1920
2,1946,1940
3,1940,1940
4,1950,1950
...,...,...
7298,2021,2020
7299,2021,2020
7300,2021,2020
7301,2021,2020


For each decade, calculate the average imdb_score, and sort them ascending to determine the worst movie decade.

In [26]:
#TODO: Create a new dataframe, movie_decades_df, that has the decade as the index column, and calculates the average imdb_score by decade from titles_df.
movie_decades_df = titles_df.pivot_table(index='decade', values='imdb_score', aggfunc=np.mean) # Rob: Note the np.mean does not need to be in quotes and does not need to written as np.mean()
movie_decades_df

  movie_decades_df = titles_df.pivot_table(index='decade', values='imdb_score', aggfunc=np.mean) # Rob: Note the np.mean does not need to be in quotes and does not need to written as np.mean()


Unnamed: 0_level_0,imdb_score
decade,Unnamed: 1_level_1
1910,5.85
1920,5.807018
1930,5.03537
1940,5.248521
1950,5.16055
1960,5.323671
1970,5.44586
1980,5.336391
1990,5.535377
2000,5.596048


In [27]:
#TODO: Sort movie_decades_df ascending
# Note to Dr. Zwick... which field do you want to sort by? I'll assume decade
movie_decades_df = movie_decades_df.sort_values(by='decade', ascending=True)


Finally, let's see which decade improved the most compared to the previous one, and which decade degraded the most compared to the previous one.

In [28]:
#TODO: Define a new column in movie_decade_df, 'shift', which gives the shift in average imdb_score for each decade from the previous one.

movie_decades_df['shift'] = (movie_decades_df['imdb_score'] - movie_decades_df.shift(1)['imdb_score'])


In [29]:
movie_decades_df

Unnamed: 0_level_0,imdb_score,shift
decade,Unnamed: 1_level_1,Unnamed: 2_level_1
1910,5.85,
1920,5.807018,-0.042982
1930,5.03537,-0.771648
1940,5.248521,0.213151
1950,5.16055,-0.08797
1960,5.323671,0.163121
1970,5.44586,0.122188
1980,5.336391,-0.109468
1990,5.535377,0.198986
2000,5.596048,0.060671


Now, we're going to (finally!) bring in the credits_df dataframe. 

We're going to bring it in because we want to determine who the most popular TV show actors are based on the imdb_votes received by their shows.

In [30]:
credits_df

Unnamed: 0,person_id,id,name,character,role
0,59401,ts20945,Joe Besser,Joe,ACTOR
1,31460,ts20945,Moe Howard,Moe,ACTOR
2,31461,ts20945,Larry Fine,Larry,ACTOR
3,21174,tm19248,Buster Keaton,Johnny Gray,ACTOR
4,28713,tm19248,Marion Mack,Annabelle Lee,ACTOR
...,...,...,...,...,...
124230,1938589,tm1054116,Sangam Shukla,Madhav,ACTOR
124231,1938565,tm1054116,Vijay Thakur,Sanjay Thakur,ACTOR
124232,728899,tm1054116,Vanya Wellens,Budhiya,ACTOR
124233,1938620,tm1054116,Vishwa Bhanu,Gissu,ACTOR


First, we'll create a new dataframe, show_df, that's just the *id* and *imdb_votes* columns from titles_df, restricted to just entries that are NOT movies.

In [31]:
#TODO: Create a new dataframe, show_df, that takes the 'id' and 'imdb_votes' columns from titles_df, restricted to rows where is_movie != 1.
show_df = titles_df[['id', 'imdb_votes']].loc[titles_df['is_movie'] != 1]
show_df


Unnamed: 0,id,imdb_votes
0,ts20945,1092.0
88,ts55748,1563.0
755,ts20005,25944.0
767,ts42867,8675.0
778,ts21930,2116.0
...,...,...
7184,ts289350,46.0
7209,ts300503,29.0
7215,ts299724,8.0
7256,ts287547,978.0


Now, we'll create a dataframe called actor_votes_df, that merges show_df with a table consisting of just the *person_id, id*, and *name* columns from credits_df, restricted to just the actor rows in credits_df. It should merge on the *id* column.

In [32]:
#TODO: Create a new dataframe, actor_votes_df, that merges show_df with the 'person_id', 'id', and 'name' columns from ACTOR rows in credits_df. 
# The merge should use the 'id' column
actor_votes_df = pd.merge(show_df, credits_df[['person_id', 'id', 'name']], on='id', how="inner")
actor_votes_df




Unnamed: 0,id,imdb_votes,person_id,name
0,ts20945,1092.0,59401,Joe Besser
1,ts20945,1092.0,31460,Moe Howard
2,ts20945,1092.0,31461,Larry Fine
3,ts55748,1563.0,730679,John Charles Daly
4,ts20005,25944.0,67198,Lucille Ball
...,...,...,...,...
6778,ts287826,17.0,1919083,Zhang Dayuan
6779,ts287826,17.0,1215260,Fu Shuyang
6780,ts287826,17.0,1918873,Liu Qi
6781,ts287826,17.0,2038762,Hu Wei


Now, we want to calculate the TV actor with the most imdb_votes. To do this, create a new dataframe called votes_by_actor_df, that adds up the imdb_votes for each actor in actor_votes_df. Then, sort the resulting dataframe by that sum.

In [36]:
#TODO: Create a dataframe votes_by_actor_df that adds up the imdb_votes for each TV actor.
votes_by_actor_df = actor_votes_df.pivot_table(index='name', values='imdb_votes', aggfunc=np.sum)
votes_by_actor_df


  votes_by_actor_df = actor_votes_df.pivot_table(index='name', values='imdb_votes', aggfunc=np.sum)


Unnamed: 0_level_0,imdb_votes
name,Unnamed: 1_level_1
A.R. Bala,898.0
Aadar Malik,978.0
Aakash Dahiya,3069.0
Aakash Gupta,3115.0
Aarif Rahman,13.0
...,...
김정교,1178.0
양원쥔,846.0
역군,153.0
유국동,38.0


In [38]:
#TODO: Sort votes_by_actor_df by imdb_votes descending. Who is the most popular TV actor?
votes_by_actor_df.sort_values(by=['imdb_votes'], ascending=False).head(1)


Unnamed: 0_level_0,imdb_votes
name,Unnamed: 1_level_1
Alyson Hannigan,804863.0


The final thing I'm going to ask you to do won't be laid out quite so step by step. For this part, you should:
* For every person / role combination find the average imdb_score for that person in that role.
* Find the people with the highest average imdb_score for each role, and list them.

In [42]:
#TODO: The instructions above.
# First we need to merge titles_df and credits_df on id
best_by_role_df = pd.merge(titles_df[['id','imdb_votes']], credits_df[['id', 'name', 'role']], how="inner", on="id")

# Second we use a pivot table to perform the aggregation. In this case we want to aggregate on every name and role combination
best_by_role_df = pd.pivot_table(best_by_role_df, index=['name', 'role'], values='imdb_votes', aggfunc={'imdb_votes':np.average})

# This sort is just for me to check that the values are indeed grouped as expected
best_by_role_df = best_by_role_df.sort_values(by=['name', 'role'], ascending=True)

# 'name' and 'role' are now literally in the index, but I want them as columns again so reset_index
best_by_role_df = best_by_role_df.reset_index() 

# Look through the data for a name who had multiple roles. I tried with 'Clint Eastwood', but he's only listed as an 'Actor' - data issue
best_by_role_df.loc[best_by_role_df['name']=='Clint Eastwood']

# Now I can use the group by function along with rank
best_by_role_df['rank'] = best_by_role_df.groupby('role')['imdb_votes'].rank(ascending=False, method='min')
best_by_role_df = best_by_role_df.sort_values(by='rank', ascending=True )

# With the rank column now in place, it's relatively easy to get the top people by role
# best_by_role_df[['rank', 'role', 'name', 'imdb_votes']].loc[best_by_role_df['rank']=='1.0']
best_by_role_df.loc[best_by_role_df['rank']==1.0]
# best_by_role_df.loc[best_by_role_df['role']=='ACTOR']





Unnamed: 0,name,role,imdb_votes,rank
2029,Alexandrea Owens,ACTOR,1133692.0,1.0
3228,Andie Hicks,ACTOR,1133692.0,1.0
9944,Camilla Overbye Roos,ACTOR,1133692.0,1.0
120,Aaron James Cash,ACTOR,1133692.0,1.0
21482,Fannie Brett,ACTOR,1133692.0,1.0
...,...,...,...,...
29502,James Garrett,ACTOR,1133692.0,1.0
45492,Martin Jarvis,ACTOR,1133692.0,1.0
68101,Tony Kenny,ACTOR,1133692.0,1.0
45472,Martin East,ACTOR,1133692.0,1.0


When this is done, please upload your completed notebook to Canvas for grading. Thank you!