# Assignment - Analyzing the IMDB Top 1000 Movies

In the next few assignments, you will be working with this data set of IMDB top 1000 movies. 

Source: https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

In [None]:
import pandas as pd
import numpy as np 

In [None]:
# Read the data file "imdb_top_1000.csv" to a dataframe named "imdb"
imdb = pd.read_csv('../data/imdb_top_1000.csv', header=0)
imdb.head()

In [None]:
# Number of rows?
# Number of columns?
imdb.shape

In [None]:
# Describe the dataframe using the info() method.
imdb.info()

In [None]:
# Use describe() to summarize the descriptive statistics of *all* columns
imdb.describe(include='all')

In [None]:
# List all the column names: 
imdb.columns

## Part 1: Data Manipulation

In [None]:
# In this dataset, there is a movie with an error in "Released_Year". (Hint: Released_Year should be a 4-digit integer.)
# Find this movie. 
imdb[imdb['Released_Year'].str.isdigit()==False]

In [None]:
imdb['Released_Year'].unique()

In [None]:
# Correct the values for the corresponding columns ("Release_Year" and "Certificate"). 
# You may want to look up this movie on www.imdb.com. 
# Hint: You can set value for a particular set by: loc[row_name, column_name] = new_value
imdb.loc[966,'Released_Year'] = 1995
imdb.loc[966,'Certificate'] = 'PG'
imdb.loc[966]

In [None]:
# Change "Released_Year" from string to int
imdb['Released_Year'] = imdb['Released_Year'].apply(int)
imdb['Released_Year'].dtype

In [None]:
# Create a new dataframe called "stars" including the following columns: 
# Series_Title, Released_Year, Star1, Star2, Star3, Star4
stars = imdb[['Series_Title', 'Released_Year', 'Star1', 'Star2', 'Star3', 'Star4']]
stars

In [None]:
# Create a new dataframe called "genres" including the following columns: 
# Series_Title, Released_Year, Genre.
genres = imdb[['Series_Title', 'Released_Year', 'Genre']]
genres

In [None]:
# Select all movies released after (>=) 2010 and with IMDB_Rating>=8.5
# Show their title, released year, Certificate, and gross.
# Sort them in descending order of "Gross"
(
    imdb[(imdb.Released_Year>=2010) & (imdb.IMDB_Rating>=8.5)]
    [['Series_Title','Released_Year','Certificate','Gross']]
    .sort_values('Gross', ascending=False)
)

In [None]:
# Does the sorting result looks right to you? What's the problem? 

# Answer: Gross is a string type. 

In [None]:
# Resolve this problem of "Gross" and convert its data type to float
# Hint: You may find this webpage useful: 
# https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-column-of-a-pandas-dataframe

imdb['Gross'] = imdb['Gross'].apply(str).str.replace(',','').apply(float)
imdb['Gross']

In [None]:
# Next, redo the sorting on Gross

# Select all movies released after (>=) 2010 and with IMDB_Rating>=8.5
# Show their title, released year, Certificate, and gross.
# Sort them in descending order of "Gross"
(
    imdb[(imdb.Released_Year>=2010) & (imdb.IMDB_Rating>=8.5)]
    [['Series_Title','Released_Year','Certificate','Gross']]
    .sort_values('Gross', ascending=False)
)

In [None]:
# Add a new column "Runtime_min" by removing the string ' min" in "Runtime"
# Set its data type as int

imdb['Runtime_min'] = imdb['Runtime'].str[:-4].astype(int)
imdb['Runtime_min']

In [None]:
# Add a new column "Decade" with values as 1980, 1990, 2000, 2010, 2020, etc. 
imdb['Decade'] = imdb['Released_Year'] // 10 * 10
imdb['Decade']

## Part 2: Data Summarization
Done! 

## Part 3: Data Visualization
Done! 

## Part 4: Tidy Data

After running all cells above, you should have three dataframes: 
- **imdb**
- **stars**
- **genres**

Now, let's take a quick look at the three dataframes. 

In [None]:
print(imdb.columns)
imdb.head()

In [None]:
print(stars.columns)
stars.head()

In [None]:
print(genres.columns)
genres.head()

Follow the instructions below and write your code to answer the questions: 

To better understand the advantages of tidy data, you will first use the "un-tidy" dataframes alone to answer the next few questions: 

In [None]:
# In dataframe "stars", find all movies that star "Morgan Freeman". 
# Hint: he could be Star1, Star2, Star3, or Star4. 


In [None]:
# In dataframe 'stars', who appeared in Star2 the most times? List the top five actors.


In [None]:
# In dataframe 'imdb', find all comedies and list Series_Title, Released_Year, Director, and IMDB_Rating
# Sort them by Resleased_Year in desceding order. 
# Hint: use .str.contains(...) 


In [None]:
# In dataframe 'imdb', find all values in the Genre column and the number of occurrences for each value. 
# Hint: use value_counts()


### Tidying the data

Next, you will further tidy the two dataframes **stars** and **genres**. 

Let's start with **stars**.

In [None]:
# Tranform the dataframe "stars" to a new dataframe named "stars_long" with the following four columns: 
# Series_Title, Released_Year, StarNo (e.g., Star1, Star2, ...), StarName
# Hint: use melt()
stars_long = stars.melt(...)
stars_long

In [None]:
# Can you transform dataframe 'stars_long' back to its original shape? 
# Hint: use pivot()
stars_long.pivot(...)

In [None]:
# In dataframe "stars_long", find all movies that star "Morgan Freeman". 


In [None]:
# Who were the Star2 the most times? List the top five actors.


In [None]:
# Who star in the most movies in this list? List the top 20 actors.


In [None]:
# Which movie stars had the highest total gross in this movie list? Show the top 10 actors. 
# Hint: Join "stars_long" and "imdb"; then group by StarName


In [None]:
# Find the best director-actor duos 
# i.e., director-actor pairs that collaborated in at least five movies, 
# sort them in descending order of average IMDB_Rating
# Hint: Join imdb and stars_long; group by ['Director','StarName']


In [None]:
# Bonus question
# Who did "Amy Adams" co-star with in this movie list? 
# Hint: Join stars_long with itself to find pairs of co-stars


Next, let's reshape the dataframe **genres**, which is a little bit more complicated. 

In [None]:
# Step 1: Split the 'Genre' string by ', ' into a list of individual genres and expand them to different columns
genres_split = ...
genres_split

In [None]:
# Step 2: Rename the columns as: Genre1, Genre2, Genre3


In [None]:
# Step 3: Combine ['Series_Title','Released_Year'] in 'genres' and ['Genre1','Genre2','Genre3'] in 'genres_split'. 
# Save it to a new dataframe named 'genres_wide'.     
# Hint: Use pd.concat(...)
genres_wide = pd.concat(...)
genres_wide

In [None]:
# Step 4: Transform genres_wide to a new dataframe genres_long with the following four columns: 
# Series_Title, Released_Year, GenreNo (e.g., Genre1, Genre2, Genre2), GenreName
# Hint: use melt()
genres_long = genres_wide.melt(...)
genres_long

In [None]:
# How many unique genres (atomic values, e.g., Drama, Comedy, ...) are there? 
# How many movies are there for each genre? 


In [None]:
# What is the average IMDB rating for each genre? 
# Sort the genres in descending order of average IMDB_Rating. 
# Hint: join imdb with genres_long; group by GenreName


In [None]:
# Who is the "King of Comedy" (i.e., the actor who starred in the most comedy movies)?
# Hint: find all comedies in genres_long ; join with stars_long; group by StarName


In [None]:
# Bonus Question: 
# Who are the best action stars? 
# i.e., the actors who stars in at least 5 action movies, sorted by average IMDB rating in descending order 
# Hint: You need to join columns from all three dataframes: imdb, stars_long, genres_long


In [None]:
# Bonus Question: 
# Create a pivot table to show the number of movies of different genres in different decades. 
# Row: GenreName
# Column: Decade
# Hint: https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html
