<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing IMDb Data

_Author: Kevin Markham (DC)_

---

For project two, you will complete a serious of exercises exploring movie rating data from IMDb.

For these exercises, you will be conducting basic exploratory data analysis on IMDB's movie data, looking to answer such questions as:

What is the average rating per genre?
How many different actors are in a movie?

This process will help you practice your data analysis skills while becoming comfortable with Pandas.

## Basic level

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#### Read in 'imdb_1000.csv' and store it in a DataFrame named movies.

In [2]:
movies = pd.read_csv('./data/imdb_1000.csv')
movies.head()

FileNotFoundError: File b'./data/imdb_1000.csv' does not exist

#### Check the number of rows and columns.

In [None]:
# Answer:
movies.shape

#### Check the data type of each column.

In [None]:
# Answer:
movies.dtypes

#### Calculate the average movie duration.

In [None]:
# Answer:
movies.loc[:, 'duration'].mean()

#### Sort the DataFrame by duration to find the shortest and longest movies.

In [None]:
# Answer:
movies.sort_values('duration', ascending = False)

#### Create a histogram of duration, choosing an "appropriate" number of bins.

In [None]:
# Answer:
fig, ax = plt.subplots()
movies.hist('duration', ax=ax, bins = 20);

#### Use a box plot to display that same data.

In [None]:
# Answer:
fig, ax = plt.subplots()
movies.loc[:, 'duration'].plot(kind='box', ax=ax);

## Intermediate level

#### Count how many movies have each of the content ratings.

In [None]:
# Answer:
movies.loc[:, 'content_rating'].value_counts()

#### Use a visualization to display that same data, including a title and x and y labels.

In [None]:
# Answer:
fig, ax = plt.subplots()
movies.loc[:, 'content_rating'].value_counts().plot(kind='bar', ax=ax)
ax.set_xlabel('Content Rating')
ax.set_ylabel('Number of Movies')
ax.set_title('Number of Movies per Content Rating');

#### Convert the following content ratings to "UNRATED": NOT RATED, APPROVED, PASSED, GP.

In [None]:
# Answer:
unrated_movies = movies.loc[:, 'content_rating'].isin(['NOT RATED', 'APPROVED', 'PASSED', 'GP'])
ratingschangedf = movies.replace(['NOT RATED', 'APPROVED', 'PASSED', 'GP'], 'UNRATED')
ratingschangedf.loc[:,'content_rating'].value_counts()

#### Convert the following content ratings to "NC-17": X, TV-MA.

In [None]:
# Answer:
ratingschangedf.replace(['X', 'TV-MA'], 'NC-17', inplace=True)
ratingschangedf.loc[:,'content_rating'].value_counts()

#### Count the number of missing values in each column.

In [None]:
# Answer:
ratingschangedf.isnull().sum()

#### If there are missing values: examine them, then fill them in with "reasonable" values.

In [None]:
# Answer:
#Look at rows where content_rating is null (there are only 3):
ratingschangedf.loc[ratingschangedf.loc[:,'content_rating'].isnull(), :]
#genres are  biography, action, & adventure

In [None]:
#COME BACK TO THIS!#
#Check what content ratings are most common for each genre and actor
#ratingschangedf.loc[ratingschangedf.loc[:, 'duration']>120, :].groupby(['genre', 'content_rating']).count()

ratingschangedf.loc[ratingschangedf.loc[:, 'actors_list'].str.contains('Robert Redford'), :].groupby([ 'content_rating']).count()


#### Calculate the average star rating for movies 2 hours or longer, and compare that with the average star rating for movies shorter than 2 hours.

In [None]:
# Answer:
avg_rating_2plus_hours = round(movies.loc[movies.loc[:, 'duration'] >=120, 'star_rating'].mean(), 3)
avg_rating_under2hours = round(movies.loc[movies.loc[:, 'duration'] <120, 'star_rating'].mean(), 3)
print("Average rating for 2+ hour movies:", avg_rating_2plus_hours)
print("Average rating for <2 hour movies:", avg_rating_under2hours)
if avg_rating_2plus_hours > avg_rating_under2hours:
    print("2+ hour movies are typically rated higher than those under 2 hours.")
else:
    print("Movies with durations under 2 hours are typically rated higher than those over 2 hours.")

#### Use a visualization to detect whether there is a relationship between duration and star rating.

In [None]:
# Answer:
fig, ax = plt.subplots()
movies.plot(kind='scatter', x='duration', y= 'star_rating', ax=ax)

There appears to be some positive correlation between star rating & duration, but not all that strong, especially if we just look at that clump in the middle.  Confirmed via running movies.corr() below, which shows a correlation of .23 -- positive correlation, but not extreme.

In [None]:
movies.corr()

#### Calculate the average duration for each genre.

In [None]:
# Answer:
movies.loc[:, ['genre', 'duration']].groupby('genre').mean()
# or - prints out both star rating & duration: movies.groupby('genre').mean()

## Advanced level

#### Visualize the relationship between content rating and duration.

In [None]:
# Answer:
import seaborn as sns
fig, ax = plt.subplots(figsize = (20,12))
boxplotdata = ratingschangedf.loc[:, ['content_rating', 'duration']]
sns.boxplot(x='content_rating', y = 'duration', data = boxplotdata, order = ['UNRATED', 'G', 'PG', 'PG-13', 'R', 'NC-17']);

Methodology: 
- Plot boxplots of duration by each content rating, in order to view median & spread of each
- Used Seaborn to visualize, which allowed me to reorder the columns by rating easily

As we go from G -> NC-17 (excluding unrated movies), median duration tends to increase, peaking at PG-13, though NC-17 gets shorter.  Unrated movies have a larger range of durations than any rated category (once we remove outliers); NC-17 is the only category without outliers, though also has the lowest count of any rating.

In [None]:
ratingschangedf.loc[ratingschangedf.loc[:,'genre'] == 'Action'].sort_values(by = 'star_rating')

#### Determine the top rated movie (by star rating) for each genre.

In [None]:
# Answer:
genres = list(ratingschangedf.loc[:,'genre'].unique())
top_rated_per_genre = {}

for value in genres:
    tempdf = ratingschangedf.loc[ratingschangedf.loc[:, 'genre'] == value, :]
    highest_rated_title = list(tempdf.nlargest(1, 'star_rating')['title'])
    highest_rated_stars = list(tempdf.nlargest(1,'star_rating')['star_rating'])
    top_rated_per_genre[value] = highest_rated_title + highest_rated_stars 
#Note, converted to list objects as a way to not have printout include the 'dtype=object' part
    
top_rated_per_genre

#### Check if there are multiple movies with the same title, and if so, determine if they are actually duplicates.

In [None]:
# Answer:
title_counts = ratingschangedf.groupby('title').count().sort_values(by='star_rating', ascending = False)
duplicate_titles = list(title_counts.loc[title_counts.loc[:, 'star_rating'] > 1, :].index)
duplicate_title_rows = ratingschangedf.loc[ratingschangedf.loc[:, 'title'].isin(duplicate_titles), :].sort_values(by='title')
duplicate_title_rows

A: all duplicates have different sets of actors & different lengths, which leads me to believe these are different versions/remakes rather than actual duplicates.

#### Calculate the average star rating for each genre, but only include genres with at least 10 movies


#### Option 1: manually create a list of relevant genres, then filter using that list

In [None]:
# Answer:
ratingschangedf.groupby('genre').count() #view list of genres and their counts, then manually add each relevant one to a list
genres_highvolume = ['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Drama', 'Horror', 'Mystery']
ratingschangedf.loc[ratingschangedf.loc[:, 'genre'].isin(genres_highvolume), ['genre', 'star_rating']].groupby('genre').mean()

#### Option 2: automatically create a list of relevant genres by saving the value_counts and then filtering

In [None]:
# Answer:
genre_counts = ratingschangedf.groupby('genre').count()
top_genres = list(genre_counts.loc[genre_counts.loc[:, 'star_rating'] > 9, :].index)
ratingschangedf.loc[ratingschangedf.loc[:, 'genre'].isin(top_genres), ['genre', 'star_rating']].groupby('genre').mean()

#### Option 3: calculate the average star rating for all genres, then filter using a boolean Series

In [None]:
# Answer:
avg_rating_by_genre = ratingschangedf.loc[:, ['genre', 'star_rating']].groupby('genre').agg(['count', 'mean'])
avg_rating_by_genre.loc[:, ['star_rating'['count']]] > 9
#avg_rating_by_genre.loc[avg_rating_by_genre.loc[:,:].isin(top_genres),:]

#### Option 4: aggregate by count and mean, then filter using the count

In [None]:
# Answer:

## Bonus

#### Figure out something "interesting" using the actors data!