<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing IMDb Data

_Author: Kevin Markham (DC)_

---

For project two, you will complete a series of exercises exploring movie rating data from IMDb.

For these exercises, you will be conducting basic exploratory data analysis on IMDB's movie data, looking to answer such questions as:

What is the average rating per genre?
How many different actors are in a movie?

This process will help you practice your data analysis skills while becoming comfortable with Pandas.

## Basic level

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#### Read in 'imdb_1000.csv' and store it in a DataFrame named movies.

In [2]:
movies = pd.read_csv('./data/imdb_1000.csv')
movies.head()

FileNotFoundError: [Errno 2] File b'./data/imdb_1000.csv' does not exist: b'./data/imdb_1000.csv'

#### Check the number of rows and columns.

In [0]:
# Answer:
movies.shape

#### Check the data type of each column.

In [0]:
# Answer:
movies.dtypes

#### Calculate the average movie duration.

In [0]:
# Answer:
movies['duration'].median()


#### Sort the DataFrame by duration to find the shortest and longest movies.

In [0]:
# Answer:
movies.sort_values(by='duration',ascending=[True]).head()

In [0]:
movies.sort_values(by='duration',ascending=[False]).head()

#### Create a histogram of duration, choosing an "appropriate" number of bins.

In [0]:
# Answer:
sns.distplot(movies['duration'],  kde=False, bins=15);

#### Use a box plot to display that same data.

In [0]:
# Answer:

sns.catplot(x='genre',y='duration',data=movies, kind='box' , height=40)

## Intermediate level

#### Count how many movies have each of the content ratings.

In [0]:
# Answer:
movies['content_rating'].value_counts()

#### Use a visualization to display that same data, including a title and x and y labels.

In [0]:
# Answer:
graphic=movies['content_rating'].value_counts().plot(kind='bar')

graphic.title.set_text("Movies and content ratings")
graphic.set_ylabel("Duration")
graphic.set_xlabel("Genre")

#### Convert the following content ratings to "UNRATED": NOT RATED, APPROVED, PASSED, GP.

In [0]:
# Answer:


conditions =[ 
(movies['content_rating'] =='NOT RATED') ,
(movies['content_rating'] =='APPROVED') ,
(movies['content_rating'] =='PASSED') ,
(movies['content_rating'] =='GP') ,

]

results = [    
    'UNRATED',
    'UNRATED',
    'UNRATED',
    'UNRATED',
]

movies['content_rating'] = np.select(conditions,results,movies['content_rating'] )

#### Convert the following content ratings to "NC-17": X, TV-MA.

In [0]:
# Answer:

conditions =[ 
(movies['content_rating'] =='X') ,
(movies['content_rating'] =='TV-MA') ,
]

results = [    
    'NC-17',
    'NC-17',
]

movies['content_rating'] = np.select(conditions,results,movies['content_rating'] )

#### Count the number of missing values in each column.

In [0]:
# Answer:
movies.isnull().sum()

In [0]:
movies[movies['content_rating'].isnull()]

#### If there are missing values: examine them, then fill them in with "reasonable" values.

In [0]:
# Answer:conditions =[ 
(movies['title'] == 'Butch Cassidy and the Sundance Kid'),
(movies['title'] == 'Where Eagles Dare'),
(movies['title'] =='True Grit')
]

results = [    
    'PG-13',
    'PG-13',
    'G'
]

movies['content_rating'] = np.select(conditions,results,movies['content_rating'] )

#### Calculate the average star rating for movies 2 hours or longer, and compare that with the average star rating for movies shorter than 2 hours.

In [0]:
# Answer:
(movies[movies['duration'] >= 120]['star_rating'].mean()) - (movies[movies['duration'] < 120]['star_rating'].mean())

#### Use a visualization to detect whether there is a relationship between duration and star rating.

In [0]:
# Answer:

sns.catplot(x='star_rating',y='duration',data=movies, kind='bar', height=40)


#### Calculate the average duration for each genre.

In [0]:
# Answer:

movies.groupby('genre')['duration'].mean()

## Advanced level

#### Visualize the relationship between content rating and duration.

In [0]:
# Answer:
sns.pairplot(movies,x_vars='content_rating', y_vars='duration', height=20)

#### Determine the top rated movie (by star rating) for each genre.

In [0]:
# Answer:
movies.groupby(['genre'])['star_rating','title'].first()

#### Check if there are multiple movies with the same title, and if so, determine if they are actually duplicates.

In [0]:
# Answer:
movies[movies.duplicated(['title'])]

#### Calculate the average star rating for each genre, but only include genres with at least 10 movies


In [0]:

genremovies= movies.groupby('genre').title.count()
genremovies = genremovies[genremovies > 10]
ratings = movies.groupby('genre').star_rating.sum()

genreavg = ratings/genremovies

genreavg

#### Option 1: manually create a list of relevant genres, then filter using that list

In [0]:
# Answer:

genreslist= ['Sci-fi','Western','Film-Noir','Horror']

genremovies =movies[movies['genre'].isin(genreslist)]

genremovies.groupby(['genre'])['star_rating','title',].first()


#### Option 2: automatically create a list of relevant genres by saving the value_counts and then filtering

In [0]:
# Answer:

#### Option 3: calculate the average star rating for all genres, then filter using a boolean Series

In [0]:
# Answer:



#### Option 4: aggregate by count and mean, then filter using the count

In [0]:
# Answer:

# Answer:
genremovies= movies.groupby('genre')['star_rating','duration'].agg(['count','mean'])

genremovies[genremovies>137]

## Bonus

#### Figure out something "interesting" using the actors data!

In [0]:
#Actors than appear in various movies
movies.groupby(['genre'])['actors_list'].describe()

In [0]:
movies[movies['actors_list'].duplicated()]