# Movies dataset
## Solution Notebook

This is a set of worked solutions to the `08.1 Movies dataset` Notebook.
The solutions given here should be viewed as guide only; other equally acceptable solutions may be possible.

## Activity
Using the tools that you have used previously to manipulate `DataFrames`, characterise the Movies dataset by answering the following questions about the data recorded in the dataset:

    1 How many movies, actors, directors and countries are there?
    2 How many unique movie titles are there?
    3 What are the earliest and latest years of release?
    4 What are the ranges of values for critics, audience and user ratings?
    5 How many movies are classified under each genre? 
    6 Missing data - How many movies are recorded without:
        6.1 a title?
        6.2 a year of release?
        6.3 critics, audience or user ratings?
        6.4 any actors?
        6.5 a director?
        6.6 a country of origin?
        6.7 any genres?

#### Create a separate DataFrame from each individual Movies dataset

In [None]:
import pandas as pd

In [None]:
# Create the DataFrame 'movie' from the CSV data file 'movie.csv'.
movie = pd.read_csv('data/movie.csv')
# Create the DataFrame 'movie_actor' from the CSV data file 'movie_actor.csv'.
movie_actor = pd.read_csv('data/movie_actor.csv')
# Create the DataFrame 'movie_country' from the CSV data file 'movie_country.csv'.
movie_country = pd.read_csv('data/movie_country.csv')
# Create the DataFrame 'movie_director' from the CSV data file 'movie_director.csv'.
movie_director = pd.read_csv('data/movie_director.csv')
# Create the DataFrame 'movie_genre' from the CSV data file 'movie_genre.csv'.
movie_genre = pd.read_csv('data/movie_genre.csv')

1 How many movies, actors, directors and countries are there?

In [None]:
len(movie)

Notes:

The Python built-in function [len( )](https://docs.python.org/3/library/functions.html#len) 
returns the number of items (rows) in the `DataFrame` specified. 
See 01.3 Basic python data structures Notebook. 


In [None]:
len(movie_actor['actor_name'].unique())

Notes:

[movie_actor['actor_name']](http://pandas.pydata.org/pandas-docs/version/0.17.1/dsintro.html#indexing-selection) 
returns a `Series` containing the values of 'actor_name'. 
See 02.1 Pandas Dataframes Notebook. 

[unique()](http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.Series.unique.html) 
returns an `Array` of unique values of 'actor_name' in the `Series` specified.
See 02.1 Pandas Dataframes Notebook. 

In [None]:
len(movie_director['director_name'].unique())

In [None]:
len(movie_country['country'].unique())

2 How many unique movie titles are there?

In [None]:
len(movie['title'].unique())

3 What are the earliest and latest years of release?

In [None]:
print(movie['year'].min())
print(movie['year'].max())

Notes:

[min()](http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.Series.min.html) 
returns the minimum of the values in the `Series` specified. See 04.2 Descriptive statistics in pandas Notebook

[max()](http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.Series.max.html) 
returns the maximum of the values in the `Series` specified. See 04.2 Descriptive statistics in pandas Notebook

4 What are the ranges of values for critics, audience and user ratings?

In [None]:
# RottenTomatoes - all critics: average rating
print(movie['rt_all_critics_rating'].min())
print(movie['rt_all_critics_rating'].max())

In [None]:
# RottenTomatoes - top critics: average rating
print(movie['rt_top_critics_rating'].min())
print(movie['rt_top_critics_rating'].max())

In [None]:
# RottenTomatoes - audience: average rating
print(movie['rt_audience_rating'].min())
print(movie['rt_audience_rating'].max())

In [None]:
# MovieLens - users: average rating
print(movie['ml_user_rating'].min())
print(movie['ml_user_rating'].max())

5 How many movies are classified under each genre?

In [None]:
movie_genre.pivot_table(index=['genre'], aggfunc='count')

Notes:
    
See 04.1 Crosstabs and pivot tables Notebook.

6.1 How many movies are recorded without a title?

In [None]:
len(movie[movie['title'].isnull()])

Notes:

[movie['title']](http://pandas.pydata.org/pandas-docs/version/0.17.1/dsintro.html#indexing-selection) 
returns a `Series` containing the values of 'title'. 
See 02.1 Pandas Dataframes Notebook. 

[isnull()](http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.Series.isnull.html) 
returns a `boolean` for each element in the `Series` indicating if the values are `null`. See 03.4 Handling missing data Notebook

movie[movie['title'].isnull()] performs a `selection` choosing those rows where `isnull()` returns `true`. 
See 03.2 Selecting and projecting, sorting and limiting Notebook.

The Python built-in functions [len()](https://docs.python.org/3/library/functions.html#len) 
returns the number of items (rows) in the `DataFrame` specified. 
See 01.3 Basic python data structures Notebook. 


Alternative solution:

`len(movie[movie.title.isnull()])`

See 02.1 Pandas Dataframes Notebook.

6.2 How many movies are recorded without a year of release?

In [None]:
len(movie[movie.year.isnull()])

6.3 How many movies are recorded without critics, audience or user ratings?

In [None]:
# RottenTomatoes - all critics: average rating
len(movie[movie.rt_all_critics_rating.isnull()])

In [None]:
# RottenTomatoes - top critics: average rating
len(movie[movie.rt_top_critics_rating.isnull()])

In [None]:
# RottenTomatoes - audience: average rating
len(movie[movie.rt_audience_rating.isnull()])

In [None]:
# MovieLens - users: average rating
len(movie[movie.ml_user_rating.isnull()])

6.4 How many movies are recorded without any actors?

In [None]:
len(set(movie['movie_id']).difference(set(movie_actor['movie_id'])))

Notes:

[set(...)](https://docs.python.org/3/library/stdtypes.html#set) 
returns a `Set` whose elements are taken from the `Series`. 
See 01.3 Basic python data structures Notebook. 

[.difference(...)](https://docs.python.org/3/library/stdtypes.html#set) 
performs a set `difference` returning those elements present in the first set but not the second.
See 01.3 Basic python data structures Notebook. 

6.5 How many movies are recorded without a director?

In [None]:
len(movie)-len(movie_director)

Notes:

Since "Each movie has one director" we can simply compare the number of rows in each `DataFrame`.

6.6 How many movies are recorded without a country of origin?

In [None]:
len(movie)-len(movie_country)

6.7 How many movies are recorded without any genres.

In [None]:
len(set(movie['movie_id']).difference(set(movie_genre[(movie_genre['genre']!='(no genres listed)')]['movie_id'])))

Notes:
    
Since we identified earlier the presence of '(no genres listed)' genre, we have eliminated it from the 'movie_genre' `DataFrame`.