# Pandas Examples using Movie Dataset

## Overview

This notebook uses the IMDB dataset from Kaggle:  
https://www.kaggle.com/PromptCloudHQ/imdb-data#IMDB-Movie-Data.csv

In [57]:
import pandas as pd
import numpy as np

## Movie Data Set

In [85]:
# read in IMDB move dataset
movies = pd.read_csv('../data/IMDB-Movie-Data.csv')

In [86]:
# display first 2 rows of the DataFrame
movies.head(2)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0


In [87]:
# display last 2 rows
movies.tail(2)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0
999,1000,Nine Lives,"Comedy,Family,Fantasy",A stuffy businessman finds himself trapped ins...,Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,12435,19.64,11.0


In [88]:
movies.columns

Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

In [89]:
movies.index

RangeIndex(start=0, stop=1000, step=1)

In [90]:
movies.iloc[0:3, 0:3]

Unnamed: 0,Rank,Title,Genre
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi"
1,2,Prometheus,"Adventure,Mystery,Sci-Fi"
2,3,Split,"Horror,Thriller"


## Unit of Analysis: Movie

The primary object of interest is a movie.  For data analysis, it is helpful to know what uniquely identifies a movie, and then set df.index to this unique identifier.

In [91]:
# groupby and filter are discussed later ...
def get_dups(df, cols):
    return df.groupby(cols).filter(lambda x: len(x) > 1)

In [92]:
# Titles are not unique in this dataset
get_dups(movies, ['Title'])

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
239,240,The Host,"Action,Adventure,Romance",When an unseen enemy threatens mankind by taki...,Andrew Niccol,"Saoirse Ronan, Max Irons, Jake Abel, Diane Kruger",2013,125,5.9,96852,26.62,35.0
632,633,The Host,"Comedy,Drama,Horror",A monster emerges from Seoul's Han River and f...,Bong Joon Ho,"Kang-ho Song, Hee-Bong Byun, Hae-il Park, Doon...",2006,120,7.0,73491,2.2,85.0


In [93]:
# Title, Year is unique
get_dups(movies, ['Title', 'Year'])

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore


### Rename Columns
to make the dataframe eaiser to work with.

In [94]:
change_names = {'Revenue (Millions)':'Revenue', 
                'Runtime (Minutes)':'Runtime'}
movies = movies.rename(change_names, axis='columns')
movies = movies.drop('Rank', axis=1)
movies.head(2)

Unnamed: 0,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0


### Use Modified Title as Unique Index

A DataFrame does not require its index be unique, but often data analysis of a given unit is easier if there is a way to uniquely identify that unit.  Here the unit of analysis is movie.

There are several ways to create a unique index for this dataset:
1. use Rank (renamed to ID)
2. use the sequence of integers created by default (or create another one)
3. use the combination of Title and Year.

The best identifier to use depends on the application.  For a "learn by example" Jupyter Notebook such as this, the best identifier will be the most "natural" identifier.  The natural identifier for a movie is Title.  Using this will make the code easier to read and understand.

For this notebook, the Title field will be appended with the Year for those 2 records that have a duplicate title, thereby making the Title field unique.

Writing maintainable production code often has different requirements than writing an easy to understand Jupyter Notebook.  For production code, a multi-index of title and year may be a better approach, especially if new movies are to be added to the dataset.

In [95]:
dups = get_dups(movies, ['Title']).copy()
dups

Unnamed: 0,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
239,The Host,"Action,Adventure,Romance",When an unseen enemy threatens mankind by taki...,Andrew Niccol,"Saoirse Ronan, Max Irons, Jake Abel, Diane Kruger",2013,125,5.9,96852,26.62,35.0
632,The Host,"Comedy,Drama,Horror",A monster emerges from Seoul's Han River and f...,Bong Joon Ho,"Kang-ho Song, Hee-Bong Byun, Hae-il Park, Doon...",2006,120,7.0,73491,2.2,85.0


In [96]:
dups.index

Int64Index([239, 632], dtype='int64')

In [97]:
movies.loc[dups.index]

Unnamed: 0,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
239,The Host,"Action,Adventure,Romance",When an unseen enemy threatens mankind by taki...,Andrew Niccol,"Saoirse Ronan, Max Irons, Jake Abel, Diane Kruger",2013,125,5.9,96852,26.62,35.0
632,The Host,"Comedy,Drama,Horror",A monster emerges from Seoul's Han River and f...,Bong Joon Ho,"Kang-ho Song, Hee-Bong Byun, Hae-il Park, Doon...",2006,120,7.0,73491,2.2,85.0


In [98]:
# for each ID in the dups dataframe, modify the Title in the movies dataframe
dups['Title'] += ': ' + dups['Year'].astype('str')
dups

Unnamed: 0,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
239,The Host: 2013,"Action,Adventure,Romance",When an unseen enemy threatens mankind by taki...,Andrew Niccol,"Saoirse Ronan, Max Irons, Jake Abel, Diane Kruger",2013,125,5.9,96852,26.62,35.0
632,The Host: 2006,"Comedy,Drama,Horror",A monster emerges from Seoul's Han River and f...,Bong Joon Ho,"Kang-ho Song, Hee-Bong Byun, Hae-il Park, Doon...",2006,120,7.0,73491,2.2,85.0


In [99]:
movies.loc[dups.index, 'Title'] = dups['Title']

In [100]:
movies.loc[dups.index]

Unnamed: 0,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
239,The Host: 2013,"Action,Adventure,Romance",When an unseen enemy threatens mankind by taki...,Andrew Niccol,"Saoirse Ronan, Max Irons, Jake Abel, Diane Kruger",2013,125,5.9,96852,26.62,35.0
632,The Host: 2006,"Comedy,Drama,Horror",A monster emerges from Seoul's Han River and f...,Bong Joon Ho,"Kang-ho Song, Hee-Bong Byun, Hae-il Park, Doon...",2006,120,7.0,73491,2.2,85.0


In [101]:
# verify titles are now unique
get_dups(movies, ['Title'])

Unnamed: 0,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore


In [102]:
# set the index to Title and sort it
movies = movies.set_index('Title', drop=True)
movies = movies.sort_index()
movies.head(3)

Unnamed: 0_level_0,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
(500) Days of Summer,"Comedy,Drama,Romance",An offbeat romantic comedy about a woman who d...,Marc Webb,"Zooey Deschanel, Joseph Gordon-Levitt, Geoffre...",2009,95,7.7,398972,32.39,76.0
10 Cloverfield Lane,"Drama,Horror,Mystery","After getting in a car accident, a woman is he...",Dan Trachtenberg,"John Goodman, Mary Elizabeth Winstead, John Ga...",2016,104,7.2,192968,71.9,76.0
10 Years,"Comedy,Drama,Romance","The night before their high school reunion, a ...",Jamie Linden,"Channing Tatum, Rosario Dawson, Chris Pratt, J...",2011,100,6.1,19636,0.2,


## Persist DataFrame

There are several options.  The best option depends upon which criteria is most important for your application.  Some criteria:
* time used to read/write
* memory used to read/write
* space used by persited object
* use by other applications
* whether indexes, categorical variables, etc. are saved when persisted
* security
* size of data

For a very small dataframe such as movies, I would argue there are only two concerns:
* security (if published publicly)
* keeping indexes, categorical variables, etc.

Pickle is the easiest, but it is a security concern.  In theory, a github account could be compromised and a valid .pickle file replaced with one that executes code upon being loaded.

hdf5 has a lot of overhead for files smaller than 1 or 2 GB.  The movies dataframe is much less than that, but hdf5 is easy to use, secure, and maintains indexes, categorical variables, etc.  The .h5 file for this small dataset is larger than the original csv, but it is under 2 MB.

In [103]:
# for convenience and security, use hdf5
movies.to_hdf('../data/movies.h5', key='movies', mode='w')

## Simple Data Queries

### Find Longest and Shortest Runtimes

In [19]:
runtime_min = movies['Runtime'].min()
runtime_max = movies['Runtime'].max()
print(f"Shortest Runtime: {runtime_min:>3} minutes")
print(f"Longest Runtime:  {runtime_max:>3} minutes")

Shortest Runtime:  66 minutes
Longest Runtime:  191 minutes


### Find Stats about a Numeric Column

In [20]:
# find stats about Runtime
movies['Runtime'].describe()

count    1000.000000
mean      113.172000
std        18.810908
min        66.000000
25%       100.000000
50%       111.000000
75%       123.000000
max       191.000000
Name: Runtime, dtype: float64

In [21]:
# get the min and max directly from the Series produced by describe
stats = movies['Runtime'].describe()
print(stats['min'], stats['max'])

66.0 191.0


### Using sum() and mean() with Boolean Series
True is 1, False is 0  
sum() counts the number of True values  
mean() computes the fraction of True values  

### Find Percent of Movies longer than 75th Runtime Percentile

In [22]:
long_movies_frac = (movies['Runtime'] > stats['75%']).mean()

print(f"{long_movies_frac*100:4.1f} percent of movies are longer than the 75th percentile.")

24.6 percent of movies are longer than the 75th percentile.


The result is not exactly 25% due to ties in the length of movies.  Movie length is recorded to the nearest minute, and this rounding creates ties.  Had movie length been recorded to the nearest nanosecond, its unlikely there would have been any ties.

### Display Movie with Highest Rating

In [23]:
criteria = (movies['Rating'].max() == movies['Rating'])
movies[criteria]

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,82.0


## Pandas and Python Data Types

**Python**

1. Python uses "duck typing", one consequence of which is that the type of a variable is rarely checked in code.
2. It is now recommended practice in Python to use "type hints".

At first glance, the above two statements appear contradictory, but they are not.  In the first case, the "consumer" of the code is the Python interpreter.  In the second case the "consumer" of the code is the developer who must read and maintain the code.

Although "type hints" will not be discussed here, the take-away is that it is helpful to the developer to understand the type of a variable, even though it is usually not necessary to write application code that checks for the type of a variable.

**Data Analysis**

A similar situation arises in analyzing data.  Although writing application code for data analysis rarely requires checking the type of a Series, understanding the type of a Series is helpful to both exploratory data analysis and to writing good code.

Understanding the type of a Series allows for:
1. a better understanding of what that variable means to the application
2. understanding what happens when a Series contains a "null" value
3. understanding what happens when a Series uses "object" to represent stings
4. understanding how to minimize the amount of memory required by a DataFrame.

### Unknown or Null Values
In data analysis every variable can have a value that is either: "known" or "unknown".  

Another name for "unknown" is "null".

In Pandas and Numpy, "np.nan" is used to represent "unknown" or "null".

Example: a person is asked a yes/no question and refuses to answer; the answer is "unknown".

In [24]:
# np.nan can be written in two different ways
np.nan is np.NaN

True

In [25]:
# convenience function
def print_and_eval(s):
    print(f'{s:<16}  {eval(s)}')

In [26]:
# comparison with nan

print_and_eval('3 < np.nan')
print_and_eval('3 > np.nan')
print_and_eval('np.nan == np.nan')
print()
print_and_eval('3 != np.nan')
print_and_eval('np.nan != np.nan')
print()
print_and_eval('np.isnan(np.nan)')

3 < np.nan        False
3 > np.nan        False
np.nan == np.nan  False

3 != np.nan       True
np.nan != np.nan  True

np.isnan(np.nan)  True


In [27]:
# the data type of np.nan is float
type(np.nan)

float

### Series have a Single Data Type
A Series is implemented as a numpy array with all values having the same data type.

As everything is a subclass of "object", a Series of type "object" can hold values of any data type.

In [28]:
# When the Series constructor is given a Python list of integers,
# the default type of Series will be int64
s = pd.Series([1, -2, 3])
s.dtype

dtype('int64')

In [29]:
# It is possible to specify the datatype,
# but the data must conform to the specification
try:
    s = pd.Series([1, -2, 3], dtype='uint8')
    print(s.dtype)
except OverflowError as err:
    print(err)

Trying to coerce negative values to unsigned integers


In [30]:
# For the purpose of minimizing memory, it is possible to let Pandas
# find the smallest datatype than can hold the given data
s = pd.Series([1, -2, 3])
print(s.dtype)
s2 = pd.to_numeric(s, downcast='integer')
print(s2.dtype)

int64
int8


### Distinction between Series.equal and **==**

Two series are not "equal", if their datatypes differ.  However, as with numpy, if two numeric values are the same, they compare equal regardless of their datatype.

In [31]:
s.equals(s2)

False

In [32]:
(s == s2).all()

True

### isclose()
Series values of type float generally should not be compared with ==, but with either np.isclose() or math.isclose(), to account for the limited precision of the float data type.

For a more in depth discussion of this issue, see my notebook: "Core Python 1".

In [33]:
a = 1.1
b = 2.2
a + b == 3.3

False

In [34]:
np.isclose(a+b, 3.3)

True

### A Series having Unknown Values

If any of the values in a Series are unknown, then the type of that series must be float or object in order to hold the np.nan value.

In [35]:
pd.Series([1, 2, 3], dtype=np.integer)

0    1
1    2
2    3
dtype: int64

In [36]:
try:
    pd.Series([1, 2, 3, np.nan], dtype=np.integer)
except ValueError as err:
    print(err)

cannot convert float NaN to integer


In [37]:
pd.Series([1, 2, 3, np.nan], dtype=np.float)

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

In [38]:
pd.Series([1, 2, 3, np.nan], dtype=np.object)

0      1
1      2
2      3
3    NaN
dtype: object

### Datatype: Object
The only way a Series can hold different datatypes is to use 'object'.  All datatypes are a subclass of object, so this works.

The only way a Series can hold a string is to use 'object'.  In practice, a column of type 'object' is almost always a column of strings, rather than a column of different datatypes.

If there are a small number of unique string values, the column datatype should be converted to category, both to reduce memory and to clarify the meaning of that variable.

In [39]:
# different datatypes, defult to Object to hold them
s = pd.Series([{"one":1}, [2, 3], (3,4), 5])
s.dtype

dtype('O')

In [40]:
# same datatype, but strings must be held as objects
s = pd.Series(['one', 'two'])
s.dtype

dtype('O')

In [41]:
# only two unique values, this is better represented as a category
s.nunique()

2

In [42]:
# a category is like a factor in R, or an enumerated datatype in other languages
s = s.astype('category')
s

0    one
1    two
dtype: category
Categories (2, object): [one, two]

### View vs Copy

A view of a dataframe, is a reference into the dataframe.  If the view is changed, the dataframe is changed.

With a copy of a dataframe, if the copy is changed, the original dataframe remains unchanged.

If "is" returns True, then the varible must be a view.  However if "is" returns False, the variable may or may not be a view, as it may have different metadata and yet still make reference to the same underlying object in memory.

### Handling Null Values

In [43]:
# df[column] is a view into the DataFrame, not a copy
metascore = movies['Metascore']
metascore is movies['Metascore']

True

In [44]:
# number of known values
metascore.count()

936

In [45]:
# number of unknown values
metascore.isna().sum()

64

In [46]:
# sum of known values (ignores np.nan by default)
metascore.sum()

55210.0

In [47]:
# sum of known values, without ignoring nan
metascore.sum(skipna=False)

nan

In [48]:
# compute the average of the known values
metascore.sum() / metascore.count()

58.98504273504273

In [49]:
# compute the average of the known values
metascore.mean()

58.98504273504273

In [50]:
# When using Machine Learning algorithms, it can be helpful impute a missing value rather than 
# use null.  Sometimes a good imputed value is to use the mean.
metascore = metascore.fillna(value=metascore.mean())

In [51]:
# number of known values
metascore.count()

1000

In [52]:
# one reason for imputing with the mean value is that it does not change the overall mean
# of the column
metascore.mean()

58.985042735042654

Movie Ratings

### Comparing Series of type float with ==

If the value you are comparing against was computed using floating point arithmetic, it may no longer be exact.

In [53]:
movies['Rating'].nlargest(n=5)

Title
The Dark Knight    9.0
Dangal             8.8
Inception          8.8
Interstellar       8.6
Kimi no na wa      8.6
Name: Rating, dtype: float64

In [54]:
# find all movies with rating = 8.5 + 0.1 + 0.1 + 0.1
# there are no movies!
movies[movies['Rating'] == 8.5 + 0.1 + 0.1 + 0.1]

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


In [55]:
# this gives the expected result
movies[np.isclose(movies['Rating'], 8.5 + 0.1 + 0.1 + 0.1)]

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Dangal,118,"Action,Biography,Drama",Former wrestler Mahavir Singh Phogat and his t...,Nitesh Tiwari,"Aamir Khan, Sakshi Tanwar, Fatima Sana Shaikh,...",2016,161,8.8,48969,11.15,
Inception,81,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0
