# Pandas Introduction Part 2

## Overview

This notebook uses the IMDB dataset from Kaggle:  
https://www.kaggle.com/PromptCloudHQ/imdb-data#IMDB-Movie-Data.csv

In [1]:
import pandas as pd
import numpy as np

## Movie Data Set

In [2]:
# read in IMDB move dataset
movies = pd.read_csv('../data/IMDB-Movie-Data.csv')

In [3]:
# display first 2 rows of the DataFrame
movies.head(2)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0


In [4]:
# display last 2 rows
movies.tail(2)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0
999,1000,Nine Lives,"Comedy,Family,Fantasy",A stuffy businessman finds himself trapped ins...,Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,12435,19.64,11.0


In [5]:
movies.columns

Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

In [6]:
movies.index

RangeIndex(start=0, stop=1000, step=1)

In [7]:
movies.iloc[0:3, 0:3]

Unnamed: 0,Rank,Title,Genre
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi"
1,2,Prometheus,"Adventure,Mystery,Sci-Fi"
2,3,Split,"Horror,Thriller"


## Find Fields for Unique Identification

The primary object of interest is a movie, therefore it is necessary to know what uniquely identifies a movie.

The DataFrame index will then be set to this unique identifier (or identifiers).

In [8]:
# groupby is discussed later ...
def get_dups(df, cols):
    """Find all duplicate records in df for the given list of columns cols."""
    dfs = [g for name, g in df.groupby(cols) if len(g) > 1]
    if dfs:
        return pd.concat(dfs)
    else:
        return pd.DataFrame()

In [9]:
# Titles are not unique in this dataset
get_dups(movies, ['Title'])

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
239,240,The Host,"Action,Adventure,Romance",When an unseen enemy threatens mankind by taki...,Andrew Niccol,"Saoirse Ronan, Max Irons, Jake Abel, Diane Kruger",2013,125,5.9,96852,26.62,35.0
632,633,The Host,"Comedy,Drama,Horror",A monster emerges from Seoul's Han River and f...,Bong Joon Ho,"Kang-ho Song, Hee-Bong Byun, Hae-il Park, Doon...",2006,120,7.0,73491,2.2,85.0


In [10]:
# Title, Year is unique
get_dups(movies, ['Title', 'Year'])

In [11]:
# Rank is unique
get_dups(movies, ['Rank'])

### Rename Columns

In [12]:
change_names = {'Rank':'ID', 
                'Revenue (Millions)':'Revenue', 
                'Runtime (Minutes)':'Runtime'}
movies = movies.rename(change_names, axis='columns')
movies.head(2)

Unnamed: 0,ID,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0


### Use Modified Title as Unique Index

A DataFrame does not require that it's index be unique, but most data analysis of a given "unit" is easier if there is a way to uniquely identify that unit.  Here the unit of analysis is movie.

There are several ways to create a unique index for this dataset:
1. use Rank (renamed to ID)
2. use the sequence of integers created by default (or create another one)
3. use the combination of Title and Year.

The best identifier to use depends on the application.  For a "learn by example" Jupyter Notebook, the best identifier will be the most "natural" identifier.  The natural identifier for a movie is Title.  Using this will make the code easier to read and understand.

For this notebook, the Title field will be appended with the Year for those 2 records that have a duplicate title, thereby making the Title field unique.

Writing maintainable production code often has different requirements than writing a maintainable Jupyter Notebook.  A multi-index of title and year may be a better approach, especially if new movies are to be added to the dataset.  If performance has been proven to be a problem, the ID field might be better.

In [13]:
# easiest to use a unique ID as index, to alter the duplicate Titles
movies = movies.set_index('ID')
dups = get_dups(movies, ['Title'])
dups

Unnamed: 0_level_0,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
240,The Host,"Action,Adventure,Romance",When an unseen enemy threatens mankind by taki...,Andrew Niccol,"Saoirse Ronan, Max Irons, Jake Abel, Diane Kruger",2013,125,5.9,96852,26.62,35.0
633,The Host,"Comedy,Drama,Horror",A monster emerges from Seoul's Han River and f...,Bong Joon Ho,"Kang-ho Song, Hee-Bong Byun, Hae-il Park, Doon...",2006,120,7.0,73491,2.2,85.0


In [14]:
# for each ID in the dups dataframe, modify Title in the movies dataframe
for id in dups.index:
    movies.loc[id, 'Title'] += ': ' + movies.loc[id, 'Year'].astype('str')

In [15]:
# Verify change to Title worked
movies.loc[dups.index][['Title', 'Year']]

Unnamed: 0_level_0,Title,Year
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
240,The Host: 2013,2013
633,The Host: 2006,2006


In [16]:
# verify titles are now unique
get_dups(movies, ['Title'])

In [17]:
# remove ID as an index, but keep it as a column
movies.reset_index(drop=False, inplace=True)

# set the index to Title
movies.set_index('Title', drop=True, inplace=True)
movies.head(3)

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0


In [18]:
# save a copy to easily revert back to for later notebooks
import pickle
with open('../data/movies.pickle','wb') as p:
    pickle.dump(movies, p)

## Simple Data Queries

### Find Longest and Shortest Runtimes

In [19]:
runtime_min = movies['Runtime'].min()
runtime_max = movies['Runtime'].max()
print(f"Shortest Runtime: {runtime_min:>3} minutes")
print(f"Longest Runtime:  {runtime_max:>3} minutes")

Shortest Runtime:  66 minutes
Longest Runtime:  191 minutes


### Find Stats about a Numeric Column

In [20]:
# find stats about Runtime
movies['Runtime'].describe()

count    1000.000000
mean      113.172000
std        18.810908
min        66.000000
25%       100.000000
50%       111.000000
75%       123.000000
max       191.000000
Name: Runtime, dtype: float64

In [21]:
# get the min and max directly from the result Series
stats = movies['Runtime'].describe()
print(stats['min'], stats['max'])

66.0 191.0


### Using sum() and mean() with Boolean Series
True is 1, False is 0  
sum() counts the number of True values  
mean() computes the fraction of True values  

### Find Percent of Movies longer than 75th Runtime Percentile

In [22]:
# about 25% of the movies should be longer than the 75th percentile
fraction = (movies['Runtime'] > stats['75%']).mean()
print(f"{fraction*100:4.1f} percent of movies are longer than the 75th percentile.")

24.6 percent of movies are longer than the 75th percentile.


The above is not exact due to ties in the length of movies recorded to the nearest minute.

### Display Movie Row with Highest Rating

In [23]:
criteria = (movies['Rating'].max() == movies['Rating'])
movies[criteria]

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,82.0


## Unknown Values
In data analysis every variable can have a value that is either: "known" or "unknown".  

Another name for "unknown" is "null".

In Pandas and Numpy, "np.nan" is used to represent "unknown" or "null".

### Example of known/unknown:  
A person is asked a yes/no question and refuses to answer; the answer is: "unknown"  
A person answers yes; the answer is: yes

In [24]:
# nan is either written as np.nan or np.NaN
np.nan is np.NaN

True

In [25]:
# any relational operator with "unknown" produces an "unknown" result
print(3 < np.nan)
print(3 > np.nan)
print(np.nan == np.nan)

# a special operater is needed to determine if the value is unknown
print(np.isnan(np.nan))

False
False
False
True


In [26]:
# the data type of np.nan is float
type(np.nan)

float

### Series have a Single Data Type
A Series is implemented as a numpy array with all values having the same data type.

As everything is a subclass of "object", a Series of type "object" can hold values of any data type.

In [27]:
# numpy creates a float64 array by default
a = np.ndarray([1, 2, 3])
a.dtype

dtype('float64')

In [28]:
# tell numpy what type of array to create
a = np.ndarray([1, 2, 3], dtype='int32')
a.dtype

dtype('int32')

In [29]:
# Pandas creates an int64 Series by default, for integers
s = pd.Series([1, 2, 3])
s.dtype

dtype('int64')

In [30]:
# tell Pandas what type of Series to create
s = pd.Series([1, 2, 3], dtype='int32')
s.dtype

dtype('int32')

### A Series with an Unknown Value
If any of the values are unknown, than the type of the series must be float or object in order to hold np.nan.

In [31]:
# Pandas defaults this to int64
s = pd.Series([1, 2, 3])
print(s.dtype)

# Pandas defaults this to float64, so it can hold np.nan
s = pd.Series([1, 2, 3, np.nan])
print(s.dtype)

int64
float64


In [32]:
# the only way to hold non-numeric types is to use the catch-all, 'object'
# a Pandas row will often contain values of different data types
s = pd.Series([{"one":1}, [2, 3], (3,4), 5])
s.dtype

dtype('O')

In [33]:
# the default way to hold strings is also to use 'object'
s = pd.Series(['one', 'two'])
s.dtype

dtype('O')

In practice, a Series of type 'object' almost always holds strings and only strings.

### Handling Null Values

In [34]:
# note: metascore is a view into the DataFrame, not a copy
metascore = movies['Metascore']
metascore is movies['Metascore']

True

In [35]:
# number of non-null values
metascore.count()

936

In [36]:
# number of null values
metascore.isnull().sum()

64

In [37]:
# sum of non-null values
metascore.sum()

55210.0

In [38]:
# compute the average of the non-null values
metascore.sum() / metascore.count()

58.98504273504273

In [39]:
# compute the average of the non-null values
metascore.mean()

58.98504273504273

In [40]:
# When using Machine Learning algorithms, it can be helpful impute a missing value rather than 
# use null.  Sometimes a good imputed value is to use the mean.
metascore = metascore.fillna(value=metascore.mean())

In [41]:
metascore.count()

1000

In [42]:
# one reason for imputing with the mean value is that it does not change the overall mean
# of the column
metascore.mean()

58.985042735042626