# Pandas Introduction Part 2

## Overview

This notebook uses the IMDB dataset from Kaggle:  
https://www.kaggle.com/PromptCloudHQ/imdb-data#IMDB-Movie-Data.csv


TODO Links to other pandas notebooks.

In [244]:
import pandas as pd
import numpy as np

## Movie Data Set

In [245]:
# read in IMDB move dataset
movies = pd.read_csv('../data/IMDB-Movie-Data.csv')

In [246]:
# display first 2 rows of the DataFrame
movies.head(2)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0


In [247]:
# display last 2 rows
movies.tail(2)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0
999,1000,Nine Lives,"Comedy,Family,Fantasy",A stuffy businessman finds himself trapped ins...,Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,12435,19.64,11.0


In [248]:
movies.columns

Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

In [249]:
movies.index

RangeIndex(start=0, stop=1000, step=1)

In [250]:
movies.iloc[0:3, 0:3]

Unnamed: 0,Rank,Title,Genre
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi"
1,2,Prometheus,"Adventure,Mystery,Sci-Fi"
2,3,Split,"Horror,Thriller"


## Look for Unique Identifier

Check if Title is unique.  
Check if Title, Year is unique.

In [251]:
def get_dups(df, cols):
    dfs = [g for name, g in df.groupby(cols) if len(g) > 1]
    if dfs:
        return pd.concat(dfs)
    else:
        return None

In [252]:
get_dups(movies, ['Title'])

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
239,240,The Host,"Action,Adventure,Romance",When an unseen enemy threatens mankind by taki...,Andrew Niccol,"Saoirse Ronan, Max Irons, Jake Abel, Diane Kruger",2013,125,5.9,96852,26.62,35.0
632,633,The Host,"Comedy,Drama,Horror",A monster emerges from Seoul's Han River and f...,Bong Joon Ho,"Kang-ho Song, Hee-Bong Byun, Hae-il Park, Doon...",2006,120,7.0,73491,2.2,85.0


In [253]:
get_dups(movies, ['Title', 'Year'])

In [254]:
get_dups(movies, ['Rank'])

In [255]:
change_names = {'Rank':'ID', 'Revenue (Millions)':'Revenue', 'Runtime (Minutes)':'Runtime'}
movies = movies.rename(change_names, axis='columns')
movies.head(2)

Unnamed: 0,ID,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0


In the data processing that follows, it is convenient to have Title as a single level index that refers to unique movies.

This can be done by renaming the duplicates to {title}:{year}

In [256]:
movies = movies.set_index('ID')
dups = get_dups(movies, ['Title'])
dups

Unnamed: 0_level_0,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
240,The Host,"Action,Adventure,Romance",When an unseen enemy threatens mankind by taki...,Andrew Niccol,"Saoirse Ronan, Max Irons, Jake Abel, Diane Kruger",2013,125,5.9,96852,26.62,35.0
633,The Host,"Comedy,Drama,Horror",A monster emerges from Seoul's Han River and f...,Bong Joon Ho,"Kang-ho Song, Hee-Bong Byun, Hae-il Park, Doon...",2006,120,7.0,73491,2.2,85.0


In [257]:
for id in dups.index:
    movies.loc[id, 'Title'] += ': ' + movies.loc[id, 'Year'].astype('str')

In [258]:
# Verify change to unique Title
movies.loc[dups.index]

Unnamed: 0_level_0,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
240,The Host: 2013,"Action,Adventure,Romance",When an unseen enemy threatens mankind by taki...,Andrew Niccol,"Saoirse Ronan, Max Irons, Jake Abel, Diane Kruger",2013,125,5.9,96852,26.62,35.0
633,The Host: 2006,"Comedy,Drama,Horror",A monster emerges from Seoul's Han River and f...,Bong Joon Ho,"Kang-ho Song, Hee-Bong Byun, Hae-il Park, Doon...",2006,120,7.0,73491,2.2,85.0


In [259]:
# verify titles are now unique
get_dups(movies, ['Title'])

In [260]:
# save a copy for to easily revert back to
movies_copy = movies.copy()

### Simple Data Queries

In [261]:
# find the longest and shortest Runtimes
runtime_min = movies['Runtime'].min()
runtime_max = movies['Runtime'].max()
print(f"Shortest Runtime: {runtime_min:>3} minutes")
print(f"Longest Runtime:  {runtime_max:>3} minutes")

Shortest Runtime:  66 minutes
Longest Runtime:  191 minutes


In [262]:
# find stats about Runtime
movies['Runtime'].describe()

count    1000.000000
mean      113.172000
std        18.810908
min        66.000000
25%       100.000000
50%       111.000000
75%       123.000000
max       191.000000
Name: Runtime, dtype: float64

In [263]:
stats = movies['Runtime'].describe()
print(stats['min'], stats['max'])

66.0 191.0


### Using sum() and mean() with Boolean Series
True is 1, False is 0  
sum() counts the number of True values  
mean() computes the fraction of True values  

In [192]:
# verify 75th percentile above
# expect about 25% of the movies longer than 123 minutes
fraction = (movies['Runtime'] > 123).mean()
print(f'{fraction*100:3.2f} percent of movies are longer than 123 minutes')

24.60 percent of movies are longer than 123 minutes


In [193]:
# display the row with the highest rating
criteria = (movies['Rating'].max() == movies['Rating'])
movies[criteria]

Unnamed: 0_level_0,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
55,The Dark Knight,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,82.0


## Series have a Single Data Type
A Series is implemented as a numpy array with all values having the same data type.

As everything is a subclass of "object", a Series of type "object" can hold values of any data type.

In [194]:
# numpy creates a float64 array by default
a = np.ndarray([1, 2, 3])
a.dtype

dtype('float64')

In [195]:
# tell numpy what type of array to create
a = np.ndarray([1, 2, 3], dtype='int32')
a.dtype

dtype('int32')

In [196]:
# Pandas creates an int64 Series by default, for integers
s = pd.Series([1, 2, 3])
s.dtype

dtype('int64')

In [197]:
# tell Pandas what type of Series to create
s = pd.Series([1, 2, 3], dtype='int32')
s.dtype

dtype('int32')

## Unknown Values
In data analysis every variable can have a value that is either: "known" or "unknown".  

Another name for "unknown" is "null".

In Pandas and Numpy, "np.nan" is used to represent "unknown" or "null".

#### Example of known/unknown:  
A person is asked a yes/no question and refuses to answer; the answer is: "unknown"  
A person answers yes; the answer is: yes

In [198]:
# nan is either written as np.nan or np.NaN
np.nan is np.NaN

True

In [199]:
# any relational operator with "unknown" produces an "unknown" result
print(3 < np.nan)
print(3 > np.nan)
print(np.nan == np.nan)

# a special operater is needed to determine if the value is unknown
print(np.isnan(np.nan))

False
False
False
True


In [200]:
# the data type of np.nan is float
type(np.nan)

float

## A Series with an Unknown Value
If any of the values are unknown, than the type of the series must be float or object in order to hold np.nan.

In [201]:
# Pandas defaults this to int64
s = pd.Series([1, 2, 3])
print(s.dtype)

# Pandas defaults this to float64, so it can hold np.nan
s = pd.Series([1, 2, 3, np.nan])
print(s.dtype)

int64
float64


In [202]:
# the only way to hold non-numeric types is to use the catch-all, 'object'
# a Pandas row will often contain values of different data types
s = pd.Series([{"one":1}, [2, 3], (3,4), 5])
s.dtype

dtype('O')

In [203]:
# the default way to hold strings is also to use 'object'
s = pd.Series(['one', 'two'])
s.dtype

dtype('O')

In practice, a Series of type 'object' almost always holds strings and only strings.

### Handling Null Values

In [204]:
# note: metascore is a view into the DataFrame, not a copy
metascore = movies['Metascore']
metascore is movies['Metascore']

True

In [205]:
# number of non-null values
metascore.count()

936

In [206]:
# number of null values
metascore.isnull().sum()

64

In [207]:
# sum of non-null values
metascore.sum()

55210.0

In [208]:
# compute the average of the non-null values
metascore.sum() / metascore.count()

58.98504273504273

In [209]:
# compute the average of the non-null values
metascore.mean()

58.98504273504273

In [210]:
# When using Machine Learning algorithms, it can be helpful impute a missing value rather than 
# use null.  Sometimes a good imputed value is to use the mean.
metascore = metascore.fillna(value=metascore.mean())

In [211]:
metascore.count()

1000

In [212]:
# one reason for imputing with the mean value is that it does not change the overall mean
# of the column
metascore.mean()

58.985042735042626

## Data Wrangling: Queries involving Genre
Each cell in the Genre column is a single string.  This string consists of several genres separated by commas.  The Actors column is similar.

With data in this format, it is not obvious how to answer queries such as "what genre has the highest average rating?".

There are (at least) three ways to wrangle this column's data:
1. Create an indicator column for the presence/absence of each genre for each movie
2. Compute movies per genre on the fly using apply
3. Use a dictionary of genre_to_title

In all cases, it is helpful to compute the set of unique Genres (or unique Actors), before doing data analysis.

Examples of each of the above 3 ways of dealing with multiple values inside a single string follow.

In [220]:
# get the original movies dataset after processing
movies = movies_copy.copy()
movies.head(2)

Unnamed: 0_level_0,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0


In [221]:
movies = movies.reset_index(drop=False).set_index('Title', drop=True)
movies.head(2)

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0


In [222]:
# Look at Genre as is
movies['Genre'].head(5)

Title
Guardians of the Galaxy     Action,Adventure,Sci-Fi
Prometheus                 Adventure,Mystery,Sci-Fi
Split                               Horror,Thriller
Sing                        Animation,Comedy,Family
Suicide Squad              Action,Adventure,Fantasy
Name: Genre, dtype: object

In [223]:
# normal way of finding unique values
# works when cell contains one value
movies['Director'].value_counts().head()

Ridley Scott          8
Michael Bay           6
David Yates           6
Paul W.S. Anderson    6
M. Night Shyamalan    6
Name: Director, dtype: int64

In [225]:
# this doesnt work the way we want it to for Genre or Actor
movies['Genre'].value_counts().head()

Action,Adventure,Sci-Fi    50
Drama                      48
Comedy,Drama,Romance       35
Comedy                     32
Drama,Romance              31
Name: Genre, dtype: int64

The above is count per genre string, not per genre.

In [226]:
def parse_genre_string(s):
    """Convert string with comma separated values to list of strings.
    For each string, remove leading and trailing whitespace.
    """
    genres = s.split(',')
    return [genre.strip() for genre in genres]

In [227]:
# with parsing of Genre string to list of genres
movies['Genre'].apply(parse_genre_string).head(5)

Title
Guardians of the Galaxy     [Action, Adventure, Sci-Fi]
Prometheus                 [Adventure, Mystery, Sci-Fi]
Split                                [Horror, Thriller]
Sing                        [Animation, Comedy, Family]
Suicide Squad              [Action, Adventure, Fantasy]
Name: Genre, dtype: object

In [228]:
# find the set of unique genres
genres = set()
title_genres_series = movies['Genre'].apply(parse_genre_string)
title_genres_series.apply(lambda x: genres.update(x))
genres

{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Thriller',
 'War',
 'Western'}

#### Using Apply to compute Movies per Genre

In [229]:
%%timeit

# vars inside %%timeit are local to the cell
# make movies_per_genre global instead
global movies_per_genre
movies_per_genre = {}

# get movie count per Genre, computing movie lists on the fly
for genre in genres:
    movies_per_genre['Genre_'+genre] = movies['Genre'].apply(lambda x: genre in x).sum()

7.84 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [230]:
movies_per_genre

{'Genre_Western': 7,
 'Genre_Sport': 18,
 'Genre_Sci-Fi': 120,
 'Genre_Action': 303,
 'Genre_Biography': 81,
 'Genre_Crime': 150,
 'Genre_Adventure': 259,
 'Genre_Family': 51,
 'Genre_Music': 21,
 'Genre_Drama': 513,
 'Genre_Horror': 119,
 'Genre_War': 13,
 'Genre_Animation': 49,
 'Genre_Fantasy': 101,
 'Genre_Thriller': 195,
 'Genre_Mystery': 106,
 'Genre_Romance': 141,
 'Genre_Comedy': 279,
 'Genre_Musical': 5,
 'Genre_History': 29}

#### Using Indcator Variables to Compute Movies per Genre

In [231]:
# Compute genre indicator variables up front, once
for genre in genres:
    movies['Genre_'+genre] = movies['Genre'].apply(lambda x: genre in x)

In [232]:
# display the indicator columns just created
# select all columns that begin with Genre_
movies_genre_cols = movies.filter(regex='^Genre_', axis='columns')
movies_genre_cols.head(2)

Unnamed: 0_level_0,Genre_Western,Genre_Sport,Genre_Sci-Fi,Genre_Action,Genre_Biography,Genre_Crime,Genre_Adventure,Genre_Family,Genre_Music,Genre_Drama,Genre_Horror,Genre_War,Genre_Animation,Genre_Fantasy,Genre_Thriller,Genre_Mystery,Genre_Romance,Genre_Comedy,Genre_Musical,Genre_History
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Guardians of the Galaxy,False,False,True,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
Prometheus,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False


In [233]:
# sum every indicator column
movies_per_genre2 = movies_genre_cols.sum(axis='rows')
movies_per_genre2

Genre_Western        7
Genre_Sport         18
Genre_Sci-Fi       120
Genre_Action       303
Genre_Biography     81
Genre_Crime        150
Genre_Adventure    259
Genre_Family        51
Genre_Music         21
Genre_Drama        513
Genre_Horror       119
Genre_War           13
Genre_Animation     49
Genre_Fantasy      101
Genre_Thriller     195
Genre_Mystery      106
Genre_Romance      141
Genre_Comedy       279
Genre_Musical        5
Genre_History       29
dtype: int64

In [234]:
%%timeit
global movies_per_genre2

movies_per_genre2 = movies_genre_cols.sum(axis='rows').to_dict()

408 µs ± 6.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [235]:
movies_per_genre == movies_per_genre2

True

Computing the movies per genre using precomputed Indicator Variables was almost 20 times faster.

It is often easier to work with indicator columns, if there are not too many of them.  
After the one time creation of the indicator columns, queries are fast.

#### Using an Inverted Dictionary to Compute Movies per Genre

In [236]:
# create a mapping from genre to titles
from collections import defaultdict
genre_to_titles = defaultdict(list)
for genre in genres:
    titles = movies[movies['Genre_'+genre]].index.tolist()
    for title in titles:
        genre_to_titles[genre].append(title)

In [237]:
# compute count per genre using the genre_to_titles dictionary
genres_to_count = {genre:len(titles) for genre,titles in genre_to_titles.items()}
genres_to_count

{'Western': 7,
 'Sport': 18,
 'Sci-Fi': 120,
 'Action': 303,
 'Biography': 81,
 'Crime': 150,
 'Adventure': 259,
 'Family': 51,
 'Music': 21,
 'Drama': 513,
 'Horror': 119,
 'War': 13,
 'Animation': 49,
 'Fantasy': 101,
 'Thriller': 195,
 'Mystery': 106,
 'Romance': 141,
 'Comedy': 279,
 'Musical': 5,
 'History': 29}

In [265]:
g = movies.groupby(by=genre_to_titles)
for name, group in g:
    print(name)
    print(group['Genre'])

In [238]:
# find highest rated action movie on the fly
action_titles = movies['Genre'].apply(lambda x: 'Action' in x)
action_movies = movies.loc[action_titles]
action_movies[action_movies['Rating'] == action_movies['Rating'].max()]

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,...,Genre_Horror,Genre_War,Genre_Animation,Genre_Fantasy,Genre_Thriller,Genre_Mystery,Genre_Romance,Genre_Comedy,Genre_Musical,Genre_History
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,...,False,False,False,False,False,False,False,False,False,False


In [239]:
# find highest rated action movie using Indicator Variable: Genre_Action
action_movies = movies[movies['Genre_Action']]
action_movies[action_movies['Rating'] == action_movies['Rating'].max()]

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,...,Genre_Horror,Genre_War,Genre_Animation,Genre_Fantasy,Genre_Thriller,Genre_Mystery,Genre_Romance,Genre_Comedy,Genre_Musical,Genre_History
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,...,False,False,False,False,False,False,False,False,False,False


In [240]:
# find the highest rated action movie using dictionary
action_titles = genre_to_titles['Action']
action_movies = movies.loc[action_titles]
action_movies[action_movies['Rating'] == action_movies['Rating'].max()]

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,...,Genre_Horror,Genre_War,Genre_Animation,Genre_Fantasy,Genre_Thriller,Genre_Mystery,Genre_Romance,Genre_Comedy,Genre_Musical,Genre_History
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,...,False,False,False,False,False,False,False,False,False,False


In [241]:
%%timeit
# find highest rated action movie on the fly
action_titles = movies['Genre'].apply(lambda x: 'Action' in x)
action_movies = movies[action_titles]
action_movies[action_movies['Rating'] == action_movies['Rating'].max()]

1.3 ms ± 6.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [242]:
%%timeit
# find highest rated action movie using precomputed Indicator Variable: Genre_Action
action_movies = movies[movies['Genre_Action']]
action_movies[action_movies['Rating'] == action_movies['Rating'].max()]

990 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [243]:
%%timeit
# find the highest rated action movie using precomputed dictionary
action_titles = genre_to_titles['Action']
action_movies = movies.loc[action_titles]
action_movies[action_movies['Rating'] == action_movies['Rating'].max()]

1.37 ms ± 7.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [None]:
# find Sci-Fi movies
sci_fi_movies = movies[movies['Genre_Sci-Fi']]
sci_fi_movies.head(2)

In [None]:
# find movies that are both action and sci-fi
action_sci_fi_movies = movies[movies['Genre_Action'] & movies['Genre_Sci-Fi']]
action_sci_fi_movies.head(3)

In [None]:
# find highest rated movie that is both Action and Sci-Fi
movie_subset = movies[movies['Genre_Action'] & movies['Genre_Sci-Fi']]
movie_subset[movie_subset['Rating'] == movie_subset['Rating'].max()]

In [None]:
# find the avg rating of all action movies
action_movies = movies[movies['Genre_Action']]
action_movies['Rating'].mean()

In [None]:
# usually groupby is not used on a boolean series, but it works
# the group of interest is the True group
groupby = movies.groupby('Genre_Action')

for name, group in groupby:
    print(name, np.round(group['Rating'].mean(),4), group['Rating'].count())

### Group By
Find the highest rated directors who directed at least 5 movies

In [None]:
def group_filter(g):
    if g['Rating'].count() >= 5:
        return g['Rating'].mean()

In [None]:
s = movies.groupby('Director').apply(group_filter)
s[s.notnull()].nlargest(5)

## Data Wrangling: Queries involving Actor
As there are so many different actors, it's probably better to rely on dictionaries instead of creating a great many indicator columns.

In [333]:
def parse_genre_string(s):
    """Convert string with comma separated values to list of strings.
    For each string, remove leading and trailing whitespace.
    """
    genres = s.split(',')
    return [genre.strip() for genre in genres]

In [334]:
# parsing of Genre is identical to parsing of Actors
movies['Actors'].apply(parse_genre_string).head(5)

Title
Guardians of the Galaxy    [Chris Pratt, Vin Diesel, Bradley Cooper, Zoe ...
Prometheus                 [Noomi Rapace, Logan Marshall-Green, Michael F...
Split                      [James McAvoy, Anya Taylor-Joy, Haley Lu Richa...
Sing                       [Matthew McConaughey, Reese Witherspoon, Seth ...
Suicide Squad              [Will Smith, Jared Leto, Margot Robbie, Viola ...
Name: Actors, dtype: object

In [335]:
# find the set of unique actors
actors = set()
title_actors_series = movies['Actors'].apply(parse_genre_string)
title_actors_series.apply(lambda x: actors.update(x))
len(actors)

1985

#### Using an Inverted Dictionary to Compute Movies per Actor

In [336]:
# create a mapping from genre to titles
from collections import defaultdict
actor_to_titles = defaultdict(list)
for actor in actors:
    titles = movies['Actors'].apply(lambda x: actor in x)
    titles = movies[titles].index.tolist()
    for title in titles:
        actor_to_titles[actor].append(title)

In [337]:
# compute count per genre using the genre_to_titles dictionary
actors_to_count = {actor:len(titles) for actor,titles in actor_to_titles.items()}
len(actors_to_count)

1985

In [338]:
# try inverted dictionary
movies.loc[actor_to_titles['Zooey Deschanel']]

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Trolls,24,"Animation,Adventure,Comedy","After the Bergens invade Troll Village, Poppy,...",Walt Dohrn,"Anna Kendrick, Justin Timberlake,Zooey Deschan...",2016,92,6.5,38552,153.69,56.0
(500) Days of Summer,508,"Comedy,Drama,Romance",An offbeat romantic comedy about a woman who d...,Marc Webb,"Zooey Deschanel, Joseph Gordon-Levitt, Geoffre...",2009,95,7.7,398972,32.39,76.0
The Happening,513,"Sci-Fi,Thriller","A science teacher, his wife, and a young girl ...",M. Night Shyamalan,"Mark Wahlberg, Zooey Deschanel, John Leguizamo...",2008,91,5.0,170897,64.51,34.0
Bridge to Terabithia,913,"Adventure,Drama,Family",A preteen's life turns upside down when he bef...,Gabor Csupo,"Josh Hutcherson, AnnaSophia Robb, Zooey Descha...",2007,96,7.2,117297,82.23,74.0


In [None]:
# find the average rating of actors in at least 5 movies
actor_rating = {}
for actor in actors:
    movies_with_actor = movies.loc[actor_to_titles[actor]]
    if movies_with_actor['Rating'].count() >= 5:
        actor_rating[actor] = movies_with_actor['Rating'].mean()

In [None]:
# read in IMDB move dataset
movies = pd.read_csv('../data/IMDB-Movie-Data.csv', index_col='Title')
movies.head(2)

# HERE HERE
A dictionary is more memory efficient than denormalizing the table, but denormalizing is much easier to work with!

In [372]:
def custom_groupby(df, mapping, values):
    for value in values:
        yield (value, df.loc[mapping[value]])

In [343]:
movies = movies_copy.copy()
movies = movies.reset_index(drop=False).set_index('Title', drop=True)

In [344]:
movies.head(1)

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0


In [357]:
np.std([1, 2, 3])

0.816496580927726

In [366]:
df = pd.DataFrame([1, 2, 3], columns=['A'])
df.apply(np.std)

A    0.816497
dtype: float64

In [367]:
df.columns

Index(['A'], dtype='object')

In [386]:
my_group = custom_groupby(movies, actor_to_titles, ['Zooey Deschanel', 'John Goodman'])

In [387]:
from IPython.display import display
for name, group in my_group:
    print(name, len(group), group['Rating'].mean(), group['Metascore'].std())
    print()
    display(group)

Zooey Deschanel 4 6.6 19.527758021169078



Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Trolls,24,"Animation,Adventure,Comedy","After the Bergens invade Troll Village, Poppy,...",Walt Dohrn,"Anna Kendrick, Justin Timberlake,Zooey Deschan...",2016,92,6.5,38552,153.69,56.0
(500) Days of Summer,508,"Comedy,Drama,Romance",An offbeat romantic comedy about a woman who d...,Marc Webb,"Zooey Deschanel, Joseph Gordon-Levitt, Geoffre...",2009,95,7.7,398972,32.39,76.0
The Happening,513,"Sci-Fi,Thriller","A science teacher, his wife, and a young girl ...",M. Night Shyamalan,"Mark Wahlberg, Zooey Deschanel, John Leguizamo...",2008,91,5.0,170897,64.51,34.0
Bridge to Terabithia,913,"Adventure,Drama,Family",A preteen's life turns upside down when he bef...,Gabor Csupo,"Josh Hutcherson, AnnaSophia Robb, Zooey Descha...",2007,96,7.2,117297,82.23,74.0


John Goodman 6 7.1499999999999995 10.684880283216403



Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Patriots Day,67,"Drama,History,Thriller",The story of the 2013 Boston Marathon bombing ...,Peter Berg,"Mark Wahlberg, Michelle Monaghan, J.K. Simmons...",2016,133,7.4,39784,31.86,69.0
10 Cloverfield Lane,119,"Drama,Horror,Mystery","After getting in a car accident, a woman is he...",Dan Trachtenberg,"John Goodman, Mary Elizabeth Winstead, John Ga...",2016,104,7.2,192968,71.9,76.0
Monsters University,413,"Animation,Adventure,Comedy",A look at the relationship between Mike and Su...,Dan Scanlon,"Billy Crystal, John Goodman, Steve Buscemi, He...",2013,104,7.3,252119,268.49,65.0
Argo,475,"Biography,Drama,History",Acting under the cover of a Hollywood producer...,Ben Affleck,"Ben Affleck, Bryan Cranston, John Goodman, Ala...",2012,120,7.7,481274,136.02,86.0
The Gambler,806,"Crime,Drama,Thriller",Lit professor and gambler Jim Bennett's debt c...,Rupert Wyatt,"Mark Wahlberg, Jessica Lange, John Goodman, Br...",2014,111,6.0,52537,33.63,55.0
Flight,846,"Drama,Thriller",An airline pilot saves almost all his passenge...,Robert Zemeckis,"Denzel Washington, Nadine Velazquez, Don Chead...",2012,138,7.3,276347,93.75,76.0


In [388]:
def gen1(n):
    a = 0
    b = 1
    for i in range(n):
        a, b = a+b, a
        yield(a)

In [282]:
g = gen1(7)

In [None]:
tuples = []
for actor in actors:
    for title in actor_to_titles[actor]:
        tuples.append((actor, title))
        
df_actor = pd.DataFrame(data=tuples, columns=['Actor','Title'])
df_actor.set_index('Actor', drop=True, inplace=True)

In [None]:
df_actor.head()

In [None]:
df = df_actor.join(movies, on='Title', how='left')
df.head()

In [None]:
len(df_actor)

In [None]:
len(movies.index)

In [None]:
len(set(movies.index))

In [None]:
m2 = movies.reset_index()
b = m2.duplicated(subset='Title', keep='first')
m2[b]

In [None]:
b = m2.duplicated(subset='Title', keep='last')
m2[b]

In [None]:
df2 = df.set_index(index)

In [None]:
actor_to_titles

In [None]:
df_actor.head()

In [None]:
tuples = []
for actor in actors:
    for title in actor_to_titles[actor]:
        tuples.append((actor, title))
        
index = pd.MultiIndex.from_tuples(tuples, names=['Actor', 'Title'])        

In [None]:
movies.set_index(index)

In [None]:
# its easier to work with dataframes than dictionaries
df = pd.DataFrame(data=[], columns=['Rating', 'Count'])
dfs = []
for actor in actors:
    df = movies.loc[actor_to_titles[actor]]
    df['Actor'] = actor
    df.set_index('Actor', drop=False, inplace=True)
    dfs.append(df)
df_new = pd.concat(dfs)
df_new.head()

In [None]:
df_actor_rating = pd.DataFrame.from_dict(actor_rating, orient='index')
df_actor_rating.rename_axis('Actor')
df_actor_rating.columns = ['Rating']
df_actor_rating.nlargest(n=5, columns='Rating')

In [None]:
df_actor_rating.columns

In [None]:
df_actor_rating.index

In [None]:
# sort by value
actor_rating_sorted = sorted(actor_rating.items(), key=lambda x: x[1], reverse=True)
for actor, rating in actor_rating_sorted:
    print(f'{actor}: {rating:4.2f}')

In [None]:
# get the average rating and number of movies for each actor in at least 5 movies
actor_rating = {}
for actor in actors_to_title:
    titles = actors_to_title[actor]
    movies_with_actor = movie.loc[titles]
    count = movies_with_actor['Rating'].count()
    if count >= 5:
        actor_rating[actor] = movies_with_actor['Rating'].mean()

In [None]:
actor_rating_sorted = sorted(actor_rating.items(), key=lambda kv: kv[1], reverse=True)
actor_rating_sorted[:10]

In [None]:
movie.loc[actor_to_titles['Leonardo DiCaprio']]

In [None]:
# with dictionaries, its slighter harder to query for two actors, than with indicator columns
# find all movies with Leonardo DiCaprio and Matt Damon
titles1 = actors_to_title['Leonardo DiCaprio']
titles2 = actors_to_title['Matt Damon']
titles = set(titles1).intersection(set(titles2))
titles
movie.loc[titles]

## Categorical Values
Categorical Values are also called enumerated types or factors.

Converting a column to categorical can save on memory, make the data processing clearer, and may help certain Machine Learning algorithms to do a better job.

In [None]:
# read in IMDB move dataset
movie = pd.read_csv('../data/IMDB-Movie-Data.csv')
movie = movie.set_index('Title')

In [None]:
# select the columns that are of type 'object'
movie_obj = movie.select_dtypes('object')
movie_obj.columns

In [None]:
# examine how many unique values there are for each such column
movie_obj.nunique()

In [None]:
# as it turns out, none of the columns in this dataset are good candidates for 
# being converted to a category

In [None]:
movie['Year'] = year
movie.info()

In [None]:
years

In [None]:
# There are too many unique values to create a category
# Although unusual, year could be converted to an ordered categorical varibale
year = movie['Year'].astype('category', ordered=True)

In [None]:
# Genre is a candidate for converting to a categorical field
movie_obj['Genre'] = movie_obj['Genre'].astype('category')

### Above Warning is Important
To save memory, Pandas does not copy objects or parts of objects by default.

movie_obj is a slice of movie and when an assignment is made to it, an assignment may be made to movie.

Although Pandas correctly figure out the intent, this type of warning should never occur in well written code.

In [None]:
# reread in IMDB move dataset
movie = pd.read_csv('../data/IMDB-Movie-Data.csv')

In [None]:
# select the columns that are of type 'object', AS A COPY
movie_obj = movie.select_dtypes('object').copy()
movie_obj.columns

In [None]:
# Convert to categorical, note there is no warning this time
movie_obj['Genre'] = movie_obj['Genre'].astype('category')

In [None]:
movie.info()

In [None]:
# overwrite the Genre column in the movie dataframe
movie['Genre'] = movie_obj['Genre']
movie.info()

In [None]:
genre = movie['Genre']
genre is movie['Genre']

In [None]:
genre

In [None]:
# When the variable is categorical, the imputed value might be the mode

In [None]:
# which director had the highest average rating and directed at least 3 movies
groupby = movie.groupby('Director')

# criteria = groupby.count() >= 3
groupby['Rating'].count() > 6

In [None]:
for group in groupby:
    print(group.name)
    print(len(group))

In [None]:
dir(groupby)

## Start of Actors Section

### Actors Column
The Actors column contains a single string per cell, but each string consists of a comma separated list of actors.

Whenever a single cell contains more than 1 value, or string is used to represent more than 1 value, special processing is required.

#### Quick Review of defaultdict(list)
A dictionary is a mapping between a key and a value.

The value can be a collection.

defaultdict(list): says that each value defaults to an empty list

In [None]:
# Book analogy:
from collections import defaultdict
d = defaultdict(list)
d['p1'].append("word 1")
d['p1'].append("word 2")
d['p1'].append("word 3")
d['p2'].append("word 2")
d['p2'].append("word 4")
d

In [None]:
# if the value is a collection, the entire collection is accessed by key
d['p1']

In [None]:
# Create an inverted index (values to keys)
# The index in the back of a book is an "inverted index"
inverted_dict = defaultdict(list)
for key, values in d.items():
    for value in values:
        inverted_dict[value].append(key)
        
inverted_dict

In [None]:
inverted_dict['word 2']

#### Convert Series to dict
The key is the index of the series.
The value is the series value.

In [None]:
# Map Title to Actor string
title_to_actor_string = movie['Actors'].to_dict()

In [None]:
title_to_actor_string['10 Cloverfield Lane']

In [None]:
# convert this string to a list of actor names
title_to_actors = defaultdict(list)
for title, actor_string in title_to_actor_string.items():
    for actor in actor_string.split(','):
        title_to_actors[title].append(actor.strip())

In [None]:
title_to_actors['10 Cloverfield Lane']

#### Create Inverted dict (Actors -> Titles)

In [None]:
# invert the index
actors_to_title = defaultdict(list)
for title, actors in title_to_actors.items():
    for actor in actors:
        actors_to_title[actor].append(title)

In [None]:
actors_to_title['John Goodman']

## END of ACTORS Sections

In [None]:
# get 1 column of values
votes = movie['Votes']

# first 5 votes in the column
votes[:5]

In [None]:
print(f'type of votes: {type(votes)}')
print(f'type of votes values: {votes.dtype}')

### DataFrame

In [None]:
# Get overall information about a DataFrame
movie.info()

In [None]:
# number of records
len(movie)

In [None]:
# number of non-null movie titles
movie['Title'].count()

In [None]:
# number of null movie titles
# True counts as 1, False counts as 0, sum up the number of True values
movie['Title'].isna().sum()

In [None]:
metascores = movie['Metascore']
print(f'Number of non-null Metascores: {metascores.count()}')
print(f'Number of null Metascores: {metascores.isnull().sum()}')
print(f'Number of Movies: {len(movie)}')

# number of non-null Metascores + number of null Metascores = number of records
metascores.count() + metascores.isna().sum() == len(movie)

In [None]:
# the datatypes of each of the movie columns
movie.dtypes

In [None]:
# distribution of datatypes
movie.dtypes.value_counts()

## Selecting Values in Row(s) or Column(s) or Both

In [None]:
# Column Selection
cols_to_select = ['Title', 'Rating']

# verfiy every column to be selected is in movie.columns
columns_all = set(movie.columns)
columns_subset = set(cols_to_select)
columns_subset.issubset(columns_all)

In [None]:
# select these columns (first 3 rows)
movie[cols_to_select].head(3)

In [None]:
# select these columns (first 3 rows), alternative syntax
movie[cols_to_select][:3]

In [None]:
# add a column that isn't there
cols_to_select2 = ['Title', 'Rating', 'Not_There']
try:
    movie[cols_to_select2]
except KeyError as err:
    print(err)

### Note on Above Error Message

The error message says that the column name is not in the "index".  As the column labels are themselves an index, this means the column name is not in the index of column labels.

This does **not** mean that 'Not_There' is not in the df.index.  df.index refers specifically to the index of row labels.  

### Pandas Infers Whether: df\[cols_to_select\] Selects Columns or Rows

Often the inference allows for convenient use of the bracket operator.  
Sometimes it is clearer to explicitly state whether you are searching for row IDs or column IDs or both..

If you want to specifically state that you want the rows with IDs 0-2 and the columns with labels 'cols_to_select', using **.loc\[\]**

In [None]:
# Row ID Selection, slice operator is inclusive
# Col ID Selection, list of lables is provided
movie.loc[:2,cols_to_select]

### Boolean Series Row Filtering

Comparing values in one column with a relational operator results in a Series having the same length as the column, and an index being identical to the column's index.

In [None]:
# get a boolean series for a relational operation
good_rating = movie['Rating'] > 8.6
print(type(good_rating)) # Series
print(good_rating.dtype) # bool
print(len(good_rating) == len(movie)) # same length

In [None]:
# show first 3 values of this series
good_rating[:3]

#### Boolean Series Index must match DataFrame index

When a boolean series is used to filter rows, it's index must "match" the DataFrame's index.  This will always be the case when boolean series was created with a relational operator applied to the values of one of its columns.

In [None]:
movie.index

In [None]:
good_rating.index

In [None]:
# all index values (aka row IDs) are the same
(movie.index == good_rating.index).all()

In [None]:
# how many movies meet the good_rating criteria?
# True is 1, False is 0, the sum of the True values is the number of True values
good_rating.sum()

In [None]:
movie[good_rating]

### Bracket Notation **df\[filter\]**

Pandas can infer:
1. columns are to be selected if the input is a list of values which are a subset of the column labels
2. rows are to be selected if the input is a boolean series having a matching index

In [None]:
# filtering by column, analgous to SQL Select
cols_to_select = ['Title', 'Rating']
movie[cols_to_select].head(3)

In [None]:
# filtering by row, analgous to SQL Where
movie[good_rating].head(3)

In [None]:
# To not clutter the notebook output, warnings are ignored
# Normally warnings should not be ignored
import warnings
warnings.filterwarnings('ignore')

# create a non-matching boolean_series
good_rating2 = good_rating[:-1]
try:
    movie[good_rating2]
except pd.core.indexing.IndexingError as err:
    print(err)

In [None]:
cols_to_select = ['Title', 'Rating']
boolean_series = movie['Rating'] > 8.6

# filter both columns and rows
movie[cols_to_select][good_rating]

In [None]:
# filter on both rows and columns, reverse order
movie[good_rating][cols_to_select]

In [None]:
# clearer code, Pandas is not infering what is being selected
movie.loc[good_rating, cols_to_select]

# Example DataFrame Queries

In [None]:
# rename the columns to be easier to work with
movie.columns

In [None]:
cols = ['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime', 'Rating', 'Votes', 'Revenue',
       'Metascore']
movie.columns = cols
movie.columns

In [None]:
# select index (movie titles)
movie = movie.sort_index()

In [None]:
# select last two rows, first 3 columns
movie.iloc[-2:, :3]

In [None]:
# select movies more than 150 minutes with a rating higher than 8.0
row_filter = (movie['Runtime'] > 150) & (movie['Rating'] > 8.3)

# select Genre, Runtime and Rating
cols = ['Genre','Runtime','Rating']

movie.loc[row_filter, cols].head(3)

## Handle Multiple Values in Actors Column

The Actors column is not a collection per se, although it is used like one.  Each value in the Actors column is a single value, which happens to be a string that contains one or more actor names.

In [None]:
# Find all movies with Zooey Deschanel in this dataset
titles_filter = movie['Actors'].apply(lambda x: 'Zooey Deschanel' in x)

In [None]:
movie[titles_filter]

In [None]:
# It's a bit awkward to hardcode a string inside the lambda expression
# pd.Series.apply() must take 1 argument, the argument that represents each value in the Series
# To get around the requirement of exactly 1 arg, use a closure
def make_actor_filter(actor):
    def my_filter(x):
        return actor in x
    return my_filter

In [None]:
zooey_filter = make_actor_filter('Zooey Deschanel')
titles_filter = movie['Actors'].apply(zooey_filter)
movie[titles_filter]

In [None]:
# to find out which actors appeared in the most movies, create a dictionary
# and invert it
title_to_actor_string = movie['Actors'].to_dict()
title_to_actor_string['10 Cloverfield Lane']

In [None]:
# convert this string to a list of actor names
title_to_actors = defaultdict(list)
for title, actor_string in title_to_actor_string.items():
    for actor in actor_string.split(','):
        title_to_actors[title].append(actor.strip())

In [None]:
title_to_actors['10 Cloverfield Lane']

In [None]:
# invert the index
actors_to_title = defaultdict(list)
for title, actors in title_to_actors.items():
    for actor in actors:
        actors_to_title[actor].append(title)

In [None]:
actors_to_title['Zooey Deschanel']

In [None]:
# with a dictionary of actor to movies, it's easy to get the count of movies per actor
movie_count_by_actor = {key:len(value) for key, value in actors_to_title.items()}

In [None]:
movie_count_by_actor['Zooey Deschanel']

In [None]:
# find which actors where in the most movies
# recall that sorted takes an optional user specified key
sorted_by_count = sorted(movie_count_by_actor.items(), key=lambda kv: kv[1], reverse=True)
sorted_by_count[:3]

In [None]:
# find Hugh Jackman movies longer than 140 minutes
titles = actors_to_title['Hugh Jackman']
hugh_movies = movie.loc[titles]
long_hugh_movies = hugh_movies[hugh_movies['Runtime'] > 140]
long_hugh_movies

## Find which actor was in movies with the highest ratings

This won't work because the groups are not mutually exclusive...

Ensure actor is in at least two movies.

In [None]:
group_by = movie.groupby(actors_to_title.keys())
# group_by_having_2plus = group_by # group_by.filter(lambda x: len(x) > 1)

In [None]:
for group in group_by_having_2plus:
    print(group

In [None]:
group_by.agg(np.mean)

In [None]:
# number of movies that are greater than 120 minutes
(movie['duration'] > 120).sum()

In [None]:
# percent of movies greater than 120 minutes long
(movie['duration'] > 120).mean()

In [None]:
# normalized true/false value_counts
(movie['duration'] > 120).value_counts(normalize=True)

In [None]:
movie[title_filter]

In [None]:
bale_movies = movie.loc[titles]
long_bale_movies = bale_movies[bale_movies['Runtime'] > 140]
long_bale_movies

In [None]:
# apply needs 1 argument
# use closure to bind 1 argument
def make_actor_filter(actor):
    def my_filter(x):
        return actor in x
    return my_filter

In [None]:
filter_f = make_actor_filter('Christian Bale')

In [None]:
filter1 = movie['Actors'].apply(filter_f)
filter2 = movie['Runtime'] > 140
movie[filter1 & filter2]

In [None]:
title_row_filter = movie.index.map(lambda x: x in titles).to_series()
title_row_filter.set_index

set_index(movie.index)

In [None]:
movie.index

In [None]:
movie.index.take(titles)

In [None]:
type(title_row_filter)

In [None]:
type(row_filter)

In [None]:
title_row_filter & row_filter

In [None]:
long_movies = movie[row_filter]
long_bale_movies = long_movies.loc[title_row_filter]
long_bale_movies

In [None]:
# deal with multiple actors in one field
from collections import defaultdict
actors = defaultdict(list)

# change to map
for i in movie.index:
    print(movie.loc[i].Title)
    print(movie.loc[i, 'Actors'].split(','))
    if i > 5:
        break

In [None]:
actor_series = movie['Actors'].map(lambda x: x.split(','))
# actors = movies['Actors'].map(lambda x: x.split(',')).to_dict(defaultdict(list))

In [None]:
actor_series.head()

In [None]:
actors = actor_series.to_dict(defaultdict(list))

In [None]:
actors

In [None]:
# inverted index
movie_actors = defaultdict(list)
for key, values in actors.items():
    for value in values:
        movie_actors[value].append(key)

In [None]:
movie_actors['Noomi Rapace']

In [None]:
movie.loc[movie_actors['Noomi Rapace']]

### Notes
1. dtype: object -- most often used for strings but may indicate a column having different data types, or a list, dict, etc.
2. dtype: categorial -- like R's factor.  Used for a small number of possibly ordered enumerated types.

In [None]:
# sorted will automatically convert the columns to a list, and then sort them
cols = sorted(movie.columns)
cols

In [None]:
# count returns number of non-null (i.e. non NAN) values
actor = movie['Actors']
actor.count()

In [None]:
movie.shape

In [None]:
len(actor)

In [None]:
actor.count() + actor.isnull().sum()

In [None]:
# describe stats depend upon datatype
movie['Revenue (Millions)'].describe()

In [None]:
movie['actor_1_name'].describe()

In [None]:
# each (computed) series has the same index as the dataframe it came from
(movie['actor_1_name'].isnull().index == movie.index).all()

In [None]:
((movie['actor_1_name'] == 'Robert De Niro').index == movie.index).all()

In [None]:
m = movie.set_index('movie_title')
m.head()

In [None]:
m['cast_total_facebook_likes'].sort_values(ascending=False).head()

In [None]:
actor_1_fb_likes = movie['actor_1_facebook_likes']
actor_1_fb_likes.fillna(0).astype(int).head()

In [None]:
# be careful when computing truthiness of Series that has nan!
s = pd.Series([np.nan] * 3)
print(s)
s.all()

In [None]:
# an empty Series is also True
pd.Series([np.nan] * 3).all(skipna=True)

In [None]:
s1 = pd.Series([np.nan] * 3)
s1 = s1.append(pd.Series([0.0]))
s1

In [None]:
s1.all(skipna=True)

In [None]:
s1.any(skipna=True)

In [None]:
# a single column is a Series
movie['movie_title'].head()

In [None]:
# a single column as part of a list, is a DataFrame
movie[['movie_title']].head()

In [None]:
movie.get_dtype_counts()

In [None]:
# equvalient to preceeding
movie.dtypes.value_counts()

In [None]:
# select column by type
# number includes float and int
movie.select_dtypes(include=['number']).head()

In [None]:
# select column by name
# use RegEx having 1 to infinity digits in it
movie.filter(regex='\d{1,}').head()

In [None]:
movie.filter(regex='^act').head()

In [None]:
# filter does not throw key error, but indexing operator does
movie.filter(items=['no_key_error', 'actor_1_name']).head()

In [None]:
movie.isnull().sum()

In [None]:
movie.isnull().sum().sum()

In [None]:
# DataFrames work the same as Series with respect to '==' and 'equals
s1 = pd.Series([1, 2, 3, np.nan])
s2 = pd.Series([1, 2, 3, np.nan])
print(f's1 is s2: {s1 is s2}')
print(f'All values are same given np.nan != np.nan: {(s1 == s2).all()}')

In [None]:
# but .equals DOES consider np.nan to equal np.nan!
s1.equals(s2)

In [None]:
# shape includes non-null values
ugds.shape

In [None]:
# count number of non-null values
ugds.count()

In [None]:
ugds.count(axis='columns').value_counts()

In [None]:
# isclose is helpful
np.isclose(1.000, 0.9999, atol=1e-4)

In [None]:
# the sum across each row is nearly 0 or 1
a = ugds.sum(axis=1).apply(lambda x: np.isclose(x, 1.0, atol=3e-4) or np.isclose(x, 0.0, atol=3e-4))
a.all()

In [None]:
# from the above, either the row has all the values, or it has none of the values
# drop rows that are missing all of the values
ugds.dropna(how='all').count()

In [None]:
ugds_clean = ugds.dropna(how='all')
ugds_clean.count()

In [None]:
ugds_clean.isnull().sum().sum()

In [None]:
# create a diversity metric and apply it
ugds_clean.ge(.15).head()

In [None]:
# count the number of columns that are greater than 0.15 for each row
ugds_clean.ge(.15).sum(axis=1).sort_values(ascending=False).head()

## Data Analysis

In [None]:
import pandas as pd
import numpy as np
from IPython.display import display
pd.options.display.max_columns = 50

In [None]:
# convert the 0,1 value to a single byte integer
col2['RELAFFIL'] = col2['RELAFFIL'].astype(np.int8)
col2.memory_usage(deep=True)

In [None]:
movie2 = movie[['movie_title', 'imdb_score', 'budget']]
movie2.head()

In [None]:
best_score = movie2.nlargest(100,'imdb_score')
best_score.shape

In [None]:
best_value = best_score.nsmallest(5,'budget')
best_value

In [None]:
movie2.nlargest(100,'imdb_score').nsmallest(5,'budget')

In [None]:
movie3 = movie[['movie_title', 'title_year', 'imdb_score']]
movie3.head()

In [None]:
movie3.sort_values(['title_year', 'imdb_score'], ascending=False).head(5)

In [None]:
# sort by movie_title and title_year, then drop dup title_year to get best title per year
movie3.sort_values(['title_year', 'imdb_score'], ascending=False).drop_duplicates('title_year').head()

In [None]:
# create initial sort order via score
m_score = movie3.sort_values(['imdb_score'], ascending=False)

# group by year
m_score = m_score.groupby(['title_year'])

# first row in each group is highest score
m_score.first().tail()

In [None]:
# as one liner
movie3.sort_values(['title_year', 'imdb_score'], ascending=False).groupby('title_year').first().tail()

## Selection

In [None]:
college = pd.read_csv(college_path[0], index_col='INSTNM')
city = college['CITY']
city.value_counts().head()

In [None]:
# slice only works if index is sorted
try:
    df = college['Spa':'Spr']['CITY']
    print(df)
except KeyError as err:
    print(f'KeyError: {err}')

In [None]:
college.sort_index(inplace=True)

In [None]:
try:
    df = college['Sp':'Su']
    print(df.index)
except KeyError as err:
    print(err)

## Boolean Indexing

In [None]:
movie = pd.read_csv(movie_path[0], index_col='movie_title')
movie.head()

In [None]:
# number of movies that are greater than 120 minutes
(movie['duration'] > 120).sum()

In [None]:
# percent of movies greater than 120 minutes long
(movie['duration'] > 120).mean()

In [None]:
# normalized true/false value_counts
(movie['duration'] > 120).value_counts(normalize=True)

In [None]:
len(movie['duration'])

In [None]:
movie['duration'].count()

In [None]:
movie['duration'].isnull().sum()

In [None]:
# drop na's before computing the mean -- no difference
movie['duration'].dropna().gt(120).mean()

In [None]:
movie['duration'].gt(120).sum() / movie['duration'].count()

In [None]:
actors = movie[['actor_1_facebook_likes', 'actor_2_facebook_likes']].dropna()

In [None]:
(actors.iloc[:,0] > actors.iloc[:,1]).mean()

In [None]:
# numpy semantics for boolean arrays same as semantics for Pandas Boolean Series
a = np.array([True, False])
b = np.array([True, True])
a & b

In [None]:
a | b

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
college_stabbr_sorted = college_stabbr.sort_index()
college_stabbr_sorted.index.is_monotonic

In [None]:
# boolean series for row selection
%timeit college[college['STABBR'] == 'TX']

In [None]:
# unsorted index for row selection
%timeit college_stabbr.loc['TX']

In [None]:
# sorted index for row selection
%timeit college_stabbr_sorted.loc['TX']

In [None]:
# boolean series for row selection
states = ['TX', 'CA', 'NY']
%timeit college[college['STABBR'].isin(states)]

In [None]:
# unsorted index for row selection
%timeit college_stabbr.loc[states]

In [None]:
# sorted index for row selection
%timeit college_stabbr_sorted.loc[states]

In [None]:
# try unsorted and randomized (rather than order read in from file)
df = college.sample(len(college))

# boolean series for row selection -- rows in random order
states = ['TX', 'CA', 'NY']
%timeit df[df['STABBR'].isin(states)]

In [None]:
# verify results are the same
college = pd.read_csv(college_path[0])
college_stabbr_sorted = college.set_index('STABBR').sort_index()

r1 = college[college['STABBR'].isin(states)]
r2 = college_stabbr_sorted.loc[states]

### DataFrame equals
1. indexes must be equal and in the same order
2. columns must be equal and in the same order
3. values must be the same, np.nan compares equal to np.nan

In [None]:
# arrange columns
cols = r1.columns.sort_values()
r1 = r1[cols]
r2 = r2[cols]
(r1.columns == r2.columns).all()

In [None]:
# INSTNM is a unique key
college['INSTNM'].value_counts().max()

In [None]:
# another way to check for uniqueness
college['INSTNM'].nunique() == college['INSTNM'].count()

In [None]:
# and the preferred way to see if the values in a series are unique
college['INSTNM'].is_unique

## Index Alignment

In [None]:
college_path = !find .. -name college.csv
college_path = college_path.sort()
college_path[0]

# read into dataframe with institue name as row index
college = pd.read_csv(college_path[0])
college.head()

In [None]:
college.columns

In [None]:
college.columns.values

In [None]:
columns = college.columns
columns[1:6:2]

In [None]:
columns.min(), columns.max(), columns.isnull().sum()

In [None]:
columns + '_A'

In [None]:
s1 = pd.Series(index=list('aaab'), data=np.arange(4))
s1

In [None]:
s2 = pd.Series(index=list('cababb'), data=np.arange(6))
s2

In [None]:
# additions of Series with non-identical indexes, creates a Cartesian Product
s = s1 + s2
s

In [None]:
len(s1), len(s2), len(s)

In [None]:
# a a a b
# 0 1 2 3

# c a b a b b
# 0 1 2 3 4 5

# a's
# 0+1, 0+3, 1+1, 1+3, 2+1, 2+3

In [None]:
s1 = pd.Series(index=list('aaabb'), data=np.arange(5))
s2 = pd.Series(index=list('bbaaa'), data=np.arange(10,15))
s1 + s2

In [None]:
s1 = pd.Series(index=list('aaabb'), data=np.arange(5))
s2 = pd.Series(index=list('aaabb'), data=np.arange(10,15))
s1 + s2

In [None]:
employee = pd.read_csv(emp_path[0], index_col='RACE')
employee.head()

In [None]:
# view is created, not copy
salary1 = employee['BASE_SALARY']
salary2 = employee['BASE_SALARY']

In [None]:
# same information is referred to
salary1 is salary2

In [None]:
salary1 = employee['BASE_SALARY'].copy()
salary2 = employee['BASE_SALARY'].copy()
salary1 is salary2

In [None]:
salary1 = salary1.sort_index()
salary1.head()

In [None]:
# values are in a different order
# they don't line up, so a Cartesian Product occurs
salary2.head()

In [None]:
salary_add = salary1 + salary2
salary_add.head()

In [None]:
salary_add1 = salary1 + salary1
len(salary1), len(salary2), len(salary_add), len(salary_add1)

In [None]:
b14 = !find .. -name baseball14.csv
b15 = !find .. -name baseball15.csv
b16 = !find .. -name baseball16.csv

In [None]:
baseball_14 = pd.read_csv(b14[0], index_col='playerID')
baseball_15 = pd.read_csv(b15[0], index_col='playerID')
baseball_16 = pd.read_csv(b16[0], index_col='playerID')
baseball_14.head()

In [None]:
# values in 14 that are not in 15
baseball_14.index.difference(baseball_15.index)

In [None]:
# values in 14 that are not in 16
baseball_14.index.difference(baseball_16.index)

In [None]:
hits_14 = baseball_14['H']
hits_15 = baseball_15['H']
hits_16 = baseball_16['H']
hits_14.head()

In [None]:
(hits_14 + hits_15).head()

In [None]:
hits_14.add(hits_15, fill_value=0).head()

In [None]:
hits_total = hits_14.add(hits_15, fill_value=0).add(hits_16, fill_value=0)
hits_total.head()

In [None]:
hits_total.hasnans

In [None]:
len(hits_14), len(hits_15), len(hits_16), len(hits_total)

In [None]:
# take union of indexes as sets
ss = set(hits_14.index) | set(hits_15.index) | set(hits_16.index)
len(ss)

In [None]:
set(hits_total.index) == ss

In [None]:
# let's check the above
columns = college_n2.columns
max_indexes = []
for col in columns:
    df = college[college[col] == college[col].max()]
    max_indexes.append((col, df.index.values[0]))
    
max_indexes

In [None]:
df = pd.DataFrame(max_indexes, columns='column INSTNM'.split())
df = df.set_index('column')
series = df['INSTNM']
del series.index.name
series

In [None]:
# it worked!
series.equals(max_cols)

In [None]:
# better way to convert tuple to series with index
instnm = [x[1] for x in max_indexes]
index = [x[0] for x in max_indexes]
s = pd.Series(data=instnm, index=index)
s

In [None]:
%%timeit
columns = college_n2.columns
max_indexes = []
for col in columns:
    df = college[college[col] == college[col].max()]
    max_indexes.append((col, df.index.values[0]))

In [None]:
# this is 10 times faster
%timeit college_n2.idxmax()

In [None]:
%%timeit
# above appears to be O(n^2) to find max and then find row with max
# this should be much faster -- it is!
columns = college_n2.columns
max_indexes = []
for col in columns:
    max_indexes.append((col, college[col].idxmax()))
    
instnm = [x[1] for x in max_indexes]
index = [x[0] for x in max_indexes]
s = pd.Series(data=instnm, index=index)

### Examining the groupby object

In [None]:
grouped = college.groupby(['STABBR', 'RELAFFIL'])
type(grouped)

In [None]:
print([attr for attr in dir(grouped) if not attr.startswith('_')])

In [None]:
grouped.ngroups

In [None]:
groups = list(grouped.groups.keys())
groups[:6]

In [None]:
grouped.get_group(('FL', 1)).head()

In [None]:
from IPython.display import display
i = 0
for name, group in grouped:
    print(name)
    display(group.head(2))
    i += 1
    if i == 5:
        break

In [None]:
grouped.head(2).head(6)

### Grouping by continuous variables

In [None]:
flights = pd.read_csv('data/flights.csv')
flights.head()

In [None]:
bins = [-np.inf, 200, 500, 1000, 2000, np.inf]
cuts = pd.cut(flights['DIST'], bins=bins)
cuts.head()

In [None]:
cuts.value_counts()

In [None]:
flights.groupby(cuts)['AIRLINE'].value_counts(normalize=True).round(3).head(40)

In [None]:
# although unusual and unnecessary, year could be considered a ordered categorical type
from pandas.api.types import CategoricalDtype
years = movie['Year'].unique()
years.sort()
dtype = CategoricalDtype(categories=years, ordered=True)
dtype

In [None]:
year = movie['Year'].astype(dtype)
year