# Pandas Introduction Part 2

## Overview

This notebook uses the IMDB dataset from Kaggle:  
https://www.kaggle.com/PromptCloudHQ/imdb-data#IMDB-Movie-Data.csv

This notebook has examples which illustrate the basics of Pandas.

Another notebook TODO Link will discuss Pandas in more depth.

In [1]:
import pandas as pd
import numpy as np

## Movie Data Set

In [58]:
# read in IMDB move dataset
movie = pd.read_csv('../data/IMDB-Movie-Data.csv')

In [28]:
# display first 2 rows of the DataFrame
movie.head(2)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0


In [4]:
# display last 2 rows
movie.tail(2)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0
999,1000,Nine Lives,"Comedy,Family,Fantasy",A stuffy businessman finds himself trapped ins...,Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,12435,19.64,11.0


In [5]:
movie.columns

Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

In [6]:
movie.index

RangeIndex(start=0, stop=1000, step=1)

In [7]:
movie.iloc[0:3, 0:3]

Unnamed: 0,Rank,Title,Genre
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi"
1,2,Prometheus,"Adventure,Mystery,Sci-Fi"
2,3,Split,"Horror,Thriller"


In [8]:
# instead of using the default index, use Title as the index and drop the default index
movie = movie.set_index('Title', drop=True)
movie.iloc[0:3, 0:3]

Unnamed: 0_level_0,Rank,Genre,Description
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te..."
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...


In [9]:
# rename the columns to be shorter by removing the units
movie.columns = ['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime', 'Rating', 'Votes', 'Revenue',
       'Metascore']

### Simple Data Queries

In [10]:
# find the longest and shortest movies on the list
movie['Runtime'].min(), movie['Runtime'].max()

(66, 191)

In [11]:
# find stats about Rating
movie['Rating'].describe()

count    1000.000000
mean        6.723200
std         0.945429
min         1.900000
25%         6.200000
50%         6.800000
75%         7.400000
max         9.000000
Name: Rating, dtype: float64

In [12]:
# find stats about Runtime
movie['Runtime'].describe()

count    1000.000000
mean      113.172000
std        18.810908
min        66.000000
25%       100.000000
50%       111.000000
75%       123.000000
max       191.000000
Name: Runtime, dtype: float64

In [13]:
# lets verify the above stat for 75th percentile
# expect to see about 25% of the records above 123
criteria = movie['Runtime'] > 123

# True is 1, False is 0, sum() therefore counts the number of True values in a boolean series
top_percentile = criteria.sum() / len(criteria)
print(f'{top_percentile*100:3.2f} percent of movies are longer than 123 minutes')

24.60 percent of movies are longer than 123 minutes


In [14]:
# display the row with the highest rating
criteria = (movie['Rating'].max() == movie['Rating'])
movie[criteria]

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,82.0


### Series have a Single Data Type
A Series is implemented as a numpy array with all values having the same data type.

As everything is a subclass of "object", a Series of type "object" can only values of any data type.

In [15]:
# numpy creates a float64 array by default
a = np.ndarray([1, 2, 3])
a.dtype

dtype('float64')

In [16]:
# tell numpy what type of array to create
a = np.ndarray([1, 2, 3], dtype='int32')
a.dtype

dtype('int32')

In [17]:
# Pandas creates an int64 Series by default, for integers
s = pd.Series([1, 2, 3])
s.dtype

dtype('int64')

In [18]:
# tell Pandas what type of Series to create
s = pd.Series([1, 2, 3], dtype='int32')
s.dtype

dtype('int32')

### Unknown Values
In data analysis every variable can have a value that is either: "known" or "unknown".  

Another name for "unknown" is "null".

In Pandas and Numpy, "np.nan" is used to represent "unknown" or "null".

#### Example of known/unknown:  
A person is asked a yes/no question and refuses to answer; the answer is: "unknown"  
A person answers yes; the answer is: yes

In [19]:
# any relational operator with "unknown" produces an "unknown" result
print(3 < np.nan)
print(3 > np.nan)
print(np.nan == np.nan)

# a special operater is needed to determine if the value is unknown
print(np.isnan(np.nan))

False
False
False
True


In [21]:
# the data type of np.nan is float
type(np.nan)

float

### A Series with an Unknown Value
If any of the values are unknown, than the type of the series must be float or object in order to hold np.nan.

In [22]:
# Pandas defaults this to int64
s = pd.Series([1, 2, 3])
print(s.dtype)

# Pandas defaults this to float64, so it can hold np.nan
s = pd.Series([1, 2, 3, np.nan])
print(s.dtype)

int64
float64


In [23]:
# the only way to hold non-numeric types is to use the catch-all, 'object'
# a Pandas row will often contain values of different data types
s = pd.Series([{"one":1}, [2, 3], (3,4), 5])
s.dtype

dtype('O')

In [24]:
# the default way to hold strings is also to use 'object'
s = pd.Series(['column_one', 'column_two'])
s.dtype

dtype('O')

### Summary of Above
An integer series that needs to represent "unknown" will have its data type promoted to "float" as seen above.

In practice, a column of type "object" is most often a Series containing strings, although it could be an array of disparate data types.

### Handling Null Values

In [40]:
# metascore is a view into the DataFrame, not a copy
metascore = movie['Metascore']
metascore is movie['Metascore']

True

In [41]:
# number of non-null values
metascore.count()

936

In [42]:
# number of null values
metascore.isnull().sum()

64

In [37]:
# sum of non-null values
metascore.sum()

55210.0

In [38]:
# compute the average of the non-null values
metascore.sum() / metascore.count()

58.98504273504273

In [39]:
# compute the average of the non-null values
metascore.mean()

58.98504273504273

In [44]:
# When using Machine Learning algorithms, it can be helpful impute a missing value rather than 
# use null.  Sometimes the imputed value is the mean.
metascore = metascore.fillna(value=metascore.mean())

In [45]:
metascore.count()

1000

In [47]:
# one reason for imputing with the mean value is that it does not change the overall mean
# of the column
metascore.mean()

58.985042735042626

## Data Wrangling: Queries involving Genre
Each cell in the Genre column is a single string.  This string consists of several genres separated by commas.

With the data in this format, it is not possible to answers queries such as "what genre has the highest average rating?"

In the following, an indicator column, indicating the presence or absence of a genre, will be created.  With this it is possible to select the movies for a given genre and compute their average rating.

In [438]:
# read in IMDB move dataset
movie = pd.read_csv('../data/IMDB-Movie-Data.csv')
movie = movie.set_index('Title')

In [439]:
def parse_genre_string(s):
    genres = s.split(',')
    return [genre.strip() for genre in genres]

In [440]:
# get the set of unique genres
genres = set()
s = movie['Genre'].apply(parse_genre_string)
s.apply(lambda x: genres.update(x))
genres

{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Thriller',
 'War',
 'Western'}

In [441]:
# Create an Indicator Columns
for col in genre_categories:
    movie['Genre_'+col] = movie['Genre'].apply(lambda x: col in x)

In [442]:
# try it
cols = ['Genre', 'Genre_Action', 'Genre_Sci-Fi', 'Rating']
action_criteria = movie['Genre_Action']
action_movies = movie.loc[action_criteria, cols]
action_movies.head(5)

Unnamed: 0_level_0,Genre,Genre_Action,Genre_Sci-Fi,Rating
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Guardians of the Galaxy,"Action,Adventure,Sci-Fi",True,True,8.1
Suicide Squad,"Action,Adventure,Fantasy",True,False,6.2
The Great Wall,"Action,Adventure,Fantasy",True,False,6.1
The Lost City of Z,"Action,Adventure,Biography",True,False,7.1
Rogue One,"Action,Adventure,Sci-Fi",True,True,7.9


In [443]:
# find the avg rating of action movies
action_movies['Rating'].mean()

6.614521452145213

In [444]:
# the same, but use groupby
# groupby is overkill for a boolean_series, but it works
# groupby with an indicator variable results in two groups: False, True
# the group of interest is 'True'
groupby = movie.groupby('Genre_Action')
groupby['Rating'].mean()

Genre_Action
False    6.770445
True     6.614521
Name: Rating, dtype: float64

In [445]:
groupby['Rating'].mean()[True]

6.614521452145213

In [446]:
# find the highest rated action movie
criteria = (action_movies['Rating'] == action_movies['Rating'].max())
action_movies[criteria]

Unnamed: 0_level_0,Genre,Genre_Action,Genre_Sci-Fi,Rating
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Dark Knight,"Action,Crime,Drama",True,False,9.0


In [447]:
# how many action movies are there?
# True counts as 1, False as 0, so sum() is number of Action Movies
movie['Genre_Action'].sum()

303

In [448]:
groupby['Rating'].count()[True]

303

In [449]:
# how many of each genre are there?
genre_count = {}
for genre in genres:
    num_movies = movie['Genre_'+genre].sum()
    genre_count[genre] = num_movies
    
genre_count_sorted = sorted(genre_count.items(), key=lambda kv: kv[1], reverse=True)
for genre, count in genre_count_sorted:
    print(f'{genre:<10} Count: {count:>4}')

Drama      Count:  513
Action     Count:  303
Comedy     Count:  279
Adventure  Count:  259
Thriller   Count:  195
Crime      Count:  150
Romance    Count:  141
Sci-Fi     Count:  120
Horror     Count:  119
Mystery    Count:  106
Fantasy    Count:  101
Biography  Count:   81
Family     Count:   51
Animation  Count:   49
History    Count:   29
Music      Count:   21
Sport      Count:   18
War        Count:   13
Western    Count:    7
Musical    Count:    5


In [450]:
# find all genres that apply to at least 20 titles
genre_20 = [genre for genre,count in genre_count.items() if count >= 20]
genre_20

['Sci-Fi',
 'Mystery',
 'Fantasy',
 'Thriller',
 'Drama',
 'Adventure',
 'Biography',
 'Romance',
 'Animation',
 'Action',
 'History',
 'Comedy',
 'Family',
 'Crime',
 'Horror',
 'Music']

In [451]:
# find the average rating for each genre that applies to at least 20 titles
cols = ['Genre', 'Rating']
genre_avg_rating = {}
for genre in genre_20:
    genre_criteria = movie['Genre_'+genre]
    genre_movies = movie.loc[genre_criteria, cols]
    genre_avg_rating[genre] = genre_movies['Rating'].mean()

genre_rating_sorted = sorted(genre_avg_rating.items(), key=lambda kv: kv[1], reverse=True)
for genre, rating in genre_rating_sorted:
    print(f'{genre:<10} Rating: {rating:3.2f}')

Animation  Rating: 7.32
Biography  Rating: 7.29
History    Rating: 7.13
Music      Rating: 7.04
Drama      Rating: 6.95
Mystery    Rating: 6.89
Crime      Rating: 6.79
Adventure  Rating: 6.77
Sci-Fi     Rating: 6.72
Romance    Rating: 6.69
Family     Rating: 6.68
Comedy     Rating: 6.65
Action     Rating: 6.61
Thriller   Rating: 6.59
Fantasy    Rating: 6.55
Horror     Rating: 6.09


In [452]:
movie.columns

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore', 'Genre_Action', 'Genre_Adventure', 'Genre_Animation',
       'Genre_Biography', 'Genre_Comedy', 'Genre_Crime', 'Genre_Drama',
       'Genre_Family', 'Genre_Fantasy', 'Genre_History', 'Genre_Horror',
       'Genre_Music', 'Genre_Musical', 'Genre_Mystery', 'Genre_Romance',
       'Genre_Sci-Fi', 'Genre_Sport', 'Genre_Thriller', 'Genre_War',
       'Genre_Western'],
      dtype='object')

In [456]:
# find all movies that are both Comedy and Sci-Fi
criteria = movie['Genre_'+'Comedy'] & movie['Genre_'+'Sci-Fi']
movie[criteria]

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),...,Genre_Horror,Genre_Music,Genre_Musical,Genre_Mystery,Genre_Romance,Genre_Sci-Fi,Genre_Sport,Genre_Thriller,Genre_War,Genre_Western
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Absolutely Anything,399,"Comedy,Sci-Fi",A group of eccentric aliens confer a human bei...,Terry Jones,"Simon Pegg, Kate Beckinsale, Sanjeev Bhaskar, ...",2015,85,6.0,26587,,...,False,False,False,False,False,True,False,False,False,False
Idiocracy,560,"Adventure,Comedy,Sci-Fi","Private Joe Bauers, the definition of ""average...",Mike Judge,"Luke Wilson, Maya Rudolph, Dax Shepard, Terry ...",2006,84,6.6,115355,0.44,...,False,False,False,False,False,True,False,False,False,False
Paul,732,"Adventure,Comedy,Sci-Fi",Two British comic-book geeks traveling across ...,Greg Mottola,"Simon Pegg, Nick Frost, Seth Rogen, Mia Stallard",2011,104,7.0,201707,37.37,...,False,False,False,False,False,True,False,False,False,False
The World's End,800,"Action,Comedy,Sci-Fi",Five friends who reunite in an attempt to top ...,Edgar Wright,"Simon Pegg, Nick Frost, Martin Freeman, Rosamu...",2013,109,7.0,199813,26.0,...,False,False,False,False,False,True,False,False,False,False
Slither,909,"Comedy,Horror,Sci-Fi","A small town is taken over by an alien plague,...",James Gunn,"Nathan Fillion, Elizabeth Banks, Michael Rooke...",2006,95,6.5,64351,7.77,...,True,False,False,False,False,True,False,False,False,False


### Group By
The Year column is a more appropriate column for Group By

In [373]:
groupby_year = movie.groupby('Year')

# average rating per year
groupby_year[cols].agg(['mean', 'count'])

Unnamed: 0_level_0,Rating,Rating
Unnamed: 0_level_1,mean,count
Year,Unnamed: 1_level_2,Unnamed: 2_level_2
2006,7.125,44
2007,7.133962,53
2008,6.784615,52
2009,6.960784,51
2010,6.826667,60
2011,6.838095,63
2012,6.925,64
2013,6.812088,91
2014,6.837755,98
2015,6.602362,127


In [532]:
def group_filter(g):
    if g['Rating'].count() > 4:
        return g['Rating'].mean()

In [535]:
s = movie.groupby('Director').apply(group_filter)
s[s.notnull()].nlargest(5)

Director
Christopher Nolan    8.68
Martin Scorsese      7.92
David Fincher        7.82
Denis Villeneuve     7.76
J.J. Abrams          7.58
dtype: float64

In [534]:
s = movie.groupby('Director')['Rating'].mean()
s[s.notnull()].nlargest(5)

Director
Nitesh Tiwari        8.80
Christopher Nolan    8.68
Makoto Shinkai       8.60
Olivier Nakache      8.60
Aamir Khan           8.50
Name: Rating, dtype: float64

## Data Wrangling: Queries involving Actor
This could be done exactly as with Genres, but as there are so many different actors, it's probably better to rely on a couple of dictionaries instead of creating a great many indicator columns.

### Convert Actor Series to dict
The key is the index of the series.
The value is the series value.

In [416]:
# Map Title to Actor string
title_to_actor_string = movie['Actors'].to_dict()

In [417]:
title_to_actor_string['10 Cloverfield Lane']

'John Goodman, Mary Elizabeth Winstead, John Gallagher Jr., Douglas M. Griffin'

In [418]:
# convert this string to a list of actor names
title_to_actors = defaultdict(list)
for title, actor_string in title_to_actor_string.items():
    for actor in actor_string.split(','):
        title_to_actors[title].append(actor.strip())

In [419]:
# try it
title_to_actors['10 Cloverfield Lane']

['John Goodman',
 'Mary Elizabeth Winstead',
 'John Gallagher Jr.',
 'Douglas M. Griffin']

### Create Inverted dict (Actors -> Titles)

In [422]:
# invert the index
actor_to_titles = defaultdict(list)
for title, actors in title_to_actors.items():
    for actor in actors:
        actor_to_titles[actor].append(title)

In [423]:
# try it
actor_to_titles['John Goodman']

['Patriots Day',
 '10 Cloverfield Lane',
 'Monsters University',
 'Argo',
 'The Gambler',
 'Flight']

In [425]:
movie.loc[actor_to_titles['John Goodman'],'Actors']

Title
Patriots Day           Mark Wahlberg, Michelle Monaghan, J.K. Simmons...
10 Cloverfield Lane    John Goodman, Mary Elizabeth Winstead, John Ga...
Monsters University    Billy Crystal, John Goodman, Steve Buscemi, He...
Argo                   Ben Affleck, Bryan Cranston, John Goodman, Ala...
The Gambler            Mark Wahlberg, Jessica Lange, John Goodman, Br...
Flight                 Denzel Washington, Nadine Velazquez, Don Chead...
Name: Actors, dtype: object

In [429]:
# get the average rating and number of movies for each actor in at least 5 movies
actor_rating = {}
for actor in actors_to_title:
    titles = actors_to_title[actor]
    movies_with_actor = movie.loc[titles]
    count = movies_with_actor['Rating'].count()
    if count >= 5:
        actor_rating[actor] = movies_with_actor['Rating'].mean()

In [433]:
actor_rating_sorted = sorted(actor_rating.items(), key=lambda kv: kv[1], reverse=True)
actor_rating_sorted[:10]

[('Leonardo DiCaprio', 7.969999999999999),
 ('Michael Caine', 7.766666666666667),
 ('Domhnall Gleeson', 7.716666666666668),
 ('Rachel McAdams', 7.671428571428571),
 ('Christopher Plummer', 7.639999999999999),
 ('Ian McKellen', 7.583333333333333),
 ('Robert Downey Jr.', 7.500000000000001),
 ('Mark Ruffalo', 7.5),
 ('Brad Pitt', 7.484615384615385),
 ('Zoe Saldana', 7.483333333333333)]

In [464]:
movie.loc[actor_to_titles['Leonardo DiCaprio']]

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),...,Genre_Horror,Genre_Music,Genre_Musical,Genre_Mystery,Genre_Romance,Genre_Sci-Fi,Genre_Sport,Genre_Thriller,Genre_War,Genre_Western
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Inception,81,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,...,False,False,False,False,False,True,False,False,False,False
The Wolf of Wall Street,83,"Biography,Comedy,Crime","Based on the true story of Jordan Belfort, fro...",Martin Scorsese,"Leonardo DiCaprio, Jonah Hill, Margot Robbie,M...",2013,180,8.2,865134,116.87,...,False,False,False,False,False,False,False,False,False,False
The Departed,100,"Crime,Drama,Thriller",An undercover cop and a mole in the police att...,Martin Scorsese,"Leonardo DiCaprio, Matt Damon, Jack Nicholson,...",2006,151,8.5,937414,132.37,...,False,False,False,False,False,False,False,True,False,False
The Revenant,130,"Adventure,Drama,Thriller",A frontiersman on a fur trading expedition in ...,Alejandro González Iñárritu,"Leonardo DiCaprio, Tom Hardy, Will Poulter, Do...",2015,156,8.0,499424,183.64,...,False,False,False,False,False,False,False,True,False,False
The Great Gatsby,138,"Drama,Romance","A writer and wall street trader, Nick, finds h...",Baz Luhrmann,"Leonardo DiCaprio, Carey Mulligan, Joel Edgert...",2013,143,7.3,386102,144.81,...,False,False,False,False,True,False,False,False,False,False
Shutter Island,139,"Mystery,Thriller","In 1954, a U.S. marshal investigates the disap...",Martin Scorsese,"Leonardo DiCaprio, Emily Mortimer, Mark Ruffal...",2010,138,8.1,855604,127.97,...,False,False,False,True,False,False,False,True,False,False
Django Unchained,145,"Drama,Western","With the help of a German bounty hunter , a fr...",Quentin Tarantino,"Jamie Foxx, Christoph Waltz, Leonardo DiCaprio...",2012,165,8.4,1039115,162.8,...,False,False,False,False,False,False,False,False,False,True
Revolutionary Road,460,"Drama,Romance",A young couple living in a Connecticut suburb ...,Sam Mendes,"Leonardo DiCaprio, Kate Winslet, Christopher F...",2008,119,7.3,159736,22.88,...,False,False,False,False,True,False,False,False,False,False
Blood Diamond,670,"Adventure,Drama,Thriller","A fisherman, a smuggler, and a syndicate of bu...",Edward Zwick,"Leonardo DiCaprio, Djimon Hounsou, Jennifer Co...",2006,143,8.0,422014,57.37,...,False,False,False,False,False,False,False,True,False,False
Body of Lies,738,"Action,Drama,Romance",A CIA agent on the ground in Jordan hunts down...,Ridley Scott,"Leonardo DiCaprio, Russell Crowe, Mark Strong,...",2008,128,7.1,182305,39.38,...,False,False,False,False,True,False,False,False,False,False


In [466]:
# with dictionaries, its slighter harder to query for two actors, than with indicator columns
# find all movies with Leonardo DiCaprio and Matt Damon
titles1 = actors_to_title['Leonardo DiCaprio']
titles2 = actors_to_title['Matt Damon']
titles = set(titles1).intersection(set(titles2))
titles
movie.loc[titles]

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),...,Genre_Horror,Genre_Music,Genre_Musical,Genre_Mystery,Genre_Romance,Genre_Sci-Fi,Genre_Sport,Genre_Thriller,Genre_War,Genre_Western
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The Departed,100,"Crime,Drama,Thriller",An undercover cop and a mole in the police att...,Martin Scorsese,"Leonardo DiCaprio, Matt Damon, Jack Nicholson,...",2006,151,8.5,937414,132.37,...,False,False,False,False,False,False,False,True,False,False


## Categorical Values
Categorical Values are also called enumerated types or factors.

Converting a column to categorical can save on memory, make the data processing clearer, and may help certain Machine Learning algorithms to do a better job.

In [401]:
# read in IMDB move dataset
movie = pd.read_csv('../data/IMDB-Movie-Data.csv')
movie = movie.set_index('Title')

In [402]:
# select the columns that are of type 'object'
movie_obj = movie.select_dtypes('object')
movie_obj.columns

Index(['Genre', 'Description', 'Director', 'Actors'], dtype='object')

In [403]:
# examine how many unique values there are for each such column
movie_obj.nunique()

Genre           207
Description    1000
Director        644
Actors          996
dtype: int64

In [None]:
# as it turns out, none of the columns in this dataset are good candidates for 
# being converted to a category

In [399]:
movie['Year'] = year
movie.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
Rank                  1000 non-null int64
Genre                 1000 non-null object
Description           1000 non-null object
Director              1000 non-null object
Actors                1000 non-null object
Year                  1000 non-null category
Runtime (Minutes)     1000 non-null int64
Rating                1000 non-null float64
Votes                 1000 non-null int64
Revenue (Millions)    872 non-null float64
Metascore             936 non-null float64
dtypes: category(1), float64(3), int64(3), object(4)
memory usage: 127.3+ KB


In [392]:
years

array([2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016])

In [378]:
# There are too many unique values to create a category
# Although unusual, year could be converted to an ordered categorical varibale
year = movie['Year'].astype('category', ordered=True)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [77]:
# Genre is a candidate for converting to a categorical field
movie_obj['Genre'] = movie_obj['Genre'].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


### Above Warning is Important
To save memory, Pandas does not copy objects or parts of objects by default.

movie_obj is a slice of movie and when an assignment is made to it, an assignment may be made to movie.

Although Pandas correctly figure out the intent, this type of warning should never occur in well written code.

In [84]:
# reread in IMDB move dataset
movie = pd.read_csv('../data/IMDB-Movie-Data.csv')

In [162]:
# select the columns that are of type 'object', AS A COPY
movie_obj = movie.select_dtypes('object').copy()
movie_obj.columns

Index(['Genre', 'Description', 'Director', 'Actors'], dtype='object')

In [163]:
# Convert to categorical, note there is no warning this time
movie_obj['Genre'] = movie_obj['Genre'].astype('category')

In [164]:
movie.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
Rank                  1000 non-null int64
Genre                 1000 non-null object
Description           1000 non-null object
Director              1000 non-null object
Actors                1000 non-null object
Year                  1000 non-null int64
Runtime (Minutes)     1000 non-null int64
Rating                1000 non-null float64
Votes                 1000 non-null int64
Revenue (Millions)    872 non-null float64
Metascore             936 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB


In [88]:
# overwrite the Genre column in the movie dataframe
movie['Genre'] = movie_obj['Genre']
movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
Rank                  1000 non-null int64
Title                 1000 non-null object
Genre                 1000 non-null category
Description           1000 non-null object
Director              1000 non-null object
Actors                1000 non-null object
Year                  1000 non-null int64
Runtime (Minutes)     1000 non-null int64
Rating                1000 non-null float64
Votes                 1000 non-null int64
Revenue (Millions)    872 non-null float64
Metascore             936 non-null float64
dtypes: category(1), float64(3), int64(4), object(4)
memory usage: 99.6+ KB


In [90]:
genre = movie['Genre']
genre is movie['Genre']

True

In [91]:
genre

0         Action,Adventure,Sci-Fi
1        Adventure,Mystery,Sci-Fi
2                 Horror,Thriller
3         Animation,Comedy,Family
4        Action,Adventure,Fantasy
5        Action,Adventure,Fantasy
6              Comedy,Drama,Music
7                          Comedy
8      Action,Adventure,Biography
9         Adventure,Drama,Romance
10       Adventure,Family,Fantasy
11        Biography,Drama,History
12        Action,Adventure,Sci-Fi
13     Animation,Adventure,Comedy
14            Action,Comedy,Drama
15     Animation,Adventure,Comedy
16        Biography,Drama,History
17                Action,Thriller
18                Biography,Drama
19           Drama,Mystery,Sci-Fi
20       Adventure,Drama,Thriller
21                          Drama
22             Crime,Drama,Horror
23     Animation,Adventure,Comedy
24        Action,Adventure,Sci-Fi
25                         Comedy
26         Action,Adventure,Drama
27                Horror,Thriller
28                         Comedy
29         Act

In [None]:
# When the variable is categorical, the imputed value might be the mode

In [None]:
# which director had the highest average rating and directed at least 3 movies
groupby = movie.groupby('Director')

# criteria = groupby.count() >= 3
groupby['Rating'].count() > 6

In [None]:
for group in groupby:
    print(group.name)
    print(len(group))

In [None]:
dir(groupby)

## Start of Actors Section

### Actors Column
The Actors column contains a single string per cell, but each string consists of a comma separated list of actors.

Whenever a single cell contains more than 1 value, or string is used to represent more than 1 value, special processing is required.

#### Quick Review of defaultdict(list)
A dictionary is a mapping between a key and a value.

The value can be a collection.

defaultdict(list): says that each value defaults to an empty list

In [None]:
# Book analogy:
from collections import defaultdict
d = defaultdict(list)
d['p1'].append("word 1")
d['p1'].append("word 2")
d['p1'].append("word 3")
d['p2'].append("word 2")
d['p2'].append("word 4")
d

In [None]:
# if the value is a collection, the entire collection is accessed by key
d['p1']

In [None]:
# Create an inverted index (values to keys)
# The index in the back of a book is an "inverted index"
inverted_dict = defaultdict(list)
for key, values in d.items():
    for value in values:
        inverted_dict[value].append(key)
        
inverted_dict

In [None]:
inverted_dict['word 2']

#### Convert Series to dict
The key is the index of the series.
The value is the series value.

In [None]:
# Map Title to Actor string
title_to_actor_string = movie['Actors'].to_dict()

In [None]:
title_to_actor_string['10 Cloverfield Lane']

In [None]:
# convert this string to a list of actor names
title_to_actors = defaultdict(list)
for title, actor_string in title_to_actor_string.items():
    for actor in actor_string.split(','):
        title_to_actors[title].append(actor.strip())

In [None]:
title_to_actors['10 Cloverfield Lane']

#### Create Inverted dict (Actors -> Titles)

In [None]:
# invert the index
actors_to_title = defaultdict(list)
for title, actors in title_to_actors.items():
    for actor in actors:
        actors_to_title[actor].append(title)

In [None]:
actors_to_title['John Goodman']

## END of ACTORS Sections

In [None]:
# get 1 column of values
votes = movie['Votes']

# first 5 votes in the column
votes[:5]

In [None]:
print(f'type of votes: {type(votes)}')
print(f'type of votes values: {votes.dtype}')

### DataFrame

In [None]:
# Get overall information about a DataFrame
movie.info()

In [None]:
# number of records
len(movie)

In [None]:
# number of non-null movie titles
movie['Title'].count()

In [None]:
# number of null movie titles
# True counts as 1, False counts as 0, sum up the number of True values
movie['Title'].isna().sum()

In [None]:
metascores = movie['Metascore']
print(f'Number of non-null Metascores: {metascores.count()}')
print(f'Number of null Metascores: {metascores.isnull().sum()}')
print(f'Number of Movies: {len(movie)}')

# number of non-null Metascores + number of null Metascores = number of records
metascores.count() + metascores.isna().sum() == len(movie)

In [None]:
# the datatypes of each of the movie columns
movie.dtypes

In [None]:
# distribution of datatypes
movie.dtypes.value_counts()

## Selecting Values in Row(s) or Column(s) or Both

In [None]:
# Column Selection
cols_to_select = ['Title', 'Rating']

# verfiy every column to be selected is in movie.columns
columns_all = set(movie.columns)
columns_subset = set(cols_to_select)
columns_subset.issubset(columns_all)

In [None]:
# select these columns (first 3 rows)
movie[cols_to_select].head(3)

In [None]:
# select these columns (first 3 rows), alternative syntax
movie[cols_to_select][:3]

In [None]:
# add a column that isn't there
cols_to_select2 = ['Title', 'Rating', 'Not_There']
try:
    movie[cols_to_select2]
except KeyError as err:
    print(err)

### Note on Above Error Message

The error message says that the column name is not in the "index".  As the column labels are themselves an index, this means the column name is not in the index of column labels.

This does **not** mean that 'Not_There' is not in the df.index.  df.index refers specifically to the index of row labels.  

### Pandas Infers Whether: df\[cols_to_select\] Selects Columns or Rows

Often the inference allows for convenient use of the bracket operator.  
Sometimes it is clearer to explicitly state whether you are searching for row IDs or column IDs or both..

If you want to specifically state that you want the rows with IDs 0-2 and the columns with labels 'cols_to_select', using **.loc\[\]**

In [None]:
# Row ID Selection, slice operator is inclusive
# Col ID Selection, list of lables is provided
movie.loc[:2,cols_to_select]

### Boolean Series Row Filtering

Comparing values in one column with a relational operator results in a Series having the same length as the column, and an index being identical to the column's index.

In [None]:
# get a boolean series for a relational operation
good_rating = movie['Rating'] > 8.6
print(type(good_rating)) # Series
print(good_rating.dtype) # bool
print(len(good_rating) == len(movie)) # same length

In [None]:
# show first 3 values of this series
good_rating[:3]

#### Boolean Series Index must match DataFrame index

When a boolean series is used to filter rows, it's index must "match" the DataFrame's index.  This will always be the case when boolean series was created with a relational operator applied to the values of one of its columns.

In [None]:
movie.index

In [None]:
good_rating.index

In [None]:
# all index values (aka row IDs) are the same
(movie.index == good_rating.index).all()

In [None]:
# how many movies meet the good_rating criteria?
# True is 1, False is 0, the sum of the True values is the number of True values
good_rating.sum()

In [None]:
movie[good_rating]

### Bracket Notation **df\[filter\]**

Pandas can infer:
1. columns are to be selected if the input is a list of values which are a subset of the column labels
2. rows are to be selected if the input is a boolean series having a matching index

In [None]:
# filtering by column, analgous to SQL Select
cols_to_select = ['Title', 'Rating']
movie[cols_to_select].head(3)

In [None]:
# filtering by row, analgous to SQL Where
movie[good_rating].head(3)

In [None]:
# To not clutter the notebook output, warnings are ignored
# Normally warnings should not be ignored
import warnings
warnings.filterwarnings('ignore')

# create a non-matching boolean_series
good_rating2 = good_rating[:-1]
try:
    movie[good_rating2]
except pd.core.indexing.IndexingError as err:
    print(err)

In [None]:
cols_to_select = ['Title', 'Rating']
boolean_series = movie['Rating'] > 8.6

# filter both columns and rows
movie[cols_to_select][good_rating]

In [None]:
# filter on both rows and columns, reverse order
movie[good_rating][cols_to_select]

In [None]:
# clearer code, Pandas is not infering what is being selected
movie.loc[good_rating, cols_to_select]

# Example DataFrame Queries

In [None]:
# rename the columns to be easier to work with
movie.columns

In [None]:
cols = ['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime', 'Rating', 'Votes', 'Revenue',
       'Metascore']
movie.columns = cols
movie.columns

In [None]:
# select index (movie titles)
movie = movie.sort_index()

In [None]:
# select last two rows, first 3 columns
movie.iloc[-2:, :3]

In [None]:
# select movies more than 150 minutes with a rating higher than 8.0
row_filter = (movie['Runtime'] > 150) & (movie['Rating'] > 8.3)

# select Genre, Runtime and Rating
cols = ['Genre','Runtime','Rating']

movie.loc[row_filter, cols].head(3)

## Handle Multiple Values in Actors Column

The Actors column is not a collection per se, although it is used like one.  Each value in the Actors column is a single value, which happens to be a string that contains one or more actor names.

In [None]:
# Find all movies with Zooey Deschanel in this dataset
titles_filter = movie['Actors'].apply(lambda x: 'Zooey Deschanel' in x)

In [None]:
movie[titles_filter]

In [None]:
# It's a bit awkward to hardcode a string inside the lambda expression
# pd.Series.apply() must take 1 argument, the argument that represents each value in the Series
# To get around the requirement of exactly 1 arg, use a closure
def make_actor_filter(actor):
    def my_filter(x):
        return actor in x
    return my_filter

In [None]:
zooey_filter = make_actor_filter('Zooey Deschanel')
titles_filter = movie['Actors'].apply(zooey_filter)
movie[titles_filter]

In [None]:
# to find out which actors appeared in the most movies, create a dictionary
# and invert it
title_to_actor_string = movie['Actors'].to_dict()
title_to_actor_string['10 Cloverfield Lane']

In [None]:
# convert this string to a list of actor names
title_to_actors = defaultdict(list)
for title, actor_string in title_to_actor_string.items():
    for actor in actor_string.split(','):
        title_to_actors[title].append(actor.strip())

In [None]:
title_to_actors['10 Cloverfield Lane']

In [None]:
# invert the index
actors_to_title = defaultdict(list)
for title, actors in title_to_actors.items():
    for actor in actors:
        actors_to_title[actor].append(title)

In [None]:
actors_to_title['Zooey Deschanel']

In [None]:
# with a dictionary of actor to movies, it's easy to get the count of movies per actor
movie_count_by_actor = {key:len(value) for key, value in actors_to_title.items()}

In [None]:
movie_count_by_actor['Zooey Deschanel']

In [None]:
# find which actors where in the most movies
# recall that sorted takes an optional user specified key
sorted_by_count = sorted(movie_count_by_actor.items(), key=lambda kv: kv[1], reverse=True)
sorted_by_count[:3]

In [None]:
# find Hugh Jackman movies longer than 140 minutes
titles = actors_to_title['Hugh Jackman']
hugh_movies = movie.loc[titles]
long_hugh_movies = hugh_movies[hugh_movies['Runtime'] > 140]
long_hugh_movies

## Find which actor was in movies with the highest ratings

This won't work because the groups are not mutually exclusive...

Ensure actor is in at least two movies.

In [None]:
group_by = movie.groupby(actors_to_title.keys())
# group_by_having_2plus = group_by # group_by.filter(lambda x: len(x) > 1)

In [None]:
for group in group_by_having_2plus:
    print(group

In [None]:
group_by.agg(np.mean)

In [None]:
# number of movies that are greater than 120 minutes
(movie['duration'] > 120).sum()

In [None]:
# percent of movies greater than 120 minutes long
(movie['duration'] > 120).mean()

In [None]:
# normalized true/false value_counts
(movie['duration'] > 120).value_counts(normalize=True)

In [None]:
movie[title_filter]

In [None]:
bale_movies = movie.loc[titles]
long_bale_movies = bale_movies[bale_movies['Runtime'] > 140]
long_bale_movies

In [None]:
# apply needs 1 argument
# use closure to bind 1 argument
def make_actor_filter(actor):
    def my_filter(x):
        return actor in x
    return my_filter

In [None]:
filter_f = make_actor_filter('Christian Bale')

In [None]:
filter1 = movie['Actors'].apply(filter_f)
filter2 = movie['Runtime'] > 140
movie[filter1 & filter2]

In [None]:
title_row_filter = movie.index.map(lambda x: x in titles).to_series()
title_row_filter.set_index

set_index(movie.index)

In [None]:
movie.index

In [None]:
movie.index.take(titles)

In [None]:
type(title_row_filter)

In [None]:
type(row_filter)

In [None]:
title_row_filter & row_filter

In [None]:
long_movies = movie[row_filter]
long_bale_movies = long_movies.loc[title_row_filter]
long_bale_movies

In [None]:
# deal with multiple actors in one field
from collections import defaultdict
actors = defaultdict(list)

# change to map
for i in movie.index:
    print(movie.loc[i].Title)
    print(movie.loc[i, 'Actors'].split(','))
    if i > 5:
        break

In [None]:
actor_series = movie['Actors'].map(lambda x: x.split(','))
# actors = movies['Actors'].map(lambda x: x.split(',')).to_dict(defaultdict(list))

In [None]:
actor_series.head()

In [None]:
actors = actor_series.to_dict(defaultdict(list))

In [None]:
actors

In [None]:
# inverted index
movie_actors = defaultdict(list)
for key, values in actors.items():
    for value in values:
        movie_actors[value].append(key)

In [None]:
movie_actors['Noomi Rapace']

In [None]:
movie.loc[movie_actors['Noomi Rapace']]

### Notes
1. dtype: object -- most often used for strings but may indicate a column having different data types, or a list, dict, etc.
2. dtype: categorial -- like R's factor.  Used for a small number of possibly ordered enumerated types.

In [None]:
# sorted will automatically convert the columns to a list, and then sort them
cols = sorted(movie.columns)
cols

In [None]:
# count returns number of non-null (i.e. non NAN) values
actor = movie['Actors']
actor.count()

In [None]:
movie.shape

In [None]:
len(actor)

In [None]:
actor.count() + actor.isnull().sum()

In [None]:
# describe stats depend upon datatype
movie['Revenue (Millions)'].describe()

In [None]:
movie['actor_1_name'].describe()

In [None]:
# each (computed) series has the same index as the dataframe it came from
(movie['actor_1_name'].isnull().index == movie.index).all()

In [None]:
((movie['actor_1_name'] == 'Robert De Niro').index == movie.index).all()

In [None]:
m = movie.set_index('movie_title')
m.head()

In [None]:
m['cast_total_facebook_likes'].sort_values(ascending=False).head()

In [None]:
actor_1_fb_likes = movie['actor_1_facebook_likes']
actor_1_fb_likes.fillna(0).astype(int).head()

In [None]:
# be careful when computing truthiness of Series that has nan!
s = pd.Series([np.nan] * 3)
print(s)
s.all()

In [None]:
# an empty Series is also True
pd.Series([np.nan] * 3).all(skipna=True)

In [None]:
s1 = pd.Series([np.nan] * 3)
s1 = s1.append(pd.Series([0.0]))
s1

In [None]:
s1.all(skipna=True)

In [None]:
s1.any(skipna=True)

In [None]:
# a single column is a Series
movie['movie_title'].head()

In [None]:
# a single column as part of a list, is a DataFrame
movie[['movie_title']].head()

In [None]:
movie.get_dtype_counts()

In [None]:
# equvalient to preceeding
movie.dtypes.value_counts()

In [None]:
# select column by type
# number includes float and int
movie.select_dtypes(include=['number']).head()

In [None]:
# select column by name
# use RegEx having 1 to infinity digits in it
movie.filter(regex='\d{1,}').head()

In [None]:
movie.filter(regex='^act').head()

In [None]:
# filter does not throw key error, but indexing operator does
movie.filter(items=['no_key_error', 'actor_1_name']).head()

In [None]:
movie.isnull().sum()

In [None]:
movie.isnull().sum().sum()

In [None]:
# DataFrames work the same as Series with respect to '==' and 'equals
s1 = pd.Series([1, 2, 3, np.nan])
s2 = pd.Series([1, 2, 3, np.nan])
print(f's1 is s2: {s1 is s2}')
print(f'All values are same given np.nan != np.nan: {(s1 == s2).all()}')

In [None]:
# but .equals DOES consider np.nan to equal np.nan!
s1.equals(s2)

In [None]:
# shape includes non-null values
ugds.shape

In [None]:
# count number of non-null values
ugds.count()

In [None]:
ugds.count(axis='columns').value_counts()

In [None]:
# isclose is helpful
np.isclose(1.000, 0.9999, atol=1e-4)

In [None]:
# the sum across each row is nearly 0 or 1
a = ugds.sum(axis=1).apply(lambda x: np.isclose(x, 1.0, atol=3e-4) or np.isclose(x, 0.0, atol=3e-4))
a.all()

In [None]:
# from the above, either the row has all the values, or it has none of the values
# drop rows that are missing all of the values
ugds.dropna(how='all').count()

In [None]:
ugds_clean = ugds.dropna(how='all')
ugds_clean.count()

In [None]:
ugds_clean.isnull().sum().sum()

In [None]:
# create a diversity metric and apply it
ugds_clean.ge(.15).head()

In [None]:
# count the number of columns that are greater than 0.15 for each row
ugds_clean.ge(.15).sum(axis=1).sort_values(ascending=False).head()

## Data Analysis

In [None]:
import pandas as pd
import numpy as np
from IPython.display import display
pd.options.display.max_columns = 50

In [None]:
# convert the 0,1 value to a single byte integer
col2['RELAFFIL'] = col2['RELAFFIL'].astype(np.int8)
col2.memory_usage(deep=True)

In [None]:
movie2 = movie[['movie_title', 'imdb_score', 'budget']]
movie2.head()

In [None]:
best_score = movie2.nlargest(100,'imdb_score')
best_score.shape

In [None]:
best_value = best_score.nsmallest(5,'budget')
best_value

In [None]:
movie2.nlargest(100,'imdb_score').nsmallest(5,'budget')

In [None]:
movie3 = movie[['movie_title', 'title_year', 'imdb_score']]
movie3.head()

In [None]:
movie3.sort_values(['title_year', 'imdb_score'], ascending=False).head(5)

In [None]:
# sort by movie_title and title_year, then drop dup title_year to get best title per year
movie3.sort_values(['title_year', 'imdb_score'], ascending=False).drop_duplicates('title_year').head()

In [None]:
# create initial sort order via score
m_score = movie3.sort_values(['imdb_score'], ascending=False)

# group by year
m_score = m_score.groupby(['title_year'])

# first row in each group is highest score
m_score.first().tail()

In [None]:
# as one liner
movie3.sort_values(['title_year', 'imdb_score'], ascending=False).groupby('title_year').first().tail()

## Selection

In [None]:
college = pd.read_csv(college_path[0], index_col='INSTNM')
city = college['CITY']
city.value_counts().head()

In [None]:
# slice only works if index is sorted
try:
    df = college['Spa':'Spr']['CITY']
    print(df)
except KeyError as err:
    print(f'KeyError: {err}')

In [None]:
college.sort_index(inplace=True)

In [None]:
try:
    df = college['Sp':'Su']
    print(df.index)
except KeyError as err:
    print(err)

## Boolean Indexing

In [None]:
movie = pd.read_csv(movie_path[0], index_col='movie_title')
movie.head()

In [None]:
# number of movies that are greater than 120 minutes
(movie['duration'] > 120).sum()

In [None]:
# percent of movies greater than 120 minutes long
(movie['duration'] > 120).mean()

In [None]:
# normalized true/false value_counts
(movie['duration'] > 120).value_counts(normalize=True)

In [None]:
len(movie['duration'])

In [None]:
movie['duration'].count()

In [None]:
movie['duration'].isnull().sum()

In [None]:
# drop na's before computing the mean -- no difference
movie['duration'].dropna().gt(120).mean()

In [None]:
movie['duration'].gt(120).sum() / movie['duration'].count()

In [None]:
actors = movie[['actor_1_facebook_likes', 'actor_2_facebook_likes']].dropna()

In [None]:
(actors.iloc[:,0] > actors.iloc[:,1]).mean()

In [None]:
# numpy semantics for boolean arrays same as semantics for Pandas Boolean Series
a = np.array([True, False])
b = np.array([True, True])
a & b

In [None]:
a | b

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
college_stabbr_sorted = college_stabbr.sort_index()
college_stabbr_sorted.index.is_monotonic

In [None]:
# boolean series for row selection
%timeit college[college['STABBR'] == 'TX']

In [None]:
# unsorted index for row selection
%timeit college_stabbr.loc['TX']

In [None]:
# sorted index for row selection
%timeit college_stabbr_sorted.loc['TX']

In [None]:
# boolean series for row selection
states = ['TX', 'CA', 'NY']
%timeit college[college['STABBR'].isin(states)]

In [None]:
# unsorted index for row selection
%timeit college_stabbr.loc[states]

In [None]:
# sorted index for row selection
%timeit college_stabbr_sorted.loc[states]

In [None]:
# try unsorted and randomized (rather than order read in from file)
df = college.sample(len(college))

# boolean series for row selection -- rows in random order
states = ['TX', 'CA', 'NY']
%timeit df[df['STABBR'].isin(states)]

In [None]:
# verify results are the same
college = pd.read_csv(college_path[0])
college_stabbr_sorted = college.set_index('STABBR').sort_index()

r1 = college[college['STABBR'].isin(states)]
r2 = college_stabbr_sorted.loc[states]

### DataFrame equals
1. indexes must be equal and in the same order
2. columns must be equal and in the same order
3. values must be the same, np.nan compares equal to np.nan

In [None]:
# arrange columns
cols = r1.columns.sort_values()
r1 = r1[cols]
r2 = r2[cols]
(r1.columns == r2.columns).all()

In [None]:
# INSTNM is a unique key
college['INSTNM'].value_counts().max()

In [None]:
# another way to check for uniqueness
college['INSTNM'].nunique() == college['INSTNM'].count()

In [None]:
# and the preferred way to see if the values in a series are unique
college['INSTNM'].is_unique

## Index Alignment

In [None]:
college_path = !find .. -name college.csv
college_path = college_path.sort()
college_path[0]

# read into dataframe with institue name as row index
college = pd.read_csv(college_path[0])
college.head()

In [None]:
college.columns

In [None]:
college.columns.values

In [None]:
columns = college.columns
columns[1:6:2]

In [None]:
columns.min(), columns.max(), columns.isnull().sum()

In [None]:
columns + '_A'

In [None]:
s1 = pd.Series(index=list('aaab'), data=np.arange(4))
s1

In [None]:
s2 = pd.Series(index=list('cababb'), data=np.arange(6))
s2

In [None]:
# additions of Series with non-identical indexes, creates a Cartesian Product
s = s1 + s2
s

In [None]:
len(s1), len(s2), len(s)

In [None]:
# a a a b
# 0 1 2 3

# c a b a b b
# 0 1 2 3 4 5

# a's
# 0+1, 0+3, 1+1, 1+3, 2+1, 2+3

In [None]:
s1 = pd.Series(index=list('aaabb'), data=np.arange(5))
s2 = pd.Series(index=list('bbaaa'), data=np.arange(10,15))
s1 + s2

In [None]:
s1 = pd.Series(index=list('aaabb'), data=np.arange(5))
s2 = pd.Series(index=list('aaabb'), data=np.arange(10,15))
s1 + s2

In [None]:
employee = pd.read_csv(emp_path[0], index_col='RACE')
employee.head()

In [None]:
# view is created, not copy
salary1 = employee['BASE_SALARY']
salary2 = employee['BASE_SALARY']

In [None]:
# same information is referred to
salary1 is salary2

In [None]:
salary1 = employee['BASE_SALARY'].copy()
salary2 = employee['BASE_SALARY'].copy()
salary1 is salary2

In [None]:
salary1 = salary1.sort_index()
salary1.head()

In [None]:
# values are in a different order
# they don't line up, so a Cartesian Product occurs
salary2.head()

In [None]:
salary_add = salary1 + salary2
salary_add.head()

In [None]:
salary_add1 = salary1 + salary1
len(salary1), len(salary2), len(salary_add), len(salary_add1)

In [None]:
b14 = !find .. -name baseball14.csv
b15 = !find .. -name baseball15.csv
b16 = !find .. -name baseball16.csv

In [None]:
baseball_14 = pd.read_csv(b14[0], index_col='playerID')
baseball_15 = pd.read_csv(b15[0], index_col='playerID')
baseball_16 = pd.read_csv(b16[0], index_col='playerID')
baseball_14.head()

In [None]:
# values in 14 that are not in 15
baseball_14.index.difference(baseball_15.index)

In [None]:
# values in 14 that are not in 16
baseball_14.index.difference(baseball_16.index)

In [None]:
hits_14 = baseball_14['H']
hits_15 = baseball_15['H']
hits_16 = baseball_16['H']
hits_14.head()

In [None]:
(hits_14 + hits_15).head()

In [None]:
hits_14.add(hits_15, fill_value=0).head()

In [None]:
hits_total = hits_14.add(hits_15, fill_value=0).add(hits_16, fill_value=0)
hits_total.head()

In [None]:
hits_total.hasnans

In [None]:
len(hits_14), len(hits_15), len(hits_16), len(hits_total)

In [None]:
# take union of indexes as sets
ss = set(hits_14.index) | set(hits_15.index) | set(hits_16.index)
len(ss)

In [None]:
set(hits_total.index) == ss

In [None]:
# let's check the above
columns = college_n2.columns
max_indexes = []
for col in columns:
    df = college[college[col] == college[col].max()]
    max_indexes.append((col, df.index.values[0]))
    
max_indexes

In [None]:
df = pd.DataFrame(max_indexes, columns='column INSTNM'.split())
df = df.set_index('column')
series = df['INSTNM']
del series.index.name
series

In [None]:
# it worked!
series.equals(max_cols)

In [None]:
# better way to convert tuple to series with index
instnm = [x[1] for x in max_indexes]
index = [x[0] for x in max_indexes]
s = pd.Series(data=instnm, index=index)
s

In [None]:
%%timeit
columns = college_n2.columns
max_indexes = []
for col in columns:
    df = college[college[col] == college[col].max()]
    max_indexes.append((col, df.index.values[0]))

In [None]:
# this is 10 times faster
%timeit college_n2.idxmax()

In [None]:
%%timeit
# above appears to be O(n^2) to find max and then find row with max
# this should be much faster -- it is!
columns = college_n2.columns
max_indexes = []
for col in columns:
    max_indexes.append((col, college[col].idxmax()))
    
instnm = [x[1] for x in max_indexes]
index = [x[0] for x in max_indexes]
s = pd.Series(data=instnm, index=index)

### Examining the groupby object

In [None]:
grouped = college.groupby(['STABBR', 'RELAFFIL'])
type(grouped)

In [None]:
print([attr for attr in dir(grouped) if not attr.startswith('_')])

In [None]:
grouped.ngroups

In [None]:
groups = list(grouped.groups.keys())
groups[:6]

In [None]:
grouped.get_group(('FL', 1)).head()

In [None]:
from IPython.display import display
i = 0
for name, group in grouped:
    print(name)
    display(group.head(2))
    i += 1
    if i == 5:
        break

In [None]:
grouped.head(2).head(6)

### Grouping by continuous variables

In [None]:
flights = pd.read_csv('data/flights.csv')
flights.head()

In [None]:
bins = [-np.inf, 200, 500, 1000, 2000, np.inf]
cuts = pd.cut(flights['DIST'], bins=bins)
cuts.head()

In [None]:
cuts.value_counts()

In [None]:
flights.groupby(cuts)['AIRLINE'].value_counts(normalize=True).round(3).head(40)

In [408]:
# although unusual and unnecessary, year could be considered a ordered categorical type
from pandas.api.types import CategoricalDtype
years = movie['Year'].unique()
years.sort()
dtype = CategoricalDtype(categories=years, ordered=True)
dtype

CategoricalDtype(categories=[2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015,
                  2016],
                 ordered=True)

In [409]:
year = movie['Year'].astype(dtype)
year

Title
Guardians of the Galaxy                    2014
Prometheus                                 2012
Split                                      2016
Sing                                       2016
Suicide Squad                              2016
The Great Wall                             2016
La La Land                                 2016
Mindhorn                                   2016
The Lost City of Z                         2016
Passengers                                 2016
Fantastic Beasts and Where to Find Them    2016
Hidden Figures                             2016
Rogue One                                  2016
Moana                                      2016
Colossal                                   2016
The Secret Life of Pets                    2016
Hacksaw Ridge                              2016
Jason Bourne                               2016
Lion                                       2016
Arrival                                    2016
Gold                              