# Pandas Introduction Part 3

## Overview

This notebook uses the IMDB dataset from Kaggle:  
https://www.kaggle.com/PromptCloudHQ/imdb-data#IMDB-Movie-Data.csv


Data Wrangling is demonstrated.

In [35]:
import pandas as pd
import numpy as np

## Read Movie Data Set

In [36]:
import pickle
with open('../data/movies.pickle','rb') as p:
    movies = pickle.load(p)

In [37]:
# display first 2 rows of the DataFrame
movies.head(2)

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0


## Data Wrangling: Queries involving Genre
Each cell in the Genre column is a single string.  This string consists of several genres separated by commas.  The Actors column is the same.

With data in this format, it is not obvious how to answer queries such as "what genre has the highest average rating?". Or "what actor has the highest average rating?".

There are (at least) 3 ways to wrangle this column's data:
1. Create an indicator column for the presence/absence of each genre for each movie.
2. Create a mapping of genre to titles and a create a custom "groupby" to yield groups of movies for each genre.  The normal Pandas groupby won't work for genre or actors.
3. "Denormalize" the data.

In all cases, it is helpful to begin by creating a list of unique Genres (or unique Actors) before doing the data analysis.

Briefly, the pros and cons of the above are:
1. This may be best if a machine learning algorithm is to be used on the data.  It also makes data analysis easy and it is easy to set up.
2. This solution is easy to implement for a senior level developer, but may be difficult to code and maintain for a junior data analyst.  If offers good speed and good memory usage.
3. This is the easiest form for a data analyst to work with, and is relatively easy to set up.  However it could require significant additional memory.

Examples of each of the above are shown below.

### Quick Look at the Data

In [38]:
# Look at Genre as is
movies['Genre'].head(5)

Title
Guardians of the Galaxy     Action,Adventure,Sci-Fi
Prometheus                 Adventure,Mystery,Sci-Fi
Split                               Horror,Thriller
Sing                        Animation,Comedy,Family
Suicide Squad              Action,Adventure,Fantasy
Name: Genre, dtype: object

In [39]:
# typical way of finding unique values
# works when cell contains one value
movies['Director'].value_counts().head()

Ridley Scott          8
Michael Bay           6
David Yates           6
Paul W.S. Anderson    6
M. Night Shyamalan    6
Name: Director, dtype: int64

In [40]:
# this doesnt work as needed for Genre or Actor
movies['Genre'].value_counts().head()

Action,Adventure,Sci-Fi    50
Drama                      48
Comedy,Drama,Romance       35
Comedy                     32
Drama,Romance              31
Name: Genre, dtype: int64

The above is count per genre string, not per genre.

### Write Parser for Genre (or Actors) String

In [46]:
def parse_string(s):
    """Convert string with comma separated values to list of strings.
    For each string, remove leading and trailing whitespace.
    """
    values = s.split(',')
    return [value.strip() for value in values]

#### Use Parser
1. use Pandas Vectorized operator 'apply' to avoid explict iteration over all titles
2. note: index of result Series is Title
3. note: value of result Series is list of genres

The same mapping could have been created using a defaultdict(list) and explicitly iterating over each movie, but Pandas makes this easier and possibly faster.

In [25]:
title_genre_series = movies['Genre'].apply(parse_genre_string)
title_genre_series.head()

Title
Guardians of the Galaxy     [Action, Adventure, Sci-Fi]
Prometheus                 [Adventure, Mystery, Sci-Fi]
Split                                [Horror, Thriller]
Sing                        [Animation, Comedy, Family]
Suicide Squad              [Action, Adventure, Fantasy]
Name: Genre, dtype: object

For each title, add the list of genres to the genre set.

In [48]:
genres = set()

title_genres_series.apply(lambda x: genres.update(x))
genres

{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Thriller',
 'War',
 'Western'}

### Create Indicator Variables
It is helpful to prefix the name of each new column so they can be treated together.

Each indicator column will have True for presence of genre and False for absence of genre.

In [55]:
for genre in genres:
    movies['Genre_'+genre] = movies['Genre'].apply(lambda x: genre in x)

In [56]:
# select all columns that begin with Genre_
genre_indicators = movies.filter(regex='^Genre_', axis='columns')
genre_indicators.head(2)

Unnamed: 0_level_0,Genre_Crime,Genre_Action,Genre_Drama,Genre_Musical,Genre_Romance,Genre_Western,Genre_Fantasy,Genre_Mystery,Genre_Biography,Genre_Music,Genre_Adventure,Genre_Thriller,Genre_Animation,Genre_Sport,Genre_Sci-Fi,Genre_War,Genre_Comedy,Genre_Horror,Genre_Family,Genre_History
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Guardians of the Galaxy,False,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False
Prometheus,False,False,False,False,False,False,False,True,False,False,True,False,False,False,True,False,False,False,False,False


In [57]:
# Pandas uses bool for datatype for each column, so memory usage is efficient
genre_indicators.dtypes.value_counts()

bool    20
dtype: int64

#### Using Indicator Variables to Compute Movies per Genre

In [71]:
movies_per_genre = genre_indicators.sum(axis='rows').to_dict()
movies_per_genre

{'Genre_Crime': 150,
 'Genre_Action': 303,
 'Genre_Drama': 513,
 'Genre_Musical': 5,
 'Genre_Romance': 141,
 'Genre_Western': 7,
 'Genre_Fantasy': 101,
 'Genre_Mystery': 106,
 'Genre_Biography': 81,
 'Genre_Music': 21,
 'Genre_Adventure': 259,
 'Genre_Thriller': 195,
 'Genre_Animation': 49,
 'Genre_Sport': 18,
 'Genre_Sci-Fi': 120,
 'Genre_War': 13,
 'Genre_Comedy': 279,
 'Genre_Horror': 119,
 'Genre_Family': 51,
 'Genre_History': 29}

It is often easier to work with indicator columns, if there are not too many of them.  
After the one time creation of the indicator columns, queries are fast.

## Create Mapping of Genre to Titles

Make use of genre indicator columns.

In [98]:
genre_to_titles = {}
for genre in genres:
    titles = movies[movies['Genre_'+genre]].index.tolist()
    genre_to_titles[genre] = titles

### Use Genre to Titles Mapping to Compute Movies per Genre

In [99]:
movies_per_genre2 = {'Genre_'+genre:len(titles) for genre,titles in genre_to_titles.items()}

In [100]:
# verify results are the same
movies_per_genre == movies_per_genre2

True

### Use Genre to Titles Mapping to Define Custom (Overlapping) Groups
The genre to titles mapping is a one-to-many mapping.

Generators are memory efficient.  The dataframe per value is created only as necessary and if there is no reference to it, Python's garbage collector can reclaim the memory.

In [130]:
def custom_groupby(df, mapping, values):
    for value in values:
        yield (value, df.loc[mapping[value]])

In [136]:
groups = custom_groupby(movies, genre_to_titles, genres)
for name, group in groups:
    print(f"{name:<9} Rating: {group['Rating'].mean():4.2f} Count: {len(group):>3}")

Crime     Rating: 6.79 Count: 150
Action    Rating: 6.61 Count: 303
Drama     Rating: 6.95 Count: 513
Musical   Rating: 6.94 Count:   5
Romance   Rating: 6.69 Count: 141
Western   Rating: 6.77 Count:   7
Fantasy   Rating: 6.55 Count: 101
Mystery   Rating: 6.89 Count: 106
Biography Rating: 7.29 Count:  81
Music     Rating: 7.04 Count:  21
Adventure Rating: 6.77 Count: 259
Thriller  Rating: 6.59 Count: 195
Animation Rating: 7.32 Count:  49
Sport     Rating: 7.01 Count:  18
Sci-Fi    Rating: 6.72 Count: 120
War       Rating: 7.35 Count:  13
Comedy    Rating: 6.65 Count: 279
Horror    Rating: 6.09 Count: 119
Family    Rating: 6.68 Count:  51
History   Rating: 7.13 Count:  29


In [144]:
# same, but store result in df
rows = []
groups = custom_groupby(movies, genre_to_titles, genres)
for name, group in groups:
    row = (name, group['Rating'].mean(), len(group))
    rows.append(row)
df = pd.DataFrame(data=rows, columns=['Genre', 'Rating', 'Count'])
df.set_index('Genre', inplace=True)
df

Unnamed: 0_level_0,Rating,Count
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1
Crime,6.786667,150
Action,6.614521,303
Drama,6.953801,513
Musical,6.94,5
Romance,6.685816,141
Western,6.771429,7
Fantasy,6.548515,101
Mystery,6.886792,106
Biography,7.290123,81
Music,7.042857,21


In [145]:
# display the 5 highest avg ratings by genre in Ratings order
df.nlargest(n=5, columns='Rating')

Unnamed: 0_level_0,Rating,Count
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1
War,7.353846,13
Animation,7.32449,49
Biography,7.290123,81
History,7.127586,29
Music,7.042857,21


In [146]:
# display the 5 highest avg ratings by genre having at least 20 movies
df_20 = df[df['Count'] >= 20]
df_20.nlargest(5, 'Rating')

Unnamed: 0_level_0,Rating,Count
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1
Animation,7.32449,49
Biography,7.290123,81
History,7.127586,29
Music,7.042857,21
Drama,6.953801,513


## Denormalize and then Analyze

In [153]:
# create the one-to-many "table"
rows = []
for key in genre_to_titles:
    for value in genre_to_titles[key]:
        rows.append((key, value))
df_genre_title = pd.DataFrame(data=rows, columns=['Genre', 'Title'])
df_genre_title.set_index('Title', inplace=True)
df_genre_title.head()

Unnamed: 0_level_0,Genre
Title,Unnamed: 1_level_1
Hounds of Love,Crime
John Wick,Crime
The Dark Knight,Crime
Don't Breathe,Crime
The Girl on the Train,Crime


In [160]:
import pickle
with open('../data/movies.pickle','rb') as p:
    movies = pickle.load(p)

In [162]:
def drop_col(df, suffix):
    to_drop = [x for x in df if x.endswith(suffix)]
    df.drop(to_drop, axis=1, inplace=True)

In [166]:
# join movies with the one-to-many table
df_full = df_genre_title.join(movies, lsuffix='', rsuffix='_right')
drop_col(df_full, '_right')
df_full.head(2)

Unnamed: 0_level_0,Genre,ID,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
(500) Days of Summer,Drama,508,An offbeat romantic comedy about a woman who d...,Marc Webb,"Zooey Deschanel, Joseph Gordon-Levitt, Geoffre...",2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Romance,508,An offbeat romantic comedy about a woman who d...,Marc Webb,"Zooey Deschanel, Joseph Gordon-Levitt, Geoffre...",2009,95,7.7,398972,32.39,76.0


In [172]:
# create a multi-index on Genre, Title
df_full.reset_index(inplace=True)
df_full.set_index(['Genre', 'Title'], inplace=True)
df_full.sort_index(inplace=True)

In [176]:
len(df_full), len(movies)

(2560, 1000)

In [175]:
# now group by will work for each Genre
groupby = df_full.groupby('Genre')
groupby['Rating'].mean()

Genre
Action       6.614521
Adventure    6.772201
Animation    7.324490
Biography    7.290123
Comedy       6.647670
Crime        6.786667
Drama        6.953801
Family       6.684314
Fantasy      6.548515
History      7.127586
Horror       6.089916
Music        7.042857
Musical      6.940000
Mystery      6.886792
Romance      6.685816
Sci-Fi       6.716667
Sport        7.011111
Thriller     6.593333
War          7.353846
Western      6.771429
Name: Rating, dtype: float64

In [None]:
# compute count per genre using the genre_to_titles dictionary
genres_to_count = {genre:len(titles) for genre,titles in genre_to_titles.items()}
genres_to_count

In [None]:
g = movies.groupby(by=genre_to_titles)
for name, group in g:
    print(name)
    print(group['Genre'])

In [None]:
# find highest rated action movie on the fly
action_titles = movies['Genre'].apply(lambda x: 'Action' in x)
action_movies = movies.loc[action_titles]
action_movies[action_movies['Rating'] == action_movies['Rating'].max()]

In [None]:
# find highest rated action movie using Indicator Variable: Genre_Action
action_movies = movies[movies['Genre_Action']]
action_movies[action_movies['Rating'] == action_movies['Rating'].max()]

In [None]:
# find the highest rated action movie using dictionary
action_titles = genre_to_titles['Action']
action_movies = movies.loc[action_titles]
action_movies[action_movies['Rating'] == action_movies['Rating'].max()]

In [None]:
%%timeit
# find highest rated action movie on the fly
action_titles = movies['Genre'].apply(lambda x: 'Action' in x)
action_movies = movies[action_titles]
action_movies[action_movies['Rating'] == action_movies['Rating'].max()]

In [None]:
%%timeit
# find highest rated action movie using precomputed Indicator Variable: Genre_Action
action_movies = movies[movies['Genre_Action']]
action_movies[action_movies['Rating'] == action_movies['Rating'].max()]

In [None]:
%%timeit
# find the highest rated action movie using precomputed dictionary
action_titles = genre_to_titles['Action']
action_movies = movies.loc[action_titles]
action_movies[action_movies['Rating'] == action_movies['Rating'].max()]

In [None]:
# find Sci-Fi movies
sci_fi_movies = movies[movies['Genre_Sci-Fi']]
sci_fi_movies.head(2)

In [None]:
# find movies that are both action and sci-fi
action_sci_fi_movies = movies[movies['Genre_Action'] & movies['Genre_Sci-Fi']]
action_sci_fi_movies.head(3)

In [None]:
# find highest rated movie that is both Action and Sci-Fi
movie_subset = movies[movies['Genre_Action'] & movies['Genre_Sci-Fi']]
movie_subset[movie_subset['Rating'] == movie_subset['Rating'].max()]

In [None]:
# find the avg rating of all action movies
action_movies = movies[movies['Genre_Action']]
action_movies['Rating'].mean()

In [None]:
# usually groupby is not used on a boolean series, but it works
# the group of interest is the True group
groupby = movies.groupby('Genre_Action')

for name, group in groupby:
    print(name, np.round(group['Rating'].mean(),4), group['Rating'].count())

### Group By
Find the highest rated directors who directed at least 5 movies

In [None]:
def group_filter(g):
    if g['Rating'].count() >= 5:
        return g['Rating'].mean()

In [None]:
s = movies.groupby('Director').apply(group_filter)
s[s.notnull()].nlargest(5)

## Data Wrangling: Queries involving Actor
As there are so many different actors, it's probably better to rely on dictionaries instead of creating a great many indicator columns.

In [None]:
def parse_genre_string(s):
    """Convert string with comma separated values to list of strings.
    For each string, remove leading and trailing whitespace.
    """
    genres = s.split(',')
    return [genre.strip() for genre in genres]

In [None]:
# parsing of Genre is identical to parsing of Actors
movies['Actors'].apply(parse_genre_string).head(5)

In [None]:
# find the set of unique actors
actors = set()
title_actors_series = movies['Actors'].apply(parse_genre_string)
title_actors_series.apply(lambda x: actors.update(x))
len(actors)

#### Using an Inverted Dictionary to Compute Movies per Actor

In [None]:
# create a mapping from genre to titles
from collections import defaultdict
actor_to_titles = defaultdict(list)
for actor in actors:
    titles = movies['Actors'].apply(lambda x: actor in x)
    titles = movies[titles].index.tolist()
    for title in titles:
        actor_to_titles[actor].append(title)

In [None]:
# compute count per genre using the genre_to_titles dictionary
actors_to_count = {actor:len(titles) for actor,titles in actor_to_titles.items()}
len(actors_to_count)

In [None]:
# try inverted dictionary
movies.loc[actor_to_titles['Zooey Deschanel']]

In [None]:
# find the average rating of actors in at least 5 movies
actor_rating = {}
for actor in actors:
    movies_with_actor = movies.loc[actor_to_titles[actor]]
    if movies_with_actor['Rating'].count() >= 5:
        actor_rating[actor] = movies_with_actor['Rating'].mean()

In [None]:
# read in IMDB move dataset
movies = pd.read_csv('../data/IMDB-Movie-Data.csv', index_col='Title')
movies.head(2)

# HERE HERE
A dictionary is more memory efficient than denormalizing the table, but denormalizing is much easier to work with!

In [None]:
def custom_groupby(df, mapping, values):
    for value in values:
        yield (value, df.loc[mapping[value]])

In [None]:
movies = movies_copy.copy()
movies = movies.reset_index(drop=False).set_index('Title', drop=True)

In [None]:
movies.head(1)

In [None]:
np.std([1, 2, 3])

In [None]:
df = pd.DataFrame([1, 2, 3], columns=['A'])
df.apply(np.std)

In [None]:
df.columns

In [None]:
my_group = custom_groupby(movies, actor_to_titles, ['Zooey Deschanel', 'John Goodman'])

In [None]:
from IPython.display import display
for name, group in my_group:
    print(name, len(group), group['Rating'].mean(), group['Metascore'].std())
    print()
    display(group)

In [None]:
def gen1(n):
    a = 0
    b = 1
    for i in range(n):
        a, b = a+b, a
        yield(a)

In [None]:
g = gen1(7)

In [None]:
tuples = []
for actor in actors:
    for title in actor_to_titles[actor]:
        tuples.append((actor, title))
        
df_actor = pd.DataFrame(data=tuples, columns=['Actor','Title'])
df_actor.set_index('Actor', drop=True, inplace=True)

In [None]:
df_actor.head()

In [None]:
df = df_actor.join(movies, on='Title', how='left')
df.head()

In [None]:
len(df_actor)

In [None]:
len(movies.index)

In [None]:
len(set(movies.index))

In [None]:
m2 = movies.reset_index()
b = m2.duplicated(subset='Title', keep='first')
m2[b]

In [None]:
b = m2.duplicated(subset='Title', keep='last')
m2[b]

In [None]:
df2 = df.set_index(index)

In [None]:
actor_to_titles

In [None]:
df_actor.head()

In [None]:
tuples = []
for actor in actors:
    for title in actor_to_titles[actor]:
        tuples.append((actor, title))
        
index = pd.MultiIndex.from_tuples(tuples, names=['Actor', 'Title'])        

In [None]:
movies.set_index(index)

In [None]:
# its easier to work with dataframes than dictionaries
df = pd.DataFrame(data=[], columns=['Rating', 'Count'])
dfs = []
for actor in actors:
    df = movies.loc[actor_to_titles[actor]]
    df['Actor'] = actor
    df.set_index('Actor', drop=False, inplace=True)
    dfs.append(df)
df_new = pd.concat(dfs)
df_new.head()

In [None]:
df_actor_rating = pd.DataFrame.from_dict(actor_rating, orient='index')
df_actor_rating.rename_axis('Actor')
df_actor_rating.columns = ['Rating']
df_actor_rating.nlargest(n=5, columns='Rating')

In [None]:
df_actor_rating.columns

In [None]:
df_actor_rating.index

In [None]:
# sort by value
actor_rating_sorted = sorted(actor_rating.items(), key=lambda x: x[1], reverse=True)
for actor, rating in actor_rating_sorted:
    print(f'{actor}: {rating:4.2f}')

In [None]:
# get the average rating and number of movies for each actor in at least 5 movies
actor_rating = {}
for actor in actors_to_title:
    titles = actors_to_title[actor]
    movies_with_actor = movie.loc[titles]
    count = movies_with_actor['Rating'].count()
    if count >= 5:
        actor_rating[actor] = movies_with_actor['Rating'].mean()

In [None]:
actor_rating_sorted = sorted(actor_rating.items(), key=lambda kv: kv[1], reverse=True)
actor_rating_sorted[:10]

In [None]:
movie.loc[actor_to_titles['Leonardo DiCaprio']]

In [None]:
# with dictionaries, its slighter harder to query for two actors, than with indicator columns
# find all movies with Leonardo DiCaprio and Matt Damon
titles1 = actors_to_title['Leonardo DiCaprio']
titles2 = actors_to_title['Matt Damon']
titles = set(titles1).intersection(set(titles2))
titles
movie.loc[titles]