# Pandas Tidy Data

## Overview

This notebook uses the IMDB dataset from Kaggle, after having been processed by preceeding notebooks.  
https://www.kaggle.com/PromptCloudHQ/imdb-data#IMDB-Movie-Data.csv

This notebook is about how to tidy the data for the Genre and Actors columns, so that groupby and other methods will work as expected.

There are many ways to convert the list of actors and genres into tidy format.  Two general approaches stand out:
1. melt() and merge()
2. stack() and join()

melt and merge are more general than stack and join, however melt and merge don't retain the index, which in this case results in extra calls to reset_index and set_index.

The simplest set of transformation steps I have found for this dataset uses stack and join.

In [1]:
import pandas as pd
import numpy as np

In [2]:
print(f'Pandas Version: {pd.__version__}')
print(f'Numpy Version:  {np.__version__}')

Pandas Version: 0.24.1
Numpy Version:  1.15.4


## Read Movie Data Set
This data was read in and preprocessed in Pandas02.  Load that notebook.

In [3]:
movies = pd.read_hdf('../data/movies.h5')
movies.head(2)

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
(500) Days of Summer,508,"Comedy,Drama,Romance",An offbeat romantic comedy about a woman who d...,Marc Webb,"Zooey Deschanel, Joseph Gordon-Levitt, Geoffre...",2009,95,7.7,398972,32.39,76.0
10 Cloverfield Lane,119,"Drama,Horror,Mystery","After getting in a car accident, a woman is he...",Dan Trachtenberg,"John Goodman, Mary Elizabeth Winstead, John Ga...",2016,104,7.2,192968,71.9,76.0


### Examine Genre and Actors

In [4]:
movies['Genre'].head(5)

Title
(500) Days of Summer         Comedy,Drama,Romance
10 Cloverfield Lane          Drama,Horror,Mystery
10 Years                     Comedy,Drama,Romance
12 Years a Slave          Biography,Drama,History
127 Hours               Adventure,Biography,Drama
Name: Genre, dtype: object

In [5]:
# typical way of finding unique values
# only works when cell contains one value, as it does for Director
movies['Director'].value_counts().head()

Ridley Scott          8
Paul W.S. Anderson    6
David Yates           6
Michael Bay           6
M. Night Shyamalan    6
Name: Director, dtype: int64

In [6]:
# this doesnt work as needed for Genre
movies['Genre'].value_counts().head()

Action,Adventure,Sci-Fi    50
Drama                      48
Comedy,Drama,Romance       35
Comedy                     32
Drama,Romance              31
Name: Genre, dtype: int64

The above is count per genre string, not per genre.

In [7]:
# this doesnt work as needed for Actor
movies['Actors'].value_counts().head()

Daniel Radcliffe, Emma Watson, Rupert Grint, Michael Gambon            2
Jennifer Lawrence, Josh Hutcherson, Liam Hemsworth, Woody Harrelson    2
Gerard Butler, Aaron Eckhart, Morgan Freeman,Angela Bassett            2
Shia LaBeouf, Megan Fox, Josh Duhamel, Tyrese Gibson                   2
Kate Mara, Anya Taylor-Joy, Rose Leslie, Michael Yare                  1
Name: Actors, dtype: int64

The above is count per actor string, not per actor.

## Create Long Form Representation of Genre

This has one row per (Genre, Title).

**Method Chaining Note**  
Method chaining is a concise way to represent a data processing pipeline, however it is difficult to debug.  It is easier to develop adding one method at a time.

**Step by Step**
1. Write a parse function to convert a comma separated string into a Python list
1. Transform Genre Column from string to list of Genres using the parse function
2. Transform list of Genres into pd.Series
3. Stack (which takes the innermost column index and makes it the innermost row index)
4. Set the series name (creates a better name for the following join)
5. Remove the unnecessary innermost row index created by stack
6. Convert to a DataFrame (to allow for join)
7. Join with movies on Title
8. Remove the original Genre column

In [8]:
def parse_string(s):
    """Convert string with comma separated values to list of strings.
    For each string, remove leading and trailing whitespace.
    """
    values = s.split(',')
    return [value.strip() for value in values]

In [9]:
tmp = movies['Genre'].transform(parse_string)
tmp = tmp.transform(pd.Series)
tmp = tmp.stack()
tmp = tmp.rename('Genre')
tmp = tmp.reset_index(level=-1, drop=True)
tmp = tmp.to_frame()
tmp = tmp.join(movies, on='Title', rsuffix='_Original')
tmp = tmp.drop(['Genre_Original'], axis=1)
tmp.head(4)

Unnamed: 0_level_0,Genre,Rank,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
(500) Days of Summer,Comedy,508,An offbeat romantic comedy about a woman who d...,Marc Webb,"Zooey Deschanel, Joseph Gordon-Levitt, Geoffre...",2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Drama,508,An offbeat romantic comedy about a woman who d...,Marc Webb,"Zooey Deschanel, Joseph Gordon-Levitt, Geoffre...",2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Romance,508,An offbeat romantic comedy about a woman who d...,Marc Webb,"Zooey Deschanel, Joseph Gordon-Levitt, Geoffre...",2009,95,7.7,398972,32.39,76.0
10 Cloverfield Lane,Drama,119,"After getting in a car accident, a woman is he...",Dan Trachtenberg,"John Goodman, Mary Elizabeth Winstead, John Ga...",2016,104,7.2,192968,71.9,76.0


In [10]:
# save the result
df_genre = tmp.copy()

In [11]:
def is_unique(df, cols):
    """Fast determination of multi-column uniqueness."""
    return not (df.duplicated(subset=cols)).any()

In [12]:
# reset_index() to allow for checking against columns (without index)
# could also set_index(['Title','Genre']) and use df.index.duplicated()
is_unique(df_genre.reset_index(), ['Title', 'Genre'])

True

### All in One Step

Identical to the above, but without saving the intermediate values. 

In [13]:
(movies['Genre']
    .transform(parse_string)
    .transform(pd.Series)
    .stack()
    .rename('Genre')
    .reset_index(level=-1, drop=True)
    .to_frame()
    .join(movies, on='Title', rsuffix='_Original')
    .drop(['Genre_Original'], axis=1)
    .head(4))

Unnamed: 0_level_0,Genre,Rank,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
(500) Days of Summer,Comedy,508,An offbeat romantic comedy about a woman who d...,Marc Webb,"Zooey Deschanel, Joseph Gordon-Levitt, Geoffre...",2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Drama,508,An offbeat romantic comedy about a woman who d...,Marc Webb,"Zooey Deschanel, Joseph Gordon-Levitt, Geoffre...",2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Romance,508,An offbeat romantic comedy about a woman who d...,Marc Webb,"Zooey Deschanel, Joseph Gordon-Levitt, Geoffre...",2009,95,7.7,398972,32.39,76.0
10 Cloverfield Lane,Drama,119,"After getting in a car accident, a woman is he...",Dan Trachtenberg,"John Goodman, Mary Elizabeth Winstead, John Ga...",2016,104,7.2,192968,71.9,76.0


In [14]:
# try it
df_genre[df_genre['Genre'] == 'Sci-Fi'].head()

Unnamed: 0_level_0,Genre,Rank,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2012,Sci-Fi,473,A frustrated writer struggles to keep his fami...,Roland Emmerich,"John Cusack, Thandie Newton, Chiwetel Ejiofor,...",2009,158,5.8,297984,166.11,49.0
2307: Winter's Dream,Sci-Fi,617,"In 2307, a future soldier is sent on a mission...",Joey Curtis,"Paul Sidhu, Branden Coles, Arielle Holmes, Kel...",2016,101,4.0,277,20.76,53.0
28 Weeks Later,Sci-Fi,876,Six months after the rage virus was inflicted ...,Juan Carlos Fresnadillo,"Jeremy Renner, Rose Byrne, Robert Carlyle, Har...",2007,100,7.0,221858,28.64,78.0
Absolutely Anything,Sci-Fi,399,A group of eccentric aliens confer a human bei...,Terry Jones,"Simon Pegg, Kate Beckinsale, Sanjeev Bhaskar, ...",2015,85,6.0,26587,,31.0
After Earth,Sci-Fi,949,A crash landing leaves Kitai Raige and his fat...,M. Night Shyamalan,"Jaden Smith, David Denman, Will Smith,Sophie O...",2013,100,4.9,166512,60.52,33.0


### Aggregate for Rating and Runtime

In [15]:
# Flatten Multi-index Column
def flatten_columns(df):
    return ['_'.join(col) for col in df.columns.values]

In [16]:
# data is tidy with respect to Genre, so groupby will work correctly
df_agg = df_genre.groupby('Genre').agg(
    {'Rating':['mean','count'], 'Runtime':['mean','count']})
df_agg.columns = flatten_columns(df_agg)
df_agg.head()

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Action,6.614521,303,116.739274,303
Adventure,6.772201,259,117.698842,259
Animation,7.32449,49,98.142857,49
Biography,7.290123,81,122.580247,81
Comedy,6.64767,279,105.899642,279


### Note: Avg of Averages != Avg

In [17]:
# avg != avg of per genre averages
df_genre['Rating'].mean() == df_agg['Rating_mean'].mean()

False

In [18]:
np.round([df_genre['Rating'].mean(), df_agg['Rating_mean'].mean()], 2)

array([6.76, 6.84])

## Genre Querys

### Query: Top 3 Genres by Rating & At Least 20 Movies

In [19]:
criteria_count = df_agg['Rating_count'] >= 20
df_agg[criteria_count].nlargest(n=3, columns=['Rating_mean'])

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Animation,7.32449,49,98.142857,49
Biography,7.290123,81,122.580247,81
History,7.127586,29,130.689655,29


### Query: Top 3 Genres by Runtime & Better than Avg Rating & At Least 20 Movies

In [20]:
criteria_rating = df_agg['Rating_mean'] > df_genre['Rating'].mean()
df_agg[criteria_count & criteria_rating].nlargest(n=3, columns=['Runtime_mean'])

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
History,7.127586,29,130.689655,29
Biography,7.290123,81,122.580247,81
Adventure,6.772201,259,117.698842,259


## Create Long Form Representation of Actor

This has one row per (Actor, Title).

This is identical to the processing for Genre above.

In [21]:
tmp = movies['Actors'].transform(parse_string)
tmp = tmp.transform(pd.Series)
tmp = tmp.stack()
tmp = tmp.rename('Actor')
tmp = tmp.reset_index(level=-1, drop=True)
tmp = tmp.to_frame()
tmp = tmp.join(movies, on='Title')
tmp = tmp.drop(['Actors'], axis=1)
tmp.head(4)

Unnamed: 0_level_0,Actor,Rank,Genre,Description,Director,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
(500) Days of Summer,Zooey Deschanel,508,"Comedy,Drama,Romance",An offbeat romantic comedy about a woman who d...,Marc Webb,2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Joseph Gordon-Levitt,508,"Comedy,Drama,Romance",An offbeat romantic comedy about a woman who d...,Marc Webb,2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Geoffrey Arend,508,"Comedy,Drama,Romance",An offbeat romantic comedy about a woman who d...,Marc Webb,2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Chloë Grace Moretz,508,"Comedy,Drama,Romance",An offbeat romantic comedy about a woman who d...,Marc Webb,2009,95,7.7,398972,32.39,76.0


In [22]:
(tmp.reset_index().duplicated(subset=['Title', 'Actor'], keep=False)).any()

False

In [23]:
# save the result, only a few columns are used in the following analysis
df_actor = tmp[['Actor','Rating','Runtime']].copy()

## Actor Queries

### Example: Query for Brad Pitt Movies

In [24]:
df_actor[df_actor['Actor'] == 'Brad Pitt']

Unnamed: 0_level_0,Actor,Rating,Runtime
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12 Years a Slave,Brad Pitt,8.1,134
Allied,Brad Pitt,7.1,124
Babel,Brad Pitt,7.5,143
Fury,Brad Pitt,7.6,134
Inglourious Basterds,Brad Pitt,8.3,153
Megamind,Brad Pitt,7.3,95
Moneyball,Brad Pitt,7.6,133
Ocean's Thirteen,Brad Pitt,6.9,122
The Assassination of Jesse James by the Coward Robert Ford,Brad Pitt,7.5,160
The Big Short,Brad Pitt,7.8,130


In [25]:
# Flatten Multi-index Column
def flatten_columns(df):
    return ['_'.join(col) for col in df.columns.values]

In [26]:
# data is tidy with respect to Actor, so groupby will work correctly
df_agg = df_actor.groupby('Actor').agg(
    {'Rating':['mean','count'], 'Runtime':['mean','count']})
df_agg.columns = flatten_columns(df_agg)
df_agg.head()

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
50 Cent,6.7,1,115.0,1
A.C. Peterson,5.3,1,81.0,1
AJ Michalka,7.1,1,112.0,1
Aamir Khan,8.475,4,162.25,4
Aaron Burns,5.15,2,99.5,2


### Query: Top 15 Actors by Movie Count, Sorted by Rating

In [27]:
# find the 15 actors who are in the most movies and sort by Rating
actor_top15 = df_agg.nlargest(n=15, columns=['Rating_count'])
actor_top15.sort_values(by='Rating_mean', ascending=False)

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Robert Downey Jr.,7.5,12,132.25,12
Brad Pitt,7.484615,13,134.538462,13
Tom Hardy,7.358333,12,122.75,12
Scarlett Johansson,7.325,12,116.833333,12
Christian Bale,7.246154,13,132.461538,13
Bradley Cooper,7.190909,11,118.090909,11
Hugh Jackman,7.135714,14,124.5,14
Cate Blanchett,7.045455,11,124.272727,11
Anne Hathaway,6.983333,12,121.25,12
Michael Fassbender,6.941667,12,121.0,12


### Query: Top 15 Actors by Movie Count, Display Top 5 by Rating

In [28]:
# very similar to above query
actor_top15.nlargest(n=5, columns=['Rating_mean'])

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Robert Downey Jr.,7.5,12,132.25,12
Brad Pitt,7.484615,13,134.538462,13
Tom Hardy,7.358333,12,122.75,12
Scarlett Johansson,7.325,12,116.833333,12
Christian Bale,7.246154,13,132.461538,13


### Query: Actors with Above Avg Rating & in at least 5 movies, Display Top 5 by Runtime

In [29]:
criteria_above_avg = df_agg['Rating_mean'] > df_actor['Rating'].mean()
criteria_common = df_agg['Rating_count'] >= 5

df_agg[criteria_above_avg & criteria_common].nlargest(n=5, columns=['Runtime_mean'])

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Leonardo DiCaprio,7.97,10,147.1,10
Ian McKellen,7.583333,6,147.0,6
Kurt Russell,7.36,5,146.0,5
Andrew Garfield,7.283333,6,136.833333,6
Russell Crowe,7.042857,7,136.285714,7


### Query: Actors Top 20% by Rating & Actors Top 20% by Count, Display Top 5 by Runtime

In [30]:
# rating for top 20% of actors
top_20percent_rating = df_actor['Rating'].quantile(q=0.80)
top_20percent_rating

7.5

In [31]:
# Note
df_actor['Rating'].quantile(q=0.80) == df_agg['Rating_mean'].quantile(q=0.80)

False

In [32]:
# Note
df_actor['Rating'].quantile(q=0.80), df_agg['Rating_mean'].quantile(q=0.80)

(7.5, 7.4)

In [33]:
# number of movies for top 20% cutoff
top_20percent_count = df_agg['Rating_count'].quantile(q=0.80)
top_20percent_count

3.0

In [34]:
criteria_top20_rated = df_agg['Rating_mean'] >= top_20percent_rating
criteria_top20_count = df_agg['Rating_count'] >= top_20percent_count
top_2020 = df_agg[criteria_top20_rated & criteria_top20_count].nlargest(n=5, columns=['Runtime_mean'])
top_2020

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aamir Khan,8.475,4,162.25,4
Richard Armitage,7.733333,3,158.0,3
Leonardo DiCaprio,7.97,10,147.1,10
Ian McKellen,7.583333,6,147.0,6
Rupert Grint,7.7,4,141.75,4


### Query: Number of Actors: Top 20% by Rating & Top 20% by Count

In [35]:
(criteria_top20_rated & criteria_top20_count).sum()

23

## Genre and Actor Queries
This requires one row per Genre, Actor, Title.

This is just a join of df_genre and df_actor, removing duplicate columns or unnecessary columns.

In [36]:
tmp1 = df_actor.drop(['Rating','Runtime'], axis=1)
tmp2 = df_genre.drop(['Actors'], axis=1)
df_genre_actor = tmp1.join(tmp2)
df_genre_actor.head()

Unnamed: 0_level_0,Actor,Genre,Rank,Description,Director,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
(500) Days of Summer,Zooey Deschanel,Comedy,508,An offbeat romantic comedy about a woman who d...,Marc Webb,2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Zooey Deschanel,Drama,508,An offbeat romantic comedy about a woman who d...,Marc Webb,2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Zooey Deschanel,Romance,508,An offbeat romantic comedy about a woman who d...,Marc Webb,2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Joseph Gordon-Levitt,Comedy,508,An offbeat romantic comedy about a woman who d...,Marc Webb,2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Joseph Gordon-Levitt,Drama,508,An offbeat romantic comedy about a woman who d...,Marc Webb,2009,95,7.7,398972,32.39,76.0


In [37]:
# verify that there are no dups for Title, Actor, Genre
((df_genre_actor
    .reset_index()
    .duplicated(subset=['Title', 'Actor', 'Genre'], keep=False))
    .all())

False

### Query: For top 15 Actors by Count, Which Genre Had Highest Mean Rating

In [38]:
# top 15 actors by movie count was computed above
actor_top15.index

Index(['Mark Wahlberg', 'Hugh Jackman', 'Brad Pitt', 'Christian Bale',
       'Anne Hathaway', 'Channing Tatum', 'Johnny Depp', 'Michael Fassbender',
       'Robert Downey Jr.', 'Scarlett Johansson', 'Tom Hardy', 'Anna Kendrick',
       'Bradley Cooper', 'Cate Blanchett', 'Chloë Grace Moretz'],
      dtype='object', name='Actor')

In [39]:
criteria = df_genre_actor['Actor'].isin(actor_top15.index)
df_genre_actor_common = df_genre_actor[criteria]
df_genre_actor_common.head()

Unnamed: 0_level_0,Actor,Genre,Rank,Description,Director,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
(500) Days of Summer,Chloë Grace Moretz,Comedy,508,An offbeat romantic comedy about a woman who d...,Marc Webb,2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Chloë Grace Moretz,Drama,508,An offbeat romantic comedy about a woman who d...,Marc Webb,2009,95,7.7,398972,32.39,76.0
(500) Days of Summer,Chloë Grace Moretz,Romance,508,An offbeat romantic comedy about a woman who d...,Marc Webb,2009,95,7.7,398972,32.39,76.0
10 Years,Channing Tatum,Comedy,697,"The night before their high school reunion, a ...",Jamie Linden,2011,100,6.1,19636,0.2,
10 Years,Channing Tatum,Drama,697,"The night before their high school reunion, a ...",Jamie Linden,2011,100,6.1,19636,0.2,


In [40]:
# aggregate the mean Rating over each genre for each actor
df_genre_actor_agg = (
    df_genre_actor_common
        .groupby(['Actor', 'Genre'])
        .agg({'Rating':'mean'})
        .add_suffix('_Mean')
        .reset_index())
df_genre_actor_agg.head()

Unnamed: 0,Actor,Genre,Rating_Mean
0,Anna Kendrick,Action,6.85
1,Anna Kendrick,Adventure,6.166667
2,Anna Kendrick,Animation,6.5
3,Anna Kendrick,Comedy,6.5875
4,Anna Kendrick,Crime,7.55


In [41]:
# find the best movie genre per top15 actor
# keep ties (idxmax does not keep ties)
rating_max = df_genre_actor_agg.groupby('Actor')['Rating_Mean'].max()
pd.merge(df_genre_actor_agg, rating_max, on=['Actor', 'Rating_Mean'])

Unnamed: 0,Actor,Genre,Rating_Mean
0,Anna Kendrick,Thriller,7.7
1,Anne Hathaway,Sci-Fi,8.6
2,Brad Pitt,History,8.1
3,Bradley Cooper,Romance,7.8
4,Cate Blanchett,Animation,7.9
5,Channing Tatum,Animation,7.3
6,Chloë Grace Moretz,Romance,7.7
7,Christian Bale,Mystery,8.5
8,Christian Bale,Sci-Fi,8.5
9,Hugh Jackman,Mystery,8.3


### Spot Check by Hand

Use a completely different technique to spot check the result.

In [42]:
pitt_movies = movies['Actors'].str.contains('Brad Pitt')
history_movies = movies['Genre'].str.contains('History')

In [43]:
pitt_history_movies = movies[pitt_movies & history_movies]
pitt_history_movies

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
12 Years a Slave,112,"Biography,Drama,History","In the antebellum United States, Solomon North...",Steve McQueen,"Chiwetel Ejiofor, Michael Kenneth Williams, Mi...",2013,134,8.1,486338,56.67,96.0


In [44]:
pitt_history_movies['Rating'].mean()

8.1

In [45]:
df = df_genre_actor_agg
df[(df['Actor'] == 'Brad Pitt') & (df['Genre'] == 'History')]

Unnamed: 0,Actor,Genre,Rating_Mean
28,Brad Pitt,History,8.1


Above results are the same.

In [46]:
war_movies = movies['Genre'].str.contains('War')
pitt_war_movies = movies[pitt_movies & war_movies]
pitt_war_movies

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Fury,241,"Action,Drama,War",A grizzled tank commander makes tough decision...,David Ayer,"Brad Pitt, Shia LaBeouf, Logan Lerman, Michael...",2014,134,7.6,332234,85.71,64.0
Inglourious Basterds,78,"Adventure,Drama,War","In Nazi-occupied France during World War II, a...",Quentin Tarantino,"Brad Pitt, Diane Kruger, Eli Roth,Mélanie Laurent",2009,153,8.3,959065,120.52,69.0


In [47]:
pitt_war_movies['Rating'].mean()

7.95

In [48]:
df[(df['Actor'] == 'Brad Pitt') & (df['Genre'] == 'War')]

Unnamed: 0,Actor,Genre,Rating_Mean
33,Brad Pitt,War,7.95


Above results are the same.

In [49]:
# look at top 5 average movie rating by genre for Brad Pitt
df[df['Actor'] == 'Brad Pitt'].nlargest(n=5, columns=['Rating_Mean'])

Unnamed: 0,Actor,Genre,Rating_Mean
28,Brad Pitt,History,8.1
33,Brad Pitt,War,7.95
23,Brad Pitt,Biography,7.75
21,Brad Pitt,Adventure,7.65
26,Brad Pitt,Drama,7.61


History was the highest, as found above.