# Pandas Tidy Data

## Overview

This notebook uses the IMDB dataset from Kaggle:  
https://www.kaggle.com/PromptCloudHQ/imdb-data#IMDB-Movie-Data.csv

This notebook is about how to make queries using the Genre and Actors column, given that each cell in these column contains a list.

There are at least 3 different ways to do this:
1. create an indicator variable for each possible genre or actor (good for Machine Learning)
2. create a dictionary that maps genres (or actors) to movies and use it in all queries
3. reformat Movies to be tidy with respect to genres or actors

This notebook only discusses the 3rd option above.

This notebook does not discuss melt and pivot, or stack and unstack, which are often helpful methods to reformat data into "long" or "wide" format as needed.

In [1]:
import pandas as pd
import numpy as np

## Read Movie Data Set
This data was read in and preprocessed in Pandas02.  Load that notebook.

In [2]:
movies = pd.read_hdf('../data/movies.h5')
movies.head(2)

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0


### Examine Genre and Actors

In [3]:
movies['Genre'].head(5)

Title
Guardians of the Galaxy     Action,Adventure,Sci-Fi
Prometheus                 Adventure,Mystery,Sci-Fi
Split                               Horror,Thriller
Sing                        Animation,Comedy,Family
Suicide Squad              Action,Adventure,Fantasy
Name: Genre, dtype: object

In [4]:
# typical way of finding unique values
# works when cell contains one value
movies['Director'].value_counts().head()

Ridley Scott          8
David Yates           6
M. Night Shyamalan    6
Paul W.S. Anderson    6
Michael Bay           6
Name: Director, dtype: int64

In [5]:
# this doesnt work as needed for Genre
movies['Genre'].value_counts().head()

Action,Adventure,Sci-Fi    50
Drama                      48
Comedy,Drama,Romance       35
Comedy                     32
Drama,Romance              31
Name: Genre, dtype: int64

The above is count per genre string, not per genre.

In [6]:
# this doesnt work as needed for Actor
movies['Actors'].value_counts().head()

Daniel Radcliffe, Emma Watson, Rupert Grint, Michael Gambon            2
Jennifer Lawrence, Josh Hutcherson, Liam Hemsworth, Woody Harrelson    2
Shia LaBeouf, Megan Fox, Josh Duhamel, Tyrese Gibson                   2
Gerard Butler, Aaron Eckhart, Morgan Freeman,Angela Bassett            2
Tom Hardy, Kelly Adams, Luing Andrews,Katy Barker                      1
Name: Actors, dtype: int64

The above is count per actor string, not per actor.

## Write Parser for Genre and Actors String

In [7]:
def parse_string(s):
    """Convert string with comma separated values to list of strings.
    For each string, remove leading and trailing whitespace.
    """
    values = s.split(',')
    return [value.strip() for value in values]

### Use Parser
1. use Pandas Vectorized operator 'apply' to avoid explict iteration over all titles
2. index of result Series is Title
3. value of result Series is list of genres

In [8]:
# create a Series with values being the parsed genre string
title_genre_series = movies['Genre'].apply(parse_string)
title_genre_series.head()

Title
Guardians of the Galaxy     [Action, Adventure, Sci-Fi]
Prometheus                 [Adventure, Mystery, Sci-Fi]
Split                                [Horror, Thriller]
Sing                        [Animation, Comedy, Family]
Suicide Squad              [Action, Adventure, Fantasy]
Name: Genre, dtype: object

For each title, add the list of genres to the genre set.

In [9]:
# find the unique genres
genres = set()
title_genre_series.apply(lambda x: genres.update(x))
genres

{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Thriller',
 'War',
 'Western'}

### Example: Query for Sci-Fi Movies

In [10]:
movies['Genre'].str.contains('Sci-Fi').head()

Title
Guardians of the Galaxy     True
Prometheus                  True
Split                      False
Sing                       False
Suicide Squad              False
Name: Genre, dtype: bool

## Create Long Form Representation of Genre

This has one row per (Genre, Title)

In [11]:
# keep columns for for Rating and Runtime
dfs = []
for genre in genres:
    has_genre = movies['Genre'].str.contains(genre)
    df = movies.loc[has_genre, ['Rating', 'Runtime']]
    df['Genre'] = genre
    dfs.append(df)
    
df_genre = pd.concat(dfs)
df_genre.head()

Unnamed: 0_level_0,Rating,Runtime,Genre
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Sing,7.2,108,Family
Fantastic Beasts and Where to Find Them,7.5,133,Family
Miss Peregrine's Home for Peculiar Children,6.7,127,Family
The Jungle Book,7.5,106,Family
Mamma Mia!,6.4,108,Family


In [12]:
# Keep Title, but make Genre the index
# often it is good to sort by the index
df_genre = df_genre.reset_index()
df_genre = df_genre.set_index('Genre')
df_genre = df_genre.sort_index()
df_genre.head()

Unnamed: 0_level_0,Title,Rating,Runtime
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Action,I Am Number Four,6.1,109
Action,Kickboxer: Vengeance,4.9,90
Action,The Hunger Games: Catching Fire,7.6,146
Action,Tron,6.8,125
Action,The Finest Hours,6.8,117


In [13]:
# verify Genre, Title is unique
def get_dups(df, cols):
    return df.groupby(cols).filter(lambda x: len(x) > 1)

get_dups(df_genre, ['Title', 'Genre'])

Unnamed: 0_level_0,Title,Rating,Runtime
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


### Aggregate for Rating and Runtime

In [14]:
# Flatten Multi-index Column
def flatten_columns(df):
    return ['_'.join(col) for col in df.columns.values]

In [15]:
# data is tidy with respect to Genre, so groupby will work correctly
df_agg = df_genre.groupby('Genre').agg({'Rating':['mean','count'], 'Runtime':['mean','count']})
df_agg.columns = flatten_columns(df_agg)
df_agg.head()

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Action,6.614521,303,116.739274,303
Adventure,6.772201,259,117.698842,259
Animation,7.32449,49,98.142857,49
Biography,7.290123,81,122.580247,81
Comedy,6.64767,279,105.899642,279


### Note: Avg of Averages != Avg

In [16]:
df_genre['Rating'].mean() == df_agg['Rating_mean'].mean()

False

In [17]:
df_genre['Rating'].mean(), df_agg['Rating_mean'].mean()

(6.756835937500007, 6.842082767468817)

## Genre Querys

### Query: Top 3 Genres by Rating & At Least 20 Movies

In [18]:
criteria_count = df_agg['Rating_count'] >= 20
df_agg[criteria_count].nlargest(n=3, columns=['Rating_mean'])

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Animation,7.32449,49,98.142857,49
Biography,7.290123,81,122.580247,81
History,7.127586,29,130.689655,29


### Query: Top 3 Genres by Runtime & Better than Avg Rating & At Least 20 Movies

In [19]:
criteria_rating = df_agg['Rating_mean'] > df_genre['Rating'].mean()
df_agg[criteria_count & criteria_rating].nlargest(n=3, columns=['Runtime_mean'])

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
History,7.127586,29,130.689655,29
Biography,7.290123,81,122.580247,81
Adventure,6.772201,259,117.698842,259


## Actor Queries

This parallels the above.  This format does not allow for queries by both actor and genre.

In [20]:
# create a Series with values being the parsed actor string
title_actor_series = movies['Actors'].apply(parse_string)
title_actor_series.head()

Title
Guardians of the Galaxy    [Chris Pratt, Vin Diesel, Bradley Cooper, Zoe ...
Prometheus                 [Noomi Rapace, Logan Marshall-Green, Michael F...
Split                      [James McAvoy, Anya Taylor-Joy, Haley Lu Richa...
Sing                       [Matthew McConaughey, Reese Witherspoon, Seth ...
Suicide Squad              [Will Smith, Jared Leto, Margot Robbie, Viola ...
Name: Actors, dtype: object

For each title, add the list of actors to the actors set.

In [21]:
# find the unique actors
actors = set()
title_actor_series.apply(lambda x: actors.update(x))
list(actors)[:5]

['Arturo Castro',
 'Andy Samberg',
 'Haley Lu Richardson',
 'Moon Bloodgood',
 'Christopher Jordan Wallace']

In [22]:
# create a "long form" representation of Actor
# this has one row per Actor, Title
# with columns for for Rating and Runtime
dfs = []
for actor in actors:
    has_actor = movies['Actors'].str.contains(actor)
    df = movies.loc[has_actor, ['Rating', 'Runtime']]
    df['Actor'] = actor
    dfs.append(df)
    
df_actor = pd.concat(dfs)
df_actor.head()

Unnamed: 0_level_0,Rating,Runtime,Actor
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Billy Lynn's Long Halftime Walk,6.3,113,Arturo Castro
Storks,6.9,87,Andy Samberg
Popstar: Never Stop Never Stopping,6.7,87,Andy Samberg
Hotel Transylvania 2,6.7,89,Andy Samberg
Split,7.3,117,Haley Lu Richardson


In [23]:
# Keep Title, but make Actor the index
# often it is good to sort by the index
df_actor = df_actor.reset_index()
df_actor = df_actor.set_index('Actor')
df_actor = df_actor.sort_index()
df_actor.head()

Unnamed: 0_level_0,Title,Rating,Runtime
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
50 Cent,Escape Plan,6.7,115
A.C. Peterson,The Stakelander,5.3,81
AJ Michalka,Super 8,7.1,112
Aamir Khan,Taare Zameen Par,8.5,165
Aamir Khan,3 Idiots,8.4,170


In [24]:
# Actor, Title is unique
get_dups(df_actor, ['Title', 'Actor'])

Unnamed: 0_level_0,Title,Rating,Runtime
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [25]:
# Flatten Multi-index Column
def flatten_columns(df):
    return ['_'.join(col) for col in df.columns.values]

In [26]:
# data is tidy with respect to Actor, so groupby will work correctly
df_agg = df_actor.groupby('Actor').agg({'Rating':['mean','count'], 'Runtime':['mean','count']})
df_agg.columns = flatten_columns(df_agg)
df_agg.head()

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
50 Cent,6.7,1,115.0,1
A.C. Peterson,5.3,1,81.0,1
AJ Michalka,7.1,1,112.0,1
Aamir Khan,8.475,4,162.25,4
Aaron Burns,5.15,2,99.5,2


### Query: Top 15 Actors by Movie Count, Sorted by Rating

In [27]:
# find the 15 actors who are in the most movies and sort by Rating
actor_top15 = df_agg.nlargest(n=15, columns=['Rating_count'])
actor_top15.sort_values(by='Rating_mean', ascending=False)

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Robert Downey Jr.,7.5,12,132.25,12
Brad Pitt,7.484615,13,134.538462,13
Tom Hardy,7.358333,12,122.75,12
Scarlett Johansson,7.325,12,116.833333,12
Christian Bale,7.246154,13,132.461538,13
Bradley Cooper,7.190909,11,118.090909,11
Hugh Jackman,7.135714,14,124.5,14
Cate Blanchett,7.045455,11,124.272727,11
Anne Hathaway,6.983333,12,121.25,12
Michael Fassbender,6.941667,12,121.0,12


### Query: Top 15 Actors by Movie Count, Display Top 5 by Rating

In [28]:
# very similar to above query
actor_top15.nlargest(n=5, columns=['Rating_mean'])

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Robert Downey Jr.,7.5,12,132.25,12
Brad Pitt,7.484615,13,134.538462,13
Tom Hardy,7.358333,12,122.75,12
Scarlett Johansson,7.325,12,116.833333,12
Christian Bale,7.246154,13,132.461538,13


### Query: Actors with Above Avg Rating & in at least 5 movies, Display Top 5 by Runtime

In [29]:
criteria_above_avg = df_agg['Rating_mean'] > df_actor['Rating'].mean()
criteria_common = df_agg['Rating_count'] >= 5

df_agg[criteria_above_avg & criteria_common].nlargest(n=5, columns=['Runtime_mean'])

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Leonardo DiCaprio,7.97,10,147.1,10
Ian McKellen,7.583333,6,147.0,6
Kurt Russell,7.36,5,146.0,5
Andrew Garfield,7.283333,6,136.833333,6
Russell Crowe,7.042857,7,136.285714,7


### Query: Actors Top 20% by Rating & Actors Top 20% by Count, Display Top 5 by Runtime

In [30]:
# rating for top 20% of actors
top_20percent_rating = df_actor['Rating'].quantile(q=0.80)
top_20percent_rating

7.5

In [31]:
# Note
df_actor['Rating'].quantile(q=0.80) == df_agg['Rating_mean'].quantile(q=0.80)

False

In [32]:
# Note
df_actor['Rating'].quantile(q=0.80), df_agg['Rating_mean'].quantile(q=0.80)

(7.5, 7.4)

In [33]:
# number of movies for top 20% cutoff
top_20percent_count = df_agg['Rating_count'].quantile(q=0.80)
top_20percent_count

3.0

In [34]:
criteria_top20_rated = df_agg['Rating_mean'] >= top_20percent_rating
criteria_top20_count = df_agg['Rating_count'] >= top_20percent_count
top_2020 = df_agg[criteria_top20_rated & criteria_top20_count].nlargest(n=5, columns=['Runtime_mean'])
top_2020

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aamir Khan,8.475,4,162.25,4
Richard Armitage,7.733333,3,158.0,3
Leonardo DiCaprio,7.97,10,147.1,10
Ian McKellen,7.583333,6,147.0,6
Rupert Grint,7.7,4,141.75,4


### Query: Number of Actors: Top 20% by Rating & Top 20% by Count

In [35]:
(criteria_top20_rated & criteria_top20_count).sum()

22

## Genre and Actor Queries
This requires one row per Genre, Actor, Title.

In [36]:
# keep columns for for Rating, Runtime and Actors
dfs = []
for genre in genres:
    has_genre = movies['Genre'].str.contains(genre)
    df = movies.loc[has_genre, ['Rating', 'Runtime', 'Actors']]
    df['Genre'] = genre
    
    for actor in actors:
        has_actor = df['Actors'].str.contains(actor)
        df2 = df.loc[has_actor, ['Genre', 'Rating', 'Runtime']]
        df2['Actor'] = actor
        dfs.append(df2)
    
df_genre_actor = pd.concat(dfs)
df_genre_actor.head()

Unnamed: 0_level_0,Genre,Rating,Runtime,Actor
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Hotel Transylvania 2,Family,6.7,89,Andy Samberg
"Alexander and the Terrible, Horrible, No Good, Very Bad Day",Family,6.2,81,Jennifer Garner
Miracles from Heaven,Family,7.0,109,Jennifer Garner
Nine Lives,Family,5.3,87,Jennifer Garner
The Jungle Book,Family,7.5,106,Ben Kingsley


In [37]:
df_genre_actor = df_genre_actor.reset_index()
df_genre_actor.head()

Unnamed: 0,Title,Genre,Rating,Runtime,Actor
0,Hotel Transylvania 2,Family,6.7,89,Andy Samberg
1,"Alexander and the Terrible, Horrible, No Good,...",Family,6.2,81,Jennifer Garner
2,Miracles from Heaven,Family,7.0,109,Jennifer Garner
3,Nine Lives,Family,5.3,87,Jennifer Garner
4,The Jungle Book,Family,7.5,106,Ben Kingsley


In [38]:
# verify unique Actor, Genre, Title
get_dups(df_genre_actor, ['Actor', 'Genre', 'Title'])

Unnamed: 0,Title,Genre,Rating,Runtime,Actor


### Query: For top 15 Actors by Count, Which Genre Had Highest Mean Rating

In [39]:
actor_top15

Unnamed: 0_level_0,Rating_mean,Rating_count,Runtime_mean,Runtime_count
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mark Wahlberg,6.773333,15,120.466667,15
Hugh Jackman,7.135714,14,124.5,14
Brad Pitt,7.484615,13,134.538462,13
Christian Bale,7.246154,13,132.461538,13
Anne Hathaway,6.983333,12,121.25,12
Channing Tatum,6.308333,12,111.416667,12
Johnny Depp,6.533333,12,127.166667,12
Michael Fassbender,6.941667,12,121.0,12
Robert Downey Jr.,7.5,12,132.25,12
Scarlett Johansson,7.325,12,116.833333,12


In [40]:
criteria = df_genre_actor['Actor'].isin(actor_top15.index)
df_genre_actor_common = df_genre_actor[criteria]
df_genre_actor_common.head()

Unnamed: 0,Title,Genre,Rating,Runtime,Actor
19,Daddy's Home,Family,6.1,96,Mark Wahlberg
80,Hugo,Family,7.5,126,Chloë Grace Moretz
95,Alice Through the Looking Glass,Family,6.2,113,Johnny Depp
96,Alice in Wonderland,Family,6.5,108,Johnny Depp
120,Cinderella,Family,7.0,105,Cate Blanchett


In [41]:
# aggregate the mean Rating over each genre for each actor
df_genre_actor_agg = \
    df_genre_actor_common.groupby(['Actor', 'Genre']).agg({'Rating':'mean'}).reset_index()
df_genre_actor_agg.head()

Unnamed: 0,Actor,Genre,Rating
0,Anna Kendrick,Action,6.85
1,Anna Kendrick,Adventure,6.166667
2,Anna Kendrick,Animation,6.5
3,Anna Kendrick,Comedy,6.5875
4,Anna Kendrick,Crime,7.55


In [42]:
# sort and pick off the first one in the group (no ties)
df3 = df_genre_actor_agg.sort_values(['Rating'], ascending=False)
df3.groupby(['Actor']).first()

Unnamed: 0_level_0,Genre,Rating
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1
Anna Kendrick,Thriller,7.7
Anne Hathaway,Sci-Fi,8.6
Brad Pitt,History,8.1
Bradley Cooper,Romance,7.8
Cate Blanchett,Animation,7.9
Channing Tatum,Animation,7.3
Chloë Grace Moretz,Romance,7.7
Christian Bale,Mystery,8.5
Hugh Jackman,Mystery,8.3
Johnny Depp,Musical,7.4


In [43]:
# same as above, but show ties
criteria = df_genre_actor_agg.groupby(['Actor']).apply(lambda x: x['Rating'] == x['Rating'].max())
df_genre_actor_agg[criteria.values].set_index('Actor')

Unnamed: 0_level_0,Genre,Rating
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1
Anna Kendrick,Thriller,7.7
Anne Hathaway,Sci-Fi,8.6
Brad Pitt,History,8.1
Bradley Cooper,Romance,7.8
Cate Blanchett,Animation,7.9
Channing Tatum,Animation,7.3
Chloë Grace Moretz,Romance,7.7
Christian Bale,Mystery,8.5
Christian Bale,Sci-Fi,8.5
Hugh Jackman,Mystery,8.3


### Spot Check by Hand

In [62]:
pitt_movies = movies['Actors'].str.contains('Brad Pitt')
history_movies = movies['Genre'].str.contains('History')

In [63]:
pitt_history_movies = movies[pitt_movies & history_movies]
pitt_history_movies

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
12 Years a Slave,112,"Biography,Drama,History","In the antebellum United States, Solomon North...",Steve McQueen,"Chiwetel Ejiofor, Michael Kenneth Williams, Mi...",2013,134,8.1,486338,56.67,96.0


In [64]:
pitt_history_movies['Rating'].mean()

8.1

In [65]:
df3[(df3['Actor'] == 'Brad Pitt') & (df3['Genre'] == 'History')]

Unnamed: 0,Actor,Genre,Rating
29,Brad Pitt,History,8.1


In [66]:
war_movies = movies['Genre'].str.contains('War')
pitt_war_movies = movies[pitt_movies & war_movies]
pitt_war_movies

Unnamed: 0_level_0,ID,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Inglourious Basterds,78,"Adventure,Drama,War","In Nazi-occupied France during World War II, a...",Quentin Tarantino,"Brad Pitt, Diane Kruger, Eli Roth,Mélanie Laurent",2009,153,8.3,959065,120.52,69.0
Fury,241,"Action,Drama,War",A grizzled tank commander makes tough decision...,David Ayer,"Brad Pitt, Shia LaBeouf, Logan Lerman, Michael...",2014,134,7.6,332234,85.71,64.0


In [67]:
pitt_war_movies['Rating'].mean()

7.95

In [68]:
df3[(df3['Actor'] == 'Brad Pitt') & (df3['Genre'] == 'War')]

Unnamed: 0,Actor,Genre,Rating
34,Brad Pitt,War,7.95


In [69]:
df3[df3['Actor'] == 'Brad Pitt'].nlargest(n=5, columns=['Rating'])

Unnamed: 0,Actor,Genre,Rating
29,Brad Pitt,History,8.1
34,Brad Pitt,War,7.95
24,Brad Pitt,Biography,7.75
22,Brad Pitt,Adventure,7.65
27,Brad Pitt,Drama,7.61
