# Rhysea's Assignment 5: Using Groupby

**Note**: I am a cinephile, so I chose the IMDB Top 1000 Database from our USC folder to practice Groupby on. I hope to find some interesting things today!

In [94]:
import pandas as pd

In [95]:
movie_df = pd.read_csv("../data/raw/imdb_1000.csv")

### Part I: Finding out basic details about the database

In [96]:
movie_df.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [97]:
movie_df.tail()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
974,7.4,Tootsie,PG,Comedy,116,"[u'Dustin Hoffman', u'Jessica Lange', u'Teri G..."
975,7.4,Back to the Future Part III,PG,Adventure,118,"[u'Michael J. Fox', u'Christopher Lloyd', u'Ma..."
976,7.4,Master and Commander: The Far Side of the World,PG-13,Action,138,"[u'Russell Crowe', u'Paul Bettany', u'Billy Bo..."
977,7.4,Poltergeist,PG,Horror,114,"[u'JoBeth Williams', u""Heather O'Rourke"", u'Cr..."
978,7.4,Wall Street,R,Crime,126,"[u'Charlie Sheen', u'Michael Douglas', u'Tamar..."


In [98]:
print(f"The number of movies in the database are {len(movie_df)}.")

The number of movies in the database are 979.


In [99]:
movie_df.dtypes

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

So, looking at the data types, I realize that aggregation will only work on two of the columns: star_rating and duration, since those are the only two numerical columns.

In [100]:
#trying to see if I could change the actors list into something more useful
movie_df = movie_df.astype({'actors_list': 'string'})

In [101]:
movie_df.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [102]:
#yeah, no idea how to change the 'u' char strings :/
movie_df.dtypes

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        string
dtype: object

### Part II: Playing around + experimentation

**Note**: In this part, I will be playing around with the database to find interesting information. I will use Groupby to find out the trends in the data.

In [103]:
#Since all the actors are clumped together, it will be hard to perform certain tasks with this list
#I will try to find how many movies have the same cast
#Will probably show movie franchises
movie_df.value_counts('actors_list')

actors_list
[u'Daniel Radcliffe', u'Emma Watson', u'Rupert Grint']              6
[u'Mark Hamill', u'Harrison Ford', u'Carrie Fisher']                3
[u'Michael J. Fox', u'Christopher Lloyd', u'Lea Thompson']          2
[u'Ian McKellen', u'Martin Freeman', u'Richard Armitage']           2
[u'Tom Hanks', u'Tim Allen', u'Joan Cusack']                        2
                                                                   ..
[u'Gary Oldman', u'Keri Russell', u'Andy Serkis']                   1
[u'Gary Oldman', u'Winona Ryder', u'Anthony Hopkins']               1
[u'Gena Rowlands', u'James Garner', u'Rachel McAdams']              1
[u'Gene Hackman', u'Barbara Hershey', u'Dennis Hopper']             1
[u'Zooey Deschanel', u'Joseph Gordon-Levitt', u'Geoffrey Arend']    1
Length: 969, dtype: int64

So, they have six Harry Potter films in the database, the first three Star Wars films, Back to the Future I and II. Very cool.

In [104]:
movie_df.groupby(["star_rating"]).agg(['min','max','mean','count']).reset_index()

Unnamed: 0_level_0,star_rating,duration,duration,duration,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,count
0,7.4,75,175,113.530612,49
1,7.5,78,172,116.194444,108
2,7.6,69,205,117.322581,124
3,7.7,80,202,118.265487,113
4,7.8,78,242,118.767241,116
5,7.9,75,220,119.12,75
6,8.0,64,197,120.814433,97
7,8.1,67,212,120.0,103
8,8.2,83,238,126.470588,51
9,8.3,81,224,132.162791,43


The star rating with the highest duration is 9.1, but there is only one movie under that star rating (I bet it's one of The Godfather trilogy films)

It is interesting that the lowest rated films also have the shortest average duration.

In [105]:
content = movie_df.groupby(['content_rating']).mean()
content

Unnamed: 0_level_0,star_rating,duration
content_rating,Unnamed: 1_level_1,Unnamed: 2_level_1
APPROVED,8.02766,113.914894
G,7.990625,112.34375
GP,7.933333,135.666667
NC-17,7.614286,119.857143
NOT RATED,8.123077,122.661538
PASSED,8.157143,104.285714
PG,7.879675,115.300813
PG-13,7.828571,127.195767
R,7.854783,122.163043
TV-MA,8.1,131.0


In [106]:
genre = movie_df.groupby(['genre']).mean()
genre

Unnamed: 0_level_0,star_rating,duration
genre,Unnamed: 1_level_1,Unnamed: 2_level_1
Action,7.884559,126.485294
Adventure,7.933333,134.84
Animation,7.914516,96.596774
Biography,7.862338,131.844156
Comedy,7.822436,107.602564
Crime,7.916935,122.298387
Drama,7.902518,126.539568
Family,7.85,107.5
Fantasy,7.7,112.0
Film-Noir,8.033333,97.333333


In [107]:
#Nothing good will come out of this one, since all titles are unique
title = movie_df.groupby(['title']).mean()
title

Unnamed: 0_level_0,star_rating,duration
title,Unnamed: 1_level_1,Unnamed: 2_level_1
(500) Days of Summer,7.8,95.0
12 Angry Men,8.9,96.0
12 Years a Slave,8.1,134.0
127 Hours,7.6,94.0
2001: A Space Odyssey,8.3,160.0
...,...,...
Zero Dark Thirty,7.4,157.0
Zodiac,7.7,157.0
Zombieland,7.7,88.0
Zulu,7.8,138.0


In [108]:
#Same here. This analysis is useless.
actors = movie_df.groupby(['actors_list']).mean()
actors

Unnamed: 0_level_0,star_rating,duration
actors_list,Unnamed: 1_level_1,Unnamed: 2_level_1
"[u""Brian O'Halloran"", u'Jeff Anderson', u'Marilyn Ghigliotti']",7.9,92.0
"[u""Brian O'Halloran"", u'Jeff Anderson', u'Rosario Dawson']",7.5,97.0
"[u""Paige O'Hara"", u'Robby Benson', u'Richard White']",8.1,84.0
"[u""Peter O'Toole"", u'Alec Guinness', u'Anthony Quinn']",8.4,216.0
"[u""Ryan O'Neal"", u'Marisa Berenson', u'Patrick Magee']",8.1,184.0
...,...,...
"[u'Zach Galifianakis', u'Bradley Cooper', u'Justin Bartha']",7.8,100.0
"[u'Zbigniew Zamachowski', u'Julie Delpy', u'Janusz Gajos']",7.7,91.0
"[u'Zero Mostel', u'Gene Wilder', u'Dick Shawn']",7.7,88.0
"[u'Ziyi Zhang', u'Takeshi Kaneshiro', u'Andy Lau']",7.6,119.0


In [109]:
#Seeing if there is any relationship between the content rating of the movie and its duration
content.sort_values('duration',ascending = True)

Unnamed: 0_level_0,star_rating,duration
content_rating,Unnamed: 1_level_1,Unnamed: 2_level_1
PASSED,8.157143,104.285714
X,7.925,106.25
UNRATED,7.994737,109.789474
G,7.990625,112.34375
APPROVED,8.02766,113.914894
PG,7.879675,115.300813
NC-17,7.614286,119.857143
R,7.854783,122.163043
NOT RATED,8.123077,122.661538
PG-13,7.828571,127.195767


In [110]:
#trying the same with the star rating
content.sort_values('star_rating',ascending=True)

Unnamed: 0_level_0,star_rating,duration
content_rating,Unnamed: 1_level_1,Unnamed: 2_level_1
NC-17,7.614286,119.857143
PG-13,7.828571,127.195767
R,7.854783,122.163043
PG,7.879675,115.300813
X,7.925,106.25
GP,7.933333,135.666667
G,7.990625,112.34375
UNRATED,7.994737,109.789474
APPROVED,8.02766,113.914894
TV-MA,8.1,131.0


Interesting. So, NC-17 films have the lowest star rating! And movies that 'passed' have the highest star rating.

In [111]:
#checking out what movies were rated NC-17
movie_df[movie_df['content_rating'] == 'NC-17']

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
456,7.9,Blue Is the Warmest Color,NC-17,Drama,179,"[u'L\xe9a Seydoux', u'Ad\xe8le Exarchopoulos',..."
604,7.7,Mysterious Skin,NC-17,Drama,105,"[u'Brady Corbet', u'Joseph Gordon-Levitt', u'E..."
715,7.6,Man Bites Dog,NC-17,Comedy,95,"[u'Beno\xeet Poelvoorde', u'Jacqueline Poelvoo..."
755,7.6,"Lust, Caution",NC-17,Drama,157,"[u'Tony Chiu Wai Leung', u'Wei Tang', u'Joan C..."
796,7.6,The Evil Dead,NC-17,Horror,85,"[u'Bruce Campbell', u'Ellen Sandweiss', u'Rich..."
915,7.5,Bad Education,NC-17,Crime,106,"[u'Gael Garc\xeda Bernal', u'Fele Mart\xednez'..."
972,7.4,Blue Valentine,NC-17,Drama,112,"[u'Ryan Gosling', u'Michelle Williams', u'John..."


In [112]:
#checking out what movies were rated 'Passed', whatever that means
#oh, this is an old rating. Very interesting!
movie_df[movie_df['content_rating'] == 'PASSED']

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
29,8.6,City Lights,PASSED,Comedy,87,"[u'Charles Chaplin', u'Virginia Cherrill', u'F..."
79,8.4,Double Indemnity,PASSED,Crime,107,"[u'Fred MacMurray', u'Barbara Stanwyck', u'Edw..."
159,8.2,The Best Years of Our Lives,PASSED,Drama,172,"[u'Fredric March', u'Dana Andrews', u'Myrna Loy']"
224,8.1,The Wizard of Oz,PASSED,Adventure,102,"[u'Judy Garland', u'Frank Morgan', u'Ray Bolger']"
293,8.1,Duck Soup,PASSED,Comedy,68,"[u'Groucho Marx', u'Harpo Marx', u'Chico Marx']"
358,8.0,The Lady Vanishes,PASSED,Comedy,96,"[u'Margaret Lockwood', u'Michael Redgrave', u'..."
619,7.7,Forbidden Planet,PASSED,Action,98,"[u'Walter Pidgeon', u'Anne Francis', u'Leslie ..."


So, many of the Passed films are also very old, and since they are a part of the IMDB top 1000 list, I am assuming they are all probably considered very good movies (as they still managed to make the list).

In [113]:
#Looking at the aggregations for star rating and duration based on Content Rating
movie_df.groupby(["content_rating"]).agg(['min','max','mean','count']).reset_index()

Unnamed: 0_level_0,content_rating,star_rating,star_rating,star_rating,star_rating,duration,duration,duration,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,count,min,max,mean,count
0,APPROVED,7.5,8.7,8.02766,47,78,220,113.914894,47
1,G,7.4,8.6,7.990625,32,75,238,112.34375,32
2,GP,7.7,8.1,7.933333,3,91,172,135.666667,3
3,NC-17,7.4,7.9,7.614286,7,85,179,119.857143,7
4,NOT RATED,7.5,8.9,8.123077,65,68,189,122.661538,65
5,PASSED,7.7,8.6,8.157143,7,68,172,104.285714,7
6,PG,7.4,8.8,7.879675,123,76,224,115.300813,123
7,PG-13,7.4,9.0,7.828571,189,78,242,127.195767,189
8,R,7.4,9.3,7.854783,460,69,229,122.163043,460
9,TV-MA,8.1,8.1,8.1,1,131,131,131.0,1


Most films belong to the R rated category. Since this is the IMDB Top 1000 database, it is interesting how the highest majority of films belong to the R-rated category. The least number of films belong to the TV-MA category.

So, here it seems like the highest rated movie belongs to the R-rated films category. But the highest mean rating, as seen earlier, is given to Passed-rated films (older films that have managed to stay relevant and popular today?)

The minimum star-rating is shared by a lot of content-rating categories. The highest among them is TV-MA, but there is only one TV-MA film in the database. 

The longest film belongs to the PG-13 category. The shortest film belongs to the Unrated category. On an average, the Passed-rated films have the shortest duration, while the GP-rated films have the longest average duration of any category.

In [114]:
#Checking out the TV-MA movie.
movie_df[movie_df["content_rating"] == "TV-MA"]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
219,8.1,Who's Afraid of Virginia Woolf?,TV-MA,Drama,131,"[u'Elizabeth Taylor', u'Richard Burton', u'Geo..."


In [115]:
#Looking at aggregations for star ratings and durations based on the genre
movie_df.groupby(["genre"]).agg(['min','max','mean','count']).reset_index()

Unnamed: 0_level_0,genre,star_rating,star_rating,star_rating,star_rating,duration,duration,duration,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,count,min,max,mean,count
0,Action,7.4,9.0,7.884559,136,80,205,126.485294,136
1,Adventure,7.4,8.9,7.933333,75,89,224,134.84,75
2,Animation,7.4,8.6,7.914516,62,75,134,96.596774,62
3,Biography,7.4,8.9,7.862338,77,85,202,131.844156,77
4,Comedy,7.4,8.6,7.822436,156,68,187,107.602564,156
5,Crime,7.4,9.3,7.916935,124,67,229,122.298387,124
6,Drama,7.4,8.9,7.902518,278,64,242,126.539568,278
7,Family,7.8,7.9,7.85,2,100,115,107.5,2
8,Fantasy,7.7,7.7,7.7,1,112,112,112.0,1
9,Film-Noir,7.7,8.3,8.033333,3,88,111,97.333333,3


The highest rated film belongs to the Crime genre (Shawshank Redemption) while the lowest rated films belong to several genres. The genre with the highest mean rating is Western. The genre with the lowest mean rating is for Thriller. Both of these genres don't have a lot of films, however.

The genre History has the shortest films. But the database also only has one film with that genre. Animation has a significant number of films that are short. The genres with the longest film durations are Western (highest) and Adventure (second-highest)

In [116]:
movie_df[movie_df['genre']=='Western']

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
6,8.9,"The Good, the Bad and the Ugly",NOT RATED,Western,161,"[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ..."
26,8.6,Once Upon a Time in the West,PG-13,Western,175,"[u'Henry Fonda', u'Charles Bronson', u'Claudia..."
59,8.5,Django Unchained,R,Western,165,"[u'Jamie Foxx', u'Christoph Waltz', u'Leonardo..."
107,8.3,For a Few Dollars More,APPROVED,Western,132,"[u'Clint Eastwood', u'Lee Van Cleef', u'Gian M..."
119,8.3,Unforgiven,R,Western,131,"[u'Clint Eastwood', u'Gene Hackman', u'Morgan ..."
236,8.1,High Noon,PG,Western,85,"[u'Gary Cooper', u'Grace Kelly', u'Thomas Mitc..."
263,8.1,Rio Bravo,NOT RATED,Western,141,"[u'John Wayne', u'Dean Martin', u'Ricky Nelson']"
421,7.9,The Outlaw Josey Wales,PG,Western,135,"[u'Clint Eastwood', u'Sondra Locke', u'Chief D..."
704,7.6,High Plains Drifter,R,Western,105,"[u'Clint Eastwood', u'Verna Bloom', u'Marianna..."


**Note**: We only have two columns with numerical values, so I don't know what else I can do.  

In [117]:
#I know this is useless, but I wanted to experiment.
movie_df.groupby(["title"])['genre'].min()

title
(500) Days of Summer        Comedy
12 Angry Men                 Drama
12 Years a Slave         Biography
127 Hours                Adventure
2001: A Space Odyssey      Mystery
                           ...    
Zero Dark Thirty             Drama
Zodiac                       Crime
Zombieland                  Comedy
Zulu                         Drama
[Rec]                       Horror
Name: genre, Length: 975, dtype: object