In [1]:
%load_ext lab_black
import pandas as pd
import altair as alt

For my final project, I wanted to look at IMDB top 1000 movie data like we did earlier in the term, but focus on financial information. I first found a dataset for the top 1000 movies that included gross earning for each movie, but found it was super messy. For example, each movie had between 1 and 3 genres listed, and the top 4 billed stars were each in different columns. I originally tried making a Boolean function to ask what genre each movie was, but I started getting confused on how I could run queries as the data set as a whole if each title would have its own yes/no function to determine its genre. I decided to limit the genre of the movie to its # 1 genre, or what IMDB considered its “primary” genre. 

I played around looking at what actors were in the most movies on the list, what the highest grossing movie overall was and what director had the most movies on the IMDB. After grouping the Star1, Star2, Star3 and Star4 columns, I saw that the actor who appears the most on the list it Tom Hanks.

My end goal was to look for a correlation between genre and gross. My personal guess was that Action movies would make the most money. After limiting the number of genres, which I did with a string split function, I found that the highest grossing genres both for mean and median gross were Family, Action and Adventure movies. The lowest grossing genres for mean and median were Film-Noir, Western, and Thriller. After charting out median and mean gross by genre, I found that Family movies actually had substantially higher mean and median gross earnings compared to Action. However, there were only 2 family movies: “E.T.” & "Willy Wonka and the Chocolate Factory.” There were 172 Action movies though, which accounts for over 1/10 of the list, and for this reason I will concur that Action movies on the the IMDB top 1000 movies list were the most financially successful genre.

I think this data collection could lead into an interesting article that takes a look at how different genres have become more fianancially successful over time. The two highest earning movies on the IMDB top 1000 list were ‘Star Wars: Episode VII - The Force Awakens’ and “Avengers: Endgame,” which are both action movies. The rise of Marvel movies in the last decade has popularized the Action genre perhaps more than ever before, and created some of the highest earning franchises ever. I would love to find a list for the IMDB top 1000 movies from 5 years ago, 10 years ago, and 20 years ago and run the same analysis to see if more Action movies have been added over time, and if the median and mean gross earnings for Action movies have gone up over time. This would be a great way to start talking about how certain studios, which I would call cinematic conglomerates at this point, can have a large hand in what genres are successful and what types of movies are being made by other studios.

Thank you for an amazing two semesters, Matt! I'm so glad I took this class and got to build on everything from the fall, it's been so incredibly interesting and worthwhile.


### Import Data

In [2]:
movies_df = pd.read_csv("data/raw/imdb_top_1000.csv")

In [3]:
movies_df.head(2)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411


### Clean up and rename columns

In [4]:
movies_df = movies_df[
    [
        "Series_Title",
        "Released_Year",
        "Runtime",
        "Genre",
        "IMDB_Rating",
        "Overview",
        "Meta_score",
        "Director",
        "Star1",
        "Star2",
        "Star4",
        "Star3",
        "Gross",
    ]
]

In [5]:
movies_df = movies_df.rename(
    {
        "Series_Title": "title",
        "Released_Year": "year",
        "Runtime": "runtime",
        "Genre": "genre",
        "IMDB_Rating": "imdb_rating",
        "Overview": "synopsis",
        "Meta_score": "meta_score",
        "Director": "director",
        "Star1": "star1",
        "Star2": "star2",
        "Star3": "star3",
        "Star4": "star4",
        "Gross": "gross",
    },
    axis=1,
)

In [6]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        1000 non-null   object 
 1   year         1000 non-null   object 
 2   runtime      1000 non-null   object 
 3   genre        1000 non-null   object 
 4   imdb_rating  1000 non-null   float64
 5   synopsis     1000 non-null   object 
 6   meta_score   843 non-null    float64
 7   director     1000 non-null   object 
 8   star1        1000 non-null   object 
 9   star2        1000 non-null   object 
 10  star4        1000 non-null   object 
 11  star3        1000 non-null   object 
 12  gross        831 non-null    object 
dtypes: float64(2), object(11)
memory usage: 101.7+ KB


### Change 'gross' column to float64, remove commas

In [7]:
movies_df["gross"] = movies_df["gross"].str.replace(",", "")
movies_df.head(3)

Unnamed: 0,title,year,runtime,genre,imdb_rating,synopsis,meta_score,director,star1,star2,star4,star3,gross
0,The Shawshank Redemption,1994,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,William Sadler,Bob Gunton,28341469
1,The Godfather,1972,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,Diane Keaton,James Caan,134966411
2,The Dark Knight,2008,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Michael Caine,Aaron Eckhart,534858444


In [8]:
movies_df["gross"] = pd.Series(movies_df["gross"], dtype="float64")

In [9]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        1000 non-null   object 
 1   year         1000 non-null   object 
 2   runtime      1000 non-null   object 
 3   genre        1000 non-null   object 
 4   imdb_rating  1000 non-null   float64
 5   synopsis     1000 non-null   object 
 6   meta_score   843 non-null    float64
 7   director     1000 non-null   object 
 8   star1        1000 non-null   object 
 9   star2        1000 non-null   object 
 10  star4        1000 non-null   object 
 11  star3        1000 non-null   object 
 12  gross        831 non-null    float64
dtypes: float64(3), object(10)
memory usage: 101.7+ KB


### Check what genre names look like

In [10]:
movies_df.genre.value_counts()

Drama                           85
Drama, Romance                  37
Comedy, Drama                   35
Comedy, Drama, Romance          31
Action, Crime, Drama            30
                                ..
Action, Comedy, Mystery          1
Horror, Mystery, Sci-Fi          1
Action, Adventure, Family        1
Adventure, Comedy, Film-Noir     1
Crime, Drama, Musical            1
Name: genre, Length: 202, dtype: int64

### Later on, I want to look at the correlation between genre of movie and how much money it makes.
### It will be hard to run queries on the genre column if it has so many values, so I will use a string split function to remove everything after the first delimiter (comma) to just have the primary listed genre for each movies

In [11]:
movies_df["genre"] = movies_df["genre"].str.split(", ", expand=True)[0]
###movies_df["genre"] = movies_df["genre"].str.rsplit(", ", n=1).str.get(0)

In [12]:
movies_df.head()

Unnamed: 0,title,year,runtime,genre,imdb_rating,synopsis,meta_score,director,star1,star2,star4,star3,gross
0,The Shawshank Redemption,1994,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,William Sadler,Bob Gunton,28341469.0
1,The Godfather,1972,175 min,Crime,9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,Diane Keaton,James Caan,134966411.0
2,The Dark Knight,2008,152 min,Action,9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Michael Caine,Aaron Eckhart,534858444.0
3,The Godfather: Part II,1974,202 min,Crime,9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Diane Keaton,Robert Duvall,57300000.0
4,12 Angry Men,1957,96 min,Crime,9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,John Fiedler,Martin Balsam,4360000.0


### What is the highest rated crime movie?

In [13]:
movies_df[movies_df["genre"].str.contains("Crime")].sort_values(
    "imdb_rating", ascending=False
).head(5)

Unnamed: 0,title,year,runtime,genre,imdb_rating,synopsis,meta_score,director,star1,star2,star4,star3,gross
1,The Godfather,1972,175 min,Crime,9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,Diane Keaton,James Caan,134966411.0
4,12 Angry Men,1957,96 min,Crime,9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,John Fiedler,Martin Balsam,4360000.0
3,The Godfather: Part II,1974,202 min,Crime,9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Diane Keaton,Robert Duvall,57300000.0
6,Pulp Fiction,1994,154 min,Crime,8.9,"The lives of two mob hitmen, a boxer, a gangst...",94.0,Quentin Tarantino,John Travolta,Uma Thurman,Bruce Willis,Samuel L. Jackson,107928762.0
22,Cidade de Deus,2002,130 min,Crime,8.6,"In the slums of Rio, two kids' paths diverge a...",79.0,Fernando Meirelles,Kátia Lund,Alexandre Rodrigues,Matheus Nachtergaele,Leandro Firmino,7563397.0


### What were the 3 highest grossing movies on the list?

In [93]:
movies_df.sort_values("gross", ascending=False).head(3)

Unnamed: 0,title,year,runtime,genre,imdb_rating,synopsis,meta_score,director,star1,star2,star4,star3,gross
477,Star Wars: Episode VII - The Force Awakens,2015,138 min,Action,7.9,"As a new threat to the galaxy rises, Rey, a de...",80.0,J.J. Abrams,Daisy Ridley,John Boyega,Domhnall Gleeson,Oscar Isaac,936662225.0
59,Avengers: Endgame,2019,181 min,Action,8.4,After the devastating events of Avengers: Infi...,78.0,Anthony Russo,Joe Russo,Robert Downey Jr.,Mark Ruffalo,Chris Evans,858373000.0
623,Avatar,2009,162 min,Action,7.8,A paraplegic Marine dispatched to the moon Pan...,83.0,James Cameron,Sam Worthington,Zoe Saldana,Michelle Rodriguez,Sigourney Weaver,760507625.0


### What was the highest grossing thriller movie?

In [17]:
movies_df[movies_df["genre"].str.contains("Thriller")].sort_values(
    "gross", ascending=False
).head(1)

Unnamed: 0,title,year,runtime,genre,imdb_rating,synopsis,meta_score,director,star1,star2,star4,star3,gross
700,Wait Until Dark,1967,108 min,Thriller,7.8,A recently blinded woman is terrorized by a tr...,81.0,Terence Young,Audrey Hepburn,Alan Arkin,Efrem Zimbalist Jr.,Richard Crenna,17550741.0


### I want to look at what stars were in the highest number of IMDB top 1000 movies, but the top 4 stars billed are in separate columns.
### I will fix this by creating a column that combines the names of the top 4 billed stars into one single column.

In [19]:
stars = movies_df[["star1", "star2", "star4", "star3"]]

### What stars appeared in the highest number of the top 1000 movies on IMDB?

In [20]:
stars.mode()

Unnamed: 0,star1,star2,star4,star3
0,Tom Hanks,Emma Watson,Michael Caine,Rupert Grint


### How many movies on the IMDB list did each of these stars appear in?

In [21]:
movies_df[movies_df["star1"] == "Tom Hanks"].count()

title          12
year           12
runtime        12
genre          12
imdb_rating    12
synopsis       12
meta_score     12
director       12
star1          12
star2          12
star4          12
star3          12
gross          12
dtype: int64

In [22]:
movies_df[movies_df["star2"] == "Emma Watson"].count()

title          7
year           7
runtime        7
genre          7
imdb_rating    7
synopsis       7
meta_score     7
director       7
star1          7
star2          7
star4          7
star3          7
gross          7
dtype: int64

In [23]:
movies_df[movies_df["star3"] == "Rupert Grint"].count()

title          5
year           5
runtime        5
genre          5
imdb_rating    5
synopsis       5
meta_score     5
director       5
star1          5
star2          5
star4          5
star3          5
gross          5
dtype: int64

In [24]:
movies_df[movies_df["star4"] == "Michael Caine"].count()

title          4
year           4
runtime        4
genre          4
imdb_rating    4
synopsis       4
meta_score     4
director       4
star1          4
star2          4
star4          4
star3          4
gross          4
dtype: int64

### Tom Hanks was in the most movies on the list. What movies were they?

In [89]:
movies_df[movies_df["star1"].str.contains("Tom Hanks")].sort_values(
    "title", ascending=True
)

Unnamed: 0,title,year,runtime,genre,imdb_rating,synopsis,meta_score,director,star1,star2,star4,star3,gross
966,Apollo 13,PG,140 min,Adventure,7.6,NASA must devise a strategy to return Apollo 1...,77.0,Ron Howard,Tom Hanks,Bill Paxton,Gary Sinise,Kevin Bacon,173837933.0
890,Bridge of Spies,2015,142 min,Drama,7.6,"During the Cold War, an American lawyer is rec...",81.0,Steven Spielberg,Tom Hanks,Mark Rylance,Amy Ryan,Alan Alda,72313754.0
604,Captain Phillips,2013,134 min,Adventure,7.8,The true story of Captain Richard Phillips and...,82.0,Paul Greengrass,Tom Hanks,Barkhad Abdi,Catherine Keener,Barkhad Abdirahman,107100855.0
647,Cast Away,2000,143 min,Adventure,7.8,A FedEx executive undergoes a physical and emo...,73.0,Robert Zemeckis,Tom Hanks,Helen Hunt,Lari White,Paul Sanchez,233632142.0
11,Forrest Gump,1994,142 min,Drama,8.8,"The presidencies of Kennedy and Johnson, the e...",82.0,Robert Zemeckis,Tom Hanks,Robin Wright,Sally Field,Gary Sinise,330252182.0
818,Philadelphia,1993,125 min,Drama,7.7,When a man with HIV is fired by his law firm b...,66.0,Jonathan Demme,Tom Hanks,Denzel Washington,Buzz Kilman,Roberta Maxwell,77324422.0
791,Road to Perdition,2002,117 min,Crime,7.7,"A mob enforcer's son witnesses a murder, forci...",72.0,Sam Mendes,Tom Hanks,Tyler Hoechlin,Liam Aiken,Rob Maxey,104454762.0
24,Saving Private Ryan,1998,169 min,Drama,8.6,"Following the Normandy Landings, a group of U....",91.0,Steven Spielberg,Tom Hanks,Matt Damon,Edward Burns,Tom Sizemore,216540909.0
25,The Green Mile,1999,189 min,Crime,8.6,The lives of guards on Death Row are affected ...,61.0,Frank Darabont,Tom Hanks,Michael Clarke Duncan,Bonnie Hunt,David Morse,136801374.0
101,Toy Story,1995,81 min,Animation,8.3,A cowboy doll is profoundly threatened and jea...,95.0,John Lasseter,Tom Hanks,Tim Allen,Jim Varney,Don Rickles,191796233.0


### What directors appear on this list the most often?

In [25]:
movies_df.director.mode()

0    Alfred Hitchcock
dtype: object

### What directors on the list had the highest grossing movies on average? Let's say that a financially successful movie grossed over 100 million dollars.

In [32]:
high_grossing = movies_df[movies_df["gross"] > 100000000.0]

In [94]:
movies_df[movies_df["gross"] > 100000000.0].groupby("director").mean(
    "gross"
).sort_values("gross", ascending=False).head(10)

Unnamed: 0_level_0,imdb_rating,meta_score,gross
director,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Joss Whedon,8.0,69.0,623279500.0
Anthony Russo,8.075,72.75,551259900.0
James Cameron,8.033333,77.666667,541558800.0
Gareth Edwards,7.8,65.0,532177300.0
J.J. Abrams,7.833333,78.0,474390300.0
Josh Cooley,7.8,84.0,434038000.0
Roger Allers,8.5,88.0,422783800.0
Tim Miller,8.0,65.0,363070700.0
James Gunn,7.8,71.5,361494900.0
Brad Bird,7.866667,88.666667,358822800.0


### I want to see if there is a correlation between the genre of the movie, and how financially successful it is.

In [28]:
### Import csv of all genre names

In [29]:
###genre_list = pd.read_csv("data/raw/genres.csv")

In [30]:
###genres_movies = movies_df["genre"].values()
###genres_movies = pd.merge(movies_df, genre_list, left_on="genre", right_on="Genre")

In [31]:
###genres_movies[genres_movies["title"] == "Star Wars: Episode VII - The Force Awakens"]

Unnamed: 0,title,year,runtime,genre,imdb_rating,synopsis,meta_score,director,star1,star2,star4,star3,gross,Genre
471,Star Wars: Episode VII - The Force Awakens,2015,138 min,Action,7.9,"As a new threat to the galaxy rises, Rey, a de...",80.0,J.J. Abrams,Daisy Ridley,John Boyega,Domhnall Gleeson,Oscar Isaac,936662225.0,Action


### What genres have the highest mean gross per film?

In [41]:
movies_df.groupby("genre").mean("gross").sort_values("gross", ascending=False).head(3)

Unnamed: 0_level_0,imdb_rating,meta_score,gross
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Family,7.8,79.0,219555300.0
Action,7.949419,73.41958,141963100.0
Animation,7.930488,81.093333,127967500.0


### What genres have the lowest mean gross per film?

In [42]:
movies_df.groupby("genre").mean("gross").sort_values("gross", ascending=True).head(3)

Unnamed: 0_level_0,imdb_rating,meta_score,gross
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Film-Noir,7.966667,95.666667,1278625.5
Western,8.35,78.25,14555377.0
Thriller,7.8,81.0,17550741.0


### Family movies have the highest mean gross per movie. What family movie made the most money?

In [92]:
movies_df[movies_df["genre"] == "Family"].sort_values("gross", ascending=False).head(2)

Unnamed: 0,title,year,runtime,genre,imdb_rating,synopsis,meta_score,director,star1,star2,star4,star3,gross
688,E.T. the Extra-Terrestrial,1982,115 min,Family,7.8,A troubled child summons the courage to help a...,91.0,Steven Spielberg,Henry Thomas,Drew Barrymore,Dee Wallace,Peter Coyote,435110554.0
698,Willy Wonka & the Chocolate Factory,1971,100 min,Family,7.8,A poor but hopeful boy seeks one of the five c...,67.0,Mel Stuart,Gene Wilder,Jack Albertson,Roy Kinnear,Peter Ostrum,4000000.0


### What Action movie made the most money?

In [45]:
movies_df[movies_df["genre"] == "Action"].sort_values("gross", ascending=False).head(1)

Unnamed: 0,title,year,runtime,genre,imdb_rating,synopsis,meta_score,director,star1,star2,star4,star3,gross
477,Star Wars: Episode VII - The Force Awakens,2015,138 min,Action,7.9,"As a new threat to the galaxy rises, Rey, a de...",80.0,J.J. Abrams,Daisy Ridley,John Boyega,Domhnall Gleeson,Oscar Isaac,936662225.0


### What Animation movie made the most money?

In [46]:
movies_df[movies_df["genre"] == "Animation"].sort_values("gross", ascending=False).head(
    1
)

Unnamed: 0,title,year,runtime,genre,imdb_rating,synopsis,meta_score,director,star1,star2,star4,star3,gross
891,Incredibles 2,2018,118 min,Animation,7.6,The Incredibles family takes on a new mission ...,80.0,Brad Bird,Craig T. Nelson,Holly Hunter,Huck Milner,Sarah Vowell,608581744.0


### Okay, this query will also depend on the number of movies in each genre that are on the list and will include outliers, but I'm curious. What genre has the highest total gross for all of its movies on the list?

In [48]:
movies_df.groupby("genre").sum("gross").sort_values("gross", ascending=False).head(3)

Unnamed: 0_level_0,imdb_rating,meta_score,gross
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Action,1367.3,10499.0,20016800000.0
Drama,2299.7,19208.0,9050484000.0
Animation,650.3,6082.0,8573824000.0


### How many actions movies are there, though?

In [51]:
movies_df[movies_df["genre"] == "Action"].count()

title          172
year           172
runtime        172
genre          172
imdb_rating    172
synopsis       172
meta_score     143
director       172
star1          172
star2          172
star4          172
star3          172
gross          141
dtype: int64

### Okay, over 1/10 of the movies are Action movies, and the highest grossing movie on the list was Action, so Action movies having the highest gross doesn't mean much. How many Family movies were there?

In [95]:
movies_df[movies_df["genre"] == "Family"].count()

title          2
year           2
runtime        2
genre          2
imdb_rating    2
synopsis       2
meta_score     2
director       2
star1          2
star2          2
star4          2
star3          2
gross          2
dtype: int64

### Wow, okay, only two movies. That to me says that those two movies must have made an insane amount of money, if the genre ranks above Action and has 2 movies compared to 172.

### Okay, last query. What movie genres had the highest median gross earnings?

In [54]:
movies_df.groupby("genre").median("gross").sort_values("gross", ascending=False).head(3)

Unnamed: 0_level_0,imdb_rating,meta_score,gross
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Family,7.8,79.0,219555277.0
Animation,7.9,82.0,75082668.0
Action,7.9,74.0,66208183.0


### I'm going to use altair to make a graph charting median gross and mean gross for each genre.

### First, assign mean and median gross values to a variable.

In [96]:
###genre_mean = df.groupby(['continent']).mean().reset_index()
genre_mean = movies_df.groupby("genre").mean("gross").reset_index()
genre_median = movies_df.groupby("genre").median("gross").reset_index()

In [97]:
alt.Chart(genre_mean).mark_bar(color="teal").encode(x="gross", y="genre").properties(
    width=650
)

In [98]:
alt.Chart(genre_median).mark_bar(color="purple").encode(
    x="gross", y="genre"
).properties(width=650)