# Top Earners in Movie Industry

## Table of Contents

<ul>
    <li><a href="#intro">Introduction</a></li>
    <li><a href="#eda">Exploratory Data Analysis</a></li>
    <li><a href="#conclusion">Conclusion</a></li>
</ul>

<a id="#intro"></a>
## Introduction

> This analysis project is to be done using the imdb movie data. When the analysis is completed, you should be able to find the top 5 highest grossing directors, the top 5 highest grossing movie genres of all time, comparing the revenue of the highest grossing movies and which companies released the most movies. 

> There are 10 columns that will not be needed for the analysis. Use pandas to drop these columns. HINT: Only the columns pertaining to revenue will be needed.

> To get you started, I've already placed the needed code for getting the packages and datafile that you will be using for the project. 

In [1]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('imdb-movies.csv')

### Drop columns without neccesary information and remove all records with no financial information -- Pay close attention to things that don't tell you anything regarding financial data

In [3]:
# Drop Columns you don't need. 
df.head(10)
df.count()

id                      10866
imdb_id                 10856
popularity              10866
budget                  10866
revenue                 10866
original_title          10866
cast                    10790
homepage                 2936
director                10822
tagline                  8042
keywords                 9373
overview                10862
runtime                 10866
genres                  10843
production_companies     9836
release_date            10866
vote_count              10866
vote_average            10866
release_year            10866
budget_adj              10866
revenue_adj             10866
dtype: int64

### Data Cleaning

In [4]:
# Delete all records with null, or empty values



In [5]:
df=df.drop(['keywords','release_date', 'cast','runtime', 'homepage', 'overview', 'vote_count', 'vote_average', 'tagline'], axis=1)

In [6]:
df[df.isna().any(axis=1)].head(15)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,director,genres,production_companies,release_year,budget_adj,revenue_adj
228,300792,tt1618448,0.584363,0,0,Racing Extinction,Louie Psihoyos,Adventure|Documentary,,2015,0.0,0.0
259,360603,tt5133572,0.476341,0,0,Crown for Christmas,Alex Zamm,TV Movie,,2015,0.0,0.0
295,363483,tt5133810,0.417191,0,0,12 Gifts of Christmas,Peter Sullivan,Family|TV Movie,,2015,0.0,0.0
298,354220,tt3826866,0.370258,0,0,The Girl in the Photographs,Nick Simon,Crime|Horror|Thriller,,2015,0.0,0.0
328,308457,tt3090670,0.367617,0,0,Advantageous,Jennifer Phang,Science Fiction|Drama|Family,,2015,0.0,0.0
370,318279,tt2545428,0.314199,0,2334228,Meru,Jimmy Chin|Elizabeth Chai Vasarhelyi,Adventure|Documentary,,2015,0.0,2147489.0
374,206197,tt1015471,0.302474,0,0,The Sisterhood of Night,Caryn Waechter,Mystery|Drama|Thriller,,2015,0.0,0.0
382,306197,tt4145304,0.295946,0,0,Unexpected,Kris Swanberg,Drama|Comedy,,2015,0.0,0.0
388,323967,tt2016335,0.289526,700000,0,Walter,Anna Mastro,Drama|Comedy,,2015,643999.7,0.0
393,343284,tt3602128,0.283194,2000000,0,Night Of The Living Deb,Kyle Rankin,Comedy|Horror,,2015,1839999.0,0.0


In [7]:
df.isna().sum()

id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
director                  44
genres                    23
production_companies    1030
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

In [8]:
df.dropna(inplace=True)
df.isna().sum()

id                      0
imdb_id                 0
popularity              0
budget                  0
revenue                 0
original_title          0
director                0
genres                  0
production_companies    0
release_year            0
budget_adj              0
revenue_adj             0
dtype: int64

In [9]:
df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,director,genres,production_companies,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Robert Schwentke,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,J.J. Abrams,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,James Wan,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,2015,174799900.0,1385749000.0


#### Here's a helpful hint from my own analysis when I ran this the first time. This may help shed light on what your data set should look like.

#### If I created one record for each the `production_companies` a movie was release under and one record each for `genres`<br>and tried to run calculations, it wouldn't work because for many records, the amount of `production_companies`<br>and `genres` aren't the same, so I'll create 2 dataframes; one w/o a `production_companies` column and one w/o a `genres` columns

In [39]:
prod_df=df.drop(['budget_adj'], axis=1)

In [40]:
def splitDataFrameList(df, target_column, separator): 
    def splitListToRows(row, row_accumulator, target_column, separator): 
        split_row = row[target_column].split(separator)
        for s in split_row: 
            new_row=row.to_dict()
            new_row[target_column] = s 
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows, axis=1, args = (new_rows, target_column, separator))
    new_df = pd.DataFrame(new_rows)
    return new_df

In [41]:
prod_df = splitDataFrameList(prod_df, 'production_companies' , '|')

In [42]:
prod_df

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,director,genres,production_companies,release_year,revenue_adj,profit
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Universal Studios,2015,1.392446e+09,1363528810
1,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Amblin Entertainment,2015,1.392446e+09,1363528810
2,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Legendary Pictures,2015,1.392446e+09,1363528810
3,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Fuji Television Network,2015,1.392446e+09,1363528810
4,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Dentsu,2015,1.392446e+09,1363528810
...,...,...,...,...,...,...,...,...,...,...,...,...
23187,20379,tt0060472,0.065543,0,0,Grand Prix,John Frankenheimer,Action|Adventure|Drama,Joel Productions,1966,0.000000e+00,0
23188,20379,tt0060472,0.065543,0,0,Grand Prix,John Frankenheimer,Action|Adventure|Drama,Douglas & Lewis Productions,1966,0.000000e+00,0
23189,39768,tt0060161,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Mystery|Comedy,Mosfilm,1966,0.000000e+00,0
23190,21449,tt0061177,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Action|Comedy,Benedict Pictures Corp.,1966,0.000000e+00,0


<a id="eda"></a>
## Exploratory Data Analysis

> Use Matplotlib to display your data analysis

### Which production companies released the most movies in the last 10 years? Display the top 5 production companies.

In [17]:
year=prod_df.loc[prod_df['release_year'] > 2005]

In [18]:
year['production_companies'].value_counts().nlargest(10)

DreamWorks Animation       25
The Asylum                 21
Pixar Animation Studios    21
Marvel Studios             20
Walt Disney Pictures       18
Dimension Films            14
Lions Gate Films           12
New Line Cinema            12
Disney Channel             12
Warner Bros.               11
Name: production_companies, dtype: int64

### What 5 movie genres grossed the highest all-time?

In [23]:
genre_df = df.drop(['popularity'], axis=1)
genre_df

Unnamed: 0,id,imdb_id,budget,revenue,original_title,director,genres,production_companies,release_year,budget_adj,revenue_adj
0,135397,tt0369610,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,2015,1.379999e+08,1.392446e+09
1,76341,tt1392190,150000000,378436354,Mad Max: Fury Road,George Miller,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,2015,1.379999e+08,3.481613e+08
2,262500,tt2908446,110000000,295238201,Insurgent,Robert Schwentke,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,2015,1.012000e+08,2.716190e+08
3,140607,tt2488496,200000000,2068178225,Star Wars: The Force Awakens,J.J. Abrams,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,2015,1.839999e+08,1.902723e+09
4,168259,tt2820852,190000000,1506249360,Furious 7,James Wan,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,2015,1.747999e+08,1.385749e+09
...,...,...,...,...,...,...,...,...,...,...,...
10861,21,tt0060371,0,0,The Endless Summer,Bruce Brown,Documentary,Bruce Brown Films,1966,0.000000e+00,0.000000e+00
10862,20379,tt0060472,0,0,Grand Prix,John Frankenheimer,Action|Adventure|Drama,Cherokee Productions|Joel Productions|Douglas ...,1966,0.000000e+00,0.000000e+00
10863,39768,tt0060161,0,0,Beregis Avtomobilya,Eldar Ryazanov,Mystery|Comedy,Mosfilm,1966,0.000000e+00,0.000000e+00
10864,21449,tt0061177,0,0,"What's Up, Tiger Lily?",Woody Allen,Action|Comedy,Benedict Pictures Corp.,1966,0.000000e+00,0.000000e+00


In [26]:
genre_df = splitDataFrameList(genre_df, 'genres', '|')
genre_df

Unnamed: 0,id,imdb_id,budget,revenue,original_title,director,genres,production_companies,release_year,budget_adj,revenue_adj
0,135397,tt0369610,150000000,1513528810,Jurassic World,Colin Trevorrow,Action,Universal Studios|Amblin Entertainment|Legenda...,2015,1.379999e+08,1.392446e+09
1,135397,tt0369610,150000000,1513528810,Jurassic World,Colin Trevorrow,Adventure,Universal Studios|Amblin Entertainment|Legenda...,2015,1.379999e+08,1.392446e+09
2,135397,tt0369610,150000000,1513528810,Jurassic World,Colin Trevorrow,Science Fiction,Universal Studios|Amblin Entertainment|Legenda...,2015,1.379999e+08,1.392446e+09
3,135397,tt0369610,150000000,1513528810,Jurassic World,Colin Trevorrow,Thriller,Universal Studios|Amblin Entertainment|Legenda...,2015,1.379999e+08,1.392446e+09
4,76341,tt1392190,150000000,378436354,Mad Max: Fury Road,George Miller,Action,Village Roadshow Pictures|Kennedy Miller Produ...,2015,1.379999e+08,3.481613e+08
...,...,...,...,...,...,...,...,...,...,...,...
24705,39768,tt0060161,0,0,Beregis Avtomobilya,Eldar Ryazanov,Mystery,Mosfilm,1966,0.000000e+00,0.000000e+00
24706,39768,tt0060161,0,0,Beregis Avtomobilya,Eldar Ryazanov,Comedy,Mosfilm,1966,0.000000e+00,0.000000e+00
24707,21449,tt0061177,0,0,"What's Up, Tiger Lily?",Woody Allen,Action,Benedict Pictures Corp.,1966,0.000000e+00,0.000000e+00
24708,21449,tt0061177,0,0,"What's Up, Tiger Lily?",Woody Allen,Comedy,Benedict Pictures Corp.,1966,0.000000e+00,0.000000e+00


In [27]:
genre_df.groupby("genres").revenue.sum().nlargest(5)

genres
Action       173418313979
Adventure    166317625752
Comedy       142141376544
Drama        138896772395
Thriller     121189561087
Name: revenue, dtype: int64

### Who are the top 5 grossing directors?

In [35]:
df.groupby("director").revenue.sum().nlargest(5)

director
Steven Spielberg     9018563772
Peter Jackson        6523244659
James Cameron        5841894863
Michael Bay          4917208171
Christopher Nolan    4167548502
Name: revenue, dtype: int64

### Compare the revenue of the highest grossing movies of all time.

In [37]:
df['profit'] = df.revenue - df.budget
print("profit")
df.groupby("original_title").profit.sum().nlargest(15)

profit


original_title
Avatar                                           2544505847
Star Wars: The Force Awakens                     1868178225
Titanic                                          1632034188
Jurassic World                                   1363528810
Furious 7                                        1316249360
The Avengers                                     1288080742
Harry Potter and the Deathly Hallows: Part 2     1202817822
Frozen                                           1127284869
Avengers: Age of Ultron                          1125035767
The Net                                          1084279658
Minions                                          1082730962
The Lord of the Rings: The Return of the King    1024888979
Iron Man 3                                       1015439994
Transformers: Dark of the Moon                    928746996
Skyfall                                           908561013
Name: profit, dtype: int64

<a id="conclusions"></a>
## Conclusions

> Using the cell below, write a brief conclusion of what you have found from the anaylsis of the data. The Cell below will allow you to write plan text instead of code.

In [None]:
DreamWorks Animation has released the most films the past 10 years. (2005-2015)

In [None]:
Action 173418313979 has made the most amount of money over the last 5 years

In [None]:
Steven Spielberg 9018563772 is the highest grossing actor

In [None]:
Avatar 2544505847 is the highest grossing movie of all time 