# Top Earners in Movie Industry

## Table of Contents

<ul>
    <li><a href="#intro">Introduction</a></li>
    <li><a href="#eda">Exploratory Data Analysis</a></li>
    <li><a href="#conclusion">Conclusion</a></li>
</ul>

<a id="#intro"></a>
## Introduction

> This analysis project is to be done using the imdb movie data. When the analysis is completed, you should be able to find the top 5 highest grossing directors, the top 5 highest grossing movie genres of all time, comparing the revenue of the highest grossing movies and which companies released the most movies. 

> There are 10 columns that will not be needed for the analysis. Use pandas to drop these columns. HINT: Only the columns pertaining to revenue will be needed.

> To get you started, I've already placed the needed code for getting the packages and datafile that you will be using for the project. 

In [1]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('imdb-movies.csv')

### Drop columns without neccesary information and remove all records with no financial information -- Pay close attention to things that don't tell you anything regarding financial data

In [3]:
df.head(1)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0


In [4]:
df.dtypes

id                        int64
imdb_id                  object
popularity              float64
budget                    int64
revenue                   int64
original_title           object
cast                     object
homepage                 object
director                 object
tagline                  object
keywords                 object
overview                 object
runtime                   int64
genres                   object
production_companies     object
release_date             object
vote_count                int64
vote_average            float64
release_year              int64
budget_adj              float64
revenue_adj             float64
dtype: object

In [5]:
df.drop(['imdb_id', 'popularity', 'cast', 'homepage', 'tagline', 'keywords', 'overview', 'runtime', 'release_date', 'vote_average'], axis=1, inplace=True)
df.columns

Index(['id', 'budget', 'revenue', 'original_title', 'director', 'genres',
       'production_companies', 'vote_count', 'release_year', 'budget_adj',
       'revenue_adj'],
      dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   budget                10866 non-null  int64  
 2   revenue               10866 non-null  int64  
 3   original_title        10866 non-null  object 
 4   director              10822 non-null  object 
 5   genres                10843 non-null  object 
 6   production_companies  9836 non-null   object 
 7   vote_count            10866 non-null  int64  
 8   release_year          10866 non-null  int64  
 9   budget_adj            10866 non-null  float64
 10  revenue_adj           10866 non-null  float64
dtypes: float64(2), int64(5), object(4)
memory usage: 933.9+ KB


In [7]:
df.duplicated().sum()

1

In [8]:
df.nunique().sum()

50045

### Data Cleaning

In [9]:
# Delete all records with null, or empty values
df[df.isna().any(axis=1)]


Unnamed: 0,id,budget,revenue,original_title,director,genres,production_companies,vote_count,release_year,budget_adj,revenue_adj
228,300792,0,0,Racing Extinction,Louie Psihoyos,Adventure|Documentary,,36,2015,0.0,0.0
259,360603,0,0,Crown for Christmas,Alex Zamm,TV Movie,,10,2015,0.0,0.0
295,363483,0,0,12 Gifts of Christmas,Peter Sullivan,Family|TV Movie,,12,2015,0.0,0.0
298,354220,0,0,The Girl in the Photographs,Nick Simon,Crime|Horror|Thriller,,10,2015,0.0,0.0
328,308457,0,0,Advantageous,Jennifer Phang,Science Fiction|Drama|Family,,29,2015,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
10804,15867,0,0,Interiors,Woody Allen,Drama,,35,1978,0.0,0.0
10806,24998,0,0,Gates of Heaven,Errol Morris,Documentary,,12,1978,0.0,0.0
10816,16378,0,0,The Rutles: All You Need Is Cash,Eric Idle|Gary Weis,Comedy,,14,1978,0.0,0.0
10842,36540,0,0,Winnie the Pooh and the Honey Tree,Wolfgang Reitherman,Animation|Family,,12,1966,0.0,0.0


In [10]:
df.isnull().sum()

id                         0
budget                     0
revenue                    0
original_title             0
director                  44
genres                    23
production_companies    1030
vote_count                 0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

In [11]:
df.dropna(inplace=True)

In [12]:
df.isnull().sum()

id                      0
budget                  0
revenue                 0
original_title          0
director                0
genres                  0
production_companies    0
vote_count              0
release_year            0
budget_adj              0
revenue_adj             0
dtype: int64

In [13]:
df.to_csv('imdb-movies_v2.csv', index=False)

In [14]:
df = pd.read_csv('imdb-movies_v2.csv')

#### Here's a helpful hint from my own analysis when I ran this the first time. This may help shed light on what your data set should look like.

#### If I created one record for each the `production_companies` a movie was release under and one record each for `genres`<br>and tried to run calculations, it wouldn't work because for many records, the amount of `production_companies`<br>and `genres` aren't the same, so I'll create 2 dataframes; one w/o a `production_companies` column and one w/o a `genres` columns

In [15]:
df.head()

Unnamed: 0,id,budget,revenue,original_title,director,genres,production_companies,vote_count,release_year,budget_adj,revenue_adj
0,135397,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,5562,2015,137999900.0,1392446000.0
1,76341,150000000,378436354,Mad Max: Fury Road,George Miller,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,6185,2015,137999900.0,348161300.0
2,262500,110000000,295238201,Insurgent,Robert Schwentke,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,2480,2015,101200000.0,271619000.0
3,140607,200000000,2068178225,Star Wars: The Force Awakens,J.J. Abrams,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,5292,2015,183999900.0,1902723000.0
4,168259,190000000,1506249360,Furious 7,James Wan,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,2947,2015,174799900.0,1385749000.0


In [16]:
df.drop(['production_companies'], axis=1, inplace=True)

In [18]:
# df.to_csv('df_imdb_movies_without_prod_companies.csv', index=False)

In [19]:
df = pd.read_csv('df_imdb_movies_without_prod_companies.csv')
df.columns

Index(['id', 'budget', 'revenue', 'original_title', 'director', 'genres',
       'vote_count', 'release_year', 'budget_adj', 'revenue_adj'],
      dtype='object')

In [25]:
df = pd.read_csv('imdb-movies_v2.csv')
df.head()

Unnamed: 0,id,budget,revenue,original_title,director,genres,production_companies,vote_count,release_year,budget_adj,revenue_adj
0,135397,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,5562,2015,137999900.0,1392446000.0
1,76341,150000000,378436354,Mad Max: Fury Road,George Miller,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,6185,2015,137999900.0,348161300.0
2,262500,110000000,295238201,Insurgent,Robert Schwentke,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,2480,2015,101200000.0,271619000.0
3,140607,200000000,2068178225,Star Wars: The Force Awakens,J.J. Abrams,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,5292,2015,183999900.0,1902723000.0
4,168259,190000000,1506249360,Furious 7,James Wan,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,2947,2015,174799900.0,1385749000.0


In [26]:
df.drop(['genres'], axis=1, inplace=True)

In [27]:
df.to_csv('df_imdb_movies_without_genres.csv', index=False)

In [28]:
df = pd.read_csv('df_imdb_movies_without_genres.csv')
df.columns

Index(['id', 'budget', 'revenue', 'original_title', 'director',
       'production_companies', 'vote_count', 'release_year', 'budget_adj',
       'revenue_adj'],
      dtype='object')

<a id="eda"></a>
## Exploratory Data Analysis

> Use Matplotlib to display your data analysis

### Which production companies released the most movies in the last 10 years? Display the top 5 production companies.

In [113]:
df = pd.read_csv('df_imdb_movies_without_genres.csv')
year=df.loc[df['release_year'] > 2005]
year['production_companies'].value_counts().nlargest(5)

DreamWorks Animation       25
Marvel Studios             22
Pixar Animation Studios    21
The Asylum                 21
Walt Disney Pictures       18
Name: production_companies, dtype: int64

### What 5 movie genres grossed the highest all-time?

In [105]:
df = pd.read_csv('df_imdb_movies_without_prod_companies.csv')
df.groupby('genres').revenue_adj.sum()
df.sort_values(by=['revenue_adj'], ascending=False).head(5)

Unnamed: 0,id,budget,revenue,original_title,director,genres,vote_count,release_year,budget_adj,revenue_adj
1254,19995,237000000,2781505847,Avatar,James Cameron,Action|Adventure|Fantasy|Science Fiction,8458,2009,240886900.0,2827124000.0
1199,11,11000000,775398007,Star Wars,George Lucas,Adventure|Action|Science Fiction,4428,1977,39575590.0,2789712000.0
4649,597,200000000,1845034188,Titanic,James Cameron,Drama|Romance|Thriller,4654,1997,271692100.0,2506406000.0
9545,9552,8000000,441306145,The Exorcist,William Friedkin,Drama|Horror|Thriller,1113,1973,39289280.0,2167325000.0
8790,578,7000000,470654000,Jaws,Steven Spielberg,Horror|Thriller|Adventure,1415,1975,28362750.0,1907006000.0


### Who are the top 5 grossing directors?

In [103]:
df = pd.read_csv('df_imdb_movies_without_genres.csv')
df.groupby("director").revenue.sum().nlargest(5)

director
Steven Spielberg     9018563772
Peter Jackson        6523244659
James Cameron        5841894863
Michael Bay          4917208171
Christopher Nolan    4167548502
Name: revenue, dtype: int64

### Compare the revenue of the highest grossing movies of all time.

In [106]:
df = pd.read_csv('df_imdb_movies_without_genres.csv')
df.groupby('original_title').revenue_adj.sum().nlargest(10)

original_title
Avatar                            2.827124e+09
Star Wars                         2.789712e+09
Titanic                           2.506406e+09
The Exorcist                      2.167325e+09
Jaws                              1.907006e+09
Star Wars: The Force Awakens      1.902723e+09
E.T. the Extra-Terrestrial        1.791694e+09
The Net                           1.583050e+09
One Hundred and One Dalmatians    1.574815e+09
The Avengers                      1.508100e+09
Name: revenue_adj, dtype: float64

<a id="conclusions"></a>
## Conclusions

> Using the cell below, write a brief conclusion of what you have found from the anaylsis of the data. The Cell below will allow you to write plan text instead of code.

The top production company who has released the most movies in the last 10 years are:
    Dreamworks Animation with 25 releases

The movie genre grossing the highest is "Action|Adventure|Fantasy|Science Fiction".
    
The highest grossing director is Steven Spielberg with $9,018,563,772.

The highest grossing movie is Avatar with $2,827,124,000. (2.827124e+09)