# Top Earners in Movie Industry

## Table of Contents

<ul>
    <li><a href="#intro">Introduction</a></li>
    <li><a href="#eda">Exploratory Data Analysis</a></li>
    <li><a href="#conclusion">Conclusion</a></li>
</ul>

<a id="#intro"></a>
## Introduction

> This analysis project is to be done using the imdb movie data. When the analysis is completed, you should be able to find the top 5 highest grossing directors, the top 5 highest grossing movie genres of all time, comparing the revenue of the highest grossing movies and which companies released the most movies. 

> There are 10 columns that will not be needed for the analysis. Use pandas to drop these columns. HINT: Only the columns pertaining to revenue will be needed.

> To get you started, I've already placed the needed code for getting the packages and datafile that you will be using for the project. 

In [None]:
!pip install numpy
!pip install pandas
import csv
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

In [37]:
#Creating Function to open and read CSV data
def open_csv(filename, d = ','):
    #define an empty list to store data
    data = []
    
    with open(filename, encoding = 'utf-8') as mData:
        #csv reader method to create a python list
        info = csv.reader(mData, delimiter = d)
        
        # Loop over info and append to data list
        for row in info:
            data.append(row)
    return data

csv_data = open_csv('../imdb-movies/files/imdb-movies.csv')

print(csv_data[1:2])

[['135397', 'tt0369610', '32.985763', '150000000', '1513528810', 'Jurassic World', "Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vincent D'Onofrio|Nick Robinson", 'http://www.jurassicworld.com/', 'Colin Trevorrow', 'The park is open.', 'monster|dna|tyrannosaurus rex|velociraptor|island', 'Twenty-two years after the events of Jurassic Park, Isla Nublar now features a fully functioning dinosaur theme park, Jurassic World, as originally envisioned by John Hammond.', '124', 'Action|Adventure|Science Fiction|Thriller', 'Universal Studios|Amblin Entertainment|Legendary Pictures|Fuji Television Network|Dentsu', '6/9/2015', '5562', '6.5', '2015', '137999939.3', '1392445893']]


In [41]:
# pandas read csv
movies = pd.read_csv('../imdb-movies/files/imdb-movies.csv', sep=',')
movies.head(1)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/2015,5562,6.5,2015,137999939.3,1392446000.0


### Drop columns without neccesary information and remove all records with no financial information -- Pay close attention to things that don't tell you anything regarding financial data

In [45]:
movies[['original_title','budget','budget_adj','revenue','revenue_adj']]

Unnamed: 0,original_title,budget,budget_adj,revenue,revenue_adj
0,Jurassic World,150000000,1.379999e+08,1513528810,1.392446e+09
1,Mad Max: Fury Road,150000000,1.379999e+08,378436354,3.481613e+08
2,Insurgent,110000000,1.012000e+08,295238201,2.716190e+08
3,Star Wars: The Force Awakens,200000000,1.839999e+08,2068178225,1.902723e+09
4,Furious 7,190000000,1.747999e+08,1506249360,1.385749e+09
...,...,...,...,...,...
10861,The Endless Summer,0,0.000000e+00,0,0.000000e+00
10862,Grand Prix,0,0.000000e+00,0,0.000000e+00
10863,Beregis Avtomobilya,0,0.000000e+00,0,0.000000e+00
10864,"What's Up, Tiger Lily?",0,0.000000e+00,0,0.000000e+00


### Data Cleaning

In [9]:
# Delete all records with null, or empty values



#### Here's a helpful hint from my own analysis when I ran this the first time. This may help shed light on what your data set should look like.

#### If I created one record for each the `production_companies` a movie was release under and one record each for `genres`<br>and tried to run calculations, it wouldn't work because for many records, the amount of `production_companies`<br>and `genres` aren't the same, so I'll create 2 dataframes; one w/o a `production_companies` column and one w/o a `genres` columns

<a id="eda"></a>
## Exploratory Data Analysis

> Use Matplotlib to display your data analysis

### Which production companies released the most movies in the last 10 years? Display the top 5 production companies.

<ol>
    <li>Ingenious Film Partners|Twentieth Century Fox Film Corporation|Dune Entertainment|Lightstorm Entertainment</li>
    <li>Lucasfilm|Twentieth Century Fox Film Corporation</li>
    <li>Paramount Pictures|Twentieth Century Fox Film Corporation|Lightstorm Entertainment</li>
    <li>Warner Bros.|Hoya Productions</li>
    <li>Universal Pictures|Zanuck/Brown Productions</li>
</ol>

In [35]:
highest_gross_production_co = movies[['revenue_adj', 'production_companies','original_title']].sort_values(['revenue_adj'], ascending=False).reset_index(drop=True)
highest_gross_production_co.head(5)

Unnamed: 0,revenue_adj,production_companies,original_title
0,2827124000.0,Ingenious Film Partners|Twentieth Century Fox ...,Avatar
1,2789712000.0,Lucasfilm|Twentieth Century Fox Film Corporation,Star Wars
2,2506406000.0,Paramount Pictures|Twentieth Century Fox Film ...,Titanic
3,2167325000.0,Warner Bros.|Hoya Productions,The Exorcist
4,1907006000.0,Universal Pictures|Zanuck/Brown Productions,Jaws


### What 5 movie genres grossed the highest all-time?

<ol>
    <li>Action|Adventure|Fantasy|Science Fiction</li>
    <li>Adventure|Action|Science Fiction</li>
    <li>Drama|Romance|Thriller</li>
    <li>Drama|Horror|Thriller</li>
    <li>Horror|Thriller|Adventure</li>
</ol>

In [24]:
highest_gross_genres = movies[['revenue_adj', 'genres','imdb_id' ]].sort_values(['revenue_adj','genres'], ascending=False).reset_index(drop=True)
highest_gross_genres.head(5)

Unnamed: 0,revenue_adj,genres,imdb_id
0,2827124000.0,Action|Adventure|Fantasy|Science Fiction,tt0499549
1,2789712000.0,Adventure|Action|Science Fiction,tt0076759
2,2506406000.0,Drama|Romance|Thriller,tt0120338
3,2167325000.0,Drama|Horror|Thriller,tt0070047
4,1907006000.0,Horror|Thriller|Adventure,tt0073195


### Who are the top 5 grossing directors?

<ol>
    <li>James Cameron</li>
    <li>George Lucas</li>
    <li>William Friedkin</li>
    <li>Steven Spielberg</li>
    <li>J.J. Abrams </li>
</ol>

In [19]:
highest_gross_directors= movies[['revenue_adj', 'director', 'imdb_id', ]].sort_values(['revenue_adj'], ascending=False).reset_index(drop=True)
highest_gross_directors.head(10)

Unnamed: 0,revenue_adj,director,imdb_id
0,2827124000.0,James Cameron,tt0499549
1,2789712000.0,George Lucas,tt0076759
2,2506406000.0,James Cameron,tt0120338
3,2167325000.0,William Friedkin,tt0070047
4,1907006000.0,Steven Spielberg,tt0073195
5,1902723000.0,J.J. Abrams,tt2488496
6,1791694000.0,Steven Spielberg,tt0083866
7,1583050000.0,Irwin Winkler,tt0113957
8,1574815000.0,Clyde Geronimi|Hamilton Luske|Wolfgang Reitherman,tt0055254
9,1443191000.0,Joss Whedon,tt0848228


### Compare the revenue of the highest grossing movies of all time.

In [23]:
# Sort based on many labels, with left-to-right priority
# sorted_data = data.sort_values('ages').reset_index()
# too many columns - let's combine indexing with sorting and a reset to perform this on a smaller dataframe
movies[['original_title','genres','revenue_adj']].describe()
avgsortmovies = movies[['original_title','revenue_adj','imdb_id']].sort_values(['revenue_adj'], ascending=False).reset_index(drop=True)
avgsortmovies
avgsortmovies.head(5)

Unnamed: 0,original_title,revenue_adj,imdb_id
0,Avatar,2827124000.0,tt0499549
1,Star Wars,2789712000.0,tt0076759
2,Titanic,2506406000.0,tt0120338
3,The Exorcist,2167325000.0,tt0070047
4,Jaws,1907006000.0,tt0073195


<a id="conclusions"></a>
## Conclusions

> Using the cell below, write a brief conclusion of what you have found from the anaylsis of the data. The Cell below will allow you to write plan text instead of code.

In conclusion the highest profiting director is James Cameron, bringing in the highest grossing movie of 'Avatar,' in the genres Action, Adventure, Fantasy, and Science Fiction.

In [21]:
movies[['original_title','director','genres','production_companies','revenue_adj']].describe()
avgsortmovies = movies[['original_title','revenue_adj','director', 'genres','imdb_id']].sort_values(['revenue_adj'], ascending=False).reset_index(drop=True)
avgsortmovies
avgsortmovies.head(5)

Unnamed: 0,original_title,revenue_adj,director,genres,imdb_id
0,Avatar,2827124000.0,James Cameron,Action|Adventure|Fantasy|Science Fiction,tt0499549
1,Star Wars,2789712000.0,George Lucas,Adventure|Action|Science Fiction,tt0076759
2,Titanic,2506406000.0,James Cameron,Drama|Romance|Thriller,tt0120338
3,The Exorcist,2167325000.0,William Friedkin,Drama|Horror|Thriller,tt0070047
4,Jaws,1907006000.0,Steven Spielberg,Horror|Thriller|Adventure,tt0073195
