
# Project: Investigate a Dataset (IMDB database)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **Tip**: In this section of the report, provide a brief introduction to the dataset you've selected for analysis. At the end of this section, describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables.

> Fields included in the dataset:
>id, imdb_id, popularity, budget, revenue, original_title, cast, homepage, director, tagline, keywords, overview, runtime, genres, production_companies, release_date, vote_count, vote_average, release_year, budget_adj, revenue_adj


> Will be investingating the IMDB database. Primarily focusing on genres
1. Profit vs. popularity to see if the more popular movies ended up making more profit
2. (i) Plot popularity rating vs. Genre for a specific year. to see which genres are the most successfull. Assume  popularity 
     (ii) Plot profit vs. Genre for a specific year
     (iii) Table showing this year the ranking for genres is. Assume that profit and popularity are correlated.  Will define success by popularity rating for the rest of the report
3. Do this genre ranking for every year. Rank on y axis and year on x axis. Each genre will be a differently  coloured point.
4. Pie charts to show trending genres. For each Studio do a pie chart of the genres they produce.
5. Can't assume success is based only on the genre of a movie. Analyse over a few years which quarter:
(i) had the most movie releases
(ii) had the most profit
6. Table of directors with highest grossing films
7. State how extraneous factors are also involved in the success of a movie. Bad weather, political turmoil etc can stop people from going to the movies. How much advertising was doen and how well received the adverts were etc. However it's difficult to make accurate predictions taking in all of the different factors that could affect a movies success. Make predictions of what the future trends are likely to be based on the data we have analysed.


In [1]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.
import numpy as np
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)


# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html


In [2]:
data = pd.read_csv('tmdb-movies.csv')

<a id='wrangling'></a>
## Data Wrangling



### General Properties

Take a look at what data is included in the dataset.

In [3]:
data.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


In [4]:
data.shape

(10866, 21)

The data contains 21 variables and 10866 rows.
First, take a quick look at the data above to see if we need to clean or trim any of the data. On first look these points were made:
    1. Cast, genres and production companies fields contain lists that are seperated by |. Need to change into a python list.
    2. Overview, tagline and homepage fields will not be useful for data analysis as they contain strings that are unique to each film so you can't compare these fields. Won't include these fields in the new table made. 
    3. Budget and revenue figures are reported as exponentials so make sure whe analysing them to take this into account. Make sure they're stored properly in python.
    4. Vote counts vary quite a bit. In order to make the vote_average a more accurate represntation should standardise it. Has this already been taken in to account?
    5. Note: release date is in American format of month/date/year. 
    8. Make a column that shows which quarter each release date is in.

### Data Cleaning (Replace this with more specific notes!)

First, create a new table with just the fields that we're inteersted in

In [5]:
new_dataset = data.filter(['revenue','original_title','director', 'genres', 'production_companies', 'release_date', 'release_year', 'vote_average'], axis=1)

In [6]:
new_dataset.head()

Unnamed: 0,revenue,original_title,director,genres,production_companies,release_date,release_year,vote_average
0,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,2015,6.5
1,378436354,Mad Max: Fury Road,George Miller,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,2015,7.1
2,295238201,Insurgent,Robert Schwentke,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2015,6.3
3,2068178225,Star Wars: The Force Awakens,J.J. Abrams,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,2015,7.5
4,1506249360,Furious 7,James Wan,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2015,7.3


Next, remove | from the lists of genres and production companies

In [7]:
new_dataset['genres'] = new_dataset['genres'].str.split("|")

In [8]:
new_dataset['production_companies'] = new_dataset['production_companies'].str.split("|")

In [9]:
new_dataset.head()

Unnamed: 0,revenue,original_title,director,genres,production_companies,release_date,release_year,vote_average
0,1513528810,Jurassic World,Colin Trevorrow,"[Action, Adventure, Science Fiction, Thriller]","[Universal Studios, Amblin Entertainment, Lege...",6/9/15,2015,6.5
1,378436354,Mad Max: Fury Road,George Miller,"[Action, Adventure, Science Fiction, Thriller]","[Village Roadshow Pictures, Kennedy Miller Pro...",5/13/15,2015,7.1
2,295238201,Insurgent,Robert Schwentke,"[Adventure, Science Fiction, Thriller]","[Summit Entertainment, Mandeville Films, Red W...",3/18/15,2015,6.3
3,2068178225,Star Wars: The Force Awakens,J.J. Abrams,"[Action, Adventure, Science Fiction, Fantasy]","[Lucasfilm, Truenorth Productions, Bad Robot]",12/15/15,2015,7.5
4,1506249360,Furious 7,James Wan,"[Action, Crime, Thriller]","[Universal Pictures, Original Film, Media Righ...",4/1/15,2015,7.3


We will be analysing movies by what quarter they were released in so change the release date to a number depending on which quarter it was released in.
Movies released: 
* January - March = 1
* April -June = 2
* July - September = 3
* October - December = 4


In [11]:
new_dataset['release_date'] = pd.to_datetime(new_dataset['release_date'])
new_dataset['quarter'] = new_dataset['release_date'].dt.quarter      

In [15]:
new_dataset.order('release_date')

AttributeError: 'DataFrame' object has no attribute 'order'

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!