# Business Understanding

Our company has decided to create a **new movie studio**. This is a new venture for the company and our first introduction to the film industry. I will be exploring what types of films are currently doing the best at the box office. 

In order to help decide what types of films to create, I investigate the following 3 business questions:
1. What **genre** movie should we produce?
2. Who should **direct** the movie?
3. What **movie rating** should the movie be?

# Data Understanding

I will be using 3 datasets in this notebook.
1. **bom.movie_gross.csv.gz**: This is a dataset from [BoxOfficeMojo](BoxOfficeMojo.com) containing __ entries ... I use this dataset to to determine the most profitable movies.
2. **im.db**: This is a dataset from [IMDB](IMDB.com) containing __ entries ... I use this dataset to determine the director and genre that will result in the most profitable movie.
3. **rt.reviews.tsv.gz**: This is a dataset from [Rotten Tomatoes](RottenTomatoes.com) containing __ entries ... I use this dataset to determine the movie rating that will result in the most profitable movie.

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [3]:
# Save Box Office Mojo dataset to bom_df
bom_df = pd.read_csv('data/zippedData/bom.movie_gross.csv.gz')

In [4]:
# Save Rotten Tomatoes dataset to rt_movie_info_df
rt_movie_info_df = pd.read_csv('data/zippedData/rt.movie_info.tsv.gz', sep='\t', encoding='windows-1252')

In [8]:
# Save IMDB dataset to conn
conn = sqlite3.connect('data/zippedData/im.db')

# Save movie_basics table as a Pandas dataframe movie_basics_df
query_movie_basics = """

SELECT *
  FROM movie_basics
 LIMIT 10
"""

movie_basics_df = pd.read_sql(query_movie_basics, conn)

# Save movie_ratings table as a Pandas dataframe movie_ratings_df
query_movie_ratings = """

SELECT *
  FROM movie_ratings
 LIMIT 10
"""

movie_ratings_df = pd.read_sql(query_movie_ratings, conn)

In [5]:
##DELETE
tmdb_df = pd.read_csv('data/zippedData/tmdb.movies.csv.gz')
tmdb_df.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [6]:
##DELETE 
movie_budgets_df = pd.read_csv('data/zippedData/tn.movie_budgets.csv.gz')
movie_budgets_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [8]:
##DELETE
rot_tom_df = pd.read_csv('data/zippedData/rt.reviews.tsv.gz', sep='\t', encoding='windows-1252')
rot_tom_df.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [15]:
rt_movie_info_df['rating'].value_counts()

rating
R        521
NR       503
PG       240
PG-13    235
G         57
NC17       1
Name: count, dtype: int64

In [14]:
query = """

SELECT *
  FROM movie_basics
 LIMIT 10
"""

pd.read_sql(query, conn)

query = """

SELECT *
  FROM movie_ratings
 LIMIT 10
"""

pd.read_sql(query, conn)

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21
5,tt1069246,6.2,326
6,tt1094666,7.0,1613
7,tt1130982,6.4,571
8,tt1156528,7.2,265
9,tt1161457,4.2,148


## Data Preparation

# Exploratory Data Analysis

# Conclusions

## Limitations

## Recommendations

## Next Steps