
## Analysis of Major Revenue Driving Factors in American Movies
#### by Stephen Gou
#### Student Number: 1000382908
### Questions
1. Is there a statistically significant difference between the mean revenues of NY critic picks and non-picks movies?
2. Do reviews of NY critic picks display a different set of sentiments than that of non-picks?
3. What are the major characteristics of a modern American movie that affect its deomestic lifetime revenue? Modern is defined here as after 1990.

### Data Collection
- OMDb API provies a good baseline meta data about movies including release date, runtime, genre, director, writer, actors, production companies, and opening box office. However, several data of interests are missing, e.g. life-time revenue (it only provides opening box office) and budget.

- To supplement OMDb, I found The Movie Dataset on Kaggle:
    https://www.kaggle.com/rounakbanik/the-movies-dataset#movies_metadata.csv
    This data was collected from TMDB. It contains revenue, budget, multiple genre tags, and keywords data.

- NY movie reviews can be accessed through its API.

### EDA
1. Plot the distribution of revenues of critic-picks vs non-picks.
    
2. Generate frequencies of words with different sentiments.

3. Feature selection and transformation for analyzing movie revenues factors
 - Pick out relevant features in the data for predicting the revenues through intuition, e.g. genre, director, keywords and actors.

 - Examine the validity of important data like revenue. For example, are these revenue figures inflation adjusted? Are they domestic and lifetime revenues? Validate some revenues with other data source like Box Office Mojo.

 - Think about how to represent and transform certain features to be ready for modelling: For instance, actors. One way to use it meaningfully is to pull external references and assign a "popularity score" to each actor.
    
 - Keywords is another example that we have to explore its range and values and figure out how to incoporate it into our model. How many unique ones are in total and how many keywords are associated with each movie? "The Dark Knight Rises" has 21 keywords associated with it, including "dc comics", "terrorist", "gotham city", "catwoman"and etc. Some words like "dc comics" might offer very good predictive value since it's associated with many movies, while others like "gotham city" and "catman" might be too specific. Here we might need to plot a histogram of frequencies of all popular keywords.

### Analysis
1. **Is there a statistically significant difference between the mean revenues of NY critic picks and non-picks movies?**
    
    Conduct a t-test on the mean revenue of critic-picks and determine if there's statistically significant difference.


2. **Do reviews of NY critic picks display a different set of sentiments than that of non-picks?**
    
    Compare the top most-frequent key words to picks vs non-picks.
    
    
3. **What are the major characteristics of a modern American movie that affect its deomestic lifetime revenue?**
    - Construct a linear regression model that fits a movie's features to its revenue. Categorical features like genre tags, and key words will be one hot encoded. Reason about possible interaction terms and include them in the model. 
    - Examine coefficients and their corresponding p-values to identify the most influential features that drive revenue.
    - Finally, test for likely confounders. For instance, genre might affect a movie's revenue and the type of directors at the same time.



#### Code for importing, basic trimming and observation of data from TMDB and OMDB.

In [5]:
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize

NYT_API_KEY = '53223e11b006467490bde835d45b0c74'
OMDB_API_KEY = 'd42886f4'

def fetch_omdb(title):
    title = 't=' + title.replace(' ', '+')
    print (title)
    req = 'http://www.omdbapi.com/?apikey='+ OMDB_API_KEY + '&'+ title
    omdb_df = pd.read_json(req)
    omdb_df.to_csv('omdb_data.csv')
    return omdb_df
    
#Load TMDB Movie Data
tmdb_df = pd.read_csv('tmdb_5000_movies.csv')
print(tmdb_df.shape)

#Data Cleaning
#Remove movies before 1990, Remove movies not from US,Remove irrelevant columns
tmdb_df['release_date'] = pd.to_datetime(tmdb_df['release_date'])
tmdb_df = tmdb_df.drop(columns = ['original_language','popularity','homepage','overview','spoken_languages','tagline','original_title','vote_average','vote_count'])
tmdb_df.drop(tmdb_df[tmdb_df['release_date'].dt.year < 1990].index, inplace=True)
tmdb_df.head()

(4803, 20)


Unnamed: 0,budget,genres,id,keywords,production_companies,production_countries,release_date,revenue,runtime,status,title
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,Released,Avatar
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,Released,Pirates of the Caribbean: At World's End
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,Released,Spectre
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,Released,The Dark Knight Rises
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,Released,John Carter


title = 't=' + nytdata['display_title'][1].replace(' ', '+')
req = 'http://www.omdbapi.com/?apikey='+ OMDB_API_KEY + '&'+ title
print(pd.read_json(req))

In [7]:
fetch_omdb('The Dark Knight').head(1)

t=The+Dark+Knight


Unnamed: 0,Title,Year,Rated,Released,Runtime,Genre,Director,Writer,Actors,Plot,...,Metascore,imdbRating,imdbVotes,imdbID,Type,DVD,BoxOffice,Production,Website,Response
0,The Dark Knight,2008,PG-13,18 Jul 2008,152 min,"Action, Crime, Drama",Christopher Nolan,"Jonathan Nolan (screenplay), Christopher Nolan...","Christian Bale, Heath Ledger, Aaron Eckhart, M...",When the menace known as the Joker emerges fro...,...,84,9,1969949,tt0468569,movie,09 Dec 2008,"$533,316,061",Warner Bros. Pictures/Legendary,http://thedarkknight.warnerbros.com/,True
