
## Analysis of Major Revenue Driving Factors in American Movies
#### by Stephen Gou
#### Student Number: 1000382908
### Questions
1. Is there a statistically significant difference between the mean revenues of NY critic picks and non-picks movies?
2. Do reviews of NY critic picks display a different set of sentiments than that of non-picks?
3. What are the major characteristics of a modern American movie that affect its deomestic lifetime revenue? Modern is defined here as after 1990.

### Data Collection
- OMDb API provies a good baseline meta data about movies including release date, runtime, genre, director, writer, actors, production companies, and opening box office. However, several data of interests are missing, e.g. life-time revenue (it only provides opening box office) and budget.

- To supplement OMDb, I found The Movie Dataset on Kaggle:
    https://www.kaggle.com/rounakbanik/the-movies-dataset#movies_metadata.csv
    This data was collected from TMDB. It contains revenue, budget, multiple genre tags, and keywords data.

- NY movie reviews can be accessed through its API.

- Text Blob for sentiment analysis https://textblob.readthedocs.io/en/dev/quickstart.html

### EDA
1. Plot the distribution of revenues of critic-picks vs non-picks.
    
2. Generate frequencies of words with different sentiments.

3. Feature selection and transformation for analyzing movie revenues factors
 - Pick out relevant features in the data for predicting the revenues through intuition, e.g. genre, director, keywords and actors.

 - Examine the validity of important data like revenue. For example, are these revenue figures inflation adjusted? Are they domestic and lifetime revenues? Validate some revenues with other data source like Box Office Mojo.

 - Think about how to represent and transform certain features to be ready for modelling: For instance, actors. One way to use it meaningfully is to pull external references and assign a "popularity score" to each actor.
    
 - Keywords is another example that we have to explore its range and values and figure out how to incoporate it into our model. How many unique ones are in total and how many keywords are associated with each movie? "The Dark Knight Rises" has 21 keywords associated with it, including "dc comics", "terrorist", "gotham city", "catwoman"and etc. Some words like "dc comics" might offer very good predictive value since it's associated with many movies, while others like "gotham city" and "catman" might be too specific. Here we might need to plot a histogram of frequencies of all popular keywords.

### Analysis
**1. Is there a statistically significant difference between the mean revenues of NY critic picks and non-picks movies?**
    
    Conduct a t-test on the mean revenue of critic-picks and determine if there's statistically significant difference.


**2. Do reviews of NY critic picks display a different set of sentiments than that of non-picks?**
    
    Compare the top most-frequent key words to picks vs non-picks.
    
    
**3. What are the major characteristics of a modern American movie that affect its deomestic lifetime revenue?**
- Construct a linear regression model that fits a movie's features to its revenue. Categorical features like genre tags, and key words will be one hot encoded. Reason about possible interaction terms and include them in the model. 
- Examine coefficients and their corresponding p-values to identify the most influential features that drive revenue.
- Finally, test for likely confounders. For instance, genre might affect a movie's revenue and the type of directors at the same time.
- Try random forest of regression trees and compare performance

## Introduction

## Methods
## Results
## Conclusion

#### Code for importing, basic trimming and observation of data from TMDB and OMDB.

In [30]:
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize
    
#Load datasets
tmdb_df = pd.read_csv('tmdb_5000_movies.csv')
ny_df = pd.read_csv('NY Movie Reviews.csv')
print(tmdb_df.shape)
print(ny_df.shape)
#Data Cleaning
#Remove movies before 1990, Remove movies not from US,Remove irrelevant columns
tmdb_df['release_date'] = pd.to_datetime(tmdb_df['release_date'])
tmdb_df = tmdb_df.drop(columns = ['original_language','popularity','homepage','overview','spoken_languages','tagline','original_title','vote_average','vote_count'])
tmdb_df.drop(tmdb_df[tmdb_df['release_date'].dt.year < 1990].index, inplace=True)
tmdb_df.head()

critics_picks = ny_df[ny_df['critics_pick'] > 0]['display_title']
print(critics_picks.head())
tmdb_df['critic_pick'] = tmdb_df['original_title'] in critics_picks

(4803, 20)
(7539, 18)
4                        Leonard Cohen: Bird on a Wire
5     Accidental Courtesy: Daryl Davis, Race & America
6                                             Paterson
7                                         Toni Erdmann
14                                              Fences
Name: display_title, dtype: object


In [25]:
#Codes for scraping, dont run. saved to csv file.
NYT_API_KEY = '53223e11b006467490bde835d45b0c74'
OMDB_API_KEY = 'd42886f4'

def fetch_omdb(title):
    title = 't=' + title.replace(' ', '+')
    print (title)
    req = 'http://www.omdbapi.com/?apikey='+ OMDB_API_KEY + '&'+ title
    omdb_df = pd.read_json(req)
    omdb_df.to_csv('omdb_data.csv')
    return omdb_df

all_ny_df = []
for offset in range(0,8000,20):
    url = 'http://api.nytimes.com/svc/movies/v2/reviews/search.json?opening-date=1990-01-01;2016-12-31&offset={0}&api-key=ae71411b586e4f9c82502e7e782b122d'.format(offset)
    ny_json = pd.read_json(url, orient = 'records')
    ny_df = json_normalize(ny_json['results'])
    if ny_df.empty:
        break
    all_ny_df.append(ny_df)

ny_df = pd.concat(all_ny_df)
print(ny_df.tail())
ny_df.to_csv('NY Movie Reviews.csv')

title = 't=' + nytdata['display_title'][1].replace(' ', '+')
req = 'http://www.omdbapi.com/?apikey='+ OMDB_API_KEY + '&'+ title
print(pd.read_json(req))