# Exporting American Movie Box Office Hits 

### Regresssion Model Stepwise Analysis <a id='top'></a> 

1. [Research Question](#1)<br/>
2. [Scraped: Movie Adaptations Data](#2) <br/>
3. [Exporatory Data Analysis: Movie Adaptations Dataframe](#3)<br/>
   [3a. Explore features correlation](#3a)<br/>
   [3b. Explore and handle categorical data](#3b)<br/>
4. [Cross-Validation](#4)<br/>
5. [Modeling](#5)<br/>
6. [Model Tuning](#6) <br/>
   [6a. Regularization](#6a)<br/>
   [6b. Features engineering](#6b)<br/>
   [6c. Modeling with new features](#6c)<br/> 
   [6d. Linear regression assumptions](#6d)<br/>
7. [Best Model ](#7)<br/>
8. [Results](#8)<br/>
   [8a. Interpretability](#8a)<br/>
   [8b. Predictions](#8b)<br/>

In [None]:
import pandas as pd
import numpy as np
import sklearn
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pylab as py
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn.metrics import r2_score, mean_absolute_error
%matplotlib inline


## 1. Research Question<a id='1'></a> 

* RQ. Can a model predict a movie adaptation's <sup>1</sup> international total gross revenue based on movie data available on boxofficemojo.com?
* Data source: boxofficemojo.com 
* Error metric: mean_squared_error

<sup>1</sup> Adapted from books, television shows, events, video games, or plays. 


## 2. Scraped [Movie Adaptations Data](https://github.com/slp22/regression-project/blob/main/adaptation_movies_webscraping.ipynb) 


## 3. Exporatory Data Analysis: [Movie Adaptations Dataframe](https://github.com/slp22/regression-project/blob/main/adaptation_movies_eda.ipynb) 

In [None]:
movie_df = pd.read_csv('clean_df.csv')
movie_df.drop(columns=['link_stub'], inplace=True)
movie_df.head(1)

In [None]:
movie_df.describe()

In [None]:
# check for null values
movie_df.isnull().sum()

In [None]:
# drop null values
movie_df.dropna(axis=0, how='any', inplace=True)

In [None]:
# double check for null values
movie_df.isnull().sum()

In [None]:
movie_df.info()

### 3a. Explore features correlation<a id='3a'></a> 

In [None]:
sns.pairplot(movie_df, height=5, aspect=1.5);

In [None]:
# heatmap correlation matrix
sns.heatmap(movie_df.corr(), cmap="seismic", annot=True, vmin=-1, vmax=1);

### Correlation Summary

#### Target-Features
*target = `international_total_gross`*
* target correlated with (highest to lowest):
    * `domestic_total_gross`
    * `domestic_opening`
    * `budget`
    * `max_theaters`
    * `opening_theathers`

Target is highly correlated with `worldwide_total_gross`; has known multicollinearity as:<br/>
`worldwide_total_gross` = `domestic_total_gross` + `international_total_gross`

####  Features-Features: Positive Correlation
* domestic_total_gross:
    * `domestic_opening`
    * `worldwide_total_gross`
    * `budget`
    * `max_theaters`
    * `opening_theathers`
* domestic_opening:<br/>
    * `budget`
    * `max_theaters`
    * `opening_theathers`
* max_theaters:
    * `opening_theathers`
    * `budget`
    * `domestic_opening`

####  Features-Features: Negative Correlation
* rank:
    * `domestic_total_gross`
    * `max_theaters`
    * `opening_theathers`
    * `domestic_opening`
    * `budget`
    


### 3b. Explore and handle categorical data<a id='3b'></a> 

In [None]:
# explore genres as candidate for dummies
print('Unique genres:', movie_df.genres.nunique())
print('\n')
print('Genres counts\n', movie_df['genres'].value_counts())
# 👎 too many; look for other dummy variables. 

In [None]:
# explore MPAA rating as candidate for dummies
print('Unique MPAA ratings:', movie_df.rating.nunique())
print('\n')
rating_count = movie_df['rating'].value_counts()
print('Rating counts\n', rating_count)
# 👍 easy-to-use for dummy variables

In [None]:
# get dummies for MPAA rating 
df_dummies_rating = pd.get_dummies(movie_df, columns=['rating'], drop_first=True)
df_dummies_rating.head(2)
movie_df = df_dummies_rating
movie_df.head(1)

In [None]:
# explore distributor as candidate for dummies 
print('Unique distributors:', movie_df.distributor.nunique())
print('\n')
distributor_count = movie_df['distributor'].value_counts()
print('Distributor count\n', distributor_count)
# 👍 Reasonable amount, group lower frequencies into an other category.

In [None]:
# create distributor other category
distributor_other = list(distributor_count[distributor_count < 20].index)
movie_df['distributor'] = movie_df['distributor'].replace(distributor_other, 'other')

# get dummies for distributor
df_dummies_distributor = pd.get_dummies(movie_df, columns=['distributor'], drop_first=True)
movie_df = df_dummies_distributor
movie_df.head(1)

In [None]:
movie_df.columns

[back to top](#top)

## 4. Cross-Validation<a id='4'></a> 

In [None]:
# separate target from select features
y = movie_df['international_total_gross']
X = movie_df.loc[:,['domestic_total_gross', 
                    'domestic_opening', 
                    'budget',
                    'max_theaters', 
                    'opening_theathers',
                    'rank',
                    'runtime',
                    'release_date',
                    'rating_PG', 
                    'rating_PG13',
                    'rating_R',
                    'distributor_Paramount Pictures',
                    'distributor_Sony Pictures Entertainment (SPE)',
                    'distributor_Twentieth Century Fox', 
                    'distributor_Universal Pictures',
                    'distributor_Walt Disney Studios Motion Pictures',
                    'distributor_Warner Bros.', 
                    'distributor_other']]

In [None]:
# split test data set
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    random_state=42)

In [None]:
# set up k-folds 
kfold = KFold(n_splits=5, 
              shuffle=True, 
              random_state = 42)

[back to top](#top)

## 5. Modeling<a id='5'></a> 

### simple linear regression model 

In [None]:
# lin_reg 
lin_reg = LinearRegression()

scores = cross_val_score(lin_reg, X_train, y_train, cv=kfold)
print('k-fold indivdual scores:', scores)
print('linear regression k-fold mean score:', round(np.mean(scores), 3))

lin_reg.fit(X_train, y_train)

In [None]:
# lin_reg train: fitted vs. actual
y_train_predict = lin_reg.predict(X_train)

plt.scatter(y_train, y_train_predict)
plt.plot([0, 400], [0, 400])
plt.title('Predictions vs. Actual (X_train)')
plt.xlabel('actual')
plt.ylabel('predictions')
plt.grid();

In [None]:
# lin_reg test: fitted vs. actual
y_test_predict = lin_reg.predict(X_test)

plt.scatter(y_test, y_test_predict)
plt.plot([0, 400], [0, 400])
plt.title('Predictions vs. Actual (X_test)')
plt.xlabel('actual')
plt.ylabel('predictions')
plt.grid();

In [None]:
# lin_reg: residuals vs. predicted
y_predict = lin_reg.predict(X)
residuals = y - y_predict

plt.scatter(y_predict, residuals)
plt.plot([0,400], [0, 0])
plt.title("Residuals vs. Predicted")
plt.xlabel('predicted')
plt.ylabel('residuals')
plt.grid();

## 6. Model Tuning<a id='6'></a> 

In [None]:
# standard-scaling features before regularization 
std = StandardScaler()
std.fit(X_train.values)

# apply scaler to train data
X_train_std = std.transform(X_train.values)

# apply scaler to test data
X_test_std = std.transform(X_test.values)

### 6a. Regularization<a id='6a'></a> 

#### Lasso Model


In [None]:
# create lasso model
lasso_model = Lasso(alpha = 100000, fit_intercept=True, random_state=42)
lasso_model.fit(X_train_std, y_train)

# cross-validate
scores = cross_val_score(lasso_model, X_train, y_train, cv=kfold)
print('lasso model k-fold indivdual scores:', scores)
print('lasso model k-fold mean score:', round(np.mean(scores), 3))

# evaluate 
lasso_r2_train = lasso_model.score(X_train_std, y_train)
lasso_r2_test = lasso_model.score(X_test_std, y_test)
print('lasso r^2 train std:', round(lasso_r2_train, 3))
print('lasso r^2 test std:', round(lasso_r2_test, 3))
print('train < test, likely outliers')

#### Ridge Model


In [None]:
# create ridge model 
ridge_model = Ridge(alpha = 1000, fit_intercept=True, random_state=42)
ridge_model.fit(X_train_std, y_train)
 
# cross-validate
scores = cross_val_score(ridge_model, X_train, y_train, cv=kfold)
print('ridge model k-fold indivdual scores:', scores)
print('ridge model k-fold mean score:', round(np.mean(scores), 3))

# evaluate
ridge_r2_train = ridge_model.score(X_train_std, y_train)
ridge_r2_test = ridge_model.score(X_test_std, y_test)
print('ridge r^2 train std:', round(ridge_r2_train, 3))
print('ridge r^2 test std:', round(ridge_r2_test, 3))
print('train = test, and high, good fit')

#### Elastic Net Model


In [None]:
# create elasticnet model 
elastic_model = ElasticNet(alpha = 1000, l1_ratio=.5)
elastic_model.fit(X_train_std, y_train)

# cross-validate
scores = cross_val_score(elastic_model, X_train, y_train, cv=kfold)
print('elastic model k-fold indivdual scores:', scores)
print('elastic model k-fold mean score:', round(np.mean(scores), 3))

elastic_r2_train = elastic_model.score(X_train_std, y_train)
elastic_r2_test = elastic_model.score(X_test_std, y_test)

print('elastic r^2 train std:', round(elastic_r2_train, 3))
print('elastic r^2 test std:', round(elastic_r2_test, 3))
print('train = test, but low, underfit')

*Ridge model is best fit using r<sup>2</sup> score*

### Error metric: MAE

In [None]:
# evaluate models using mean absolute error 
y_pred = lin_reg.predict(X_test)
print(f'Linear Regression MAE on test: {mean_absolute_error(y_test, y_pred):.2f}')

y_pred = lasso_model.predict(X_test)
print(f'Lasso Regression MAE on test: {mean_absolute_error(y_test, y_pred):.2f}')

y_pred = ridge_model.predict(X_test)
print(f'Ridge Regression MAE on test: {mean_absolute_error(y_test, y_pred):.2f}')

y_pred = elastic_model.predict(X_test)
print(f'ElasticNet Regression MAE on test: {mean_absolute_error(y_test, y_pred):.2f}')


*Simple linear regression is best model using MAE, but the error metric in off by $45.5 million.* 

[back to top](#top)

### 6b. Features engineering<a id='6b'></a> 

In [None]:
# log transformation for monetary columns 

# check for zeros in columns before log transformation 
count = (movie_df['international_total_gross'] == 0).sum()
print('count zeros in international_total_gross:', count)

count = (movie_df['domestic_total_gross'] == 0).sum()
print('count of zeros in domestic_total_gross:', count)

count = (movie_df['budget'] == 0).sum()
print('count of zeros in budget:', count)
# budget: min $0, max $270,000,000
# 👎 zeros throwing division-zero error; will not transform 

count = (movie_df['domestic_opening'] == 0).sum()
print('count of zeros in domestic_opening:', count)
# domestic_opening: min $0, max $191,770,800
# 👎 zeros throwing division-zero error; will not transform 


# 👍 international_total_gross and domestic_total_gross
# international_total_gross: min $98, max $1,119,261,000
movie_df['log_international_total_gross'] = np.log(movie_df['international_total_gross'])

# domestic_total_gross: min $742, max $543,638,043
movie_df['log_domestic_total_gross'] = np.log(movie_df['domestic_total_gross'])


In [None]:
# profit = domestic_total_gross - budget
movie_df['profit'] = (movie_df['domestic_total_gross'] - movie_df['budget'])

In [None]:
# opening_profit = domestic_opening - budget
movie_df['opening_profit'] = (movie_df['domestic_opening'] - movie_df['budget'])

In [None]:
# opening = domestic_opening * opening_theathers
movie_df['opening'] = (movie_df['domestic_opening'] * movie_df['opening_theathers'])

In [None]:
movie_df.columns

[back to top](#top)

### 6c. Modeling with new features<a id='6c'></a> 

In [None]:
# separate target from new and original features
y2 = movie_df['international_total_gross']
X2 = movie_df.loc[:,['domestic_total_gross', 
                       'rank', 
                       'max_theaters', 
                       'opening_theathers',
                       'domestic_opening', 
                       'budget', 
                       'release_date',
                       'runtime', 
                       'rating_PG', 
                       'rating_PG13', 
                       'rating_R',
                       'distributor_Paramount Pictures',
                       'distributor_Sony Pictures Entertainment (SPE)',
                       'distributor_Twentieth Century Fox', 
                       'distributor_Universal Pictures',
                       'distributor_Walt Disney Studios Motion Pictures',
                       'distributor_Warner Bros.', 
                       'distributor_other',
                       'log_domestic_total_gross', 
                       'profit',
                       'opening', 
                       'opening_profit']]


In [None]:
# split test data set
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, 
                                                        y2,
                                                        test_size=0.2, 
                                                        random_state=42)


In [None]:
# lin_reg2: simple linear regression 
lin_reg2 = LinearRegression()

scores = cross_val_score(lin_reg2, X2_train, y2_train, cv=kfold)
print('k-fold indivdual scores:', scores)
print('linear regression2 k-fold mean score:', round(np.mean(scores), 3))

lin_reg2.fit(X2_train, y2_train)

# improvement by 0.2 


In [None]:
# standard-scaling features before regularization 
std = StandardScaler()
std.fit(X2_train.values)

# apply scaler to train data
X2_train_std = std.transform(X2_train.values)

# apply scaler to test data
X2_test_std = std.transform(X2_test.values)

In [None]:
# create ridge model2
ridge_model2 = Ridge(alpha = 1000, fit_intercept=True, random_state=42)
ridge_model2.fit(X2_train_std, y2_train)

# cross-validate
scores = cross_val_score(ridge_model2, X2_train, y2_train, cv=kfold)
print('ridge model2 k-fold indivdual scores:', scores)
print('ridge model2 k-fold mean score:', round(np.mean(scores), 3))

# evaluate
ridge2_r2_train = ridge_model2.score(X2_train_std, y2_train)
ridge2_r2_test = ridge_model2.score(X2_test_std, y2_test)
print('ridge2 r^2 train std:', round(ridge2_r2_train, 3))
print('ridge2 r^2 test std:', round(ridge2_r2_test, 3))
print('train > test, overfit')

In [None]:
y2_pred = lin_reg2.predict(X2_test)
print(f'Linear Regression2 MAE on test: {mean_absolute_error(y2_test, y2_pred):.2f}')

y2_pred = ridge_model2.predict(X2_test)
print(f'Ridge Regression2 MAE on test: {mean_absolute_error(y2_test, y2_pred):.2f}')

In [None]:
# separate target, drop some features
y3 = movie_df['international_total_gross']
X3 = movie_df.loc[:,['domestic_total_gross', 
#                        'rank', 
#                        'max_theaters', 
#                        'opening_theathers',
                       'domestic_opening', 
                       'budget', 
#                        'release_date',
#                        'runtime', 
#                        'rating_PG', 
#                        'rating_PG13', 
#                        'rating_R',
#                        'distributor_Paramount Pictures',
#                        'distributor_Sony Pictures Entertainment (SPE)',
#                        'distributor_Twentieth Century Fox', 
#                        'distributor_Universal Pictures',
#                        'distributor_Walt Disney Studios Motion Pictures',
#                        'distributor_Warner Bros.', 
#                        'distributor_other', 
#                        'log_domestic_total_gross', 
                       'profit',
                       'opening', 
                       'opening_profit']]

In [None]:
# split test data set
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, 
                                                        y3,
                                                        test_size=0.2, 
                                                        random_state=42)


In [None]:
# simple linear regression minus features 
lin_reg3 = LinearRegression()

scores = cross_val_score(lin_reg3, X3_train, y3_train, cv=kfold)
print('k-fold indivdual scores:', scores)
print('linear regression2 k-fold mean score:', np.mean(scores))

lin_reg3.fit(X3_train, y3_train)

In [None]:
# standard-scaling features before regularization 
std = StandardScaler()
std.fit(X3_train.values)

# apply scaler to train data
X3_train_std = std.transform(X3_train.values)

# apply scaler to test data
X3_test_std = std.transform(X3_test.values)

In [None]:
# create ridge model3
ridge_model3 = Ridge(alpha = 100)
ridge_model3.fit(X3_train_std, y3_train)

# cross-validate
scores = cross_val_score(ridge_model3, X3_train_std, y3_train, cv=kfold)
print('ridge model2 k-fold indivdual scores:', scores)
print('ridge model2 k-fold mean score:', round(np.mean(scores), 3))

# evaluate
ridge3_r2_train = ridge_model3.score(X3_train_std, y3_train)
ridge3_r2_test = ridge_model3.score(X3_test_std, y3_test)
print('ridge2 r^2 train std:', round(ridge3_r2_train, 3))
print('ridge2 r^2 test std:', round(ridge3_r2_test, 3))
print('train = test, good fit')

In [None]:
y3_pred = lin_reg3.predict(X3_test)
print(f'Linear Regression3 MAE on test: {mean_absolute_error(y3_test, y3_pred):.2f}')

y3_pred = ridge_model3.predict(X3_test)
print(f'ElasticNet Regression2 MAE on test: {mean_absolute_error(y3_test, y3_pred):.2f}')

[back to top](#top)

### 6d. Linear regression assumptions<a id='6d'></a> 

In [None]:
# residuals vs. predicted
y3_predict = lin_reg3.predict(X3)
residuals = y3 - y3_predict
plt.scatter(y3_predict, residuals)
plt.plot([0, 400], [0, 400])
plt.title("Residuals vs. Predicted")
plt.xlabel("predictions")
plt.ylabel("residuals")
plt.grid();
    

In [None]:
# normal q-q plot = heavy-tailed 
y3_predict = lin_reg3.predict(X3)
residuals = y3 - y3_predict
stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Normal Q-Q plot")
plt.xlabel("theoretical")
plt.ylabel("observed values");
 

In [None]:
# residual distribution
movie_df.international_total_gross.hist(bins=10)
plt.title('Histogram of Dependent Variable (international_total_gross)');

[back to top](#top)

## 7. Best Model<a id='7'></a> 
Fit best model on (train + val), score on test!

In [None]:
# quick reg plot
y3_predict = lin_reg3.predict(X3)

plt.scatter(X3, y3)
plt.scatter(X3, y3_predict);

[back to top](#top)

## 8. Results<a id='8'></a> 

### 8a. Interpretability<a id='8a'></a> 

In [None]:
lin_reg2 = sm.OLS(y2, X2)
fit2 = lin_reg2.fit()
fit2.summary()

In [None]:
# lin_reg3 has lower adjusted R^2 compared to lin_reg2
# despite removing features based on p-values
lin_reg3 = sm.OLS(y3, X3)
fit3 = lin_reg3.fit()
fit3.summary()

### 8b. Predictions<a id='8b'></a> 

Slides, article, and code available at: https://github.com/slp22/regression-project

[back top top](#top)