# Day 20 Workout - Multivariate Stats

### Objective

Create an MLR model to explain and predict the gross revenue of Disney movies from 1937 to 2016.

- Use the disney_movies_total_gross.csv file
- Follow the code below to create a column called "days_since_release", which you should calculate as 1/1/2017 minus the release date
- manipulate the dataset however needed to identify insights on both the total gross and adjusted gross variables.


In [52]:
import pandas as pd
from datetime import datetime

df = pd.read_csv('data/disney_movies_total_gross.csv')

In [53]:
df["release_date"] = pd.to_datetime(df["release_date"])

df['days_since_release'] = datetime(2017, 1, 1) - df['release_date']

df['days_since_release'] = df['days_since_release'].dt.days

Create a regression model for days_since_release as the only feature and total_gross as the label

In [71]:
from scipy import stats

stats.linregress(df.days_since_release, y=df.total_gross)

LinregressResult(slope=-6332.669588619881, intercept=107443786.44752352, rvalue=-0.2838544996620264, pvalue=3.424281529173464e-12, stderr=890.5569082475823, intercept_stderr=7063367.921552288)

In [73]:
import statsmodels.api as sm

X = pd.DataFrame(df.days_since_release).assign(const=1)

model = sm.OLS(df.total_gross, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:            total_gross   R-squared:                       0.081
Model:                            OLS   Adj. R-squared:                  0.079
Method:                 Least Squares   F-statistic:                     50.57
Date:                Wed, 06 Dec 2023   Prob (F-statistic):           3.42e-12
Time:                        14:47:55   Log-Likelihood:                -11420.
No. Observations:                 579   AIC:                         2.284e+04
Df Residuals:                     577   BIC:                         2.285e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
days_since_release -6332.6696    890

- fill in all null values for mpaa_rating with "Empty"
- Create dummy code features to convert mpaa_rating to sets of numeric 0/1 features for all values except "Empty" (hint: drop_first will automatically select "Empty")

In [55]:
df.mpaa_rating.fillna('Empty', inplace=True)

In [64]:
df2 = pd.get_dummies(df, columns=['mpaa_rating'], dtype=int, drop_first=True).assign(const=1)

Create a new regression model for days_since_release and rating as the features and total_gross as the label

In [74]:
model = sm.OLS(df2.total_gross, df2.iloc[:,5:]).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:            total_gross   R-squared:                       0.122
Model:                            OLS   Adj. R-squared:                  0.112
Method:                 Least Squares   F-statistic:                     13.19
Date:                Wed, 06 Dec 2023   Prob (F-statistic):           5.12e-14
Time:                        14:49:15   Log-Likelihood:                -11407.
No. Observations:                 579   AIC:                         2.283e+04
Df Residuals:                     572   BIC:                         2.286e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
days_since_release    -5481.89

If you change the value to inflated adjusted gross, how does that change the results?

In [75]:
model = sm.OLS(df2.inflation_adjusted_gross, df2.iloc[:,5:]).fit()
print(model.summary())

                               OLS Regression Results                               
Dep. Variable:     inflation_adjusted_gross   R-squared:                       0.257
Model:                                  OLS   Adj. R-squared:                  0.249
Method:                       Least Squares   F-statistic:                     32.94
Date:                      Wed, 06 Dec 2023   Prob (F-statistic):           3.79e-34
Time:                              14:53:20   Log-Likelihood:                -12009.
No. Observations:                       579   AIC:                         2.403e+04
Df Residuals:                           572   BIC:                         2.406e+04
Df Model:                                 6                                         
Covariance Type:                  nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------