In [None]:
%reload_ext nb_black

In [None]:
import warnings

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

import statsmodels.api as sm
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.diagnostic import het_breuschpagan

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
# https://gist.github.com/AdamSpannbauer/c99c366b0c7d5b6c4920a46c32d738e5
def print_vif(x):
    """Utility for checking multicollinearity assumption
    
    :param x: input features to check using VIF. This is assumed to be a pandas.DataFrame
    :return: nothing is returned the VIFs are printed as a pandas series
    """
    # Silence numpy FutureWarning about .ptp
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        x = sm.add_constant(x)

    vifs = []
    for i in range(x.shape[1]):
        vif = variance_inflation_factor(x.values, i)
        vifs.append(vif)

    print("VIF results\n-------------------------------")
    print(pd.Series(vifs, index=x.columns))
    print("-------------------------------\n")

* Read in and get to know the data.  We want to eventually predict `Profit`

In [None]:
data_url = "https://docs.google.com/spreadsheets/d/1RJrLftlRnj6gmrYewqxykVKSyl7aV-Ktd3sUNQILidM/export?format=csv"

* Do we have an even distribution of states?  We'll eventually encode this variable to be numeric, how should we encode it? Which category would be the 'default'?

* Create a pair plot with all of the data, what do you see?

* Create a train test split stratified by state

* One hot encode

We'll take a look at using the `ColumnTransformer` today.  This is a way to write a 1 stop shop for all of your column preprocessing for a supervised learning model.  We can use it to one hot encode categorical variables and scale numeric variables all at once.

In [None]:
# Identify which columns are cat and num
# For one hot encoding, specify which categories to drop
cat_cols = ["State"]
drop_cats = ["California"]

num_cols = ["R&D Spend", "Administration", "Marketing Spend"]

----

Option showing how to onehotencode and leave numerics untouched

In [None]:
ct = ColumnTransformer(
    #   Format
    #   [("name of step", WhatToDo(), list_of_columns_to_do_it_to)]
    [("one_hot_encode", OneHotEncoder(drop=drop_cats), cat_cols)],
    # Do nothing to the rest of the data
    remainder="passthrough",
)

Option showing how to onehotencode and scale numerics

In [None]:
# ct =ColumnTransformer(
#     #   Format
#     #   [("name of step", WhatToDo(), list_of_columns_to_do_it_to)]
#     [("one_hot_encode", OneHotEncoder(drop=drop_cats), cat_cols)],
#     # Scale the rest of the data
#     remainder=StandardScaler(),
# )

----

A big benefit of this is a single `fit` method that figures out how to one hot encode and scale and whatever else all at once.  We also have a single `transform` method that prepares all of our data at once.  This is a big big big plus for being able to predict on new data.

In [None]:
ct.fit(X_train)

X_train_trans = ct.transform(X_train)
X_test_trans = ct.transform(X_test)

X_train = pd.DataFrame(X_train_trans, index=X_train.index)
X_test = pd.DataFrame(X_test_trans, index=X_test.index)

X_train.head(2)

The downside to this is it's harder to trackdown the variable
names :(

If we don't care about interpretability (just focused
on accuracy), this isn't terrible.  It's annoying if we care
about interpreting, and with linear regression we almost always care about interpreting coefficients.

This is admittedly, a pain.  Ugly code below to address the issue for this model.

In [None]:
cat_names = ct.transformers_[0][1].get_feature_names(cat_cols)
cat_names = list(cat_names)

new_col_names = cat_names + num_cols

X_train.columns = new_col_names
X_test.columns = new_col_names

X_train.head(2)

Let's try and drive home why this is better than `pd.get_dummies` in a machine learning context.  For our model's to make any difference to the business they need to be 'deployed'; that is, they need to be living somewhere that they can receive new data and make predictions.

Let's say we run a website for people to judge how well a startup would do.  All users need to do is go to our website, fill out a form that asks them for the `'R&D Spend'`, `'Administration'`, `'Marketing Spend'`, and `'State'`.  Given this info, our model is expected to predict how much `'Profit'` we think the startup will have.

^This means that we'll get one new observation at a time.  Below is an example of what a user might input.

In [None]:
new_observation = pd.DataFrame(
    {
        "R&D Spend": [73721],
        "Administration": [121344],
        "Marketing Spend": [211025],
        "State": ["California"],
    }
)

new_observation

For us to make a prediction, we need to reformat this data the same way we reformatted our original data.

In [None]:
# Reminder of what the training data looks like right now.
# Our new observation needs to match for our model to know
# what to do (column names optional, models dont care about them)
X_train.head(2)

Maybe you'd think to use `pd.get_dummies()`. What's the issue with this?

In [None]:
pd.get_dummies(new_observation)

This is where the extra work we put into the `ColumnTransformer()` pays off.

------

* Check for multicollinearity with VIF

* Build a model using statsmodels and display the summary
    * Interpret $R^2$

* Check the normality of residuals assumption with a qqplot

* Check the homoscedasticity assumption with `statsmodels`

* Make a plot of actuals vs predicted

* Calculate MAE, MAPE, MSE, & RMSE
  * Interpret MAE and MAPE

-------

#### Group A: 

Re-fit the model, but use either `QuantileTransformer()`, `StandardScaler()`, or `MinMaxScaler()`.

#### Group B:

Re-fit the model, but drop the predictor that was the worst predictor in our original model.

----

* Using the `statsmodels` output as a reference: Is your model performing better, worse, or no different than our original model? Which numbers back this up?
* Use `MAE` to evaluate your model.  Interpret this number for a business person.  According to this metric, how does your model perform compared to the original (again, express this as if you're talking to a business person).

In [None]:
# You can write new code for this or modify the above existing code
# Modifying is prolly less effort


--------

Let's use `sklearn`'s `cross_val_score` to see a more 'stable' picture of your model's accuracy
* Use `cross_val_score` to calculate $R^2$
    * $R^2$ is the default score
    * We can choose from [a long list of scores](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values) `sklearn` can do for us.

* Use `cross_val_score` with a different score than $R^2$