# Python Practice Lecture 9 MATH 342W Queens College - OLS Using Categorical Predictors
## Author: Amir ElTabakh
## Date: March 1, 2022

## Agenda:
* OLS using categorical predictors

## OLS Using Categorical Predictors

Note that historically this is called "Analysis of Variance" or "ANOVA" for short. But there is no difference to the computer, it still crunches the same matrices.

Let's get the cars93 data again:

In [None]:
# Importing dependencies
import numpy as np # mathematical operations
import pandas as pd # pandas DataFrame object
from sklearn.linear_model import LinearRegression # Build Linear Regression models
import statsmodels.api as sm # Get standard R datasets
from sklearn.metrics import mean_squared_error, r2_score # RMSE, R^2

# Load dataset
cars = sm.datasets.get_rdataset("Cars93", "MASS")

# Assign data to a variable as a df object
cars_df = pd.DataFrame(cars.data)
cars_df.head()

Let's try to model `Type`, a factor with 6 levels.

In [None]:
# Print out categories
cars_df['Type'].unique()

What will $\hat{y}$ look like? Should be the $\bar{y}$'s for each level. What is $p$? 6. First we'll use the `pandas.get_dummies` method to convert the categorical variables into dummy/indicator variables. Regression results are easiest to interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents the absence.

In [None]:
# Set X
X = cars_df[['Type']]
X

In [None]:
# dummify categorical variables
X = pd.get_dummies(data=X, drop_first=True)
X.head()

The one categorical variable got blown up into 5 features. How to interpret? First need to know the "reference category" i.e. which level is missing in the list. We can see from cross-referencing the coefficient names with the table of the raw feature that the reference category is `Compact`. So what is prediction for the compact type? The intercept. What is prediction of Large type? Intercept + Large, etc. We do not need to add a column of 1's to generate an intercept value.

Now let's build a linear model and get the coefficients and intercept then get the $R^2$ value.

In [None]:
# Setting X and y
y = cars_df[['Price']]

# initialize model
anova_model = LinearRegression()

# fit model
anova_model.fit(X, y)

# print b0
print(anova_model.intercept_)

# print coefficients
print(anova_model.coef_)

In [None]:
# R^2
print(anova_model.score(X, y))

Let's create our model matrix. We'll calculate our $R^2$ via the theory we learn in class.

In [None]:
# Insert intercept
X.insert(0, 'Intercept', [1 for i in range(len(X))])

# Convert df to matrix
X_m = X.to_numpy()

# Print first 10 rows of X_m
X_m[0:10]

In [None]:
# X transpose
Xt = X_m.transpose()

In [None]:
# X transpose * X
XtX = Xt @ X_m

In [None]:
# XtX Inverse
XtX_inv = np.linalg.inv(XtX)

In [None]:
# solve for b
b = XtX_inv @ Xt @ y

# Rename column name
b.columns = ['b vector']
b

In [None]:
# Get yhat values
yhat = (X_m @ b).to_numpy()
yhat[0:10]

In [None]:
# define residual error
e = (y - yhat).to_numpy()
e[0:10]

In [None]:
# R^2
Rsq = float((np.var(y) - np.var(e)) / np.var(y))
Rsq

In [None]:
# RMSE
np.sqrt(sum(e**2) / (len(X) - 6))[0]

And of course the coefficients and $R^2$ are identical to the output from `lm`.

If we want to do a more "pure ANOVA", we can get rid of the intercept and see the $\bar{y}$'s immediately. This is handled when you initialize the model object:

In [None]:
# Setting X and y
y = cars_df[['Price']]
X = pd.get_dummies(data=cars_df[['Type']], drop_first=False)

# initialize model
anova_model = LinearRegression(fit_intercept = False)

# fit model
anova_model.fit(X, y)

# print b0
print(anova_model.intercept_)

# print coefficients
print(anova_model.coef_)

Is this correct?

In [None]:
cars_df.groupby('Type').mean()['Price']

What does $R^2$ look like?

In [None]:
# R^2 no intercept
anova_model.score(X, y)

Remember this from last time? What happened? The $R^2$ calculation in `lm` is not accurate without the intercept. Keep this in mind. 

What does the design matrix (model matrix) look like? we can use the `.ro_numpy()` function to generate the columns of $X$ from the data frame.

In [None]:
# Convert df to matrix
X_m = X.to_numpy()
X_m[0:20]

Regressions without an intercept are not recommended. Here's why. What if we were doing two factors? I want a linear model with both Type and Airbags:

In [None]:
# Exploring AirBags column
print(cars_df['AirBags'].value_counts())

AirBags is another nominal categorical variable, this time with three levels.

We invoke the model as follows.

In [None]:
# Setting X and y
y = cars_df[['Price']]
X = pd.get_dummies(data=cars_df[['Type', 'AirBags']], drop_first=True)

# X column names
X.columns

In [None]:
# initialize model
anova_model = LinearRegression(fit_intercept = True)

# fit model
anova_model.fit(X, y)

# print b0
print(anova_model.intercept_)

# print coefficients
print(anova_model.coef_)

# get yhat
yhat = anova_model.predict(X)

# print R^2
print(f"R Squared: {r2_score(y, yhat)}")

# print RMSE
print(f"RMSE: {mean_squared_error(y_true=y, y_pred=yhat, squared=False)}")

In [None]:
# Get yhat
yhat = anova_model.predict(X)

# Calculating RMSE
rmse = mean_squared_error(y_true=y, y_pred=yhat, squared=False)
print(f"RMSE: {rmse}")

What are interpretations now? What is the "reference level"? It's actually two levels in one: Type = compact and Airbags = Driver \& Passenger. 

A deeper question: can we read off Type = Midsize and AirBags = none? No... this is a modeling "enhancement" we will discuss in a few lectures from now.

If we model it without an intercept,

In [None]:
# Setting X and y
y = cars_df[['Price']]
X = pd.get_dummies(data=cars_df[['Type', 'AirBags']], drop_first=False)

# X column names
X.columns

In [None]:
# initialize model
anova_model = LinearRegression(fit_intercept = False) # not modeling intercept

# fit model
anova_model.fit(X, y)

# print b0
print(anova_model.intercept_)

# print coefficients
print(anova_model.coef_)

# get yhat
yhat = anova_model.predict(X)

# print R^2
print(f"R Squared: {r2_score(y, yhat)}")

# print RMSE
print(f"RMSE: {mean_squared_error(y_true=y, y_pred=yhat, squared=False)}")

We only get $\bar{y}$'s for the first factor predictor crossed with the reference category of the second. So above `TypeCompact` refers to the average of Type = Compact and Airbags = Driver \& Passenger.

Now let's create a linear model using one categorical predictor and one continuous predictor. The combination is called for historical reasons "Analysis of Covariance" or "ANCOVA" for short.

Let's use `Type` and `Horsepower`:

In [None]:
# Setting X and y
y = cars_df[['Price']]
X = pd.get_dummies(data=cars_df[['Type', 'Horsepower']], drop_first=True)

# X column names
X.columns

In [None]:
# initialize model
ancova_model = LinearRegression(fit_intercept = True) # not modeling intercept

# fit model
ancova_model.fit(X, y)

# print b0
print(ancova_model.intercept_)

# print coefficients
print(ancova_model.coef_)

# get yhat
yhat = ancova_model.predict(X)

# print R^2
print(f"R Squared: {r2_score(y, yhat)}")

# print RMSE
print(f"RMSE: {mean_squared_error(y_true=y, y_pred=yhat, squared=False)}")

Interpretation of estimated coefficients? Why did $R^2$ increase? (We will be explaining this in detail in the next unit).

What's going on in the design / model matrix? Note that there is an additional column vector with 1's that we account for in the model intialization line.

In [None]:
X.head()

Same as model matrix with just `Type`. Since `Horsepower` is continuous, it doesn't get dummified to more features.

What if we went back to the `Type` regression, left out the intercept, dummified and added the intercept back in?

In [None]:
# Setting X and y
y = cars_df[['Price']]
X = pd.get_dummies(data=cars_df[['Type']], drop_first=False)

# X column names
X.columns

In [None]:
# initialize model
ancova_model = LinearRegression(fit_intercept = False) # not modeling intercept

# fit model
ancova_model.fit(X, y)

# print b0
print(ancova_model.intercept_)

# print coefficients
print(ancova_model.coef_)

# get yhat
yhat = ancova_model.predict(X)

# print R^2
print(f"R Squared: {r2_score(y, yhat)}")

# print RMSE
print(f"RMSE: {mean_squared_error(y_true=y, y_pred=yhat, squared=False)}")

And let's derive the coefficients ourselves,

In [None]:
# Convert df to matrix
X = X.to_numpy()

XtX = X.transpose() @ X

XtX_inverse = np.linalg.inv(XtX)

b = XtX_inverse @ X.transpose() @ y
b

# NOT CONSISTENT WITH KAPS R NOTES. The matrix is invertible and works fine.