# Python Practice Lecture 9 MATH 342W Queens College - OLS Using Categorical Predictors
## Author: Amir ElTabakh
## Date: March 1, 2022

## Agenda:
* OLS using categorical predictors

## OLS Using Categorical Predictors

Note that historically this is called "Analysis of Variance" or "ANOVA" for short. But there is no difference to the computer, it still crunches the same matrices.

Let's get the cars93 data again:

In [1]:
# Importing dependencies
import numpy as np # mathematical operations
import pandas as pd # pandas DataFrame object
from sklearn.linear_model import LinearRegression # Build Linear Regression models
import statsmodels.api as sm # Get standard R datasets
from sklearn.metrics import mean_squared_error, r2_score # RMSE, R^2

# Load dataset
cars = sm.datasets.get_rdataset("Cars93", "MASS")

# Assign data to a variable as a df object
cars_df = pd.DataFrame(cars.data)
cars_df.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,DriveTrain,...,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,18.8,25,31,,Front,...,5,177,102,68,37,26.5,11.0,2705,non-USA,Acura Integra
1,Acura,Legend,Midsize,29.2,33.9,38.7,18,25,Driver & Passenger,Front,...,5,195,115,71,38,30.0,15.0,3560,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,32.3,20,26,Driver only,Front,...,5,180,102,67,37,28.0,14.0,3375,non-USA,Audi 90
3,Audi,100,Midsize,30.8,37.7,44.6,19,26,Driver & Passenger,Front,...,6,193,106,70,37,31.0,17.0,3405,non-USA,Audi 100
4,BMW,535i,Midsize,23.7,30.0,36.2,22,30,Driver only,Rear,...,4,186,109,69,39,27.0,13.0,3640,non-USA,BMW 535i


Let's try to model `Type`, a factor with 6 levels.

In [2]:
# Print out categories
cars_df['Type'].unique()

array(['Small', 'Midsize', 'Compact', 'Large', 'Sporty', 'Van'],
      dtype=object)

What will $\hat{y}$ look like? Should be the $\bar{y}$'s for each level. What is $p$? 6. First we'll use the `pandas.get_dummies` method to convert the categorical variables into dummy/indicator variables. Regression results are easiest to interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents the absence.

In [3]:
# Set X
X = cars_df[['Type']]
X

Unnamed: 0,Type
0,Small
1,Midsize
2,Compact
3,Midsize
4,Midsize
...,...
88,Van
89,Compact
90,Sporty
91,Compact


In [4]:
# dummify categorical variables
X = pd.get_dummies(data=X, drop_first=True)
X.head()

Unnamed: 0,Type_Large,Type_Midsize,Type_Small,Type_Sporty,Type_Van
0,0,0,1,0,0
1,0,1,0,0,0
2,0,0,0,0,0
3,0,1,0,0,0
4,0,1,0,0,0


The one categorical variable got blown up into 5 features. How to interpret? First need to know the "reference category" i.e. which level is missing in the list. We can see from cross-referencing the coefficient names with the table of the raw feature that the reference category is `Compact`. So what is prediction for the compact type? The intercept. What is prediction of Large type? Intercept + Large, etc. We do not need to add a column of 1's to generate an intercept value.

Now let's build a linear model and get the coefficients and intercept then get the $R^2$ value.

In [5]:
# Setting X and y
y = cars_df[['Price']]

# initialize model
anova_model = LinearRegression()

# fit model
anova_model.fit(X, y)

# print b0
print(anova_model.intercept_)

# print coefficients
print(anova_model.coef_)

[18.2125]
[[ 6.0875      9.00568182 -8.04583333  1.18035714  0.8875    ]]


In [6]:
# R^2
print(anova_model.score(X, y))

0.3985818528346551


Let's create our model matrix. We'll calculate our $R^2$ via the theory we learn in class.

Here we must include an intercept column. The `LinearRegression` method automatically accounts for an intercept column unless you change the parameter. Now we wish to create our model matrix, so we need to manually load in the intercept so we can appropriately calculate the $R^2$ value.

`fit_intercept = False` is the parameter and value you would use if you don't wish for the model to calculate for an intercept.

In [7]:
# Insert intercept
X.insert(0, 'Intercept', [1 for i in range(len(X))])

# Convert df to matrix
X_m = X.to_numpy()

# Print first 10 rows of X_m
X_m[0:10]

array([[1, 0, 0, 1, 0, 0],
       [1, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0],
       [1, 0, 1, 0, 0, 0],
       [1, 0, 1, 0, 0, 0],
       [1, 1, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0],
       [1, 1, 0, 0, 0, 0]], dtype=int64)

In [8]:
# X transpose
Xt = X_m.transpose()

In [9]:
# X transpose * X
XtX = Xt @ X_m

In [10]:
# XtX Inverse
XtX_inv = np.linalg.inv(XtX)

In [11]:
# solve for b
b = XtX_inv @ Xt @ y

# Rename column name
b.columns = ['b vector']
b

Unnamed: 0,b vector
0,18.2125
1,6.0875
2,9.005682
3,-8.045833
4,1.180357
5,0.8875


In [12]:
# Get yhat values
yhat = (X_m @ b).to_numpy()
yhat[0:10]

array([[10.16666667],
       [27.21818182],
       [18.2125    ],
       [27.21818182],
       [27.21818182],
       [27.21818182],
       [24.3       ],
       [24.3       ],
       [27.21818182],
       [24.3       ]])

In [13]:
# define residual error
e = (y - yhat).to_numpy()
e[0:10]

array([[  5.73333333],
       [  6.68181818],
       [ 10.8875    ],
       [ 10.48181818],
       [  2.78181818],
       [-11.51818182],
       [ -3.5       ],
       [ -0.6       ],
       [ -0.91818182],
       [ 10.4       ]])

In [14]:
# R^2
Rsq = float((np.var(y) - np.var(e)) / np.var(y))
Rsq

0.3985818528346551

In [15]:
# RMSE
np.sqrt(sum(e**2) / (len(X) - 6))[0]

7.703250679453583

And of course the coefficients and $R^2$ are identical to the output from `lm`.

If we want to do a more "pure ANOVA", we can get rid of the intercept and see the $\bar{y}$'s immediately. This is handled when you initialize the model object:

In [35]:
# Setting X and y
y = cars_df[['Price']]
X = pd.get_dummies(data=cars_df[['Type']], drop_first=False)

# initialize model
anova_model_no_intercept = LinearRegression(fit_intercept = False)

# fit model
anova_model_no_intercept.fit(X, y)

# print b0
print(anova_model_no_intercept.intercept_)

# print coefficients
print(anova_model_no_intercept.coef_)

0.0
[[18.2125     24.3        27.21818182 10.16666667 19.39285714 19.1       ]]


Is this correct?

In [36]:
cars_df.groupby('Type').mean()['Price']

Type
Compact    18.212500
Large      24.300000
Midsize    27.218182
Small      10.166667
Sporty     19.392857
Van        19.100000
Name: Price, dtype: float64

What does $R^2$ look like?

In [37]:
# R^2 no intercept
anova_model_no_intercept.score(X, y)

0.3985818528346551

Remember this from last time? What happened? The $R^2$ calculation in `lm` is not accurate without the intercept. Keep this in mind. 

What does the design matrix (model matrix) look like? we can use the `.ro_numpy()` function to generate the columns of $X$ from the data frame.

In [19]:
# Convert df to matrix
X_m = X.to_numpy()
X_m[0:20]

array([[0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 0]], dtype=uint8)

Regressions without an intercept are not recommended. Here's why. What if we were doing two factors? I want a linear model with both Type and Airbags:

In [20]:
# Exploring AirBags column
print(cars_df['AirBags'].value_counts())

Driver only           43
None                  34
Driver & Passenger    16
Name: AirBags, dtype: int64


AirBags is another nominal categorical variable, this time with three levels.

We invoke the model as follows.

In [21]:
# Setting X and y
y = cars_df[['Price']]
X = pd.get_dummies(data=cars_df[['Type', 'AirBags']], drop_first=True)

# X column names
X.columns

Index(['Type_Large', 'Type_Midsize', 'Type_Small', 'Type_Sporty', 'Type_Van',
       'AirBags_Driver only', 'AirBags_None'],
      dtype='object')

In [22]:
# initialize model
anova_model = LinearRegression(fit_intercept = True)

# fit model
anova_model.fit(X, y)

# print b0
print(anova_model.intercept_)

# print coefficients
print(anova_model.coef_)

# get yhat
yhat = anova_model.predict(X)

# print R^2
print(f"R Squared: {r2_score(y, yhat)}")

# print RMSE
print(f"RMSE: {mean_squared_error(y_true=y, y_pred=yhat, squared=False)}")

[24.28327824]
[[  3.29537038   7.35691165  -5.15459312   0.22922837   3.30250817
   -5.15216212 -10.15259856]]
R Squared: 0.4924977895569449
RMSE: 6.844203087778953


In [23]:
# Get yhat
yhat = anova_model.predict(X)

# Calculating RMSE
rmse = mean_squared_error(y_true=y, y_pred=yhat, squared=False)
print(f"RMSE: {rmse}")

RMSE: 6.844203087778953


What are interpretations now? What is the "reference level"? It's actually two levels in one: Type = compact and Airbags = Driver \& Passenger. 

A deeper question: can we read off Type = Midsize and AirBags = none? No... this is a modeling "enhancement" we will discuss in a few lectures from now.

If we model it without an intercept,

In [24]:
# Setting X and y
y = cars_df[['Price']]
X = pd.get_dummies(data=cars_df[['Type', 'AirBags']], drop_first=False)

# X column names
X.columns

Index(['Type_Compact', 'Type_Large', 'Type_Midsize', 'Type_Small',
       'Type_Sporty', 'Type_Van', 'AirBags_Driver & Passenger',
       'AirBags_Driver only', 'AirBags_None'],
      dtype='object')

In [25]:
# initialize model
anova_model = LinearRegression(fit_intercept = False) # not modeling intercept

# fit model
anova_model.fit(X, y)

# print b0
print(anova_model.intercept_)

# print coefficients
print(anova_model.coef_)

# get yhat
yhat = anova_model.predict(X)

# print R^2
print(f"R Squared: {r2_score(y, yhat)}")

# print RMSE
print(f"RMSE: {mean_squared_error(y_true=y, y_pred=yhat, squared=False)}")

0.0
[[ 5.39062762  8.685998   12.74753927  0.2360345   5.619856    8.69313579
  18.89265062 13.7404885   8.74005206]]
R Squared: 0.4924977895569448
RMSE: 6.844203087778954


We only get $\bar{y}$'s for the first factor predictor crossed with the reference category of the second. So above `TypeCompact` refers to the average of Type = Compact and Airbags = Driver \& Passenger.

Now let's create a linear model using one categorical predictor and one continuous predictor. The combination is called for historical reasons "Analysis of Covariance" or "ANCOVA" for short.

Let's use `Type` and `Horsepower`:

In [26]:
# Setting X and y
y = cars_df[['Price']]
X = pd.get_dummies(data=cars_df[['Type', 'Horsepower']], drop_first=True)

# X column names
X.columns

Index(['Horsepower', 'Type_Large', 'Type_Midsize', 'Type_Small', 'Type_Sporty',
       'Type_Van'],
      dtype='object')

In [27]:
# initialize model
ancova_model = LinearRegression(fit_intercept = True) # not modeling intercept

# fit model
ancova_model.fit(X, y)

# print b0
print(ancova_model.intercept_)

# print coefficients
print(ancova_model.coef_)

# get yhat
yhat = ancova_model.predict(X)

# print R^2
print(f"R Squared: {r2_score(y, yhat)}")

# print RMSE
print(f"RMSE: {mean_squared_error(y_true=y, y_pred=yhat, squared=False)}")

[1.85998155]
[[ 0.12482839  0.03899734  3.75154161 -3.05269793 -2.45749865 -1.41489021]]
R Squared: 0.678697245526901
RMSE: 5.445793172712398


Interpretation of estimated coefficients? Why did $R^2$ increase? (We will be explaining this in detail in the next unit).

What's going on in the design / model matrix? Note that there is an additional column vector with 1's that we account for in the model intialization line.

In [28]:
X.head()

Unnamed: 0,Horsepower,Type_Large,Type_Midsize,Type_Small,Type_Sporty,Type_Van
0,140,0,0,1,0,0
1,200,0,1,0,0,0
2,172,0,0,0,0,0
3,172,0,1,0,0,0
4,208,0,1,0,0,0


Same as model matrix with just `Type`. Since `Horsepower` is continuous, it doesn't get dummified to more features.

What if we went back to the `Type` regression, left out the intercept, dummified and added the intercept back in?

In [29]:
# Setting X and y
y = cars_df[['Price']]
X = pd.get_dummies(data=cars_df[['Type']], drop_first=False)

# X column names
X.columns

Index(['Type_Compact', 'Type_Large', 'Type_Midsize', 'Type_Small',
       'Type_Sporty', 'Type_Van'],
      dtype='object')

In [30]:
# initialize model
ancova_model = LinearRegression(fit_intercept = False) # not modeling intercept

# fit model
ancova_model.fit(X, y)

# print b0
print(ancova_model.intercept_)

# print coefficients
print(ancova_model.coef_)

# get yhat
yhat = ancova_model.predict(X)

# print R^2
print(f"R Squared: {r2_score(y, yhat)}")

# print RMSE
print(f"RMSE: {mean_squared_error(y_true=y, y_pred=yhat, squared=False)}")

0.0
[[18.2125     24.3        27.21818182 10.16666667 19.39285714 19.1       ]]
R Squared: 0.3985818528346551
RMSE: 7.450616038363192


And let's derive the coefficients ourselves,

In [31]:
# Convert df to matrix
X = X.to_numpy()

XtX = X.transpose() @ X

XtX_inverse = np.linalg.inv(XtX)

b = XtX_inverse @ X.transpose() @ y
b

Unnamed: 0,Price
0,18.2125
1,24.3
2,27.218182
3,10.166667
4,19.392857
5,19.1


# NOT CONSISTENT WITH KAPS R NOTES. The matrix is invertible and works fine.