# Regression with dummy variables, no tears

This notebook shows how you can apply ordinary least square (OLS) regression to a dataset with both continuous and categorical data. OLS regression only operates on continuous variables, and one way to allow categorical variables to be a part of the model is to create dummy variables out of the categorical ones. One way to create dummy variables out of a categorical variable is with [one-hot encoding](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) (OHE). 

The idea behind OHE is very simple. Imagine we have a categorical variable `gender` and it has only two values `male` and `female` (we know that there are [more genders](https://en.wikipedia.org/wiki/Third_gender) than male and female, but, bear with and forgive me, this example is only for illustration purposes). We might observe many instances of gender values as in the following table.

<table>
    <tr><th>gender</th></tr>
    <tr><td>male</td></tr>
    <tr><td>female</td></tr>
    <tr><td>female</td></tr>
    <tr><td>male</td></tr>
</table>

If we apply OHE to this gender variable, we will get the following data.

<table>
    <tr><th>gender_male</th><th>gender_female</th></tr>
    <tr><td>1</td><td>0</td></tr>
    <tr><td>0</td><td>1</td></tr>
    <tr><td>0</td><td>1</td></tr>
    <tr><td>1</td><td>0</td></tr>
</table>

Here's another example; imagine we have a categorical variable `height` and it has three values `tall`, `medium`, and `short`. We might observe many instances of height values as in the following table.

<table>
    <tr><th>height</th></tr>
    <tr><td>tall</td></tr>
    <tr><td>tall</td></tr>
    <tr><td>short</td></tr>
    <tr><td>medium</td></tr>
</table>

Applying OHE to this height variable, we will get the following data.

<table>
    <tr><th>height_tall</th><th>height_medium</th><th>height_short</th></tr>
    <tr><td>1</td><td>0</td><td>0</td></tr>
    <tr><td>1</td><td>0</td><td>0</td></tr>
    <tr><td>0</td><td>0</td><td>1</td></tr>
    <tr><td>0</td><td>1</td><td>0</td></tr>
</table>

Generally speaking, for a categorical variable with $k$ values, we will have $k$ new variables after applying OHE.

Now, let us say we want to apply OLS regression on three independent variables 

* `age` (continuous), 
* `gender` (categorical), and
* `height` (categorical).

Our output variable $Y$ is a continuous dependent variable and could represent something like blood pressure. Then, our OLS regression model looks like the following. 

$Y = 5.0 + w_1 A + w_2 G + w_3 H$

where

* $A$ is age
* $G$ is gender
* $H$ is height

We know that we cannot use the categorical values directly in the OLS regression model, and so we OHE each of the categorical variables. After applying OHE to the categorical variables, our OLS regression model looks like the following.

$Y = 5.0 + w_1 A + w_2 G_{\mathrm{male}} + w_3 G_{\mathrm{female}} + w_4 H_{\mathrm{short}} + w_5 H_{\mathrm{medium}} + w_6 H_{\mathrm{tall}}$

Of course, we can try to fit this model, but modeling the data in this way results in a few problems.

* [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity)
* [dummy variable trap](https://en.wikipedia.org/wiki/Dummy_variable_%28statistics%29)

Multicollinearity is when an independent variable can be predicted linearly from others. It presents coefficient estimation and interpretation problems. The dummy variable trap is when we use all the OHE variables in the model and the system of equations approach (matrix algebra) to estimating the coefficients will not have a unique solution. 

The way to deal with this situation is to remove one OHE variable from the set associated with a categorical variable. Which one do we remove? Here's two approaches.

* Remove the OHE variable whose associated value is the most frequent.
* Remove the OHE variable whose associated value is the most natural reference point.

In this toy example, we might remove $G_{\mathrm{female}}$ since it is the most frequently observed value, and we might remove $H_{\mathrm{short}}$ since it is a natural reference point. Our new OLS regression model will look like the following.

$Y = 5.0 + w_1 A + w_2 G_{\mathrm{male}} + w_3 H_{\mathrm{tall}} + w_4 H_{\mathrm{medium}}$

Note that $w_2, w_3, w_4$ are called __differential intercept coefficients__ and interpreted as follows

* $w_2$ is the difference in $y$ between males and females holding everything else constant
* $w_3$ is the difference in $y$ between tall and short holding everything else constant
* $w_4$ is the difference in $y$ between medium and short holding everything else constant

# Sample (synthetic) data



We will sample $A$ and $\epsilon$ (error) from normal distributions and $G$ and $H$ from multinomial distributions as follows.

* $A \sim \mathcal{N}(25, 1)$
* $G \sim \mathrm{Mult}([p_{\mathrm{male}}, p_{\mathrm{female}}]) = \mathrm{Mult}([0.2, 0.8])$
* $H \sim \mathrm{Mult}([p_{\mathrm{short}}, p_{\mathrm{medium}}, p_{\mathrm{tall}}]) = \mathrm{Mult}([0.2, 0.6, 0.2])$
* $\epsilon \sim \mathcal{N}(0, 1)$

Then, we will simulate $Y$ according to the following equation.

$Y = 5.0 + 3 A + 0.8 G_{\mathrm{male}} - 2 H_{\mathrm{tall}} + 3 H_{\mathrm{medium}} + \epsilon$

In [1]:
%matplotlib inline
from scipy.stats import multinomial, dirichlet, norm
import numpy as np

np.random.seed(37)
n = 2000
half = n / 2

intercept = np.ones(n).reshape(n, 1)
age = norm.rvs(25, 1, size=n).reshape(n, 1)
gender = multinomial.rvs(1, [0.2, 0.8], size=n)
height = multinomial.rvs(1, [0.2, 0.6, 0.2], size=n)
noise = norm.rvs(0, 1, size=n)

# the full synthetic data
X = np.hstack([intercept, age, gender, height]) # all the variables
Z = np.delete(X, [2, 4], axis=1) # get rid of 2 dummy variables
w = np.array([5.0, 3.0, 0.0, 0.8, 0.0, -2.0, 3.0], dtype=np.float)
y = np.dot(X, w) + noise

# split the data into half training and half testing
X_train = X[0:half, :]
Z_train = Z[0:half, :]
y_train = y[0:half]

X_test = X[half:n, :]
Z_test = Z[half:n, :]
y_test = y[half:n]

## Learn from the data with all variables

We will use OLS, Lasso, and Ridge regression to learn the weights on the data with all the variables (no dummy variable removed).

In [2]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import r2_score

def print_mat(m):
    rows = m.shape[0]
    cols = m.shape[1]
    for r in range(rows):
        print(','.join(['{:.2f}'.format(m[r, c]) for c in range(cols)]))
    print('{}x{}'.format(rows, cols))
    
def print_vec(hint, v):
    print(
        ','.join(['{:.2f}'.format(c) for c in v]) + 
        ', shape=' + '1x{}'.format(len(v)) +
        ', ' + hint
    ) 
    
lr = LinearRegression(fit_intercept=False)
la = Lasso(fit_intercept=False)
ri = Ridge(fit_intercept=False)

lr.fit(X_train, y_train)
la.fit(X_train, y_train)
ri.fit(X_train, y_train)

print('model coefficients')
print_vec('OLS', lr.coef_)
print_vec('LASSO', la.coef_)
print_vec('RIDGE', ri.coef_)

print('')
print('R^2')
print('{:.5f} OLS'.format(r2_score(y_test, lr.predict(X_test))))
print('{:.5f} LASSO'.format(r2_score(y_test, la.predict(X_test))))
print('{:.5f} RIDGE'.format(r2_score(y_test, ri.predict(X_test))))

model coefficients
3.14,3.00,1.20,1.94,0.71,-1.33,3.76, shape=1x7, OLS
0.00,3.20,-0.00,0.00,0.00,-0.00,0.00, shape=1x7, LASSO
2.34,3.06,0.80,1.54,0.44,-1.58,3.48, shape=1x7, RIDGE

R^2
0.92537 OLS
0.64030 LASSO
0.92519 RIDGE


## Learn from the data with 2 dummy variables removed

We will use OLS, Lasso, and Ridge regression to learn the weights on the data with 2 dummy variables removed.

In [3]:
lr = LinearRegression(fit_intercept=False)
la = Lasso(fit_intercept=False)
ri = Ridge(fit_intercept=False)

lr.fit(Z_train, y_train)
la.fit(Z_train, y_train)
ri.fit(Z_train, y_train)

print('model coefficients')
print_vec('OLS', lr.coef_)
print_vec('LASSO', la.coef_)
print_vec('RIDGE', ri.coef_)

print('')
print('R^2')
print('{:.5f} OLS'.format(r2_score(y_test, lr.predict(Z_test))))
print('{:.5f} LASSO'.format(r2_score(y_test, la.predict(Z_test))))
print('{:.5f} RIDGE'.format(r2_score(y_test, ri.predict(Z_test))))

model coefficients
5.04,3.00,0.74,-2.04,3.05, shape=1x5, OLS
0.00,3.20,0.00,-0.00,0.00, shape=1x5, LASSO
3.09,3.08,0.74,-2.02,3.05, shape=1x5, RIDGE

R^2
0.92537 OLS
0.64030 LASSO
0.92499 RIDGE


## Comparing models without and with removing dummy variables

Note how LASSO regression's $R^2$ values are dismal without and with removing the dummy variables. LASSO tends to think that only age matters in predicting the outcome.

Note how OLS and RIDGE regressions' $R^2$ values are really good without and with removing the dummy variables. When removing the dummy variables, the coefficients between OLS and RIDGE are nearly identical, however, the intercept learned from OLS matches the true value.

Since the OLS and RIDGE regression models learned from removing the dummy variables predict unseen data equally as well as their corresponding models without removing the variables, we should favor them (parsimony and [Occam's razor](https://simple.wikipedia.org/wiki/Occam%27s_razor)).

# Take a Look!

Take a look at [Dr. Vladimir Vapnik](https://en.wikipedia.org/wiki/Vladimir_Vapnik).