#### Introduction to Statistical Learning, Lab 3.4

# Qualitative Predictors

When we deal with qualitative predictors we need to introduce dummy variables encoding the different classes before we can perform a regression fit. Fortunately, the `statsmodels` library can do this for us.


  - [statsmodels documentation](https://www.statsmodels.org/stable/)
  - [statsmodels formula interface](https://www.statsmodels.org/stable/example_formulas.html)
  - [the formula mini language](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula-language)

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics as smg
from islpy import datasets, utils, lmplots
import plots
sns.set()
%matplotlib inline

#### Data Set

We use the `Carseats` data set to demonstrate the usage of qualitative variables.

In [2]:
carseats = datasets.Carseats()
carseats.head()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.5,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
3,7.4,117,100,4,466,97,Medium,55,14,Yes,Yes
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No


#### Model Specification & Fit

We would like to predict `Sales` based on all predictors plus some interaction terms.

In [None]:
formula = 'Sales~' + '+'.join(carseats.columns.drop('Sales')) 
formula += '+Income:Advertising'
formula += '+Price:Age'
lm = smf.ols(formula=formula, data=carseats).fit()

#### Fit Result Summary

We can get a comprehensive summary using the `summary()` method. Now we get the results for all three $\beta$ coefficients.

In [None]:
lm.summary()

#### The Dummy Encoding

Note that the dummy encoding for the categories was done automatically!

The large and positive coefficients of `ShelveLoc[T.Good]` and `ShelveLoc[T.Medium]` indicate that better shelve locations have a positive impact on `Sales`. 

The `ShelveLoc` categories were encoded like this:

||ShelveLoc[T.Good]|ShelveLoc[T.Medium]|
|:---|---:|---:|
|__Bad__|0|0|
|__Good__|1|0|
|__Medium__|0|1|

You can read about what happens behind the scenes [here](https://www.statsmodels.org/dev/contrasts.html).