## Categorical Variables

Until now, we have only worked with variables (both dependent and independent) that are *continuous* - they are numeric values that can take on an infinite range of possible values. However, many of the variables that you will encounter in real-life situations are *categorical*, meaning that they are only allowed to take on values from a certain discrete set. For example, the variable `color` in the `car.prices.df` data frame below is a *categorical variable*:

In [18]:
car.prices.df <- read.csv("data/car-prices.csv")
head(car.prices.df)

price.thousand.eur,top.speed.kph,color
<int>,<int>,<fct>
39,250,red
28,190,black
35,230,blue
29,160,blue
38,240,red
37,280,black


We can see that there are three car colors in this dataset:

In [2]:
unique(car.prices.df$color)

How can we fit a line to this data? We'll need to convert our color values `red`, `black`, and `blue` into numbers. We can do this using the `caret` package:

In [3]:
library(caret)

Loading required package: lattice
Loading required package: ggplot2

Attaching package: ‘caret’

The following object is masked from ‘package:httr’:

    progress



Specifically, we'll want to use the `dummyVars` function:

In [4]:
dummies <- dummyVars( ~ ., data=car.prices.df)
car.prices.encoded.df <- as.data.frame(predict(dummies, newdata=car.prices.df))
head(car.prices.encoded.df)

price.thousand.eur,top.speed.kph,color.black,color.blue,color.red
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
39,250,0,0,1
28,190,1,0,0
35,230,0,1,0
29,160,0,1,0
38,240,0,0,1
37,280,1,0,0


The use of `dummyVars` has *dummy encoded* our categorical variable `color`. Dummy encoding this variable means that each row has a `1` in the column corresponding to its color, and a zero in the other color columns.

Right now, our dataset has a critical problem: it suffers from the *multicollinearity* issue that we learned about in the previous lesson. We can see this when we try to build a model using all of our columns - notice that we get `NA` values for the final row:

In [5]:
model.fail <- lm(price.thousand.eur ~ ., data=car.prices.encoded.df)
summary(model.fail)


Call:
lm(formula = price.thousand.eur ~ ., data = car.prices.encoded.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.13913 -0.77951 -0.06656  0.85740  2.85392 

Coefficients: (1 not defined because of singularities)
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   12.848963   0.397209  32.348  < 2e-16 ***
top.speed.kph  0.099653   0.001681  59.290  < 2e-16 ***
color.black   -2.712952   0.153577 -17.665  < 2e-16 ***
color.blue    -0.619549   0.152927  -4.051 6.51e-05 ***
color.red            NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.081 on 296 degrees of freedom
Multiple R-squared:  0.9335,	Adjusted R-squared:  0.9328 
F-statistic:  1384 on 3 and 296 DF,  p-value: < 2.2e-16


These `NA` values appear because R is confused by the multicollinearity in our data. This multicollinearity occurs because our three `color` columns contain redundant information; given any two of these columns, we automatically know all values in the third one because there are only three colors available. We therefore need to choose on of these color columns as our *reference level* and remove it from the dataset - let's choose `color.black`:

In [6]:
car.prices.encoded.df$color.black <- NULL

Now that we have removed the multicollinearity in our data, R gives us a sensible result:

In [7]:
model.success <- lm(price.thousand.eur ~ ., data=car.prices.encoded.df)
summary(model.success)


Call:
lm(formula = price.thousand.eur ~ ., data = car.prices.encoded.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.13913 -0.77951 -0.06656  0.85740  2.85392 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   10.136011   0.383481   26.43   <2e-16 ***
top.speed.kph  0.099653   0.001681   59.29   <2e-16 ***
color.blue     2.093403   0.153796   13.61   <2e-16 ***
color.red      2.712952   0.153577   17.66   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.081 on 296 degrees of freedom
Multiple R-squared:  0.9335,	Adjusted R-squared:  0.9328 
F-statistic:  1384 on 3 and 296 DF,  p-value: < 2.2e-16


Since the only possible values of `color.blue` and `color.red` are zero and one, these coefficients have a unique interpretation. Both of these colors increase the value of the car relative to the reference level (black) - making the car blue increases the the price by a fixed value of `coefficients(model.success)['color.blue']`, while making it red increases the price by `coefficients(model.success)['color.red']`. 

## Automatic Encoding

It turns out that we don't need to perform the above dummy encoding step manually - we only did so in the section above to help you understand the interpretation of the parameters associated with categorical variables in the model `summary`. If we pass our categorical variables to R directly, it will encode them automatically:

In [8]:
summary(lm(price.thousand.eur ~ ., data=car.prices.df))


Call:
lm(formula = price.thousand.eur ~ ., data = car.prices.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.13913 -0.77951 -0.06656  0.85740  2.85392 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   10.136011   0.383481   26.43   <2e-16 ***
top.speed.kph  0.099653   0.001681   59.29   <2e-16 ***
colorblue      2.093403   0.153796   13.61   <2e-16 ***
colorred       2.712952   0.153577   17.66   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.081 on 296 degrees of freedom
Multiple R-squared:  0.9335,	Adjusted R-squared:  0.9328 
F-statistic:  1384 on 3 and 296 DF,  p-value: < 2.2e-16


Notice that `black` is automatically chosen as the reference level, because it is the first level of the corresponding factor:

In [9]:
levels(car.prices.df$color)

<span style="color:blue;font-weight:bold">Exercise</span>: Change the reference level of the `color` variable in our model to `blue` by changing the level ordering of `car.prices.df$color` and re-fitting a new model. Store your new model in the variable `model.new.reference`:

In [29]:
# delete this entire line and replace it with your code
head(car.prices.df)
car.prices.df$color <- factor(car.prices.df$color, levels = c("blue", 'black', 'red'))
str(car.prices.df)
head(car.prices.df)

model.new.reference <- lm(price.thousand.eur~., data=car.prices.df)

summary(model.new.reference)

price.thousand.eur,top.speed.kph,color
<int>,<int>,<fct>
39,250,red
28,190,black
35,230,blue
29,160,blue
38,240,red
37,280,black


'data.frame':	300 obs. of  3 variables:
 $ price.thousand.eur: int  39 28 35 29 38 37 39 39 30 39 ...
 $ top.speed.kph     : int  250 190 230 160 240 280 240 270 190 290 ...
 $ color             : Factor w/ 3 levels "blue","black",..: 3 2 1 1 3 2 1 3 2 2 ...


price.thousand.eur,top.speed.kph,color
<int>,<int>,<fct>
39,250,red
28,190,black
35,230,blue
29,160,blue
38,240,red
37,280,black



Call:
lm(formula = price.thousand.eur ~ ., data = car.prices.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.13913 -0.77951 -0.06656  0.85740  2.85392 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   12.229414   0.399312  30.626  < 2e-16 ***
top.speed.kph  0.099653   0.001681  59.290  < 2e-16 ***
colorblack    -2.093403   0.153796 -13.612  < 2e-16 ***
colorred       0.619549   0.152927   4.051 6.51e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.081 on 296 degrees of freedom
Multiple R-squared:  0.9335,	Adjusted R-squared:  0.9328 
F-statistic:  1384 on 3 and 296 DF,  p-value: < 2.2e-16


In [30]:
check.variable.definition("model.new.reference")
assert.true(round(coefficients(model.new.reference)['colorblack'], 2) == -2.09, "Model incorrect. Did you set the reference level of <code>car.prices.df$color</code> to <code>blue</code>?")
success()