<!-- dummy-coding: -->
# Dummy coding and the general linear model

Before we start using GLMs we should address a problem that many of you will confront when dealing with categorical predictor variables. **It is important to remember that you cannot but categorical variables straight into a regression model without some form of transformation.** Two techniques are typically used and they are :

* hot encoding the categorical variables or
* converting them to dummy variables.

So imagine we are trying to build a model that predicts the sales of ice-cream based on purchasers age category. We have 3 categories and they are Child, Young adult and adult. So if we were to code them we might use the numbers $\{1,2,3\}$, This does not mean that people in category $1$ are 3 times younger than those in catgeory $3$. So this causes a problem because they are not really numbers but categories with no real scale.

In order to use hot encoded variables we convert the category variable into 3 variables. $\hat{\beta_1}$ would be equal to 1 where the person is a child and $0$ else where, $\hat{\beta_2}$ would be equal to 1 where the person is a young adult and $0$ else where and $\hat{\beta_3}$ would be equal to 1 where the person is a adult and $0$ else where.
Now suppose we had 12 cases, 4 in each category then if we were to hot encode the age variable we would create a matrix such outlined below:


$$\hat{X} =
 \begin{pmatrix}
  1  & 0 & 0 \\
  1  & 0 & 0 \\
  1  & 0 & 0 \\
  1  & 0 & 0 \\
  0  & 1 & 0 \\
  0  & 1 & 0 \\
  0  & 1 & 0 \\
  0  & 1 & 0 \\
  0  & 0 & 1 \\
  0  & 0 & 1 \\
  0  & 0 & 1 \\
  0  & 0 & 1 \\
 \end{pmatrix}$$

Notice how each column in this matrix has a 1 or a 0 and each row only has only a single 1 signifying the age category of the case.

Now we would use this matrix in our model to transform our original model:
$$Y_i=\alpha + \beta_1. X_1 + \epsilon_i$$

to

$$Y_i=\hat{\beta_1}. \hat{X_1} +\hat{\beta_2}.\hat{X_2}+\hat{\beta_3}. \hat{X_3}+ \epsilon_i ~~~~ (1)$$

</br>

if we were to use a dummy variable approach we are $\hat{X}$ matrix would be as follows:

$$\hat{X} =
 \begin{pmatrix}
  1  & 0  \\
  1  & 0 \\
  1  & 0 \\
  1  & 0 \\
  0  & 1 \\
  0  & 1 \\
  0  & 1 \\
  0  & 1 \\
  0  & 0 \\
  0  & 0 \\
  0  & 0 \\
  0  & 0 \\
 \end{pmatrix}$$

The model would look like that shown in equation 2:

 $$Y_i=\alpha +\hat{\beta_1}.\hat{X_1}+\hat{\beta_2}. \hat{X_2}+ \epsilon_i ~~~~(2)$$

The difference between the 2 equations is subtle but should not be ignored. In equation (1) we assume there is no intercept, thus the interpretation is that you have a variable intercept or a different intercept for each age category. In this case the $\beta 's$ are the average sales for each group. If you use the dummy variable approach you the $\alpha$ value is the equivalent to the effect of Adults and the remaining $\beta$ parameters are the ratio between each of the other age categories and Adults. They are not the effect of each age catgory on sales. Now the method you use depends on the effect you want to study but do not include an $\alpha$ with a hotencoded variable as you will induce multi-collinearity. If you are using Scikit-learn's linear regression model or stats-model you should specify if you want an intercept or not. For one-hot encoding, always set fit_intercept=False, in sklearn. For dummy encoding, fit_intercept should always be set to True? I do not see any "warning" on the website.

As usual leave your thoughts on the comments board.



在回归模型中，截距项（Intercept）
𝛽
0
β
0
​
  本质上表示的是基准组的平均值：

𝑌
=
𝛽
0
+
𝛽
1
𝑋
1
+
𝛽
2
𝑋
2
+
𝛽
3
𝑋
3
+
𝜖
Y=β
0
​
 +β
1
​
 X
1
​
 +β
2
​
 X
2
​
 +β
3
​
 X
3
​
 +ϵ
如果使用 One-Hot Encoding，并且保留截距项，那么方程变成：

𝑌
=
𝛽
0
+
𝛽
1
(
Child
)
+
𝛽
2
(
Young Adult
)
+
𝛽
3
(
Adult
)
+
𝜖
Y=β
0
​
 +β
1
​
 (Child)+β
2
​
 (Young Adult)+β
3
​
 (Adult)+ϵ
但由于我们已经知道：

Child
+
Young Adult
+
Adult
=
1
Child+Young Adult+Adult=1
这意味着拦截项
𝛽
0
β
0
​
  和其中某个变量（例如 Adult）是完全重复的！

我们可以把这个公式改写：

𝑌
=
𝛽
0
+
𝛽
1
(
Child
)
+
𝛽
2
(
Young Adult
)
+
(
𝛽
3
⋅
Adult
)
+
𝜖
Y=β
0
​
 +β
1
​
 (Child)+β
2
​
 (Young Adult)+(β
3
​
 ⋅Adult)+ϵ
由于 Child + Young Adult + Adult = 1，我们可以替换 Adult = 1 - Child - Young Adult，代入上式：

𝑌
=
𝛽
0
+
𝛽
1
(
Child
)
+
𝛽
2
(
Young Adult
)
+
𝛽
3
(
1
−
Child
−
Young Adult
)
+
𝜖
Y=β
0
​
 +β
1
​
 (Child)+β
2
​
 (Young Adult)+β
3
​
 (1−Child−Young Adult)+ϵ
展开后：

𝑌
=
𝛽
0
+
𝛽
3
+
𝛽
1
(
Child
)
+
𝛽
2
(
Young Adult
)
−
𝛽
3
(
Child
)
−
𝛽
3
(
Young Adult
)
+
𝜖
Y=β
0
​
 +β
3
​
 +β
1
​
 (Child)+β
2
​
 (Young Adult)−β
3
​
 (Child)−β
3
​
 (Young Adult)+ϵ
整理：

𝑌
=
(
𝛽
0
+
𝛽
3
)
+
(
𝛽
1
−
𝛽
3
)
⋅
Child
+
(
𝛽
2
−
𝛽
3
)
⋅
Young Adult
+
𝜖
Y=(β
0
​
 +β
3
​
 )+(β
1
​
 −β
3
​
 )⋅Child+(β
2
​
 −β
3
​
 )⋅Young Adult+ϵ
这时候我们发现：

Intercept（
𝛽
0
​
 ） 和 Adult（
𝛽
3
​
 ）是完全重复的！
模型的系数变得无法唯一确定！
这就是 "多重共线性（Multicollinearity）" 发生的原因。









——————————————————————————————————————————————————————————————————

 use a dummy variable


如果使用 Dummy Encoding，模型形式如下：

𝑌
𝑖
=
𝛼
+
𝛽
1
𝑋
1
+
𝛽
2
𝑋
2
+
𝜖
𝑖
Y
i
​
 =α+β
1
​
 X
1
​
 +β
2
​
 X
2
​
 +ϵ
i
​

其中：

𝛼
α 代表基准类别（Adult）的均值。
𝛽
1
β
1
​
  代表 "Child" 相对于 "Adult" 的变化。
𝛽
2
β
2
​
  代表 "Young Adult" 相对于 "Adult" 的变化。
特点
基准类别（Reference Category）的影响被吸收到截距项
𝛼
α 中。
必须包含拦截项（Intercept），否则无法正确建模。
适用于模型需要基准类别的情况。