$$\text{Linear Regression}$$

#### What is Linear Regression

线性回归是一种根据预测变量预测数值型变量的一种parametric model，因为它假设了响应变量与预测变量之间存在线性关系。线性回归模型通常以来*最小二乘法*来进行参数估计，已找到能够最小化残差平方和(*Residuals Sum of Squares*)的最优参数。线性回归模型被定义为：

$$Y = \beta_0 + \beta_1X_1 + \dots + \beta_nX_n$$

#### How Linear Regression Works?

为了阐述线性回归模型，我们先定义一个bivaraite线性回归的损失函数(*Loss Function*)：

$$L(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=1}^m(h_{\theta}(x)^i - y^i)^2$$

- $h_{\theta}(x)^i$是我们对于函数形式的假设
- $y^i$是指响应变量的实际值

$$\because h_{\theta}(x)^i = \theta_0 + \theta_1X^i$$

$$\therefore L(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=1}^m(\theta_0 + \theta_1X - y^i)^2$$

- $\theta_0 + \theta_1X^1 - y_i$是线性回归的预测值与实际值之间的差，也被称为残差（*Residual*）

我们的目标就是要最小该损失函数，即最小二乘法。

$$\min{L(\theta_0,\theta_1)} = \min{\frac{1}{2m}\sum_{i=1}^m(\theta_0 + \theta_1X - y^i)^2}$$

补充另外一个评级模型拟合优劣的指标$R^2$，该指标可以解释预测变量能够解释响应变量variance的百分比。越高越好。

$$R^2 = \frac{RSS-TSS}{TSS}\\
TSS = \sum_{i=1}^m(y_i - \bar{y})^2\\
RSS = \sum_{i=1}^m(y_i - \hat{y})^2
$$

如果我们从图像上解释该损失函数，训练集是一堆分散在X和Y轴构成的平面内的一些点。线性回归需要拟合一条线尽可能穿过这些点以至于这些离散点离这条线的距离和，即残差平方和RSS，要最小。如下图所示：

<img src="https://www.ehdp.com/vn/ro/images/800px-Residuals_for_Linear_Regression_Fit.png" width=400>

目标现在清楚了，我们需要一个方法帮助找到最优参数。**梯度下降法(*Gradient Descent*)**可以被用于寻找线性回归的最优参数。

#### Gradient Descent

该算法的基本思想是首先随机猜一个参数的值，然后不断地改变这个参数值带入到损失函数希望该损失也不断变小，知道最终损失函数收敛于一个最小的数值。

**How Gradient Descent Works?**

![image.png](attachment:image.png)

上图是一个以参数值为X轴和Y轴，目标函数值为Z轴的三维等高线图。我们会随机取一点，然后不断改变$\theta$值目的是让目标函数值减少。改图中从顶点往右下方慢慢下降倾斜，从而最终达到一个最优点。所以梯度下降包含两个重要要素：*下降步伐的幅度*以及*下降步伐的斜度*。我们用数学公式来描述上方步骤：

$\theta_j:=\theta_j-\alpha\frac{\partial{}}{\partial{\theta_j}}J(\theta_0,\theta_1)$ (for j = 0 and j=1)



- $\alpha$ controls how big step we take downhill with creating descent, **learning rate**
- $J(\theta_0,\theta_1)$ determines the slope and direction in which the step is taken

$$temp_0:=\theta_0-\alpha\frac{\partial{}}{\partial{\theta_0}}J(\theta_0,\theta_1)\\
temp_1:=\theta_1-\alpha\frac{\partial{}}{\partial{\theta_1}}J(\theta_0,\theta_1)\\
\theta_0:= temp_0\\
\theta_1:=temp_1$$

该算法会自动不断重复，直至损失函数值收敛。

#### Gradient Descent for Linear Regression

When specifically applied to the case of linear regression, a new form of gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to:

reqeat until converagence:{

​	$\theta_0:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta)}(x_i)-y_i)$

​	$\theta_1:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x_i)-y_i)x_i)$

}

So how we get there? In fact, the new form of functions come from the partial derivative of the target function. Here's the calculation process:

Given the cost function: $h(\theta_0,\theta_1) = \theta_0 + \theta_1x_i$, we have:

$J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=1}^m(\theta_0+\theta_1x^{(i)}-y^{(i)})^2$, Now let's take the partial derivative of parameters $\theta$, so we have:

$\frac{\partial}{\partial \theta_0} = \frac{1}{m}\sum_{i=1}^m(h_{\theta}(x)^i-y^{(i)})\frac{\partial}{\partial \theta_0}(h_{\theta_0}(x)^i-y^{(i)})$

$\frac{\partial}{\partial \theta_0} = \frac{1}{m}\sum_{i=1}^m(h_{\theta}(x)^i-y^{(i)})$

Likely, we can get the target function for $\theta_1$ by taking the partial derivative for $\theta_1$:

$\frac{\partial}{\partial \theta_1} = \frac{1}{m}\sum_{i=1}^m(h_{\theta}(x)^i-y^{(i)})x^{i}$


**How Gradient Descent Works for Linear Regression?**

首先随机猜测一个线性表达式然后用所有的训练集数据训练模型，并且不断重复上方梯度下降法的公式计算损失函数值，该方法也被称为*Batch Gradient Descent.*

***Note: For linear regression, gradient descent optimization can only produce a global optimum, and no other local optimum because J is a convex quadratic function. Here is an example of gradient descent as it is run to minimize a quadratic function.***



#### Assumptions of Linear Regression

**Linearity:** 

线性回归假设预测变量和响应变量之间存在一条简单直接的线性关系。可以使用残差散点图来观察是否预测变量和响应变量之间存在线性关系。以X轴为预测变量Y轴为残差绘制散点图，如果存在线性关系，则残差散点图不会呈现可识别的非线性趋势。

![image-3.png](attachment:image-3.png)

**Normality:** 

线性回归假设残差应该服从正太分布。Q-Q Plot可以帮助识别残差的正太性质。


**Independence:** 

线性回归假设误差项之间不存在相关关系，即$\epsilon_i$不会为$\epsilon_{i+1}$的符号提供任何信息。时间序列数据经常出现残差相互相关的问题，因为在相邻时间点获取的观测数据的残差通常呈现正相关关系。为检验数据是否满足独立性，可以绘制残差时间序列图显示残差随着时间的变化趋势。如果残差之间不存在相关关系，则该图不会呈现容易识的模式。


**Homoscedasticity:**

线性回归假设残差在预测变量的每一个水平下都应保持方差不变的性质。但是很多情况下，误差项的方差会随着预测变量变动而变大。我们可以X轴为预测值Y轴为残差绘制散点图，如果残差图呈现漏斗状，则说明数据不满足残差方差恒定性质。如下图所示：

![image-4.png](attachment:image-4.png)

为了解决方差异动的问题，我们可以使用Concave Function对响应变量Y进行转化，例如$\sqrt{Y}$或者$\log{Y}$。该转化可以大幅缩减相应变量中的数值很高的观测成小数值观测，从而减少方差变动性。

**Collinearity:**

最后，线性回归假设预测变量之间不存在相关关系，即共线性问题。共线性在回归的场景下会减少评估的回归系数的精确度，因为共线性使得回归模型难以确认互相关联的预测变量单独对响应变量所造成的影响。例如，如果`limit`和`rating`会一起增加或者减少，则我们很难确认这两个变量单独对响应变量`balance`的影响。

#### Other Considerations of Linear Regression

##### Is there a relationship beteween Predictors and the Response?

在多元线性回归的问题下，我们会检查是否$\beta_1 = \beta_2 = \beta_3 = \dots = \beta_p = 0$。因此，多元线性回归问题的原假设为：

$$
H_0 = \beta_1 = \beta_2 = \beta_3 = \dots = \beta_p = 0
$$

备择假设为：

$$
H_1: \text{at least one }\beta_j  \text{is non-zero}
$$

通过计算线性回归预测的F值来判断是否拒绝原假设：

$$F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)}
$$

也可以将F值理解为：

$$F = \frac{\text{variance explained by predictors}}{\text{variance not explained by variance}}
$$

- P is the number of predictors
- n is the number of observations
- n-p-1 is the degree of freedom, which is the shape of F-distribution

We calculate the F statistic to determine if there is a relationship between predictors and the response variable. If there is no realtion, we will expect that the value of F statistic is close to 1, meaning that the predictor does not help to explain more variance than when we do not consider it.

On the other hand, Higher value of F statistic may suggest that the predictors can explain the variance of the response variable. And we can decide whether to reject the null by calculating the p value for F statisic.

**When to use F-statistic?**

The approach of using least squares/F-statistic to test for any association bewteen predictors and the response works when p is small and certainly small compared to n. However, there are sometimes when we have large numbers of variables. If p>n, there are more coefficients $\beta_j$ to estimate than observations from which to estimate them. In the case, the least squares cannot be used to fit a linear regression model.

Hence, F-statistic cannot be used and sure R squared. When p is large, feature selection approaches should be used to select features, such as *forward feature selection and backward feature selection*


##### Qualitative Predictors

The predictor variables in linear regression can be both quantitative and qualitative. Suppose that we want to investigate the differences between male and feamle in card balance, ignoring all other variables. Now we can create one dummy variable that incorporates the two levels of gender, with 0 being male and 1 being female. Then the predictor variable `Gender` can be defined as:

$$x_i = \begin{cases} 1 & \text{if ith person is female}\\
0 & \text{if ith person is male}
\end{cases}$$

Hence, we can obtaint the linear regression:

$$
y_i = \beta_0 + \beta_1x_i + \epsilon = \begin{cases}
\beta_0 + \epsilon & \text{if ith person is male}\\
\beta_0 + \beta_1 + \epsilon & \text{if ith person is female}
\end{cases}
$$

Therefore, $\beta_0$ is just the average card balance among males, and $\beta_0+\beta_1$ is the average card balance among females. $\beta_1$ is the average difference in card balance between males and feamles.

***Notes: There will always be one fewer dummy variable than the number of levels. The level with no dummy variable is known as the baseline.***