# Linear Regression

## 背景

根据 Andrew NG 在[网易云课堂](https://study.163.com/course/courseMain.htm?courseId=1004570029)的 Meachine Learning 课程，使用 python 来编写 linear regression 的算法，从而更加深刻的理解这些算法模型

## Linear Hypothesis Function（LHF）

Linear Hypothesis Function 线性假设函数用于 Linear Regression 模型，它假设数据特征（输入）和 target（输出）呈现一种线性关系，可以使用如下数学公式表示：

$$
h(\theta)=\theta_0+\theta_{1}x_1+...+\theta_{n}x_n
$$

- $h(\theta)$: 表示预测结果
- $\theta$: 表示系数，从$\theta_0...\theta_n$，是一个标量
- $x$: 表示数据特征（输入），从$x_1...x_n$，表示有$n$个特征

## Cost Function（CF）

Cost Function 成本函数，使用该函数去使得上面的 LHF 达到最佳值，通常使用 Mean Sequared Error (MSE) 平均方差公式作为上面 LHF 的 Cost Function，它的公式如下：

$$
J(\theta)=\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2
$$

- $J(\theta)$: cost function $J$ 在给定的 $\theta$ 情况下的结果
- $m$: 样本数量
- $x^{(i)}$: 训练集的第$i$个输入向量
- $y^{(i)}$: 训练集的第$i$个输出分类标志（就是target）
- $\theta$: 选择的参数值或权重（$\theta_0 \theta_1 \theta_2 ...$）
- $(h_{\theta}(x^{(i)})$: 在给定的 $\theta$ 情况下，训练集的第$i$个样本的预测结果（这里用的就是线性假设函数 $h(x)=\theta_0+\theta_1x$ 计算出的

## Gradient Descent （GD）

Gradient Descent 梯度下降是一种算法，用于计算 Linear Hypothesis Function 的系数（$\theta$）,使用该算法采用迭代的方式可以计算出 CF 的最小值，因此称为“梯度下降”。

梯度下降的算法如下：
1. 随机假设$\theta$的值，通常设置为0
2. 通过 LHF 计算训练集中每个样本的预测值
3. 计算 CF 结果（这里用的是 MSE 成本函数），这里用$J(\theta)$表示。$(h_{\theta}(x^{(i)})$表示预测值，$y^{(i)}$表示实际值（target）
4. 更新$\theta$的值，更新规则为$\theta=\theta-\frac{\partial}{\partial \theta}J(\theta)$
5. 重复（迭代）执行2-4步，直到 $J(\theta)$ 值不变，或者变化很小为止

这里重点解释$\frac{\partial}{\partial \theta}J(\theta)$的计算方式，首先介绍表示方法：
- $x_{j}^{(i)}$：表示第$i$个样本的第$j$个特征的值
- 预设$x_0=1$，因为在 LHF 中$\theta_0$是没有对应的变量，这里预设一个且设置为1，不会影响 LHF
- 为了方便计算 Andrew NG 对 MSE 函数做了一点调整，该调整并不会影响整个计算结果，主要是在公式前加了一个$\frac{1}{2}$，使得公式变为$J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2$

计算过程：
- $\frac{\partial}{\partial \theta}J(\theta)=\frac{\partial}{\partial \theta}(\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2)$
---
- $\frac{\partial}{\partial \theta}J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}\frac{\partial}{\partial \theta}(h_{\theta}(x^{(i)})-y^{(i)})^2$
---
- $\frac{\partial}{\partial \theta}J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}2(h_{\theta}(x^{(i)})-y^{(i)})\frac{\partial}{\partial \theta}((h_{\theta}(x^{(i)})-y^{(i)})$
---
- $\frac{\partial}{\partial \theta}J(\theta)=\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\frac{\partial}{\partial \theta}((h_{\theta}(x^{(i)})-y^{(i)})$
---
- $\frac{\partial}{\partial \theta}J(\theta)=\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}$

对于$\theta$的更新规则而言:
- $\theta_0=\theta_0-\frac{\partial}{\partial \theta_0}J(\theta_0)=\theta_0-\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}$
---
- $\theta_1=\theta_1-\frac{\partial}{\partial \theta_1}J(\theta_1)=\theta_1-\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{1}^{(i)}$
---
- ...
---
- $\theta_n=\theta_n-\frac{\partial}{\partial \theta_n}J(\theta_n)=\theta_n-\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{n}^{(i)}$


## Stochastic Gradient Descent （SGD）

当数据集数量非常大时，梯度下降算法的效率会变得很差，因为它每次迭代都要遍历所有数据。

Stochastic Gradient Descent 随机梯度下降算法用于解决梯度下降算法效率问题，它的算法如下：
1. 随机假设$\theta$的值，通常设置为0
2. 随机从数据集中选择一条样本，通过 LHF 计算该样本的预测值
3. 计算 CF 结果（这里用的是 MSE 成本函数），这里用$J(\theta)$表示。$(h_{\theta}(x^{(i)})$表示预测值，$y^{(i)}$表示实际值（target）
4. 更新$\theta$的值，更新规则为$\theta=\theta-\frac{\partial}{\partial \theta}J(\theta)$
5. 重复（迭代）执行2-4步，直到 $J(\theta)$ 值不变，或者变化很小为止


## Normal Equation （NE）

用线性代数的方式计算$\theta$值，该算法无需迭代数据集，一步就可算出结果，但该算法有如下条件制约：
- 只可用于 linear regression 模型
- 当数据特征（输入字段数）非常大， 其效率会变得很差。根据 Andrew NG 的建议，超过10000个特征字段就不要使用该算法了

$$
\theta=(X^TX)^{-1}(X^Ty)
$$

- $\theta$是向量，其中的元素是 LHF 中的各系数值，这个也是最终计算结果
- $X$是矩阵，其规模为$m\times n$，其中$m$表示数据集的记录数，$n$表示数据集的特征字段数，其元素为具体的数据集中数据，比如$X_{i,j}$表示第$i$个样本第$j$列特征的数据内容
- $y$表示target即标签或者说是输出

## 数据集

为了实现并验证 linear regression 模型，需要一个数据集（`data/kc_house_data.csv`），这里采用 kaggle 的 House Sales in King County, USA 房价数据。

字段说明：
- id： a notation for a house
- date： Date house was sold
- price： Price is prediction target
- bedrooms： Number of Bedrooms/House
- bathrooms： Number of bathrooms/House
- sqft_living： square footage of the home
- sqft_lot： square footage of the lot
- floors： Total floors (levels) in house
- waterfront： House which has a view to a waterfront
- view： Has been viewed
- condition： How good the condition is ( Overall )
- grade： overall grade given to the housing unit, based on King County grading system
- sqft_above： square footage of house apart from basement
- sqft_basement： square footage of the basement
- yr_built： Built Year
- yr_renovated： Year when house was renovated
- zipcode： zip
- lat： Latitude coordinate
- long： Longitude coordinate
sqft_living15： Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
- sqft_lot15： lotSize area in 2015(implies-- some renovations)

In [7]:
import pandas as pd

df = pd.read_csv('../data/kc_house_data.csv')
df

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.00,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.00,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.00,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.00,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
5,7237550310,20140512T000000,1225000.0,4,4.50,5420,101930,1.0,0,0,...,11,3890,1530,2001,0,98053,47.6561,-122.005,4760,101930
6,1321400060,20140627T000000,257500.0,3,2.25,1715,6819,2.0,0,0,...,7,1715,0,1995,0,98003,47.3097,-122.327,2238,6819
7,2008000270,20150115T000000,291850.0,3,1.50,1060,9711,1.0,0,0,...,7,1060,0,1963,0,98198,47.4095,-122.315,1650,9711
8,2414600126,20150415T000000,229500.0,3,1.00,1780,7470,1.0,0,0,...,7,1050,730,1960,0,98146,47.5123,-122.337,1780,8113
9,3793500160,20150312T000000,323000.0,3,2.50,1890,6560,2.0,0,0,...,7,1890,0,2003,0,98038,47.3684,-122.031,2390,7570


使用 pandas.sample 方法创建训练集，提取整个数据集的 20% 的数据，为了保证训练集的稳定，给定 seed 为 1

In [9]:
train_set = df.sample(frac=0.2, random_state=1)
train_set

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
15544,1310430130,20141009T000000,459000.0,4,2.75,2790,6600,2.0,0,0,...,9,2790,0,2000,0,98058,47.4362,-122.109,2900,6752
17454,2540830020,20150401T000000,445000.0,3,2.25,1630,6449,1.0,0,0,...,7,1310,320,1986,0,98011,47.7275,-122.232,1620,7429
21548,8835770330,20140819T000000,1057000.0,2,1.50,2370,184231,2.0,0,0,...,11,2370,0,2005,0,98045,47.4543,-121.778,3860,151081
3427,7732400490,20141105T000000,732350.0,4,2.50,2270,7665,2.0,0,0,...,9,2270,0,1986,0,98052,47.6612,-122.148,2450,8706
8809,2800031,20150401T000000,235000.0,3,1.00,1430,7599,1.5,0,0,...,6,1010,420,1930,0,98168,47.4783,-122.265,1290,10320
3294,686450490,20140929T000000,555000.0,3,2.00,2240,11250,1.0,0,0,...,8,2240,0,1968,0,98008,47.6371,-122.119,2200,12500
275,4215100060,20150320T000000,365000.0,3,2.50,2653,4510,2.0,0,0,...,8,2653,0,2006,0,98031,47.4145,-122.166,2653,4927
8736,7853301570,20150430T000000,685000.0,4,2.50,3550,10968,2.0,0,0,...,9,3550,0,2006,0,98065,47.5431,-121.886,3550,8583
6161,7977201845,20140514T000000,525000.0,3,1.75,1600,6120,1.5,0,0,...,7,1600,0,1924,0,98115,47.6847,-122.291,1670,4590
19832,2970800105,20150313T000000,449950.0,4,2.50,2420,5244,2.0,0,0,...,9,2420,0,2007,0,98166,47.4729,-122.350,1400,5250


使用 pandas.sample 方法创建测试集，提取整个数据集的 30% 的数据，为了保证测试集的稳定，给定 seed 为 2

In [11]:
test_set = df.sample(frac=0.3, random_state=2)
test_set

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
6638,3491300052,20150409T000000,735000.0,4,2.25,2410,4250,1.5,0,0,...,7,1460,950,1929,0,98117,47.6849,-122.376,1360,5074
7366,8917100020,20140606T000000,1150000.0,3,1.50,2170,16600,1.0,1,2,...,10,1130,1040,1979,0,98052,47.6307,-122.088,3130,13875
3158,5100401429,20141009T000000,350500.0,2,1.00,1450,6380,1.0,0,0,...,7,1450,0,1967,0,98115,47.6924,-122.321,1240,6380
9117,1454600156,20140625T000000,860000.0,5,3.25,4500,9648,2.0,0,4,...,8,3000,1500,1968,0,98125,47.7262,-122.282,2780,21132
3392,1917300025,20150127T000000,122000.0,2,1.00,860,6000,1.0,0,0,...,6,860,0,1945,0,98022,47.2109,-121.985,1300,6000
305,5016001535,20150217T000000,725000.0,3,1.75,1920,3300,1.0,0,0,...,8,960,960,1913,0,98112,47.6239,-122.298,1740,4000
14462,2113700620,20140804T000000,417000.0,3,1.50,2500,6000,1.5,0,0,...,7,1730,770,1941,1984,98106,47.5297,-122.354,1340,5000
6196,629600130,20140801T000000,594950.0,4,2.25,2380,35008,1.0,0,0,...,8,2380,0,1977,0,98075,47.5834,-122.001,2250,34794
10194,7787120260,20140609T000000,471000.0,4,2.50,2330,9928,2.0,0,0,...,8,2330,0,1998,0,98045,47.4836,-121.783,2430,8175
13457,7589200191,20140807T000000,634950.0,3,3.00,2180,2650,1.5,0,0,...,8,1410,770,1930,0,98117,47.6891,-122.375,1570,4820
