## 线性回归的解析方法

### 1. 模型

输入向量与标签：
$$
X = \begin{bmatrix}x^T_1 \\ x^T_2\\ \vdots \\ x^T_N\end{bmatrix}, Y = \begin{bmatrix}y_1 \\ y_2\\ \vdots \\ y_N\end{bmatrix}
$$

损失函数（均方误差）：

$$
J(\theta) = \frac{1}{2N}(y-X\theta)^T(y-X\theta)
$$

解析解（导数为0）：
$$
\theta = (X^TX)^{-1}X^Ty
$$

模型对训练数据的预测：

$$
f(\theta) = X\theta = X(X^TX)^{-1}X^Ty
$$

### 2. 手写代码实现

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
from sklearn.preprocessing import StandardScaler

In [25]:
# 加载数据，查询数据特征
lines = np.loadtxt('USA_Housing.csv',delimiter=',',dtype='str')
header = lines[0]
lines = lines[1:].astype(float)
print('数据特征：',', '.join(header[:-1]))
print('数据标签：',header[-1])
print('数据总条数：',len(lines))

数据特征： Avg. Area Income, Avg. Area House Age, Avg. Area Number of Rooms, Avg. Area Number of Bedrooms, Area Population
数据标签： Price
数据总条数： 5000


In [26]:
# 划分训练集与测试集
ratio = 0.8
split = int(len(lines)*ratio)
lines = np.random.permutation(lines)
train,test = lines[:split], lines[split:]

In [27]:
# 数据标准化
scaler = StandardScaler() # 创建对象
scaler.fit(train) # 使用训练集计算均值与方差
train = scaler.transform(train)
test = scaler.transform(test)

# print(help(flatten()))

In [30]:
# 划分输入和标签
x_train, y_train = train[:,:-1], train[:,-1]
print(x_train.shape)
print(y_train.shape)
print(type(y_train))
x_train, y_train = train[:,:-1], train[:,-1].flatten()
print(y_train.shape)
print(type(y_train))

(4000, 5)
(4000,)
<class 'numpy.ndarray'>
(4000,)
<class 'numpy.ndarray'>


In [29]:
x_test, y_test = test[:,:-1], test[:,-1].flatten()

均方根误差：（模型评价指标）

$$
\mathcal{L}_{RMSE}(y,\hat{y}) = \sqrt{\frac{1}{N}\sum_{i=1}^N(y_i-\hat{y}_i)^2}
$$

均方误差：（训练时的损失函数）

$$
\mathcal{L}_{MSE}(y,\hat{y}) = \frac{1}{2N}\sum_{i=1}^N(y_i-\hat{y}_i)^2
$$

In [37]:
# 在X矩阵最后添加一列1，代表常数项
print(x_train.shape)
print(type(x_train))
X = np.concatenate([x_train,np.ones((len(x_train),1))],axis=1)
print(X.shape)
print(type(X))

(4000, 5)
<class 'numpy.ndarray'>
(4000, 6)
<class 'numpy.ndarray'>


解析解（导数为0）：
$$
\theta = (X^TX)^{-1}X^Ty
$$

In [39]:
# @表示矩阵相乘，X.T代表矩阵X的转置，np.linalg.inv计算矩阵的逆
theta = np.linalg.inv(X.T @ X) @ X.T @ y_train
print('回归系数：',theta)

回归系数： [6.55393036e-01 4.59613908e-01 3.42308053e-01 3.04235311e-03
 4.19771186e-01 6.76542156e-17]


模型对测试数据的预测：

$$
f(\theta) = X\theta = X(X^TX)^{-1}X^Ty
$$

In [41]:
# 在测试集上使用回归系数进行预测
X_test = np.concatenate([x_test,np.ones((len(x_test),1))],axis=1)
y_pred = X_test @ theta

In [42]:
# 计算预测值与真实值之间的RMSE
rmse_loss = np.sqrt(np.square(y_test - y_pred).mean())
print("RMSE: ",rmse_loss)

RMSE:  0.2757154767338514


### 3. sklearn实现

In [45]:
from sklearn.linear_model import LinearRegression

# 初始化线性回归模型
linreg = LinearRegression()

# LinearRegression的方法中已经考虑了线性回归中的常数项
linreg.fit(x_train,y_train)
# coef_是训练得到的回归系数，intercept_是常数项
print("回归系数：",linreg.coef_,linreg.intercept_)

回归系数： [0.65539304 0.45961391 0.34230805 0.00304235 0.41977119] 9.374571425301378e-17


In [46]:
y_pred = linreg.predict(x_test)
# 计算真实值与预测值之间的MSE
rmse_loss = np.sqrt(np.square(y_test - y_pred).mean())
print("RMSE:", rmse_loss)

RMSE: 0.2757154767338514
