Fetching contributors…
Cannot retrieve contributors at this time
371 lines (277 sloc) 12.1 KB

# 线性回归

`````` 什么是线性回归
线性回归假说
训练线性回归模型
评估模型
scikit-learn实现
``````

y=b0+b1x1+b2x2+ ...... +bn*xn

```# 通过矩阵乘法得到加权和
y_pred = np.dot(x, self.w_)```

Y是预测值

`````` θ0是偏差项。
θ1，...，θₙ是模型参数
x1，x2，...，xₙ是特征值。
``````

`````` θ是模型的参数向量，包括偏置项θ0
x是x0 = 1的特征向量
``````

```# imports
import numpy as np
import matplotlib.pyplot as plt

# generate random data-set
np.random.seed(0)
x = np.random.rand(100, 1)
y = 2 + 3 * x + np.random.rand(100, 1)

# plot
plt.scatter(x,y,s=10)
plt.xlabel('x')
plt.ylabel('y')
plt.show()```

m是我们数据集中的训练样例总数。

```1 / 2*m((b0*x0-y0)**2 + (b1*x1-y1)**2 + (b2*x2-y2)**2 + ...... + (bm*xm-ym)**2)

# 通过当前权重得到数据点x相应的预测y值
y_pred = np.dot(x_train, self.w_)
# 预测的y值和实际的y值的差我们称之为残差
residuals = y_pred - y
# 残差的平方和就是成本函数
cost = np.sum((residuals ** 2)) / (2 * m)
```

``````初始化模型参数

``````

``````对于用直线拟合点的情况来说，权重有两个 theta0, theta1
``````

theta0(也就是直线在y的截距)的更新公式为:

theta1(也就是直线的斜率)的更新公式为:

```# 计算所有参数的偏导数向量
# 通过向量更新权重
self.w_ -= (self.eta / m) * gradient_vector```

```# imports
import numpy as np

class LinearRegressionUsingGD:
Parameters
----------
eta : float
Learning rate
n_iterations : int
No of passes over the training set
Attributes
----------
w_ : weights/ after fitting the model
cost_ : total error of the model after each iteration
"""

def __init__(self, eta=0.05, n_iterations=1000):
self.eta = eta
self.n_iterations = n_iterations

def fit(self, x, y):
"""Fit the training data
Parameters
----------
x : array-like, shape = [n_samples, n_features]
Training samples
y : array-like, shape = [n_samples, n_target_values]
Target values
Returns
-------
self : object
"""

self.cost_ = []
self.w_ = np.zeros((x.shape[1], 1))
m = x.shape[0]

for _ in range(self.n_iterations):
y_pred = np.dot(x, self.w_)
residuals = y_pred - y
self.w_ -= (self.eta / m) * gradient_vector
cost = np.sum((residuals ** 2)) / (2 * m)
self.cost_.append(cost)
return self

def predict(self, x):
""" Predicts the value after the model has been trained.
Parameters
----------
x : array-like, shape = [n_samples, n_features]
Test samples
Returns
-------
Predicted value
"""
return np.dot(x, self.w_)
```

``````斜率:[2.89114079]
y截距:[2.58109277]
``````

LRToGifPro.py中实现完整拟合过程和注释

`评估模型的性能`

`RMSE`是残差平方和的平均值的平方根。
RMSE定义为

```# mean squared error
mse = np.sum((y_pred - y_actual)**2)

# root mean squared error
# m is the number of training examples
rmse = np.sqrt(mse/m)
```

RMSE得分为2.764182038967211。
`R²`分数或`coefficient of determination`通过使用最小二乘回归解释了因变量的总方差可以减少多少。
R²由一下公式决定

SSᵣ是残差平方和

```# sum of square of residuals
ssr = np.sum((y_pred - y_actual)**2)
#  total sum of squares
sst = np.sum((y_actual - np.mean(y_actual))**2)
# R2 score
r2_score = 1 - (ssr/sst)```
``````SSₜ - 69.47588572871659
SSᵣ - 7.64070234454893
R² score - 0.8900236785122296
``````

Scikit-learn实现：
scikit-learn是一个非常强大的数据科学库。 完整的代码如下

```# imports
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# generate random data-set
np.random.seed(0)
x = np.random.rand(100, 1)
y = 2 + 3 * x + np.random.rand(100, 1)

# sckit-learn implementation

# Model initialization
regression_model = LinearRegression()
# Fit the data(train the model)
regression_model.fit(x, y)
# Predict
y_predicted = regression_model.predict(x)

# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)

# printing values
print('Slope:' ,regression_model.coef_)
print('Intercept:', regression_model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)

# plotting values

# data points
plt.scatter(x, y, s=10)
plt.xlabel('x')
plt.ylabel('y')

# predicted values
plt.plot(x, y_predicted, color='r')
plt.show()
```

``````斜率[[2.93655106]]
y截距[2.55808002]

R平方分为0.9038655568672764。
``````

You can’t perform that action at this time.