由上文使用numpy和matplotlib的分析之后得出的图像，发现只有CRIM、RM和LATST与房价线性相关  
因此选出CRIM、RM和LATST作为自变量，MEDV作为因变量

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

Matplotlib created a temporary cache directory at C:\Users\Raze\AppData\Local\Temp\matplotlib-a566uh0c because the default path (C:\Users\Raze\.matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.


In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv('./boston.csv', encoding = 'gbk')

In [4]:
df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0


## 进行自变量因变量的划分

去除其他无关特征，只保留这四个变量

In [5]:
df = df[['CRIM', 'RM', 'LSTAT', 'MEDV']]
df

Unnamed: 0,CRIM,RM,LSTAT,MEDV
0,0.00632,6.575,4.98,24.0
1,0.02731,6.421,9.14,21.6
2,0.02729,7.185,4.03,34.7
3,0.03237,6.998,2.94,33.4
4,0.06905,7.147,5.33,36.2
...,...,...,...,...
501,0.06263,6.593,9.67,22.4
502,0.04527,6.120,9.08,20.6
503,0.06076,6.976,5.64,23.9
504,0.10959,6.794,6.48,22.0


## 进行数据集的划分

使用切片以8：2划分训练集与测试集

In [6]:
ratio = (int)(0.8 * len(df))
df_Train = df[:ratio]
df_Test = df[ratio:]

给y赋值，用drop函数去除y留下x

In [7]:
y_Train = df_Train['MEDV']
y_Test = df_Test['MEDV']

axis为维度，0表示行，1表示列，此处要去除MEDV整一列

In [8]:
df_Train.drop(['MEDV'], axis = 1, inplace = True)
df_Test.drop(['MEDV'], axis = 1, inplace = True)

定义自变量x，其中有3个变量：CRIM、RM、LSTAT

In [9]:
x_Train = df_Train
x_Test = df_Test

下面检查划分是否成功

In [10]:
print(x_Train.shape, y_Test.shape)

(404, 3) (102,)


## 以下是线性回归的代码

将dataframe转化为array，并将X赋值给X_calculate用于计算

In [11]:
y = y_Train.to_numpy()
X = x_Train.to_numpy()
X_calculate = X

对每个特征的数据分别做归一化处理，即（数据-均值）/方差  
然后将reshape变为二维

In [12]:
for i in range(X_calculate.shape[1]):
    mu = np.average(X_calculate[:,i])
    sigma = np.std(X_calculate[:,i])
    X_calculate[:,i] = (X_calculate[:,i] - mu)/sigma

In [13]:
X_calculate = X_calculate.reshape(404, 3)

此处是创建一个全为1的数组并添加，便于矩阵运算

In [14]:
a = np.ones(X_calculate.shape[0]).reshape(-1, 1)
a.shape

(404, 1)

将两个数组拼接在一起

In [15]:
X_calculate = np.concatenate((a, X_calculate), axis = 1)

检查数组的shape

In [16]:
X_calculate.shape

(404, 4)

m表示样本的总个数，n表示特征的个数

In [17]:
m = X_calculate.shape[0]
n = X_calculate.shape[1]

定义学习率（步长）与循环迭代次数  
并从0到1中随机选取theta作为初始值，以追求拟合结果的准确性

In [18]:
alpha = 0.001
times = 1000
theta = np.random.uniform(0, 1, [1, X_calculate.shape[1]])
theta = theta.reshape(-1, 1)

使用梯度下降法对三个自变量的theta进行迭代计算，最终拟合出**每个自变量对应的theta**的值

In [19]:
for num in range(times):
    for j in range(n):
        theta[j] = theta[j] + (alpha/m)*(np.sum((y- np.dot(X_calculate, theta))* X_calculate[:,j].reshape(-1, 1)))

In [20]:
theta

array([[ 2.41757426e+01],
       [-8.48341054e-17],
       [ 4.57996841e-16],
       [ 2.37765197e-16]])

将X进行转置（X不含有添加列，只有三个自变量）

将x，y更新为测试集的数据

In [21]:
y = y_Test.to_numpy()
X = x_Test.to_numpy()

In [22]:
X = X.transpose()

得到预测数值矩阵y_Test_rediction

In [23]:
y_Test_rediction = theta[0] + theta[1]* X[0] + theta[2] * X[1] + theta[3] * X[2] 

用损失函数计算其MSE

In [24]:
LOSS2 = y - y_Test_rediction
np.sum(np.power(LOSS2, 2))/len(y)

93.31498368575947

经过对学习率和迭代次数的反复调试后，发现这时候MSE最低（总感觉哪里有问题，但是实在是找不到了，核对了梯度下降的公式好像也没错T_T）

感觉样本量比较大，好像用别的损失函数会比较好？