本文主要探索如何使用深度学习框架 MXNet 或 TensorFlow 实现**线性回归**模型？并且以 Kaggle 上数据集 [USA_Housing](https://www.kaggle.com/vedavyasv/usa-housing) 做线性回归任务来预测房价。

回归任务，scikit-learn 亦可以实现，具体操作可以查看 [线性回归模型的原理与 scikit-learn 实现](https://www.jianshu.com/p/a65c3965e290)。

## 载入数据

In [1]:
import pandas as pd
import numpy as np

In [2]:
name = '../dataset/USA_Housing.csv'
dataset = pd.read_csv(name)

train = dataset.iloc[:3000,:]
test = dataset.iloc[3000:,:]

print(train.shape)
print(test.shape)

(3000, 7)
(2000, 7)


查看有无缺失值：

In [3]:
print(np.unique(train.isnull().any()))
print(np.unique(test.isnull().any()))

[False]
[False]


In [4]:
dataset.columns  # 查看所有特征名称

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')

我们不考虑 `'Address'` 特征。通过特征 `'Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population'` 来预测 `'Price'`。

In [5]:
features_column = [
    name for name in dataset.columns if name not in ['Price', 'Address']
]
label_column = ['Price']

x_train = train[features_column]
y_train = train[label_column]
x_test = test[features_column]
y_test = test[label_column]

为了更好的理解线性回归的原理，我们先动手自己实现：

## MXNet 训练模型

线性回归模型就是单层神经网络，在神经网络的训练中，需要将数据进行标准化处理，使得数据的尺度统一。

In [7]:
from sklearn.preprocessing import scale

from mxnet import nd, autograd
from mxnet.gluon import nn

标准化处理：

In [8]:
x_train = scale(x_train)
x_test = scale(x_test)

为了更红的管理数据集我们先定义一个针对数据集处理的统一 API：`Loader`。为了和不同的深度学习框架进行接洽，`Loader` 被限制为输出 Numpy 数组。

In [9]:
class Loader(dict):
    """
    方法
    ========
    L 为该类的实例
    len(L)::返回 batch 的批数
    iter(L)::即为数据迭代器

    Return
    ========
    可迭代对象（numpy 对象）
    """

    def __init__(self, batch_size, X, Y=None, shuffle=True, name=None):
        '''
        X, Y 均为类 numpy, 可以是 HDF5
        '''
        if name is not None:
            self.name = name
        self.X = np.asanyarray(X[:])
        if Y is None:
            # print('不存在标签！')
            self.Y = None
        else:
            self.Y = np.asanyarray(Y[:])
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.nrows = self.X.shape[0]

    def __iter__(self):
        idx = np.arange(self.nrows)

        if self.shuffle:
            np.random.shuffle(idx)

        for k in range(0, self.nrows, self.batch_size):
            K = idx[k:min(k + self.batch_size, self.nrows)]
            if self.Y is None:
                yield np.take(self.X, K, 0)
            else:
                yield np.take(self.X, K, 0), np.take(self.Y, K, 0)

In [11]:
batch_size = 64
trainset = Loader(batch_size, scale(x_train), y_train)
testset = Loader(batch_size, scale(x_test), y_test)

In [None]:
def linreg(X, w, b):
    return nd.dot(X, w) + b


def squared_loss(y_hat, y):
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2


def sgd(params, lr, batch_size):
    for param in params:
        param[:] -= lr * param.grad / batch_size

In [None]:
n_features = x_train.shape[1]
w = nd.random_normal(shape=(n_features, 1))
b = nd.zeros([1])
params = [w, b]

for param in params:
    param.attach_grad()

lr = .03
epochs = 10
net = linreg
loss = squared_loss

def evaluate(net, w, b, testset):
    test_l = 0
    for x, y in testset:
        out = net(x, w, b)
        L = loss(out, y)
        test_l += L.mean().asscalar()
    return test_l

for epoch in range(epochs):
    train_l = 0
    for x, y in trainset:
        with autograd.record():
            out = net(x, w, b)
            L = loss(out, y)
        L.backward()
        sgd([w, b], lr, batch_size)
        train_l += L.mean().asscalar()
    test_l = evaluate(net, w, b, testset)
    print(f'Epoch {epoch}, train loss {train_l},\ttest loss {test_l}')