本文主要探索如何使用深度学习框架 Gluon 实现**线性回归**模型？并且以 Kaggle 上数据集 [USA_Housing](https://www.kaggle.com/vedavyasv/usa-housing) 做线性回归任务来预测房价。

回归任务，scikit-learn 亦可以实现，具体操作可以查看 [线性回归模型的原理与 scikit-learn 实现](https://www.jianshu.com/p/a65c3965e290)。

## 载入数据

In [54]:
import pandas as pd
import numpy as np

name = '../dataset/USA_Housing.csv'
dataset = pd.read_csv(name)

train = dataset.iloc[:3000,:]
test = dataset.iloc[3000:,:]

features_column = [
    name for name in dataset.columns if name not in ['Price', 'Address']
]
label_column = ['Price']

x_train = train[features_column]
y_train = train[label_column]
x_test = test[features_column]
y_test = test[label_column]

## 数据标准化

线性回归模型就是单层神经网络，在神经网络的训练中，需要将数据进行标准化处理，使得数据的尺度统一。

In [55]:
from sklearn.preprocessing import scale

x_train_s = scale(x_train)
x_test_s = scale(x_test)
y_train_s = scale(y_train)
y_test_s = scale(y_test)

为了更红的管理数据集我们先定义一个针对数据集处理的统一 API：`Loader`。为了和不同的深度学习框架进行接洽，`Loader` 被限制为输出 Numpy 数组。

In [56]:
class Loader(dict):
    """
    方法
    ========
    L 为该类的实例
    len(L)::返回样本数目
    iter(L)::即为数据迭代器

    Return
    ========
    可迭代对象（numpy 对象）
    """

    def __init__(self, batch_size, X, Y=None, shuffle=True, name=None):
        '''
        X, Y 均为类 numpy, 可以是 HDF5
        '''
        if name is not None:
            self.name = name
        self.X = np.asanyarray(X[:])
        if Y is None:
            # print('不存在标签！')
            self.Y = None
        else:
            self.Y = np.asanyarray(Y[:])
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.nrows = self.X.shape[0]

    def __iter__(self):
        idx = np.arange(self.nrows)

        if self.shuffle:
            np.random.shuffle(idx)

        for k in range(0, self.nrows, self.batch_size):
            K = idx[k:min(k + self.batch_size, self.nrows)]
            if self.Y is None:
                yield np.take(self.X, K, 0)
            else:
                yield np.take(self.X, K, 0), np.take(self.Y, K, 0)
                
    def __len__(self):
        return self.nrows

下面我们便可以获得满足深度学习框架的训练数据集和测试数据集：

In [100]:
batch_size = 512
trainset = Loader(batch_size, x_train_s, y_train_s)

In [109]:
from mxnet.gluon import nn
from mxnet import init
from mxnet import nd, autograd
from mxnet.gluon import loss as gloss
from mxnet import gluon


net = nn.Sequential()
net.add(nn.Dense(1))
loss = gloss.L2Loss()  # 平方损失又称 L2 范数损失。

net.initialize(init.Normal(sigma=0.01))
trainer = gluon.Trainer(net.collect_params(), 'adadelta', {'rho': 0.9})

num_epochs = 100
for epoch in range(1, num_epochs + 1):
    for X, y in trainset:
        X = nd.array(X)
        y = nd.array(y)
        with autograd.record():
            #  由于预测值很大，我们可以将其转换为小尺度的数据
            out = net(X)
            out = nd.relu(out)
            l = loss(out, y)
        l.backward()
        trainer.step(batch_size)
    l = loss(net(nd.array(x_test_s)), nd.array(y_test_s))
    print('epoch %d, loss: %f' % (epoch, l.mean().asnumpy()))

epoch 1, loss: 0.420815
epoch 2, loss: 0.335122
epoch 3, loss: 0.273987
epoch 4, loss: 0.230418
epoch 5, loss: 0.197117
epoch 6, loss: 0.171459
epoch 7, loss: 0.151247
epoch 8, loss: 0.134480
epoch 9, loss: 0.119742
epoch 10, loss: 0.106986
epoch 11, loss: 0.095315
epoch 12, loss: 0.084960
epoch 13, loss: 0.075951
epoch 14, loss: 0.068014
epoch 15, loss: 0.061575
epoch 16, loss: 0.056335
epoch 17, loss: 0.052137
epoch 18, loss: 0.049383
epoch 19, loss: 0.046681
epoch 20, loss: 0.044657
epoch 21, loss: 0.043820
epoch 22, loss: 0.042778
epoch 23, loss: 0.042254
epoch 24, loss: 0.042134
epoch 25, loss: 0.041724
epoch 26, loss: 0.041711
epoch 27, loss: 0.041550
epoch 28, loss: 0.041608
epoch 29, loss: 0.041457
epoch 30, loss: 0.041672
epoch 31, loss: 0.041680
epoch 32, loss: 0.041552
epoch 33, loss: 0.041596
epoch 34, loss: 0.041865
epoch 35, loss: 0.041722
epoch 36, loss: 0.041452
epoch 37, loss: 0.041614
epoch 38, loss: 0.041962
epoch 39, loss: 0.041465
epoch 40, loss: 0.041769
epoch 41,

In [110]:
from sklearn.metrics import r2_score

out = net(nd.array(x_test_s)).asnumpy()
r2_score(y_test_s, out)

0.9168694624481616