# 线性模型

## 模型
$$\hat{y} = w_1x_1 + w_2x_2 + \dots + w_nx_n + b\tag{1} $$

## 损失函数
$$ L = \frac{1}{2}\sum_{i=1}^n(y_i - \hat{y_i})^2\tag{2} $$

准备数据,数据有100条记录，每条记录有两个特征$x_1$和$x_2$

In [1]:
import numpy as np
import seaborn as sns
import pandas as pd

In [2]:
true_w = np.array([[1.4], [2.3]], dtype = np.float64)
true_b = 4.7

In [3]:
X = np.random.normal(0, 1, (100,2))
zao = np.random.normal(0, 0.001,(100,1))
y = np.dot(X, true_w) + true_b + zao

现在我们有了数据，特征$X$和对应的关注值$y$，我们要如何去得到线性模型的参数$w$和$b$呢?  
若记$\bar{w}=\left[\begin{matrix}w \\ b \end{matrix}\right]$,$\bar{X}=\left[\begin{matrix}X & 1 \end{matrix}\right]$对于统计学，你可以推导出来最优解 $\hat{\bar{w}}^*=(\bar{X}^\intercal\bar{X})^{-1}(\bar{X}^\intercal)y$。但是这个方法有很大的局限性  
1. X必须是可逆的
2. 当数据量和特征维度都很大是，计算耗时，求逆更耗时  

那么有别的办法吗？有！从优化的角度来做。这是个凸函数，用梯度下降的方法，一定可以找到全局最优解

首先计算损失函数对参数的偏导数  
$$\frac{\partial{L}}{\partial{w_1}}= \sum_{i=1}^n(\hat{y_i}-y_i)x_{i1}\tag{3}$$
$$\frac{\partial{L}}{\partial{b}}= \sum_{i=1}^n(\hat{y_i}-y_i)\tag{4}$$

在本例中，若定义$\vec{w}=\left[\begin{matrix}w1 \\ w2 \\ b \end{matrix}\right]$,那么  
$$\frac{\partial{L}}{\partial{\vec{w}}}= \sum_{i=1}^n(\hat{y_i}-y_i)\left[\begin{matrix}x_{i1}\\x_{i2}\\1 \end{matrix}\right]\tag{4}$$

写成矩阵形式，对$w$的计算为, $X_{100\times2}$ ,$y_{100\times1}$,$\hat{y}_{100\times1}$
$$\frac{\partial{L}}{\partial{\left[\begin{matrix}w_1\\w_2 \end{matrix}\right]}} = X^\intercal(\hat{y}-y)\tag{5}$$  
对$y$的计算为,其中$\vec{1}$元素全为1的列向量
$$\frac{\partial{L}}{\partial{y}} = {\vec{1}}^\intercal(\hat{y}-y)\tag{5}$$ 

先随机初始化$w$和$b$

In [4]:
w = np.random.randn(2,1)
b = np.random.randn(1)
w,b

(array([[0.15392858],
        [0.22306389]]), array([1.459993]))

根据推导的公式计算梯度

In [5]:
def get_grad(X, y, w, b):
    return np.dot(X.T,np.dot(X,w)+b-y), np.dot(np.ones((1,X.shape[0])),np.dot(X,w)+b-y)

In [6]:
get_grad(X, y, w, b)

(array([[-105.49510806],
        [-175.3691483 ]]), array([[-315.65400173]]))

而深度学习框架都提供了自动求导机制，省去了手动推导公式的繁琐，对于神经网络等更有用

## Mxnet

In [7]:
from mxnet import ndarray as nd, autograd
ndX = nd.array(X)
ndy = nd.array(y)
ndzao = nd.array(zao)
ndtrue_w = nd.array(true_w)
ndtrue_b = nd.array([true_b])
ndw = nd.array(w)
ndb = nd.array(b)

In [8]:
#mxnet的变量自动求导，需要先给其分配内存记录
ndw.attach_grad()
ndb.attach_grad()

In [9]:
with autograd.record():
    los = (ndy-(nd.dot(ndX, ndw) +  ndb))**2/2

In [10]:
los.backward()

In [11]:
ndw.grad,ndb.grad

(
 [[-105.4951 ]
  [-175.36911]]
 <NDArray 2x1 @cpu(0)>, 
 [-315.654]
 <NDArray 1 @cpu(0)>)

可以发现，使用mxnet的自动求导机制算得的梯度，和我们手动计算的梯度是一致的

## Pytorch

## 利用梯度下降训练模型，这里先定义几个函数 

损失函数

In [12]:
def loss(X, y, w, b):
    return 1/2*(np.dot(X,w)+b-y)**2

求梯度函数

In [13]:
def get_grad(X, y, w, b):
    return np.dot(X.T,np.dot(X,w)+b-y), np.dot(np.ones((1,X.shape[0])),np.dot(X,w)+b-y)

In [14]:
lr = 0.3
num_epochs = 100
N = X.shape[0]
for i in range(num_epochs):
    w_grad, b_grad = get_grad(X,y,w,b)
    w = w - lr*w_grad/N
    b = b - lr*b_grad/N
    los = loss(X, y, w, b)
    if not i % 10:
        print(i,'th iter,','loss is:', np.mean(los))

0 th iter, loss is: 3.9427338594420513
10 th iter, loss is: 0.006435777080695627
20 th iter, loss is: 1.3393769114546138e-05
30 th iter, loss is: 5.644060828737941e-07
40 th iter, loss is: 5.321654575407989e-07
50 th iter, loss is: 5.320526772619975e-07
60 th iter, loss is: 5.32052130905547e-07
70 th iter, loss is: 5.320521277102273e-07
80 th iter, loss is: 5.320521276900484e-07
90 th iter, loss is: 5.320521276899181e-07


In [15]:
w,b

(array([[1.40002084],
        [2.30000097]]), array([[4.70002848]]))

In [16]:
true_w,true_b

(array([[1.4],
        [2.3]]), 4.7)

## 下面实现随机梯度下降法来计算模型参数  
首先分割数据集

In [17]:
def data_iter1(batch_size, X, y):
    num_examples = len(X)
    indices = list(range(num_examples))
    np.random.shuffle(indices)  # 样本的读取顺序是随机的。
    for i in range(0, num_examples, batch_size):
        j = np.array(indices[i: min(i + batch_size, num_examples)])
        yield X[j], y[j]  # take 函数根据索引返回对应元素。

In [18]:
batch_size = 10
for i,j in data_iter1(batch_size, X, y):
    print(i,j)
    break

[[-1.18685698 -0.05292217]
 [-1.05680667  0.14374236]
 [-1.01158179 -0.38416092]
 [-0.21631229 -0.41169566]
 [ 2.26181203 -1.20254096]
 [-0.571994   -1.12633831]
 [ 0.36500332 -0.76084153]
 [ 0.23405349 -0.85803565]
 [ 0.43666814  0.24758787]
 [-2.38083768 -0.88643925]] [[ 2.91490929]
 [ 3.55090278]
 [ 2.40009139]
 [ 3.45249447]
 [ 5.09940214]
 [ 1.30942582]
 [ 3.46016434]
 [ 3.05369184]
 [ 5.88066803]
 [-0.67195291]]


重新初始化$w$,$b$

In [19]:
w = np.random.randn(2,1)
b = np.random.randn(1,1)

In [20]:
lr = 0.3
num_epochs = 100
N = X.shape[0]
for i in range(num_epochs):
    for X_s,y_s in data_iter1(batch_size, X, y):
        w_grad, b_grad = get_grad(X_s,y_s,w,b)
        w = w - lr*w_grad/batch_size
        b = b - lr*b_grad/batch_size
    los = loss(X, y, w, b)
    if not i % 10:
        print(i,'th iter,','loss is:', np.mean(los))

0 th iter, loss is: 0.027347295275899294
10 th iter, loss is: 5.445108377788829e-07
20 th iter, loss is: 5.397653213488734e-07
30 th iter, loss is: 5.373639096945176e-07
40 th iter, loss is: 5.501392030148602e-07
50 th iter, loss is: 5.369086956030732e-07
60 th iter, loss is: 5.385231576704392e-07
70 th iter, loss is: 5.405155918590798e-07
80 th iter, loss is: 5.375904719613152e-07
90 th iter, loss is: 5.596308086531687e-07


In [21]:
w,b

(array([[1.399912  ],
        [2.30007441]]), array([[4.69987222]]))

## 使用Mxnet来训练模型

In [22]:
def net(X, w, b):
    return nd.dot(X,w)+b
def myloss(y,yhat):
    return (y-yhat)**2/2
def sgd(params, lr, batch_size):  # 本函数已保存在 gluonbook 包中方便以后使用。
    for param in params:
        param[:] = param - lr * param.grad / batch_size

In [23]:
ndw = nd.random.randn(2,1)
ndb = nd.random.randn(1,1)

In [24]:
ndw, ndb

(
 [[1.1630785]
  [0.4838046]]
 <NDArray 2x1 @cpu(0)>, 
 [[0.29956347]]
 <NDArray 1x1 @cpu(0)>)

In [25]:
lr = 0.3
num_epochs = 100
N = ndX.shape[0]
ndw.attach_grad()
ndb.attach_grad()

这里有一个注意点，如果更新参数$ndw$时使用$ndw$而不是使用$ndw[:]$，需要重新申请存储梯度的内存，这是因为ndw已经指向了新的内存空间，这个变量之前申请的存储梯度内存已经失效。而$ndw[:]$的值则是写入了原来变量的内存空间，所以不需要再次申请存储梯度的内存空间（多说一句，这和python的list有所不同，$a = [1,2,3,4]$，$a[:]$是指向了新的内存空间）

In [26]:
for i in range(num_epochs):
    with autograd.record():
        los = (ndy-(nd.dot(ndX, ndw) +  ndb))**2/2
    los.backward()
    ndw[:] = ndw-lr*ndw.grad/N
    ndb[:] = ndb-lr*ndb.grad/N
    '''ndw = ndw-lr*ndw.grad/N
    ndb = ndb-lr*ndb.grad/N
    ndw.attach_grad()
    ndb.attach_grad()'''
    
    if not i % 10:
        print(i,'th iter,','loss is:', ((ndy-(nd.dot(ndX, ndw) +  ndb))**2/2).mean().asnumpy())

0 th iter, loss is: [5.500447]
10 th iter, loss is: [0.00769663]
20 th iter, loss is: [1.6042066e-05]
30 th iter, loss is: [5.7849286e-07]
40 th iter, loss is: [5.322418e-07]
50 th iter, loss is: [5.320153e-07]
60 th iter, loss is: [5.320284e-07]
70 th iter, loss is: [5.320284e-07]
80 th iter, loss is: [5.320284e-07]
90 th iter, loss is: [5.320284e-07]


In [27]:
ndw, ndb

(
 [[1.4000211]
  [2.3000007]]
 <NDArray 2x1 @cpu(0)>, 
 [[4.700028]]
 <NDArray 1x1 @cpu(0)>)

## Mxnet的随机梯度下降实现

In [28]:
def data_iter2(batch_size, X, y):
    num_examples = len(X)
    indices = list(range(num_examples))
    np.random.shuffle(indices)  # 样本的读取顺序是随机的。
    for i in range(0, num_examples, batch_size):
        j = nd.array(indices[i: min(i + batch_size, num_examples)])
        yield X.take(j), y.take(j)  # take 函数根据索引返回对应元素。

In [29]:
batch_size = 10
for i,j in data_iter2(batch_size, ndX, ndy):
    print(i,j)
    break


[[ 1.1328384   0.97010976]
 [ 2.261812   -1.202541  ]
 [ 0.5233994  -2.3260677 ]
 [-0.13625841  0.04932534]
 [ 0.35480747  0.30289814]
 [ 0.18187748 -0.98623437]
 [ 1.1560868  -0.41092116]
 [-0.32945085  0.5558534 ]
 [-1.0568067   0.14374235]
 [ 0.4996388   0.7169842 ]]
<NDArray 10x2 @cpu(0)> 
[[8.517731  ]
 [5.099402  ]
 [0.08239988]
 [4.6215725 ]
 [5.8935075 ]
 [2.6884203 ]
 [5.373425  ]
 [5.5161014 ]
 [3.5509028 ]
 [7.049156  ]]
<NDArray 10x1 @cpu(0)>


重新初始化ndw和ndb

In [30]:
ndw = nd.random.randn(2,1)
ndb = nd.random.randn(1,1)
ndw.attach_grad()
ndb.attach_grad()

In [31]:
lr = 0.03
num_epochs = 20
for epoch in range(num_epochs):  # 训练模型一共需要 num_epochs 个迭代周期。
    # 在每一个迭代周期中，会使用训练数据集中所有样本一次（假设样本数能够被批量大小整除）。
    # X 和 y 分别是小批量样本的特征和标签。
    for X, y in data_iter2(batch_size, ndX, ndy):
        with autograd.record():
            l = myloss(net(X, ndw, ndb), y)  # l 是有关小批量 X 和 y 的损失。
        l.backward()  # 小批量的损失对模型参数求梯度。
        sgd([ndw, ndb], lr, batch_size)  # 使用小批量随机梯度下降迭代模型参数。
    train_l = myloss(net(ndX, ndw, ndb), ndy)
    print('epoch %d, loss %f' % (epoch + 1, train_l.mean().asnumpy()))

epoch 1, loss 6.100148
epoch 2, loss 3.510528
epoch 3, loss 2.020575
epoch 4, loss 1.163837
epoch 5, loss 0.669829
epoch 6, loss 0.385972
epoch 7, loss 0.222735
epoch 8, loss 0.128599
epoch 9, loss 0.074279
epoch 10, loss 0.042887
epoch 11, loss 0.024789
epoch 12, loss 0.014326
epoch 13, loss 0.008279
epoch 14, loss 0.004782
epoch 15, loss 0.002767
epoch 16, loss 0.001602
epoch 17, loss 0.000927
epoch 18, loss 0.000535
epoch 19, loss 0.000310
epoch 20, loss 0.000180


In [32]:
ndw,ndb

(
 [[1.3938206]
  [2.2838824]]
 <NDArray 2x1 @cpu(0)>, 
 [[4.689897]]
 <NDArray 1x1 @cpu(0)>)

换一种参数初始化方法

In [33]:
ndw = nd.random.normal(0,0.001,(2,1))
ndb = nd.zeros((1,1))
ndw.attach_grad()
ndb.attach_grad()

In [34]:
lr = 0.03
num_epochs = 20


for epoch in range(num_epochs):  # 训练模型一共需要 num_epochs 个迭代周期。
    # 在每一个迭代周期中，会使用训练数据集中所有样本一次（假设样本数能够被批量大小整除）。
    # X 和 y 分别是小批量样本的特征和标签。
    for X, y in data_iter2(batch_size, ndX, ndy):
        with autograd.record():
            l = myloss(net(X, ndw, ndb), y)  # l 是有关小批量 X 和 y 的损失。
        l.backward()  # 小批量的损失对模型参数求梯度。
        sgd([ndw, ndb], lr, batch_size)  # 使用小批量随机梯度下降迭代模型参数。
    train_l = myloss(net(ndX, ndw, ndb), ndy)
    print('epoch %d, loss %f' % (epoch + 1, train_l.mean().asnumpy()))

epoch 1, loss 7.770519
epoch 2, loss 4.367507
epoch 3, loss 2.457646
epoch 4, loss 1.384984
epoch 5, loss 0.781441
epoch 6, loss 0.441956
epoch 7, loss 0.250328
epoch 8, loss 0.141902
epoch 9, loss 0.080633
epoch 10, loss 0.045870
epoch 11, loss 0.026125
epoch 12, loss 0.014905
epoch 13, loss 0.008503
epoch 14, loss 0.004857
epoch 15, loss 0.002779
epoch 16, loss 0.001592
epoch 17, loss 0.000913
epoch 18, loss 0.000524
epoch 19, loss 0.000302
epoch 20, loss 0.000174


In [35]:
ndw, ndb

(
 [[1.3919284]
  [2.2873595]]
 <NDArray 2x1 @cpu(0)>, 
 [[4.687408]]
 <NDArray 1x1 @cpu(0)>)

## 使用Mxnet的高阶API Gluon来实现

回忆一下模型的三要素
1. 模型
2. 损失函数
3. 优化算法  
  
这几者在Gluon中都有对应

获取数据，Gluon有data模块来将数据分批

In [36]:
from mxnet.gluon import data as gdata

In [37]:
??gdata

In [38]:
?? gdata.ArrayDataset

In [39]:
?? gdata.DataLoader

In [40]:
batch_size = 10
dataset = gdata.ArrayDataset(ndX, ndy)
data_iter = gdata.DataLoader(dataset, batch_size, shuffle=True)

### 定义模型

In [41]:
from mxnet.gluon import nn

In [42]:
net = nn.Sequential()
net.add(nn.Dense(1))

### 初始化模型参数

In [43]:
from mxnet import init
?? init.Normal

In [44]:
net.initialize(init.Normal(sigma=0.01))

### 定义损失函数

In [45]:
from mxnet.gluon import loss as gloss

In [46]:
?? gloss

In [47]:
loss = gloss.L2Loss()

### 定义优化算法

In [48]:
from mxnet import gluon
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.03})

###  训练模型

In [49]:
num_epochs = 20
for epoch in range(1, num_epochs + 1):
    for X, y in data_iter:
        with autograd.record():
            l = loss(net(X), y)
        l.backward()
        trainer.step(batch_size)
        l = loss(net(ndX), ndy)
    print('epoch %d, loss: %f' % (epoch, l.mean().asnumpy()))

epoch 1, loss: 7.732026
epoch 2, loss: 4.346726
epoch 3, loss: 2.445875
epoch 4, loss: 1.378127
epoch 5, loss: 0.778209
epoch 6, loss: 0.440118
epoch 7, loss: 0.249306
epoch 8, loss: 0.141301
epoch 9, loss: 0.080303
epoch 10, loss: 0.045641
epoch 11, loss: 0.025995
epoch 12, loss: 0.014827
epoch 13, loss: 0.008467
epoch 14, loss: 0.004841
epoch 15, loss: 0.002767
epoch 16, loss: 0.001587
epoch 17, loss: 0.000910
epoch 18, loss: 0.000523
epoch 19, loss: 0.000301
epoch 20, loss: 0.000173


In [50]:
ndtrue_w, ndtrue_b

(
 [[1.4]
  [2.3]]
 <NDArray 2x1 @cpu(0)>, 
 [4.7]
 <NDArray 1 @cpu(0)>)

In [51]:
dense = net[0]

In [52]:
dense.weight.data()


[[1.3917496 2.2874672]]
<NDArray 1x2 @cpu(0)>

In [53]:
dense.bias.data()


[4.687459]
<NDArray 1 @cpu(0)>