# 线性模型

## 模型
$$\hat{y} = w_1x_1 + w_2x_2 + \dots + w_nx_n + b\tag{1} $$

## 损失函数
$$ L = \frac{1}{2}\sum_{i=1}^n(y_i - \hat{y_i})^2\tag{2} $$

准备数据,数据有100条记录，每条记录有两个特征$x_1$和$x_2$

In [431]:
import numpy as np
import seaborn as sns
import pandas as pd

In [432]:
true_w = np.array([[1.4], [2.3]], dtype = np.float64)
true_b = 4.7

In [433]:
X = np.random.normal(0, 1, (100,2))
zao = np.random.randn(100,1)
y = np.dot(X, true_w) + true_b + zao

现在我们有了数据，特征$X$和对应的关注值$y$，我们要如何去得到线性模型的参数$w$和$b$呢?  
若记$\bar{w}=\left[\begin{matrix}w \\ b \end{matrix}\right]$,$\bar{X}=\left[\begin{matrix}X & 1 \end{matrix}\right]$对于统计学，你可以推导出来最优解 $\hat{\bar{w}}^*=(\bar{X}^\intercal\bar{X})^{-1}(\bar{X}^\intercal)y$。但是这个方法有很大的局限性  
1. X必须是可逆的
2. 当数据量和特征维度都很大是，计算耗时，求逆更耗时  

那么有别的办法吗？有！从优化的角度来做。这是个凸函数，用梯度下降的方法，一定可以找到全局最优解

首先计算损失函数对参数的偏导数  
$$\frac{\partial{L}}{\partial{w_1}}= \sum_{i=1}^n(\hat{y_i}-y_i)x_{i1}\tag{3}$$
$$\frac{\partial{L}}{\partial{b}}= \sum_{i=1}^n(\hat{y_i}-y_i)\tag{4}$$

在本例中，若定义$\vec{w}=\left[\begin{matrix}w1 \\ w2 \\ b \end{matrix}\right]$,那么  
$$\frac{\partial{L}}{\partial{\vec{w}}}= \sum_{i=1}^n(\hat{y_i}-y_i)\left[\begin{matrix}x_{i1}\\x_{i2}\\1 \end{matrix}\right]\tag{4}$$

写成矩阵形式，对$w$的计算为, $X_{100\times2}$ ,$y_{100\times1}$,$\hat{y}_{100\times1}$
$$\frac{\partial{L}}{\partial{\left[\begin{matrix}w_1\\w_2 \end{matrix}\right]}} = X^\intercal(\hat{y}-y)\tag{5}$$  
对$y$的计算为,其中$\vec{1}$元素全为1的列向量
$$\frac{\partial{L}}{\partial{y}} = {\vec{1}}^\intercal(\hat{y}-y)\tag{5}$$ 

先随机初始化$w$和$b$

In [434]:
w = np.random.randn(2,1)
b = np.random.randn(1)
w,b

(array([[-1.963915  ],
        [-0.63304225]]), array([-0.040506]))

根据推导的公式计算梯度

In [435]:
def get_grad(X, y, w, b):
    return np.dot(X.T,np.dot(X,w)+b-y), np.dot(np.ones((1,X.shape[0])),np.dot(X,w)+b-y)

In [436]:
get_grad(X, y, w, b)

(array([[-431.78921776],
        [-338.20905788]]), array([[-526.9522258]]))

而深度学习框架都提供了自动求导机制，省去了手动推导公式的繁琐，对于神经网络等更有用

## Mxnet

In [437]:
from mxnet import ndarray as nd, autograd
ndX = nd.array(X)
ndy = nd.array(y)
ndzao = nd.array(zao)
ndtrue_w = nd.array(true_w)
ndtrue_b = nd.array([true_b])
ndw = nd.array(w)
ndb = nd.array(b)

In [438]:
#mxnet的变量自动求导，需要先给其分配内存记录
ndw.attach_grad()
ndb.attach_grad()

In [439]:
with autograd.record():
    los = (ndy-(nd.dot(ndX, ndw) +  ndb))**2/2

In [440]:
los.backward()

In [441]:
ndw.grad,ndb.grad

(
 [[-431.7892]
  [-338.209 ]]
 <NDArray 2x1 @cpu(0)>, 
 [-526.9522]
 <NDArray 1 @cpu(0)>)

可以发现，使用mxnet的自动求导机制算得的梯度，和我们手动计算的梯度是一致的

## Pytorch

利用梯度下降训练模型，这里先定义几个函数 

损失函数

In [442]:
def loss(X, y, w, b):
    return 1/2*(np.dot(X,w)+b-y)**2

求梯度函数

In [443]:
def get_grad(X, y, w, b):
    return np.dot(X.T,np.dot(X,w)+b-y), np.dot(np.ones((1,X.shape[0])),np.dot(X,w)+b-y)

In [444]:
lr = 0.3
num_epochs = 100
N = X.shape[0]
for i in range(num_epochs):
    w_grad, b_grad = get_grad(X,y,w,b)
    w = w - lr*w_grad/N
    b = b - lr*b_grad/N
    los = loss(X, y, w, b)
    if not i % 10:
        print(i,'th iter,','loss is:', np.mean(los))

0 th iter, loss is: 10.748258895560653
10 th iter, loss is: 0.5551090349265329
20 th iter, loss is: 0.552559886859715
30 th iter, loss is: 0.5525582376749849
40 th iter, loss is: 0.5525582354808939
50 th iter, loss is: 0.5525582354771611
60 th iter, loss is: 0.5525582354771544
70 th iter, loss is: 0.5525582354771545
80 th iter, loss is: 0.5525582354771545
90 th iter, loss is: 0.5525582354771544


In [445]:
w,b

(array([[1.27517596],
        [2.26746547]]), array([[4.72462129]]))

In [446]:
true_w,true_b

(array([[1.4],
        [2.3]]), 4.7)

下面实现随机梯度下降法来计算模型参数  
首先分割数据集

In [449]:
def data_iter1(batch_size, X, y):
    num_examples = len(X)
    indices = list(range(num_examples))
    np.random.shuffle(indices)  # 样本的读取顺序是随机的。
    for i in range(0, num_examples, batch_size):
        j = np.array(indices[i: min(i + batch_size, num_examples)])
        yield X[j], y[j]  # take 函数根据索引返回对应元素。

In [450]:
batch_size = 10
for i,j in data_iter(batch_size, X, y):
    print(i,j)
    break

[[-1.10425416  0.53744733]
 [ 1.41665835 -0.8384792 ]
 [ 1.23419133  0.7761752 ]
 [-0.58011035  1.00976837]
 [ 0.92207333 -0.17040899]
 [-0.76202002 -0.82587629]
 [ 0.47843025  0.86187774]
 [-0.10212744 -0.73475063]
 [ 0.0365292   2.09224611]
 [ 1.31619544  0.77441899]] [[4.09030254]
 [3.80566454]
 [8.99708103]
 [8.63543191]
 [5.53068968]
 [1.09725473]
 [9.02026778]
 [3.08251743]
 [8.37406336]
 [6.60325469]]


重新初始化$w$,$b$

In [455]:
w = np.random.randn(2,1)
b = np.random.randn(1,1)

In [456]:
lr = 0.3
num_epochs = 100
N = X.shape[0]
for i in range(num_epochs):
    for X_s,y_s in data_iter(batch_size, X, y):
        w_grad, b_grad = get_grad(X_s,y_s,w,b)
        w = w - lr*w_grad/batch_size
        b = b - lr*b_grad/batch_size
    los = loss(X, y, w, b)
    if not i % 10:
        print(i,'th iter,','loss is:', np.mean(los))

0 th iter, loss is: 0.5564595708043266
10 th iter, loss is: 0.5698418517348373
20 th iter, loss is: 0.5982968558560376
30 th iter, loss is: 0.5557177891091425
40 th iter, loss is: 0.5626929669253424
50 th iter, loss is: 0.5679093231445218
60 th iter, loss is: 0.583278473915394
70 th iter, loss is: 0.672769470699366
80 th iter, loss is: 0.5660121623422896
90 th iter, loss is: 0.5744911931686882


In [457]:
w,b

(array([[1.33977475],
        [2.16482354]]), array([[4.74738742]]))

In [453]:
# 本函数已保存在 gluonbook 包中方便以后使用。
def data_iter2(batch_size, X, y):
    num_examples = len(X)
    indices = list(range(num_examples))
    np.random.shuffle(indices)  # 样本的读取顺序是随机的。
    for i in range(0, num_examples, batch_size):
        j = nd.array(indices[i: min(i + batch_size, num_examples)])
        yield X.take(j), y.take(j)  # take 函数根据索引返回对应元素。

In [454]:
batch_size = 10
for i,j in data_iter2(batch_size, ndX, ndy):
    print(i,j)
    break


[[-0.9468651   1.5843596 ]
 [-1.4950799  -0.68095076]
 [ 0.39265046 -0.9277623 ]
 [ 2.0839138  -1.780824  ]
 [ 0.30021167  0.5575545 ]
 [ 0.4478474  -0.7138504 ]
 [ 0.8718394   1.8295546 ]
 [ 0.22883023  0.5474122 ]
 [ 1.6809354   1.8031437 ]
 [-1.1868412  -0.6501476 ]]
<NDArray 10x2 @cpu(0)> 
[[ 6.15197   ]
 [ 0.20289575]
 [ 4.178192  ]
 [ 2.6170437 ]
 [ 6.072993  ]
 [ 4.013717  ]
 [11.572858  ]
 [ 5.986799  ]
 [10.344066  ]
 [ 3.4257388 ]]
<NDArray 10x1 @cpu(0)>


In [404]:
def linreg(X, w, b):  # 本函数已保存在 gluonbook 包中方便以后使用。
    return nd.dot(X, w) + b
def squared_loss(y_hat, y):  # 本函数已保存在 gluonbook 包中方便以后使用。
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
def sgd(params, lr, batch_size):  # 本函数已保存在 gluonbook 包中方便以后使用。
    for param in params:
        param[:] = param - lr * param.grad / batch_size
        

In [409]:
lr = 0.03
num_epochs = 20
net = linreg
loss = squared_loss

for epoch in range(num_epochs):  # 训练模型一共需要 num_epochs 个迭代周期。
    # 在每一个迭代周期中，会使用训练数据集中所有样本一次（假设样本数能够被批量大小整除）。
    # X 和 y 分别是小批量样本的特征和标签。
    for X, y in data_iter(batch_size, ndX, ndy):
        with autograd.record():
            l = loss(net(X, ndw, ndb), y)  # l 是有关小批量 X 和 y 的损失。
        l.backward()  # 小批量的损失对模型参数求梯度。
        sgd([ndw, ndb], lr, batch_size)  # 使用小批量随机梯度下降迭代模型参数。
    train_l = loss(net(ndX, ndw, ndb), ndy)
    print('epoch %d, loss %f' % (epoch + 1, train_l.mean().asnumpy()))

epoch 1, loss 1.252480
epoch 2, loss 0.899208
epoch 3, loss 0.700107
epoch 4, loss 0.589044
epoch 5, loss 0.527703
epoch 6, loss 0.492404
epoch 7, loss 0.473696
epoch 8, loss 0.462425
epoch 9, loss 0.456762
epoch 10, loss 0.453340
epoch 11, loss 0.451572
epoch 12, loss 0.450525
epoch 13, loss 0.450052
epoch 14, loss 0.449882
epoch 15, loss 0.449718
epoch 16, loss 0.449573
epoch 17, loss 0.449483
epoch 18, loss 0.449394
epoch 19, loss 0.449377
epoch 20, loss 0.449336


In [410]:
ndw,ndb

(
 [[1.2964016]
  [2.4212751]]
 <NDArray 2x1 @cpu(0)>, 
 [4.629563]
 <NDArray 1 @cpu(0)>)