# 线性模型

## 模型
$$\hat{y} = w_1x_1 + w_2x_2 + \dots + w_nx_n + b\tag{1} $$

## 损失函数
$$ L = \frac{1}{2}\sum_{i=1}^n(y_i - \hat{y_i})^2\tag{2} $$

准备数据,数据有100条记录，每条记录有两个特征$x_1$和$x_2$

In [316]:
import numpy as np
import seaborn as sns
import pandas as pd

In [349]:
true_w = np.array([[1.4], [2.3]], dtype = np.float64)
true_b = 4.7

In [350]:
X = np.random.normal(0, 1, (100,2))
zao = np.random.randn(100,1)
y = np.dot(X, true_w) + true_b + zao

现在我们有了数据，特征$X$和对应的关注值$y$，我们要如何去得到线性模型的参数$w$和$b$呢?  
若记$\bar{w}=\left[\begin{matrix}w \\ b \end{matrix}\right]$,$\bar{X}=\left[\begin{matrix}X & 1 \end{matrix}\right]$对于统计学，你可以推导出来最优解 $\hat{\bar{w}}^*=(\bar{X}^\intercal\bar{X})^{-1}(\bar{X}^\intercal)y$。但是这个方法有很大的局限性  
1. X必须是可逆的
2. 当数据量和特征维度都很大是，计算耗时，求逆更耗时  

那么有别的办法吗？有！从优化的角度来做。这是个凸函数，用梯度下降的方法，一定可以找到全局最优解

首先计算损失函数对参数的偏导数  
$$\frac{\partial{L}}{\partial{w_1}}= \sum_{i=1}^n(\hat{y_i}-y_i)x_{i1}\tag{3}$$
$$\frac{\partial{L}}{\partial{b}}= \sum_{i=1}^n(\hat{y_i}-y_i)\tag{4}$$

在本例中，若定义$\vec{w}=\left[\begin{matrix}w1 \\ w2 \\ b \end{matrix}\right]$,那么  
$$\frac{\partial{L}}{\partial{\vec{w}}}= \sum_{i=1}^n(\hat{y_i}-y_i)\left[\begin{matrix}x_{i1}\\x_{i2}\\1 \end{matrix}\right]\tag{4}$$

写成矩阵形式，对$w$的计算为, $X_{100\times2}$ ,$y_{100\times1}$,$\hat{y}_{100\times1}$
$$\frac{\partial{L}}{\partial{\left[\begin{matrix}w_1\\w_2 \end{matrix}\right]}} = X^\intercal(\hat{y}-y)\tag{5}$$  
对$y$的计算为,其中$\vec{1}$元素全为1的列向量
$$\frac{\partial{L}}{\partial{y}} = {\vec{1}}^\intercal(\hat{y}-y)\tag{5}$$ 

先随机初始化$w$和$b$

In [383]:
w = np.random.randn(2,1)
b = np.random.randn(1)
w,b

(array([[-0.86699247],
        [ 0.01012463]]), array([0.11836671]))

根据推导的公式计算梯度

In [384]:
def get_grad(X, y, w, b):
    return np.dot(X.T,np.dot(X,w)+b-y), np.dot(np.ones((1,X.shape[0])),np.dot(X,w)+b-y)

In [385]:
get_grad(X, y, w, b)

(array([[-205.32582638],
        [-218.83858727]]), array([[-438.51617656]]))

而深度学习框架都提供了自动求导机制，省去了手动推导公式的繁琐，对于神经网络等更有用

## Mxnet

In [386]:
from mxnet import ndarray as nd, autograd
ndX = nd.array(X)
ndy = nd.array(y)
ndzao = nd.array(zao)
ndtrue_w = nd.array(true_w)
ndtrue_b = nd.array([true_b])
ndw = nd.array(w)
ndb = nd.array(b)

In [387]:
#mxnet的变量自动求导，需要先给其分配内存记录
ndw.attach_grad()
ndb.attach_grad()

In [388]:
with autograd.record():
    los = (ndy-(nd.dot(ndX, ndw) +  ndb))**2/2

In [389]:
los.backward()

In [390]:
ndw.grad,ndb.grad

(
 [[-205.32582]
  [-218.83862]]
 <NDArray 2x1 @cpu(0)>, 
 [-438.51617]
 <NDArray 1 @cpu(0)>)

## Pytorch

利用梯度下降训练模型，这里先定义几个函数 

损失函数

In [391]:
def loss(X, y, w, b):
    return 1/2*np.mean((np.dot(X,w)+b-y)**2)

求梯度函数

In [392]:
def get_grad(X, y, w, b):
    return np.dot(X.T,np.dot(X,w)+b-y), np.dot(np.ones((1,X.shape[0])),np.dot(X,w)+b-y)

In [393]:
w_grad, b_grad = get_grad(X,y,w,b)

In [394]:
w_grad, b_grad

(array([[-205.32582638],
        [-218.83858727]]), array([[-438.51617656]]))

In [395]:
w,b

(array([[-0.86699247],
        [ 0.01012463]]), array([0.11836671]))

In [396]:
lr = 0.3
num_epochs = 1000
N = X.shape[0]
for i in range(num_epochs):
    w_grad, b_grad = get_grad(X,y,w,b)
    w = w - lr*w_grad/N
    b = b - lr*b_grad/N
    los = loss(X, y, w, b)
    if not i % 10:
        print(i,'th iter,','loss is:', los)

0 th iter, loss is: 7.951563325225119
10 th iter, loss is: 0.4581301101113125
20 th iter, loss is: 0.44932491751667475
30 th iter, loss is: 0.44931436383712614
40 th iter, loss is: 0.4493143505047864
50 th iter, loss is: 0.449314350484543
60 th iter, loss is: 0.44931435048449586
70 th iter, loss is: 0.44931435048449564
80 th iter, loss is: 0.4493143504844957
90 th iter, loss is: 0.4493143504844958
100 th iter, loss is: 0.44931435048449564
110 th iter, loss is: 0.44931435048449564
120 th iter, loss is: 0.4493143504844957
130 th iter, loss is: 0.4493143504844957
140 th iter, loss is: 0.4493143504844957
150 th iter, loss is: 0.4493143504844957
160 th iter, loss is: 0.4493143504844957
170 th iter, loss is: 0.4493143504844957
180 th iter, loss is: 0.4493143504844957
190 th iter, loss is: 0.4493143504844957
200 th iter, loss is: 0.4493143504844957
210 th iter, loss is: 0.4493143504844957
220 th iter, loss is: 0.4493143504844957
230 th iter, loss is: 0.4493143504844957
240 th iter, loss is: 0

In [397]:
w,b

(array([[1.30066018],
        [2.42602976]]), array([[4.62771146]]))

In [398]:
true_w,true_b

(array([[1.4],
        [2.3]]), 4.7)

In [401]:
import random
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    random.shuffle(indices)  # 样本的读取顺序是随机的。
    for i in range(0, num_examples, batch_size):
        j = nd.array(indices[i: min(i + batch_size, num_examples)])
        yield features.take(j), labels.take(j)  # take 函数根据索引返回对应元素。

In [403]:
batch_size = 10

for X, y in data_iter(batch_size, ndX, ndy):
    print(X, y)
    break


[[ 5.0287610e-01  5.3495818e-01]
 [-2.4273744e+00 -3.7820321e-01]
 [-5.6928647e-01  2.1097383e+00]
 [-4.3524254e-02  3.9931270e-01]
 [ 6.2985349e-01 -5.2471608e-01]
 [ 1.1761853e+00  1.6975348e+00]
 [-6.5805078e-02  1.2352763e+00]
 [-2.0735729e-03  1.8669657e+00]
 [-1.9155878e-01  1.9280884e-01]
 [-2.9792911e-01 -8.8309121e-01]]
<NDArray 10x2 @cpu(0)> 
[[ 6.6100607]
 [ 1.2608424]
 [ 5.9588785]
 [ 5.9672117]
 [ 3.758598 ]
 [10.082812 ]
 [ 7.688988 ]
 [ 9.469128 ]
 [ 3.848411 ]
 [ 2.0874534]]
<NDArray 10x1 @cpu(0)>


In [404]:
def linreg(X, w, b):  # 本函数已保存在 gluonbook 包中方便以后使用。
    return nd.dot(X, w) + b
def squared_loss(y_hat, y):  # 本函数已保存在 gluonbook 包中方便以后使用。
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
def sgd(params, lr, batch_size):  # 本函数已保存在 gluonbook 包中方便以后使用。
    for param in params:
        param[:] = param - lr * param.grad / batch_size
        

In [409]:
lr = 0.03
num_epochs = 20
net = linreg
loss = squared_loss

for epoch in range(num_epochs):  # 训练模型一共需要 num_epochs 个迭代周期。
    # 在每一个迭代周期中，会使用训练数据集中所有样本一次（假设样本数能够被批量大小整除）。
    # X 和 y 分别是小批量样本的特征和标签。
    for X, y in data_iter(batch_size, ndX, ndy):
        with autograd.record():
            l = loss(net(X, ndw, ndb), y)  # l 是有关小批量 X 和 y 的损失。
        l.backward()  # 小批量的损失对模型参数求梯度。
        sgd([ndw, ndb], lr, batch_size)  # 使用小批量随机梯度下降迭代模型参数。
    train_l = loss(net(ndX, ndw, ndb), ndy)
    print('epoch %d, loss %f' % (epoch + 1, train_l.mean().asnumpy()))

epoch 1, loss 1.252480
epoch 2, loss 0.899208
epoch 3, loss 0.700107
epoch 4, loss 0.589044
epoch 5, loss 0.527703
epoch 6, loss 0.492404
epoch 7, loss 0.473696
epoch 8, loss 0.462425
epoch 9, loss 0.456762
epoch 10, loss 0.453340
epoch 11, loss 0.451572
epoch 12, loss 0.450525
epoch 13, loss 0.450052
epoch 14, loss 0.449882
epoch 15, loss 0.449718
epoch 16, loss 0.449573
epoch 17, loss 0.449483
epoch 18, loss 0.449394
epoch 19, loss 0.449377
epoch 20, loss 0.449336


In [410]:
ndw,ndb

(
 [[1.2964016]
  [2.4212751]]
 <NDArray 2x1 @cpu(0)>, 
 [4.629563]
 <NDArray 1 @cpu(0)>)