# ch05_神经网络与全连接层

## 5.1 Logistic ( sigmoid ) Regression 

- for continuous (由于是continuous,所以才叫做regression):
$$ y = x w + b$$
- for probability output :  
$$ y = \sigma ( x w + b )$$
 - $\sigma$: sigmoid or logistic

### Binary Classification :
- interpred network as $ f : x \to p( y \mid x ; \theta) $
- output $ \in [ 0 , 1 ]$
- which is exactly what logistic function comes in

### For Regression:
- Goal : $ pred = y $
- Approach : minimize $ dist(pred,y) $

### For classification:
- Goal : maximize benchmark , e.g. accuracy
- Approach 1:  minimize $ dist ( p_\theta ( y \mid x ) , p_r ( y \mid x )) $
- Approach 2:  minimize $ divergence ( p_\theta ( y \mid x ) , p_r (y \mid x )) $

### Why call logistic regression ?
- use sigmoid
- Controversial:
 - MSE $\to$ regression
 - Cross Entropy $\to$ classification

## 5.2 Cross Entropy Loss ( 交叉熵 )

### Loss for classification:
- MSE
- Hinge Loss (SVM)
$$ \sum_i \max (0 , 1-y_i * h_\theta(x_i)) $$
- Cross Entropy Loss

### What's Entropy means ? 
- Uncertainty ( 不确定性 )
 - measure of surprise ( 惊喜度 )
- higher entropy = less info.
$$ Entropy = - \sum_i P(i)\space \log P(i) $$

### Binary Classification
$$ H(P,Q) = -P(cat)\log Q(cat) - (1-P(cat))\log (1-Q(cat))$$
$$由于是二分类,所以 P(dog) = (1-P(cat))$$
$$
H(P,Q) = - \sum_{i=(cat,dog)}P(i)\log Q(i) \\
= -P(cat)\log Q(cat) - P(dog)\log Q(dog)  \\
 -(y\log (p) + (1-y)\log (1-p))
$$

### Why not use MSE on classification
- sigmoid + MSE $\to$ gradient vanish
- converge slower
- But,sometimes
 - e.g. meta-learning

In [1]:
import torch
import numpy as np
from torch.nn import functional as F
from torch import optim
from torch import nn
import torchvision

In [2]:
# Numerical Stability
x = torch.rand(1,784)
w = torch.rand(10,784)
logits = x@w.t()
print('x.shape: ',x.shape,'\nw.shape: ',w.shape,
      '\nlogits = x@w.t(),logits.shape: ',logits.shape)

pred = torch.softmax(logits,dim=1)
print('pred = softmax(logits,dim=1) :\n',pred)
pred_log = torch.log(pred)
print('log(pred):',pred_log)

loss = F.cross_entropy(logits,torch.tensor([3]))
print('方法一:使用F.cross_entropy(logits,torch.tensor([3])直接计算:',loss)
my_loss = F.nll_loss(pred_log,torch.tensor([3]))
print('方法二:使用softmax计算出的pred_log计算:',my_loss)
print('方法一只需要一步,而方法二的CE = softmax -> logits -> nll_loss')
      

x.shape:  torch.Size([1, 784]) 
w.shape:  torch.Size([10, 784]) 
logits = x@w.t(),logits.shape:  torch.Size([1, 10])
pred = softmax(logits,dim=1) :
 tensor([[2.8702e-04, 4.0199e-01, 1.3270e-05, 9.6294e-04, 1.8527e-02, 2.3045e-01,
         7.3673e-03, 4.2058e-04, 3.3930e-01, 6.7891e-04]])
log(pred): tensor([[ -8.1559,  -0.9113, -11.2300,  -6.9455,  -3.9885,  -1.4677,  -4.9107,
          -7.7739,  -1.0809,  -7.2950]])
方法一:使用F.cross_entropy(logits,torch.tensor([3])直接计算: tensor(6.9455)
方法二:使用softmax计算出的pred_log计算: tensor(6.9455)
方法一只需要一步,而方法二的CE = softmax -> logits -> nll_loss


## 5.3 多分类问题实战

In [None]:
from visdom import Visdom

learning_rate = 1e-2
epochs = 10
batch_size = 64

train_load = torch.utils.data.DataLoader(torchvision.datasets.MNIST(
    '../data/',
    train=True,
    download=True,
    transform=torchvision.transforms.Compose([
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.1307, ), (0.3081, ))
    ])),
    batch_size=batch_size,
    shuffle=True)

test_load = torch.utils.data.DataLoader(torchvision.datasets.MNIST(
    '../data/',
    train=False,
    transform=torchvision.transforms.Compose([
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.1307),(0.3081,))])),
            batch_size=batch_size,
            shuffle=True)


w1,b1 = torch.randn(200,784,requires_grad=True),\
        torch.zeros(200,requires_grad=True)
w2,b2 = torch.randn(200,200,requires_grad=True),\
        torch.zeros(200,requires_grad=True)
w3,b3 = torch.randn(10,200,requires_grad=True),\
        torch.zeros(10,requires_grad=True)

nn.init.kaiming_normal_(w1)
nn.init.kaiming_normal_(w2)
nn.init.kaiming_normal_(w3)

def forward(x):
    x = x@w1.t() + b1
    x = F.relu(x)
    x = x@w2.t() + b2
    x = F.relu(x)
    x = x@w3.t() + b3
    x = F.relu(x)
    return x


optimizer = optim.SGD([w1,b1,w2,b2,w3,b3],lr=learning_rate)
criteon = nn.CrossEntropyLoss()
global_step = 0
global_test_step = 0
# criteon = F.cross_entropy()

global_step = 0
vis = Visdom()
vis.line([0.],[0.],win='train_loss',opts=dict(title='train_loss'))
vis.line([[0.0,0.0]],[0.],win='test',opts=dict(title='test loss&acc.',legend=['loss','acc.']))

for epoch in range(epochs):
    for batch_idx, (data,target) in enumerate(train_load):
        data = data.view(-1,28*28)
        logits = forward(data)
        loss = criteon(logits,target)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if batch_idx%100 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.
                 format(epoch,batch_idx*len(data),len(train_load.dataset),
                       100. *batch_idx/len(train_load),loss.item()))
        global_step += 1
        vis.line([loss.item()],[global_step],win='train_loss',update='append')
            
    test_loss = 0
    correct = 0
    for data,target in test_load:
        
        vis.images(data.view(-1,1,28,28),win='x')
        
        data = data.view(-1,28*28)
        logits = forward(data)
        test_loss += criteon(logits,target).item()
        pred = logits.data.max(1)[1]
        correct += pred.eq(target.data).sum()
        
        global_test_step += 1
        vis.line([[test_loss,correct/len(test_load.dataset)]],[global_test_step],win='test',update='append')
    
       
        vis.text(str(pred.detach().cpu().numpy()),win='pred',opts=dict(title='pred'))
    
    test_loss /= len(test_load.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.
         format(test_loss,correct,len(test_load.dataset),100. *correct/len(test_load.dataset)))
    

Setting up a new session...







Test set: Average loss: 0.0071, Accuracy: 8427/10000 (84%)


Test set: Average loss: 0.0064, Accuracy: 8538/10000 (85%)


Test set: Average loss: 0.0060, Accuracy: 8597/10000 (86%)


Test set: Average loss: 0.0056, Accuracy: 8659/10000 (87%)


Test set: Average loss: 0.0054, Accuracy: 8708/10000 (87%)


Test set: Average loss: 0.0053, Accuracy: 8725/10000 (87%)


Test set: Average loss: 0.0052, Accuracy: 8755/10000 (88%)


Test set: Average loss: 0.0051, Accuracy: 8770/10000 (88%)



## 5.4 全连接层

- x = F.relu(x,inplace = True)
 - inplace = True 代表进行原地操作,可以节省一半空间

## 5.5 Visdom 可视化
- 安装:
 - python -m pip install --upgrade pip
 - python -m pip install visdom
- 启动:
 - python -m visdom.server