# Weight initialization

- Why good initialization?
- RBM / DBN
- Xavier / He initialization
- Code: mnist_nn_xavier
- Code: mnist_nn_deep

Weight 초기화는 생각이상으로 중요하다. 

### Need to set the initial weight values wisely

- __Not all 0's__  
    NN 을 업데이트 할때 Backpropagation을 사용합니다.
    그때, gradient를 구해 chain rule 로 업데이트를 합니다. 
    만약 이때 weight가 0이라면 모든 gradient 값이 0이 되게 됩니다.

### RBM ( Restricted Boltzmann Machine )

![RBM](https://miro.medium.com/max/1760/1*ZY4c980_7MfEMYTIi6jvTw.png)
입력 x가 있을때 Y를 만들수 있는 foward가 존재하고,  
반대로 Y가 들어왔을때 backward를 통해 x를 복원하는 x'이 가능한 구조입니다.


- Restricted : no connections within a layer  
    레이어 안에서는 연결이 없습니다.  
    다른레이어 사이에서는 fully connected 된것을 볼수 있습니다.  
- KL divergence : compare actual to recreation

### Deep Belief Network

In [2]:
# Lab 10 MNIST and softmax
import torch
import torchvision.datasets as dsets
import torchvision.transforms as transforms
import random

In [3]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# for reproducibility
random.seed(777)
torch.manual_seed(777)
if device == 'cuda':
    torch.cuda.manual_seed_all(777)

In [5]:
# parameters
learning_rate = 0.001
training_epochs = 15
batch_size = 100

In [6]:
# MNIST dataset
mnist_train = dsets.MNIST(root='MNIST_data/',
                          train=True,
                          transform=transforms.ToTensor(),
                          download=True)

mnist_test = dsets.MNIST(root='MNIST_data/',
                         train=False,
                         transform=transforms.ToTensor(),
                         download=True)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to MNIST_data/MNIST\raw\train-images-idx3-ubyte.gz


9920512it [00:04, 2009015.26it/s]                                                                                      


Extracting MNIST_data/MNIST\raw\train-images-idx3-ubyte.gz to MNIST_data/MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to MNIST_data/MNIST\raw\train-labels-idx1-ubyte.gz


32768it [00:00, 50334.81it/s]                                                                                          


Extracting MNIST_data/MNIST\raw\train-labels-idx1-ubyte.gz to MNIST_data/MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to MNIST_data/MNIST\raw\t10k-images-idx3-ubyte.gz


1654784it [00:00, 1716581.79it/s]                                                                                      


Extracting MNIST_data/MNIST\raw\t10k-images-idx3-ubyte.gz to MNIST_data/MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to MNIST_data/MNIST\raw\t10k-labels-idx1-ubyte.gz


8192it [00:00, 17655.15it/s]                                                                                           


Extracting MNIST_data/MNIST\raw\t10k-labels-idx1-ubyte.gz to MNIST_data/MNIST\raw
Processing...
Done!


In [7]:
# dataset loader
data_loader = torch.utils.data.DataLoader(dataset=mnist_train,
                                          batch_size=batch_size,
                                          shuffle=True,
                                          drop_last=True)

In [8]:
# nn layers
linear1 = torch.nn.Linear(784, 256, bias=True)
linear2 = torch.nn.Linear(256, 256, bias=True)
linear3 = torch.nn.Linear(256, 10, bias=True)
relu = torch.nn.ReLU()

In [10]:
# xavier initialization
torch.nn.init.xavier_uniform_(linear1.weight)
torch.nn.init.xavier_uniform_(linear2.weight)
torch.nn.init.xavier_uniform_(linear3.weight)

Parameter containing:
tensor([[-0.0039, -0.0166, -0.0415,  ...,  0.1375, -0.0556, -0.0868],
        [-0.0806,  0.0315, -0.1408,  ..., -0.1330, -0.0807,  0.0600],
        [ 0.1287,  0.0014, -0.0580,  ...,  0.0508, -0.0500, -0.1308],
        ...,
        [-0.1004, -0.0187, -0.0075,  ..., -0.0087, -0.0063,  0.0588],
        [ 0.1414,  0.1065,  0.1057,  ...,  0.1154, -0.0272, -0.0432],
        [-0.1199, -0.0441, -0.1109,  ..., -0.0223,  0.0153,  0.0347]],
       requires_grad=True)

In [11]:
# model
model = torch.nn.Sequential(linear1, relu, linear2, relu, linear3).to(device)

In [12]:
# define cost/loss & optimizer
criterion = torch.nn.CrossEntropyLoss().to(device)    # Softmax is internally computed.
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [None]:

total_batch = len(data_loader)
for epoch in range(training_epochs):
    avg_cost = 0

    for X, Y in data_loader:
        # reshape input image into [batch_size by 784]
        # label is not one-hot encoded
        X = X.view(-1, 28 * 28).to(device)
        Y = Y.to(device)

        optimizer.zero_grad()
        hypothesis = model(X)
        cost = criterion(hypothesis, Y)
        cost.backward()
        optimizer.step()

        avg_cost += cost / total_batch

    print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost))

print('Learning finished')

Epoch: 0001 cost = 0.243988752
Epoch: 0002 cost = 0.091930427
Epoch: 0003 cost = 0.059369326
Epoch: 0004 cost = 0.042475790
Epoch: 0005 cost = 0.033861812
Epoch: 0006 cost = 0.025505863
Epoch: 0007 cost = 0.022579098
Epoch: 0008 cost = 0.017337879
Epoch: 0009 cost = 0.016488504
Epoch: 0010 cost = 0.014489295
Epoch: 0011 cost = 0.014997463
Epoch: 0012 cost = 0.010930263


이전 것들과 비교했을때 weight만 바꿨는데 도 Accuracy가 오른걸 확인 가능하다.