![](pics/header.png)

# Deep Learning: Multi-Layer Perceptron (MLP) Networks

Kevin Walchko

---

These notes come from Udacity's Deep Learning Nanodegree

## Resources

- github: [udacity deep-learning-v2-pytorch](https://github.com/udacity/deep-learning-v2-pytorch)
- github: [deeptraffic](https://github.com/lexfridman/deeptraffic) **broken**
- github: [flappybird](https://github.com/yenchenlin/DeepLearningFlappyBird)
- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/chap1.html)
- [Deep Learning](https://www.deeplearningbook.org/)
- [CNN driving car in Grand Theft Auto](https://pythonprogramming.net/game-frames-open-cv-python-plays-gta-v/)
- [CS231n Convolutional Neural Networks for Visual Recognition](https://cs231n.github.io/understanding-cnn/)

## Signal to Noise

The neural net is a tool to find a pattern. However, if there is too much noise in the training data, then it can be hard or impossible for it to do that. If your inputs are [18, 200, 0, 0.1, 2 ...], then a noise term that is multiplied by the 200 term could make training difficult.

So either scale your inputs such that they are 0 or 1 or in the continous range of [0,1].

## Setting Initial Weights

Setting initial weights, you have a couple options:

- **constant values:** set initial weights to 0 or 1, however, if all of the weights are the same in your network, the backpropagation step has a difficult time determining the gradient and performs poorly.
- **normal distribution:** pick random values in `np.random.normal(mean, std, size)` where `mean` is 0, `std` is 1.0 and size is a function of layer inputs/outputs. **This is generally your best option** and is basically the default solution for `torch`.
- **uniform distribution:** see the discussion below, but normal distribution is slightly better with large networks. For small networks, there is little difference between normal and uniform distributions.

The general rule for setting the weights in a neural network is to set them to be close to zero without being too small. A good value is [-y, y] where:

$$
y = \frac {1}{\sqrt{n}}
$$

where $n$ is the number of inputs to the layer. Making this change *should* enable training the NN with a larger learning rate!

```python
# Initialize weights

# These are the weights between the input layer and the hidden layer.
self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))

# These are the weights between the hidden layer and the output layer.
## NOTE: the difference in the standard deviation of the normal weights
## This was changed from `self.output_nodes**-0.5` to `self.hidden_nodes**-0.5`
self.weights_1_2 = np.random.normal(
    0.0, # mean 
    self.hidden_nodes**-0.5, # standard deviation 
    (self.hidden_nodes, self.output_nodes)) # size of random weight matrix

```

*Note:* Here, the hidden nodes are the input since this code removed the input layer to reduce the memory footprint of the NN.

> Apply those weights to an initialized model using `nn.Model.apply(fn)`, which applies a function to each model layer.

```python
# this uses Torch functions rather than Numpy functions as seen above

# normal distribution
def weights_init_normal(m):
    classname = m.__class__.__name__
    # for every Linear layer in a model..
    if classname.find('Linear') != -1:
        # get the number of the inputs
        n = m.in_features
        y = (1.0/np.sqrt(n))
        m.weight.data.normal_(0, y)
        m.bias.data.fill_(0)
        
# uniform distribution
def weights_init_uniform_rule(m):
    classname = m.__class__.__name__
    # for every Linear layer in a model..
    if classname.find('Linear') != -1:
        # get the number of the inputs
        n = m.in_features
        y = 1.0/np.sqrt(n)
        m.weight.data.uniform_(-y, y)
        m.bias.data.fill_(0)

# create a new model with these weights
model_rule = Net()
model_rule.apply(weights_init_uniform_rule)
```

## Multi-Layer Perceptron

Generally you use CNN for image work, but the MNIST data set is simple enough and well conditioned enough that a MLP NN can be used.

- **MNIST Data:** grayscale of handwritten numbers that are normalized (0-1) and 28 x 28 pixels in size
    - This dataset is preprocessed and very clean, therefore, it will work nicely with MLP
    - If this data was messier, then you would need to use a CNN
    - Other [classifiers](http://yann.lecun.com/exdb/mnist/) on the MNIST dataset

## MLP Development Pipeline

![](pics/pipeline.png)

## Using PyTorch

<table> 
    <tr>
        <td><img src="pics/neuron.png"></td>
        <td><img src="pics/model.jpg"></td>
    </tr>
</table>

- $LayerOutput = f(xW+b)$ where: 
    - $f$ is an activation function
    - $x$ is an input from a previous layer
    - $W$ are the weights applied between neuron layers
    - $b$ is a bias applied to the network
- [Activation functions](https://cs231n.github.io/neural-networks-1/#actfun)
- Final outputs of a NN are one of two things:
    - **Class scores:** output of a NN with higher positive being the answer
    - **Class probability:** output of a NN is a probability, usually from `LogSoftmax(class_score)`

## Build the Network

In [2]:
from torch import nn
from torch.nn import functional as F
from helper import summary

In [3]:
## Define the NN architecture
class Net(nn.Module):
    def __init__(self, dp=0.2):
        super(Net, self).__init__()
        
        # linear layer (784 inputs -> 128 hidden -> 64 hidden -> 10 output)
        self.fc1 = nn.Linear(28 * 28, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)
        
        # dropout prevents overfitting
        self.dropout = nn.Dropout(dp)
        
        # optional
        # self.log_softmax = nn.LogSoftmax(dim=1)

    def forward(self, x):
        # flatten image input
        x = x.view(-1, 28 * 28)
        
        # NN
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        
        # get class scores as output
        x = self.fc3(x)
        # or to get probabilities as output
        # x = self.log_softmax(self.fc3(x))
        return x

In [4]:
model = Net()
print(model)

Net(
  (fc1): Linear(in_features=784, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=10, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [5]:
summary(model)

Layer (type (var_name))                  Kernel Shape              Param #
Net                                      --                        --
├─Linear (fc1)                           [784, 128]                100,480
├─Linear (fc2)                           [128, 64]                 8,256
├─Linear (fc3)                           [64, 10]                  650
├─Dropout (dropout)                      --                        --
Total params: 109,386
Trainable params: 109,386
Non-trainable params: 0


## Train the Network

The steps for training/learning from a batch of data:

1. Clear the gradients of all optimized variables
2. Forward pass: compute predicted outputs by passing inputs to the model
3. Calculate the loss
4. Backward pass: compute gradient of the loss with respect to model parameters
5. Perform a single optimization step (parameter update)
6. Update average training loss

- Training NN with data;
    - Training data: used to update weights during backpropagation
    - Validation data: used to check how well the model generalizes **but not used to update weights**
- Testing data: 
    - Not seen during training but used to test accuracy of **trained** model
    - This helps prevent over fitting

```python
# specify loss function
# criterion = nn.NLLLoss()
criterion = nn.CrossEntropyLoss()

# specify optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
# optimizer = optim.SGD(model.parameters(), lr=0.003)

###################
# train the model #
###################
for data, target in train_loader:
    # clear the gradients of all optimized variables
    optimizer.zero_grad()
    # forward pass: compute predicted outputs by passing inputs to the model
    output = model(data)
    # calculate the loss
    loss = criterion(output, target)
    # backward pass: compute gradient of the loss with respect to model parameters
    loss.backward()
    # perform a single optimization step (parameter update)
    optimizer.step()
    # update running training loss
    train_loss += loss.item()*data.size(0)
```

## Test the Network

![](pics/mlp-mnist.png)

```python
# initialize lists to monitor test loss and accuracy
test_loss = 0.0
class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))

model.eval() # prep model for *evaluation*

for data, target in test_loader:
    # forward pass: compute predicted outputs by passing inputs to the model
    output = model(data)
    # calculate the loss
    loss = criterion(output, target)
    # update test loss 
    test_loss += loss.item()*data.size(0)
    # convert output probabilities to predicted class
    _, pred = torch.max(output, 1)
    # compare predictions to true label
    correct = np.squeeze(pred.eq(target.data.view_as(pred)))
    # calculate test accuracy for each object class
    
    for label, c in zip(target.data, correct):
        class_correct[label] += c.item()
        class_total[label] += 1

# calculate and print avg test loss
test_loss = test_loss/len(test_loader.dataset)
print('Test Loss: {:.6f}\n'.format(test_loss))

for i in range(10):
    if class_total[i] > 0:
        print('Test Accuracy of %5s: %2d%% (%2d/%2d)' % (
            str(i), 100 * class_correct[i] / class_total[i],
            class_correct[i], class_total[i]))
    else:
        print('Test Accuracy of %5s: N/A (no training examples)' % (classes[i]))

print('\nTest Accuracy (Overall): %2d%% (%2d/%2d)' % (
    100. * np.sum(class_correct) / np.sum(class_total),
    np.sum(class_correct), np.sum(class_total)))
```