# The Multi-Layer Perceptoron

![](images/perceptron.png)


We'll start by modifying the logistic regression structure somewhat, to use another non-linearity, rather than a sigmoid or softmax. We typically use a Relu function:

![](https://miro.medium.com/max/1400/1*XxxiA0jJvPrHEJHD4z893g.png)

At this moment this is not a classification unit, but a compute unit in a larger structure, called a peceptron. We will combine perceptron's together to create a neural network, called a *Multi-Layer Perceptron*.

For example, we can combine 2 perceptons, and then repeat the process:

![](images/mlp.png)

We can combine many more:

![](images/Figure-18-032.png)

Indeed such combination can be made:

- both deep and wide
- this buys us complex nonlinearity
- both for regression and classification
- key technical advance: BackPropagation with autodiff
- key technical advance: gpu

Why does this work? There is a Universal Approximation theorem that says that a network with one hidden layer can approximate any continuous function with finite support, with appropriate choice of nonlinearity

- under appropriate conditions, all of sigmoid, tanh, RELU can work
- but may need lots of units
- and will learn the function it thinks the data has, not what you think





## MLP on the MNIST dataset

In [None]:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.utils import to_categorical
from keras.utils import np_utils

In [None]:
class Config:
  pass
config = Config()

In [None]:
config.optimizer = "adam"
config.epochs = 10
config.hidden_nodes = 100

# load data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
img_width = X_train.shape[1]
img_height = X_train.shape[2]

X_train = X_train.astype('float32')
X_train /= 255.
X_test = X_test.astype('float32')
X_test /= 255.

# one hot encode outputs
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
labels = range(10)

num_classes = y_train.shape[1]

In [None]:
# create model
model = Sequential()
model.add(Flatten(input_shape=(img_width, img_height)))
model.add(Dense(config.hidden_nodes, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=config.optimizer,
              metrics=['accuracy'])
model.summary()

## A segue on SGD

SGD has a very simple idea: dont construct the loss surface from the full data. Take a **batch** of data instead, of sixe $n$:

$$\theta_{t+1} = \theta_t - \eta \nabla_{\theta} {\cal L}(\theta_t; x^{(i:i+n)}; y^{(i:i+n)})$$

The big idea here is that at each batch, you have a **new loss surface**.

```python
for i in range(n_epochs):
  np.random.shuffle(data) # so that data is seen in different orders leading to more loss surfaces
  for batch in get_batches(data, batch_size=50):
    params_grad = evaluate_gradient(loss_function, batch, params)
    params = params - learning_rate * params_grad
```

Indeed, SGD **changes the loss surface**

One can see this even for the convex linear regression loss, where the different surfaces look like this.

![fit, inline](images/animsgd.gif)

SGD is not useful for convex surfaces, as you will always get to the bottom. But the losses for the non-linear neural nets ARE NOT CONVEX. Indeed they are like the broken up everest, lhotse, nuptse valleys we saw before.

Here SGD achieves something amazing for us:


![left, fit](images/flbl.png)

Remember you want to find the best minimum of the loss.

- If losses are not convex, simple gradient descent can get trapped (unless the learning rate is high)
- but if the learning rate is always high, you might lose good minima
- by using only some data you change the surface, and this may release you from the trap
- thus you can avoid shallow local minima and make your way towards something deeper

SGD does not finding the global minimum, but is unreasonably effective in getting us close.

## Back to the fit

In [None]:
history = model.fit(X_train, y_train, validation_data=(X_test, y_test),
          epochs=config.epochs)

In [None]:
import matplotlib.pyplot as plt
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])

One notices that after some epochs of training, the validation set accuracy falls below the training set accuracy, and vice-versa for the loss.

This is overfitting. We are not able to do as well on the unseen data (validation set) as we did on the data we trained with. This is bad in any machine learning model as we overpromise but under-deliver, and in the extreme case, our predictions are not robust.

This overfitting can be seen even more clearly in this set of images which uses multilayer perceptons in the 2D half-moon dataset you can generate in `sklearn`

First the dataset followed by logistic regression:

![inline](images/halfmoonsset.png)![inline](images/mlplogistic.png)

Now consider a 1 hidden layer MLP with 2 vs 10 neurons:

![inline](images/mlp2102.png.png)![inline](images/mlp2110.png)


Finally, a 2-layer MLP, with 20 neurons per layer vs  5 layers, with 1000 neurons per layer

![inline](images/mlp2220.png)![inline](images/mlp251000.png)

You can clearly see that as the complexity of the model becomes too much, it no longer does as well.