# The multi-layer perceptron with Regularization

We use a special technique, called **Dropout**. Here we randomly sever connections to and from some fraction of nodes in "previous" layer in every epoch.

It prevents overfitting by not allowing specific nodes to specialize to say, for example "cat eye detection" at the expense of other things. By severing, other neurons are forced to step in.

It is done only during training, at prediction time the connections are restored with the final weights that we computed

![](images/Figure-20-008.png)

In [None]:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout
from keras.utils import to_categorical
from keras.utils import np_utils

In [None]:
class Config:
  pass
config = Config()

In [None]:
config.optimizer = "adam"
config.epochs = 10
config.hidden_nodes = 100

# load data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
img_width = X_train.shape[1]
img_height = X_train.shape[2]

X_train = X_train.astype('float32')
X_train /= 255.
X_test = X_test.astype('float32')
X_test /= 255.

# one hot encode outputs
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
labels = range(10)

num_classes = y_train.shape[1]

In [None]:
config.dropout = 0.1

In [None]:
# create model
model=Sequential()
model.add(Flatten(input_shape=(img_width,img_height)))
model.add(Dropout(config.dropout))
model.add(Dense(config.hidden_nodes, activation='relu'))
model.add(Dropout(config.dropout))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=config.optimizer,
                    metrics=['accuracy'])
model.summary()

In [None]:
history = model.fit(X_train, y_train, validation_data=(X_test, y_test),
          epochs=config.epochs)

In [None]:
import matplotlib.pyplot as plt
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])

Now we do much better.

## TECHNIQUE

1. normalize data, and initialize coefficients
2. increase complexity until you overfit: layers, Dense layers
4. increase rregularization until we dont overfit any more. Do this via dropout and weight-decay (Ridge regularization, more later)