# Regularization and Optimization in Neural Networks

## Regularization of NNs

Does regularization make sense in the context of neural networks? <br/>

Yes! We still have all of the salient ingredients: a loss function, overfitting vs. underfitting, and coefficients (weights) that could get too large.

But there are now a few different flavors besides L1 and L2 regularization. (Note that L1 regularization is not common in the context of  neural networks.)

In [23]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from keras.models import Sequential
from keras.layers import Dense
from matplotlib import pyplot as plt
%matplotlib inline

In [14]:
wine = pd.read_csv('wine.csv')
wine.head()

In [3]:
X = wine.drop('quality', axis=1)
y = wine.quality

In [15]:
wine['quality'].value_counts()

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=11, stratify=y)
ss = StandardScaler()
X_train_s = ss.fit_transform(X_train).astype(np.int32)
X_test_s = ss.transform(X_test).astype(np.int32)

In [16]:
model = Sequential()

n_input = X_train_s.shape[1]

model.add(Dense(n_input, activation='relu'))
model.add(Dense(1))

In [17]:
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['acc'])

By the way, here is a helpful blog post that goes carefully through a list of similarly-named different activation functions and loss functions: https://gombru.github.io/2018/05/23/cross_entropy_loss/

In [18]:
history = model.fit(X_train_s, np.array(y_train),
                    validation_data=(X_test_s, np.array(y_test)),
                   epochs=30, batch_size=None)

In [19]:
for layer in model.layers:
    print(layer.get_weights())

In [20]:
plt.plot(history.history['loss'], label='Train loss')
plt.plot(history.history['val_loss'], label='Test loss')
plt.legend();

In [21]:
sum(abs(np.array(model.predict(X_test_s).T) - np.array(y_test))[0])

## Adding Regularization

Here's a helpful review article on regularization techniques: https://towardsdatascience.com/regularization-in-machine-learning-connecting-the-dots-c6e030bfaddd

In [36]:
from keras import regularizers

In [53]:
model_r = Sequential()

n_input = X_train_s.shape[1]

model_r.add(Dense(n_input, activation='relu',
                 kernel_regularizer=regularizers.l2(0.01)))
model_r.add(Dense(1,
                 kernel_regularizer=regularizers.l2(0.01)))

model_r.compile(optimizer='adam', loss='mean_squared_error')

history_r = model_r.fit(X_train_s, np.array(y_train),
                        validation_data=(X_test_s, np.array(y_test)),
                       epochs=42, batch_size=None)

In [54]:
plt.plot(history_r.history['loss'], label='Training loss')
plt.plot(history_r.history['val_loss'], label='Testing loss')
plt.legend();

#### Examining Our Predictions

In [55]:
model_r.predict(X_test_s[:10]).round(2)

In [56]:
y_test[:10]

In [57]:
model_r.predict(X_test_s).T

In [58]:
(np.array(model_r.predict(X_test_s).T) - np.array(y_test))[0]

In [59]:
sum(abs(np.array(model_r.predict(X_test_s).T) - np.array(y_test))[0])

### Dropout

Here's a new regularization idea: Turn some neurons off during training. We'll assign probabilities of 'dropout' and then let fate decide.

$\rightarrow$ Why is this a good idea? *Is* it a good idea?

Was this sort of regularization available to us before? Why (not)?

In [44]:
from keras.layers import Dropout

In [60]:
model_d = Sequential()

n_input = X_train_s.shape[1]

model_d.add(Dense(n_input, activation='relu'))
model_d.add(Dropout(0.2))
model_d.add(Dense(1))

model_d.compile(optimizer='adam', loss='mean_squared_error')

history_d = model_d.fit(X_train_s, np.array(y_train),
                        validation_data=(X_test_s, np.array(y_test)),
                        epochs=42, batch_size=None)

In [61]:
plt.plot(history_d.history['loss'], label='Training loss')
plt.plot(history_d.history['val_loss'], label='Testing loss')
plt.legend();

In [47]:
# history_d.history['acc'][-1], history_d.history['val_acc'][-1]

In [62]:
sum(abs(np.array(model_d.predict(X_test_s).T) - np.array(y_test))[0])

### Early Stopping

Another idea is to try to terminate the training process early, even before some pre-specified number of epochs.

$\rightarrow$ Why is this a good idea? *Is* it a good idea?

Was this sort of regularization available to us before? Why (not)?

In [49]:
from keras.callbacks import EarlyStopping

In [63]:
model_es = Sequential()

n_input = X_train_s.shape[1]
n_hidden = n_input

model_es.add(Dense(n_hidden, input_dim=n_input, activation='relu'))
model_es.add(Dense(1))

model_es.compile(optimizer='adam', loss='mean_squared_error')

early_stop = EarlyStopping(monitor='val_loss', min_delta=1e-08, patience=0, verbose=1,
                           mode='auto')

callbacks_list = [early_stop]

history_es = model_es.fit(X_train_s, np.array(y_train),
                          validation_data=(X_test_s, np.array(y_test)),
                         epochs=40, batch_size=None, callbacks=callbacks_list)

In [64]:
plt.plot(history_es.history['loss'], label='Training loss')
plt.plot(history_es.history['val_loss'], label='Testing loss')
plt.legend();

In [65]:
sum(abs(np.array(model_es.predict(X_test_s).T) - np.array(y_test))[0])

## Exercise

Build your own network *with some sort of regularization built in* to predict digits using sklearn's `load_digits` dataset!

The imports you need are in the next cell.

Here are a couple hints and leading questions:

1. You'll need to use `to_categorical()` on your target. (What does this function do?)
2. What should your output layer look like? How many neurons should it have and what should your activation function be there?
3. When we compile this network, what loss function should we use?

In [None]:
from sklearn.datasets import load_digits
from keras.utils import to_categorical

In [53]:
data = load_digits()
print(data.data)
print(data.target)
print(data.DESCR)

In [54]:
plt.matshow(data.images[0]);