In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

In [None]:
matplotlib.rcParams['figure.dpi'] = 100

In [None]:
from pylib.draw_nn import draw_neural_net_fig
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

In [None]:
#Start Spark with BigDL support
from pyspark import SparkContext
import bigdl
import bigdl.util.common
sc = SparkContext.getOrCreate(conf=bigdl.util.common.create_spark_conf().setMaster("local[3]")
                              .set("spark.driver.memory","2g"))
bigdl.util.common.init_engine()

<!-- requirement: images/neuron.svg -->

# Neural Networks

Neural networks are the current leading (often only) choice for doing deep learning.  They are conceptually modeled after how brains work&mdash;each neuron is fairly simple, but they can be combined to perform an enormous variety of tasks.  To specialize to a particular task, we allow the network to learn how strongly to connect the individual neurons in the network.  There has been ongoing research on them since the 1940's, but they weren't useful for practical applications until rather recently, thanks to advances in computing power.  

## Simplest network: a neuron

In [None]:
draw_neural_net_fig([4, 1])

This will be our way of displaying neural network topologies, and is the most common, so we should take a moment to explain it.  Our neuron is the orange circle, and it is getting four inputs.  They're coming from our data, so we'll call them $x_0$, $x_1$, $x_2$, and $x_3$.  The neuron then does a weighted sum with a bias, to get its internal value
$$y_i = \sum_i W_i x_i + b = \vec{W}\cdot\vec{x} + b$$
Then we route this through an activation function to get our output
$$\mathrm{out} = f(y_i) = f(\vec{W}\cdot\vec{x} + b)$$

This should look very familiar - if that activation function is the identity (i.e. it does nothing), we get linear regression.  If it's a sigmoid, we get logistic regression.

Why is this called a neuron?  Its structure is mimicking that of a biological neuron

![neuron](images/neuron.svg)
<!-- Copyright Quasar Jarosz.  Distributed under the CC Attribution-Share Alike 3.0 Unported Licence.  https://commons.wikimedia.org/wiki/File:Neuron_Hand-tuned.svg -->

It receives many inputs from other neurons via its dendrites, then makes a decision as to whether to send a pulse along its axon (to other neurons) or not based on what it gets in.  Note the word pulse there - a real neuron basically fires or doesn't, whereas we're allowing ours to take on a continuous range of values.

## Hidden Layers

Previously, we made a simple logistic regressor for three classes.  While this worked fine for the problem we had, it clearly had limits - it could only draw three lines in the space.  We will now look at a more complicated example, related to the famous XOR problem pointed out in 1969 by Martin Minsky and Seymour Papert.  We would reproduce this problem exactly, but it is fundamentally a two class problem, and currently BigDL's support for two-class classification (via the `SoftMarginCriterion`) has a bug in it.

In [None]:
centers = np.array([[0, 0]] * 100 + [[1, 1]] * 100
                   + [[0, 1]] * 100 + [[1, 0]] * 100 + [[0.5,0.5]] * 200)
np.random.seed(42)
data_in = np.random.normal(0, 0.1, (600, 2)) + centers
labels_in = np.array([[1]] * 200 + [[2]] * 200 + [[3]] * 200)

#Shuffle the data to prevent the ordered inputs from biasing the minibatches
idx = list(range(len(labels_in)))
np.random.shuffle(idx)
data = data_in[idx]
labels = labels_in[idx]

plt.scatter(data[:,0], data[:,1], c=labels.reshape(-1), cmap=plt.cm.brg)
plt.colorbar();

Let's try our three-class logistic regression again.

In [None]:
from bigdl.util.common import Sample

data_with_labels = zip(data, labels)
#np.random.shuffle(data_with_labels)
samples = sc.parallelize(data_with_labels).map(lambda x: Sample.from_ndarray(x[0],x[1]))

In [None]:
%%time
from bigdl.nn import layer
from bigdl.nn import criterion 
from bigdl.optim import optimizer

model = layer.Sequential()
model.add(layer.Linear(2,3))


fitter = optimizer.Optimizer(model=model, training_rdd=samples, criterion=criterion.CrossEntropyCriterion(), 
                     optim_method=optimizer.Adam(), end_trigger=optimizer.MaxEpoch(100), 
                                batch_size=30)

#add tracking
now_string = datetime.now().strftime("%Y%m%d-%H%M%S")
trainSummary = optimizer.TrainSummary("./logs", "simple_linear_{}".format(now_string))
trainSummary.set_summary_trigger("Loss", optimizer.EveryEpoch())
fitter.set_train_summary(trainSummary)

trained_model = fitter.optimize()

In [None]:
def get_accuracy(predicts, trues):
    return sum([int(predicts[i] == trues[i]) for i in range(len(predicts))]) * 1.0 / len(trues)
predictions = [x.argmax() + 1 for x in trained_model.predict(samples).collect()]
get_accuracy(predictions, [x for x in labels])

In [None]:
loss_summary = np.array(trainSummary.read_scalar('Loss'))
plt.plot(loss_summary[:,0],loss_summary[:,1])

In [None]:
#Set up a 100x100 grid of points across our space, predict on those to get our background coloring
mesh = np.column_stack(a.reshape(-1) for a in np.meshgrid(np.r_[-0.5:1.6:100j], np.r_[-0.5:1.5:100j]))
predict_mesh = trained_model.predict(mesh).argmax(axis=1)+1

plt.imshow(predict_mesh.reshape(100,100), cmap=plt.cm.brg, origin='lower', alpha=0.5,
           extent=(-0.5, 1.6, -0.5, 1.5), vmin=1, vmax=3)

plt.scatter(data[:,0], data[:,1], c=labels.reshape(-1), cmap=plt.cm.brg, edgecolors='black')
plt.colorbar();

plt.ylim(-0.5,1.5)
plt.show()

This is clearly not the right solution.  These are separable classes, so we should be able to get nearly 100% accuracy.  If you try running this again and tinkering with the initial weights, you can get different bad solutions.

We need more flexibility here.  Let's try to combine these artificial neurons into a more complex configuration.  We'll make a network with a single **hidden layer** of size five (a relatively arbitrary choice).  That is, we will have five logistic regressions whose outputs are not visible.  Instead, they are fed into our three visible neurons from before, whose output we use.

In [None]:
draw_neural_net_fig([2,5,3])

The math behind this isn't as bad as it might seem at first.  All of the weights of the neurons in the hidden layer can be combined into a single $2\times5$ matrix $W^{(1)}$.  The final neurons weights will be in a $5\times3$ matrix $W^{(2)}$.  The biases behave similarly.  Then our final probabilistic prediction is just

$$ [p_{j1}, p_{j2}, p_{j3}] = f_2\bigg( f_1\left( X_{ji} W^{(1)}_{ik} + b^{(1)}_k \right) W^{(2)}_k + b^{(2)} \bigg)$$

We are using the Einstein notation: All repeated indices are implicitly summed over.  Both $f_1$ and $f_2$ represent our activation functions, which are taken to operate element-wise over tensors.

The **backpropagation** algorithm, developed by Paul Werbos in 1975, points out that we can use gradient descent (or similar algorithms) to optimize all of the parameters in these sorts of expressions.  All it takes is successive applications of the chain rule.  This is done for us by the optimizer in BigDL.

In [None]:
%%time
model = layer.Sequential()
model.add(layer.Linear(2,5))
#Tanh converges faster for this problem
model.add(layer.Tanh())
model.add(layer.Linear(5,3))


fitter = optimizer.Optimizer(model=model, training_rdd=samples, criterion=criterion.CrossEntropyCriterion(), 
                     optim_method=optimizer.Adam(), end_trigger=optimizer.MaxEpoch(400), 
                                batch_size=30)

#add tracking
now_string = datetime.now().strftime("%Y%m%d-%H%M%S")
trainSummary = optimizer.TrainSummary("./logs", "simple_linear_{}".format(now_string))
trainSummary.set_summary_trigger("Loss", optimizer.EveryEpoch())
fitter.set_train_summary(trainSummary)

trained_model = fitter.optimize()

In [None]:
predictions = [x.argmax() + 1 for x in trained_model.predict(samples).collect()]
get_accuracy(predictions, [x for x in labels])

In [None]:
# Check convergence
loss_summary = np.array(trainSummary.read_scalar('Loss'))
plt.plot(loss_summary[:,0],loss_summary[:,1])

In [None]:
#Set up a 100x100 grid of points across our space, predict on those to get our background coloring
mesh = np.column_stack(a.reshape(-1) for a in np.meshgrid(np.r_[-0.5:1.6:100j], np.r_[-0.5:1.5:100j]))
predict_mesh = trained_model.predict(mesh).argmax(axis=1)+1

plt.imshow(predict_mesh.reshape(100,100), cmap=plt.cm.brg, origin='lower', alpha=0.5,
           extent=(-0.5, 1.6, -0.5, 1.5), vmin=1, vmax=3)

plt.scatter(data[:,0], data[:,1], c=labels.reshape(-1), cmap=plt.cm.brg, edgecolors='black')
plt.colorbar();

plt.ylim(-0.5,1.5)
plt.show()

### Exercise: Number of hidden neurons

Change the number of neurons in the hidden layer.  How does this change the predictions made by the model?  What happens when you add many neurons to this hidden layer?

## Activation functions

How important is the logistic function in the hidden neurons?  We can easily take them out and see what happens.

In [None]:
%%time
model = layer.Sequential()
model.add(layer.Linear(2,5))
model.add(layer.Linear(5,3))


fitter = optimizer.Optimizer(model=model, training_rdd=samples, criterion=criterion.CrossEntropyCriterion(), 
                     optim_method=optimizer.Adam(), end_trigger=optimizer.MaxEpoch(200), 
                                batch_size=30)

#add tracking
now_string = datetime.now().strftime("%Y%m%d-%H%M%S")
trainSummary = optimizer.TrainSummary("./logs", "simple_linear_{}".format(now_string))
trainSummary.set_summary_trigger("Loss", optimizer.EveryEpoch())
fitter.set_train_summary(trainSummary)

trained_model = fitter.optimize()

In [None]:
predictions = [x.argmax() + 1 for x in trained_model.predict(samples).collect()]
get_accuracy(predictions, [x for x in labels])

In [None]:
#Set up a 100x100 grid of points across our space, predict on those to get our background coloring
mesh = np.column_stack(a.reshape(-1) for a in np.meshgrid(np.r_[-0.5:1.6:100j], np.r_[-0.5:1.5:100j]))
predict_mesh = trained_model.predict(mesh).argmax(axis=1)+1

plt.imshow(predict_mesh.reshape(100,100), cmap=plt.cm.brg, origin='lower', alpha=0.5,
           extent=(-0.5, 1.6, -0.5, 1.5), vmin=1, vmax=3)

plt.scatter(data[:,0], data[:,1], c=labels.reshape(-1), cmap=plt.cm.brg, edgecolors='black')
plt.colorbar();

plt.ylim(-0.5,1.5)
plt.show()

That doesn't look good.  It seems that we've fallen back to the single neuron case again.

In this behavior becomes obvious when we consider the full neural network function, with $f_1$ being the identity function:

$$ p_j = f_2\bigg( \left( X_{ji} W^{(1)}_{ik} + b^{(1)}_k \right) W^{(2)}_k + b^{(2)} \bigg) = f_2\bigg( X_{ji} \color{red}{W^{(1)}_{ik} W^{(2)}_k} + \color{blue}{b^{(1)}_k W^{(2)}_k + b^{(2)}} \bigg) $$

This is just logistic regression, with the <font color="red">weights</font> and <font color="blue">bias</font> written in a rather funny way.  Given this, it would be odd if we didn't see this behavior!

This function, $f_1$, is known as the **activation function** of the neuron.  As we just saw, the fact that the activation function is nonlinear is crucial.  This is what keeps the whole network from just being a linear transformation.  Any non-linearity will do though, so a number of different activation functions have been proposed, with the most common here:

In [None]:
from scipy import special

xx = np.linspace(-4, 4)
plt.plot(xx, xx > 0, label='Heaviside')
plt.plot(xx, special.expit(xx), label='sigmoid')
plt.plot(xx, np.tanh(xx), label='tanh')
plt.plot(xx, np.maximum(xx,0), label='relu')
plt.legend(loc=2)
plt.ylim(-1, 2);

The first perceptron, a single-layer neural network, designed by Frank Rosenblatt in 1957, used the **Heaviside** or **step** function.  This is essentially equivalent to using a threshold with logistic regression.  While this if fine for predicting a class, it has slope 0 almost everywhere, and therefore is unsuitable for use with gradient descent algorithms.

We have already seen the **sigmoid** function used in logistic regression.  In a sense, it smooths out the step function, allowing a usable gradient in the area near $x = 0$.  Because the function saturates at $\pm\infty$, the gradient goes to zero for large positive or negative inputs.  This can cause optimization algorithms to slow down.

The average output of a sigmoid is 0.5, but it performs best when the average input is 0.  Thus, several layers of sigmoid neurons may push themselves away from optimal behavior.  One solution to this is use a **tanh** instead.  While the general shape is the same, its range is [-1, 1], so the output will on average be 0.

The tanh will still have trouble with saturation of the signal.  Recently, many researchers have had success with the **rectified linear unit (ReLU)**: $f(x) = \max(0, x)$.  While it might seem to combine the problems of the other functions (non-analytic points, zero derivatives, non-centered output), in practice it tends to be quite successful.

ReLU neurons are susceptible to dying, however.  If they get into a state where the combined input is negative, both their output value and gradient become zero, and the neurons cease to learn.  Two other activation functions take the linear part of the ReLU and replace the constant zero.  The **leaky ReLU** is also linear for negative inputs, but with a much smaller slope, typically 0.01.  This keeps the benefits of the ReLU, but allows dead neurons to recover eventually.  The **exponential linear unit (ELU)** marries the linear portion to an exponential decay with negative inputs.  This allows the activation function to have a continuous derivative, at the cost of being more expensive to compute.

In [None]:
def leaky_relu(x):
    return tf.maximum(0.01 * x, x)

plt.plot(xx, layer.ReLU().forward(xx), label='relu')
plt.plot(xx, layer.LeakyReLU(0.03).forward(xx), label='leaky relu', ls='--')
plt.plot(xx, layer.ELU().forward(xx), label='elu', ls=':')
plt.legend(loc=2)
plt.ylim(-1, 2);

### Exercise: Exploring activation functions

Adjust the activation function in the XOR-like neural network.  What function gives the fastest training?  The highest accuracy?

## Training

### Overfitting
Overfitting stems from too much flexibility in a model which causes the model to start fitting the noise in the data instead of the signal.  The main source of flexibility in a neural network is weights in the model which describe the relationship between the input features and output values.  Usually we will constrain this flexibility through the use of **regularization** which will penalize non-zero weights in our model.  Thus any non-zero weight will need to benefit the loss function greater than the penalty applied whose affect is usually a tunable hyperparameter.  Another way to think of regularization is instead of assuming a uniform prior for parameters in the model (i.e. weights), one assumes some sort of peaked distribution about zero whose width is the tunable regularization hyperparameter.  Percolation of this prior though a maximum likelihood calculation will produce the weight penalty.

The two main types of regularization are $L_2$-regularization which adds a penalty proportional to the sum of the squares of the weights and $L_1$-regularization which uses the sum of the absolute values of the weights (we generally don't apply these to the biases).  We can write these as
$$ L_2\text{ term: }\alpha\sum W^2 \qquad L_1\text{ term: }\alpha\sum\left|W\right| $$

The hyperparameter $\alpha$ controls the degree of regularization and is generally a parameter which should be tuned during training.

*Question*: What happens in the limits $\alpha\to0$? $\alpha\to\infty$?

In order to regularize, we can use the `weightdecay` parameter of the optimization method.  Note that this is using an $L_2$ norm.  Lets train two models that are the same except for a different regularization parameter:

Let us first look at an example of an overfit model, first lets generate some data which is a bit noisier (this will make it easier to force overfitting).

In [None]:
# Generate some random data
import matplotlib.pyplot as plt

centers = np.array([[0, 0]] * 50 + [[1, 0.5]] * 50 + [[0,1]]*50)
np.random.seed(47)
data_in = np.random.normal(0, 0.35, (150, 2)) + centers
labels_in = np.array([[1]] * 50 + [[2]] * 50 + [[3]]*50)

#Shuffle the data to prevent the ordered inputs from biasing the minibatches
idx = list(range(len(labels_in)))
np.random.shuffle(idx)
data = data_in[idx]
labels = labels_in[idx]

data_with_labels = zip(data, labels)
samples = sc.parallelize(data_with_labels).map(lambda x: Sample.from_ndarray(x[0],x[1]))

plt.scatter(data[:,0], data[:,1], c=labels.reshape(-1), cmap=plt.cm.brg)
plt.colorbar();

In [None]:
#We'll also generate a separate, similar sample to do some testing
data_in_test = np.random.normal(0, 0.35, (150, 2)) + centers
labels_in_test = np.array([[1]] * 50 + [[2]] * 50 + [[3]]*50)

#Shuffle the data to prevent the ordered inputs from biasing the minibatches
idx = list(range(len(labels_in_test)))
np.random.shuffle(idx)
data_test = data_in_test[idx]
labels_test = labels_in_test[idx]

plt.scatter(data_test[:,0], data_test[:,1], c=labels_test.reshape(-1), cmap=plt.cm.brg)
plt.colorbar();

data_with_labels_test = zip(data_test, labels_test)
samples_test = sc.parallelize(data_with_labels_test).map(lambda x: Sample.from_ndarray(x[0],x[1]))

We will write a convenience function for visualizing this data and model accuracy.

In [None]:
def plot_predictions(model):
    
    predictions = [x.argmax() + 1 for x in model.predict(samples).collect()]
    predictions_test = [x.argmax() + 1 for x in model.predict(samples_test).collect()]
    
    accuracy = get_accuracy(predictions, [x for x in labels])
    accuracy_test = get_accuracy(predictions_test, [x for x in labels_test])
    
    print("training accuracy:   {:.5f}".format(accuracy) + "\n" + "test accuracy:       {:.5f}".format(accuracy_test)) 

    #Set up a 100x100 grid of points across our space, predict on those to get our background coloring
    mesh = np.column_stack(a.reshape(-1) for a in np.meshgrid(np.r_[-0.5:1.6:100j], np.r_[-0.5:1.5:100j]))
    predict_mesh = model.predict(mesh).argmax(axis=1)+1

    def plot_mesh():
        plt.xlim(-0.7,1.6)
        plt.ylim(-0.5,1.5)
        plt.imshow(predict_mesh.reshape(100,100), cmap=plt.cm.brg, origin='lower', alpha=0.5,
                   extent=(-0.7, 1.7, -0.5, 1.5), vmin=1, vmax=3)
    
    fig = plt.figure(figsize=(10,5))
    
    fig.add_subplot(121)
    plt.title("Training")
    plot_mesh()
    plt.scatter(data[:,0], data[:,1], c=labels.reshape(-1), cmap=plt.cm.brg, edgecolors='black')
    
    fig.add_subplot(122)
    plt.title("Test")
    plot_mesh()
    plt.scatter(data_test[:,0], data_test[:,1], c=labels_test.reshape(-1), cmap=plt.cm.brg, edgecolors='black')
    
    plt.show()

Now we can add more neurons in the hidden layer, and add an additional layer.

In [None]:
%%time
hidden_size = 512
model = layer.Sequential()
model.add(layer.Linear(2, hidden_size))
model.add(layer.Tanh())
model.add(layer.Linear(hidden_size, hidden_size))
model.add(layer.Tanh())
model.add(layer.Linear(hidden_size, 3))


fitter = optimizer.Optimizer(model=model, 
                             training_rdd=samples, 
                             criterion=criterion.CrossEntropyCriterion(), 
                             optim_method=optimizer.Adam(), 
                             end_trigger=optimizer.MaxEpoch(400), 
                             batch_size=30)

#add tracking
now_string = datetime.now().strftime("%Y%m%d-%H%M%S")
trainSummary = optimizer.TrainSummary("./logs", "overfit_{}".format(now_string))
trainSummary.set_summary_trigger("Loss", optimizer.EveryEpoch())
fitter.set_train_summary(trainSummary)

#add test tracking, will run slower now
fitter.set_validation(batch_size=60, val_rdd=samples_test, 
                      trigger=optimizer.EveryEpoch(), val_method=[optimizer.Loss(criterion.CrossEntropyCriterion())])
valSummary = optimizer.ValidationSummary("./logs", "overfit_test_{}".format(now_string))
fitter.set_val_summary(valSummary)

overfit_model = fitter.optimize()

In [None]:
loss_summary = np.array(trainSummary.read_scalar('Loss'))
test_loss_summary = np.array(valSummary.read_scalar('Loss'))
plt.plot(loss_summary[:,0],loss_summary[:,1], test_loss_summary[:,0], test_loss_summary[:,1])

In [None]:
plot_predictions(overfit_model)

Now we add regularization to prevent the overfitting.  We'll try two different values of $\alpha$.  First the "low" one

In [None]:
%%time
hidden_size = 512
model = layer.Sequential()
model.add(layer.Linear(2, hidden_size))
model.add(layer.Tanh())
model.add(layer.Linear(hidden_size, hidden_size))
model.add(layer.Tanh())
model.add(layer.Linear(hidden_size, 3))


fitter = optimizer.Optimizer(model=model, 
                             training_rdd=samples, 
                             criterion=criterion.CrossEntropyCriterion(), 
                             optim_method=optimizer.SGD(weightdecay=.000001), 
                             end_trigger=optimizer.MaxEpoch(600), 
                             batch_size=30)

#add tracking
now_string = datetime.now().strftime("%Y%m%d-%H%M%S")
trainSummary = optimizer.TrainSummary("./logs", "alpha_low_{}".format(now_string))
trainSummary.set_summary_trigger("Loss", optimizer.EveryEpoch())
fitter.set_train_summary(trainSummary)

#add test tracking
fitter.set_validation(batch_size=60, val_rdd=samples_test, 
                      trigger=optimizer.EveryEpoch(), val_method=[optimizer.Loss(criterion.CrossEntropyCriterion())])
valSummary = optimizer.ValidationSummary("./logs", "alpha_low_test_{}".format(now_string))
fitter.set_val_summary(valSummary)

alpha_low = fitter.optimize()

In [None]:
loss_summary = np.array(trainSummary.read_scalar('Loss'))
test_loss_summary = np.array(valSummary.read_scalar('Loss'))
plt.plot(loss_summary[:,0],loss_summary[:,1], test_loss_summary[:,0], test_loss_summary[:,1])

In [None]:
plot_predictions(alpha_low)

And now the "high" one

In [None]:
%%time

hidden_size = 512
model = layer.Sequential()
model.add(layer.Linear(2, hidden_size))
model.add(layer.Tanh())
model.add(layer.Linear(hidden_size, hidden_size))
model.add(layer.Tanh())
model.add(layer.Linear(hidden_size, 3))


fitter = optimizer.Optimizer(model=model, 
                             training_rdd=samples, 
                             criterion=criterion.CrossEntropyCriterion(), 
                             optim_method=optimizer.SGD(weightdecay=.01), 
                             end_trigger=optimizer.MaxEpoch(600), 
                             batch_size=30)

#add tracking
now_string = datetime.now().strftime("%Y%m%d-%H%M%S")
trainSummary = optimizer.TrainSummary("./logs", "alpha_high_{}".format(now_string))
trainSummary.set_summary_trigger("Loss", optimizer.EveryEpoch())
fitter.set_train_summary(trainSummary)

#add test tracking
fitter.set_validation(batch_size=60, val_rdd=samples_test, 
                      trigger=optimizer.EveryEpoch(), val_method=[optimizer.Loss(criterion.CrossEntropyCriterion())])
valSummary = optimizer.ValidationSummary("./logs", "alpha_high_test_{}".format(now_string))
fitter.set_val_summary(valSummary)

alpha_high = fitter.optimize()

In [None]:
loss_summary = np.array(trainSummary.read_scalar('Loss'))
test_loss_summary = np.array(valSummary.read_scalar('Loss'))
plt.plot(loss_summary[:,0],loss_summary[:,1], test_loss_summary[:,0], test_loss_summary[:,1])

In [None]:
plot_predictions(alpha_high)

In [None]:
def sum_weights(m):
    return sum([(i**2).sum() for i in m.get_weights()])

print("no alpha is {}".format(sum_weights(overfit_model)))
print("low alpha is : {}".format(sum_weights(alpha_low)))
print("high alpha is : {}".format(sum_weights(alpha_high)))

Notice that with increasing $\alpha$ the sum of the squared weights is reduced.

### Dropout

Regularization is a popular mechanism for simple linear models, where training is rather fast.  This makes it easy to try a number of values for $\alpha$ in order to determine the optimal value.  Deep nets take much longer to train, making this search much more painful.

For this reason, another approach for dealing with overfitting has been widely adopted.  Introduced Hinton, Srivastava, *et al*, in two [recent](https://arxiv.org/pdf/1207.0580.pdf) [papers](http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf) (from about 2012), **dropout** is a remarkably simple strategy.  At each training step, a random selection of neurons are removed from the network:

In [None]:
# Network with 50% dropput
draw_neural_net_fig([20, 14, 12, 10], 0.5)

The idea behind this is to keep individual neurons from becoming too specialized.  Because each neuron may or may not be in any given run, it cannot be the only neuron detecting a particular feature.  If that feature is important, responsibility for its detection needs to be spread out among several neurons.  When we make a prediction, we will keep all of the neurons, in order to make the best prediction possible.

Another way of thinking about dropout is that, instead of training a single network of $N$ neurons, we are training $2^N$ networks of different combinations of neurons.  Then when we make a prediction, we are essentially taking an ensemble average of all of these $2^N$ networks.  A well-known feature of ensemble models is that they tend to avoid overfitting.

We can add dropout layers to our model quite easily.  BigDL's Dropout layers will randomly drop elements with a configurable probability. 

The standard value for dropout is 50%, which works well for most problems.  In this example, we use different dropout rates for different layers.  A lower dropout rate is better for the top of the network because we only have two inputs, and a higher dropout rate is better for the hidden layers because they have too many neurons to begin with.

In [None]:
%%time
hidden_size = 512
model = layer.Sequential()
model.add(layer.Dropout(0.1))
model.add(layer.Linear(2, hidden_size))
model.add(layer.Tanh())
model.add(layer.Dropout(0.8))
model.add(layer.Linear(hidden_size, hidden_size)) 
model.add(layer.Tanh())                           
model.add(layer.Dropout(0.8))                     
model.add(layer.Linear(hidden_size, 3))


fitter = optimizer.Optimizer(model=model, 
                             training_rdd=samples, 
                             criterion=criterion.CrossEntropyCriterion(), 
                             optim_method=optimizer.Adam(), 
                             end_trigger=optimizer.MaxEpoch(400), 
                             batch_size=30)

#add tracking
now_string = datetime.now().strftime("%Y%m%d-%H%M%S")
trainSummary = optimizer.TrainSummary("./logs", "dropout_{}".format(now_string))
trainSummary.set_summary_trigger("Loss", optimizer.EveryEpoch())
fitter.set_train_summary(trainSummary)

#add test tracking
fitter.set_validation(batch_size=60, val_rdd=samples_test, 
                      trigger=optimizer.EveryEpoch(), val_method=[optimizer.Loss(criterion.CrossEntropyCriterion())])
valSummary = optimizer.ValidationSummary("./logs", "dropout_test_{}".format(now_string))
fitter.set_val_summary(valSummary)

dropout_model = fitter.optimize()

In [None]:
loss_summary = np.array(trainSummary.read_scalar('Loss'))
test_loss_summary = np.array(valSummary.read_scalar('Loss'))
plt.plot(loss_summary[:,0],loss_summary[:,1], test_loss_summary[:,0], test_loss_summary[:,1])

In [None]:
plot_predictions(dropout_model)

### Batch Normalization 

Another [recently-developed](https://arxiv.org/pdf/1502.03167v3.pdf) tool for deep networks is **batch normalization**.  Although it can help with overfitting, it was originally developed to deal with the vanishing gradient problem.  Recall that activation functions have flat regions, where their gradients are small.  When the input is in these regions, gradient descent will only move the weights small amounts, leaving them stuck in the low-gradient regions.  Intelligent choices for initializations and activation functions try to avoid this as much as possible.

Batch normalization takes a more proactive approach, scaling and shifting the inputs so that the average input, over the whole batch, has a target mean and standard deviation.  These target values become parameters of the model, tuned during training.

By keeping gradients from vanishing, batch normalization reduces the importance of the weight initialization and the activation function.  Larger learning rates can be used.  Although the initial steps may proceed more slowly, as the correct normalizations must be learned, learning should proceed much faster overall than without.  Batch normalization can also have a regularization effect, reducing the propensity towards overfitting!


In [None]:
%%time

hidden_size = 512
model = layer.Sequential()
model.add(layer.BatchNormalization(2))
model.add(layer.Linear(2, hidden_size))
model.add(layer.Tanh())
model.add(layer.BatchNormalization(hidden_size))
model.add(layer.Linear(hidden_size, hidden_size)) 
model.add(layer.Tanh())                           
model.add(layer.BatchNormalization(hidden_size))  
model.add(layer.Linear(hidden_size, 3))


fitter = optimizer.Optimizer(model=model, 
                             training_rdd=samples, 
                             criterion=criterion.CrossEntropyCriterion(), 
                             optim_method=optimizer.Adam(), 
                             end_trigger=optimizer.MaxEpoch(200), 
                             batch_size=30)

#add tracking
now_string = datetime.now().strftime("%Y%m%d-%H%M%S")
trainSummary = optimizer.TrainSummary("./logs", "batchnorm_{}".format(now_string))
trainSummary.set_summary_trigger("Loss", optimizer.EveryEpoch())
fitter.set_train_summary(trainSummary)

#add test tracking
fitter.set_validation(batch_size=60, val_rdd=samples_test, 
                      trigger=optimizer.EveryEpoch(), val_method=[optimizer.Loss(criterion.CrossEntropyCriterion())])
valSummary = optimizer.ValidationSummary("./logs", "batchnorm_{}".format(now_string))
fitter.set_val_summary(valSummary)

batch_model = fitter.optimize()

In [None]:
loss_summary = np.array(trainSummary.read_scalar('Loss'))
test_loss_summary = np.array(valSummary.read_scalar('Loss'))
plt.plot(loss_summary[:,0],loss_summary[:,1], test_loss_summary[:,0], test_loss_summary[:,1])

In [None]:
plot_predictions(batch_model)

### Exercise: Better MNIST

Build a multi-layer network for our MNIST data from before.  In the interest of training time, you'll want to start with just one or two relatively small (less than 200 node) hidden layers.  Try different architectures and see what does best - and be sure to evaluate your progress on the _test_ data as well as the _training_ data to see if you've overfit.  

*Copyright &copy; 2018 The Data Incubator.  All rights reserved.*