# Experiments

We've now got our learning algorithm working, so let's run some simulations and see if these more "advanced" techniques really make much difference.  Recall we've got the following choices:

1. Different activation functions: sigmoid, hyperbolic tangent, or rectifier.
2. Different cost functions: MSE or CE.
3. Different initialization functions: uniform in $[-1,1]$ or Xavier.
4. Different learning rates

Let's investigate all of them.  We'll use the same architecture for each experiment -- 80+40+10 neurons.  The last layer will always be sigmoid neurons to force them into the appropriate range.  We'll run 10 epochs with a batch size of 50; this is enough epochs to make the difference known, but not so many that I'm going to spend an entire afternoon watching little numbers go down.

We'll investigate every combination of activation function, cost function, and initialization method.  For each one, we'll start with a learning rate of 1024, then cut it in half until it stops blowing up, and consider that the learning rate for the experiment.  Note that it's not fair to pick a uniform learning rate across all the experiments, since different rates work better for different activation functions and so on.

To start, let's write an experiment function which allows us to specify which parameters we want to use, and then run it.

In [1]:
from basic_nn import *
import time

def run_exp(act_fn, cost_fn, init_fn, learning_rate):
    np.random.seed(313) # for determinism
    
    # Step 1: pick architecture; in prose and parameters
    neuron_sizes = [80, 40]
    
    num_epochs = 10
    batch_size = 50
    
    # Step 2: initialize
    weights, biases = init_fn(n, k, neuron_sizes)
    acts = [act_fn for _ in range(0, len(weights))]
    acts[-1] = act_sigmoid # last one is always sigmoid
    
    # Step 3: train
    t1 = time.time()

    for epoch in range(0, num_epochs):
        # we'll keep track of the cost as we go
        total_cost = 0
        num_batches = 0

        for X_mb, Y_mb in get_mini_batches(batch_size, train_X, train_Y):
            x, z, y = forward_prop(weights, biases, acts, X_mb)

            bp_grad_w, bp_grad_b = back_prop(weights, biases, acts, cost_fn, X_mb, Y_mb, x, y, z)

            for i in range(0, len(weights)):
                weights[i] -= learning_rate * bp_grad_w[i] / len(X_mb)
                biases[i] -= learning_rate * bp_grad_b[i] / len(X_mb)

            total_cost += cost_fn(y[-1], Y_mb, aggregate=True)
            num_batches += 1

        cost = total_cost / num_batches # average cost
        print("Cost {2:0.7f} through epoch {0}; took {1:0.3f} seconds so far.".format(epoch, time.time()-t1, cost))
    
    # Step 4: evaluate
    _, _, y = forward_prop(weights, biases, acts, train_X)
    success_rate = classification_success_rate(y[-1], train_Y)
    print("After {0} epochs, got {1:0.3f}% classifications correct.".format(num_epochs, 100*success_rate))

# The Data

As before, we'll use the MNIST dataset.

In [2]:
from mnist_import import get_mnist_nice

train_X, train_Y, test_X, test_Y = get_mnist_nice()

n = train_X.shape[1]
k = train_Y.shape[1]

# Sigmoid Neurons

Let's get started!  We'll go from "worst" to "best," in terms of the reputation of the algorithm and its importance (in my opinion, based on what I've read and seen).  So we'll loop through initialization functions (least important), then cost functions (more important), then activation functions (most important).

I won't show the results of failed learning rates, but trust me that I've started at 1024 and halved until a good learning rate was found.

In [3]:
learning_rate = 4
run_exp(act_sigmoid, cost_MSE, initialize_network_uniform, learning_rate)

Cost 0.4591226 through epoch 0; took 10.680 seconds so far.
Cost 0.2402632 through epoch 1; took 18.142 seconds so far.
Cost 0.2083375 through epoch 2; took 25.790 seconds so far.
Cost 0.1922131 through epoch 3; took 32.938 seconds so far.
Cost 0.1811940 through epoch 4; took 40.097 seconds so far.
Cost 0.1733975 through epoch 5; took 47.260 seconds so far.
Cost 0.1668856 through epoch 6; took 54.353 seconds so far.
Cost 0.1617854 through epoch 7; took 61.505 seconds so far.
Cost 0.1574871 through epoch 8; took 68.684 seconds so far.
Cost 0.1534947 through epoch 9; took 75.794 seconds so far.
After 10 epochs, got 87.817% classifications correct.


In [4]:
learning_rate = 4
run_exp(act_sigmoid, cost_MSE, initialize_xavier_sigmoid, learning_rate)

Cost 0.3648476 through epoch 0; took 8.327 seconds so far.
Cost 0.1342702 through epoch 1; took 15.487 seconds so far.
Cost 0.1036181 through epoch 2; took 22.799 seconds so far.
Cost 0.0864062 through epoch 3; took 29.926 seconds so far.
Cost 0.0750066 through epoch 4; took 37.008 seconds so far.
Cost 0.0668906 through epoch 5; took 44.147 seconds so far.
Cost 0.0605494 through epoch 6; took 51.454 seconds so far.
Cost 0.0554018 through epoch 7; took 59.546 seconds so far.
Cost 0.0511987 through epoch 8; took 67.156 seconds so far.
Cost 0.0475507 through epoch 9; took 74.309 seconds so far.
After 10 epochs, got 97.738% classifications correct.


In [5]:
learning_rate = 4
run_exp(act_sigmoid, cost_CE, initialize_network_uniform, learning_rate)

Cost 1.1200734 through epoch 0; took 7.547 seconds so far.
Cost 0.4754880 through epoch 1; took 15.070 seconds so far.
Cost 0.3797369 through epoch 2; took 22.531 seconds so far.
Cost 0.3265063 through epoch 3; took 30.061 seconds so far.
Cost 0.2917096 through epoch 4; took 37.810 seconds so far.
Cost 0.2629701 through epoch 5; took 46.256 seconds so far.
Cost 0.2401513 through epoch 6; took 53.742 seconds so far.
Cost 0.2229482 through epoch 7; took 61.223 seconds so far.
Cost 0.2066669 through epoch 8; took 68.743 seconds so far.
Cost 0.1926037 through epoch 9; took 76.280 seconds so far.
After 10 epochs, got 98.055% classifications correct.


In [6]:
learning_rate = 4
run_exp(act_sigmoid, cost_CE, initialize_xavier_sigmoid, learning_rate)

Cost 0.9786934 through epoch 0; took 7.553 seconds so far.
Cost 0.3796509 through epoch 1; took 15.298 seconds so far.
Cost 0.2925477 through epoch 2; took 23.926 seconds so far.
Cost 0.2421980 through epoch 3; took 31.671 seconds so far.
Cost 0.2086297 through epoch 4; took 39.461 seconds so far.
Cost 0.1852873 through epoch 5; took 46.984 seconds so far.
Cost 0.1662026 through epoch 6; took 54.445 seconds so far.
Cost 0.1508255 through epoch 7; took 62.691 seconds so far.
Cost 0.1380346 through epoch 8; took 70.203 seconds so far.
Cost 0.1268596 through epoch 9; took 77.751 seconds so far.
After 10 epochs, got 98.882% classifications correct.


The results are quite impressive, in my opinion.  The initialization really matters -- observe that in both cases, switching from the naive uniform initialization to the Xavier method cut the error rate almost in half.  The networks that come from this method don't have the same saturation problems that a uniformly generated network would, and they train a lot more quickly.

Also notice the effect of having a proper cost function, which speeds training wonderfully.  Especially striking is in the case of a "bad" initialization; with the MSE cost function, the uniform network can only train to 88% accuraccy in 10 epochs, but in the same time, the CE cost function gets to 98% accuraccy.

# Hyperbolic Tangent Neurons

In [7]:
learning_rate = 2
run_exp(act_tanh, cost_MSE, initialize_network_uniform, learning_rate)

Cost 0.7630340 through epoch 0; took 9.595 seconds so far.
Cost 0.6679351 through epoch 1; took 21.213 seconds so far.
Cost 0.6529174 through epoch 2; took 29.033 seconds so far.
Cost 0.6452124 through epoch 3; took 36.813 seconds so far.
Cost 0.6405911 through epoch 4; took 44.587 seconds so far.
Cost 0.6367360 through epoch 5; took 52.465 seconds so far.
Cost 0.6339814 through epoch 6; took 61.354 seconds so far.
Cost 0.6313757 through epoch 7; took 69.115 seconds so far.
Cost 0.5807940 through epoch 8; took 77.206 seconds so far.
Cost 0.5258620 through epoch 9; took 84.971 seconds so far.
After 10 epochs, got 53.110% classifications correct.


In [8]:
learning_rate = 2
run_exp(act_tanh, cost_MSE, initialize_xavier_tanh, learning_rate)

Cost 0.2674983 through epoch 0; took 7.694 seconds so far.
Cost 0.0944498 through epoch 1; took 15.433 seconds so far.
Cost 0.0700270 through epoch 2; took 23.185 seconds so far.
Cost 0.0569199 through epoch 3; took 30.914 seconds so far.
Cost 0.0487330 through epoch 4; took 38.420 seconds so far.
Cost 0.0426666 through epoch 5; took 46.061 seconds so far.
Cost 0.0374035 through epoch 6; took 53.667 seconds so far.
Cost 0.0339940 through epoch 7; took 61.245 seconds so far.
Cost 0.0302312 through epoch 8; took 68.703 seconds so far.
Cost 0.0278998 through epoch 9; took 76.249 seconds so far.
After 10 epochs, got 98.767% classifications correct.


In [9]:
learning_rate = 2
run_exp(act_tanh, cost_CE, initialize_network_uniform, learning_rate)

Cost 1.7647134 through epoch 0; took 7.964 seconds so far.
Cost 0.6239229 through epoch 1; took 17.181 seconds so far.
Cost 0.4974299 through epoch 2; took 25.349 seconds so far.
Cost 0.4334288 through epoch 3; took 33.436 seconds so far.
Cost 0.3852466 through epoch 4; took 41.419 seconds so far.
Cost 0.3550446 through epoch 5; took 49.393 seconds so far.
Cost 0.3310954 through epoch 6; took 57.355 seconds so far.
Cost 0.3080355 through epoch 7; took 65.368 seconds so far.
Cost 0.2942540 through epoch 8; took 74.429 seconds so far.
Cost 0.2790795 through epoch 9; took 82.585 seconds so far.
After 10 epochs, got 96.773% classifications correct.


In [10]:
learning_rate = 2
run_exp(act_tanh, cost_CE, initialize_xavier_tanh, learning_rate)

Cost 0.7109427 through epoch 0; took 8.208 seconds so far.
Cost 0.3222831 through epoch 1; took 17.309 seconds so far.
Cost 0.2474861 through epoch 2; took 28.022 seconds so far.
Cost 0.2040213 through epoch 3; took 36.184 seconds so far.
Cost 0.1781349 through epoch 4; took 44.307 seconds so far.
Cost 0.1581911 through epoch 5; took 53.656 seconds so far.
Cost 0.1439137 through epoch 6; took 62.637 seconds so far.
Cost 0.1302804 through epoch 7; took 74.486 seconds so far.
Cost 0.1224360 through epoch 8; took 86.801 seconds so far.
Cost 0.1109439 through epoch 9; took 94.978 seconds so far.
After 10 epochs, got 98.932% classifications correct.


There is no real significance to the change in learning rates from 4 to 2; it is perhaps due to the steeper slope of hyperbolic tangent in general, as compared to the sigmoid function.

When I saw the first experiment, I was quite disappointed.  It looks like hyperbolic tangent was a huge under-performer.  But the second was impressive - with proper initialization, it dramatically outperformed the sigmoid network.  Similarly with the second pair.

The lesson, apparently, is that proper initialization is extremely important for hyperbolic tangent neurons, and doing it wrong results in bad, difficult-to-train networks which underperform compared to sigmoid neurons.  But with proper initialization, they outperform the sigmoid neurons (even when both are initialized properly).

The lesson about appropriate cost functions is the same as before, for the same reasons.  Note the lack of significance of this factor when the network is initialized properly, although this could be a coincidence.

# Rectifier Units

In [11]:
learning_rate = 1
run_exp(act_LeRU, cost_MSE, initialize_network_uniform, learning_rate)

Cost 1.0015040 through epoch 0; took 7.231 seconds so far.
Cost 0.8802469 through epoch 1; took 15.969 seconds so far.
Cost 0.8554427 through epoch 2; took 23.005 seconds so far.
Cost 0.8468417 through epoch 3; took 29.980 seconds so far.
Cost 0.8421582 through epoch 4; took 37.043 seconds so far.
Cost 0.8396950 through epoch 5; took 44.078 seconds so far.
Cost 0.8351688 through epoch 6; took 51.050 seconds so far.
Cost 0.8341459 through epoch 7; took 58.075 seconds so far.
Cost 0.8324586 through epoch 8; took 65.086 seconds so far.
Cost 0.8303623 through epoch 9; took 73.356 seconds so far.
After 10 epochs, got 27.948% classifications correct.


In [12]:
learning_rate = 1
run_exp(act_LeRU, cost_MSE, initialize_xavier_leru, learning_rate)

Cost 0.5230091 through epoch 0; took 7.494 seconds so far.
Cost 0.3698022 through epoch 1; took 15.994 seconds so far.
Cost 0.3511810 through epoch 2; took 23.044 seconds so far.
Cost 0.3409287 through epoch 3; took 30.007 seconds so far.
Cost 0.3345050 through epoch 4; took 36.955 seconds so far.
Cost 0.3298234 through epoch 5; took 43.945 seconds so far.
Cost 0.3268026 through epoch 6; took 50.944 seconds so far.
Cost 0.3241408 through epoch 7; took 58.039 seconds so far.
Cost 0.3214145 through epoch 8; took 64.969 seconds so far.
Cost 0.3194634 through epoch 9; took 71.945 seconds so far.
After 10 epochs, got 75.947% classifications correct.


In [14]:
learning_rate = 0.5
run_exp(act_LeRU, cost_CE, initialize_network_uniform, learning_rate)

Cost 32.1786928 through epoch 0; took 22.846 seconds so far.
Cost 29.4563494 through epoch 1; took 40.891 seconds so far.
Cost 29.2815434 through epoch 2; took 59.954 seconds so far.
Cost 29.1442811 through epoch 3; took 79.421 seconds so far.
Cost 29.0949277 through epoch 4; took 100.959 seconds so far.
Cost 29.0727902 through epoch 5; took 123.390 seconds so far.
Cost 28.9924801 through epoch 6; took 147.013 seconds so far.
Cost 28.9313671 through epoch 7; took 171.161 seconds so far.
Cost 28.9055826 through epoch 8; took 190.618 seconds so far.
Cost 28.9034858 through epoch 9; took 209.099 seconds so far.
After 10 epochs, got 29.070% classifications correct.


In [15]:
learning_rate = 1
run_exp(act_LeRU, cost_CE, initialize_xavier_leru, learning_rate)

Cost 0.7097867 through epoch 0; took 7.174 seconds so far.
Cost 0.2965490 through epoch 1; took 14.317 seconds so far.
Cost 0.2226534 through epoch 2; took 21.476 seconds so far.
Cost 0.1866233 through epoch 3; took 28.572 seconds so far.
Cost 0.1606866 through epoch 4; took 35.746 seconds so far.
Cost 0.1428151 through epoch 5; took 42.888 seconds so far.
Cost 0.1263069 through epoch 6; took 50.023 seconds so far.
Cost 0.1173063 through epoch 7; took 57.120 seconds so far.
Cost 0.1093971 through epoch 8; took 64.286 seconds so far.
Cost 0.1029874 through epoch 9; took 71.477 seconds so far.
After 10 epochs, got 99.082% classifications correct.


Interestingly LeRU units need a lower learning rate than the other options, and indeed have overflow problems quite often with ordinary-looking learning rates (e.g. 2).  This is perhaps because their slopes do not degrade like the sigmoid or hyperbolic tangent neurons do.

The final experiment slightly improved on hyperbolic tangent, but the first two experiments underperformed drastically compared to both sigmoid and hyperbolic tangent neurons.

The third experiment is left in as a cautionary tale.  Without regularization (which we will discuss soon), there is nothing stopping the coefficients from growing without bound. This is particularly troubling for rectifier units, where one can easily grow the input at an exponential rate across layers.  The `cost_CE` function is tolerant to sigmoid "overflows" to zero or one, but we still have major problems.  This initialization gives a massively over-activated network, and there's really nothing that can be done to fix it (with tools already discussed).

# In Defense of Rectifier Units

Supposedly, rectifier units are the new hotness, so why aren't they outperforming hyperbolic tangents, which are so 2006?  It turns out they really shine with a bigger network.  So let's do another experiment, where we give them more epochs and more neurons, do them both "right," and see which one really does better.

In [16]:
def run_bigger_exp(act_fn, cost_fn, init_fn, learning_rate):
    np.random.seed(313) # for determinism

    # Step 1: pick architecture; in prose and parameters
    num_epochs = 25
    batch_size = 50
    
    neuron_sizes = [100, 100]
    
    # Step 2: initialize
    weights, biases = init_fn(n, k, neuron_sizes)
    acts = [act_fn for _ in range(0, len(weights))]
    acts[-1] = act_sigmoid # last one is always sigmoid
    
    # Step 3: train
    t1 = time.time()

    for epoch in range(0, num_epochs):
        # we'll keep track of the cost as we go
        total_cost = 0
        num_batches = 0

        for X_mb, Y_mb in get_mini_batches(batch_size, train_X, train_Y):
            x, z, y = forward_prop(weights, biases, acts, X_mb)

            bp_grad_w, bp_grad_b = back_prop(weights, biases, acts, cost_fn, X_mb, Y_mb, x, y, z)

            for i in range(0, len(weights)):
                weights[i] -= learning_rate * bp_grad_w[i] / len(X_mb)
                biases[i] -= learning_rate * bp_grad_b[i] / len(X_mb)

            total_cost += cost_fn(y[-1], Y_mb, aggregate=True)
            num_batches += 1

        cost = total_cost / num_batches # average cost
        print("Cost {2:0.7f} through epoch {0}; took {1:0.3f} seconds so far.".format(epoch, time.time()-t1, cost))
    
    # Step 4: evaluate
    _, _, y = forward_prop(weights, biases, acts, train_X)
    success_rate = classification_success_rate(y[-1], train_Y)
    print("After {0} epochs, got {1:0.3f}% classifications correct.".format(num_epochs, 100*success_rate))

In [17]:
learning_rate = 2
run_bigger_exp(act_sigmoid, cost_CE, initialize_xavier_sigmoid, learning_rate)

Cost 1.7071007 through epoch 0; took 13.205 seconds so far.
Cost 0.4903770 through epoch 1; took 22.904 seconds so far.
Cost 0.3847985 through epoch 2; took 34.091 seconds so far.
Cost 0.3254435 through epoch 3; took 44.274 seconds so far.
Cost 0.2833317 through epoch 4; took 53.980 seconds so far.
Cost 0.2534444 through epoch 5; took 63.681 seconds so far.
Cost 0.2309697 through epoch 6; took 73.395 seconds so far.
Cost 0.2121886 through epoch 7; took 83.023 seconds so far.
Cost 0.1955491 through epoch 8; took 92.822 seconds so far.
Cost 0.1815544 through epoch 9; took 102.526 seconds so far.
Cost 0.1696140 through epoch 10; took 112.216 seconds so far.
Cost 0.1592080 through epoch 11; took 121.927 seconds so far.
Cost 0.1490028 through epoch 12; took 131.602 seconds so far.
Cost 0.1410483 through epoch 13; took 141.554 seconds so far.
Cost 0.1337619 through epoch 14; took 154.787 seconds so far.
Cost 0.1266575 through epoch 15; took 165.621 seconds so far.
Cost 0.1204249 through epoc

In [18]:
learning_rate = 2
run_bigger_exp(act_tanh, cost_CE, initialize_xavier_tanh, learning_rate)

Cost 0.6370673 through epoch 0; took 10.116 seconds so far.
Cost 0.2970022 through epoch 1; took 22.667 seconds so far.
Cost 0.2244473 through epoch 2; took 33.066 seconds so far.
Cost 0.1866230 through epoch 3; took 43.229 seconds so far.
Cost 0.1579253 through epoch 4; took 53.394 seconds so far.
Cost 0.1389525 through epoch 5; took 63.518 seconds so far.
Cost 0.1216503 through epoch 6; took 73.638 seconds so far.
Cost 0.1100803 through epoch 7; took 86.012 seconds so far.
Cost 0.1004261 through epoch 8; took 98.798 seconds so far.
Cost 0.0907168 through epoch 9; took 109.482 seconds so far.
Cost 0.0837887 through epoch 10; took 119.599 seconds so far.
Cost 0.0755311 through epoch 11; took 129.727 seconds so far.
Cost 0.0693652 through epoch 12; took 139.949 seconds so far.
Cost 0.0668106 through epoch 13; took 150.397 seconds so far.
Cost 0.0622533 through epoch 14; took 160.975 seconds so far.
Cost 0.0575975 through epoch 15; took 171.108 seconds so far.
Cost 0.0544478 through epoc

In [19]:
learning_rate = 1
run_bigger_exp(act_LeRU, cost_CE, initialize_xavier_leru, learning_rate)

Cost 0.6274595 through epoch 0; took 10.239 seconds so far.
Cost 0.2668945 through epoch 1; took 19.652 seconds so far.
Cost 0.2023016 through epoch 2; took 29.434 seconds so far.
Cost 0.1644078 through epoch 3; took 39.184 seconds so far.
Cost 0.1418413 through epoch 4; took 49.662 seconds so far.
Cost 0.1245527 through epoch 5; took 58.860 seconds so far.
Cost 0.1097182 through epoch 6; took 68.152 seconds so far.
Cost 0.0996820 through epoch 7; took 78.036 seconds so far.
Cost 0.0903848 through epoch 8; took 87.389 seconds so far.
Cost 0.0844181 through epoch 9; took 96.635 seconds so far.
Cost 0.0767940 through epoch 10; took 109.598 seconds so far.
Cost 0.0727951 through epoch 11; took 119.256 seconds so far.
Cost 0.0668819 through epoch 12; took 128.458 seconds so far.
Cost 0.0627825 through epoch 13; took 137.684 seconds so far.
Cost 0.0587614 through epoch 14; took 146.937 seconds so far.
Cost 0.0563800 through epoch 15; took 156.208 seconds so far.
Cost 0.0527349 through epoch

Well, as a result, we have a few observations:
1. With sufficient learning rate and epochs, and proper cost function and initialization, they all work well.
2. It is slightly true that tanh and LeRU beat sigmoid, in terms of cost (noticeably) and classification accuraccy (less noticeably).
3. tanh and LeRU are still quite similar in terms of performance.
4. All the classification rates are quite astoundingly good.

It is point (4) that is most worth mentioning right now.  We will return to how to make networks learn faster (through momentum, or pre-training, or etc.) but at least for this example, the classification accuracy is incredible.  It turns out that we are *overfitting* badly, and need to introduce measures to prevent this.

It's bad practice to examine the test set too often, so as to avoid overfitting to the test set, but there's no harm in doing it occasionally, so long as we aren't modifying our algorithms too obviously to fit them.  Let's compare accuracy on the training and on the test set:

In [20]:
def run_test_exp(act_fn, cost_fn, init_fn, learning_rate):
    np.random.seed(313) # for determinism

    # Step 1: pick architecture; in prose and parameters
    num_epochs = 25
    batch_size = 50
    
    neuron_sizes = [100, 100]
    
    # Step 2: initialize
    weights, biases = init_fn(n, k, neuron_sizes)
    acts = [act_fn for _ in range(0, len(weights))]
    acts[-1] = act_sigmoid # last one is always sigmoid
    
    # Step 3: train
    t1 = time.time()

    for epoch in range(0, num_epochs):
        # we'll keep track of the cost as we go
        total_cost = 0
        num_batches = 0

        for X_mb, Y_mb in get_mini_batches(batch_size, train_X, train_Y):
            x, z, y = forward_prop(weights, biases, acts, X_mb)

            bp_grad_w, bp_grad_b = back_prop(weights, biases, acts, cost_fn, X_mb, Y_mb, x, y, z)

            for i in range(0, len(weights)):
                weights[i] -= learning_rate * bp_grad_w[i] / len(X_mb)
                biases[i] -= learning_rate * bp_grad_b[i] / len(X_mb)

            total_cost += cost_fn(y[-1], Y_mb, aggregate=True)
            num_batches += 1

        cost = total_cost / num_batches # average cost
    
    # Step 4: evaluate
    _, _, y = forward_prop(weights, biases, acts, train_X)
    success_rate = 100*classification_success_rate(y[-1], train_Y)
    print("After {1} epochs, got {0:0.3f}% classifications correct (training).".format(success_rate, num_epochs))
    
    _, _, y = forward_prop(weights, biases, acts, test_X)
    success_rate = 100*classification_success_rate(y[-1], test_Y)
    print("After {1} epochs got {0:0.3f}% classifications correct (test).".format(success_rate, num_epochs))

In [21]:
learning_rate = 2
run_test_exp(act_tanh, cost_CE, initialize_xavier_tanh, learning_rate)

After 25 epochs, got 99.723% classifications correct (training).
After 25 epochs got 97.880% classifications correct (test).


In [22]:
learning_rate = 2
run_test_exp(act_LeRU, cost_CE, initialize_xavier_leru, learning_rate)

After 25 epochs, got 99.763% classifications correct (training).
After 25 epochs got 98.200% classifications correct (test).


It's not that the classification accuracy is *bad* on the test set -- actually it's quite good.  But we won't be able to improve it by fitting the training set more closely, as we've nearly completely fit it.  We want to force our model to generalize better, without getting to look at the test set we need to generalize to.  The usual technique to accomplish this is to force the model to be *simpler*, in one sense or another, so that there isn't "space" to "memorize" the training set, and it has to actually learn.