In [11]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# L1 and L2 regularization

In this notebook we'll talk about regularization.  This is a first step toward dealing with *overfitting* -- the situation where the model fits the training data extremely well, but fits the test data comparatively less well.  For example, in the last experiment of the preceding notebook, the network got 99.763% accuracy on the training set, but only 98.200% accuracy on the test set.

On the one hand, this is pretty good test set accuracy for this problem (although not state-of-the-art).  On the other hand, the network had an error rate of 0.237% on the training set, and 1.800% on the test set, meaning it made more than 7.5 times as many errors (per example) on the test set than on the training set.

Ideally the network would be equally accurate on both sets.

## Model Complexity

One explanation for overfitting is that the model is allowed to be too complicated, and somehow "memorizes" the training set. Somehow, simpler models need to actually understand the underlying problem better, rather than just having enough neurons to handle lots of edge cases.

If this explanation is correct, the solution is to introduce a **penalty for complexity**.  This should be somehow added to the cost function, so that gradient descent will automatically introduce it into the optimization process. This way, the model will learn complexity when it actually helps, that is, when it gives significant improved training set performance.  However, it will not learn complexity if it only helps on a small number of training examples.

## L2 Regularization / Ridge Regression

The simplest and most classical form of regularization is called *ridge regression*, or *L2 regularization*.  It introduces a cost: $$\textrm{L2Cost}=\lambda \sum_{i,j,k} w_{i,j,k}^2$$

This penalizes all weights, encouraging them to go downward to zero. It especially penalizes large weights, which could indicate over-reliance on a single parameter (which may not generalize well).  For this reason, if two parameters are accomplishing essentially the same (useful) thing, ridge regularization will tend to shrink them down until they are the same size (compare this to L1 regularization below).

The gradient of this cost is easy to compute: $$ \dfrac{\partial \textrm{L2Cost}}{\partial w}=2\lambda w$$

Note that larger $\lambda$ means more regularization.  It is a simple fact that as $\lambda$ increases, training accuracy will **decrease**.  If $\lambda=0$, then the training accuracy will continue to be excellent, as seen above.  A large $\lambda$ will absolutely ruin the model; we add in a relatively low $\lambda$ to encourage the model to better approximate the data.

There are several important points to raise about this:

1. Typically one does *not* regularize the bias terms. If you do, it makes the model tend toward the zero predictor (which is biased toward certain datasets).  If you don't, if makes the model tend towards predicting the most common class (for classification) or mean output (for regression), which is less biased.
2. For stochastic gradient descent, some authors scale the contribution of regularization by $\frac{B}{m}$, where $B$ is the size of the current batch, and $m$ is the total size of the training data.  We *do not*.
3. Some authors divide the regularization penalty by $m$, the size of the training set.  Others don't.  It is true that as the data gets larger, overfitting is less of a problem, so we need less regularization, but this doesn't necessarily need to be automatic.  In our implementation, we will *not* divide $\lambda$ by $m$, as it hurts generalization to the online case.  You can divide $\lambda$ by $m$ when you plug it into the experiment code.

**Note about (2)**: The "authors" who scale their regularization in this way include Yoshua Bengio and his group, and we are not following their convention because we are following their intent.  In <a href="http://arxiv.org/pdf/1206.5533v2.pdf">Bengio's paper</a>, he recommends scaling the regularization term in this way because it makes the stochastic gradient an unbiased estimate of the total gradient.

That is, he uses the *total gradient*, as in the total cost across all training examples in the batch (or mini-batch or etc.).  However, we use the *average gradient* (see the experiment code, where we divided by the length of the mini-batch); there is no reason to use one or the other, as it only differs by a scalar, but we use the average.  In this case leaving $\lambda$ un-scaled for the mini-batch makes this an unbiased estimator of the average gradient (of the whole batch).

The whole point is to make the effect of regularization independent of the size of the mini-batch, and we will experimentally verify that this is the case.

### The Code

In what follows, there are a lot of potential flags and so on, but the actual code is extremely short.  In practice we essentially only ever need `diff==True` and `regularize_bias==False`, which could be accomplished in about one line of code (line 9, specifically), so don't be overwhelmed by the size of the code.

In [1]:
def ridge_cost(l2_lambda, weights, biases,
               diff=False, aggregate=False,
               regularize_bias=False):
    
    L = len(weights)
    
    if diff:
        scale = 2*l2_lambda
        weight_grad = [scale*weights[i] for i in range(0, L)]
        
        if regularize_bias:
            biases_grad = [scale*biases[i] for i in range(0, L)]
            return weight_grad, biases_grad
        
        else:
            return weight_grad
    
    elif aggregate:
        cost = l2_lambda * sum([np.sum(weights[i]**2) for i in range(0, L)])
        if regularize_bias:
            cost += l2_lambda * sum([np.sum(biases[i]**2) for i in range(0, L)])
            
        return cost
    
    else:
        weight_cost = [l2_lambda * (weights[i]**2) for i in range(0, L)]
        
        if regularize_bias:
            biases_cost = [np.zeros(biases[i].shape) for i in range(0, L)]
            return weight_cost, biases_cost
        
        else:
            return weight_cost

Note that we place a flag about whether or not to regularize the bias; the meaning of the flag is clear, but note that it affects what is returned.  If `regularize_bias` is off, it does not return zeroes (or equivalent).

### A Worked Example

That's the cost, but let's look at how to integrate it into our training process.  First, let's get hold of the data:

In [2]:
from mnist_import import get_mnist_nice

train_X, train_Y, test_X, test_Y = get_mnist_nice()

n = train_X.shape[1]
k = train_Y.shape[1]

Here is the new experiment code.  We've made several changes to the previous version, some of which are for regularization, and some of which just streamline the process.  Here are the ones of the second category:
1. All the hyperparameters are now arguments to the experiment, including network size, length of experiment, and so on.  It is still assumed that the activation functions will all be the same, except the last which is a sigmoid.
2. The experiment now returns the trained network, so we can do whatever we like with it afterward.

Also note lines 30-33 and 35-38.  Instead of just subtracting the scaled back_prop gradient from weights and biases, we accumulate a total cost gradient from all sources (currently just two), then subtract them all at once.  This allows us to put in lots of extra additional costs (for example, L1 regularization) without changing the code too much.

The actual L2 regularization is computed in line 27, added to the cost gradients in lines 31 and 36.  Note that we have added regularization costs to the biases, even though they are always zero; this allows us to easily change our minds later.

In [3]:
from basic_nn import *
import time

def run_exp(act_fn, cost_fn, init_fn, learning_rate,
           neuron_sizes, num_epochs, batch_size,
           l2_cost=0
           ):
    np.random.seed(313) # for determinism
    
    # Step 2: initialize
    weights, biases = init_fn(n, k, neuron_sizes)
    acts = [act_fn for _ in range(0, len(weights))]
    acts[-1] = act_sigmoid # last one is always sigmoid
    
    # Step 3: train
    t1 = time.time()

    for epoch in range(0, num_epochs):
        # we'll keep track of the cost as we go
        total_cost = 0
        num_batches = 0

        for X_mb, Y_mb in get_mini_batches(batch_size, train_X, train_Y):
            x, z, y = forward_prop(weights, biases, acts, X_mb)

            bp_grad_w, bp_grad_b = back_prop(weights, biases, acts, cost_fn, X_mb, Y_mb, x, y, z)
            l2_grad_w = ridge_cost(l2_cost, weights, biases, diff=True)

            for i in range(0, len(weights)):
                weight_grad = learning_rate * bp_grad_w[i] / len(X_mb)
                weight_grad += l2_grad_w[i]
                
                weights[i] -= weight_grad
                
                biases_grad = learning_rate * bp_grad_b[i] / len(X_mb)
                
                biases[i] -= biases_grad

            total_cost += cost_fn(y[-1], Y_mb, aggregate=True)
            num_batches += 1

        cost = total_cost / num_batches # average cost
        print("Cost {2:0.7f} through epoch {0}; took {1:0.3f} seconds so far.".format(epoch, time.time()-t1, cost))
    
    # Step 4: evaluate
    _, _, y = forward_prop(weights, biases, acts, train_X)
    success_rate = classification_success_rate(y[-1], train_Y)
    print("After {0} epochs, got {1:0.3f}% classifications correct (train).".format(num_epochs, 100*success_rate))
    
    return weights, biases, acts

Now, let's take a successful network from before and see how it does with some regularization.

In [4]:
act_fn = act_tanh
cost_fn = cost_CE
init_fn = initialize_xavier_tanh

learning_rate = 2

neuron_sizes = [100, 100]
num_epochs = 25
batch_size = 50

l2_cost = 0

weights, biases, acts = run_exp(act_fn, cost_fn, init_fn, learning_rate,
                                neuron_sizes, num_epochs, batch_size,
                                l2_cost=l2_cost)

# Get test error, too
_, _, y = forward_prop(weights, biases, acts, test_X)
success_rate = classification_success_rate(y[-1], test_Y)
print("After {0} epochs, got {1:0.3f}% classifications correct (test).".format(num_epochs, 100*success_rate))

Cost 0.6370673 through epoch 0; took 11.733 seconds so far.
Cost 0.2970022 through epoch 1; took 22.979 seconds so far.
Cost 0.2244473 through epoch 2; took 34.271 seconds so far.
Cost 0.1866230 through epoch 3; took 45.922 seconds so far.
Cost 0.1579253 through epoch 4; took 59.424 seconds so far.
Cost 0.1389525 through epoch 5; took 73.795 seconds so far.
Cost 0.1216503 through epoch 6; took 86.303 seconds so far.
Cost 0.1100803 through epoch 7; took 100.316 seconds so far.
Cost 0.1004261 through epoch 8; took 114.539 seconds so far.
Cost 0.0907168 through epoch 9; took 126.514 seconds so far.
Cost 0.0837887 through epoch 10; took 139.498 seconds so far.
Cost 0.0755311 through epoch 11; took 153.070 seconds so far.
Cost 0.0693652 through epoch 12; took 167.383 seconds so far.
Cost 0.0668106 through epoch 13; took 181.051 seconds so far.
Cost 0.0622533 through epoch 14; took 193.807 seconds so far.
Cost 0.0575975 through epoch 15; took 208.023 seconds so far.
Cost 0.0544478 through ep

Note that the training set error is presented first, and the test set error is presented second.  Here the test error was approximately five times the training error.  Now let's try with regularization:

In [5]:
act_fn = act_tanh
cost_fn = cost_CE
init_fn = initialize_xavier_tanh

learning_rate = 2

neuron_sizes = [100, 100]
num_epochs = 25
batch_size = 50

l2_cost = 4 / len(train_X)

weights, biases, acts = run_exp(act_fn, cost_fn, init_fn, learning_rate,
                                neuron_sizes, num_epochs, batch_size,
                                l2_cost=l2_cost)

# Get test error, too
_, _, y = forward_prop(weights, biases, acts, test_X)
success_rate = classification_success_rate(y[-1], test_Y)
print("After {0} epochs, got {1:0.3f}% classifications correct (test).".format(num_epochs, 100*success_rate))

Cost 0.6463834 through epoch 0; took 11.430 seconds so far.
Cost 0.3164446 through epoch 1; took 22.774 seconds so far.
Cost 0.2505502 through epoch 2; took 34.169 seconds so far.
Cost 0.2178240 through epoch 3; took 45.439 seconds so far.
Cost 0.1969646 through epoch 4; took 58.164 seconds so far.
Cost 0.1850752 through epoch 5; took 69.497 seconds so far.
Cost 0.1724567 through epoch 6; took 80.808 seconds so far.
Cost 0.1690400 through epoch 7; took 92.178 seconds so far.
Cost 0.1604903 through epoch 8; took 103.567 seconds so far.
Cost 0.1554749 through epoch 9; took 114.930 seconds so far.
Cost 0.1533960 through epoch 10; took 126.252 seconds so far.
Cost 0.1485437 through epoch 11; took 137.569 seconds so far.
Cost 0.1458224 through epoch 12; took 148.877 seconds so far.
Cost 0.1429504 through epoch 13; took 160.209 seconds so far.
Cost 0.1393662 through epoch 14; took 171.577 seconds so far.
Cost 0.1384094 through epoch 15; took 183.488 seconds so far.
Cost 0.1376486 through epo

As we can see, the costs leveled off much higher with the regularized version than with the un-regularized version.  The classification error for the training set was significantly worse, but the error for the test set was somewhat better.  Remember that the goal is to improve test error, not training error!

# L1 Regularization / LASSO Regression

Somewhat later than ridge regression came the LASSO regression technique.  It is also called L1 regularization, because it introduces a penalty which is proportional to the L1 norm.  Specifically, it's: $$\textrm{L1Cost} = \sum_{i,j,k}\lambda\cdot|w_{i,j,k}|$$

The vertical bars indicate absolute value.  This is somewhat unfortunate for purists, as the function is not differentiable at zero, but in practice this is not a problem, since things are rarely exactly zero.  Its gradient is simply
$$\dfrac{\partial \textrm{L1Cost}}{\partial w}=\lambda\cdot \textrm{sgn}(w)$$
where $\textrm{sgn}$ is the "signum" function, which returns 1 if $w$ is positive, -1 if $w$ is negative, and is undefined (but by convention equal to zero) if $w$ is zero.

As before, $\lambda$ is a nonnegative constant, where larger values lead to more regularization, and zero means no (lasso) regularization.

This is easy to implement, despite the lack of differentiability:

In [28]:
def lasso_cost(l1_lambda, weights, biases,
               diff=False, aggregate=False,
               regularize_bias=False):
    L = len(weights)
    
    if diff:
        weight_grad = [l1_lambda*np.sign(weights[i]) for i in range(0, L)]
        
        if regularize_bias:
            biases_grad = [l1_lambda*np.sign(biases[i]) for i in range(0, L)]
            return weight_grad, biases_grad
        else:
            return weight_grad
    
    elif aggregate:
        cost = l1_lambda * sum([np.sum(np.abs(weights[i])) for i in range(0, L)])
        
        if regularize_bias:
            cost += l1_lambda * sum([np.sum(np.abs(biases[i])) for i in range(0, L)])
            
        return cost
    
    else:
        weight_cost = [l1_lambda * np.abs(weights[i]) for i in range(0, L)]
        
        if regularize_bias:
            biases_cost = [l1_lambda * np.abs(biases[i]) for i in range(0, L)]
            return weight_cost, biases_cost
        else:
            return weight_cost

Note that the `np.sign` function does exactly what the signun function is supposed to.

## Comparing L1 and L2 Regularization
You might wonder why we need both L2 and L1 regularization, and perhaps we don't.  In fact they do very similar things, encouraging weights to be smaller.  They work for essentially the same reasons.

However, they do somewhat different things.  If two weights accomplish essentially the same task, and the ideal is for them to have a total of 1, then without regularization, they might converge to anything at all which add up to 1, for example 100 and -99, which is not ideal.  There is no pressure to encourage them toward any particular value.

L2 regularization encourages both of them to be small and approximately equal, converging to perhaps 0.5 and 0.5, as this induces an L2 cost of $\frac{1}{2}\lambda$, while one being 0 and the other being one induces an L2 cost of $\lambda$.

On the other hand, L1 regularization encourages one of them to be exactly zero, and the other to be one.  In case they are literally equal, L1 regularization doesn't do this (anything where they are between 0 and 1, and add up to one, is fine).  But more usually they are simply very similar; in this case, whichever is *better* (for whatever task) will tend to one, and the other to zero.

Thus L2 regularization makes the whole model *smaller* and thus less variable, while L1 regularization makes individual weights disappear, making the model *sparser*.  This is sometimes desirable.

## Tuning

Bengio gives approximately the same advice for tuning L1 as for L2 regularization, along with some statistical justification for what each of them do.  The only thing we will add to the aforementioned notes and hints is that, if both L1 and L2 regularization are used, the corresponding $\lambda$ values are *different hyperparameters* and *should be tuned separately and independently*.

Let's try some experiments where we use L1 regularization instead of L2 regularization; compare these results to the above.

In [26]:
from basic_nn import *
import time

def run_exp(act_fn, cost_fn, init_fn, learning_rate,
           neuron_sizes, num_epochs, batch_size,
           l1_cost=0, l2_cost=0
           ):
    np.random.seed(313) # for determinism
    
    # Step 2: initialize
    weights, biases = init_fn(n, k, neuron_sizes)
    acts = [act_fn for _ in range(0, len(weights))]
    acts[-1] = act_sigmoid # last one is always sigmoid
    
    # Step 3: train
    t1 = time.time()

    for epoch in range(0, num_epochs):
        # we'll keep track of the cost as we go
        total_cost = 0
        num_batches = 0

        for X_mb, Y_mb in get_mini_batches(batch_size, train_X, train_Y):
            x, z, y = forward_prop(weights, biases, acts, X_mb)

            bp_grad_w, bp_grad_b = back_prop(weights, biases, acts, cost_fn, X_mb, Y_mb, x, y, z)
            l1_grad_w = lasso_cost(l1_cost, weights, biases, diff=True)
            l2_grad_w = ridge_cost(l2_cost, weights, biases, diff=True)

            for i in range(0, len(weights)):
                weight_grad = learning_rate * bp_grad_w[i] / len(X_mb)
                weight_grad += l1_grad_w[i]
                weight_grad += l2_grad_w[i]
                
                weights[i] -= weight_grad
                
                biases_grad = learning_rate * bp_grad_b[i] / len(X_mb)
                
                biases[i] -= biases_grad

            total_cost += cost_fn(y[-1], Y_mb, aggregate=True)
            num_batches += 1

        cost = total_cost / num_batches # average cost
        print("Cost {2:0.7f} through epoch {0}; took {1:0.3f} seconds so far.".format(epoch, time.time()-t1, cost))
    
    # Step 4: evaluate
    _, _, y = forward_prop(weights, biases, acts, train_X)
    success_rate = classification_success_rate(y[-1], train_Y)
    print("After {0} epochs, got {1:0.3f}% classifications correct (train).".format(num_epochs, 100*success_rate))
    
    return weights, biases, acts

In the above, the L1 cost was easy to add; it was computed in line 27, and used in like 32.

In [27]:
act_fn = act_tanh
cost_fn = cost_CE
init_fn = initialize_xavier_tanh

learning_rate = 2

neuron_sizes = [100, 100]
num_epochs = 25
batch_size = 50

l1_cost = 4 / len(train_X)

weights, biases, acts = run_exp(act_fn, cost_fn, init_fn, learning_rate,
                                neuron_sizes, num_epochs, batch_size,
                                l1_cost=l1_cost, l2_cost=0)

# Get test error, too
_, _, y = forward_prop(weights, biases, acts, test_X)
success_rate = classification_success_rate(y[-1], test_Y)
print("After {0} epochs, got {1:0.3f}% classifications correct (test).".format(num_epochs, 100*success_rate))

Cost 0.7064186 through epoch 0; took 16.202 seconds so far.
Cost 0.4129381 through epoch 1; took 30.855 seconds so far.
Cost 0.3549946 through epoch 2; took 45.276 seconds so far.
Cost 0.3315202 through epoch 3; took 58.547 seconds so far.
Cost 0.3097674 through epoch 4; took 71.738 seconds so far.
Cost 0.2986018 through epoch 5; took 85.096 seconds so far.
Cost 0.2931440 through epoch 6; took 98.298 seconds so far.
Cost 0.2831075 through epoch 7; took 111.495 seconds so far.
Cost 0.2753639 through epoch 8; took 128.019 seconds so far.
Cost 0.2702161 through epoch 9; took 146.878 seconds so far.
Cost 0.2729327 through epoch 10; took 160.047 seconds so far.
Cost 0.2620355 through epoch 11; took 173.171 seconds so far.
Cost 0.2653828 through epoch 12; took 186.249 seconds so far.
Cost 0.2577904 through epoch 13; took 199.284 seconds so far.
Cost 0.2574434 through epoch 14; took 212.303 seconds so far.
Cost 0.2571472 through epoch 15; took 225.339 seconds so far.
Cost 0.2536157 through ep

Observe that the scores (cross entropy costs) more or less stopped going down, in any significant way, after around epoch 10 or so.  Also, we've almost completely stopped overfitting, though we've perhaps thrown the baby out with the bathwater and knocked train and test accuracy down too far.

Let's try it with a less aggressive regularization parameter:

In [29]:
act_fn = act_tanh
cost_fn = cost_CE
init_fn = initialize_xavier_tanh

learning_rate = 2

neuron_sizes = [100, 100]
num_epochs = 25
batch_size = 50

l1_cost = 2 / len(train_X)

weights, biases, acts = run_exp(act_fn, cost_fn, init_fn, learning_rate,
                                neuron_sizes, num_epochs, batch_size,
                                l1_cost=l1_cost, l2_cost=0)

# Get test error, too
_, _, y = forward_prop(weights, biases, acts, test_X)
success_rate = classification_success_rate(y[-1], test_Y)
print("After {0} epochs, got {1:0.3f}% classifications correct (test).".format(num_epochs, 100*success_rate))

Cost 0.6639774 through epoch 0; took 13.905 seconds so far.
Cost 0.3479109 through epoch 1; took 29.234 seconds so far.
Cost 0.2885470 through epoch 2; took 43.017 seconds so far.
Cost 0.2623045 through epoch 3; took 56.867 seconds so far.
Cost 0.2421258 through epoch 4; took 70.717 seconds so far.
Cost 0.2304192 through epoch 5; took 84.714 seconds so far.
Cost 0.2245165 through epoch 6; took 98.525 seconds so far.
Cost 0.2119657 through epoch 7; took 112.281 seconds so far.
Cost 0.2057250 through epoch 8; took 125.808 seconds so far.
Cost 0.2020185 through epoch 9; took 139.428 seconds so far.
Cost 0.1944786 through epoch 10; took 153.176 seconds so far.
Cost 0.1896175 through epoch 11; took 166.556 seconds so far.
Cost 0.1874693 through epoch 12; took 180.019 seconds so far.
Cost 0.1860393 through epoch 13; took 193.565 seconds so far.
Cost 0.1802805 through epoch 14; took 206.948 seconds so far.
Cost 0.1800305 through epoch 15; took 220.380 seconds so far.
Cost 0.1778965 through ep

Even with the lower L1 penalty, it's still a pretty excellent reduction in overfitting.  Let's try one more:

In [30]:
act_fn = act_tanh
cost_fn = cost_CE
init_fn = initialize_xavier_tanh

learning_rate = 2

neuron_sizes = [100, 100]
num_epochs = 25
batch_size = 50

l1_cost = 1 / len(train_X)

weights, biases, acts = run_exp(act_fn, cost_fn, init_fn, learning_rate,
                                neuron_sizes, num_epochs, batch_size,
                                l1_cost=l1_cost, l2_cost=0)

# Get test error, too
_, _, y = forward_prop(weights, biases, acts, test_X)
success_rate = classification_success_rate(y[-1], test_Y)
print("After {0} epochs, got {1:0.3f}% classifications correct (test).".format(num_epochs, 100*success_rate))

Cost 0.6479678 through epoch 0; took 20.311 seconds so far.
Cost 0.3176793 through epoch 1; took 33.537 seconds so far.
Cost 0.2523066 through epoch 2; took 46.669 seconds so far.
Cost 0.2183188 through epoch 3; took 59.825 seconds so far.
Cost 0.1950166 through epoch 4; took 72.938 seconds so far.
Cost 0.1828854 through epoch 5; took 86.063 seconds so far.
Cost 0.1724679 through epoch 6; took 99.224 seconds so far.
Cost 0.1614637 through epoch 7; took 113.393 seconds so far.
Cost 0.1569031 through epoch 8; took 126.806 seconds so far.
Cost 0.1500302 through epoch 9; took 139.972 seconds so far.
Cost 0.1441631 through epoch 10; took 153.125 seconds so far.
Cost 0.1383619 through epoch 11; took 166.259 seconds so far.
Cost 0.1334704 through epoch 12; took 179.438 seconds so far.
Cost 0.1326735 through epoch 13; took 192.535 seconds so far.
Cost 0.1270855 through epoch 14; took 205.679 seconds so far.
Cost 0.1288318 through epoch 15; took 218.845 seconds so far.
Cost 0.1247921 through ep

Sadly, it isn't magical.  Lasso regression seems to do a good job at reducing overfitting, but we'll need new techniques to get both good accuracy on the training set *and* a lack of overfitting.

## Verifying Our Assumptions

Earlier, we stated that if for stochastic gradient descent, some authors (e.g. Bengio) scale the contribution of regularization by $\frac{B}{m}$ but we don't.  The purpose of this scaling (in their paper) is to make the amount of regularization invariant under the size of the batch.  We argued that this was appropriate for them (who used total error for gradient) but not for us (who use the mean error).  We now demonstrate that if we increase the batch size, the amount of regularization stays the same, but if we use their scaling, it doesn't.

First, compare this experiment with the same one above, with L2 penalty of `4/len(train_X)` and batch size of `25`:

In [31]:
act_fn = act_tanh
cost_fn = cost_CE
init_fn = initialize_xavier_tanh

learning_rate = 2

neuron_sizes = [100, 100]
num_epochs = 25
batch_size = 100

l2_cost = 4 / len(train_X)

weights, biases, acts = run_exp(act_fn, cost_fn, init_fn, learning_rate,
                                neuron_sizes, num_epochs, batch_size,
                                l2_cost=l2_cost)

# Get test error, too
_, _, y = forward_prop(weights, biases, acts, test_X)
success_rate = classification_success_rate(y[-1], test_Y)
print("After {0} epochs, got {1:0.3f}% classifications correct (test).".format(num_epochs, 100*success_rate))

Cost 0.7945432 through epoch 0; took 13.143 seconds so far.
Cost 0.3767478 through epoch 1; took 25.628 seconds so far.
Cost 0.2925384 through epoch 2; took 40.078 seconds so far.
Cost 0.2499553 through epoch 3; took 50.072 seconds so far.
Cost 0.2187889 through epoch 4; took 59.865 seconds so far.
Cost 0.1994401 through epoch 5; took 70.202 seconds so far.
Cost 0.1834057 through epoch 6; took 81.031 seconds so far.
Cost 0.1709964 through epoch 7; took 90.787 seconds so far.
Cost 0.1616552 through epoch 8; took 100.557 seconds so far.
Cost 0.1545272 through epoch 9; took 110.383 seconds so far.
Cost 0.1496675 through epoch 10; took 120.245 seconds so far.
Cost 0.1410364 through epoch 11; took 130.048 seconds so far.
Cost 0.1363483 through epoch 12; took 139.902 seconds so far.
Cost 0.1327514 through epoch 13; took 149.656 seconds so far.
Cost 0.1301203 through epoch 14; took 159.453 seconds so far.
Cost 0.1257846 through epoch 15; took 169.299 seconds so far.
Cost 0.1247712 through epo

Now, if we had done their scaling, it would have the net effect of doubling the L2 penalty, which would look as follows:

In [32]:
act_fn = act_tanh
cost_fn = cost_CE
init_fn = initialize_xavier_tanh

learning_rate = 2

neuron_sizes = [100, 100]
num_epochs = 25
batch_size = 100

l2_cost = 8 / len(train_X)

weights, biases, acts = run_exp(act_fn, cost_fn, init_fn, learning_rate,
                                neuron_sizes, num_epochs, batch_size,
                                l2_cost=l2_cost)

# Get test error, too
_, _, y = forward_prop(weights, biases, acts, test_X)
success_rate = classification_success_rate(y[-1], test_Y)
print("After {0} epochs, got {1:0.3f}% classifications correct (test).".format(num_epochs, 100*success_rate))

Cost 0.8033346 through epoch 0; took 9.702 seconds so far.
Cost 0.3941819 through epoch 1; took 19.628 seconds so far.
Cost 0.3161697 through epoch 2; took 29.454 seconds so far.
Cost 0.2787638 through epoch 3; took 39.216 seconds so far.
Cost 0.2518769 through epoch 4; took 48.968 seconds so far.
Cost 0.2365621 through epoch 5; took 59.060 seconds so far.
Cost 0.2231726 through epoch 6; took 69.779 seconds so far.
Cost 0.2135124 through epoch 7; took 80.055 seconds so far.
Cost 0.2059177 through epoch 8; took 89.908 seconds so far.
Cost 0.2010070 through epoch 9; took 99.669 seconds so far.
Cost 0.1980287 through epoch 10; took 109.515 seconds so far.
Cost 0.1901233 through epoch 11; took 119.366 seconds so far.
Cost 0.1874954 through epoch 12; took 129.118 seconds so far.
Cost 0.1840133 through epoch 13; took 141.864 seconds so far.
Cost 0.1819456 through epoch 14; took 155.124 seconds so far.
Cost 0.1809950 through epoch 15; took 165.255 seconds so far.
Cost 0.1792845 through epoch 

Comparing our results, we can easily see that the original experiment lined up well with the second-to-last.  The final experiment clearly had more regularization than the original, as represented by its cross entropy scores (consistently higher).