# OVERFITTING

* The more layers, the more prone the network is to overfit.   
  The smaller a network is, the less it's able to overfit.
* **Overfitting**: perfect predictions on training data, but bad results with test data.
* **Causes**: NNS can get worse:
  * if the nn is **trained too much** -> stop   
    testing accuracy starts decreasing while training accuracy still increasing
  * if the nn **learns the noise** in the dataset instead of making decisions based only on the **true signal**   
<br/>   
  
* How to avoid overfitting? With regularization
* How do you get NNs to learn only the **true signal** (essence of inputs) and ignore the noise (non-discriminant features),    
to **capture only the general information and ignore the fine-grained details**?

# Regularization method 1: Early stopping

* stop training when NN starts getting worse   
  = don't let the nn train long enough to learn it   
  = cheapest form of regularization   

### Example

In the following example:
* the model predicts **perfectly on training** data but **poorly on test** data:
```
Iter 350  Train-Err 0.108 Train-Acc 1.0  Test-Err 0.654 Test-Acc 0.807
```
* and the test accuracy **decreases** after 100 iterations even though the training accuracy keeps increasing   
  test accuracy drops to 77.1% when training finishes (training accuracy at 100%)

#### Loading data

In [2]:
import sys
import numpy as np
from keras.datasets import mnist

# x_train, x_test:
#   uint8 arrays of grayscale image data with shapes (num_samples, 28, 28)
# y_train, y_test: 
#   uint8 arrays of digit labels (integers in range 0-9) with shapes (num_samples,)
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, y_train = x_train[0:1000], y_train[0:1000]
# x_test, y_test = x_test[0:1000], y_test[0:1000]

# IMAGES
train_images = x_train.reshape(1000, 28*28) / 255
test_images = x_test.reshape(len(x_test), 28*28) / 255

# LABELS: one_hot vectors
# label '4' = [0, 0, 0, 0, 1, 0, 0...]
train_labels = np.zeros((len(y_train), 10))
for i, l in enumerate(y_train):
    train_labels[i][l] = 1

test_labels = np.zeros((len(y_test), 10))
for i, l in enumerate(y_test):
    test_labels[i][l] = 1

print(f'train_images: {train_images.shape}, train_labels: {train_labels.shape}')    
print(f'test_images:  {test_images.shape}, test_labels:  {test_labels.shape}')

train_images: (1000, 784), train_labels: (1000, 10)
test_images:  (10000, 784), test_labels:  (10000, 10)


#### Network parameters

In [3]:
np.random.seed(1)
relu = lambda x: (x >= 0) * x  # x if x>0, 0 otherwise
relu2deriv = lambda x: x >= 0  # 1 if input>0, 0 otherwise

alpha = 0.005
iterations = 1501
hidden_size = 40

pixels_per_image = 784
num_labels = 10
num_train_images = len(train_images)
num_test_images = len(test_images)

weights_0_1 = 0.2 * np.random.random((pixels_per_image, hidden_size)) - 0.1
weights_1_2 = 0.2 * np.random.random((hidden_size, num_labels)) - 0.1

#### Learning and predicting

In [6]:
for j in range(iterations):
    
    #-- Predicting on training data
    
    train_error, train_correct_cnt = 0.0, 0
    
    for i in range(num_train_images):
        
        target_pred = train_labels[i:i+1]
        
        layer_0 = train_images[i:i+1]
        layer_1 = relu(np.dot(layer_0, weights_0_1))
        layer_2 = np.dot(layer_1, weights_1_2)
        
        train_error += np.sum((target_pred - layer_2) ** 2)  # pred - target_pred
        train_err = train_error / float(num_train_images)
        
        train_correct_cnt += int( np.argmax(layer_2) == np.argmax(target_pred) )
        train_acc = train_correct_cnt / float(num_train_images)
        
        layer_2_delta = target_pred - layer_2
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
        
        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
    
    #-- Predicting on test data
    
    if (j % 50 == 0) or (j == iterations - 1):
        
        test_error, test_error_cnt = 0.0, 0

        for k in range(num_test_images):
            
            target_pred = test_labels[k:k+1]

            layer_0 = test_images[k:k+1]
            layer_1 = relu(np.dot(layer_0, weights_0_1))
            layer_2 = np.dot(layer_1, weights_1_2)

            test_error += np.sum((target_pred - layer_2) ** 2)
            test_err = test_error / float(num_test_images)
            
            test_error_cnt += int( np.argmax(layer_2) == np.argmax(target_pred) )
            test_acc = test_error_cnt / float(num_test_images)

        print('Iter ' + str(j) + \
              '  Train-Err ' + str(train_err)[0:5] + ' Train-Acc ' + str(train_acc)[0:5] + \
              '  Test-Err ' + str(test_err)[0:5] + ' Test-Acc ' + str(test_acc)[0:5])


Iter 0  Train-Err 0.722 Train-Acc 0.537  Test-Err 0.601 Test-Acc 0.702
Iter 50  Train-Err 0.204 Train-Acc 0.966  Test-Err 0.437 Test-Acc 0.894
Iter 100  Train-Err 0.166 Train-Acc 0.984  Test-Err 0.482 Test-Acc 0.869
Iter 150  Train-Err 0.145 Train-Acc 0.991  Test-Err 0.513 Test-Acc 0.854
Iter 200  Train-Err 0.130 Train-Acc 0.998  Test-Err 0.538 Test-Acc 0.846
Iter 250  Train-Err 0.120 Train-Acc 0.999  Test-Err 0.577 Test-Acc 0.831
Iter 300  Train-Err 0.113 Train-Acc 0.999  Test-Err 0.614 Test-Acc 0.818
Iter 350  Train-Err 0.108 Train-Acc 1.0  Test-Err 0.654 Test-Acc 0.807
Iter 400  Train-Err 0.106 Train-Acc 0.998  Test-Err 0.691 Test-Acc 0.797
Iter 450  Train-Err 0.105 Train-Acc 0.998  Test-Err 0.712 Test-Acc 0.793
Iter 500  Train-Err 0.104 Train-Acc 0.999  Test-Err 0.729 Test-Acc 0.788
Iter 550  Train-Err 0.104 Train-Acc 0.999  Test-Err 0.745 Test-Acc 0.779
Iter 600  Train-Err 0.105 Train-Acc 0.998  Test-Err 0.756 Test-Acc 0.771
Iter 650  Train-Err 0.106 Train-Acc 0.998  Test-Err 0.75

## Regularization method 2: Dropout

* randomly turn off neurons during training   
  = set them to 0 (and usually also their delta during backpropagation, but not necessary)
* network trains on **random subsections** of the network   
  then sum the total to maintain its expressive power
  = go-to state-of-the-art most used regularization technique
  = simple and inexpensive
* make big network act like little one by randomly training little subsections because **little networks don't overfit**:
  * small networks don't have much expressive power
  * they have room only to capture the big, obvious, high-level features
* Difference between learning with big vs small networks   
  = difference between molding with very coarse-grained clay vs very fine-grained clay   
  -> coarse-grained can only **_average_** the shapes, ignoring fine creases and corners.
* dropout = similar to training for a marathon with weights on your legs (harder to train), but taking off the weights for the big race   
  you run quite a bit faster because you trained for something that was much more difficult

### Example

* new dropout mask for each iteration: random matrix of 0s and 1s
* `layer_1 *= dropout_mask * 2`: 
  * if we turn off half the nodes in `layer_1`, the weighted sum in `layer_2` will be cut in half
  * thus, `layer_2` would increase its sensitivity to `layer_1` (like leaning closer to a radio when volume too low)
  * but at test time, we no longer use dropout (volume back up to normal), this throws off `layer_2`'s ability to listen to `layer_1`
  * we counter this by multiplying `layer_1` by `1 / percentage of turned on nodes` (here: 1/0.5, which equals to 2)
  * this way, the volume of `layer_1` is the same between training and testing
* In the previous example, the network peaked at 89.4% test accuracy then falled down to 77.1% when training finishes   
  and it predicted perfectly on training data, but poorly on test data
  * with dropout, it peaks at 89.3% then drops to only 83.3% when training peaks   
    = doesn't overfit as badly as before
  * test accuracy keeps going up and down
  * note: dropout slows down training accuracy, which previously went straight to 100%   
    twice more iterations (1500 vs 750) and still training not finished

In [6]:
for j in range(iterations):
    
    #-- Predicting on training data
        
    train_error, train_correct_cnt = 0.0, 0
    
    for i in range(num_train_images):
        
        target_pred = train_labels[i:i+1]
        
        layer_0 = train_images[i:i+1]
        layer_1 = relu(np.dot(layer_0, weights_0_1))
        dropout_mask = np.random.randint(2, size=layer_1.shape) # new dropout mask: random matrix of 0s and 1s
        layer_1 *= dropout_mask *2                              # turn off some of layer_1's nodes
        layer_2 = np.dot(layer_1, weights_1_2)
        
        train_error += np.sum((target_pred - layer_2) ** 2)
        train_err = train_error / float(num_train_images)
        
        train_correct_cnt += int( np.argmax(layer_2) == np.argmax(target_pred) )
        train_acc = train_correct_cnt / float(num_train_images)
        
        layer_2_delta = target_pred - layer_2
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
        layer_1_delta *= dropout_mask                             # turn off layer_1's deltas
        
        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
    
    #-- Predicting on test data
    
    if (j % 50 == 0) or (j == iterations - 1):
        
        test_error, test_error_cnt = 0.0, 0

        for k in range(num_test_images):
            
            target_pred = test_labels[k:k+1]

            layer_0 = test_images[k:k+1]
            layer_1 = relu(np.dot(layer_0, weights_0_1))
            layer_2 = np.dot(layer_1, weights_1_2)

            test_error += np.sum((target_pred - layer_2) ** 2)
            test_err = test_error / float(num_test_images)
            
            test_correct_cnt += int( np.argmax(layer_2) == np.argmax(target_pred) )
            test_acc = test_correct_cnt / float(num_test_images)

        print('Iter ' + str(j) + \
              '  Train-Err ' + str(train_err)[0:5] + ' Train-Acc ' + str(train_acc)[0:5] + \
              '  Test-Err ' + str(test_err)[0:5] + ' Test-Acc ' + str(test_acc)[0:5])
        

Iter 0  Train-Err 0.885 Train-Acc 0.289  Test-Err 0.718 Test-Acc 0.570
Iter 50  Train-Err 0.462 Train-Acc 0.742  Test-Err 0.430 Test-Acc 0.888
Iter 100  Train-Err 0.452 Train-Acc 0.769  Test-Err 0.433 Test-Acc 0.880
Iter 150  Train-Err 0.457 Train-Acc 0.783  Test-Err 0.446 Test-Acc 0.870
Iter 200  Train-Err 0.442 Train-Acc 0.796  Test-Err 0.436 Test-Acc 0.882
Iter 250  Train-Err 0.433 Train-Acc 0.789  Test-Err 0.421 Test-Acc 0.884
Iter 300  Train-Err 0.407 Train-Acc 0.804  Test-Err 0.434 Test-Acc 0.875
Iter 350  Train-Err 0.418 Train-Acc 0.79  Test-Err 0.426 Test-Acc 0.893
Iter 400  Train-Err 0.386 Train-Acc 0.835  Test-Err 0.426 Test-Acc 0.883
Iter 450  Train-Err 0.398 Train-Acc 0.823  Test-Err 0.425 Test-Acc 0.881
Iter 500  Train-Err 0.380 Train-Acc 0.829  Test-Err 0.424 Test-Acc 0.878
Iter 550  Train-Err 0.390 Train-Acc 0.819  Test-Err 0.441 Test-Acc 0.866
Iter 600  Train-Err 0.360 Train-Acc 0.84  Test-Err 0.430 Test-Acc 0.865
Iter 650  Train-Err 0.369 Train-Acc 0.847  Test-Err 0.43

## Regularization

* subset of methods to help the nn learn true signal and ignore noise
* by increasing the difficulty for a model to learn the fine-grained details
* a form of training a bunch of networks and averaging them (called **_ensemble learning_**)
* since no 2 nns learn the same (see previous chapter), no 2 nns overfit the same:   
  = each network inevitably makes different mistakes, resulting in different updates   
  = **_each network overfits to different noise_**
* all networks **start by learning broad** features (general features)   
  before learning about noise (fine-grained features   
<br/>  

If you train 100 different networks (all initialized randomly):
* they will each latch on to **_different noise_**
* but **_similar broad signal_**
* so by allowing them to vote equally, their noise would tend to **_cancel out_**, revealing only what they all learned in common: **_the signal_**

# BATCH GRADIENT DESCENT

* **Increasing speed training and the rate of convergence**   
<br/>   

* training runs much faster: `np.dot` now performs batched dot products (100 at a time)   
  CPU architectures are much faster at performing batched dot products
* alpha is 20 times larger (0.005 vs 0.001)   
  because training takes an average noisy signal (average weight change over 100 training examples), it can take **bigger steps**
* batch size: pick numbers randomly until you find a `batch_size`/`alpha` pair that works well   
<br/>   

* Note: with batched gradient descent, training accuracy has smoother trend than with stochastic GD (1 training example at a time)   
  this is an effect of taking an average weight update consistently   
  individual training example are very noisy in terms of the weight updates they generate

In [4]:
import numpy as np

np.random.seed(1)
relu = lambda x: (x >= 0) * x  # x if x>0, 0 otherwise
relu2deriv = lambda x: x >= 0  # 1 if input>0, 0 otherwise

alpha = 0.005
iterations = 2001
hidden_size = 10
batch_size = 100

pixels_per_image = 784
num_labels = 10
num_train_images = len(train_images)
num_test_images = len(test_images)

weights_0_1 = 0.2 * np.random.random((pixels_per_image, hidden_size)) - 0.1
weights_1_2 = 0.2 * np.random.random((hidden_size, num_labels)) - 0.1

for j in range(iterations):
    
    train_error, train_correct_cnt = (0.0, 0)
    
    for i in range(num_train_images // batch_size):
        
        batch_start, batch_end = (i * batch_size), ((i+1) * batch_size)
                
        layer_0 = train_images[batch_start:batch_end]
        layer_1 = relu(np.dot(layer_0, weights_0_1))
        dropout_mask = np.random.randint(2, size=layer_1.shape)
        layer_1 *= dropout_mask * 2
        layer_2 = np.dot(layer_1, weights_1_2)
        
        train_error += np.sum(train_labels[batch_start:batch_end] - layer_2) ** 2
        train_err = train_error / float(num_train_images)
        
        for k in range(batch_size):            
            train_correct_cnt += int( np.argmax(layer_2[k:k+1]) == \
                                      np.argmax(train_labels[batch_start+k:batch_start+k+1]) )
        train_acc = train_correct_cnt / float(num_train_images)
            
        layer_2_delta = (train_labels[batch_start:batch_end] - layer_2) / batch_size
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
        layer_1_delta *= dropout_mask
            
        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
        
    
    if (j % 50 == 0):
        test_error, test_correct_cnt = (0.0, 0)
        
        for m in range(num_test_images):            
            layer_0 = test_images[m:m+1]
            layer_1 = relu(np.dot(layer_0, weights_0_1))
            layer_2 = np.dot(layer_1, weights_1_2)
            
            test_error += np.sum(test_labels[m:m+1] - layer_2) ** 2
            test_err = test_error / float(num_test_images)
            
            test_correct_cnt += int( np.argmax(layer_2) == np.argmax(test_labels[m:m+1]) )
            test_acc = test_correct_cnt / float(num_test_images)

        print('Iter ' + str(j) + \
              '  Train-Err ' + str(train_err)[0:5] + ' Train-Acc ' + str(train_acc)[0:5] + \
              '  Test-Err ' + str(test_err)[0:5] + ' Test-Acc ' + str(test_acc)[0:5])

Iter 0  Train-Err 82.46 Train-Acc 0.118  Test-Err 0.818 Test-Acc 0.128
Iter 50  Train-Err 16.21 Train-Acc 0.218  Test-Err 0.269 Test-Acc 0.289
Iter 100  Train-Err 8.870 Train-Acc 0.286  Test-Err 0.187 Test-Acc 0.435
Iter 150  Train-Err 10.66 Train-Acc 0.294  Test-Err 0.163 Test-Acc 0.486
Iter 200  Train-Err 10.42 Train-Acc 0.288  Test-Err 0.157 Test-Acc 0.506
Iter 250  Train-Err 11.97 Train-Acc 0.326  Test-Err 0.160 Test-Acc 0.517
Iter 300  Train-Err 11.67 Train-Acc 0.327  Test-Err 0.164 Test-Acc 0.525
Iter 350  Train-Err 14.47 Train-Acc 0.33  Test-Err 0.168 Test-Acc 0.530
Iter 400  Train-Err 13.74 Train-Acc 0.337  Test-Err 0.174 Test-Acc 0.545
Iter 450  Train-Err 13.81 Train-Acc 0.355  Test-Err 0.181 Test-Acc 0.555
Iter 500  Train-Err 14.97 Train-Acc 0.376  Test-Err 0.186 Test-Acc 0.564
Iter 550  Train-Err 15.76 Train-Acc 0.363  Test-Err 0.199 Test-Acc 0.572
Iter 600  Train-Err 16.33 Train-Acc 0.369  Test-Err 0.204 Test-Acc 0.579
Iter 650  Train-Err 15.96 Train-Acc 0.369  Test-Err 0.2