<a href="https://colab.research.google.com/github/monttj/computational-physics/blob/2021/ComPhy-10-Overtraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Only if you are using colab, you need to mount your drive and go to the directory where the necessary files are located : computational physics

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd /content/drive/My Drive/computational-physics

### Overtraining

Overgeneralization is what humans do all too often. For example, you go to a foreign country and the taxi driver rips you off. You might say that all taxi drivers in that country are thieves. 

Overfitting in statistics is production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably. 

In machine learning, this is called overtraining.

To test, we will check the accuracy for train and test data sets. 

In [None]:
import os
import sys

import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.multi_layer_net import MultiLayerNet
from common.optimizer import SGD

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

# will contrain training data sample to produce overfitting
x_train = x_train[:300]
t_train = t_train[:300]

# weight decay setup =======================
weight_decay_lambda = 0 # no weight decay
#weight_decay_lambda = 0.1
# ====================================================

network = MultiLayerNet(input_size=784, hidden_size_list=[100, 100, 100, 100, 100, 100], output_size=10,
                        weight_decay_lambda=weight_decay_lambda)
optimizer = SGD(lr=0.01) # learning rate = 0.01 

max_epochs = 201
train_size = x_train.shape[0]
batch_size = 100

train_loss_list = []
train_acc_list = []
test_acc_list = []

iter_per_epoch = max(train_size / batch_size, 1)
epoch_cnt = 0

for i in range(2000):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]

    grads = network.gradient(x_batch, t_batch)
    optimizer.update(network.params, grads)

    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train) # calculate accuracy for train data
        test_acc = network.accuracy(x_test, t_test) # calculate accuracy for test data
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)

        print("epoch:" + str(epoch_cnt) + ", train acc:" + str(train_acc) + ", test acc:" + str(test_acc))

        epoch_cnt += 1
        if epoch_cnt >= max_epochs:
            break


# make a graph==========
markers = {'train': 'o', 'test': 's'}
x = np.arange(max_epochs)
plt.plot(x, train_acc_list, marker='o', label='train', markevery=10)
plt.plot(x, test_acc_list, marker='s', label='test', markevery=10)
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.ylim(0, 1.1)
plt.legend(loc='lower right')
plt.show()

##### task 

Can you improve the model above to avoid overtraining? 
you can think of following ways.
1. increase train data samples
2. use regularization such as weight decay or dropout (see next example for dropout) 
3. use less layers or nodes

##### weight decay method

Regularization means constraining a model to make it simpler and reduce the risk of overfitting.
For the weight decay method, we add term $ \frac{1}{2} \lambda W^2 $ to the loss function to constrain the weights. Expression is as follows:

$$ W = W - \eta ( \frac{ \partial L }{ \partial W} + \lambda W) $$

$$ W = (1-\lambda \eta)W - \eta \frac{ \partial L }{ \partial W} $$

Weights are constrained by $ \lambda \eta$ term. $\lambda$ is hyperparameter. 

##### dropout method

At every training step, every neuron has a probability ```p``` of being temporarily ***dropped out***
It will be entirely ignored during this training step but it may be active during the next step
Here ```p``` is called the dropout rate.

In [None]:
import os
import sys
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.multi_layer_net_extend import MultiLayerNetExtend
from common.trainer import Trainer

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

# will contrain training data sample to produce overfitting
x_train = x_train[:300]
t_train = t_train[:300]

# this time, we will use dropout. These two lines are for setup for dropout ========================
use_dropout = True  #  False if you don't want to use dropout
dropout_ratio = 0.2 # probability of being dropped out 
# ====================================================

network = MultiLayerNetExtend(input_size=784, hidden_size_list=[100, 100, 100, 100, 100, 100],
                              output_size=10, use_dropout=use_dropout, dropout_ration=dropout_ratio)
trainer = Trainer(network, x_train, t_train, x_test, t_test,
                  epochs=301, mini_batch_size=100,
                  optimizer='sgd', optimizer_param={'lr': 0.01}, verbose=True)
trainer.train()

train_acc_list, test_acc_list = trainer.train_acc_list, trainer.test_acc_list

# make a graph ==========
markers = {'train': 'o', 'test': 's'}
x = np.arange(len(train_acc_list))
plt.plot(x, train_acc_list, marker='o', label='train', markevery=10)
plt.plot(x, test_acc_list, marker='s', label='test', markevery=10)
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

With dropout method, did we solve the overtraining problem?

There is no such a rule how to determine the hyperparameters. 
We should check many values randomly to find the optimial value.
Following code shows how to check all different hyperparameter values.

In [None]:
import sys, os
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.multi_layer_net import MultiLayerNet
from common.util import shuffle_dataset
from common.trainer import Trainer

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

# to speed up, reduce size of train sample
x_train = x_train[:500]
t_train = t_train[:500]

# will use 20% as validation 
validation_rate = 0.20
validation_num = int(x_train.shape[0] * validation_rate)
x_train, t_train = shuffle_dataset(x_train, t_train)
x_val = x_train[:validation_num]
t_val = t_train[:validation_num]
x_train = x_train[validation_num:]
t_train = t_train[validation_num:]


def __train(lr, weight_decay, epocs=50):
    network = MultiLayerNet(input_size=784, hidden_size_list=[100, 100, 100, 100, 100, 100],
                            output_size=10, weight_decay_lambda=weight_decay)
    trainer = Trainer(network, x_train, t_train, x_val, t_val,
                      epochs=epocs, mini_batch_size=100,
                      optimizer='sgd', optimizer_param={'lr': lr}, verbose=False)
    trainer.train()

    return trainer.test_acc_list, trainer.train_acc_list


# randomly choose hyperparameter values=====================================
optimization_trial = 100
results_val = {}
results_train = {}
for _ in range(optimization_trial):
    # set the range for weight decay parameter and learning rate==============
    weight_decay = 10 ** np.random.uniform(-8, -4)
    lr = 10 ** np.random.uniform(-6, -2)
    # ================================================

    val_acc_list, train_acc_list = __train(lr, weight_decay)
    print("val acc:" + str(val_acc_list[-1]) + " | lr:" + str(lr) + ", weight decay:" + str(weight_decay))
    key = "lr:" + str(lr) + ", weight decay:" + str(weight_decay)
    results_val[key] = val_acc_list
    results_train[key] = train_acc_list

# make plots ========================================================
print("=========== Hyper-Parameter Optimization Result ===========")
graph_draw_num = 20
col_num = 5
row_num = int(np.ceil(graph_draw_num / col_num))
i = 0

for key, val_acc_list in sorted(results_val.items(), key=lambda x:x[1][-1], reverse=True):
    print("Best-" + str(i+1) + "(val acc:" + str(val_acc_list[-1]) + ") | " + key)

    plt.subplot(row_num, col_num, i+1)
    plt.title("Best-" + str(i+1))
    plt.ylim(0.0, 1.0)
    if i % 5: plt.yticks([])
    plt.xticks([])
    x = np.arange(len(val_acc_list))
    plt.plot(x, val_acc_list)
    plt.plot(x, results_train[key], "--")
    i += 1

    if i >= graph_draw_num:
        break

plt.show()

Can you say which one is the best model without overtraining?