# Testing with Out-of-Sample Data 

## Overfitting

**Overfitting** is effectively just memorizing the data without any understanding of it. An overfit model will do very well predicting the data that it has already seen, but often significantly worse on unseen data. <br><br>
![](img1.png) <br><br>
- The left image shows an example of generalization. In this example, the model learned to separate red and blue data points, even if some of them will be predicted incorrectly. One reason for this might be the data that contains some “confusing” samples. When you look at the image, you can see that, for example, some of these blue dots might not be there, which would raise the data quality and make it easier to fit. A good dataset is one of the biggest challenges with neural networks.
- The image on the right shows the model that memorized the data, fitting them perfectly and ruining generalization.

## Traing and Testing Data

Without knowing if a model overfits the training data, we cannot trust the model’s results. For this reason, it’s essential to have both *training* and *testing* purposes.  <br>
Training data as separate sets for different data should only be used to train a model. The testing, or out-of-sample data, should only be used to validate a model’s performance after training (we are using the testing data during training later in this chapter for demonstration purposes only). The idea is that some data are reserved and withheld from the training data for testing the model’s performance. <br>


*Cautionary Notes* <br><br>
- In many cases, one can take a random sampling of available data to train with and make the remaining data the testing dataset. You still need to be very careful about information leaking through. One common area where this can be problematic is in time-series data. Consider a scenario where you have data from sensors collected every second. You might have millions of observations collected, and randomly selecting your data for the testing data might result in samples in your testingdataset that are only a second in time apart from your trainingdata, thus are very similar. This means overfitting can spill into your testing data, and the model can achieve good results on both the training and the testing data, which won’t mean it generalized well. 
- Randomly allocating time-series data as testing data may be very similar to training data. Both datasets must differ enough to prove the model’s ability to generalize. In time-series data, a better approach is to take multiple slices of your data, entire blocks of time, and reserve those for testing. 


In our case, we can use our data-generating function to create new data that will serve as out-of-sample/testing data. Given what was just said about overfitting, it may look wrong to only generate more data, as the testing data could look similar to the training data. Intuition and experience are both important to spot potential issues with out-of-sample data. By looking at the image representation of the data, we can see that another set of data generated by the same function will be adequate. This is just about as safe as it gets for out-of-sample data as the classes are partially mixing at the edges (also, we’re quite literally using the “underlying function” to make more data). 


With these data, we evaluate the model’s performance by doing a forward pass and calculating 
loss and accuracy the same as before:

In [1]:
import numpy as np 
import nnfs 
from nnfs.datasets import spiral_data 
 
nnfs.init() 
 
 
# Dense layer 
class Layer_Dense: 
 
    # Layer initialization 
    def __init__(self, n_inputs, n_neurons): 
        # Initialize weights and biases 
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons) 
        self.biases = np.zeros((1, n_neurons)) 
 
    # Forward pass 
    def forward(self, inputs): 
        # Remember input values 
        self.inputs = inputs 
        # Calculate output values from inputs, weights and biases 
        self.output = np.dot(inputs, self.weights) + self.biases 
 
    # Backward pass 
    def backward(self, dvalues): 
        # Gradients on parameters 
        self.dweights = np.dot(self.inputs.T, dvalues) 
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True) 
        # Gradient on values 
        self.dinputs = np.dot(dvalues, self.weights.T) 
 
 
# ReLU activation 
class Activation_ReLU: 
 
    # Forward pass 
    def forward(self, inputs): 
          # Remember input values 
        self.inputs = inputs 
        # Calculate output values from inputs 
        self.output = np.maximum(0, inputs) 
 
    # Backward pass 
    def backward(self, dvalues): 
        # Since we need to modify original variable, 
        # let's make a copy of values first 
        self.dinputs = dvalues.copy() 
 
        # Zero gradient where input values were negative 
        self.dinputs[self.inputs <= 0] = 0 
 
 
# Softmax activation 
class Activation_Softmax: 
 
    # Forward pass 
    def forward(self, inputs): 
        # Remember input values 
        self.inputs = inputs 
 
        # Get unnormalized probabilities 
        exp_values = np.exp(inputs - np.max(inputs, axis=1, 
                                            keepdims=True)) 
        # Normalize them for each sample 
        probabilities = exp_values/np.sum(exp_values, axis=1, 
                                            keepdims=True) 
 
        self.output = probabilities 
 
    # Backward pass 
    def backward(self, dvalues): 
 
        # Create uninitialized array 
        self.dinputs = np.empty_like(dvalues) 
 
        # Enumerate outputs and gradients 
        for index, (single_output, single_dvalues) in enumerate(zip(self.output, dvalues)): 
            # Flatten output array 
            single_output = single_output.reshape(-1, 1) 
            # Calculate Jacobian matrix of the output and 
            jacobian_matrix = np.diagflat(single_output) - np.dot(single_output, single_output.T) 
            # Calculate sample-wise gradient 
            # and add it to the array of sample gradients 
            self.dinputs[index] = np.dot(jacobian_matrix, 
                                         single_dvalues) 
 
 
# Common loss class 
class Loss: 
 
    # Calculates the data and regularization losses 
    # given model output and ground truth values 
    def calculate(self, output, y): 
 
        # Calculate sample losses 
        sample_losses = self.forward(output, y) 
 
        # Calculate mean loss 
        data_loss = np.mean(sample_losses) 
 
        # Return loss 
        return data_loss 
 
 
# Cross-entropy loss 
class Loss_CategoricalCrossentropy(Loss): 
 
    # Forward pass 
    def forward(self, y_pred, y_true): 
 
        # Number of samples in a batch 
        samples = len(y_pred) 
 
        # Clip data to prevent division by 0 
        # Clip both sides to not drag mean towards any value 
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7) 
 
        # Probabilities for target values - 
        # only if categorical labels 
        if len(y_true.shape) == 1: 
            correct_confidences = y_pred_clipped[ 
                range(samples), 
                y_true 
            ] 
        # Mask values - only for one-hot encoded labels 
        elif len(y_true.shape) == 2: 
            correct_confidences = np.sum( 
                y_pred_clipped * y_true, 
                axis=1 
            ) 
 
        # Losses 
        negative_log_likelihoods = -np.log(correct_confidences) 
        return negative_log_likelihoods 
 
    # Backward pass 
    def backward(self, dvalues, y_true): 
 
        # Number of samples 
        samples = len(dvalues) 
        # Number of labels in every sample 
        # We'll use the first sample to count them 
        labels = len(dvalues[0]) 
 
        # If labels are sparse, turn them into one-hot vector 
        if len(y_true.shape) == 1: 
            y_true = np.eye(labels)[y_true] 
 
        # Calculate gradient 
        self.dinputs = -y_true / dvalues 
        # Normalize gradient 
        self.dinputs = self.dinputs/samples 
 
 
# Softmax classifier - combined Softmax activation 
# and cross-entropy loss for faster backward step 
class Activation_Softmax_Loss_CategoricalCrossentropy(): 
 
    # Creates activation and loss function objects 
    def __init__(self): 
        self.activation = Activation_Softmax() 
        self.loss = Loss_CategoricalCrossentropy() 
 
    # Forward pass 
    def forward(self, inputs, y_true): 
        # Output layer's activation function 
        self.activation.forward(inputs) 
        # Set the output 
        self.output = self.activation.output 
        # Calculate and return loss value 
        return self.loss.calculate(self.output, y_true) 
    
    # Backward pass 
    def backward(self, dvalues, y_true): 
 
        # Number of samples 
        samples = len(dvalues) 
 
        # If labels are one-hot encoded, 
        # turn them into discrete values 
        if len(y_true.shape) == 2: 
            y_true = np.argmax(y_true, axis=1) 
 
        # Copy so we can safely modify 
        self.dinputs = dvalues.copy() 
        # Calculate gradient 
        self.dinputs[range(samples), y_true] -= 1 
        # Normalize gradient 
        self.dinputs = self.dinputs/samples 

#SGD + momentum Optimiser 
class Optimiser_SGD:

    # Initialize optimizer - set settings, 
    # learning rate of 1. is default for this optimizer
    def __init__(self,learning_rate=1.0,decay = 0.,momentum = 0.):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.momentum = momentum

    """This method will update the learning rate if decay is anything other than zero"""
    # call once before any updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate*(1./(1. + self.decay*self.iterations))
    
    """Major changes are in this method wrt vanilla SGD"""
    #update parameters
    def update_params(self,layer):

        # if we use momentum
        if self.momentum:

            # If layer does not contain momentum arrays, create them filled with zeros 
            if not hasattr(layer, 'weight_momentums'): 
                layer.weight_momentums = np.zeros_like(layer.weights) 
                # If there is no momentum array for weights 
                # The array doesn't exist for biases yet either. 
                layer.bias_momentums = np.zeros_like(layer.biases)

            # Build weight updates with momentum - take previous updates multiplied by retain factor and update with 
            # current gradients
            weight_updates = self.momentum*layer.weight_momentums - self.current_learning_rate*layer.dweights
            layer.weight_momentums = weight_updates

            # build bias updates
            bias_updates = self.momentum*layer.bias_momentums - self.current_learning_rate*layer.biases
            layer.bias_momentums = bias_updates

        # Vanilla SGD updates (as before momentum update) 
        else:
            weight_updates = -self.current_learning_rate*layer.dweights
            bias_updates = -self.current_learning_rate*layer.dbiases

        # Update weights and biases using either 
        # vanilla or momentum updates 
        layer.weights += weight_updates
        layer.biases += bias_updates

    # call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1

#AdaGrad Optimiser
class Optimiser_Adagrad:

    # Initialize optimizer - set settings, 
    # learning rate of 1. is default for this optimizer
    def __init__(self,learning_rate=1.0,decay = 0.,epsilon = 1e-7):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon

    
    # call once before any updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate*(1./(1. + self.decay*self.iterations))
    
    #update parameters
    def update_params(self,layer):

        # If layer does not contain cache arrays, create them filled with zeros 
        if not hasattr(layer, 'weight_cache'): 
            layer.weight_cache = np.zeros_like(layer.weights)  
            layer.bias_cache = np.zeros_like(layer.biases)

        # Update cache with squared current gradients 
        layer.weight_cache += layer.dweights**2 
        layer.bias_cache += layer.dbiases**2 

        # Vanilla SGD parameter update + normalization 
        # with square rooted cache 
        layer.weights += -self.current_learning_rate*layer.dweights/(np.sqrt(layer.weight_cache) +  self.epsilon)
        layer.biases += -self.current_learning_rate*layer.dbiases/(np.sqrt(layer.bias_cache) +  self.epsilon)

    # call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1

#RMSProp Optimiser
class Optimiser_RMSProp:

    # Initialize optimizer - set settings, 
    # learning rate of 1. is default for this optimizer
    def __init__(self,learning_rate=0.001,decay = 0.,epsilon = 1e-7,rho = 0.9):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        self.rho = rho

    
    # call once before any updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate*(1./(1. + self.decay*self.iterations))
    
    #update parameters
    def update_params(self,layer):

        # If layer does not contain cache arrays, create them filled with zeros 
        if not hasattr(layer, 'weight_cache'): 
            layer.weight_cache = np.zeros_like(layer.weights)  
            layer.bias_cache = np.zeros_like(layer.biases)

        # Update cache with squared current gradients 
        layer.weight_cache += self.rho*layer.weight_cache + (1 - self.rho)*layer.dweights**2 
        layer.bias_cache += self.rho*layer.bias_cache + (1 - self.rho)*layer.dbiases**2 

        # Vanilla SGD parameter update + normalization 
        # with square rooted cache 
        layer.weights += -self.current_learning_rate*layer.dweights/(np.sqrt(layer.weight_cache) +  self.epsilon)
        layer.biases += -self.current_learning_rate*layer.dbiases/(np.sqrt(layer.bias_cache) +  self.epsilon)

    # call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1

#Adam or Adaptive Momentum Optmiser
class Optimiser_Adam: 
 
    # Initialize optimizer - set settings 
    def __init__(self, learning_rate=0.001, decay=0., epsilon=1e-7, 
                 beta_1=0.9, beta_2=0.999): 
        self.learning_rate = learning_rate 
        self.current_learning_rate = learning_rate 
        self.decay = decay 
        self.iterations = 0 
        self.epsilon = epsilon 
        self.beta_1 = beta_1 
        self.beta_2 = beta_2 
 
    # Call once before any parameter updates 
    def pre_update_params(self): 
        if self.decay: 
            self.current_learning_rate = self.learning_rate*(1. / (1. + self.decay * self.iterations)) 
 
    # Update parameters 
    def update_params(self, layer): 
 
        # If layer does not contain cache arrays, 
        # create them filled with zeros 
        if not hasattr(layer, 'weight_cache'): 
            layer.weight_momentums = np.zeros_like(layer.weights) 
            layer.weight_cache = np.zeros_like(layer.weights) 
            layer.bias_momentums = np.zeros_like(layer.biases) 
            layer.bias_cache = np.zeros_like(layer.biases) 
 
        # Update momentum  with current gradients 
        layer.weight_momentums = self.beta_1*layer.weight_momentums + (1 - self.beta_1)*layer.dweights 
        layer.bias_momentums = self.beta_1*layer.bias_momentums + (1 - self.beta_1)*layer.dbiases 

        # Get corrected momentum 
        # self.iteration is 0 at first pass 
        # and we need to start with 1 here 
        weight_momentums_corrected = layer.weight_momentums/(1 - self.beta_1 ** (self.iterations + 1)) 
        bias_momentums_corrected = layer.bias_momentums/(1 - self.beta_1 ** (self.iterations + 1)) 

        # Update cache with squared current gradients 
        layer.weight_cache = self.beta_2 * layer.weight_cache + (1 - self.beta_2) * layer.dweights**2 
        layer.bias_cache = self.beta_2 * layer.bias_cache + (1 - self.beta_2) * layer.dbiases**2 

        # Get corrected cache 
        weight_cache_corrected = layer.weight_cache/(1 - self.beta_2 ** (self.iterations + 1)) 
        bias_cache_corrected = layer.bias_cache/(1 - self.beta_2 ** (self.iterations + 1)) 
 
        # Vanilla SGD parameter update + normalization 
        # with square rooted cache 
        layer.weights += -self.current_learning_rate*weight_momentums_corrected/(np.sqrt(weight_cache_corrected) + self.epsilon) 
        layer.biases += -self.current_learning_rate*bias_momentums_corrected/(np.sqrt(bias_cache_corrected) + self.epsilon) 
 
    # Call once after any parameter updates 
    def post_update_params(self): 
        self.iterations += 1
 
# Create dataset 
X, y = spiral_data(samples=100, classes=3) 
 
# Create Dense layer with 2 input features and 64 output values 
dense1 = Layer_Dense(2, 64) 
 
# Create ReLU activation (to be used with Dense layer): 
activation1 = Activation_ReLU() 
 
# Create second Dense layer with 64 input features (as we take output 
# of previous layer here) and 3 output values (output values) 
dense2 = Layer_Dense(64, 3) 
 
# Create Softmax classifier's combined loss and activation 
loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy() 
 
# Create optimizer 
#optimiser = Optimiser_SGD(decay=8e-8, momentum=0.9)

optimiser = Optimiser_Adam(learning_rate=0.05,decay=5e-7) 
# Train in loop 
for epoch in range(10001): 
 
    # Perform a forward pass of our training data through this layer 
    dense1.forward(X) 
 
    # Perform a forward pass through activation function 
    # takes the output of first dense layer here 
    activation1.forward(dense1.output) 
 
    # Perform a forward pass through second Dense layer 
    # takes outputs of activation function of first layer as inputs 
    dense2.forward(activation1.output) 
 
    # Perform a forward pass through the activation/loss function 
    # takes the output of second dense layer here and returns loss 
    loss = loss_activation.forward(dense2.output, y) 
 
    # Calculate accuracy from output of activation2 and targets 
    # calculate values along first axis 
    predictions = np.argmax(loss_activation.output, axis=1) 
    if len(y.shape) == 2: 
        y = np.argmax(y, axis=1) 
    accuracy = np.mean(predictions==y) 
 
    if not epoch % 100: 
        print(f'epoch: {epoch}, ' + 
              f'acc: {accuracy:.3f}, ' + 
              f'loss: {loss:.3f}, ' + 
              f'lr: {optimiser.current_learning_rate}') 
 
    # Backward pass 
    loss_activation.backward(loss_activation.output, y) 
    dense2.backward(loss_activation.dinputs) 
    activation1.backward(dense2.dinputs) 
    dense1.backward(activation1.dinputs) 
 
    # Update weights and biases 
    optimiser.pre_update_params() 
    optimiser.update_params(dense1) 
    optimiser.update_params(dense2) 
    optimiser.post_update_params()

# Validate the model 

# Create test dataset 
X_test, y_test = spiral_data(samples=100, classes=3) 
# Perform a forward pass of our testing data through this layer 
dense1.forward(X_test) 
# Perform a forward pass through activation function 
# takes the output of first dense layer here 
activation1.forward(dense1.output) 
# Perform a forward pass through second Dense layer 
# takes outputs of activation function of first layer as inputs 
dense2.forward(activation1.output) 
# Perform a forward pass through the activation/loss function 
# takes the output of second dense layer here and returns loss 
loss = loss_activation.forward(dense2.output, y_test) 
# Calculate accuracy from output of activation2 and targets 
# calculate values along first axis 
predictions = np.argmax(loss_activation.output, axis=1) 
if len(y_test.shape) == 2: 
    y_test = np.argmax(y_test, axis=1) 
accuracy = np.mean(predictions==y_test) 
print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')

epoch: 0, acc: 0.360, loss: 1.099, lr: 0.05
epoch: 100, acc: 0.670, loss: 0.705, lr: 0.04999752512250644
epoch: 200, acc: 0.797, loss: 0.522, lr: 0.04999502549496326
epoch: 300, acc: 0.847, loss: 0.430, lr: 0.049992526117345455
epoch: 400, acc: 0.887, loss: 0.344, lr: 0.04999002698961558
epoch: 500, acc: 0.910, loss: 0.303, lr: 0.049987528111736124
epoch: 600, acc: 0.907, loss: 0.276, lr: 0.049985029483669646
epoch: 700, acc: 0.917, loss: 0.252, lr: 0.049982531105378675
epoch: 800, acc: 0.920, loss: 0.245, lr: 0.04998003297682575
epoch: 900, acc: 0.930, loss: 0.228, lr: 0.049977535097973466
epoch: 1000, acc: 0.940, loss: 0.217, lr: 0.049975037468784345
epoch: 1100, acc: 0.937, loss: 0.205, lr: 0.049972540089220974
epoch: 1200, acc: 0.947, loss: 0.192, lr: 0.04997004295924593
epoch: 1300, acc: 0.947, loss: 0.184, lr: 0.04996754607882181
epoch: 1400, acc: 0.943, loss: 0.183, lr: 0.049965049447911185
epoch: 1500, acc: 0.943, loss: 0.189, lr: 0.04996255306647668
epoch: 1600, acc: 0.943, lo

While 79.7% accuracy and a loss of 0.921​ is not terrible, this contrasts with our training data 
that achieved 97% accuracy and a loss of 0.074. This is evidence of over-fitting. <br><br>
![](img2.png) <br><br>

- We can recognize overfitting when testing data results begin to diverge in trend from training data. 
- It will usually be the case that performance against your training data is better, but having training loss differ from test performance by over 10% approximately is a common sign of serious overfitting from our anecdotal experience. Ideally, both datasets would have identical performance. 
- Even a small difference means that the model did not correctly predict some testing samples, implying slight overfitting of training data. In most cases, modest overfitting is not a serious problem, but something we hope to minimize

- This is a classic example of overfitting — the validation loss falls down, then starts rising once 
the model starts overfitting. 
- The model is currently tuned to achieve the best possible score on the training data, and most likely the learning rate is too high, there are too many training epochs, or the model is too big. There are other possible causes and ways to fix this.
- In general, the goal is to have the testing loss identical to the training loss, even if that means higher loss and lower accuracy on the training data. Similar performance on both datasets means that model generalized instead of overfitting on the training data.
- One general rule to follow when selecting initial model hyperparameters is to find the smallest model possible that still learns.
- Other possible ways to avoid overfitting are regularization techniques, and the Dropout​ layer. 
- The process of trying different model settings is called **hyperparameter searching**. Initially, you can very quickly (usually within minutes) try different settings (e.g., layer sizes) to see if the models are learning something​. If they are, train the models fully — or at least significantly longer — and compare results to pick the best set of hyperparameters. 
- Another possibility is to create a list of different hyperparameter sets and train the model in a loop using each of those sets at a time to pick the best set at the end. 
- The reasoning here is that the fewer neurons you have, the less chance you have that the model is memorizing the data. Fewer neurons can mean it’s easier for a neural network to generalize (actually learn the meaning of the data) compared to memorizing the data. 
- With enough neurons, it’s easier for a neural network to memorize the data. Remember that the neural network wants to decrease training loss and follows the path of least resistance to meet that objective. Our job as the programmer is to make the path to generalization the easiest path. This can often mean our job is actually to make the path to lowering loss for the model more 
challenging!