### Activation functions

Activation functions are applied to the output of the neurons (or layer of neurons), which modifies the outputs. 

It makes the output non-linear:

> allows neural networks with usually two or more hidden layers to map nonlinear functions.

> when non-linear, NNs can approximate functions that ARE non-linear.

In general, your neural network will have two types of activation functions: 

* activation functions used for the hidden neurons

* activation functions used for the output layer

In [4]:
'''ReLU Activation Function Code'''
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]

'''if the number in the input is greater than 0, append the input to the output list
, otherwise, append 0 to the output list.'''
outputs = []
for i in inputs:
    if i > 0:
        outputs.append(i)
    else:
        outputs.append(0)   

print(outputs)         

[0, 2, 0, 3.3, 0, 1.1, 2.2, 0]


alternatively, we can take take the largest of the two values: 0 or the neuron value

In [5]:
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]

outputs = []

'''for each number in the input, compare the input to 0, it takes the largest of the two
values (if i = -1, take 0, if i = 100, take 100.'''
for i in inputs:
    outputs.append(max(0, i))

print(outputs)    

[0, 2, 0, 3.3, 0, 1.1, 2.2, 0]


NumPy contains an equivalent: np.maximum()

In [6]:
import numpy as np

inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]
outputs = np.maximum(0, inputs)
print(outputs)

[0.  2.  0.  3.3 0.  1.1 2.2 0. ]


Lets create a new Rectified Linear Activation Class

In [8]:
import numpy as np
import matplotlib.pyplot as plt
import nnfs
from nnfs.datasets import spiral_data

#Creation of the Dense Layer Class
class Dense_Layer:
    def __init__(self, num_features, num_neurons):
        self.weights = np.random.randn(num_features, num_neurons)
        self.biases = np.zeros((1, num_neurons))

    def forward(self, samples):
        self.outputs = np.dot(samples, self.weights) + self.biases        

#ReLU Activation Class
class Activation_ReLU:
    #forward pass
    def forward(self, inputs):
        #Calculate the output values from the inputs
        self.outputs = np.maximum(0, inputs)

Lets now apply this activation function to the dense layer's outputs in our code 
from the last chapter.

In [27]:
#Creeation of the dataset using spiral data from nnfs
X, y = spiral_data(samples=100, classes=3)

#Lets analyze this data:
print('First 5 samples in this dataset: \n', X[:5], '\n')
print('The dataset''s shape: ', X.shape)
print('Reminder that 300 indicates the number of samples, 2 being the features.')
print('This means that if you want to connect this to 3 neurons, weights shape is (2,3)')


#Creation of an instance of the Dense Layer Class, with 2 input features and 3 output values (neurons)
dense1 = Dense_Layer(2, 3)

#Creation of an instance of the ReLU Activation Class
activation1 = Activation_ReLU()

#Make a forwardf pass of our training data through this data
dense1.forward(X)

#Forward pass through the activation function. Takes in output from the previous layer
activation1.forward(dense1.outputs)

#Show the outputs of the frist few samples
print(activation1.outputs[:5])

First 5 samples in this dataset: 
 [[ 0.          0.        ]
 [ 0.0064161   0.00780154]
 [-0.00603296  0.01928017]
 [ 0.00582981  0.02973696]
 [ 0.00544852  0.04003499]] 

The datasets shape:  (300, 2)
Reminder that 300 indicates the number of samples, 2 being the features.
This means that if you want to connect this to 3 neurons, weights shape is (2,3)
[[0.         0.         0.        ]
 [0.         0.01061704 0.01824998]
 [0.00161633 0.         0.01487057]
 [0.         0.01684239 0.04383836]
 [0.         0.01963051 0.05570489]]


Lets show the softmax activation function in play

In [28]:
layer_outputs = [4.8, 1.21, 2.385]

the first step is to exponentiate the outputs, using Euler's number, e. (2.7182818...)

> this is also referred to as the exponential growth of a number.

> exponentiating is taking this constant e to the power of a given parameter:

$$ y = e^{x} $$

the softmax activation function:

$$ \sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K}e^{z_j}} $$

where z is the given the given indices, i is the current sample in z and j is the outut in i.
the numberatir exponentiates the current output value and the denominator takes a sum of all the exponentiated outputs for a given sample.

In [29]:
#Values from the previous output when we described what a neural network is.
layer_outputs = [4.8, 1.21, 2.385]

#e, the exponential constant, denoted as E. You can also used math.e, this is an approx.
E = 2.71828182846

#for each value in a vector, calculate the exponential value (E ^ output)
exp_values = []
for output in layer_outputs:
    exp_values.append(E ** output)
print('Exponentiated values: ', exp_values)    

Exponentiated values:  [121.51041751893969, 3.3534846525504487, 10.85906266492961]


we can simplify this code:

In [31]:
import math
layer_outputs = [4.8, 1.21, 2.385]

exp_values = [(math.e) ** output for output in layer_outputs]
print('Exponentiated values: ', exp_values)

Exponentiated values:  [121.51041751873483, 3.353484652549023, 10.859062664920513]


the exponential value of any number is always non-negative, because a negative probability doesnt make sense. 

> it returns 0 for negative infinity, 1 for the input of 0, and increases for positive values

exponential function is monotonic, which means that the higher the input values, the outputs will also be higher naturally. It also provides stability to the result as a normalized exponentiation is more about the difference between the numbers than the magnitudes of those numbers.

once exponentiated, we convert these values to a probability distribution (vector of confidences), one for each class, which adds up to 1 for everthing in the vector.

In [36]:
#lets normalize the values
norm_base = sum(exp_values)
norm_values = []
for value in exp_values:
    norm_values.append(value / norm_base)
print('The normalized exponentiated values:\n', norm_values)  
print('Sum of the normalized values: ', sum(norm_values))  

The normalized exponentiated values:
 [0.8952826639572619, 0.024708306782099374, 0.0800090292606387]
Sum of the normalized values:  0.9999999999999999


lets perform this same set of operations with the use of NumPy

In [39]:
import numpy as np

layer_outputs = [4.8, 1.21, 2.385]

#for each value in a vector, calculate the exponential values:
exp_values = [math.e ** output for output in layer_outputs]
print('Exponentiated values \n', exp_values)

#lets now normalize the data to create a vector of confidences (probability)
norm_base = sum(exp_values)
norm_values = [value / norm_base for value in exp_values]
print('Normalized values: \n', norm_values)
print('Sum of these normalized values: ', sum(norm_values))

Exponentiated values 
 [121.51041751873483, 3.353484652549023, 10.859062664920513]
Normalized values: 
 [0.8952826639572619, 0.024708306782099374, 0.0800090292606387]
Sum of these normalized values:  0.9999999999999999


We can simplify even further, using np.exp(), then immediately normalizing them with the sum. We can do this to train in batch more effiecienty

In [45]:
#get exponentiated probabilities
exp_values = np.exp(layer_outputs)

#normalize them for each of the sample
probabilities = exp_values / np.sum(exp_values, keepdims=True)
print('Probabilites: \n', probabilities, 'Shape: ', probabilities.shape )

Probabilites: 
 [0.89528266 0.02470831 0.08000903] Shape:  (3,)


In NumPy, axis refers to a specific dimension of an array.

for example: a 2D array has 2 axis:

* row axis, axis=0, column axis, axis= 1

for a 3D array, it has 3 axis:

* depth axis, axis=0, row axis, axis=1, and column axis, axis=1

When performing operations on the arrays, axis is often usef to specify along which axis the operation should be performed.

For example: if axis=0, np.sum() will sum along the rows, axis=1 for columns

In [63]:
'''Lets demonstrate this axis parameter'''
import numpy as np

layer_outputs = np.array([[4.8, 1.21, 2.385],
                [8.9, -1.81, 0.2],
                [1.41, 1.051, 0.026]])

print('The shape of the array: ', layer_outputs.shape)

print('Sum without axis: ', np.sum(layer_outputs))

print('This will be identical to the above since defauly is None, ', np.sum(layer_outputs, axis=None))

print('Another way to think of it w/ a matrix == axis 0: columns: ')
print(np.sum(layer_outputs, axis=0))

The shape of the array:  (3, 3)
Sum without axis:  18.172
This will be identical to the above since defauly is None,  18.172
Another way to think of it w/ a matrix == axis 0: columns: 
[15.11   0.451  2.611]


this isn't what we want though, we want the sum of the rows:

In [64]:
'''From scratch version'''
for i in layer_outputs:
    print(sum(i))

8.395
7.29
2.4869999999999997


we can do the same with NumPy:

In [67]:
print('So we can sum axis 1, but not the current shape: ', np.sum(layer_outputs, axis=1))
print("this shape is: ", np.sum(layer_outputs, axis=1).shape)

So we can sum axis 1, but not the current shape:  [8.395 7.29  2.487]
this shape is:  (3,)


We want to sum at axis 1, but keep the same dimensions as the layer_outputs (which is (3, 3)) so that we want to sum to either be (1, n), or (n, 1)

In this case will represent it as a (n, 1) 2 Dimensional array:

In [70]:
print('Sum axis = 1, but keep the same dimension as the input: \n', np.sum(layer_outputs, axis=1, keepdims=True))
print('the keepdims =  True keeps the dimensions the same as the input.')

Sum axis = 1, but keep the same dimension as the input: 
 [[8.395]
 [7.29 ]
 [2.487]]
the keepdims =  True keeps the dimensions the same as the input.


Now, if we divide the array containing a batch of outputs with the array, NumPy will perform this spample-wise. 

> This means that it will divide all of the values from each of the output rows by the corresponding row from the sum array. 

> since the sum in each row is a single value, it will be used for the division with every value from the coresponding output row.

Lets now combine all of we have went over thus far into the softmax class we created earlier:

In [73]:
#softmax activation
class Activation_SoftMax:
    '''define the forward pass:
            get the exponentiated values for each of the inputs
            normalize each of the exponentiated values

        we subtract the maximum value of each input row from each element of
        the output, then exponentiating the result. This is so that the 
        exponentiated values do not become too large and cause numerical values
    '''
    def forward(self, inputs):
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities



we subtracted the largest of the inputs before we did the exponentiation.

this solves 2 main challenges: dead neurons and very large numbers (exploding values)

> the exponential function used in softmax activation function is one of the sources of exploding values.

Below is an example of this

In [75]:
import numpy as np

print(np.exp(1))
print(np.exp(10))
print(np.exp(100000))

print('This shows that the exp function tends toward 0 as the input value\napporaches negative inf, 1 when input is 0')
print(np.exp(-np.inf), np.exp(0))

2.718281828459045
22026.465794806718
inf
This shows that the exp function tends toward 0 as the input value
apporaches negative inf, 1 when input is 0
0.0 1.0


  print(np.exp(100000))


since we subtract the maximum value from the list of inputs values, the output values will be in a range from the negative value up to 0, as the largest number is subtractef ny itself is 0, and only smaller number by it will result in a negative number.  

Thanks to normalization, we can subtract any value from all of the inputs, and it will not change the outputs

In [76]:
softmax = Activation_SoftMax()

softmax.forward([[1,2,3]])
print(softmax.output)

[[0.09003057 0.24472847 0.66524096]]


In [78]:
'''We can subtract each of the values in the input by th largest value in 
the input, and it will not change the probability output, because it is
normalized!'''

softmax.forward([[-2,-1,0]])
print(softmax.output)

[[0.09003057 0.24472847 0.66524096]]


Now, lets add another dense layer as the output layer, setting it to contain as many inputs as the previous layer has outputs and as many outputs as our data includes classes. We can then apply softmax function to the output of this new layer

In [81]:
#create the dataset
X, y = spiral_data(samples=100, classes=3)

#create a dense layer with two input features, and 3 output values (neurons)
dense1 = Dense_Layer(2, 3)

#create an ReLU activation function (to be used with the dense layer)
activation1 = Activation_ReLU()

'''create a second dense layer with 3 input features (the 3 neurons), 
   and 3 output values (3 succeeding neurons)'''
dense2 = Dense_Layer(3, 3)

#create a softmax activation function (to be used with the dense layer)
activation2 = Activation_SoftMax()

#make a forward pass of out training data through this layer
dense1.forward(X)

#pass the output layer in the first dense layer to the ReLu activation function
activation1.forward(dense1.outputs)

#pass the result from the ReLu activation function into the second dense layer
dense2.forward(activation1.outputs)

#pass the result from the second dense layer into the softmax function
activation2.forward(dense2.outputs)

#print the output from the second dense layer
print(activation2.output[:5])

[[0.33333333 0.33333333 0.33333333]
 [0.33244723 0.33378834 0.33376444]
 [0.33474389 0.33177042 0.3334857 ]
 [0.34113263 0.32474825 0.33411912]
 [0.3302289  0.33492762 0.33484348]]


As you can see, the distribution of predictions is almost equal, as each of the samples has ~33% predictions for each class. this output is the confidence scores for each of the samples.

To determine which classification the model ahs chosen to be the prediction, we perform the argmax on these outputs, which checks which of the classes in the ouput distribution has the highest confidence and returns its index.

Example:

> argmax of [0.22, 0.6, 0.18] is the same as argmax for [0.32, 0.36, 0.32], value returned will both be 1.

### Full Code of this Chapter

In [88]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data

nnfs.init()

#Create the Dense Layer Class
class Dense_Layer:
    #initialize the weights and biases for the layer
    def __init__(self, n_features, n_neurons):
        self.weights = 0.01 * np.random.randn(n_features, n_neurons)
        self.biases = np.zeros((1, n_neurons))

    #perform the forward pass, which is the matrix product of inputs, weights, add biases.
    def forward(self, samples):
        self.outputs = np.dot(samples, self.weights) + self.biases

#Create the ReLU Activation Function
class ReLU:
    '''create the forward pass, x if x > 0, otherwise 0
       np.maximum returns input if input > 0, if less than 0, return 0'''
    def forward(self, inputs):
        self.outputs = np.maximum(0, inputs) 

#Create the SoftMax Activation Function
class SoftMax:
    '''create the forward pass, compute the exponentiated values,
       normalize them for each sample'''      
    def forward(self, inputs):
        #get the unnormalized probablities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))

         # normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.outputs = probabilities

X, y = spiral_data(samples=100, classes=3)

dense1 = Dense_Layer(2, 3)
activation1 = ReLU()
dense2 = Dense_Layer(3, 3)
activation2 = SoftMax()

dense1.forward(X)
activation1.forward(dense1.outputs)
dense2.forward(activation1.outputs)
activation2.forward(dense2.outputs)

print(activation2.outputs[:5])




[[0.33333334 0.33333334 0.33333334]
 [0.33333316 0.3333332  0.33333364]
 [0.33333287 0.3333329  0.33333418]
 [0.3333326  0.33333263 0.33333477]
 [0.33333233 0.3333324  0.33333528]]
