# Chapter 4 : Activation Functions

-  The activation function is applied to the output of a neuron (or layer of neurons), which modifies outputs. We use activation functions because if 
the activation function itself is nonlinear, it allows for neural networks with usually two or more hidden layers to map nonlinear functions.
- In general, your neural network will have two types of activation functions. The first will be the activation function used in hidden layers, and the second will be used in the output layer. Usually, the activation function used for hidden neurons will be the same for all of them, but it doesn’t 
have to.

Few examples include:
- **Step Activation Function :** This activation function has been used historically in hidden layers, but nowadays, it is rarely a 
choice.
- **Linear Activation Function :** A linear function is simply the equation of a line. It will appear as a straight line when graphed, 
where y=x and the output value equals the input.This activation function is usually applied to the last layer’s output in the case of a regression 
model — a model that outputs a scalar value instead of a classification. 

**Sigmoid Activation Function:** <br>
Problem with step function is that it's less informative, it's either 1 or 0. So you can't tell how close this step function was to activating or deactivating.  Thus, when it comes time to optimize weights and biases, it’s easier for the optimizer if we have activation functions that are more granular and informative. <br>
Originally,sigmoid activaton function was used. <br>
![](img1.png)

**Advantages?** <br>
In this case, we’re getting a value that can be reversed to its original value; the returned value contains all the information from the input,contrary to a function like the step function, where an input of 3 will output the same value as an input of 300,000. The output from the Sigmoid function, being in the range of 0 to 1, also works better with neural networks — especially compared to the range of the negative to the positive infinity — and adds nonlinearity. <br>
The Sigmoid function, historically used in hidden layers, was eventually replaced by the **Rectified Linear Units** activation function (or ReLU).

![](img2.png) <br>
This simple yet powerful activation function is the most widely used activation function at the 
time of writing for various reasons — mainly speed and efficiency. While the sigmoid activation 
function isn’t the most complicated, it’s still much more challenging to compute than the ReLU 
activation function. The ReLU activation function is extremely close to being a linear activation 
function while remaining nonlinear, due to that bend after 0. This simple property is, however, 
very effective

**NOTE** <br>
In most cases,for a neural network to fit a non-linear function,we need to contain teo or more hidden layers,and those hidden layers to use a nonlinear activation function. <br>
**Also check out the example and functioning of ReLU from the book.**

In [1]:
input = [0,2,-1,3.3,-2.7,1.1,2.2,-100]
output = []
for i in input:
    if i>0:
        output.append(i)
    else:
        output.append(0)
print(output)

[0, 2, 0, 3.3, 0, 1.1, 2.2, 0]


In [2]:
output = []
for i in input:
    output.append(max(0,i))
print(output)

[0, 2, 0, 3.3, 0, 1.1, 2.2, 0]


In [3]:
import numpy as np

In [4]:
output = np.maximum(0,input)
print(output)

[0.  2.  0.  3.3 0.  1.1 2.2 0. ]


In [5]:
# ReLU activation
class Activation_ReLU:

    # forward pass
    def forward(self,inputs):
        # calculate ouput values from inputs
        self.outputs = np.maximum(0,inputs)

In [6]:
import nnfs
nnfs.init()
import matplotlib.pyplot as plt
import nnfs.datasets


In [7]:
class Layer_Dense:
    def __init__(self,n_inputs,n_neurons):
        # Initialise weights and biases
        self.weights = 0.01*np.random.randn(n_inputs,n_neurons) # the transpose step has been omitted hence
        self.biases = np.zeros((1,n_neurons))

    # Forward pass
    def forward(self,inputs):
        # Calculate output values from inputs, weights and biases 
        self.outputs = np.dot(inputs,self.weights) + self.biases

In [8]:
# create dataset
X,y = nnfs.datasets.spiral_data(samples=100,classes=3)

# create dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2,3)

# Create ReLU activation (to be used with Dense layer): 
activation1 = Activation_ReLU() 

# perform forward pass of our training data through this layer
dense1.forward(X)

# forward pass through activation func.
activation1.forward(dense1.outputs)

# output of first few samples
print(activation1.outputs[:5])


[[0.         0.         0.        ]
 [0.         0.00011395 0.        ]
 [0.         0.00031729 0.        ]
 [0.         0.00052666 0.        ]
 [0.         0.00071401 0.        ]]


**Softmax Activation function**

Now if we want this model for classification,there's problems with ReLU. 
- unbounded
- not normalised
- exclusive

“Not normalized” implies the values can be anything, an output of [12, 99, 318]​ is without context, and “exclusive” means each output is independent of the others. <br>
To address this lack of context, the softmax activation on the output data can take in non-normalized, or uncalibrated, inputs and 
produce a normalized distribution of probabilities for our classes. This distribution returned by the softmax activation function represents confidence scores for each class and will add up to 1. The predicted class is associated with the output neuron that returned the largest confidence score.<br> <br>
Here's the function for softmax:


$$S_{i,j} = \frac{e^{z_{i,j}}}{\sum_{l=1}^{L} e^{z_{i,l}}}$$

Both the numerator and the denominator of the Softmax function contain e​ raised to the power of 
z​, where z​, given indices, means a singular output value — the index i​ means the current sample 
and the index j​ means the current output in this sample. The numerator exponentiates the current 
output value and the denominator takes a sum of all of the exponentiated outputs for a given 
sample. <br>
1. Exponentiation serves multiple purposes. To calculate the probabilities, we need non-negative values. <br>
2. The exponential function is a monotonic function. This means that, with higher input values, 
outputs are also higher, so we won’t change the predicted class after applying it while making 
sure that we get non-negative values. <br>
3.  It also adds stability to the result as the normalized 
exponentiation is more about the difference between numbers than their magnitudes.

The formula for the softmax function is the normalisation part of it.

In [9]:
import numpy as np 
# Values from the earlier previous when we described 
# what a neural network is 
layer_outputs = [4.8, 1.21, 2.385]

# for each avlue in vector ,calculate the exponent
exp_values = np.exp(layer_outputs)
print('exponential values:')
print(exp_values)

# normalising the values
norm_values = exp_values/np.sum(exp_values)
print('normalised exponential values:')
print(norm_values)
print('sum of normalised values:',np.sum(norm_values))

exponential values:
[121.51041752   3.35348465  10.85906266]
normalised exponential values:
[0.89528266 0.02470831 0.08000903]
sum of normalised values: 0.9999999999999999


We’re trying to sum all the outputs from a layer for each sample in a batch; <br>
converting the layer’s output array with row length equal to the number of neurons in the layer, to just one value. We need a column vector with these values since it will let us normalize the whole batch of samples, sample-wise, with a single calculation.<br>
Adding all this in softmax class

In [1]:
class Activation_Softmax:

    #forward pass
    def forward(self,inputs):

        # get unnormalised probablities
        exp_values = np.exp(inputs - np.max(inputs,axis=1,keepdims=True))
        
        #normalise them for each sample
        probabilities = exp_values/np.sum(exp_values,axis = 1,keepdims=True)

        self.output = probabilities


Notice we did a subtraction too. There are two main pervasive challenges with neural networks: “dead neurons” 
and very large numbers (referred to as “exploding” values). “Dead” neurons and enormous 
numbers can wreak havoc down the line and render a network useless over time. The exponential 
function used in softmax activation is one of the sources of exploding values

In [6]:
print(np.exp(1))

2.718281828459045


In [7]:
print(np.exp(10))

22026.465794806718


In [8]:
print(np.exp(100))

2.6881171418161356e+43


In [9]:
print(np.exp(1000))

inf


  print(np.exp(1000))


Suppose we subtract the maximum value from a list of input values. We would then change the output values to always be in a range from some negative value up to 0, as the largest number subtracted by itself returns 0, and any smaller number subtracted by it will result in a negative number. And we know the exponential of such numbers lies between 0 and 1.With Softmax, thanks to the normalization, we can subtract any value from all of the inputs, and it will not change the output.

In [11]:
softmax = Activation_Softmax()
softmax.forward([[1,2,3]])
print(softmax.output)

[[0.09003057 0.24472847 0.66524096]]


In [12]:
# subtracted 3 from each
softmax.forward([[-2,-1,0]])
print(softmax.output)

[[0.09003057 0.24472847 0.66524096]]


Now, we can add another dense layer as the output layer, setting it to contain as many inputs as 
the previous layer has outputs and as many outputs as our data includes classes. Then we can 
apply the softmax activation to the output of this new layer: 

**Full code till now:**

In [13]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data

In [14]:
nnfs.init()

In [15]:
class Layer_Dense:
    def __init__(self,n_inputs,n_neurons):
        # Initialise weights and biases
        self.weights = 0.01*np.random.randn(n_inputs,n_neurons) # the transpose step has been omitted hence
        self.biases = np.zeros((1,n_neurons))

    # Forward pass
    def forward(self,inputs):
        # Calculate output values from inputs, weights and biases 
        self.outputs = np.dot(inputs,self.weights) + self.biases

In [16]:
# ReLU activation
class Activation_ReLU:

    # forward pass
    def forward(self,inputs):
        # calculate ouput values from inputs
        self.outputs = np.maximum(0,inputs)


In [17]:
#Softmax activation
class Activation_Softmax:

    #forward pass
    def forward(self,inputs):

        # get unnormalised probablities
        exp_values = np.exp(inputs - np.max(inputs,axis=1,keepdims=True))
        
        #normalise them for each sample
        probabilities = exp_values/np.sum(exp_values,axis = 1,keepdims=True)

        self.output = probabilities

In [18]:
#Create dataset 
X,Y = spiral_data(samples=100,classes=3)

# create dense layer with 2 input features and 3 neurons
dense1 = Layer_Dense(2,3)

#create ReLU activation(to be used with dense layer)
activation1 = Activation_ReLU()

# create second dense layer with 3 inputs features(as we take outout from last layer) 
# and 3 output values 
dense2 = Layer_Dense(3,3)

#create Softmax Activation (to be used with dense layer)
activation2 = Activation_Softmax()

# make a forward pass of our training data through thsi layer
dense1.forward(X)

# make a forward pass through activation function
activation1.forward(dense1.outputs)

#make a forward pass through second dense layer
dense2.forward(activation1.outputs)

# make a forward pass through axtivation function
activation2.forward(dense2.outputs)

# output of first few samples
print(activation2.output[:5])

[[0.33333334 0.33333334 0.33333334]
 [0.3333332  0.3333332  0.33333364]
 [0.3333329  0.33333293 0.3333342 ]
 [0.3333326  0.33333263 0.33333477]
 [0.33333233 0.3333324  0.33333528]]


As you can see, the distribution of predictions is almost equal, as each of the samples has ~33% 
(0.33) predictions for each class. This results from the random initialization of weights (a draw 
from the normal distribution, as not every random initialization will result in this) and zeroed 
biases. These outputs are also our “confidence scores.” To determine which classification the 
model has chosen to be the prediction, we perform an argmax​ on these outputs, which checks 
which of the classes in the output distribution has the highest confidence and returns its index - 
the predicted class index. That said, the confidence score can be as important as the class 
prediction itself. For example, the argmax of [0.22, 0.6, 0.18] ​ is the same as the argmax for  
[0.32, 0.36, 0.32]. In both of these, the argmax function would return an index value of 1 
(the 2nd element in Python’s zero-indexed paradigm), but obviously, a 60% confidence is much 
better than a 36% confidence.