<a href="https://colab.research.google.com/github/sandipanpaul21/Neural-Network-in-Python/blob/main/04_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Activation Function

- The activation function is applied to the output of a neuron (or layer of neurons), which modifies outputs. 
- We use activation functions because if the activation function itself is nonlinear, it allows for neural networks with usually two or more hidden layers to map nonlinear functions. 
- In general, your neural network will have two types of activation functions. 
  1. The first will be the activation function used in hidden layers
  2. The second will be used in the output layer. 
  
  Usually, the activation function used for hidden neurons will be the same for all of them, but it doesn’t have to.

#### Types of Activation Function
1. Step Activation Function
2. Sigmoid Activation Function
3. Linear Activation Function
4. Rectifier Linear Activation Function
5. Softmax Activation Function

#### 1. The Step Activation Function
- Recall the purpose this activation function serves is to mimic a neuron “firing” or “not firing” based on input information. 
- The simplest version of this is a step function. 
- Formula wise, 
        y = 1 if output > 0 else 0 where output = sum(weights * inputs) + bias
- In a single neuron, 
  - Output​ > 0 then Output of Neuron is 1
  - otherwise, it will output 0.

#### 2. The Sigmoid Activation Function
- The problem with the step function is it’s not very informative. 
- The problem with a step function is 
  - It’s give less clear to the optimizer what these impacts are because there’s very little information gathered from this function. 
  - It’s either on (1) or off (0). 
  - It’s hard to tell how “close” this step function was to activating or deactivating. Maybe it was very close, or maybe it was very far. 
- In terms of the final output value from the network, it doesn’t matter if it was ​close​ to outputing something else. 
- Thus, when it comes time to optimize weights and biases, it’s easier for the optimizer if we have activation functions that are more granular and informative.
- The original, more granular, activation function used for neural networks was the ​Sigmoid activation function, which looks like:
        y = 1 / (1 + e^-x) # e raise to -x
  - This function 
    - Range is from 0 to 1
    - Returns a value in the range of 0 for negative infinity, 
    - Through 0.5 for the input of 0, 
    - To 1 for positive infinity.

#### 3. The Linear Activation Function
- A linear function is simply the equation of a line. 
- It will appear as a straight line when graphed, 
        y=x here, the output value equals the input.

#### 4. The Rectified Linear Activation Function (ReLU)
- ReLU means Rectified Linear Units​ activation function
- The rectified linear activation function is simpler than the sigmoid. It’s quite literally ​y=x,​ clipped at 0 from the negative side. If ​x​ is less than or equal to ​0,​ then ​y​ is ​0​ — otherwise, ​y​ is equal to ​x​.
        y = x if x > 0 else 0 if x <= 0
  - This function
    - Will return x if x > 0
    - Else return 0 if x < = 0

#### 5. The Softmax Activation Function
- If model is to be a classifier, so we want an activation function meant for classification. One of these is the Softmax activation function.  
- In the case of classification, what we want to see is a prediction of which class the network "classify the input represents. 
- This distribution returned by the softmax activation function represents ​confidence scores​ for each class and will add up to 1.
-  The predicted class is associated with the output neuron that returned the largest confidence score. 

        y = e^z / sum(e^z)
  - This function
    - will return probability of each class and will add to 1

#### Why Use Activation Functions?
- In most cases, for a neural network to fit a nonlinear function, we need it to contain two or more hidden layers, and we need those hidden layers to use a nonlinear activation function.
- A nonlinear function cannot be represented well by a straight line, such as a sine function.
- Most of the problems in world are non linear. So to fit a model to a non linear data we use Activation Function like ReLU which help the model to fit to an Non Linear Data.

In [1]:
# ReLU Activation Code
# he ReLU in this code is a loop where we’re checking if the current value is greater than 0. 
# If it is, we’re appending it to the output list, and if it’s not, we’re appending 0. 
# This can be written more simply, as we just need to take the largest of two values: 0 or neuron value. 

import numpy as np

input = [-1,0,1,2,3,4,-4,-5]
output = []

for i in input:
  print("\ninput :",i)
  print("Value greater than 0 then input else 0 :",max(0,i))
  output.append(max(0,i))
  print("Output after ReLU:",output)

print("\nFinal Output after ReLU :",output)


input : -1
Value greater than 0 then input else 0 : 0
Output after ReLU: [0]

input : 0
Value greater than 0 then input else 0 : 0
Output after ReLU: [0, 0]

input : 1
Value greater than 0 then input else 0 : 1
Output after ReLU: [0, 0, 1]

input : 2
Value greater than 0 then input else 0 : 2
Output after ReLU: [0, 0, 1, 2]

input : 3
Value greater than 0 then input else 0 : 3
Output after ReLU: [0, 0, 1, 2, 3]

input : 4
Value greater than 0 then input else 0 : 4
Output after ReLU: [0, 0, 1, 2, 3, 4]

input : -4
Value greater than 0 then input else 0 : 0
Output after ReLU: [0, 0, 1, 2, 3, 4, 0]

input : -5
Value greater than 0 then input else 0 : 0
Output after ReLU: [0, 0, 1, 2, 3, 4, 0, 0]

Final Output after ReLU : [0, 0, 1, 2, 3, 4, 0, 0]


In [2]:
# Alternate Way using NumPy

input = [-1,0,1,2,3,4,-4,-5]
output = np.maximum(0,input)
print(output)

[0 0 1 2 3 4 0 0]


In [3]:
# Lets create and generalize ReLU activation function

# Relu Activation
class Activation_ReLU:

  # Forward Pass
  def Forward(self, inputs):
    self.output = np.maximum(0,input)

In [4]:
# Lets create a sample Training Dataset (Random Set)
import nnfs   # Will help to create random spiral dataset
from nnfs.datasets import spiral_data

nnfs.init()   # Will set random seed = 0, to make it repeatable
X, Y = spiral_data(samples = 2,classes = 3)
print("Data is in X Variable ")
print(X)    # 3 Clases created with 2 sample each
print("\nTarget Variable or Classes Defined are in Y Variable") 
print(Y)

Data is in X Variable 
[[ 0.          0.        ]
 [-0.69993085 -0.7142106 ]
 [-0.         -0.        ]
 [ 0.7647814  -0.6442898 ]
 [ 0.         -0.        ]
 [-0.94481426 -0.3276065 ]]

Target Variable or Classes Defined are in Y Variable
[0 0 1 1 2 2]


In [5]:
# Lets take big sample
# 100 samples of 3 target classes
X, Y = spiral_data(samples = 2,classes = 3)

np.random.seed(0)
class Layer_Dense:
  
  # Random Initialization of Weight and Bias
  def __init__(self, n_inputs, n_neurons):
    print("Number of Inputs, n_inputs :",n_inputs)
    print("Number of Neurons, n_neurons :",n_neurons) 
    print("In our case n_neurons is number of outputs")
    
    self.weights = 0.10 * np.random.randn(n_inputs, n_neurons)
    print("\nRandom weights Initialized:")
    print("2 output means 2 rows and each row contains 3 input weight as we have 3 neurons")
    print(self.weights)
    
    self.biases = np.zeros((1, n_neurons))
    print("\nTwo Output means Two Biases but we are adding zero biases :")
    print(self.biases)
  
  # Calculation of Output = (input * weight) + bias
  def forward(self,inputs):
    print("\nOriginal Data before Output Calculation")
    print(inputs)
    self.output = np.dot(inputs, self.weights)

class Activation_ReLU:
  # Forward Pass
  def forward(self,inputs):
    self.output = np.maximum(0,inputs)

# Create a Dense Layer with 2 input features (X and Y) and 3 Output Class
dense1 = Layer_Dense(2,3)

# Perform a forward pass of out training data through this layer
# As we discussed eariler in notebook, Data is in X Variable
dense1.forward(X)

print("\nOutput after Weight and Bias Calculation")
print(dense1.output)

activation_1 = Activation_ReLU()
activation_1.forward(dense1.output)

print("\nOutput after ReLU Activation Function")
print(activation_1.output)

# Observation
# As you can see, negative values have been ​clipped​ (modified to be zero). 
# That’s all there is to the rectified linear activation function used in the hidden layer.

Number of Inputs, n_inputs : 2
Number of Neurons, n_neurons : 3
In our case n_neurons is number of outputs

Random weights Initialized:
2 output means 2 rows and each row contains 3 input weight as we have 3 neurons
[[ 0.17640524  0.04001572  0.0978738 ]
 [ 0.22408931  0.1867558  -0.09772779]]

Two Output means Two Biases but we are adding zero biases :
[[0. 0. 0.]]

Original Data before Output Calculation
[[ 0.          0.        ]
 [-0.47902483 -0.87780136]
 [-0.         -0.        ]
 [ 0.97696507  0.2133992 ]
 [ 0.          0.        ]
 [-0.63560337  0.7720158 ]]

Output after Weight and Bias Calculation
[[ 0.          0.          0.        ]
 [-0.2812084  -0.18310302  0.03890161]
 [ 0.          0.          0.        ]
 [ 0.22016223  0.07894751  0.07476425]
 [ 0.          0.          0.        ]
 [ 0.06087673  0.11874431 -0.13765632]]

Output after ReLU Activation Function
[[0.         0.         0.        ]
 [0.         0.         0.03890161]
 [0.         0.         0.        ]
 [0

In [6]:
# Softmax Activation Function
# y = e^z / sum(e^z)
# The first step for us is to “exponentiate” the outputs. 
# We do this with Euler’s number, ​e, ​which is roughly ​2.71828182846​ and referred to as the “exponential growth” number. 

# Both the numerator and the denominator of the Softmax function contain ​e​ raised to the power of z​, where ​z​, given means a singular output value
# The numerator exponentiates the current output value and the denominator takes a sum of all of the exponentiated outputs for a given sample. 

layers_output = [1,2,3,4]
E = 2.71828182846

# For each value in a vector, calculate exponential value
exp_value = []
for output in layers_output:
  exp_value.append(E ** output) # ** signifies power
  print('\nInput :', output)
  print('E ** Input :',E ** output)
  print('Exponential List :',exp_value)

print("\nFinal Exponential List :",exp_value)


Input : 1
E ** Input : 2.71828182846
Exponential List : [2.71828182846]

Input : 2
E ** Input : 7.38905609893584
Exponential List : [2.71828182846, 7.38905609893584]

Input : 3
E ** Input : 20.085536923208828
Exponential List : [2.71828182846, 7.38905609893584, 20.085536923208828]

Input : 4
E ** Input : 54.59815003322094
Exponential List : [2.71828182846, 7.38905609893584, 20.085536923208828, 54.59815003322094]

Final Exponential List : [2.71828182846, 7.38905609893584, 20.085536923208828, 54.59815003322094]


#### Exponentiation serves multiple purposes. 
- To calculate the probabilities, we need non-negative values. 
- Imagine the output as ​[​1, ​2​, -​2​]​ 
  - Even after normalization, the last value will still be negative since we’ll just divide all of them by their sum. 
  - A negative probability (or confidence) does not make much sense. 
  - An exponential value of any number is always non-negative. It returns 
    - 0 for negative infinity
    - 1 for the input of 0
    - And increases for positive values


In [7]:
layers_output = [1,2,-2,69]
E = 2.71828182846

# For each value in a vector, calculate exponential value
exp_value = []
for output in layers_output:
  exp_value.append(E ** output) # ** signifies power
  print('\nInput :', output)
  print('E ** Input :',E ** output)
  print('Exponential List :',exp_value)

print("\nFinal Exponential List :",exp_value)


Input : 1
E ** Input : 2.71828182846
Exponential List : [2.71828182846]

Input : 2
E ** Input : 7.38905609893584
Exponential List : [2.71828182846, 7.38905609893584]

Input : -2
E ** Input : 0.13533528323651764
Exponential List : [2.71828182846, 7.38905609893584, 0.13533528323651764]

Input : 69
E ** Input : 9.25378172581203e+29
Exponential List : [2.71828182846, 7.38905609893584, 0.13533528323651764, 9.25378172581203e+29]

Final Exponential List : [2.71828182846, 7.38905609893584, 0.13533528323651764, 9.25378172581203e+29]


- The exponential function is a monotonic function. 
- This means that, with higher input values, outputs are also higher so we won’t change the predicted class after applying it while making sure that we get non-negative values. 
- It also adds stability to the result as the normalized exponentiation is more about the difference between numbers than their magnitudes.


- Once we’ve exponentiated, we want to convert these numbers to a probability distribution (converting the values into the vector of confidences, one for each class, which add up to 1 for everything in the vector). 
- What that means is that we’re about to **perform a normalization** where we take a given value and divide it by the sum of all of the values. 

- For our outputs, exponentiated at this stage, that’s what the equation of the **Softmax function** describes next — to take a given exponentiated value and divide it by the sum of all of the exponentiated values. 
- Since each output value normalizes to a fraction of the sum, all of the values are now in the range of 0 to 1 and add up to 1 — they share the probability of 1 between themselves.

In [8]:
# Let’s add the sum and normalization to the code:

layers_output = [1,2,3,4]
E = 2.71828182846

# For each value in a vector, calculate exponential value
exp_value = []
for output in layers_output:
  exp_value.append(E ** output)

norm_base = sum(exp_value)
print("Raw Input :",layers_output)
print("Exponential Values :",exp_value)
print('Sum of all Exponential Values :',norm_base)

# Sum all the values
norm_values = []
for value in exp_value:
  print('\nInput :',value)
  print('Normalized Input (value/sum_values) :',value/norm_base)
  norm_values.append(value/norm_base)
  print(norm_values)

print('\nFinal Normalized Exponential Values :',norm_values)
print('Sum of Final Normalized Values :',sum(norm_values))

Raw Input : [1, 2, 3, 4]
Exponential Values : [2.71828182846, 7.38905609893584, 20.085536923208828, 54.59815003322094]
Sum of all Exponential Values : 84.7910248838256

Input : 2.71828182846
Normalized Input (value/sum_values) : 0.03205860328005693
[0.03205860328005693]

Input : 7.38905609893584
Normalized Input (value/sum_values) : 0.08714431874198689
[0.03205860328005693, 0.08714431874198689]

Input : 20.085536923208828
Normalized Input (value/sum_values) : 0.23688281808986916
[0.03205860328005693, 0.08714431874198689, 0.23688281808986916]

Input : 54.59815003322094
Normalized Input (value/sum_values) : 0.6439142598880871
[0.03205860328005693, 0.08714431874198689, 0.23688281808986916, 0.6439142598880871]

Final Normalized Exponential Values : [0.03205860328005693, 0.08714431874198689, 0.23688281808986916, 0.6439142598880871]
Sum of Final Normalized Values : 1.0


In [9]:
# We can perform the same set of operations with the use of NumPy in the following way:

layers_output = [1,2,3,4]
print("Raw Input :",layers_output)

# First Scaling by calculating exponential values
exp_values = np.exp(layers_output)
print("\nExpoential Values :",exp_values)

# Now Normalize the exponential values
norm_values = exp_values/ np.sum(exp_values)
print('\nFinal Normalized Exponential Values :',norm_values)
print('Sum of Final Normalized Values :',sum(norm_values))

Raw Input : [1, 2, 3, 4]

Expoential Values : [ 2.71828183  7.3890561  20.08553692 54.59815003]

Final Normalized Exponential Values : [0.0320586  0.08714432 0.23688282 0.64391426]
Sum of Final Normalized Values : 1.0


In [10]:
inputs = np.array([[1,2],[-1,-2]])
print("Input :\n",inputs)

# To run in batches 
exp_values = np.exp(inputs)
print("\nExponential Value of Input :\n",exp_values)

# Now Normalize the exponential values
norm_values = exp_values/ np.sum(exp_values,axis=1,keepdims = True) # Axis = 1 means row wise calculation 
print("\nNormalized Value of Input :\n",norm_values)

Input :
 [[ 1  2]
 [-1 -2]]

Exponential Value of Input :
 [[2.71828183 7.3890561 ]
 [0.36787944 0.13533528]]

Normalized Value of Input :
 [[0.26894142 0.73105858]
 [0.73105858 0.26894142]]


#### Two main pervasive challenges with neural networks: 
  1. Dead neurons
  2. Very large numbers (referred to as “exploding” values). 

- The exponential function used in softmax activation is one of the sources of exploding values. 

#### What is Dead Neuron ?
- When using ReLU, you are using a stepwise function that evaluates to 0 whenever the input is less than or equal to 0. 
- Because of this piecewise nature, the gradient is 0 if the input is <= 0, since the slope here is 0. 
- However, if every training example causes a certain neuron to have a negative value (which then becomes 0 after ReLU is applied), then the neuron will never be adjusted, since no matter which training example is selected (or which batch) the gradient on the neuron will be 0. 
- Thus, the neuron is completely useless- it outputs 0 regardless of which training example comes in, and no matter how much training, it will always output 0 (since its weights never get changed; the gradient is always 0).
- In practice, a network with ReLU activations often has a few dead neurons, but a few dead neurons won’t cause too much of a problem. 
- Too many dead neurons, however, and the neural network loses a lot of its explanatory power.

In [11]:
# Sample of exponential outputs
print("exponential of 1 :",np.exp(1))
print("exponential of 10  :",np.exp(10))
print("exponential of 100  :",np.exp(100))
print("exponential of 1000 cause error  :",np.exp(1000))

# Inference :
# It doesn’t take a very large number, in this case, a mere ​1,000,​ to cause an overflow error. 
# We know the exponential function tends toward 0 as its input value approaches negative infinity, and the output is 1 when the input is 0 

exponential of 1 : 2.718281828459045
exponential of 10  : 22026.465794806718
exponential of 100  : 2.6881171418161356e+43
exponential of 1000 cause error  : inf


  """


In [12]:
# -inf gives the output 0
print(np.exp(-np.inf), np.exp(0))

0.0 1.0


- We can use this property to prevent the exponential function from overflowing. 
- Suppose we subtract the maximum value from a list of input values. 
- We would then change the output values to always be in a range from some negative value up to 0
  - Largest number subtracted by itself returns 0
  - Any smaller number subtracted by it will result in a negative number 
  - After this, number - max(number) we will exponent it. 
  - So largest number will have value 0 and exponent(0) is 1
  - After finding the exponents of each input we will find probabiltity or normalization
- With Softmax, thanks to the normalization, we can subtract any value from all of the inputs, and it will not change the output:

In [13]:
# Softmax Activation Function

class Activation_Softmax:
  def forward(self, inputs):
    print("\nInput :",inputs)
    print("\nMaximum Input :",np.max(inputs, axis=1, keepdims=True))
    exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
    print("\nExponent Values of Each Input,logic is exponent((input - max(input)) :\n",exp_values)
    probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
    print("\nProbabiltiy of Each Input :\n",probabilities)
    self.output = probabilities

softmax = Activation_Softmax()
softmax.forward([[1,2,3]])
print("\nSoftmax Output (Same as Probabilities) :\n",softmax.output)


Input : [[1, 2, 3]]

Maximum Input : [[3]]

Exponent Values of Each Input,logic is exponent((input - max(input)) :
 [[0.13533528 0.36787944 1.        ]]

Probabiltiy of Each Input :
 [[0.09003057 0.24472847 0.66524096]]

Softmax Output (Same as Probabilities) :
 [[0.09003057 0.24472847 0.66524096]]


In [14]:
# Another example
softmax = Activation_Softmax()
softmax.forward([[-1,-2,-3]])
print("\nSoftmax Output (Same as Probabilities) :\n",softmax.output)


Input : [[-1, -2, -3]]

Maximum Input : [[-1]]

Exponent Values of Each Input,logic is exponent((input - max(input)) :
 [[1.         0.36787944 0.13533528]]

Probabiltiy of Each Input :
 [[0.66524096 0.24472847 0.09003057]]

Softmax Output (Same as Probabilities) :
 [[0.66524096 0.24472847 0.09003057]]


In [15]:
# Softmax Function Overall

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.1 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases

class Activation_Softmax:
    def forward(self, inputs):
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities

# Random Samples 3 classes with 2 samples each
X, y = spiral_data(samples=2, classes=3)

dense = Layer_Dense(2, 3) #  X and Y i.e, 2 Inputs and 3 classes or neurons will initilize random 2 weight and 3 biases (zeroes biases)
dense.forward(X) # Returns Weight and Bias Calculation

# Initializing Softmax Axtivation Function
activation = Activation_Softmax()
activation.forward(dense.output)

print(activation.output)

[[0.33333334 0.33333334 0.33333334]
 [0.33380836 0.3108572  0.3553344 ]
 [0.33333334 0.33333334 0.33333334]
 [0.34230095 0.32963628 0.32806274]
 [0.33333334 0.33333334 0.33333334]
 [0.3207316  0.3653398  0.31392863]]


In [16]:
# Merging both Activation Function - RelU and Softmax

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.1 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases

class Activation_ReLU:
    def forward(self, inputs):
        self.output = np.maximum(0, inputs)

class Activation_Softmax:
    def forward(self, inputs):
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities

X, y = spiral_data(samples=100, classes=3)

dense1 = Layer_Dense(2,3)
activation1 = Activation_ReLU()

dense2 = Layer_Dense(3, 3)
activation2 = Activation_Softmax()

dense1.forward(X)
activation1.forward(dense1.output)

dense2.forward(activation1.output)
activation2.forward(dense2.output)

print(activation2.output[:5])

[[0.33333334 0.33333334 0.33333334]
 [0.33334747 0.3333181  0.33333445]
 [0.3333639  0.3332918  0.33334434]
 [0.33335617 0.33330873 0.33333513]
 [0.33335337 0.33331177 0.33333492]]


- As you can see, the distribution of predictions is almost equal, as each of the samples has ~33% (0.33) predictions for each class. 
- This results from the random initialization of weights (a draw from the normal distribution, as not every random initialization will result in this) and zeroed biases. 
- These outputs are also our “confidence scores.” 
- To determine which classification the model has chosen to be the prediction, we perform an ​argmax​ on these outputs, which checks which of the classes in the output distribution has the highest confidence and returns its index - the predicted class index. 
- That said, the confidence score can be as important as the class prediction itself. 
- For example, the argmax of ​[​0.22​, ​0.6​, ​0.18​]​ i​s the same as the argmax for [​0.32​, ​0.36​, ​0.32​]​. In both of these, the argmax function would return an index value of 1 (the 2nd element in Python’s zero-indexed paradigm), but obviously, a 60% confidence is much better than a 36% confidence.

#### Few Observations

- We’ve completed what we need for forward-passing data through our model. We used the Rectified Linear (ReLU​) activation function on the hidden layer, which works on a per-neuron basis. 
- We additionally used the ​Softmax​ activation function for the output layer since it accepts non-normalized values as input and outputs a probability distribution, which we’re using as confidence scores for each class. 
- Recall that, although neurons are interconnected, they each have their respective weights and biases and are not “normalized” with each other.
- As you can see, our example model is currently random. To remedy this, we need a way to calculate how wrong the neural network is at current predictions and begin adjusting weights and biases to decrease error over time. 

**Thus, our next step is to quantify how wrong the model is through what’s defined as a ​loss function​.**