<a href="https://colab.research.google.com/github/rahiakela/machine-learning-algorithms/blob/main/neural-networks-from-scratch/04-activation-function/activation_functions_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ReLU Activation Function from Scratch

The rectified linear activation function is simpler than the sigmoid. It’s quite literally $y=x$ , clipped at $\theta$ from the negative side. If $x$ is less than or equal to $\theta$ , then $y$ is $\theta$ — otherwise, $y$ is equal to $x$.

$$
y = {\displaystyle \textstyle {\begin{cases} x, \space \space  x > 0 \\ 0, \space \space x < 0 \end{cases}}}
$$

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/neural-networks-from-scratch/04-activation-function/images/1.png?raw=1' width='600'/>

This simple yet powerful activation function is the most widely used activation function at the time of writing for various reasons — mainly speed and efficiency.

The ReLU activation function is extremely close to being a linear activation
function while remaining nonlinear, due to that bend after 0. This simple property is, however, very effective.




##Setup

In [1]:
!pip install nnfs

Collecting nnfs
  Downloading nnfs-0.5.1-py3-none-any.whl (9.1 kB)
Installing collected packages: nnfs
Successfully installed nnfs-0.5.1


In [2]:
from nnfs.datasets import spiral_data
import numpy as np
import nnfs
import matplotlib.pyplot as plt

nnfs.init()

## ReLU Activation 

Despite the fancy sounding name, the rectified linear activation function is straightforward to code. Most closely to its definition:

In [3]:
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]

output = []

for i in inputs:
  if i > 0:     # if the current value is greater than 0, appending the current value
    output.append(i)
  else:         # if it’s not, appending 0
    output.append(0)

print(output)

[0, 2, 0, 3.3, 0, 1.1, 2.2, 0]


This can be written more simply, as we just need to take the largest of two values: 0 or neuron value. 

For example:

In [4]:
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]

output = []

for i in inputs:
    output.append(max(0, i))

print(output)

[0, 2, 0, 3.3, 0, 1.1, 2.2, 0]


NumPy contains an equivalent — `np.maximum()`:

In [5]:
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]

output = np.maximum(0, inputs)

print(output)

[0.  2.  0.  3.3 0.  1.1 2.2 0. ]


This method compares each element of the input list (or an array) and returns an object of the same shape filled with new values. 

We will use it in our new rectified linear activation class:

In [6]:
# ReLU activation class
class ReLU:
  # Forward pass
  def forward(self, inputs):
    # Calculate output values from input
    self.output = np.maximum(0, inputs)

    return self.output

In [7]:
relu = ReLU()
print(relu.forward(inputs))

[0.  2.  0.  3.3 0.  1.1 2.2 0. ]


Let’s apply this activation function to the dense layer’s outputs.

In [8]:
class Dense:

  def __init__(self, n_inputs, n_neurons):
    """Layer initialization: Initialize weights and biases"""
    # Note that we’re initializing weights to be (inputs, neurons), rather than ( neurons, inputs)
    self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
    # a bias can ensure that a neuron fires initially. so initializing it with zero
    self.biases = np.zeros((1, n_neurons))

  def forward(self, inputs):
    # Calculate output values from inputs, weights and biases
    self.output = np.dot(inputs, self.weights) + self.biases

# ReLU activation class
class ReLU:
  # Forward pass
  def forward(self, inputs):
    # Calculate output values from input
    self.output = np.maximum(0, inputs)

In [9]:
# Create dataset
X, y = spiral_data(samples=100, classes=3)

# Create Dense layer with 2 input features and 3 output values
dense1 = Dense(2, 3)

# Create ReLU activation (to be used with Dense layer)
relu = ReLU()

# Make a forward pass of our training data through this layer
dense1.forward(X)

# Forward pass through activation func.
# Takes in output from previous layer
relu.forward(dense1.output)

# Let's see output of the first few samples
print(f"Before ReLU:\n {dense1.output[:5]}")
print(f"After ReLU:\n {relu.output[:5]}")

Before ReLU:
 [[ 0.0000000e+00  0.0000000e+00  0.0000000e+00]
 [-1.0475188e-04  1.1395361e-04 -4.7983500e-05]
 [-2.7414842e-04  3.1729150e-04 -8.6921798e-05]
 [-4.2188365e-04  5.2666257e-04 -5.5912682e-05]
 [-5.7707680e-04  7.1401405e-04 -8.9430439e-05]]
After ReLU:
 [[0.         0.         0.        ]
 [0.         0.00011395 0.        ]
 [0.         0.00031729 0.        ]
 [0.         0.00052666 0.        ]
 [0.         0.00071401 0.        ]]


As you can see, negative values have been clipped (modified to be zero). That’s all there is to the rectified linear activation function used in the hidden layer. 

##Softmax Activation

In our case, we’re looking to get this model to be a classifier, so we want an activation function meant for classification.

In this case, the rectified linear unit is unbounded, not normalized with other units, and exclusive. 
- **Not normalized** implies the values can be anything, an output of [12, 99, 318] is without context
- **exclusive** means each output is independent of the others

To address this lack of context, the softmax activation on the output data can take in non-normalized, or uncalibrated, inputs and produce a normalized distribution of probabilities for our classes.

In the case of classification, what we want to see is a prediction of which class the network “thinks” the input represents. 

This distribution returned by the softmax activation function represents confidence scores for each class and will add up to 1.

For example, if our network has a confidence distribution for two classes: 
`[0.45, 0.55]`, the prediction is the 2nd class, but the confidence in this
prediction isn’t very high. Maybe our program would not act in this case since it’s not very confident.

Here’s the function for the Softmax :

$$
S_{i,j} = \frac{e^{z_{i,j}}}{\sum_{l=1}^L e^{z_{i,j}}}
$$

The first step for us is to “exponentiate” the outputs. We do this with Euler’s number, $e$, which is roughly `2.71828182846` and referred to as the “exponential growth” number.

Both the numerator and the denominator of the Softmax function contain $e$ raised to the power of $z$ , where $z$ , given indices, means a singular output value — the index $i$ means the current sample and the index $j$ means the current output in this sample. 

The numerator exponentiates the current output value and the denominator takes a sum of all of the exponentiated outputs for a given sample.

In [10]:
# Values from the previous output when we described what a neural network is
layer_outputs = [4.8, 1.21, 2.385]

# e - mathematical constant, we use E here to match a common coding style where constants are uppercased
E = 2.71828182846  # you can also use math.e

# For each value in a vector, calculate the exponential value
exp_values = []
for output in layer_outputs:
  exp_values.append(E ** output)  # ** - power operator in Python

print("exponentiated values:")
print(exp_values)

exponentiated values:
[121.51041751893969, 3.3534846525504487, 10.85906266492961]


To calculate the probabilities, we need non-negative values. Imagine the output as `[ 4.8 , 1.21 , - 2.385 ]` — even after normalization, the last
value will still be negative since we’ll just divide all of them by their sum.

A negative probability (or confidence) does not make much sense. An exponential value of any number is always non-negative — it returns 0 for negative infinity, 1 for the input of 0, and increases for positive values:

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/neural-networks-from-scratch/04-activation-function/images/2.png?raw=1' width='600'/>

The exponential function is a monotonic function. This means that, with higher input values, outputs are also higher, so we won’t change the predicted class after applying it while making sure that we get non-negative values. It also adds stability to the result as the normalized exponentiation is more about the difference between numbers than their magnitudes.

Once we’ve exponentiated, we want to convert these numbers to a probability distribution (converting the values into the vector of confidences, one for each class, which add up to 1 for everything in the vector). What that means is that we’re about to perform a normalization where we take a given
value and divide it by the sum of all of the values.

Since each output value normalizes to a fraction of the sum, all of the values are now in the range of 0 to 1 and add up to 1 — they share the probability of 1 between themselves.

Let’s add the sum and normalization to the Softmax.

In [11]:
# Values from the previous output when we described what a neural network is
layer_outputs = [4.8, 1.21, 2.385]

# e - mathematical constant, we use E here to match a common coding style where constants are uppercased
E = 2.71828182846  # you can also use math.e

# For each value in a vector, calculate the exponential value
exp_values = []
for output in layer_outputs:
  exp_values.append(E ** output)  # ** - power operator in Python

print("exponentiated values:")
print(exp_values)

# Now normalize values
norm_base = sum(exp_values)  # we sum all values
norm_values = []
for value in exp_values:
  norm_values.append(value / norm_base)

print("Normalized exponentiated values:")
print(norm_values)
print(f"Sum of normalized values:{sum(norm_values)}")

exponentiated values:
[121.51041751893969, 3.3534846525504487, 10.85906266492961]
Normalized exponentiated values:
[0.8952826639573506, 0.024708306782070668, 0.08000902926057876]
Sum of normalized values:1.0


We can perform the same set of operations with the use of NumPy in the following way:

In [12]:
# Values from the previous output when we described what a neural network is
layer_outputs = [4.8, 1.21, 2.385]

# For each value in a vector, calculate the exponential value
exp_values = np.exp(layer_outputs)
print("exponentiated values:")
print(exp_values)

# Now normalize values
norm_base = exp_values / np.sum(exp_values)  # we sum all values
print("Normalized exponentiated values:")
print(norm_values)
print(f"Sum of normalized values:{np.sum(norm_values)}")

exponentiated values:
[121.51041752   3.35348465  10.85906266]
Normalized exponentiated values:
[0.8952826639573506, 0.024708306782070668, 0.08000902926057876]
Sum of normalized values:1.0


Notice the results are similar, but faster to calculate and the code is easier to read with NumPy.

We can exponentiate all of the values with a single call of the `np.exp()`, then immediately normalize them with the sum. To train in batches, we need to convert this functionality to accept layer outputs in batches.

```python
# Get unnormalized probabilities
exp_values = np.exp(inputs)

# Normalize them for each sample
probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
```

We should also address what axis and keepdims mean in the above. Let’s first discuss the axis . Axis is easier to show than tell, but, in a 2D array/matrix, axis 0 refers to the rows, and axis 1 refers to the columns. 

Let’s see some examples of how axis affects the sum using NumPy. First, we
will just show the default, which is None.


In [14]:
# Values from the previous output when we described what a neural network is
layer_outputs = np.array([
  [4.8, 1.21, 2.385],
  [8.9 , - 1.81 , 0.2],
  [1.41 , 1.051 , 0.026]
]) 

print("Sum without axis:")
print(np.sum(layer_outputs))

print("This will be identical to the above since default is None:")
print(np.sum(layer_outputs, axis=None))

Sum without axis:
18.172
This will be identical to the above since default is None:
18.172


With no axis specified, we are just summing all of the values, even if they’re in varying dimensions.

Next, `axis = 0`. This means to sum row-wise, along axis 0.

In the case of our 2D array, where we have only a single other dimension, the columns, the output vector will sum these columns. 

This means we’ll perform 4.8+8.9+1.41 and so on.

In [17]:
print("Another way to think of it w/ a matrix == axis 0: columns:")
print(np.sum(layer_outputs, axis=0))

Another way to think of it w/ a matrix == axis 0: columns:
[15.11   0.451  2.611]


This isn’t what we want, though. We want sums of the rows.

In [18]:
print("But we want to sum the rows instead, like this w/ raw py:")
for i in layer_outputs:
  print(sum(i))

But we want to sum the rows instead, like this w/ raw py:
8.395
7.29
2.4869999999999997


As you probably guessed, we’re going to sum along axis 1:

In [19]:
print("So we can sum axis 1, but note the current shape:")
print(np.sum(layer_outputs, axis=1))

So we can sum axis 1, but note the current shape:
[8.395 7.29  2.487]


As pointed out by “ note the current shape ,” we did get the sums that we expected, but actually, we want to simplify the outputs to a single value per sample. We’re trying to sum all the outputs from a layer for each sample in a batch; converting the layer’s output array with row length equal to the number of neurons in the layer, to just one value. 

We need a column vector with these values since it will let us normalize the whole batch of samples, sample-wise, with a single calculation.

In [20]:
layer_outputs.shape

(3, 3)

In [23]:
print("Sum axis 1, but keep the same dimensions as input:")
print(layer_outputs)
print(np.sum(layer_outputs, axis=1, keepdims=True))

Sum axis 1, but keep the same dimensions as input:
[[ 4.8    1.21   2.385]
 [ 8.9   -1.81   0.2  ]
 [ 1.41   1.051  0.026]]
[[8.395]
 [7.29 ]
 [2.487]]


Now, if we divide the array containing a batch of the outputs with this array, NumPy will perform this sample-wise. That means that it’ll divide all of the values from each output row by the corresponding row from the sum array. Since
this sum in each row is a single value, it’ll be used for the division with every value from the corresponding output row). 

We can combine all of this into a softmax class, like:

In [34]:
# Softmax activation
class Softmax:

  # Forward pass
  def forward(self, inputs):
    # Get unnormalized probabilities
    exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
    # Normalize them for each sample
    probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)

    self.output = probabilities

Finally, we also included a subtraction of the largest of the inputs before we did the exponentiation. There are two main pervasive challenges with neural networks: “dead neurons” and very large numbers (referred to as “exploding” values). “Dead” neurons and enormous numbers can wreak havoc down the line and render a network useless over time. The exponential function used in softmax activation is one of the sources of exploding values. 

Let’s see some examples of how and why this can easily happen:

In [26]:
print(np.exp(1))

2.718281828459045


In [27]:
print(np.exp(10))

22026.465794806718


In [28]:
print(np.exp(100))

2.6881171418161356e+43


In [29]:
print(np.exp(1000))

inf


  """Entry point for launching an IPython kernel.


It doesn’t take a very large number, in this case, a mere 1,000 , to cause an overflow error. 

We know the exponential function tends toward 0 as its input value approaches negative infinity, and the output is 1 when the input is 0.

In [31]:
print(np.exp(-np.inf), np.exp(0))

0.0 1.0


We can use this property to prevent the exponential function from overflowing.

With Softmax, thanks to the normalization, we can subtract any value from all of the inputs, and it will not change the output:

In [36]:
softmax = Softmax()
softmax.forward([[1, 2, 3]])
print(softmax.output)

[[0.09003057 0.24472847 0.66524096]]


In [37]:
# subtracted 3 - max from the list
softmax.forward([[-2, -1, 0]])
print(softmax.output)

[[0.09003057 0.24472847 0.66524096]]


This is another useful property of the exponentiated and normalized function. 

There’s one more thing to mention in addition to these calculations. 

What happens if we divide the layer’s output data, `[1, 2, 3]` , for example, by 2?

In [38]:
softmax.forward([[.5, 1, 1.5]])
print(softmax.output)

[[0.18632372 0.30719589 0.50648039]]


The output confidences have changed due to the nonlinearity nature of the exponentiation. This is one example of why we need to scale all of the input data to a neural network in the same way.