In [None]:
'''
 * Copyright (c) 2004 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

##  Activation Function
In this chapter, we will tackle a few of the activation functions and discuss their roles. We use
different activation functions for different cases, and understanding how they work can help you
properly pick which of them is best for your task. The activation function is applied to the output
of a neuron (or layer of neurons), which modifies outputs. We use activation functions because if
the activation function itself is nonlinear, it allows for neural networks with usually two or more
hidden layers to map nonlinear functions. We’ll be showing how this works in this chapter.
In general, your neural network will have two types of activation functions. The first will be the
activation function used in hidden layers, and the second will be used in the output layer. Usually,
the activation function used for hidden neurons will be the same for all of them, but it doesn’t
have to.

## The Step Activation Function
Recall the purpose this activation function serves is to mimic a neuron “firing” or “not firing”
based on input information. The simplest version of this is a step function. In a single neuron, if
the ​ weights · inputs + bias ​ results in a value greater than 0, the neuron will fire and output a 1;
otherwise, it will output a 0.
![image.png](attachment:image.png)
                   Figure 1: Step function graph

This activation function has been used historically in hidden layers, but nowadays, it is rarely a
choice.
## The Linear Activation Function

A linear function is simply the equation of a line. It will appear as a straight line when graphed,
where y=x and the output value equals the input.
![image-2.png](attachment:image-2.png)
                    Figure 2: Linear function graph

This activation function is usually applied to the last layer’s output in the case of a regression
model — a model that outputs a scalar value instead of a classification. We’ll cover regression in
chapter 17 and soon in an example in this chapter.



The Sigmoid Activation Function
The problem with the step function is it’s not very informative. When we get to training and network optimizers, you will see that the way an optimizer works is by assessing individual impacts that weights and biases have on a network’s output. The problem with a step function is that it’s less clear to the optimizer what these impacts are because there’s very little information gathered from this function. It’s either on (1) or off (0). It’s hard to tell how “close” this step function was to activating or deactivating. Maybe it was very close, or maybe it was very far. In terms of the final output value from the network, it doesn’t matter if it was ​ close ​ to outputting something else. Thus, when it comes time to optimize weights and biases, it’s easier for the optimizer if we have activation functions that are more granular and informative. The original, more granular, activation function used for neural networks was the ​ Sigmoid activation function, which looks like:

![image.png](attachment:image.png)

Fig 3: Sigmoid function graph. 
    
This function returns a value in the range of 0 for negative infinity, through 0.5 for the input of 0, and to 1 for positive infinity. We’ll talk about this function more in chapter 16. As mentioned earlier, with “dead neurons,” it’s usually better to have a more granular approach for the hidden neuron activation functions. In this case, we’re getting a value that can be reversed to its original value; the returned value contains all the information from the input, contrary to a function like the step function, where an input of 3 will output the same value as an input of 300,000. The output from the Sigmoid function, being in the range of 0 to 1, also works better with neural networks — especially compared to the range of the negative to the positive infinity — and adds nonlinearity. The importance of nonlinearity will become more clear shortly in this chapter. The Sigmoid function, historically used in hidden layers, was eventually replaced by the ​ Rectified Linear Units​ activation function (or ​ ReLU​ ).

This function returns a value in the range of 0 for negative infinity, through 0.5 for the input of 0,
and to 1 for positive infinity. We’ll talk about this function more in chapter 16.
As mentioned earlier, with “dead neurons,” it’s usually better to have a more granular approach
for the hidden neuron activation functions. In this case, we’re getting a value that can be
reversed to its original value; the returned value contains all the information from the input,
contrary to a function like the step function, where an input of 3 will output the same value as an
input of 300,000. The output from the Sigmoid function, being in the range of 0 to 1, also works
better with neural networks — especially compared to the range of the negative to the positive
infinity — and adds nonlinearity. The importance of nonlinearity will become more clear shortly
in this chapter. The Sigmoid function, historically used in hidden layers, was eventually replaced
by the ​ Rectified Linear Units​ activation function (or ​ ReLU​ ).

##  The Rectified Linear Activation Function
![image-3.png](attachment:image-3.png)
Figure 4: Graph of the ReLU activation function.

The rectified linear activation function is simpler than the sigmoid. It’s quite literally ​ y=x , ​ clipped
at 0 from the negative side. If ​ x ​ is less than or equal to ​ 0 , ​ then ​ y ​ is ​ 0 ​ — otherwise, ​ y ​ is equal to ​ x ​ .
![image-2.png](attachment:image-2.png)
This simple yet powerful activation function is the most widely used activation function at the
time of writing for various reasons — mainly speed and efficiency. While the sigmoid activation
function isn’t the most complicated, it’s still much more challenging to compute than the ReLU
activation function. The ReLU activation function is extremely close to being a linear activation
function while remaining nonlinear, due to that bend after 0. This simple property is, however,
very effective.

## Why Use Activation Functions? 
Now that we understand what activation functions represent, how some of them look, and what
they return, let’s discuss ​ why ​ we use activation functions in the first place. In most cases, for a
neural network to fit a nonlinear function, we need it to contain two or more hidden layers, and
we need those hidden layers to use a nonlinear activation function.
First off, what’s a nonlinear function? A nonlinear function cannot be represented well by a
straight line, such as a sine function:
![image.png](attachment:image.png)
Figure 5: Graph of y=sin(x)

While there are certainly problems in life that are linear in nature, for example, trying to figure
out the cost of some number of shirts, and we know the cost of an individual shirt, and that there
are no bulk discounts, then the equation to calculate the price of any number of those products is a
linear equation. Other problems in life are not so simple, like the price of a home. The number of
factors that come into play, such as size, location, time of year attempting to sell, number of
rooms, yard, neighborhood, and so on, makes the pricing of a home a nonlinear equation. Many of
the more interesting and hard problems of our time are nonlinear. The main attraction for neural
networks has to do with their ability to solve nonlinear problems. First, let’s consider a situation
where neurons have no activation function, which would be the same as having an activation
function of ​ y=x . ​ With this linear activation function in a neural network with 2 hidden layers of 8
neurons each, the result of training this model will look like:
![image-2.png](attachment:image-2.png)
Figure 6:  Neural network with linear activation functions in hidden layers attempting to fit
y=sin(x)
![image-3.png](attachment:image-3.png)
Figure 7: ReLU activation functions in hidden layers attempting to fit y=sin(x)

## Linear Activation in the Hidden Layers.

Now that you can see that this is the case, we still should consider ​ why ​ this is the case. To begin,
let’s revisit the linear activation function of ​ y=x ​ , and let’s consider this on a singular neuron level.
Given values for weights and biases, what will the output be for a neuron with a ​ y=x a ​ ctivation
function? Let’s look at some examples — first, let’s try to update the first weight with a positive
value:Now that you can see that this is the case, we still should consider ​ why ​ this is the case. To begin,
let’s revisit the linear activation function of ​ y=x ​ , and let’s consider this on a singular neuron level.
Given values for weights and biases, what will the output be for a neuron with a ​ y=x a ​ ctivation
function? Let’s look at some examples — first, let’s try to update the first weight with a positive
value:
    ![image.png](attachment:image.png)
    Figure 8: Example of output with a neuron using a linear activation function.
   As we continue to tweak with weights, updating with a negative number this time:
   
   ![image-2.png](attachment:image-2.png)
   Figure 9: Example of output with a neuron using a linear activation function, updated weight.
   
   And updating weights and additionally a bias:
   ![image-3.png](attachment:image-3.png)
   
   Figure 10: Example of output with a neuron using a linear activation function, updated another
weight.


perfectly linear to ​ y=x ​ of the activation function. This linear nature will continue throughout the
entire network:
![image-4.png](attachment:image-4.png)
Fig 11: A neural network with all linear activation functions.
No matter what we do with this neuron’s weights and biases, the output of this neuron will be

No matter what we do, however many layers we have, this network can only depict linear
relationships if we use linear activation functions. It should be fairly obvious that this will be the
case as each neuron in each layer acts linearly, so the entire network is a linear function as well.


 ##  ReLU Activation in a Pair of Neurons

  We believe it is less obvious how, with a barely nonlinear activation function, like the rectified
linear activation function, we can suddenly map nonlinear relationships and functions, so now
let’s cover that. Let’s start again with a single neuron. We’ll begin with both a weight of 0 and a
bias of 0:

![image.png](attachment:image.png)
Figure 12: Single neuron with single input (zeroed weight) and ReLU activation function.

In this case, no matter what input we pass, the output of this neuron will always be a 0, because
the weight is 0, and there’s no bias. Let’s set the weight to be 1:
![image-2.png](attachment:image-2.png)

Fig 13: Single neuron with single input and ReLU activation function, weight set to 1.0.


In [None]:
### It looks :)

$$
\text{Let's consider a simple example of two neurons with ReLU activation functions.}
$$

### Neuron 1:
$$
\text{The first neuron applies the following transformation to the input:}
$$
$$
\text{output}_1 = \max(0, \text{input} \cdot \text{weight}_1 + \text{bias}_1)
$$

### Neuron 2:
$$
\text{The second neuron takes the output of the first neuron as its input:}
$$
$$
\text{output}_2 = \max(0, \text{output}_1 \cdot \text{weight}_2 + \text{bias}_2)
$$

### Case 1: 
$$
\text{If the bias of the second neuron is set to } 0, \text{ it does not shift the activation point, and the second neuron's output is:}
$$
$$
\text{output}_2 = \max(0, \text{output}_1)
$$

### Case 2: 
$$
\text{If we increase the bias of the second neuron to } 2, \text{ the activation point shifts horizontally:}
$$
$$
\text{output}_2 = \max(0, \text{output}_1 + 2)
$$

### Conclusion:
$$
\text{Adjusting the bias of the second neuron shifts its activation threshold, which allows us to control when the neuron activates or deactivates.}
$$


What happens when we have, rather than just the one neuron, a pair of neurons? For example,
let’s pretend that we have 2 hidden layers of 1 neuron each. Thinking back to the ​ y=x ​ activation
function, we unsurprisingly discovered that a linear activation function produced linear results no
matter what chain of neurons we made. Let’s see what happens with the rectified linear function
for the activation. We’ll begin with the last values for the 1st neuron and a weight of 1, with a
bias of 0, for the 2nd neuron: ![image.png](attachment:image.png)
Figure 6: Pair of neurons with single inputs and ReLU activation functions.

As we can see so far, there’s no change. This is because the 2nd neuron’s bias is doing no
offsetting, and the 2nd neuron’s weight is just multiplying output by 1, so there’s no change. Let’s try to adjust the 2nd neuron’s bias now:

![image-2.png](attachment:image-2.png)
Figure 7: Pair of neurons with single inputs and ReLU activation functions.
Now we see some fairly interesting behavior. The bias of the second neuron indeed shifted the
overall function, but, rather than shifting it ​ horizontally , ​ it shifted the function ​ vertically . ​ What
then might happen if we make that 2nd neuron’s weight -2 rather than 1?
![image-3.png](attachment:image-3.png)
Figure 8: Pair of neurons with single inputs and ReLU activation functions, other negative
weight.

Something exciting has occurred! What we have here is a neuron that has both an activation and a
deactivation point. When ​ both ​ neurons are activated, when their “area of effect” comes into play,
they produce values in the range of the granular, variable, and output. If any neuron in the pair is
inactive, the pair will produce non-variable output:
![image-4.png](attachment:image-4.png)
Figure 9: Pair of neurons with single inputs and ReLU activation functions, area of effect.


### ReLU Activation in the Hidden Layers
Let’s now take this concept and use it to fit to the sine wave function using 2 hidden layers of 8
neurons each, and we can hand-tune the values to fit the curve. We’ll do this by working with 1
pair of neurons at a time, which means 1 neuron from each layer individually. For simplicity, we
are also going to assume that the layers are not densely connected, and each neuron from the first
hidden layer connects to only one neuron from the second hidden layer. That’s usually not the
case with the real models, but we want this simplification for the purpose of this demo.
Additionally, this example model takes a single value as an input, the input to the sine function,
and outputs a single value like the sine function. The output layer uses the Linear activation
function, and the hidden layers will use the rectified linear activation function.
To start, we’ll set all weights to 0 and work with the first pair of neurons:

![image.png](attachment:image.png)
Figure 20: Hand-tuning a neural network starting with the first pair of neurons.
Next, we can set the weight for the hidden layer neurons and the output neuron to 1, and we can
see how this impacts the output:

![image-2.png](attachment:image-2.png)
Figure 21: Adjusting weights for the first/top pair of neurons all to 1.

In this case, we can see that the slope of the overall function is impacted. We can further increase
this slope by adjusting the weight for the first neuron of the first layer to 6.0:

![image-3.png](attachment:image-3.png)
Figure 22: Setting weight for first hidden neuron to 6.
We can now see, for example, that the initial slope of this function is what we’d like, but we have
a problem. Currently, this function never ends because this neuron pair never ​ deactivates ​ . We can
visually see where we’d like the deactivation to occur. It’s where the red fitment line (our current neural network’s output) diverges initially from the green sine wave. So now, while we have the correct slope, we need to set this spot as our deactivation point. To do that, we start by increasingChapter 4 - Activation Functions -
the bias for the 2nd neuron of the hidden layer pair to 0.70. Recall that this offsets the overall function ​ vertically ​ :
![image-4.png](attachment:image-4.png)

Figure 23: Using the bias for the 2nd hidden neuron in the top pair to offset function vertically.

Now we can set the weight for the 2nd neuron to -1, causing a deactivation point to occur, at least
horizontally, where we want it:
![image-5.png](attachment:image-5.png)
Figure 24: Setting the weight for the 2nd neuron in the top pair to -1.
Now we’d like to flip this slope back. How might we flip the output of these two neurons? It
seems like we can take the weight of the connection to the output neuron, which is currently a 1.0,
and just flip it to a -1, and that flips the function:
![image-6.png](attachment:image-6.png)
Figure 25: Setting the weight to the output neuron to -1.

We’re certainly getting closer to making this first section fit how we want. Now, all we need to
do is offset this up a bit. For this hand-optimized example, we’re going to use the first 7 pairs of
neurons in the hidden layers to create the sine wave’s shape, then the bottom pair to offset
everything vertically. If we set the bias of the 2nd neuron in the bottom pair to 1.0 and the
weight to the output neuron as 0.7, we can vertically shift the line like so:

![image-7.png](attachment:image-7.png)
Figure 26: Using the bottom pair of neurons to offset the entire neural network function.

At this point, we have completed the first section with an “area of effect” being the first upward
section of the sine wave. We can start on the next section that we wish to do. We can start by
setting all weights for this 2nd pair of neurons to 1, including the output neuron:

![image-8.png](attachment:image-8.png)
Figure 27: Starting to adjust the 2nd pair of neurons (from the top) for the next segment of the overall function.

At this point, this 2nd pair of neurons’ activation is beginning too soon, which is impacting the
“area of effect” of the top pair that we already aligned. To fix this, we want this second pair to
start influencing the output where the first pair deactivates, so we want to adjust the function
horizontally. As you can recall from earlier, we adjust the first neuron’s bias in this neuron pair to
achieve this. Also, to modify the slope, we’ll set the weight coming into that first neuron for the
2nd pair, setting it to 3.5. This is the same method we used to set the slope for the first section,
which is controlled by the top pair of neurons in the hidden layer. After these adjustments:

![image-9.png](attachment:image-9.png)
Figure 28: Adjusting the weight and bias into the first neuron of the 2nd pair.
We will now use the same methodology as we did with the first pair to set the deactivation point.
We set the weight for the 2nd neuron in the hidden layer pair to -1 and the bias to 0.27.
![image-10.png](attachment:image-10.png)
Figure 29:Adjusting the bias of the 2nd neuron in the 2nd pair.
Then we can flip this section’s function, again the same way we did with the first one, by setting
the weight to the output neuron from 1.0 to -1.0:

![image-11.png](attachment:image-11.png)
Figure 30: Flipping the 2nd pair’s function segment, flipping the weight to the output neuron.

And again, just like the first pair, we will use the bottom pair to fix the vertical offset:
![image-12.png](attachment:image-12.png)
Figure 31:Using the bottom pair of neurons to adjust the network’s overall function.
We then just continue with this methodology. We’ll leave it flat for the top section, which means
we will only begin the activation for the 3rd pair of hidden layer neurons when we wish for the
slope to start going down:

![image-13.png](attachment:image-13.png)

Figure 32: Adjusting the 3rd pair of neurons for the next segment.

This process is simply repeated for each section, giving us a final result:

![image-14.png](attachment:image-14.png)
Figure 33: The completed process (see anim for all values).

We can then begin to pass data through to see how these neuron’s areas of effect come into play
— only when both neurons are activated based on input:

![image-16.png](attachment:image-16.png)
Figure 34: Example of data passing through this hand-crafted model.

In this case, given an input of 0.08, we can see the only pairs activated are the top ones, as this is
their area of effect. Continuing with another example:

![image-17.png](attachment:image-17.png)
Figure 35: Example of data passing through this hand-crafted model.
![image-18.png](attachment:image-18.png)

In this case, only the fourth pair of neurons is activated. As you can see, even without any of the
other weights, we’ve used some crude properties of a pair of neurons with rectified linear
activation functions to fit this sine wave pretty well. If we enable all of the weights now and allow
a mathematical optimizer to train, we can see even better fitment:

Figure 36: Example of fitment after fully-connecting the neurons and using an optimizer.
![image-19.png](attachment:image-19.png)

It should begin to make more sense to you now how more neurons can enable more unique areas
of effect, why we need two or more hidden layers, and why we need nonlinear activation
functions to map nonlinear problems. For further example, we can take the above example with 2
hidden layers of 8 neurons each, and instead use 64 neurons per hidden layer, seeing the even
further continued improvement:
Figure 37: Fitment with 2 hidden layers of 64 neurons each, fully connected, with optimizer.


$$
\text{The Rectified Linear Activation Function (ReLU) is easy to implement:}
$$

### ReLU using Python loop:
$$
\text{Given the inputs:}
$$
\[
\text{inputs} = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]
\]
$$
\text{We can implement ReLU using a basic Python for loop:}
$$
$
\text{output} = []
$
$
\text{for } i \text{ in inputs:}
$
$
\quad \text{if } i > 0: \text{ append } i \text{ to output}
$
$
\quad \text{else: append } 0
$
$$
\text{This will result in the following output:}
$$
$
\text{Result: } [0, 2, 0, 3.3, 0, 1.1, 2.2, 0]
$

### Simplified ReLU using max():
$$
\text{We can simplify the ReLU implementation using Python's max function:}
$$
$
\text{for } i \text{ in inputs: append } \max(0, i) \text{ to output}
$
$$
\text{The output will be the same:}
$$
$
\text{Result: } [0, 2, 0, 3.3, 0, 1.1, 2.2, 0]
$

### Using NumPy's np.maximum():
$$
\text{In NumPy, you can use the } np.maximum() \text{ function for ReLU:}
$$
$
\text{import numpy as np}
$
$
\text{output} = np.maximum(0, \text{inputs})
$
$$
\text{This will yield the same result:}
$$
$
\text{Result: } [0, 2, 0, 3.3, 0, 1.1, 2.2, 0]
$

### ReLU Activation Class:
$$
\text{We can define a ReLU activation class:}
$$
$
\text{class Activation\_ReLU:}
$
$
\quad \text{def forward(self, inputs):}
$
$
\quad \quad \text{self.output = np.maximum(0, inputs)}
$

### Example with Dense Layer:
$$
\text{Using the ReLU activation function with a dense layer:}
$$
$
\text{# Create dataset } X, y = spiral\_data(samples=100, classes=3)
$
$
\text{# Create Dense layer: dense1 = Layer\_Dense(2, 3)}
$
$
\text{# Create ReLU activation: activation1 = Activation\_ReLU()}
$
$
\text{# Forward pass through dense layer: dense1.forward(X)}
$
$
\text{# Forward pass through ReLU activation: activation1.forward(dense1.output)}
$
$$
\text{Print the output of the first 5 samples:}
$$
$
\text{activation1.output[:5]}
$
$$
\text{The result is:}
$$
$
[[0. , 0. , 0. ], [0. , 0.00011395, 0. ], [0. , 0.00031729, 0. ], [0. , 0.00052666, 0. ], [0. , 0.00071401, 0. ]]
$
$$
\text{This shows that negative values are clipped to 0, which is the essence of the ReLU function.}
$$


In [1]:
# ReLU using Python loop:
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]

# Implementing ReLU using a basic Python for loop
output = []
for i in inputs:
    if i > 0:
        output.append(i)  # Append positive values
    else:
        output.append(0)  # Replace negative values with 0

print("ReLU output (using loop):", output)

# Simplified ReLU using max()
output = [max(0, i) for i in inputs]
print("ReLU output (using max()):", output)

# Using NumPy's np.maximum()
import numpy as np
output = np.maximum(0, inputs)
print("ReLU output (using np.maximum()):", output)

# ReLU Activation class
class Activation_ReLU:
    def forward(self, inputs):
        # Compute the ReLU output
        self.output = np.maximum(0, inputs)

# Example with Dense Layer
X = np.random.randn(100, 2)  # Example input data

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)
activation1 = Activation_ReLU()

# Forward pass through dense layer
dense1.forward(X)

# Forward pass through ReLU activation
activation1.forward(dense1.output)

# Print the output of the first 5 samples:
print("First 5 outputs after ReLU activation:\n", activation1.output[:5])


ReLU output (using loop): [0, 2, 0, 3.3, 0, 1.1, 2.2, 0]
ReLU output (using max()): [0, 2, 0, 3.3, 0, 1.1, 2.2, 0]
ReLU output (using np.maximum()): [0.  2.  0.  3.3 0.  1.1 2.2 0. ]
First 5 outputs after ReLU activation:
 [[0.         0.         0.        ]
 [0.         0.         0.        ]
 [0.         0.         0.02857015]
 [0.03213217 0.01709417 0.        ]
 [0.02036945 0.01192971 0.03625   ]]


In [None]:
### Next -------> Softmax