## Neural Network

A neural network is inspired by the structure of the human brain. Its basic building block is a neuron, which takes inputs, performs computations, and produces an output. Each neuron has an associated activation function that determines whether it should "fire" based on its inputs.  

The training of a neural network can basically be broken down into the following parts:  
#### 1. Architecture and Layers
A simple neural network consists of an input layer, a hidden layer and an output layer. For deep neural networks, the hidden layers should be more than one and possibly sometimes the number of input features are also larger.

##### Simple Neural Network
$x_{1}$, $x_{2}$ and $x_{3}$ are the input features and $y_{1}$ and $y_{2}$ are the outputs. Here, we can say, the model architecture is receiving 3 features and 2 outputs(2 classes) in a classification model.
<img src='img/nn.png' alt='Simple Neural Network' width="200px" style="float: center" />
<br clear="left" />

#### 2. Parameters
In between each layers, there are some linear functions which consist of trainable parameters called weights and biases. Basically, the training of a neural network is updating these weights and biases to achieve the objective function of the specific problem.

##### Linear function with parameters
Let's assume that $x$ is an input feature from a data, the linear function is such a way that the weight $w$ multiple with the input feature $x$ and sum up with the bias $b$. In below graph, we explicitly assign $w=1/2$ and $b=1$ for easy understanding, and it shows a linear line representing that function. This example is only happening in one neuron or one feature. However in neural networks, there will be a lot of similar operations happening in parallel. In such cases, the vectorization and matrix operations are used for faster and more efficient computations.  
<img src='img/linear.png' alt='Linear Function' width="250px" style="float: center" />
<br clear="left" />


#### 3. Activation Function
The activation function introduces non-linearity to the network. It decides whether a neuron should be active or not based on its weighted inputs. Common activation functions include sigmoid, tanh, and Rectified Linear Unit (ReLU).

<img src='img/activation.png' alt='Activation' width="250px" style="float: center" />
<br clear="left" />

##### Deep Neural Network
Architecture image referenced from: [Industry 4.0 Interoperability, Analytics, Security, and Case Studies](https://www.researchgate.net/figure/a-Typical-Architecture-of-Deep-Learning-Neural-Network-with-One-Output-One-Input-and_fig1_355485828)
<img src='img/deepNN.png' alt='Deep Neural Network' width="600px" style="float: center" />
<br clear="left" />

#### 4. Feedforward
In the feedforward neural network, the calculation is only forward pass without updating any parameters yet. The idea is first we calculate the linear function using the input features, then again from the outputs of the linear function, an activation function is introduced in order to get the non-linear outputs. Then the activated outputs from the previous layer go to the next layer as the input of that layer. The process is generally repeated through all the layers until we get the final result $y_{i}$.

#### 5. Loss Function
A loss function measures how far off the neural network's output is from the expected output(label or ground truth). It quantifies the network's performance and guides the learning process.

#### 6. Optimizer
An optimizer is an algorithm or a method used to adjust the parameters (weights and biases) of a neural network during training in order to minimize the loss function or maximize the efficiency of the model. The optimizer helps in adjusting the learning rate, momentum and direction of the gradient calculation during training.
Examples of optimizer:  

1. Stochastic Gradient Descent(SGD)
2. Adam
3. RMSProp
4. Adagrad
and so on...  

#### 7. Backpropagation
Backpropagation is the process of updating the weights and biases in the network to minimize the loss function(objective function). It calculates the gradient of the loss with respect to the network's parameters and updates them using optimization algorithms like Gradient Descent.

##### Computational Graph
<img src='img/computationgraph1.png' alt='Computation Graph' width="250px" style="float: left" />
<!-- <br clear="left" /> -->

<img src='img/computationgraph2.png' alt='Computation Graph' width="400px" style="float: center" />
<br clear="left" />

##### Derivation
<img src='img/backprop1.png' alt='Back Propagation' width="300px" style="float: left" />
<br clear="left" />

<img src='img/backprop2.png' alt='Back Propagation' width="300px" style="float: left" />
<br clear="left" />

##### Derivative Examples
$f(x)=x^{2}$ >> $\frac{\partial}{\partial{x}}f(x)$ = $2x$

$f(x)=x^{3}$ >> $\frac{\partial}{\partial{x}}f(x)$ = $3x^{2}$

$f(x)=log_{e}(x)$ or $f(x)=ln(x)$ >> $\frac{\partial}{\partial{x}}f(x)$ = $\frac{1}{x}$

$f(x)=e^{x}$ >> $\frac{\partial}{\partial{x}}f(x)$ = $e^{x}$

$f(x)=2x^{2}+3x+b$ >> $\frac{\partial}{\partial{x}}f(x)$ = $4x+3$ (The derivative of a constant is always zero)

#### 8. Training
The network is trained using labeled data. During training, the input data is passed through the network, the output is compared to the expected output, and the loss is calculated. Backpropagation then adjusts the weights and biases to minimize the loss so that we will get the optimal weights and biases that fit with our problem.

##### Gradient Descent
Gradient Descent is an optimization algorithm used in machine learning and deep learning to iteratively update the parameters of a model in order to minimize a loss function. The goal of the algorithm is to find the values of the model's parameters that result in the lowest possible value of the loss function, which in turn signifies the best fit of the model to the training data.  

1. Initialization: The algorithm starts with an initial guess for the model's parameters.  

2. Compute Loss: The current model's parameters are used to make predictions on the training data. The difference between these predictions and the actual target values (the loss) is calculated using a chosen loss function.  

3. Compute Gradient: The gradient of the loss with respect to each parameter is computed. The gradient indicates the direction and magnitude of the steepest increase in the loss function. This step involves taking partial derivatives of the loss function with respect to each parameter.  

4. Update Parameters: The parameters are updated by subtracting a fraction of the gradient from the current parameter values. This fraction is called the learning rate. The learning rate determines the step size in the parameter space and influences the speed and stability of convergence.  

5. Repeat: Steps 2 to 4 are repeated for a certain number of iterations or until the change in the loss function becomes small (convergence criterion).  

Gradient descent algorithm reference: Deep Learning Specialization by Andrew Ng from [Coursera](https://www.coursera.org/).  

<img src='img/gradientdescent.png' alt='Gradient Descent Algorithm' width="300" style="float: left" />
<!-- <br clear="left" /> -->

<img src='img/localoptima.png' alt='Local Optima' width="175" style="float: right" />
<!-- <br clear="left" /> -->

<img src='img/gradientdescent2.png' alt='Gradient Descent' width="400" style="float: center" />
<br clear="left" />  

#### 9. Validation and Evaluation
After training, the network's performance is evaluated using validation data to ensure it's not overfitting, underfitting, or the model is learning. Once satisfied, the network is tested on unseen data to measure its real-world performance. The testing process is sometimes called inferencing.


## Vectorization and Matrix Operation

Before you start, please make sure that you have already installed `Numpy` library which is a handy package for sientific computing in Python.  
To install Numpy, just simply use `pip3 install numpy` or `pip install numpy` in your terminal, or put an exclaimation point infront `!pip3 install numpy` or `!pip install numpy` in you Jupyter cell.

#### Different types of vector/matrix operations
1. The dot product (inner scalar product)  
$\mathbf{x1} \cdot \mathbf{x2}$  
<img src="img/dot.png" alt="Dot Product" width="400px" style="float: center" />
<br clear="left" />  

2. Elementwise product (Hadamard product)  
$\mathbf{x1} \circ \mathbf{x2}$  
$\mathbf{x1} \odot \mathbf{x2}$  
<img src="img/hadamard.png" alt="Hadamard Product" width="400px" style="float: center" />
<br clear="left" />  

3. The outer product  
$\mathbf{x1} \otimes \mathbf{x2}$  
<img src="img/outer.png" alt="Outer Product" width="400px" style="float: center" />
<br clear="left" />  

#### Exercise 1

$\mathbf{A}$ = $\begin{pmatrix} 2&1 \\ 6&5  \end{pmatrix}$, 
$\mathbf{B}$ = $\begin{pmatrix} 3&4 \\ 8&7  \end{pmatrix}$  

Calculate the following and print out the results:  
1. $\mathbf{A} \cdot \mathbf{B}$.
2. $2\mathbf{A}$
3. $2\mathbf{A}^{T}$
4. $3\mathbf{A}^{T} \cdot \mathbf{B}$
5. $\mathbf{A} \odot \mathbf{B}$



In [1]:
### Your code here

import numpy as np

# Define matrices A and B
A = np.array([[2, 1],
              [6, 5]])

B = np.array([[3, 4],
              [8, 7]])

# Calculate and print the results
result1 = np.dot(A, B)
result2 = 2 * A
result3 = 2 * A.T
result4 = 3 * np.dot(A.T, B)
result5 = np.multiply(A, B)  # Element-wise multiplication

print("1. A * B:")
print(result1)

print("\n2. 2A:")
print(result2)

print("\n3. 2A^T:")
print(result3)

print("\n4. 3A^T * B:")
print(result4)

print("\n5. A ⊙ B (element-wise multiplication):")
print(result5)

1. A * B:
[[14 15]
 [58 59]]

2. 2A:
[[ 4  2]
 [12 10]]

3. 2A^T:
[[ 4 12]
 [ 2 10]]

4. 3A^T * B:
[[162 150]
 [129 117]]

5. A ⊙ B (element-wise multiplication):
[[ 6  4]
 [48 35]]


In [2]:
### Importing libraries before writing your codes
import time
import numpy as np
import math


x1 = [9, 2, 5, 0, 0, 7, 5, 0, 0, 0, 9, 2, 5, 0, 0]
x2 = [9, 2, 2, 9, 0, 9, 2, 5, 0, 0, 9, 2, 5, 0, 0]

# np.random.seed(27)
# x1 = np.random.randint(500, 1000, 500)
# x2 = np.random.randint(0, 500, 500)
# print("x1[0]:", x1[0], "  ", "x2[0]:", x2[0])

### CLASSIC DOT PRODUCT OF VECTORS IMPLEMENTATION ###
tic = time.process_time()
dot = 0
for i in range(len(x1)):
    dot+= x1[i]*x2[i]
toc = time.process_time()
print ("dot = " + str(dot) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")
# print ("dot ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### CLASSIC OUTER PRODUCT IMPLEMENTATION ###
tic = time.process_time()
outer = np.zeros((len(x1),len(x2))) # we create a len(x1)*len(x2) matrix with only zeros
for i in range(len(x1)):
    for j in range(len(x2)):
        outer[i,j] = x1[i]*x2[j]
toc = time.process_time()
print ("outer = " + str(outer) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")
# print ("outer ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### CLASSIC ELEMENTWISE IMPLEMENTATION ###
tic = time.process_time()
mul = np.zeros(len(x1))
for i in range(len(x1)):
    mul[i] = x1[i]*x2[i]
toc = time.process_time()
print ("elementwise multiplication = " + str(mul) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")
# print ("elementwise multiplication ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### CLASSIC GENERAL DOT PRODUCT IMPLEMENTATION ###
W = np.random.rand(3,len(x1)) # Random 3*len(x1) numpy array
tic = time.process_time()
gdot = np.zeros(W.shape[0])
for i in range(W.shape[0]):
    for j in range(len(x1)):
        gdot[i] += W[i,j]*x1[j]
toc = time.process_time()
print ("gdot = " + str(gdot) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")
# print ("gdot ----- Computation time = " + str(1000*(toc - tic)) + "ms")

dot = 278
 ----- Computation time = 0.08290499999996648ms
outer = [[81. 18. 18. 81.  0. 81. 18. 45.  0.  0. 81. 18. 45.  0.  0.]
 [18.  4.  4. 18.  0. 18.  4. 10.  0.  0. 18.  4. 10.  0.  0.]
 [45. 10. 10. 45.  0. 45. 10. 25.  0.  0. 45. 10. 25.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [63. 14. 14. 63.  0. 63. 14. 35.  0.  0. 63. 14. 35.  0.  0.]
 [45. 10. 10. 45.  0. 45. 10. 25.  0.  0. 45. 10. 25.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [81. 18. 18. 81.  0. 81. 18. 45.  0.  0. 81. 18. 45.  0.  0.]
 [18.  4.  4. 18.  0. 18.  4. 10.  0.  0. 18.  4. 10.  0.  0.]
 [45. 10. 10. 45.  0. 45. 10. 25.  0.  0. 45. 10. 25.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0

In [3]:
x1 = [9, 2, 5, 0, 0, 7, 5, 0, 0, 0, 9, 2, 5, 0, 0]
x2 = [9, 2, 2, 9, 0, 9, 2, 5, 0, 0, 9, 2, 5, 0, 0]

# np.random.seed(27)
# x1 = np.random.randint(500, 1000, 500)
# x2 = np.random.randint(0, 500, 500)
# print("x1[0]:", x1[0], "  ", "x2[0]:", x2[0])

### VECTORIZED DOT PRODUCT OF VECTORS ###
tic = time.process_time()
dot = np.dot(x1,x2)
toc = time.process_time()
print ("dot = " + str(dot) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")
# print ("dot ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### VECTORIZED OUTER PRODUCT ###
tic = time.process_time()
outer = np.outer(x1,x2)
toc = time.process_time()
print ("outer = " + str(outer) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")
# print ("outer ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### VECTORIZED ELEMENTWISE MULTIPLICATION ###
tic = time.process_time()
mul = np.multiply(x1,x2)
toc = time.process_time()
print ("elementwise multiplication = " + str(mul) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")
# print ("elementwise multiplication ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### VECTORIZED GENERAL DOT PRODUCT ###
tic = time.process_time()
dot = np.dot(W,x1)
toc = time.process_time()
print ("gdot = " + str(dot) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")
# print ("gdot ----- Computation time = " + str(1000*(toc - tic)) + "ms")

dot = 278
 ----- Computation time = 0.05275999999998504ms
outer = [[81 18 18 81  0 81 18 45  0  0 81 18 45  0  0]
 [18  4  4 18  0 18  4 10  0  0 18  4 10  0  0]
 [45 10 10 45  0 45 10 25  0  0 45 10 25  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [63 14 14 63  0 63 14 35  0  0 63 14 35  0  0]
 [45 10 10 45  0 45 10 25  0  0 45 10 25  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [81 18 18 81  0 81 18 45  0  0 81 18 45  0  0]
 [18  4  4 18  0 18  4 10  0  0 18  4 10  0  0]
 [45 10 10 45  0 45 10 25  0  0 45 10 25  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]]
 ----- Computation time = 0.05415100000005779ms
elementwise multiplication = [81  4 10  0  0 63 10  0  0  0 81  4 25  0  0]
 ----- Computation time = 0.028734000000030235ms
gdot = [24.46216207 20.10858413 29.05124

## Building basic functions with numpy

### Activation Functions

Referenced from [Action functions](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#relu).

#### Sigmoid Function
Sigmoid function is known as the logistic function. The equation can be written as: 

$$\textit{sigmoid(x)}=\frac{1}{1+e^{-\textit{x}}}$$

<img src='img/sigmoid.png' alt='Sigmoid Function' width="200px" style="float: center" />
<br clear="left" />

Before using np.exp(), you will use math.exp() to implement the sigmoid function. You will then see why np.exp() is preferable to math.exp().

#### Example

In [4]:
import math

def math_sigmoid(x):
    '''
    Compute sigmoid of x
    '''
    z = None
    # YOUR CODE HERE
    # raise NotImplementedError()
    z = 1 / (1 + np.exp(-x))
    return z

In [5]:
# test function - do not remove
print(math_sigmoid(5))

assert math_sigmoid(10) > 0.9999, "Calculate error"
assert math_sigmoid(-10) < 0.0001, "Calculate error"
assert math_sigmoid(0) == 0.5, "Calculate error"

0.9933071490757153


Actually, we rarely use the "math" library in deep learning because the inputs of the functions are real numbers. In deep learning we mostly use matrices and vectors. This is why numpy is more useful.

In [6]:
### One reason why we use "numpy" instead of "math" in Deep Learning ###
x = [1, 2, 3]
# math_sigmoid(x) # you will see this give an error when you run it, because x is a vector.

In fact, if $\mathit{x = (x_{1}, x_{2},...,x_{n})}$ is a row vector then will apply the exponential function to every element of $\mathit{x}$. The output will thus be:  

$$\mathit{np.exp(x) = (e^{x_{1}},e^{x_{2}},...,e^{x_{n}})}$$



In [7]:
# example of np.exp
x = np.array([1, 2, 3])
print(np.exp(x)) # result is (exp(1), exp(2), exp(3))

[ 2.71828183  7.3890561  20.08553692]


#### Exercise 2

In [8]:
# Grade cell - do not remove
# Use Numpy library to build your sigmoid function again

def Sigmoid(x):
    # output = None
    # YOUR CODE HERE
    # raise NotImplementedError()
    # return 
    return 1 / (1 + np.exp(-x))

In [9]:
# test function - do not remove

a = np.array([.9, 0.2, 0.1, -0.3, -0.7])

y_hat = Sigmoid(a)
print(y_hat)

assert y_hat.shape[0] == 5, "sigmoid output is incorrect"
assert np.round(y_hat[0],4) == 0.7109, "sigmoid output is incorrect"
assert np.round(y_hat[1],4) == 0.5498, "sigmoid output is incorrect"
assert np.round(y_hat[2],4) == 0.5250, "sigmoid output is incorrect"
assert np.round(y_hat[3],4) == 0.4256, "sigmoid output is incorrect"
assert np.round(y_hat[4],4) == 0.3318, "sigmoid output is incorrect"

[0.7109495  0.549834   0.52497919 0.42555748 0.33181223]


In [10]:
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

a = np.array([0.9, 0.2, 0.1, -0.3, -0.7])
y_hat = sigmoid(a)
print(y_hat)

assert y_hat.shape[0] == 5, "sigmoid output is incorrect"
assert np.round(y_hat[0], 4) == 0.7109, "sigmoid output is incorrect"
assert np.round(y_hat[1], 4) == 0.5498, "sigmoid output is incorrect"
assert np.round(y_hat[2], 4) == 0.5250, "sigmoid output is incorrect"
assert np.round(y_hat[3], 4) == 0.4256, "sigmoid output is incorrect"
assert np.round(y_hat[4], 4) == 0.3318, "sigmoid output is incorrect"

[0.7109495  0.549834   0.52497919 0.42555748 0.33181223]


#### ReLU (Rectified Linear Unit)
A recent invention which stands for Rectified Linear Units. The formula is deceptively simple: $\textit{max(0,z)}$
. Despite its name and appearance, it’s not linear and provides the same benefits as Sigmoid (i.e. the ability to learn nonlinear functions), but with better performance.  

$$\textit{ReLU(x)} = \textit{max(0, x)}$$  

<img src='img/relu.png' alt='ReLU Function' width="200px" style="float: center" />
<br clear="left" />

#### Exercise 3

In [11]:
# Grade cell - do not remove

def ReLu(x):
    # output = None
    # YOUR CODE HERE
    # raise NotImplementedError()
    # return output
    return np.maximum(0, x)

In [12]:
# test function - do not remove

a = np.array([.9, 0.2, 0.1, -0.3, -0.7])

y_hat = ReLu(a)
print(y_hat)

assert y_hat.shape[0] == 5, "ReLu output is incorrect"
assert y_hat[3] > a[3] and y_hat[3] == 0, "ReLu output is incorrect"
assert y_hat[4] > a[4] and y_hat[4] == 0, "ReLu output is incorrect"
assert y_hat[0] == a[0], "ReLu output is incorrect"
assert y_hat[1] == a[1], "ReLu output is incorrect"
assert y_hat[2] == a[2], "ReLu output is incorrect"

[0.9 0.2 0.1 0.  0. ]


#### Tanh (Hyperbolic Tangent)
Tanh squashes a real-valued number to the range [-1, 1]. It’s non-linear. But unlike Sigmoid, its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.  

$$\textit{Tanh(x)} = \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$$  

<img src='img/tanh.png' alt='Tanh Function' width="200px" style="float: center" />
<br clear="left" />

#### Exercise 4

In [13]:
# Grade cell - do not remove

def Tanh(x):
    output = np.tanh(x)
    # YOUR CODE HERE
    # raise NotImplementedError()
    return output

In [14]:
# test function - do not remove

a = np.array([.9, 0.2, 0.1, -0.3, -0.7])

y_hat = Tanh(a)
print(y_hat)

assert y_hat.shape[0] == 5, "Tanh output is incorrect"
assert np.round(y_hat[0],4) == 0.7163, "Tanh output is incorrect"
assert np.round(y_hat[1],4) == 0.1974, "Tanh output is incorrect"
assert np.round(y_hat[2],4) == 0.0997, "Tanh output is incorrect"
assert np.round(y_hat[3],4) == -0.2913, "Tanh output is incorrect"
assert np.round(y_hat[4],4) == -0.6044, "Tanh output is incorrect"

[ 0.71629787  0.19737532  0.09966799 -0.29131261 -0.60436778]


#### Softmax
Softmax function calculates the probabilities distribution of the event over ‘i’ different events. In general way of saying, this function will calculate the probabilities of each target class over all possible target classes. Later the calculated probabilities will be helpful for determining the target class for the given inputs.  

$$Softmax(z_{i}) = \frac{exp(z_{i})}{\sum{exp(z_{j})}}$$

#### Exercise 5

In [15]:
# Grade cell - do not remove

def Softmax(x):
    """
    Calculates the softmax for each row of the input x.

    The code should work for a row vector and also for matrices of shape (m,n).
    """
    # output = None
    # YOUR CODE HERE
    # raise NotImplementedError()
    
    # return output
    exp_x = np.exp(x)
    sum_exp_x = np.sum(exp_x, axis=-1, keepdims=True)
    return exp_x / sum_exp_x

In [16]:
# test function - do not remove

a = np.array([.9, 0.2, 0.1, -0.3, -0.7])

y_hat = Softmax(a)
print(y_hat)

assert y_hat.shape[0] == 5, "Softmax output is incorrect"
assert np.round(y_hat[0],4) == 0.4083, "Softmax output is incorrect"
assert np.round(y_hat[1],4) == 0.2028, "Softmax output is incorrect"
assert np.round(y_hat[2],4) == 0.1835, "Softmax output is incorrect"
assert np.round(y_hat[3],4) == 0.1230, "Softmax output is incorrect"
assert np.round(y_hat[4],4) == 0.0824, "Softmax output is incorrect"

[0.4083291  0.20277023 0.18347409 0.12298636 0.08244022]
