Artificial Neural Networks (ANNs) are computer systems designed to mimic how the human brain processes information.

# Key Components of an ANN

1. Input Layer: This is where the network receives information. For example, in an image recognition task, the input could be an image.

2. Hidden Layers: These layers process the data received from the input layer. The more hidden layers there are, the more complex patterns the network can learn and understand. Each hidden layer transforms the data into more abstract information.

3. Output Layer: This is where the final decision or prediction is made. For example, after processing an image, the output layer might decide whether it’s a cat or a dog.

### Other IMP terms in ANN

# 🔥 **Weights & Bias in ANN (Simple + Deep Explanation)**

## ✅ **1. What Are Weights?**

**Weights are the strength of the connections between neurons.**
They decide *how much importance* a neural network gives to an input.

### 🎯 Think like this:

* Input = a feature (like height, marks, pixel value, etc.)
* Weight = importance multiplier

### 📌 Example:

Input: ( x = 5 )
Weight: ( w = 0.8 )

Contribution to neuron:
w⋅x=0.8×5=4

If weight was larger (e.g., 5), the contribution becomes huge → **input becomes more important**.

### 👉 In ANN:

Every connection has its own weight.



## ✅ **2. What Is Bias?**

**Bias is an extra parameter added to shift the output of a neuron.**

It helps the model **fit data better** by allowing the activation function to shift left/right.

### 🎯 Think like this:

Bias behaves like the **intercept (b)** in linear regression:

[
y = w x + b
]

Without bias, the line must pass through origin (0,0).
With bias, the model gets flexibility.

### 📌 Example:

If:

* ( w = 1 )
* ( x = 3 )
* ( b = 2 )

Output becomes:
1×3+2=5

Bias lets the neuron fire even when inputs are zero.



## 🧠 **Why Are Weight & Bias Important?**

* **Weights** learn the relationship between inputs and outputs.
* **Bias** gives the network freedom to adjust outputs better.
* Together they help the model learn complex patterns.



## 🖼️ **Diagram (Conceptual)**

```
   x ---- w ----> ( wx + b ) ---> Activation ----> Output
               ↑
              bias
```



## 🔍 **Short Summary**

| Concept    | Meaning          | Purpose                                  |
| ---------- | ---------------- | ---------------------------------------- |
| **Weight** | Multiplies input | Learns importance of each feature        |
| **Bias**   | Added constant   | Shifts activation, increases flexibility |

---

# Working of Artificial Neural Networks

1. Input Layer: Data such as an image, text or number is fed into the network through the input layer.

2. Hidden Layers: Each neuron in the hidden layers performs some calculation on the input, passing the result to the next layer. The data is transformed and abstracted at each layer.

3. Output Layer: After passing through all the layers, the network gives its final prediction like classifying an image as a cat or a dog.

4. The process of "backpropagation" is used to adjust the weights between neurons. When the network makes a mistake, the weights are updated to reduce the error and improve the next prediction.

---

# How do Artificial Neural Networks learn?

1. Artificial Neural Networks (ANNs) learn by training on a set of data. For example, to teach an ANN to recognize a cat, we show it thousands of images of cats. The network processes these images and learns to identify the features that define a cat.

2. Once the network has been trained, we test it by providing new images to see if it can correctly identify cats. The network’s prediction is then compared to the actual label (whether it's a cat or not). If it makes an incorrect prediction, the network adjusts by fine-tuning the weights of the connections between neurons using a process called backpropagation. This involves correcting the weights based on the difference between the predicted and actual result.

3. This process repeats until the network can accurately recognize a cat in an image with minimal error. Essentially, through constant training and feedback, the network becomes better at identifying patterns and making predictions.

---

# Backpropagation in Neural Network

Backpropagation, short for Backward Propagation of Errors, is a key algorithm used to train neural networks by minimizing the difference between predicted and actual outputs. It works by propagating errors backward through the network, using the chain rule of calculus to compute gradients and then iteratively updating the weights and biases. Combined with optimization techniques like gradient descent, backpropagation enables the model to reduce loss across epochs and effectively learn complex patterns from data.

![image.png](attachment:image.png)

# Working of Back Propagation Algorithm

* The Back Propagation algorithm involves two main steps: the Forward Pass and the Backward Pass.

1. Forward Pass Work

In forward pass the input data is fed into the input layer. These inputs combined with their respective weights are passed to hidden layers. For example in a network with two hidden layers (h1 and h2) the output from h1 serves as the input to h2. Before applying an activation function, a bias is added to the weighted inputs.

Each hidden layer computes the weighted sum (`a`) of the inputs then applies an activation function like "ReLU" (Rectified Linear Unit) to obtain the output (`o`). The output is passed to the next layer where an activation function such as "softmax" converts the weighted outputs into probabilities for classification.

![image.png](attachment:image.png)

2. Backward Pass

* In the backward pass the error (the difference between the predicted and actual output) is propagated back through the network to adjust the weights and biases. One common method for error calculation is the Mean Squared Error (MSE) given by:

MSE = (Predicted Output−Actual Output)^2 


* Once the error is calculated the network adjusts weights using gradients which are computed with the chain rule. These gradients indicate how much each weight and bias should be adjusted to minimize the error in the next iteration. The backward pass continues layer by layer ensuring that the network learns and improves its performance. The activation function through its derivative plays a crucial role in computing these gradients during Back Propagation.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

#### A lower MSE indicates that the model's predictions are closer to the actual values, signifying better accuracy.
#### Conversely, a higher MSE suggests that the model's predictions deviate further from the true value, indicating poorer performance.

*******************************************************************************************************************************************

#### How to Minimize Mean Squared Error in Model Training

To minimize Mean Squared Error during the model training, several strategies can be employed, including:

1. Feature selection: Choosing relevant features that contribute most to reducing prediction errors.
2. Model selection: Experimenting with the different algorithms and model architectures to identify the best-performing model.
3. Hyperparameter tuning: The Optimizing model hyperparameters such as the learning rate, regularization strength, and network depth to improve predictive accuracy.

*******************************************************************************************************************************************

#### Root Mean Square Error

The Root Mean Squared Error (RMSE) is a variant of MSE that calculates the square root of the average squared difference between actual and predicted values. It is often preferred over MSE as it provides an interpretable measure of the error in the same units as the original data.


RMSE = √(MSE)

*******************************************************************************************************************************************

#### MSE vs RMSE

1. Mean Squared Error is often compared with other error metrics, such as the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), to evaluate model performance.

2. While MAE measures the average absolute difference between predicted and actual values, RMSE measures the square root of the average squared difference. The MSE and RMSE penalize large errors more heavily than MAE, making them more sensitive to the outliers.

# Example of Back Propagation in Machine Learning

Let’s walk through an example of Back Propagation in machine learning. Assume the neurons use the sigmoid activation function for the forward and backward pass. The target output is 0.5 and the learning rate is 1.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![alt text](image.png)

![alt text](image-1.png)
![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Using this error value we will be backpropagating.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)
![image-4.png](attachment:image-4.png)
![image-5.png](attachment:image-5.png)

# Back Propagation Implementation in Python for XOR Problem

1. Defining Neural Network

We define a neural network as Input layer with 2 inputs, Hidden layer with 4 neurons, Output layer with 1 output neuron and use Sigmoid function as activation function.

* self.input_size = input_size: stores the size of the input layer
* self.hidden_size = hidden_size: stores the size of the hidden layer
* self.weights_input_hidden = np.random.randn(self.input_size, self.hidden_size): initializes weights for input to hidden layer
* self.weights_hidden_output = np.random.randn(self.hidden_size, self.output_size): initializes weights for hidden to output layer
* self.bias_hidden = np.zeros((1, self.hidden_size)): initializes bias for hidden layer
* self.bias_output = np.zeros((1, self.output_size)): initializes bias for output layer

In [None]:
import numpy as np


class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        self.weights_input_hidden = np.random.randn(
            self.input_size, self.hidden_size)
        self.weights_hidden_output = np.random.randn(
            self.hidden_size, self.output_size)

        self.bias_hidden = np.zeros((1, self.hidden_size))
        self.bias_output = np.zeros((1, self.output_size))

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def sigmoid_derivative(self, x):
        return x * (1 - x)

2. Defining Feed Forward Network

In Forward pass inputs are passed through the network activating the hidden and output layers using the sigmoid function.

* self.hidden_activation = np.dot(X, self.weights_input_hidden) + self.bias_hidden: calculates activation for hidden layer
* self.hidden_output= self.sigmoid(self.hidden_activation): applies activation function to hidden layer
* self.output_activation= np.dot(self.hidden_output, self.weights_hidden_output) + self.bias_output: calculates activation for output layer
* self.predicted_output = self.sigmoid(self.output_activation): applies activation function to output layer

In [None]:
def feedforward(self, X):
    self.hidden_activation = np.dot(
        X, self.weights_input_hidden) + self.bias_hidden
    self.hidden_output = self.sigmoid(self.hidden_activation)

    self.output_activation = np.dot(
        self.hidden_output, self.weights_hidden_output) + self.bias_output
    self.predicted_output = self.sigmoid(self.output_activation)

    return self.predicted_output

3. Defining Backward Network

In Backward pass or Back Propagation the errors between the predicted and actual outputs are computed. The gradients are calculated using the derivative of the sigmoid function and weights and biases are updated accordingly.

* output_error = y - self.predicted_output: calculates the error at the output layer
* output_delta = output_error * self.sigmoid_derivative(self.predicted_output): calculates the delta for the output layer
* hidden_error = np.dot(output_delta, self.weights_hidden_output.T): calculates the error at the hidden layer
* hidden_delta = hidden_error * self.sigmoid_derivative(self.hidden_output): calculates the delta for the hidden layer
* self.weights_hidden_output += np.dot(self.hidden_output.T, output_delta) * learning_rate: updates weights between hidden and output layers
* self.weights_input_hidden += np.dot(X.T, hidden_delta) * learning_rate: updates weights between input and hidden layers

In [None]:
def backward(self, X, y, learning_rate):
    output_error = y - self.predicted_output
    output_delta = output_error * \
        self.sigmoid_derivative(self.predicted_output)

    hidden_error = np.dot(output_delta, self.weights_hidden_output.T)
    hidden_delta = hidden_error * self.sigmoid_derivative(self.hidden_output)

    self.weights_hidden_output += np.dot(self.hidden_output.T,
                                         output_delta) * learning_rate
    self.bias_output += np.sum(output_delta, axis=0,
                               keepdims=True) * learning_rate
    self.weights_input_hidden += np.dot(X.T, hidden_delta) * learning_rate
    self.bias_hidden += np.sum(hidden_delta, axis=0,
                               keepdims=True) * learning_rate

4. Training Network

The network is trained over 10,000 epochs using the Back Propagation algorithm with a learning rate of 0.1 progressively reducing the error.

* output = self.feedforward(X): computes the output for the current inputs
* self.backward(X, y, learning_rate): updates weights and biases using Back Propagation
* loss = np.mean(np.square(y - output)): calculates the mean squared error (MSE) loss

In [None]:
def train(self, X, y, epochs, learning_rate):
    for epoch in range(epochs):
        output = self.feedforward(X)
        self.backward(X, y, learning_rate)
        if epoch % 4000 == 0:
            loss = np.mean(np.square(y - output))
            print(f"Epoch {epoch}, Loss:{loss}")

5. Testing Neural Network

* X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]): defines the input data
* y = np.array([[0], [1], [1], [0]]): defines the target values
* nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1): initializes the neural network
* nn.train(X, y, epochs=10000, learning_rate=0.1): trains the network
* output = nn.feedforward(X): gets the final predictions after training

In [None]:
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
nn.train(X, y, epochs=10000, learning_rate=0.1)

output = nn.feedforward(X)
print("Predictions after training:")
print(output)

![image.png](attachment:image.png)

# Challenges

While Back Propagation is useful it does face some challenges:

1. Vanishing Gradient Problem: In deep networks the gradients can become very small during Back Propagation making it difficult for the network to learn. This is common when using activation functions like sigmoid or tanh.

2. Exploding Gradients: The gradients can also become excessively large causing the network to diverge during training.

---

# Vanishing Gradient Problem

![image.png](attachment:image.png)

# Exploding Gradients Problem
![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

### Why do the Gradients Vanish or Explode

Activation Functions: Sigmoid or Tanh have small derivatives that shrink gradients.
Weight Initialization: Too small or too large weights cause vanishing or exploding gradients.
Deep Networks: Many layers multiply gradients repeatedly leading to instability.
Learning Rate: High learning rate or unscaled inputs can make gradients explode.

### Techniques to Fix Vanishing and Exploding Gradients

1. Proper Weight Initialization
Choosing the right weight initialization keeps gradients balanced during backpropagation.

* Xavier Initialization: Keeps activation variance consistent across layers to stabilize gradients.
* Kaiming Initialization: Scales weights for ReLU to preserve signal strength and prevent gradient decay.

2. Use Non Saturating Activation Functions
Sigmoid and Tanh can shrink gradients. Using ReLU or its variants prevents vanishing gradients:

* ReLU: Basic rectified linear unit.
* Leaky ReLU: Allows small gradients for negative inputs.
* ELU / SELU: Helps maintain self normalizing properties.

3. Apply Batch Normalization
Normalizes layer inputs to have zero mean and unit variance, stabilizing gradients and accelerating convergence.

4. Gradient Clipping
Limits gradients to a maximum threshold to prevent them from exploding and destabilizing training.


![image.png](attachment:image.png)
![image-3.png](attachment:image-3.png)

---

# Activation Functions in ANN

1. An activation function in a neural network is a mathematical function applied to the output of a neuron. It introduces non-linearity, enabling the model to learn and represent complex data patterns. Without it, even a deep neural network would behave like a simple linear regression model.

2. Activation functions decide whether a neuron should be activated based on the weighted sum of inputs and a bias term. They also make backpropagation possible by providing gradients for weight updates.

![image.png](attachment:image.png)

### Why Non-Linearity is Important

1. Real-world data is rarely linearly separable.
2. Non-linear functions allow neural networks to form curved decision boundaries, making them capable of handling complex patterns (e.g.,  classifying apples vs. bananas under varying colors and shapes).
3. They ensure networks can model advanced problems like image recognition, NLP and speech processing

# Types of Activation Functions in Deep Learning

## 1. Linear Activation Function
![image-2.png](attachment:image-2.png)

## 2. Non Linear Activation Functions
![image-3.png](attachment:image-3.png)
![image-4.png](attachment:image-4.png)
![image-5.png](attachment:image-5.png)
![image-6.png](attachment:image-6.png)

## 3. Exponential Linear Units
![image-7.png](attachment:image-7.png)
![image-8.png](attachment:image-8.png)

---

# Optimizers in ANN

In machine learning, optimizers and loss functions are two fundamental components that help improve a model’s performance.

1. A loss function evaluates a model's effectiveness by computing the difference between expected and actual outputs. Common loss functions include log loss, hinge loss and mean square loss.
2. An optimizer improves the model by adjusting its parameters (weights and biases) to minimize the loss function value.
3. The optimizer’s role is to find the best combination of weights and biases that leads to the most accurate predictions.

## Gradient Descent

Gradient Descent is a popular optimization method for training machine learning models. It works by iteratively adjusting the model parameters in the direction that minimizes the loss function.

![image.png](attachment:image.png)

#### Key Steps in Gradient Descent

1. Initialize parameters: Randomly initialize the model parameters.
2. Compute the gradient: Calculate the gradient (derivative) of the loss function with respect to the parameters.
3. Update parameters: Adjust the parameters by moving in the opposite direction of the gradient, scaled by the learning rate.

## Variants of Gradient Descent

### 1. Stochastic Gradient Descent (SGD)
![image-2.png](attachment:image-2.png)

### 2. Mini Batch Gradient Descent
![image-3.png](attachment:image-3.png)

### 3. SGD with Momentum
![image-4.png](attachment:image-4.png)

## Advanced Optimizers
### 1. AdaGrad
Adagrad (Adaptive Gradient Algorithm) is an optimization method that adjusts the learning rate for each parameter during training. Unlike standard gradient descent with a fixed rate.

#### Working Adagrad
The primary concept behind Adagrad is the idea of adapting the learning rate based on the historical sum of squared gradients for each parameter. Here's a step-by-step explanation of how Adagrad works:

1. Initialization: Adagrad begins by initializing the parameter values randomly, just like other optimization algorithms. Additionally, it initializes a running sum of squared gradients for each parameter which will track the gradients over time.

2. Gradient Calculation: For each training step, the gradient of the loss function with respect to the model's parameters is calculated, just like in standard gradient descent.

3. Adaptive Learning Rate: The key difference comes next. Instead of using a fixed learning rate, Adagrad adjusts the learning rate for each parameter based on the accumulated sum of squared gradients.

#### When to Use Adagrad?
Adagrad is ideal for:

1. Problems with sparse data and features like in natural language processing or recommender systems.
2. Tasks where features have different levels of importance and frequency.
3. Training models that do not require a very fast convergence rate but benefit from a more stable optimization process.

* However, if you are dealing with problems where a more constant learning rate is preferable, using variants like RMSProp or Adam might be more appropriate.


### Different Variants of Adagrad Optimizer
![image-5.png](attachment:image-5.png)
![image-6.png](attachment:image-6.png)
![image-7.png](attachment:image-7.png)

---

# Types of Artificial Neural Networks (ANN)
1. ANN
2. CNN
3. RNN
4. LSTM
5. LSTM RNN
6. GRU
7. Bidirectional RNN
8. Encoder and Decoder 
9. Attention Mechanism - seq2seq network
10. Transformers 