### Q1. What is the purpose of forward propagation in a neural network?

Forward propagation is a crucial step in the training and inference process of a neural network. It refers to the process of moving input data through the network's layers, from the input layer to the output layer, to generate predictions or activations. The purpose of forward propagation can be understood in the following key aspects:

![image.png](attachment:image.png)

1. ##### Generating Predictions:
   * Training Phase: In the training phase, forward propagation is used to generate predictions for the input data. The              predictions are then compared to the actual target values to compute the loss (a measure of the model's performance).

   * Inference Phase: In the inference phase (or prediction phase), forward propagation is used to obtain predictions for new, 
     unseen data.
     
     
2. ##### Computing Activations:
   * Forward propagation involves computing the weighted sum of input features at each neuron, applying an activation function,      and passing the result to the next layer. This process results in the activation values for each neuron in the network.
   * The activations provide a representation of the input data as it passes through the network, capturing hierarchical              features and representations in the deeper layers.
   
3. ##### Parameter Updates (Training):
   * During the training phase, forward propagation is a precursor to backward propagation (backpropagation). After obtaining        predictions, the model compares them to the actual targets to compute the loss. The gradients of the loss with respect to 
     the model's parameters are then calculated during backpropagation.

   * The calculated gradients are used to update the model's parameters (weights and biases) in the direction that minimizes the      loss, enabling the network to learn and improve its performance.
   
   
4. ##### Understanding Model Behavior:
   * Forward propagation allows practitioners to observe how input data is transformed at each layer of the neural network. This      is valuable for understanding the hierarchical and nonlinear representations that the model learns.

   * By inspecting intermediate layer activations, practitioners can gain insights into what features the model considers            important for making predictions.

### Q2. How is forward propagation implemented mathematically in a single-layer feedforward neural network?

Feed forward neural networks are artificial neural networks in which nodes do not form loops. This type of neural network is also known as a multi-layer neural network as all information is only passed forward.

During data flow, input nodes receive data, which travel through hidden layers, and exit output nodes. No links exist in the network that could get used to by sending information back from the output node.

A feed forward neural network approximates functions in the following way:

* An algorithm calculates classifiers by using the formula y = f* (x).
* Input x is therefore assigned to category y.
* According to the feed forward model, y = f (x; θ). This value determines the closest approximation of the function.

Feed forward neural networks serve as the basis for object detection in photos, as shown in the Google Photos app.

![image.png](attachment:image.png)

As a feed forward neural network model, the single-layer perceptron often gets used for classification. Machine learning can also get integrated into single-layer perceptrons. Through training, neural networks can adjust their weights based on a property called the delta rule, which helps them compare their outputs with the intended values.

As a result of training and learning, gradient descent occurs. Similarly, multi-layered perceptrons update their weights. But, this process gets known as back-propagation. If this is the case, the network's hidden layers will get adjusted according to the output values produced by the final layer.

* ##### Loss function:

The loss function of a neural network gets used to determine if an adjustment needs to be made in the learning process.

Neurons in the output layer are equal to the number of classes. Showing the differences between predicted and actual probability distributions. Following is the cross-entropy loss for binary classification.

![image.png](attachment:image.png)
    

### Q3. How are activation functions used during forward propagation?

During forward propagation, pre-activation and activation take place at each hidden and output layer node of a neural network. The pre-activation function is the calculation of the weighted sum. The activation function is applied, based on the weighted sum, to make the neural network flow non-linearly using bias. 

There is no definitive guide for which activation function works best on specific problems. It’s a trial and error process where one should try different set of functions and see which one works best on the problem at hand.

### Q4. What is the role of weights and biases in forward propagation?

* ##### Weights :

Each weight represents the strength of the connection between the two nodes it connects.

When the network receives an input at a given node in the input layer, this input is passed to nodes in the next layer via connections, and the input will be multiplied by the weight values assigned to those connections.

The weight values are first randomly initialized and then learned, updated, and optimized by the network during the training process.

For each node in a fully connected layer, a weighted sum is computed with each of the incoming weights. This weighted sum is considered the pre-activation output from the node.
    
    
![image-2.png](attachment:image-2.png)

* #### Biases:

In simple words, neural network bias can be defined as the constant which is added to the product of features and weights. It is used to offset the result. It helps the models to shift the activation function towards the positive or negative side.

Consider a sigmoid activation function which is represented by the equation below:

![image.png](attachment:image.png)

On replacing the variable ‘x’ with the equation of line, we get the following:

![image-2.png](attachment:image-2.png)

If we vary the values of the weight ‘w’, keeping bias ‘b’=0, we will get the following graph:
![image-4.png](attachment:image-4.png)

While changing the values of ‘w’, there is no way we can shift the origin of the activation function, i.e., the sigmoid function. On changing the values of ‘w’, only the steepness of the curve will change. There is only one way to shift the origin and that is to include bias ‘b’.

On keeping the value of weight ‘w’ fixed and varying the value of bias ‘b’, we will get the graph below:
![image-5.png](attachment:image-5.png)

### Q5. What is the purpose of applying a softmax function in the output layer during forward propagation?

Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would. Softmax is implemented through a neural network layer just before the output layer.

![image.png](attachment:image.png)

### Q6. What is the purpose of backward propagation in a neural network?

Backpropagation is a method for efficiently computing the gradient of the cost function of a neural network with respect to its parameters.  These partial derivatives can then be used to update the network's parameters using, e.g., gradient descent.  This may be the most common method for training neural networks.  Deriving backpropagation involves numerous clever applications of the chain rule for functions of vectors.

![image.png](attachment:image.png)

"Backpropagation is just a way of propagating the total loss back into the neural network to know how much of the loss every node is responsible for, and subsequently updating the weights in a way that minimizes the loss by giving the nodes with higher error rates lower weights, and vice versa."

### Q7. How is backward propagation mathematically calculated in a single-layer feedforward neural network?

#### Backpropagation in general :

In order to train the network using a gradient descent algorithm, we need to know the gradient of each of the parameters with respect to the cost/error function $C$; that is, we need to know $\frac{\partial C}{\partial W^m}$ and $\frac{\partial C}{\partial b^m}$.  It will be sufficient to derive an expression for these gradients in terms of the following terms, which we can compute based on the neural network's architecture:

- $\frac{\partial C}{\partial a^L}$: The derivative of the cost function with respect to its argument, the output of the network
- $\frac{\partial a^m}{\partial z^m}$: The derivative of the nonlinearity used in layer $m$ with respect to its argument

To compute the gradient of our cost/error function $C$ to $W^m_{ij}$ (a single entry in the weight matrix of the layer $m$), we can first note that $C$ is a function of $a^L$, which is itself a function of the linear mix variables $z^m_k$, which are themselves functions of the weight matrices $W^m$ and biases $b^m$.  With this in mind, we can use the chain rule as follows:

$$\frac{\partial C}{\partial W^m_{ij}} = \sum_{k = 1}^{N^m} \frac{\partial C}{\partial z^m_k} \frac{\partial z^m_k}{\partial W^m_{ij}}$$

Note that by definition 
$$
z^m_k = \sum_{l = 1}^{N^m} W^m_{kl} a_l^{m - 1} + b^m_k
$$
It follows that $\frac{\partial z^m_k}{\partial W^m_{ij}}$ will evaluate to zero when $i \ne k$ because $z^m_k$ does not interact with any elements in $W^m$ except for those in the $k$<sup>th</sup> row, and we are only considering the entry $W^m_{ij}$.  When $i = k$, we have

\begin{align*}
\frac{\partial z^m_i}{\partial W^m_{ij}} &= \frac{\partial}{\partial W^m_{ij}}\left(\sum_{l = 1}^{N^m} W^m_{il} a_l^{m - 1} + b^m_i\right)\\
&= a^{m - 1}_j\\
\rightarrow \frac{\partial z^m_k}{\partial W^m_{ij}} &= \begin{cases}
0 & k \ne i\\
a^{m - 1}_j & k = i
\end{cases}
\end{align*}

The fact that $\frac{\partial C}{\partial a^m_k}$ is $0$ unless $k = i$ causes the summation above to collapse, giving

$$\frac{\partial C}{\partial W^m_{ij}} = \frac{\partial C}{\partial z^m_i} a^{m - 1}_j$$

or in vector form

$$\frac{\partial C}{\partial W^m} = \frac{\partial C}{\partial z^m} a^{m - 1 \top}$$

Similarly for the bias variables $b^m$, we have

$$\frac{\partial C}{\partial b^m_i} = \sum_{k = 1}^{N^m} \frac{\partial C}{\partial z^m_k} \frac{\partial z^m_k}{\partial b^m_i}$$

As above, it follows that $\frac{\partial z^m_k}{\partial b^m_i}$ will evaluate to zero when $i \ne k$ because $z^m_k$ does not interact with any element in $b^m$ except $b^m_k$.  When $i = k$, we have

\begin{align*}
\frac{\partial z^m_i}{\partial b^m_i} &= \frac{\partial}{\partial b^m_i}\left(\sum_{l = 1}^{N^m} W^m_{il} a_l^{m - 1} + b^m_i\right)\\
&= 1\\
\rightarrow \frac{\partial z^m_i}{\partial b^m_i} &= \begin{cases}
0 & k \ne i\\
1 & k = i
\end{cases}
\end{align*}

The summation also collapses to give

$$\frac{\partial C}{\partial b^m_i} = \frac{\partial C}{\partial z^m_i}$$

or in vector form

$$\frac{\partial C}{\partial b^m} = \frac{\partial C}{\partial z^m}$$

Now, we must compute $\frac{\partial C}{\partial z^m_k}$.  For the final layer ($m = L$), this term is straightforward to compute using the chain rule:

$$
\frac{\partial C}{\partial z^L_k} = \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z^L_k}
$$

or, in vector form

$$
\frac{\partial C}{\partial z^L} = \frac{\partial C}{\partial a^L} \frac{\partial a^L}{\partial z^L}
$$

The first term $\frac{\partial C}{\partial a^L}$ is just the derivative of the cost function with respect to its argument, whose form depends on the cost function chosen.  Similarly, $\frac{\partial a^m}{\partial z^m}$ (for any layer $m$ includling $L$) is the derivative of the layer's nonlinearity with respect to its argument and will depend on the choice of nonlinearity.  For other layers, we again invoke the chain rule:


\begin{align*}
\frac{\partial C}{\partial z^m_k} &= \frac{\partial C}{\partial a^m_k} \frac{\partial a^m_k}{\partial z^m_k}\\
&= \left(\sum_{l = 1}^{N^{m + 1}}\frac{\partial C}{\partial z^{m + 1}_l}\frac{\partial z^{m + 1}_l}{\partial a^m_k}\right)\frac{\partial a^m_k}{\partial z^m_k}\\
&= \left(\sum_{l = 1}^{N^{m + 1}}\frac{\partial C}{\partial z^{m + 1}_l}\frac{\partial}{\partial a^m_k} \left(\sum_{h = 1}^{N^m} W^{m + 1}_{lh} a_h^m + b_l^{m + 1}\right)\right) \frac{\partial a^m_k}{\partial z^m_k}\\
&= \left(\sum_{l = 1}^{N^{m + 1}}\frac{\partial C}{\partial z^{m + 1}_l} W^{m + 1}_{lk}\right) \frac{\partial a^m_k}{\partial z^m_k}\\
&= \left(\sum_{l = 1}^{N^{m + 1}}W^{m + 1\top}_{kl} \frac{\partial C}{\partial z^{m + 1}_l}\right) \frac{\partial a^m_k}{\partial z^m_k}\\
\end{align*}

where the last simplification was made because by convention $\frac{\partial C}{\partial z^{m + 1}_l}$ is a column vector, allowing us to write the following vector form:

$$\frac{\partial C}{\partial z^m} = \left(W^{m + 1\top} \frac{\partial C}{\partial z^{m + 1}}\right) \circ \frac{\partial a^m}{\partial z^m}$$

Note that we now have the ingredients to efficiently compute the gradient of the cost function with respect to the network's parameters:  First, we compute $\frac{\partial C}{\partial z^L_k}$ based on the choice of cost function and nonlinearity.  Then, we recursively can compute $\frac{\partial C}{\partial z^m}$ layer-by-layer based on the term $\frac{\partial C}{\partial z^{m + 1}}$ computed from the previous layer and the nonlinearity of the layer (this is called the "backward pass").

### Q8. Can you explain the concept of the chain rule and its application in backward propagation?

##### Review: The chain rule

The chain rule is a way to compute the derivative of a function whose variables are themselves functions of other variables.  If $C$ is a scalar-valued function of a scalar $z$ and $z$ is itself a scalar-valued function of another scalar variable $w$, then the chain rule states that
$$
\frac{\partial C}{\partial w} = \frac{\partial C}{\partial z}\frac{\partial z}{\partial w}
$$
For scalar-valued functions of more than one variable, the chain rule essentially becomes additive.  In other words, if $C$ is a scalar-valued function of $N$ variables $z_1, \ldots, z_N$, each of which is a function of some variable $w$, the chain rule states that
$$
\frac{\partial C}{\partial w} = \sum_{i = 1}^N \frac{\partial C}{\partial z_i}\frac{\partial z_i}{\partial w}
$$

* ###### Application of chain rule in Backward Propagation :



  * Error Derivative with Respect to the Output
  * Error Propagation to the Last Layer
  * Error Propagation Through the Activation Function
  * Error Propagation to the Weights and Biases 

### Q9. What are some common challenges or issues that can occur during backward propagation, and how can they be addressed?

During backward propagation in neural network training, several challenges and issues may arise. Addressing these challenges is crucial to ensure the stability and effectiveness of the training process. Here are some common challenges and potential solutions:

#### Vanishing Gradients:
* Issue: In deep networks, gradients can become extremely small as they are propagated backward through many layers. This is known as the vanishing gradients problem.

* Solution: Use activation functions that mitigate vanishing gradients, such as the Rectified Linear Unit (ReLU) or variants like Leaky ReLU. Additionally, consider using normalization techniques like batch normalization.


#### Exploding Gradients:
* Issue: Gradients can become extremely large during backward propagation, leading to numerical instability and making training difficult.

* Solution: Implement gradient clipping, which involves scaling gradients when they exceed a certain threshold. This prevents the explosion of gradients while maintaining the direction of the gradient.


#### Choice of Activation Functions:
* Issue: Poor choices of activation functions can lead to issues like vanishing or exploding gradients.

* Solution: Choose activation functions carefully based on the characteristics of the data and the network architecture. For example, ReLU and its variants are popular choices due to their effectiveness.


#### Unstable Training:
* Issue: The training process may be unstable, with the loss oscillating or not converging.

* Solution: Adjust the learning rate. If the learning rate is too high, it can cause overshooting, and if it's too low, convergence may be slow. Experiment with different learning rates and consider using learning rate schedules.

#### Overfitting:
* Issue: The model learns the training data too well but fails to generalize to new, unseen data.

* Solution: Introduce regularization techniques, such as L1 or L2 regularization, dropout, or early stopping. These techniques help prevent the model from fitting noise in the training data.


#### Numerical Precision Issues:
* Issue: In deep networks, numerical precision issues may arise during the computation of gradients, leading to instability.
* Solution: Use techniques such as gradient normalization, weight initialization strategies, and careful choice of numerical precision (e.g., using 32-bit or 16-bit floating-point representations).

#### Incorrect Implementation of Backward Pass:
* Issue: Mistakes in the implementation of the backward pass can lead to incorrect updates of weights and biases.
* Solution: Double-check the implementation of the backward pass, ensuring that the chain rule is correctly applied for each layer. Compare the implementation against the mathematical derivations.


#### Slow Convergence:
* Issue: The training process may be slow to converge to a satisfactory solution.
* Solution: Experiment with different optimization algorithms, learning rates, and initialization strategies. Consider using more advanced optimizers like Adam or RMSprop.

#### Memory Usage:
* Issue: Deep networks may consume a significant amount of memory during backward propagation, especially when storing intermediate activations.
* Solution: Use memory-efficient implementations, such as optimizing mini-batch sizes, or consider using techniques like gradient checkpointing to trade memory for computation.

#### Gradient Descent Variants:
* Issue: The choice of optimization algorithm can impact training stability and speed.
* Solution: Experiment with different optimization algorithms (e.g., stochastic gradient descent, Adam, RMSprop) and adjust hyperparameters like learning rate and momentum.