## **1. Machine Learning as Function Approximation**
Machine Learning models approximate a mathematical function, $f(x)$, that maps inputs to outputs:
$$
y = f(x; \theta)
$$
Where:
- $x$: Input features.
- $\theta$: Learnable parameters (weights, biases).
- $y$: Predicted output.

The goal is to learn the optimal $\theta$ by minimizing the **loss function**, which measures the error between predictions and true outputs.

#### **Single Perceptron**
A **single perceptron** models linear relationships:
$$
y = \sigma(w \cdot x + b)
$$
- **Limitation:** Only capable of solving **linearly separable problems**, making it unsuitable for real-world tasks with **non-linearity and noise**.


## **2. Deep Neural Networks (DNNs)**
A **Deep Neural Network (DNN)** has multiple layers of neurons, enabling it to model complex, non-linear relationships in data:
- **Input Layer:** Accepts raw data.
- **Hidden Layers:** Perform transformations to learn intermediate representations.
- **Output Layer:** Produces predictions.

#### **Fully Connected Layers (Dense Layers)**
- _Each neuron in one layer is connected to every neuron in the next layer_ .
- The number of **weights** in a layer is:
$$
\text{Number of Weights} = (\text{Number of Inputs}) \times (\text{Number of Neurons})
$$
- **Transposing Weights:** In some architectures, weights are transposed to align their dimensions with matrix multiplication during forward propagation.

## **3. Activation Functions**
We use activation function to capture multiple linear functions in our data. Activation function which are non linear are necessary in this case. We are trying to learn the relationship between input and output in the Data.


- **Purpose:** They determine whether a neuron should activate or not by transforming the weighted sum of inputs.
- **Key Role:** Introduce **non-linearity**, allowing neural networks to model complex patterns in data. Without activation functions, a neural network would behave like a linear model regardless of the number of layers.

### **a. Sigmoid Activation Function**

#### **Definition:**
The **sigmoid function** maps input values to a range between $0$ and $1$.  
$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

#### **Properties:**
- Output: $(0, 1)$.
- Differentiable everywhere.
- Symmetric around $x = 0.5$.

#### **Use Cases:**
- Historically used in the **output layer** for **binary classification** tasks.
- Models probabilities, as outputs lie between $0$ and $1$.

#### **Advantages:**
1. Maps large inputs into a small range, making it interpretable for probability-based tasks.
2. Smooth and differentiable, enabling gradient-based optimization.

#### **Drawbacks:**
1. **Vanishing Gradient Problem:**
   - Gradients diminish as $x$ becomes very large or very small.
   - Derivative:
     $$
     \sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))
     $$
     At extreme $x$, $\sigma(x)$ saturates near $0$ or $1$, leading to very small gradients.

2. **Not Zero-Centric:**
   - Outputs are always positive, causing gradients to have consistent sign across layers, which slows convergence.

3. **Exponential Computation:**
   - Relatively expensive due to $e^{-x}$.



### **b. Tanh (Hyperbolic Tangent) Activation Function**

#### **Definition:**
The **Tanh** function maps inputs to a range between $-1$ and $1$.  
$$
\text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
$$

#### **Properties:**
- Output: $(-1, 1)$.
- Differentiable everywhere.
- Symmetric around the origin ($0$).

#### **Use Cases:**
- Commonly used in hidden layers of neural networks.
- Zero-centered output helps with balanced gradient updates.

#### **Advantages:**
1. **Zero-Centric Output:** Improves optimization as gradients are balanced (positive and negative).
2. **Smooth Non-Linearity:** Suitable for tasks requiring subtle activations.

#### **Drawbacks:**
1. **Vanishing Gradient Problem:** Similar to Sigmoid, gradients shrink for extreme $x$ values.
2. **Computational Cost:** Requires exponential computations, making it slower than simpler functions.



### **b. ReLU (Rectified Linear Unit) Activation Function**

#### **Definition:**
The **ReLU function** outputs the input if it’s positive, and $0$ otherwise.  
$$
\text{ReLU}(x) = \max(0, x)
$$

#### **Properties:**
- Output: $[0, \infty)$.
- Differentiable for $x > 0$.

#### **Use Cases:**
- Most commonly used in **hidden layers** of modern neural networks.
- Preferred for deep networks due to computational simplicity.

#### **Advantages:**
1. **Efficient Computation:** No expensive operations (e.g., exponentials).
2. **Non-Saturating Gradients:** Unlike Sigmoid and Tanh, ReLU doesn’t saturate for large positive values.
3. **Sparse Activation:** Activates only a fraction of neurons, improving efficiency.

#### **Drawbacks:**
1. **Dying ReLU Problem:** Neurons with negative inputs always output $0$, and their gradients become $0$, effectively killing those neurons.
2. **Unbounded Output:** Large values can destabilize optimization.



### **c. Leaky ReLU**

#### **Definition:**
A variation of ReLU that introduces a small slope for negative inputs:  
$$
\text{Leaky ReLU}(x) = 
\begin{cases} 
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
$$
Where $\alpha$ is a small positive constant (e.g., $0.01$).

#### **Advantages:**
1. **Solves Dying ReLU Problem:** Negative inputs produce small but non-zero outputs, keeping neurons alive.
2. **Efficient Computation:** Similar to ReLU.

#### **Drawbacks:**
1. The choice of $\alpha$ is a hyperparameter requiring tuning.



### **d. Softmax Activation Function**

#### **Definition:**
Used in **multi-class classification**, it converts logits into probabilities:  
$$
\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
$$

#### **Properties:**
- Output: $(0, 1)$ for each class, with all probabilities summing to $1$.

#### **Use Cases:**
- **Output layer** for multi-class classification.

#### **Advantages:**
1. Outputs probabilities, making results interpretable.
2. Normalizes outputs, ensuring the total equals $1$.

#### **Drawbacks:**
1. Computationally expensive due to the exponentials and summation.
2. **Vanishing Gradient Problem** can occur for very large logits.



### **e. Other Notable Activation Functions**
1. **Swish:**  
   $$
   \text{Swish}(x) = x \cdot \sigma(x)
   $$  
   Combines the properties of Sigmoid and linear functions. Smooth and non-saturating.

2. **GELU (Gaussian Error Linear Unit):**  
   Smoothly combines linear and non-linear behavior. Often used in transformer architectures like **BERT**.



### **Summary Table**

| **Activation**   | **Range**            | **Zero-Centric** | **Pros**                                      | **Cons**                                        |
|------------------|----------------------|------------------|-----------------------------------------------|-------------------------------------------------|
| **Sigmoid**      | $(0, 1)$             | ❌                | Probabilistic output, smooth gradients        | Vanishing gradients, not zero-centric           |
| **Tanh**         | $(-1, 1)$            | ✅                | Zero-centered, good for hidden layers         | Vanishing gradients, computationally costly     |
| **ReLU**         | $[0, \infty)$        | ❌                | Efficient, non-saturating, sparse activations  | Dying ReLU problem, unbounded output            |
| **Leaky ReLU**   | $(-\infty, \infty)$   | ✅                | Solves Dying ReLU, efficient                  | Requires tuning of $\alpha$                     |
| **Softmax**      | $(0, 1)$             | ❌                | Probability distribution for multi-class       | Computationally expensive                       |



### **4. Object-Oriented Design in Neural Networks**
It is always better to compute object oriented because it is always extensible and does not require changing the functional code. Linear problems are given to the Activation Function.Which are dense in nature. We put the common functionality to Base class which shall be implemented for all the core. 

Building neural networks using **object-oriented programming (OOP)** ensures:
- **Reusability:** Common functionality (e.g., layers, activations) is defined in base classes.
- **Extensibility:** New layers or methods can be added without modifying existing code.

Example:
```python
class BaseLayer:
    def forward(self, input):
        pass

class DenseLayer(BaseLayer):
    def forward(self, input):
        return np.dot(input, self.weights) + self.biases
```


## **5. Backpropagation**
**Backpropagation** computes gradients of the loss with respect to each parameter using the **chain rule**.  We do backpropagation for finding the loss at node at last. Error is propagated from the back side. Error is calculated at the Later Nodes and is propagated back.
- **Forward Pass:** Calculate predictions and loss.
- **Backward Pass:** Propagate errors from the output layer back to earlier layers, adjusting weights:
  $$
  w \leftarrow w - \eta \cdot \frac{\partial L}{\partial w}
  $$
Where $\eta$ is the learning rate.

### __Important Notes:__
In a **forward pass**, the data flows from the input layer to the output layer, making predictions at the end. However, if the output is wrong, we can't directly know which layers contributed to the error, especially if the network has many layers.

#### __Intuitive Explanation:__
Imagine you’re trying to solve a complex math problem, and your final answer is wrong. Even though your calculations along the way were correct, you can’t tell exactly where the mistake happened just by looking at the final answer. You need a way to **backtrack** and figure out where things went wrong in the middle steps.

This is where **backpropagation** comes in. After the forward pass, backpropagation helps you work backward through the layers to identify which ones contributed to the error and by how much. This way, even though the layers in the front may have produced correct outputs, backpropagation ensures the **whole network** learns and adjusts based on the final error.

In short, **forward pass** is great for making predictions, but we need **backpropagation** to fix mistakes, especially in deep networks, by adjusting each layer's contribution based on the error.

### **Why Is Backpropagation Necessary?**

1. **Optimizing Parameters:**
   - Neural networks rely on **weights** and **biases** to map inputs to outputs.
   - Backpropagation calculates the **gradients** of the loss function with respect to each parameter.
   - These gradients guide the optimizer (e.g., gradient descent) to update parameters to minimize the loss:
     $$
     w \leftarrow w - \eta \cdot \frac{\partial L}{\partial w}
     $$
     Where:
     - $w$: Weight
     - $\eta$: Learning rate
     - $\frac{\partial L}{\partial w}$: Gradient of the loss with respect to the weight



2. **Handling Complex Relationships:**
   - In real-world problems, input-output relationships are often **non-linear** and complex.
   - Backpropagation works layer-by-layer, ensuring each weight is adjusted based on its specific contribution to the error, even in deep networks with multiple layers.



3. **Computational Efficiency:**
   - Backpropagation efficiently uses the **chain rule** to propagate errors from the output layer to earlier layers.
   - Without it, calculating gradients manually for each parameter in large networks would be computationally infeasible.



### **Where Is Backpropagation Necessary?**

#### **1. Supervised Learning:**
- **When:** Tasks like classification and regression.
- **Why:** To reduce the error between the predicted labels and ground truth using a loss function (e.g., Mean Squared Error, Cross-Entropy Loss).
- **Example:**
  - In image classification, backpropagation updates weights to improve the model's accuracy in identifying objects.

#### **2. Deep Neural Networks (DNNs):**
- **When:** For architectures with multiple layers (e.g., convolutional networks, recurrent networks).
- **Why:** Each layer depends on the outputs of previous layers, making backpropagation crucial to calculate gradients for every layer.

#### **3. Fine-Tuning Pretrained Models:**
- **When:** Transfer learning with pretrained models like BERT or ResNet.
- **Why:** Backpropagation fine-tunes weights of the pretrained layers for the new task.

#### **4. Reinforcement Learning (Certain Cases):**
- **When:** Training policies in neural networks with differentiable components.
- **Why:** Backpropagation helps calculate gradients for value or policy functions.



### **Why Can't We Skip Backpropagation?**

- Without backpropagation, **parameter updates** would be random or incorrect, leading to:
  - Poor training performance.
  - Non-convergence of the model.
- Real-world problems like language translation or image recognition require structured gradient updates to learn efficiently.

# __Oher Gradient Descent Algorithms:__

### **Momentum in Gradient Descent**:
Momentum helps accelerate the gradient descent process and avoids getting stuck in saddle points (flat regions where gradients are close to zero).

- **How it Works**:
  - Momentum keeps track of past gradients to adjust the current gradient update. This means if the gradients have been consistently pointing in the same direction, the algorithm will continue in that direction, making updates faster.
  - The formula for momentum is:
    $$
    v_t = \beta v_{t-1} + (1 - \beta) \nabla L
    $$
    Here, \( v_t \) is the velocity (running average of the gradients), and \( \beta \) is the momentum term (usually close to 1).
    - The weight update formula is:
    $$
    w \leftarrow w - \eta v_t
    $$
    where \( \eta \) is the learning rate and \( v_t \) is the adjusted gradient.

- **Advantages**:
  - Helps escape saddle points by smoothing the updates and reducing oscillations.
  - Accelerates the gradient descent in the direction of the gradient, making the process faster and more stable.

- **Why use it?**:
  - It's useful when the gradient is noisy, as it smooths out fluctuations, allowing the model to converge faster and avoid getting stuck in local minima or saddle points.



### **Adagrad**:
Adagrad adjusts the learning rate for each parameter based on the accumulated squared gradients, allowing the algorithm to adaptively scale the learning rate for each parameter.

- **How it Works**:
  - The formula for Adagrad's update is:
    $$
    w \leftarrow w - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla L
    $$
    where \( G_t \) is the cumulative sum of squared gradients, and \( \epsilon \) is a small constant to prevent division by zero.
  
- **Advantages**:
  - Adagrad is great for dealing with **sparse data**, where some features appear infrequently (e.g., in NLP or text classification tasks). It gives larger updates to infrequent features and smaller updates to frequent ones.
  
- **Why use it?**:
  - If the dataset is sparse (some features are very infrequent), Adagrad will give larger learning rates to these sparse features, which can be useful for better generalization. It’s also ideal for problems like natural language processing (NLP) where feature sparsity is common.

- **Downside**:
  - The learning rate shrinks too quickly over time, which can lead to stagnation during the later stages of training.



### **RMSProp**:
RMSProp (Root Mean Squared Propagation) improves Adagrad by introducing an **exponential moving average** of squared gradients. This prevents the learning rate from decaying too rapidly as in Adagrad.

- **How it Works**:
  - RMSProp modifies Adagrad by considering the recent gradients rather than all historical gradients. The formula for RMSProp is:
    $$
    E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) g_t^2
    $$
    where \( E[g^2]_t \) is the moving average of squared gradients, and \( \beta \) is the decay factor (typically set to 0.9).
    - The weight update formula is:
    $$
    w \leftarrow w - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla L
    $$
  
- **Advantages**:
  - **Prevents learning rate decay**: By using an exponentially decaying average of squared gradients, RMSProp avoids the rapid decay problem of Adagrad, making it more suitable for long-term training.
  - **Effective for non-stationary objectives**: It works well for tasks like training recurrent neural networks (RNNs) or deep networks.

- **Why use it?**:
  - It’s better than Adagrad because it keeps the learning rate from shrinking too quickly. This makes it more stable and effective for a wider range of problems, especially in non-stationary settings where the data distribution may change over time.



### **Comparing Adagrad, Momentum, and RMSProp**:

1. **Adagrad vs Momentum**:
   - **Adagrad** is suited for problems with sparse data, where some parameters (features) are updated much less frequently. The learning rate decreases over time, which helps reduce the impact of frequent updates.
   - **Momentum**, on the other hand, doesn't adjust the learning rate for individual parameters but rather accelerates the gradient descent process by maintaining a velocity term. It is better for smooth convergence, especially when there are oscillations or noise in the gradients.
   
   **Why Adagrad can be better than Momentum**:
   - In tasks like NLP where features are sparse, **Adagrad** is more efficient because it adapts the learning rate to each parameter individually. In contrast, **Momentum** doesn’t focus on sparse data and might not handle varying learning rates as effectively.

2. **RMSProp vs Adagrad**:
   - **RMSProp** is an improvement over **Adagrad**. The key difference is that RMSProp introduces an exponentially weighted average of the squared gradients, so the learning rate doesn't decay too fast. This makes it more effective in long-term training scenarios.
   - **Adagrad** is good for sparse data but suffers from rapidly shrinking learning rates. **RMSProp** overcomes this issue and can be more efficient in scenarios where you need a more stable learning rate over time.

   **Why RMSProp is better than Adagrad**:
   - **RMSProp** is more stable than **Adagrad**, especially for non-stationary problems. The use of a decaying average of gradients helps prevent the learning rate from becoming too small over time, making the optimizer more reliable in the later stages of training.

### **Conclusion**:
- Use **Momentum** when you need smoother, faster convergence and to escape saddle points.
- Use **Adagrad** if you have sparse data where each feature’s importance varies, but be aware of the rapid learning rate decay.
- Use **RMSProp** if you want the benefits of Adagrad but with a more stable and effective learning rate over time, especially in non-stationary scenarios.