# Deep LEarning Funndamentals - Part 2

## **7. Activation Functions**

### **What are they?**

An **activation function** is a mathematical transformation applied to a neuron’s output before passing it to the next layer.

**Without them** → a neural network would just be a stack of linear operations, equivalent to a single linear model (no matter how many layers).

**With them** → networks can learn **non-linear relationships**.

---

### **Why are they needed?**

1. **Non-linearity**: Real-world data often has complex patterns that linear models can’t capture.
2. **Feature learning**: Different neurons can activate differently for different patterns.
3. **Gradient flow**: Some functions (like ReLU) help avoid vanishing gradients.

---

### **Common Activation Functions in DL**

| Function       | Formula                                        | Range         | Key Points                                                                         |   |   |
| -------------- | ---------------------------------------------- | ------------- | ---------------------------------------------------------------------------------- | - | - |
| **Sigmoid**    | $\sigma(x) = \frac{1}{1+e^{-x}}$               | (0, 1)        | Smooth, good for probabilities, but suffers from **vanishing gradients** for large | x |   |
| **Tanh**       | $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ | (-1, 1)       | Zero-centered, still suffers from vanishing gradients                              |   |   |
| **ReLU**       | $\text{ReLU}(x) = \max(0, x)$                  | \[0, ∞)       | Fast, reduces vanishing gradients, but can cause **dead neurons**                  |   |   |
| **Leaky ReLU** | $\max(0.01x, x)$                               | (-∞, ∞)       | Fixes dead neuron problem                                                          |   |   |
| **Softmax**    | $\sigma(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$ | (0, 1), sum=1 | Converts logits to probability distribution in multi-class problems                |   |   |
| **GELU**       | Smooth approximation of ReLU                   | (-∞, ∞)       | Popular in Transformer models                                                      |   |   |

**PyTorch Example:**

```python
import torch.nn as nn

act1 = nn.ReLU()
act2 = nn.Sigmoid()
act3 = nn.Tanh()
```

---

## **8. Optimizers**

### **What are they?**

Optimizers are algorithms that adjust **model parameters (weights & biases)** to minimize the loss function during training.

---

### **Function**

1. Compute gradients using **backpropagation**.
2. Update parameters using gradients, learning rate, and sometimes momentum/other terms.

---

### **Common Optimizers in PyTorch**

| Optimizer                           | Characteristics                                                   | Common Use                              |
| ----------------------------------- | ----------------------------------------------------------------- | --------------------------------------- |
| **SGD** (`torch.optim.SGD`)         | Simple, stable; can use momentum                                  | Small-to-medium problems, vision models |
| **Adam** (`torch.optim.Adam`)       | Adaptive learning rates per parameter                             | Most popular for NLP, CV, tabular       |
| **AdamW** (`torch.optim.AdamW`)     | Adam + correct weight decay handling                              | Transformers, large models              |
| **RMSprop** (`torch.optim.RMSprop`) | Scales learning rate based on moving average of squared gradients | RNNs, noisy problems                    |
| **Adagrad** (`torch.optim.Adagrad`) | Adjusts LR for each parameter based on past gradients             | Sparse data (NLP)                       |

**PyTorch Example:**

```python
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
```

---

## **9. Loss Functions in Deep Learning**

### **What are they?**

A **loss function** measures how far the model’s prediction is from the actual target.
Lower loss → better performance.

---

### **Common Loss Functions in PyTorch**

#### **A. For Regression**

| Loss                      | PyTorch Class  | Notes                        |
| ------------------------- | -------------- | ---------------------------- |
| Mean Squared Error (MSE)  | `nn.MSELoss()` | Penalizes larger errors more |
| Mean Absolute Error (MAE) | `nn.L1Loss()`  | Robust to outliers           |

#### **B. For Binary Classification**

| Loss                 | PyTorch Class            | Notes                                            |
| -------------------- | ------------------------ | ------------------------------------------------ |
| Binary Cross-Entropy | `nn.BCELoss()`           | Use when outputs are probabilities (sigmoid)     |
| BCE with Logits      | `nn.BCEWithLogitsLoss()` | More stable numerically (combines sigmoid + BCE) |

#### **C. For Multi-class Classification**

| Loss                    | PyTorch Class           | Notes                            |
| ----------------------- | ----------------------- | -------------------------------- |
| Cross-Entropy Loss      | `nn.CrossEntropyLoss()` | Combines `LogSoftmax` + NLL loss |
| Negative Log Likelihood | `nn.NLLLoss()`          | Used after `LogSoftmax` output   |

#### **D. For Other Tasks**

| Task                 | Loss                                        | PyTorch Class |
| -------------------- | ------------------------------------------- | ------------- |
| Image segmentation   | Dice Loss (custom), `nn.CrossEntropyLoss()` |               |
| Embedding similarity | `nn.CosineEmbeddingLoss()`                  |               |
| Metric learning      | `nn.TripletMarginLoss()`                    |               |

---

### **Quick PyTorch Example**

```python
import torch.nn as nn
import torch.optim as optim

# Example: binary classification
loss_fn = nn.BCEWithLogitsLoss()  # includes sigmoid internally
optimizer = optim.Adam(model.parameters(), lr=0.001)
```

---

✅ **Summary Table**:

| **Concept**         | **Purpose**                    | **Example in PyTorch**           |
| ------------------- | ------------------------------ | -------------------------------- |
| Activation Function | Introduce non-linearity        | `nn.ReLU()`                      |
| Optimizer           | Update weights using gradients | `optim.Adam(model.parameters())` |
| Loss Function       | Measure prediction error       | `nn.CrossEntropyLoss()`          |

---