# Deep LEarning Fundamentals - Part 3

---

## **1. Deciding the Number of Neurons**

### **A. Output Layer**

The number of neurons here is **fixed by the problem type**:

| **Problem Type**               | **Output Neurons**    | **Reason**                                   |
| ------------------------------ | --------------------- | -------------------------------------------- |
| **Regression (single target)** | 1                     | Predict a continuous value                   |
| **Regression (multi-target)**  | One neuron per target | Predict multiple continuous values           |
| **Binary classification**      | 1                     | Output probability (sigmoid)                 |
| **Multiclass classification**  | `n_classes`           | One neuron per class (softmax)               |
| **Multi-label classification** | `n_labels`            | Each neuron predicts independent probability |

---

### **B. Hidden Layers**

There’s no universal formula, but here are practical guidelines:

#### **Rule-of-thumb approaches**

1. **Between input and output sizes**:

   * Start with a size somewhere between `input_features` and `output_neurons`.
   * E.g., if input = 50 features, output = 1, start with 32 or 64 neurons in the first hidden layer.
2. **Pyramid/Tapering structure**:

   * Larger first layer → gradually smaller layers (e.g., 128 → 64 → 32 → output).
3. **Power of 2 sizes**:

   * Easy for vectorized computation (32, 64, 128, etc.).

#### **Data complexity approach**

* **Simple tabular regression/classification** → start with 1–3 hidden layers of 16–128 neurons.
* **Image classification (MLP)** → often needs hundreds to thousands of neurons in initial layers.
* **Sequence modeling** (RNN/Transformer feedforward) → 256–2048 neurons in dense parts.

📌 **Warning**: Too many neurons → overfitting, longer training. Too few → underfitting.

---

## **2. Deciding the Number of Layers**

### **Guidelines**

1. **Shallow networks (1–2 hidden layers)**:

   * Work well for simple regression/classification on tabular data.
2. **Moderately deep networks (3–8 layers)**:

   * Suitable for moderately complex patterns (structured + unstructured data).
3. **Deep networks (>8 layers)**:

   * Used in computer vision (CNNs), NLP (Transformers), generative models.

📌 **Practical advice**:

* Start simple → increase depth only if the model underfits.
* Use **validation loss** as a guide: if adding layers reduces validation loss, it’s helping; if not, it’s adding complexity without benefit.

---

## **3. Deciding the Activation Functions**

### **A. Hidden Layers**

* **Default choice**: **ReLU**

  * Fast, works well for most deep networks.
* **Leaky ReLU / Parametric ReLU**:

  * If worried about **dead ReLUs** (neurons stuck at 0 output).
* **Tanh**:

  * If inputs are zero-centered and small-scale networks.
* **GELU**:

  * Popular in Transformer-based models.

---

### **B. Output Layer** (based on problem type)

| **Problem Type**                                   | **Output Activation**                        | **Reason**                                  |
| -------------------------------------------------- | -------------------------------------------- | ------------------------------------------- |
| **Regression (unbounded)**                         | None (linear activation)                     | Allows predicting any real number           |
| **Regression (bounded, e.g., 0–1)**                | Sigmoid                                      | Constrains output                           |
| **Binary classification**                          | Sigmoid                                      | Outputs probability for one class           |
| **Multiclass classification**                      | Softmax                                      | Converts logits to probability distribution |
| **Multi-label classification**                     | Sigmoid (per label)                          | Each label independent                      |
| **Sequence-to-sequence (e.g., language modeling)** | Softmax at each time step                    | Predicts token probabilities                |
| **Autoencoders (decoder output)**                  | Sigmoid (if normalized 0–1) / Tanh (-1 to 1) | Matches data scale                          |

---

### **Quick PyTorch Reference Table**

| **Problem**                | **Hidden Layer Activation** | **Output Activation** | **Loss Function**                                 |
| -------------------------- | --------------------------- | --------------------- | ------------------------------------------------- |
| Regression (any real)      | ReLU / Leaky ReLU           | None                  | `MSELoss()`                                       |
| Regression (0–1)           | ReLU / Leaky ReLU           | Sigmoid               | `MSELoss()` / `BCELoss()`                         |
| Binary Classification      | ReLU / Leaky ReLU           | Sigmoid               | `BCEWithLogitsLoss()` *(omit Sigmoid in forward)* |
| Multiclass Classification  | ReLU / Leaky ReLU           | Softmax               | `CrossEntropyLoss()` *(omit Softmax in forward)*  |
| Multi-label Classification | ReLU / Leaky ReLU           | Sigmoid               | `BCEWithLogitsLoss()`                             |
| Sequence Modeling          | ReLU / GELU                 | Softmax               | `CrossEntropyLoss()`                              |

---

### **Example Architecture Choices**

**Binary classification (tabular)**

```python
model = nn.Sequential(
    nn.Linear(30, 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 1),   # output neuron
    nn.Sigmoid()
)
```

**Multiclass classification (10 classes)**

```python
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)  # logits
)  # Softmax applied in loss
```

---