```{contents}
```


# Training & Hyper-Parameter Tuning 

## Training

Training an FNN means **learning optimal parameters (weights and biases)** so that the network can map input data (x) to correct outputs (y).

The training follows a well-defined workflow:

---

### Initialize Parameters

* Randomly initialize weights $W^{(l)}$ (small random numbers, not all zeros).
* Initialize biases $b^{(l)} = 0$.

  * Common initializations:

    * **Xavier (Glorot)** for sigmoid/tanh
    * **He initialization** for ReLU

This helps avoid issues like **vanishing/exploding gradients**.

---

### Forward Propagation

Each layer computes:
$$
z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}
$$
$$
a^{(l)} = f(z^{(l)})
$$

The final layer output (a^{(L)} = \hat{y}) is the model’s **prediction**.

---

### Compute Loss

The **loss function** measures how far predictions are from true outputs:
$$
L = \text{Loss}(y, \hat{y})
$$

Examples:

* **Regression:** Mean Squared Error (MSE)
  $$
  L = \frac{1}{m}\sum (y - \hat{y})^2
  $$
* **Classification:** Cross-Entropy Loss
  $$
  L = -\frac{1}{m}\sum y \log(\hat{y})
  $$

---

### Backpropagation (Gradient Computation)

Compute partial derivatives of loss (L) w.r.t. all parameters using **chain rule**:

$$
\frac{\partial L}{\partial W^{(l)}}, \quad \frac{\partial L}{\partial b^{(l)}}
$$

Error flows backward:

* Output → hidden layers → input.
* Each layer updates its parameters based on its contribution to total loss.

This is known as **backpropagation**.

---

### Parameter Update

Weights and biases are updated using **gradient descent**:

$$
W^{(l)} := W^{(l)} - \eta \frac{\partial L}{\partial W^{(l)}}
$$
$$
b^{(l)} := b^{(l)} - \eta \frac{\partial L}{\partial b^{(l)}}
$$

Where:

* $\eta$ = **learning rate** (step size)
* Determines how big each parameter update is

---

### Iterate (Epochs)

Repeat:

* Forward pass
* Loss computation
* Backpropagation
* Weight update

for multiple **epochs** (full passes through training data).
Training continues until:

* Loss converges (stabilizes), or
* Validation performance stops improving.

---

### Evaluate and Validate

After each epoch:

* Compute **validation loss** to monitor overfitting.
* If validation loss increases while training loss decreases → model is overfitting.

Use early stopping if necessary.

---

**Key Intuition**

| Step            | Purpose             |
| --------------- | ------------------- |
| Forward pass    | Predict output      |
| Loss            | Measure error       |
| Backpropagation | Compute gradients   |
| Optimization    | Adjust parameters   |
| Validation      | Prevent overfitting |

---

## Huperparameter Tuning Stratergies

Hyperparameters are **external settings** (not learned) that control model behavior and training efficiency.

---

### Common Hyperparameters in FNN

| Category           | Hyperparameter          | Description                                 |
| ------------------ | ----------------------- | ------------------------------------------- |
| **Architecture**   | Number of hidden layers | Controls model depth                        |
|                    | Neurons per layer       | Controls capacity                           |
|                    | Activation functions    | Controls non-linearity                      |
| **Training**       | Learning rate           | Step size for gradient descent              |
|                    | Batch size              | Number of samples per gradient update       |
|                    | Epochs                  | How long to train                           |
|                    | Optimizer               | Update rule (SGD, Adam, RMSProp)            |
| **Regularization** | Dropout rate            | Fraction of neurons dropped during training |
|                    | L1/L2 penalty           | Adds weight constraints                     |
| **Initialization** | Weight scheme           | (He, Xavier, Random uniform)                |
| **Scheduler**      | Learning rate decay     | Dynamically adjust learning rate            |

---

### Tuning Strategies

#### (a) Grid Search

* Try all combinations of hyperparameters.
* Example:

  ```text
  learning_rate = [0.01, 0.001]
  hidden_layers = [2, 3, 4]
  neurons = [64, 128, 256]
  ```
* Train model for each combo → pick one with best validation accuracy.
* **Pros:** Exhaustive
* **Cons:** Computationally expensive

---

#### (b) Random Search

* Randomly select hyperparameter combinations.
* Usually faster and often as effective as grid search.
* **Good first step** for coarse tuning.

---

#### (c) Bayesian Optimization

* Builds a probabilistic model of performance across hyperparameter space.
* Chooses next set of hyperparameters based on previous results.
* Efficient and smart search.
* Tools: **Optuna**, **Hyperopt**, **Ray Tune**

---

#### (d) Learning Rate Scheduling

* Start with higher learning rate and gradually reduce.
* Schedulers:

  * **Step Decay**
  * **Exponential Decay**
  * **ReduceLROnPlateau**
* Keeps training stable and efficient.

---

#### (e) Early Stopping

* Monitor validation loss.
* Stop training when it stops improving.
* Prevents overfitting and saves computation.

---

#### (f) Cross-Validation

* Split data into ( k ) folds (e.g., 5-fold).
* Train on ( k-1 ) folds, validate on 1.
* Average performance across folds for stability.

---

### 3. Regularization Techniques

To prevent overfitting:

* **L2 regularization (weight decay):**
  Adds penalty term to loss:
  $$
  L' = L + \lambda \sum W^2
  $$
* **Dropout:** Randomly deactivate neurons during training.
* **Batch Normalization:** Normalizes activations between layers for stability.

---

### 4. Optimizer Tuning

| Optimizer    | Description                       | Key Parameters             |
| ------------ | --------------------------------- | -------------------------- |
| **SGD**      | Basic gradient descent            | Learning rate              |
| **Momentum** | Accelerates SGD                   | Momentum factor            |
| **RMSProp**  | Adaptive learning rate per weight | Decay rate                 |
| **Adam**     | Combines RMSProp + Momentum       | $\eta, \beta_1, \beta_2$ |

Usually, **Adam** is the best default choice.

---

### 5. Practical Workflow for Tuning

1. Start with baseline FNN (simple structure).
2. Tune learning rate → use learning rate finder.
3. Tune hidden layers & neurons → increase until overfitting.
4. Add regularization (dropout/L2).
5. Try optimizers (Adam, RMSProp).
6. Fine-tune batch size and epochs.
7. Use validation metrics for comparison.

---

### Example Workflow Summary

| Phase                     | Action                            | Purpose                    |
| ------------------------- | --------------------------------- | -------------------------- |
| **Initialization**        | Choose model structure            | Define architecture        |
| **Training**              | Forward + Backpropagation         | Fit data                   |
| **Validation**            | Monitor loss/accuracy             | Detect overfitting         |
| **Hyperparameter Search** | Adjust learning rate, depth, etc. | Improve performance        |
| **Regularization**        | Apply dropout, L2                 | Increase generalization    |
| **Final Evaluation**      | Test data                         | Assess real-world accuracy |

---

### Key Principles

✅ Start simple, scale up gradually
✅ Use validation loss as your tuning signal
✅ Use **learning rate warm-up & decay**
✅ Apply dropout and normalization in deeper models
✅ Automate search using Optuna or KerasTuner

---

### In One Line:

> Training an FNN means optimizing its weights through forward and backward passes, while **hyperparameter tuning** is the process of adjusting external configurations — like learning rate, depth, and regularization — to achieve the best generalization performance.
