```{contents}
```

# Hyperparameter Tuning

---

## What is Hyperparameter Tuning

**Hyperparameter tuning** means selecting the *best configuration* (learning rate, hidden layers, activation, etc.) to make your ANN perform optimally for a given task.

Unlike model **parameters** (weights and biases) which are learned automatically,
**hyperparameters** are chosen *before training* and directly affect:

* Learning efficiency
* Model capacity
* Generalization to unseen data

---

## Key ANN Problem Types

| Problem Type                      | Goal                              | Example Outputs                       | Common Loss Function               |
| --------------------------------- | --------------------------------- | ------------------------------------- | ---------------------------------- |
| **Classification**                | Predict discrete categories       | Spam / not spam, dog / cat            | Cross-Entropy Loss                 |
| **Regression**                    | Predict continuous values         | House price, stock value              | Mean Squared Error (MSE)           |
| **Representation / Unsupervised** | Learn structure in unlabeled data | Feature extraction, anomaly detection | Reconstruction loss (Autoencoders) |

Each type requires different **hyperparameters**, **architecture**, and **evaluation metrics**.

---

## Core Hyperparameters

### Architecture-Related

| Hyperparameter            | Meaning              | Typical Range                             |
| ------------------------- | -------------------- | ----------------------------------------- |
| **Hidden layers**         | Depth of the network | 1–10 for simple models, 10+ for deep nets |
| **Neurons per layer**     | Capacity per layer   | 16–1024                                   |
| **Activation functions**  | Non-linearity type   | ReLU, Leaky ReLU, Sigmoid, Tanh           |
| **Initialization method** | How weights start    | He normal (ReLU), Xavier (Tanh/Sigmoid)   |

---

### Optimization-Related

| Hyperparameter        | Purpose                         | Common Values      |
| --------------------- | ------------------------------- | ------------------ |
| **Learning rate (η)** | Step size for gradient descent  | 1e-4 to 1e-1       |
| **Optimizer**         | Controls how weights update     | Adam, RMSprop, SGD |
| **Batch size**        | Number of samples per update    | 16–256             |
| **Epochs**            | Number of full passes over data | 50–1000+           |
| **Momentum / β**      | Memory in gradient updates      | 0.8–0.99           |

---

### Regularization & Generalization

| Method                  | Hyperparameter       | Description                                     |
| ----------------------- | -------------------- | ----------------------------------------------- |
| **Dropout**             | Dropout rate         | Randomly deactivate neurons (0.2–0.5)           |
| **Weight decay (L2)**   | λ (penalty strength) | Adds constraint to weights                      |
| **Early stopping**      | Patience             | Stops when validation loss stops improving      |
| **Batch normalization** | -                    | Stabilizes learning by normalizing layer inputs |

---

### Task-Specific

| Problem Type                   | Key Hyperparameters                                               |
| ------------------------------ | ----------------------------------------------------------------- |
| **Classification**             | Activation (Softmax/Sigmoid), loss (Cross-Entropy), learning rate |
| **Regression**                 | Output activation (Linear), loss (MSE/MAE), normalization         |
| **Autoencoder / Unsupervised** | Latent dimension size, reconstruction loss weight                 |

---

## Hyperparameter Tuning Strategies

### Manual / Empirical Search

Try combinations based on intuition.

> ✅ Simple but inefficient.
> Used in early experiments or for small models.

---

### **Grid Search**

Evaluate *every* combination of predefined hyperparameters.

```python
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier

param_grid = {
  'hidden_layer_sizes': [(50,), (100,), (100,50)],
  'activation': ['relu', 'tanh'],
  'learning_rate_init': [0.001, 0.01],
  'solver': ['adam', 'sgd']
}

grid = GridSearchCV(MLPClassifier(max_iter=300), param_grid, cv=3)
grid.fit(X_train, y_train)
print(grid.best_params_)
```

✅ Exhaustive
❌ Very slow (explodes with parameters)

---

### **Random Search**

Randomly samples combinations from ranges.

✅ Finds near-optimal results faster than Grid Search.
❌ Doesn’t explore systematically.

```python
from sklearn.model_selection import RandomizedSearchCV
```

---

### **Bayesian Optimization**

Learns a *probabilistic model* of performance vs. hyperparameters and picks next candidates smartly.

✅ Very efficient for expensive training (deep models)
❌ More complex to implement

Libraries: `Optuna`, `scikit-optimize`, `Hyperopt`, `Ray Tune`

Example (Optuna):

```python
import optuna

def objective(trial):
    model = MLPClassifier(
        hidden_layer_sizes=trial.suggest_categorical("hidden_layer_sizes", [(64,), (128,), (128,64)]),
        learning_rate_init=trial.suggest_float("lr", 1e-4, 1e-2, log=True),
        activation=trial.suggest_categorical("activation", ["relu", "tanh"])
    )
    return cross_val_score(model, X_train, y_train, cv=3).mean()
```

---

### **Evolutionary Algorithms / Genetic Search**

Uses biological evolution analogy — mutation, selection, crossover.

✅ Good for complex non-smooth search spaces.
❌ Computationally expensive.

Used in NAS (Neural Architecture Search).

---

### **Hyperband / BOHB**

* **Hyperband**: Trains many models for few epochs, kills the bad ones early.
* **BOHB**: Combines Bayesian Optimization + Hyperband.

✅ Efficient, scalable for large deep learning setups (e.g., CNN, Transformer).

---

## **Hyperparameter Tuning per Problem Type**

| Problem Type                            | Objective Metric            | Example Tuned Parameters                          |
| --------------------------------------- | --------------------------- | ------------------------------------------------- |
| **Binary / Multi-class Classification** | Accuracy, F1-score, ROC-AUC | Hidden layers, learning rate, batch size, dropout |
| **Regression**                          | MAE, RMSE, R²               | Hidden neurons, learning rate, regularization     |
| **Autoencoders (Unsupervised)**         | Reconstruction loss (MSE)   | Bottleneck size, optimizer, activation            |
| **Time-Series Forecasting (RNN/LSTM)**  | MAPE, RMSE                  | Sequence length, learning rate, dropout           |
| **Image / Vision (CNN)**                | Accuracy, Top-5 Accuracy    | Filters, kernel size, learning rate, epochs       |

---

## **Example: Hyperparameter Tuning for Classification (Keras)**

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import KFold

def build_model(lr=0.001, dropout=0.3, neurons=64):
    model = Sequential([
        Dense(neurons, activation='relu', input_dim=30),
        Dropout(dropout),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer=Adam(learning_rate=lr), loss='binary_crossentropy', metrics=['accuracy'])
    return model
```

You can then wrap this with `KerasTuner` or `Optuna` to auto-tune `lr`, `dropout`, and `neurons`.

---

## **Practical Best Practices**

1. **Normalize inputs** (StandardScaler) before training.
2. Use **early stopping** to prevent overfitting.
3. Monitor **validation metrics** (not only training).
4. Combine **Random + Bayesian** approaches for efficiency.
5. Always record results (e.g., via TensorBoard or Weights & Biases).
6. Limit search space — too large causes computational waste.

---

## **Example Summary Table**

| Category       | Hyperparameter    | Typical Range         | Purpose                |
| -------------- | ----------------- | --------------------- | ---------------------- |
| Architecture   | Hidden layers     | 1–5                   | Model complexity       |
| Architecture   | Neurons per layer | 16–512                | Representation power   |
| Learning       | Learning rate     | 1e-4–1e-2             | Convergence control    |
| Optimization   | Batch size        | 16–256                | Gradient stability     |
| Regularization | Dropout           | 0.2–0.5               | Prevent overfitting    |
| Regularization | L2 weight decay   | 1e-5–1e-2             | Penalize large weights |
| Training       | Epochs            | 50–1000               | Learning duration      |
| Activation     | Function          | ReLU / Tanh / Sigmoid | Nonlinearity           |
| Task           | Loss function     | CE / MSE              | Learning objective     |

---

**Summary Intuition**

* **Classification:** Adjust architecture + activation to separate categories.
* **Regression:** Control learning rate and regularization for smooth predictions.
* **Unsupervised:** Balance compression vs. reconstruction accuracy.
* **Tuning Goal:** Find configuration minimizing validation loss while maintaining generalization.

