# Deep Learning Core Concepts: Epochs, Learning Rate, Parameters, Hyperparameters, and Cost Function

---

## Epochs

An **epoch** is one complete pass through the entire training dataset by the learning algorithm. If you have 10,000 training samples, one epoch means the network has seen all 10,000 samples once. Training usually requires multiple epochs to adjust the weights properly.

- **Batch:** Subset of training data used for one update of the network.
- **Iteration:** One update of the model’s weights using a batch.

**Formula:**
- Iterations per epoch:
  $$
  \text{Iterations per epoch} = \frac{\text{Total samples}}{\text{Batch size}}
  $$
- Total iterations:
  $$
  \text{Total Iterations} = \frac{\text{Number of samples}}{\text{Batch size}} \times \text{Epochs}
  $$

**Example:**
- Training samples = 10,000
- Batch size = 100
- Iterations per epoch = 10,000 / 100 = 100
- 20 epochs = 2,000 total iterations

**Why Multiple Epochs?**
- A single pass (1 epoch) is usually not enough for the model to learn properly.
- Multiple passes allow the model to gradually adjust weights using backpropagation.
- Too few epochs → underfitting; too many epochs → overfitting.

**Weight Update per Epoch (Gradient Descent):**
1. Compute predicted output:
   $$
   \hat{y} = f(X, W)
   $$
2. Compute loss:
   $$
   L = \text{Loss}(y, \hat{y})
   $$
3. Compute gradients:
   $$
   \frac{\partial L}{\partial W}
   $$
4. Update weights:
   $$
   W := W - \eta \frac{\partial L}{\partial W}
   $$
   where $\eta$ is the learning rate.

**Summary Table:**
| Term      | Meaning                                  |
|-----------|------------------------------------------|
| Epoch     | One full pass through the dataset        |
| Batch     | Subset of dataset for one weight update  |
| Iteration | One update of weights (one batch)        |
| Relation  | Iterations = (Samples / Batch) × Epochs  |

---

## Learning Rate

The **learning rate** ($\eta$ or $\alpha$) is a hyperparameter that controls how much we adjust the model’s weights during training. It determines the step size in the direction of the negative gradient.

**Weight Update Rule:**
$$
W := W - \eta \frac{\partial L}{\partial W}
$$

- Too high: may overshoot the minimum of the loss function (unstable training).
- Too low: training will be slow and may get stuck in local minima.

**Example:**
- Weight = 2
- Gradient = 0.5
- Learning rate = 0.1
- Update: $W_{new} = 2 - 0.1 \times 0.5 = 1.95$

**Learning Rate Strategies:**
- Fixed LR: Keep $\eta$ constant.
- Step Decay: Reduce $\eta$ after some epochs.
- Exponential Decay: $\eta = \eta_0 e^{-kt}$
- Adaptive Optimizers: Adam, RMSProp, etc. adjust $\eta$ per weight.

**Summary:**
- Learning rate = “step size” in gradient descent.
- Too high → overshoot, unstable. Too low → slow.
- Typical starting values: 0.001 – 0.01

---

## Parameters vs Hyperparameters

### Parameters
- Internal variables of a model learned from the training data.
- Define the behavior of the model.
- Updated automatically during training (e.g., weights, biases).

**Example:**
- Linear Regression: $y = Wx + b$ ($W$ and $b$ are parameters)
- Neural Network: All weights and biases in each layer

### Hyperparameters
- External configurations set before training.
- Not learned from the data.
- Control how the model is trained (e.g., learning rate, batch size, number of epochs, number of layers, regularization parameter $\lambda$).

**Example:**
- Learning rate = 0.001
- Epochs = 50
- Batch size = 64
- Number of hidden layers = 3

**Table Summary:**
| Feature        | Parameter                        | Hyperparameter                      |
|----------------|----------------------------------|-------------------------------------|
| Definition     | Learned by the model from data   | Set before training, controls train |
| Updated by     | Training algorithm               | Not updated during training         |
| Examples       | Weights, biases                  | Learning rate, batch size, layers   |
| Role           | Defines model’s prediction       | Defines how model learns            |
| Determined by  | Data                             | User or search/tuning               |

**Intuition:**
- Parameters = the answers the model finds
- Hyperparameters = the rules that guide how the model finds answers

---

## Cost Function (Loss Function)

A **Cost Function** (or Loss Function) is a mathematical function that measures how “bad” the model is at making predictions. It quantifies the difference between the model’s predictions and the actual target values. The goal of training is to minimize the cost function by adjusting the model’s parameters.

**Why is it important?**
- Guides the learning: tells the algorithm whether to increase or decrease weights.
- Smaller cost → better model; larger cost → worse predictions.

### Common Cost Functions

#### a) Mean Squared Error (MSE) — Regression
$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$
- $y_i$ = actual value
- $\hat{y}_i$ = predicted value
- $n$ = number of samples

#### b) Cross-Entropy Loss — Classification
$$
L = -\sum_i y_i \log(\hat{y}_i)
$$
- $y_i$ = 1 if class is correct, 0 otherwise (one-hot encoding)
- $\hat{y}_i$ = predicted probability for that class (Softmax output)

**How it Works in Training:**
1. Model predicts output $\hat{y}$
2. Compute cost $L$ (compare with true $y$)
3. Use gradient descent to update weights to reduce cost:
   $$
   W := W - \eta \frac{\partial L}{\partial W}
   $$

**Intuition:**
- Cost function = “score” of the model
- Lower score → better predictions
- Determines direction of weight updates

**Example:**
- MSE: Actual = [2, 3], Predicted = [2.5, 2.0]
  $$
  \text{MSE} = \frac{(2-2.5)^2 + (3-2)^2}{2} = \frac{0.25 + 1}{2} = 0.625
  $$

**Summary:**
- Cost Function = error calculator
- Guides how the model learns
- Examples: MSE (regression), Cross-Entropy (classification)
