```{contents}
```

## Intiution

A **Feedforward Neural Network (FNN)** models a **function ( f(x; \theta) )** that maps an input vector ( x \in \mathbb{R}^n ) to an output ( y \in \mathbb{R}^m ).
The parameters ( \theta = {W, b} ) (weights and biases) are **learned** such that:
$$
y \approx f(x; \theta)
$$

It’s called “feedforward” because information flows **one way**:
$$
x \rightarrow \text{hidden layers} \rightarrow \hat{y}
$$

---

### Mathematical Structure

An FNN is composed of **layers of neurons**, where each layer performs a simple operation:

$$
z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}
$$
$$
a^{(l)} = f(z^{(l)})
$$

where:

* $l$ = layer index
* $a^{(0)} = x$ (input)
* $W^{(l)}$: weight matrix for layer (l)
* $b^{(l)}$: bias vector
* $f(\cdot)$: activation function (non-linear)

The final layer output $a^{(L)} = \hat{y}$ is the model’s **prediction**.

---

### Step-by-Step Theoretical Intuition

Let’s break the FNN into **three conceptual blocks**:

---

#### Linear Transformation (Feature Extraction)

Each neuron computes a **weighted sum**:
$$
z_i = \sum_{j} w_{ij}x_j + b_i
$$

This is a **linear transformation**:
It projects the input vector into another space, stretching or rotating it.

In matrix form:
$$
z = W x + b
$$

🔹 **Interpretation**:
Each layer re-represents the data in a new coordinate system, learning directions (features) that best explain the data’s patterns.

---

#### Non-Linearity (Feature Composition)

The activation function ( f(z) ) (e.g. ReLU, Sigmoid, Tanh) adds **non-linearity**:
$$
a = f(z)
$$

Without this step, stacking multiple layers collapses into one linear map:
$$
W_3(W_2(W_1x)) = W'x
$$
→ still linear.
With non-linearity, the network can learn **curved, complex functions**.

🔹 **Interpretation**:
Non-linearity allows each layer to “bend” the input space — this lets the network model complex relationships like curved boundaries or hierarchies (edges → shapes → objects).

---

#### Layer Composition (Hierarchy of Features)

Each successive layer composes previous transformations:

$$
f(x) = f^{(L)}(W^{(L)} f^{(L-1)}(W^{(L-1)} \dots f^{(1)}(W^{(1)}x + b^{(1)}) + b^{(2)} \dots ))
$$

This **composition of non-linear functions** allows the network to approximate **any continuous function** — proven by the **Universal Approximation Theorem**.

🔹 **Interpretation**:
Each layer learns progressively abstract representations:

* Layer 1 → basic features (edges, correlations)
* Layer 2 → combinations (shapes, phrases)
* Layer 3 → abstract patterns (objects, meaning)

---

### Learning — Backpropagation + Gradient Descent

The network learns parameters $W, b$ that minimize **loss** between prediction $\hat{y}$ and true output $y$.

### Loss Function

$$
L = \text{Loss}(y, \hat{y})
$$
Common examples:

* Mean Squared Error: $L = \frac{1}{2}(y - \hat{y})^2$
* Cross-Entropy: $L = -\sum y_i \log(\hat{y_i})$

---

### Gradient Descent

We minimize (L) by adjusting weights in direction of steepest descent:
$$
W := W - \eta \frac{\partial L}{\partial W}
$$
$$
b := b - \eta \frac{\partial L}{\partial b}
$$
where $\eta$ = learning rate.

---

### Backpropagation

Backpropagation computes gradients efficiently using the **chain rule**:

$$
\frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial z^{(l)}} \cdot \frac{\partial z^{(l)}}{\partial W^{(l)}}
$$
Each layer passes error backward, adjusting weights to reduce loss.

---

### Example Calculation

Let’s take one hidden layer for simplicity:

$$
\begin{aligned}
z^{(1)} &= W^{(1)}x + b^{(1)} \
a^{(1)} &= f(z^{(1)}) \
z^{(2)} &= W^{(2)}a^{(1)} + b^{(2)} \
\hat{y} &= f(z^{(2)})
\end{aligned}
$$

Loss:
$$
L = (y - \hat{y})^2
$$

Gradient for output layer:
$$
\frac{\partial L}{\partial W^{(2)}} = -2(y - \hat{y}) \cdot f'(z^{(2)}) \cdot a^{(1)}
$$

Gradient for hidden layer:
$$
\frac{\partial L}{\partial W^{(1)}} = \left[-2(y - \hat{y})f'(z^{(2)})W^{(2)}f'(z^{(1)})\right]x
$$

This shows how errors “flow backward” to earlier layers — the **core of backpropagation**.

---

#### Theoretical Insights

### Universal Approximation Theorem

A neural network with **one hidden layer and sufficient neurons** can approximate **any continuous function** on compact input space.

Mathematically:
$$
f(x) = \sum_{i=1}^{N} \alpha_i \sigma(w_i^T x + b_i)
$$
can approximate any $f(x)$ given enough neurons.

→ Neural networks are **function approximators**.

---

### Role of Depth

Deeper networks can represent functions **exponentially more efficiently** (fewer neurons for same accuracy).
Depth → hierarchical abstraction (like brain cortex layers).

---

### Probabilistic View

Feedforward networks can also be seen as **parameterized conditional models**:
$$
p(y|x; \theta)
$$
They model the probability of output given input, making them suitable for classification and regression.

---

#### Geometric View

Each layer **warps input space**:

* Linear layers rotate/stretch space.
* Nonlinear activations bend it.
* The final decision boundary becomes a complex surface.

Thus, neural networks **reshape feature space** until classes become linearly separable.

---

### Intuition Recap

| Stage             | M`athematical Form                   | Theoretical Meaning           |
| ----------------- | ----------------------------------- | ----------------------------- |
| Weighted sum      | $z = Wx + b$                        | Linear projection             |
| Activation        | $a = f(z)$                          | Nonlinear deformation         |
| Composition       | $f^{(L)} \circ \dots \circ f^{(1)}$ | Hierarchical feature learning |
| Loss minimization | $\min L(y, \hat{y})$                | Learn function mapping        |
| Backpropagation   | Chain rule on gradients             | Efficient optimization        |
`
---

### Example Intuitive Analogy

Think of a **Feedforward Neural Network** as:

* Each layer is a **filter** refining the input.
* Linear layers = weighted “mixing” of features.
* Activation = deciding which features to emphasize.
* Backprop = continuous self-correction to improve the mapping.

---

**Summary**

| Concept        | Description                               |
| -------------- | ----------------------------------------- |
| **Goal**       | Learn a function ( f: X \to Y ) from data |
| **Mechanism**  | Compose linear + nonlinear transforms     |
| **Training**   | Backpropagation with gradient descent     |
| **Power**      | Universal function approximator           |
| **Limitation** | Needs lots of data and tuning             |

---

**In One Line**

> A **Feedforward Neural Network** is a layered system that linearly transforms inputs, bends them through nonlinear activations, and learns parameters via backpropagation to approximate any target function.

