```{contents}
```

# Intuition 

Think of a neural network as a **function approximator**.

* It learns to map **inputs** (x) to **outputs** (y).
* It’s inspired by the brain: neurons receive signals, process them, and send outputs to other neurons.

**Example – House Price Prediction**:

* Input: house size, number of bedrooms, zip code, neighborhood wealth
* Output: predicted house price
* A neuron can combine these inputs in some way and output an intermediate concept (e.g., “family size” or “neighborhood quality”)
* Multiple neurons in **hidden layers** can combine simple concepts to learn **complex relationships**

**Key idea:** Each neuron is like a **mini-function**. Stacking many neurons allows the network to represent highly complicated functions.

---

## Mathematical Intuition

###  Neuron as a Function

A single neuron computes:

$$
z = \sum_{i=1}^{n} w_i x_i + b
$$

$$
a = f(z)
$$

Where:

* $x_i$ = input features
* $w_i$ = weights (importance of each input)
* $b$ = bias (shifts the output)
* $f$ = activation function (introduces non-linearity)
* $a$ = output of neuron

**Interpretation:**

* The neuron **linearly combines inputs** $(w_i x_i)$, then **non-linearly transforms** the result with $f$ to create richer patterns.

---

### Layers as Function Composition

A network with **two layers**:

$$
a^{(1)} = f_1(W^{(1)} x + b^{(1)})
$$

$$
y = f_2(W^{(2)} a^{(1)} + b^{(2)})
$$

Where:

* $W^{(1)}, W^{(2)}$ = weight matrices for layers 1 and 2
* $b^{(1)}, b^{(2)}$ = bias vectors
* $f_1, f_2$ = activation functions
* $a^{(1)}$ = output of first (hidden) layer
* $y$ = final output

**Interpretation:**

* Each layer transforms the input space into a new representation.
* Stacking layers allows the network to **learn hierarchical features** (e.g., pixels → edges → shapes → objects in images).

---

### Learning Weights (Training)

1. **Forward Propagation:** Compute predicted output $\hat{y}$ from inputs.
2. **Compute Loss:** Measure error between predicted $\hat{y}$ and true output $y$:
   $$
   \text{Loss} = L(\hat{y}, y)
   $$

* Example: MSE for regression: $$L = \frac{1}{N}\sum (\hat{y}_i - y_i)^2$$

3. **Backpropagation:** Compute gradients of loss w.r.t weights:
   $$
   \frac{\partial L}{\partial w_i}
   $$

4. **Update Weights (Gradient Descent):**
   $$
   w_i \leftarrow w_i - \eta \frac{\partial L}{\partial w_i}
   $$

* $\eta$ = learning rate

**Interpretation:**

* The network **learns by adjusting weights** to reduce the difference between predicted and true outputs.

---

### Non-Linearity is Key

* Without activation functions, multiple layers collapse into a **single linear transformation**.
* Non-linear activations (ReLU, sigmoid, tanh) allow the network to approximate **any continuous function**.
* **Mathematical insight:** A sufficiently large neural network with non-linear activations is a **universal function approximator**.

---

### Geometric Intuition

* Each neuron can be seen as defining a **hyperplane** in input space.
* The activation determines **which side of the hyperplane is “activated”**.
* Combining many neurons creates complex **decision boundaries** for classification or mapping functions for regression.

---

### Summary of Intuition

| Aspect        | Intuition                                            | Math                                                |
| ------------- | ---------------------------------------------------- | --------------------------------------------------- |
| Neuron        | Mini-function combining inputs                       | (a = f(\sum w_i x_i + b))                           |
| Layer         | Transform features into higher-level representations | (a^{(l)} = f(W^{(l)} a^{(l-1)} + b^{(l)}))          |
| Network       | Function approximator mapping x → y                  | Composition of layers                               |
| Learning      | Adjust weights to minimize error                     | Gradient descent/backprop                           |
| Non-linearity | Capture complex patterns                             | Activation functions (ReLU, Sigmoid)                |
| Depth         | Hierarchical feature learning                        | Multiple layers = hierarchical function composition |

---

**Key Takeaways:**

1. Neural networks are **composable functions**.
2. Each neuron performs a **linear combination + non-linear activation**.
3. Stacking neurons (layers) allows modeling **complex functions**.
4. Training adjusts weights to **fit the data** using gradient-based optimization.
5. Non-linearity is critical to avoid networks reducing to a simple linear function.