# Neural Networks (NNs)

A **neural network** is a **parameterized, differentiable function approximator** inspired by the structure and function of biological neural systems. It is composed of layers of **artificial neurons** (also called units or nodes), each of which performs a weighted sum of its inputs followed by the application of a non-linear **activation function**.

Formally, a neural network $f(\mathbf{x}; \boldsymbol{\theta})$ maps an input vector $\mathbf{x} \in \mathbb{R}^n$ to an output vector $\mathbf{y} \in \mathbb{R}^m$, where $\boldsymbol{\theta}$ represents the learnable parameters (weights and biases). A typical feedforward neural network with $L$ layers can be defined recursively as:

$$
\mathbf{h}^{(0)} = \mathbf{x}
$$

$$
\mathbf{h}^{(l)} = \phi^{(l)}\left( \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)} \right), \quad \text{for } l = 1, 2, \ldots, L
$$

$$
f(\mathbf{x}; \boldsymbol{\theta}) = \mathbf{h}^{(L)}
$$

Where:
- $\mathbf{W}^{(l)} \in \mathbb{R}^{d_l \times d_{l-1}}$ is the weight matrix for layer $l$,
- $\mathbf{b}^{(l)} \in \mathbb{R}^{d_l}$ is the bias vector,
- $\phi^{(l)}$ is a non-linear activation function (e.g., ReLU, sigmoid, tanh),
- $\mathbf{h}^{(l)}$ is the output of the $l$-th layer (also called the hidden state),
- $\boldsymbol{\theta} = \{ \mathbf{W}^{(l)}, \mathbf{b}^{(l)} \}_{l=1}^L$ is the set of all learnable parameters.

Neural networks are typically trained using **gradient-based optimization**, most commonly **stochastic gradient descent (SGD)** or its variants, to minimize a **loss function** that quantifies the error between predicted and true outputs. The gradients are computed using **backpropagation**, which applies the chain rule of calculus to efficiently compute derivatives of the loss with respect to each parameter.

Variants of neural networks include:
- **Convolutional Neural Networks (CNNs)** for spatial data (e.g., images),
- **Recurrent Neural Networks (RNNs)** for sequential data,
- **Transformer architectures** for attention-based sequence modeling.

Neural networks are universal function approximators and can represent any Borel measurable function to arbitrary accuracy given sufficient width, depth, and training data.