# Position-Wise Feed-Forward Network

This section covers the **Position-Wise Feed-Forward Network**. 
Modern language models tend to two main changes:

* they use another activation function and
* employ a gating mechanism

We should understand three key areas: 
* **standard feed-forward structures (MLP/FFN)**, 
* **activation functions(ReLU/SiLU)**, and, 
* **Gated Linear Units (GLU)**.

## Feed-Forward Network(FFN) & Multilayer Perceptron(MLP)

In the world of machine learning, **Feedforward Neural Networks (FFN)** and **Multilayer Perceptrons (MLP)** are often used interchangeably. However, while they are closely related, they represent different levels of abstraction.

The simplest way to distinguish them: **All MLPs are FFNs, but not all FFNs are MLPs.**

### 1. Feedforward Neural Network (FFN)

An FFN is a broad category of neural networks where information moves in only one direction: forward. 

There are **no cycles or loops** (FFN <-> RNN).

```
data -> input nodes -> hidden layer -> ... -> hidden layer -> output
```

**Mathematical Representation**

A feedforward network can be viewed as a composition of functions. For a network with  layers, the output  is calculated as:

$$
y = f^{(L)}(f^{(L-1)}(...f^{(1)}(x)...))
$$

Where:
* $x$ is the input vector.
* $f^{(i)}$ represents the transformation at layer $i$.

### 2. Multilayer Perceptron (MLP)


An MLP is a kind of modern of FFN. To be classified as an MLP, a network must meet three criteria:

1. **Multiple Layers:** It must have at least one hidden layer (3 layers total: input, hidden, and output).
2. **Fully Connected (Dense):** Every neuron in layer  must connect to every neuron in layer $i+1$.
3. **Non-linear Activations:** It must use non-linear activation functions (like ReLU or Sigmoid) to avoid collapsing into a simple linear model.

**Mathematical Representation**

For a single layer in an MLP, the output vector  is calculated as:

$$
h = \sigma(W x + b)
$$

Where:
* $x \in \mathbb{R}^n$: Input vector.
* $W \in \mathbb{R}^{m \times n}$: Weight matrix.
* $b \in \mathbb{R}^m$: Bias vector.
* $\sigma$: Non-linear activation function (e.g., $ReLU(z) = \max(0, z)$).

### 3. Direct Comparison


| Feature | Feedforward Neural Network (FFN) | Multilayer Perceptron (MLP) |
| --- | --- | --- |
| **Scope** | A broad class of architectures. | A specific subset of FFNs. |
| **Connectivity** | Can be sparse or locally connected (e.g., CNNs). | **Must** be fully connected (Dense). |
| **Structure** | Unidirectional flow (no loops). | Unidirectional flow with  hidden layer. |
| **Complexity** | Varies from a single layer to billions. | Requires a hidden layer to solve non-linear problems (XOR). |

**Key Distinction: The "XOR" Problem**

Historically, a "Single-Layer Perceptron" (an FFN with no hidden layer) could only solve linearly separable problems. It could not solve the **XOR** logic gate because it couldn't draw a non-linear boundary.

By adding a hidden layer and non-linear activations, it becomes an **MLP**, which gains the power of the **Universal Approximation Theorem**: the ability to approximate any continuous function given enough neurons.


## Activation Functions (ReLU, SiLU)

<img src="../images/SiLU_ReLU.png" width="70%">

### ReLU (Rectified Linear Unit)


$$
ReLU(x) = \max(0, x)
$$

**Characteristics**

* It turns off neurons that have negative values (sets them to 0). 
* ‚úÖ pros
    * The network becomes lighter and more efficient.
* ‚ùå„ÄÄcons
    * **The "Dying ReLU" Problem:** because of the ReLU turning negative value into zero, the gradient becomes 0, it will stay at 0 forever and that neuron "dies."


### 2. SiLU (Sigmoid Linear Unit)


Also known as **Swish**, SiLU is a more modern, "smooth" version of ReLU. 

$$
SiLU(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}
$$

* Characteristics
* Unlike ReLU, which has a sharp "elbow" at zero, SiLU is smooth everywhere. 

* ‚úÖPros
    * This helps the optimization process (Gradient Descent) find better minima.
    * Interestingly, for small negative values, SiLU actually dips below zero before returning to zero. This allows some negative information to flow through, which often leads to better accuracy than ReLU.

### 3. Comparison Table


| Feature | ReLU | SiLU (Swish) |
| --- | --- | --- |
| **Formula** | $\max(0, x)$ | $x \cdot \text{sigmoid}(x)$ |
| **Differentiable** | Not at $x=0$ | Yes, everywhere |
| **Computation** | Extremely fast (simple comparison) | Moderate (requires exponential) |
| **Output Range** | $[0, \infty)$ | $[\approx -0.28, \infty)$ |
| **Best For** | General MLPs, CNNs | Deep Transformers, YOLO, LLMs |


## Gated Linear Units (GLUs)

The **Gated Linear Unit (GLU)** is a sophisticated architectural component that moves away from simple "all-or-nothing" activations (like ReLU) toward a **gating mechanism**.

The original definition by Dauphin et al. is:

$$
\text{GLU}(x, W_1, W_2) = \sigma({W_1}x) \odot ({W_2}x)
$$

To visualize what's happening, let's break it into two parallel paths:

1. **The Gate $\sigma({W_1}x)$:** This path applies a sigmoid function, squashing the linear transformation into a range of $[0,1]$. It acts as a learned "filter."
2. **The Content $({W_2}x)$:** This is a standard linear transformation of the input. It carries the actual "data" or features.
3. **The Element-wise Product ($\odot$):** The gate vector multiplies the content vector. If the gate value is $1.0$, the content passes through perfectly; if it's $0.0$, the content is blocked.

---

**üîçWhy use GLUs?**

* **Vanishing Gradient Relief:** In a standard network, gradients must pass through non-linearities (like Tanh) at every layer, which can shrink the signal. In a GLU, if the gate is "open" (near 1), the gradient flows through the  path linearly, preserving its strength.
* **Dynamic Selection:** Unlike ReLU, a GLU can choose to block or pass *any* feature based on the context of the input.
* **Reduced Training Bias:** Because they have a linear path, they are easier to train in very deep stacks compared to pure Sigmoid or Tanh networks.

---

**The "SwiGLU" Evolution**

Researchers found that replacing the **Sigmoid** with a **SiLU** (Swish) activation works significantly better.However, we offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.

The **SwiGLU** variant is defined as:

$$\text{SwiGLU}(x, {W_1}, {W_2}) = \text{SiLU}({W_1}x) \otimes ({W_2}x)$$

In this version, the "gate" isn't just a 0-to-1 filter; it‚Äôs a smooth, non-monotonic function that allows the network to learn much more complex representations.


## Code