# Lecture 3 Architectures & Hyperparameter

# Architecture for Transformer

Starting with the original transformer...

- (Masked) MultiHead Attention

- Positional Embeddings

- FFN (MLP), with RELU activation function.

- LayerNorm

However, the original transformer is designed for machine translation, which are viewed as a variant of RNNs. Nowadays, different models have different architecture design in detail:

| Name | Year | LayerNorm | Parallel Layer | Pre-norm | Position embedding | Activations | Stability tricks |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Original transformer | 2017 | LayerNorm | Serial | ❌ | Sine | ReLU | |
| GPT | 2018 | LayerNorm | Serial | ❌ | Absolute | GeLU | |
| T5 (11B) | 2019 | RMSNorm | Serial | ✅ | Relative | ReLU | |
| GPT2 | 2019 | LayerNorm | Serial | ❌ | Absolute | GeLU | |
| GPT2 | 2020 | RMSNorm | Serial | ✅ | Relative | GeGLU | |
| T5 (XXL 11B) v1.1 | 2020 | RMSNorm | Serial | ✅ | Relative | GeGLU | |
| mT5 | 2020 | RMSNorm | Serial | ✅ | Relative | GeGLU | |
| GPT3 (175B) | 2020 | LayerNorm | Serial | ✅ | Absolute | GeLU | |
| GPTJ | 2021 | LayerNorm | Parallel | ✅ | RoPE | GeLU | |
| LaMDA | 2021 | | | | Relative | GeGLU | |
| Anthropic LM (not claude) | 2021 | | | | | | |
| Gopher (280B) | 2021 | RMSNorm | Serial | ✅ | Relative | ReLU | |
| GPT-NeoX | 2022 | LayerNorm | Parallel | ✅ | RoPE | GeLU | |
| BLOOM (175B) | 2022 | LayerNorm | Parallel | ✅ | Alibi | GeLU | |
| OPT (175B) | 2022 | LayerNorm | Serial | ❌ | Absolute | ReLU | |
| PaLM (540B) | 2022 | RMSNorm | Parallel | ✅ | RoPE | SwiGLU | Z-loss |
| Chinchilla | 2022 | RMSNorm | Serial | ✅ | Relative | ReLU | |
| Mistral (7B) | 2023 | RMSNorm | Serial | ✅ | RoPE | SwiGLU | |
| LLaMA2 (70B) | 2023 | RMSNorm | Serial | ✅ | RoPE | SwiGLU | |
| LLaMA (65B) | 2023 | RMSNorm | Serial | ✅ | RoPE | SwiGLU | |
| GPT4 | 2023 | | | ❌ | | | |
| Olmo 2 | 2024 | RMSNorm | Serial | ❌ | RoPE | SwiGLU | Z-loss, QK-norm |
| Gemma 2 (27B) | 2024 | RMSNorm | Serial | ✅ | RoPE | GeGLU | Logit soft capping, Pre+post norm |
| Nemotron-4 (340B) | 2024 | LayerNorm | Serial | ✅ | RoPE | SqReLu | |
| Qwen 2 (72B) - same for 2.5 | 2024 | RMSNorm | Serial | ✅ | RoPE | SwiGLU | |
| Falcon 2 11B | 2024 | LayerNorm | Parallel | ✅ | RoPE | GeLU | Z-loss |
| Ph3 (small) - same for ph4 | 2024 | RMSNorm | Serial | ✅ | RoPE | SwiGLU | |
| Llama 3 (70B) | 2024 | RMSNorm | Serial | ✅ | RoPE | SwiGLU | |
| Reka Flash | 2024 | RMSNorm | Serial | ✅ | RoPE | SwiGLU | |
| Command R+ | 2024 | LayerNorm | Parallel | ✅ | RoPE | SwiGLU | |
| OLMo | 2024 | RMSNorm | Serial | ✅ | RoPE | SwiGLU | |
| Qwen (14B) | 2024 | RMSNorm | Serial | ✅ | RoPE | SwiGLU | |
| DeepSeek (67B) | 2024 | RMSNorm | Serial | ✅ | RoPE | SwiGLU | |
| Yi (34B) | 2024 | RMSNorm | Serial | ✅ | RoPE | SwiGLU | |
| Mixtral of Experts | 2024 | | | ❌ | | | |
| Command A | 2025 | LayerNorm | Parallel | ✅ | Hybrid (RoPE+NoPE) | SwiGLU | |
| Gemma 3 | 2025 | RMSNorm | Serial | ✅ | RoPE | GeGLU | Pre+post norm, QK-norm |
| SmolLM2 (1.7B) | 2025 | RMSNorm | Serial | ✅ | RoPE | SwiGLU | |

## Normalization

### PreNorm vs PostNorm

![Norm](../../assets/imgs/norm.png)

Several discussions for PreNorm and PostNorm:

- [Discussion](https://kexue.fm/archives/9009)

**TLDR**: With the same parameter settings, pre-norm can be trained more easily than post-norm, however, the final performance does not outweigh post-norm.(or exactly the same for the pre-training process.)

However, for the post-training fine-tuning, the post-normalization architecture has better transfer performance.

> However, the gradients of Pre-LN at bottom layers tend to be larger than at top layers, leading to a degradation in performance compared with Post-LN.

But for pre-norm, which can keep the good parts of residual connections, with nicer gradient propagation and fewer spike.

Maybe we can do Double Norm ;)

### Layer Norm & RMS Norm

#### Layer Normalization

在层归一化中，我们沿着 **特征维度 ($D$)** 计算统计量，对每个 **样本 ($N$)** 和 **序列位置 ($L$)** 进行归一化。
这意味着 $\mu$ 和 $\sigma^2$ 是针对每个样本 $n$ 和每个序列位置 $l$ 计算的。

* **均值 $\mu_{n,l}$：**
    $$\mu_{n,l} = \frac{1}{D} \sum_{d=1}^D x_{n,l,d}$$
* **方差 $\sigma_{n,l}^2$：**
    $$\sigma_{n,l}^2 = \frac{1}{D} \sum_{d=1}^D (x_{n,l,d} - \mu_{n,l})^2$$
* **缩放因子和平移因子：** $\gamma$ 和 $\beta$ 在所有样本和序列位置上共享，但通常也是与特征维度 $D$ 相同长度的向量，即 $\gamma, \beta \in \mathbb{R}^{D}$。
    因此，对于每个 $x_{n,l,d}$：
    $$y_{n,l,d} = \gamma_d \left( \frac{x_{n,l,d} - \mu_{n,l}}{\sqrt{\sigma_{n,l}^2 + \epsilon}} \right) + \beta_d$$

> 在 Transformer 等序列建模模型中，因为序列长度 $N$ 可变（并且是频繁变化），因此为了提升模型的鲁棒性，往往采用 **Layer Normalization** 的方式实现归一化。

#### RMS Norm

RMSNorm, which stands for **Root Mean Square Normalization**, is a layer normalization technique used in some transformer architectures. It's a simpler and more computationally efficient alternative to traditional Layer Normalization.

The core principle of RMSNorm is to **re-scale the activations of a layer based on their root mean square (RMS)** value, rather than their mean and standard deviation.

For a given input vector $x$ (the activations from a layer), the RMSNorm operation is defined as:

> The normalization also happens in the dimensional feature.

$$y_{n,l,d} = \gamma_d \left( \frac{x_{n,l,d}}{\sqrt{\frac{1}{D} \sum_{i=1}^{D} x_{n,l,i}^2 + \epsilon}} \right)$$


**RMSNorm** calculates the root mean square, which is essentially the L2 norm divided by the square root of the dimension. This method is simpler as it **doesn't subtract the mean**, which is the most computationally expensive part of Layer Normalization. This makes RMSNorm more **efficient and faster**, especially in large-scale models (Runtime: Memory Movement). It normalizes the magnitude of the activations, which is often sufficient for stabilizing training.

- also: it drops bias terms for memory and optimization stability.

## Activations

### Activation Functions

For non-linear components in the neural network.

### GLU (Gated Activations)

Original Feed Forward Network: $FF(x) = \max(0, xW_1)W_2$.

With GLU:

$$\max(0, xW_1) \to \max(0, xW_1) \otimes (xV)$$

$\otimes$ means element-wise operations, the former part of GLU is also the RELU function, and the latter let the input $x$ passing through the matrix $V$ and controls which message can be passed into this successfully.

$$\text{FF}_{\text{REGLU}}(x) = (\max(0, xW_1) \otimes (xV) ) W_2$$

### SwiGLU Mathematical Formula

SwiGLU, which stands for **Swish-Gated Linear Unit**, is a variant of the Gated Linear Unit (GLU) family. It was introduced by Google Brain in the PaLM model and has since been adopted by other modern large language models, such as LLaMA, as the activation function within their feed-forward networks (FFN). It has been shown to outperform both ReLU and GeLU in terms of performance.

SwiGLU functions as a sub-module within the feed-forward network. Its core idea is to **replace the traditional single-path activation with a two-path structure controlled by a "gate."**.

**SwiGLU-based FFN**:
$$FFN(x) = \text{SwiGLU}(xW_1, xW_2)W_3$$

Here, $x$ is the input vector, and $W_1, W_2, W_3$ are learnable weight matrices.

The core operation of SwiGLU is defined as:

$$\text{SwiGLU}(x, W_1, W_2) = (\text{Swish}(xW_1)) \otimes (xW_2)$$

Where:
* $x$ is the input vector.
* $W_1$ and $W_2$ are two distinct weight matrices.
* **$\text{Swish}(\cdot)$** is the activation function defined by:
  $$\text{Swish}(z) = z \cdot \sigma(z)$$
  where $\sigma(z)$ is the Sigmoid function:
  $$\sigma(z) = \frac{1}{1 + e^{-z}}$$
* **$\otimes$** denotes **element-wise multiplication**.

1.  **Path 1 (Activation Path)**: The input vector $x$ is multiplied by the weight matrix $W_1$ to produce a vector $z_1 = xW_1$. This vector is then passed through the Swish activation function: Swish($z_1$).
2.  **Path 2 (Gating Path)**: The input vector $x$ is multiplied by a separate weight matrix $W_2$ to produce another vector $z_2 = xW_2$.
3.  **Gating**: The result from Path 1, Swish($z_1$), is multiplied element-wise by the result from Path 2,$z_2$.

This element-wise multiplication is the key "gating" mechanism. The second path, $z_2 = xW_2$, acts as a gate, dynamically adjusting the weight of each element in the activated vector from Path 1 based on the input $x$.

* **Dynamic Information Flow**: The gating design allows the network to learn which features should be amplified or suppressed, providing a more precise way to control information flow. This adaptive gating ability helps the model better capture complex dependencies.
* **Smoother Gradients**: Unlike ReLU's hard cutoff (with a gradient of 0 or 1), the Swish function has a smooth gradient. This helps in mitigating the vanishing gradient problem in deep networks, leading to more stable optimization.


## Serial vs Parallel layers

Original serialized attention mechanism cannot do parallel computations.

- $y = x + \text{MLP}(\text{LayerNorm}(x + \text{Attention}(\text{LayerNorm}(x))))$

- $y = x + \text{MLP}(\text{LayerNorm}(x)) + \text{Attention}(\text{LayerNorm}(x))$

## Positional Embeddings

- sine embeddings

- absolute embeddings

- relative embeddings

- rope embeddings

**Relative positions encoding**:  $\langle f(x, i), f(y, j) \rangle = g(x, y, i - j)$

Where $\langle .,. \rangle$ means the dot product(which is the attention score) for two different tokens. $g$ is a function which consider **relative positional embedding** as its input.

For the original sime embeddings: $\text{Embed}(x, i) = v_x + \text{Positional Embedding}(\text{pos})$

$$\begin{aligned}
\langle \text{Embed}(x, i), \text{Embed}(y, j) \rangle &= \langle v_x + PE_i, v_y + PE_j \rangle \\
&= \langle v_x, v_y \rangle + \langle v_x, PE_j \rangle + \langle PE_i, v_y \rangle + \langle PE_i, PE_j \rangle
\end{aligned}$$

$\langle v_x, v_y \rangle + \langle v_x, PE_j \rangle + \dots$, these are cross-terms that are not relative.


### RoPE

Consider the most simple situation:

$x = (x_0, x_1) \in \mathbb{R}^2$, we can compute $x'$ with a rotation matrix:

$$R_{\theta}=\left(\begin{array}{cc}
\cos \theta & -\sin \theta \\
\sin \theta & \cos \theta
\end{array}\right)$$

$$x' = R_\theta x$$

The rotation of angles and doc product will only focus on the relative rotation!

$$\langle R_\alpha a, R_\beta b \rangle = \langle R_{(\alpha - \beta)} a, b \rangle$$


# Hyperparameter Settings