# Mixture of Experts (MoE)

A **Mixture of Experts (MoE)** is a machine learning architecture that combines multiple specialized models (called **experts**) and a **gating network** that learns to decide which experts to use for a given input.

### Key Idea

Instead of using a single monolithic model for all inputs, MoE dynamically selects **a subset of experts** to handle each input, allowing for **sparse computation** and **specialization**. This makes MoE both more **scalable** and **efficient**, especially in large-scale models like language models.


### Architecture Overview

An MoE consists of:
1. **Experts**: Independent sub-models (usually neural networks) trained to specialize in different parts of the input space.
2. **Gating Network**: A learned function that, given an input, assigns **weights** (or makes a hard selection) over the experts.

Formally, for input $\mathbf{x}$:

$$
y = \sum_{i=1}^K g_i(\mathbf{x}) \cdot E_i(\mathbf{x})
$$

Where:
- $E_i(\mathbf{x})$ is the output of the $i$-th expert
- $g_i(\mathbf{x})$ is the gating function output (e.g., softmax or top-k selection)
- $K$ is the total number of experts
- The sum is over selected experts (can be sparse)


### Types of Gating
- **Soft gating**: All experts contribute, weighted by $g_i(\mathbf{x})$ (similar to attention).
- **Hard gating**: Only top-k experts are selected (common in large-scale models for computational efficiency).


### Benefits
- **Efficiency**: Only a small subset of experts is used per input → reduces compute.
- **Scalability**: Models can scale to billions of parameters without needing to activate all of them at once.
- **Specialization**: Experts can learn to focus on different features or tasks.


### Use Cases
- **Large Language Models** (e.g., Google's Switch Transformer, GShard, and recent MoE layers in models like GPT and T5 variants).
- **Multitask Learning**: Different tasks can route to different experts.
- **Continual Learning**: New experts can be added without retraining the whole model.


### Example: Top-2 Gating (used in Switch Transformer)
Instead of using all experts, use only the top 2 based on gating scores:

1. Compute gating weights $g = \text{softmax}(W \mathbf{x})$
2. Select top 2 experts by score
3. Route input only to those 2 experts
4. Combine outputs weighted by gating scores

This keeps the model efficient while still leveraging expert specialization.

### Example: DeepSeek V2 MoE

<img src="images/deepseek-moe.png" alt="deepseek_v2_moe" width="500"/>

[DeepSeek V2 MoE](https://arxiv.org/pdf/2405.04434)

- Opensource [Mixtral 8x7B](https://ollama.com/library/mixtral): 8 experts, each with 7B parameters.
- First introduced in 1991 by Jacobs et al. [View](https://www.researchgate.net/publication/233806999_Adaptive_Mixtures_of_Local_Experts)