# Mixture of Experts (MoE)

Mixture of experts aims at increasing the accuracy of a function approximation by replacing a single global model by a weighted sum of local models (experts). It is based on a partition of the problem domain into several subdomains via clustering algorithms followed by a local expert training on each subdomain.

The mixture-of-experts architecture improves upon the shared-bottom model by creating multiple expert networks and adding a gating network to weight each expert network’s output.

<p><center><img src='_images/L861465_1.png'></center></p>

Although the technique was initially described using neural network experts and gating models, it can be generalized to use models of any type. As such, it shows a strong similarity to stacked generalization and belongs to the class of ensemble learning methods referred to as meta-learning.

Two major components are needed in any MoE model: a set of experts (neural networks) and one or more trainable gates. A gate selects a combination of the experts that is specific to each input example. This allows experts to specialize in different partitions of the input space, which has the potential to improve the predictive performance and interpretability.

The MOE method strongly relies on the Expectation-Maximization (EM) algorithm for Gaussian mixture models (GMM). With an aim of regression, the different steps are the following:

1. Clustering: the inputs are clustered together with their output values by means of parameter estimation of the joint distribution.
2. Local expert training: A local expert is then built (linear, quadratic, cubic, radial basis functions, or different forms of kriging) on each cluster.
3. Recombination: all the local experts are finally combined using the Gaussian mixture model parameters found by the EM algorithm to get a global model.

When local models $y_i$ are known, the global model would be:

$\hat{y}({\bf x})=\sum_{i=1}^{K} \mathbb{P}(\kappa=i|X={\bf x}) \hat{y_i}({\bf x})$

which is the classical probability expression of mixture of experts. In this equation, $K$ is the number of Gaussian components, $P(κ=i|X=x)$, denoted by gating network, is the probability to lie in cluster $i$ knowing that $X=x$ and $\hat{y}_i$ is the local expert built on cluster $i$.

<p><center><figure><img src='_images/L861465_2.png'><figcaption>An example of a MoE that can be used as a standalone learner or layer in a neural network.</figcaption></figure></center></p>

Each expert network is essentially a unique shared bottom network, each using the same network architecture. The assumption is that each expert network is able to learn different patterns in the data and focus on different things.

The gating network then produces a weighting scheme such that the task is able to use a weighted average of the outputs of the expert networks, conditioned on the input data. The gating network’s final layer is a softmax layer (**g(x)**), which is used to produce a linear combination of the expert networks’ outputs (**y**).

The main innovation with this architecture is that the model is able to activate parts of the network differently on a per-sample basis. Since the gating network is conditioned on the input data (due to the gating network being trained as a part of training the overall model), the model is able to learn how to weight each expert network based on the properties of the input data.