# Mixture Density Networks
>  This post review Mixture Density Networks and their applications in deep neural networks for regression and classification problems.
- toc: False
- badges: true
- comments: true
- categories: [Deep learning, Machine learning, Probalistic Machine learning]
- image:  images/post/deep-1.jpg
- author: Anthony Faustine


## Introduction
Deep Learning models are widely used in prediction problem which involves learning the mapping from a set of inputs variables $\mathbf{x}=\{x_1, \ldots, x_d\}$ to a set of output variables $\mathbf{y}=\{y_1, \ldots,y_c\}$.  In this setting, $d$ is the size of input features, and $c$ is the output feature or target dimension. In this case, usually, the network is trained using minimization of the sum of squares errors or cross-entropy error function over a set of training data $\mathcal{D}=\{\mathbf{x}_{1:N},\mathbf{y}_{1:N}\}$ where $x\in \mathbb{R}^{N\times D}$ and $y\in \mathbb{R}^{N\times C}$.
With this approach it is explicitly assumed that there is a deterministic $1-to-1$ mapping between a given input variables $\mathbf{x}$ and target variable $\mathbf{y}$ without any uncertainty. As a result, the network's output trained by this approach approximates the conditional mean of the output in the training data conditioned on the input vector. These averages represent the posterior probability of class membership for classification problems with a well-chosen target coding scheme and can be regarded as optimal. For a problem involving predicting a continuous variable,  the conditional averages are not usually a good description of data and don't have power to the modal distribution of output with complexity. This is especially the challenge for the one-to-many problem, in which each input can have multiple possible outputs.

## Mixture Density Network
An MDN, as proposed by Bishop, is a flexible framework for modeling an arbitrary conditional probability distribution $p(\mathbf{y}|\mathbf{x})$ as a mixture of distributions. It combines a mixture model with DNN in which a DNN is used to parametrize a mixture model consisting of some predefined distributions such that;
\begin{equation}
    p(\mathbf{y}|\mathbf{x}) = \sum_{i=1}^K \pi _i \phi(\mathbf{y}|\theta _i)
\end{equation} where $K$ is the number of mixture components, $\phi$ can be any parametric distribution with parameters $\theta$, and $\pi$ is the respective component's weight (mixing coefficient) as a mixture of distributions. It combines a mixture model with a neural network, which parameterizes MDN parameters. As a result, MDN can handle multimodality better than a standard discriminative neural network. 

Suppose we consider a mixture of  Gaussian, distribution $\theta_{i}=\{ \boldsymbol{\mu}_{i},\boldsymbol{\Sigma}^2_{i}\}$ such that;
\begin{equation}
     p(\mathbf{y}|\mathbf{x}) = \sum_{i=1}^K \pi_i \mathcal{N}(\mathbf{y}|\boldsymbol{\mu}_{i},\boldsymbol{\Sigma}^2_{i})
\end{equation}

The parameters $\pi, \theta_{yi}$,\text{ and } $\theta_{si}$ are all outputs of the neural network.
The mixture weights $\pi_k(\mathbf{x})$ represents the relative amounts by of each mixture components, which can be interpreted as the probabilities of the $k-$ components for a given observation $\mathbf{x}$. It model the probability from which a data point was sampled, allowing to encodes uncertainty about the prediction. If we introduce a latent variable $\mathbf{z}$ with $k$ possible states, then $\pi_k(\mathbf{x})$ will represents the probability distribution of these states $p(\mathbf{z})$. Specifically, the MDN converts the input vector using DNN with an output layer $\mathbf{z}$ of linear units to obtain the output
 $$
 \hat{\mathbf{z}} = f(\mathbf{x}, \mathbf{\theta})
 $$
 
  The mixing coefficient $\pi _k(\mathbf{x})$ is modeled as the softmax transformation of the corresponding output.

$$
\pi_k(\mathbf{z}) = \frac{\exp(z_k^{\pi})}{\sum_{j=1}^M \exp(z_j^{\pi})}
$$

The mean of the $$k-th$$ kernel is modeled directly as the network outputs: 

$$
\mu_{k}(\mathbf{x})=\mathbf{z} 
$$

On the  the other hand, the variances $\sigma $ is represented by an exponential activation function of the to ensure that it always positive.

$$
\Sigma^2{(\mathbf{x})} = \exp(\mathbf{z} )
$$



This can be easily implemnted in pytorch as follow:
Given
```python
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributions as dist

class MDN(nn.Module):
    def __init__(self, in_dims=1,out_dims=2, hidden_dims=[64,64], min_std=0.01,
                 kmix=5,activation=nn.ReLU(), use_lvar=True):
        super().__init__()
        self.activation = activation
        self.hidden_dims = hidden_dims
        self.in_dims = in_dims
        self.out_dim = out_dims
        self.kmix = kmix
        self.min_std = min_std
        self._use_lvar = use_lvar
        self._z = nn.Linear(self.in_dim,self.hidden_dim)
        self._pi = nn.Linear(self.hidden_dim,self.kmix)
        self._mu = nn.Linear(self.hidden_dim,self.kmix*self.out_dim)
        self._sigma = nn.Linear(self.hidden_dim,self.kmix*self.out_dim)
        
    def forward(self, x):
        z = self.activation(self._z(x))
        pi = torch.softmax(self._pi(z), -1)
        mu = self._mu(z).reshape(-1,self.out_dim,self.kmix)
        log_var = self._sigma(x).reshape(-1,self.out_dim,self.kmix)
        log_var = F.logsigmoid(log_var)
        sigma = torch.exp(0.5 * log_var)
        mix = dist.Categorical(pi)
        comp = dist.Independent(dist.Normal(mu.permute(0,2,1), sigma.permute(0,2,1)), 1)
        gmm = dist.MixtureSameFamily(mix, comp)
        
```

## Training MDN
As the generative model, an MDN model can be trained using the backpropagation algorithm under the maximum likelihood criterion. Suppose $\theta$ is the vector of the trainable parameter, and we can redefine our model as a function of $\mathbf{x}$ parameterized by $\theta$

$$
p(\mathbf{y}|\mathbf{x}, \mathbf{\theta})=\sum_{k=1}^M \pi_k(\mathbf{x}, \mathbf{\theta}) \mathcal{N}(\mathbf{y}; \mu_k(\mathbf{x}, \mathbf{\theta}), \sigma_k^2(\mathbf{x}, \mathbf{\theta}))
$$

Considering a data set $\mathcal{D}$ 
we want to maximize 
$$
p(\mathbf{\theta}|\mathcal{D}) = p(\mathbf{\theta}|\mathbf{Y},\mathbf{X})
$$ 

By Bayes's theorem, this is equivalent to
$$
p(\mathbf{\theta}|\mathbf{Y},\mathbf{X})p(\mathbf{Y}) = p(\mathbf{Y},\mathbf{\theta} |\mathbf{X}) = p(\mathbf{Y}|\mathbf{X},\mathbf{\theta})p(\mathbf{\theta})
$$ 


which leads to

$$
p(\mathbf{\theta}|\mathbf{Y},\mathbf{X}) = \frac{p(\mathbf{Y}|\mathbf{X},\mathbf{\theta})p(\mathbf{\theta})}{p(\mathbf{Y})} \propto p(\mathbf{Y}|\mathbf{X},\mathbf{\theta})p(\mathbf{\theta})
$$
where 
$$
p(\mathbf{Y}|\mathbf{X},\mathbf{\theta})=\prod_{n=1}^N p(\mathbf{y}_n|\mathbf{x}_n, \mathbf{\theta})
$$ 
which is simply the product of the conditional densities for each data points.


To define an error function, the standard approach is the maximum likelihood method, which requires maximisation of the log-likelihood function or, equivalently, minimisation of the negative logarithm of the likelihood. Therefore, the error function for the Mixture Density Network is:

$$
\begin{aligned}
E(\theta, \mathcal{D})&=-\log p(\mathbf{\theta}|\mathbf{Y},\mathbf{X})= -\log p(\mathbf{Y}|\mathbf{X},\mathbf{\theta})p(\mathbf{\theta})\\
&= -\left(\log \prod_{n=1}^N p(\mathbf{y}_n|\mathbf{x}_n, \mathbf{\theta}) + \log p(\mathbf{\theta})\right)\\
&=-\left(\sum_{n=1}^N \log \sum_{k=1}^M \pi_k(\mathbf{x}) \mathcal{N}(\mathbf{y}; \mu_k(\mathbf{x}), \sigma_k^2(\mathbf{x})) + \log p(\mathbf{\theta})\right)\\
\end{aligned}
$$

If we assume a non-informative prior of $p(\mathbf{\theta})=1$ the error function simplify to

$$
E(\theta, \mathcal{D}) = -\sum_{n=1}^N \log \sum_{k=1}^M \pi_k(\mathbf{x}) \mathcal{N}(\mathbf{y}; \mu_k(\mathbf{x}), \sigma_k^2(\mathbf{x}))
$$

The loss function can be easily computed as follows
```python
 def log_nlloss(self, y, gmm):
        logprobs = gmm.log_prob(y)
        y_pred  = gmm.sample()
        mse_loss = F.mse_loss(y_pred, y)
        gmm_nll  = -torch.mean(logprobs)
        return gmm_nll
```