
# Energy Based Model Classifier
> This post review energy-based model classfier presented in Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One
- toc: true 
- badges: true
- comments: true
- categories: [ML, DL, EBM]
- images: images/post/EBM.jpg

#### IDEA:
This post review energy-based model classfier presented in [Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One](https://arxiv.org/abs/1912.03263). The paper propose to transform a standard discriminative classifier of $p(y|\mathbf{x})$ as an Energy Based Model(EBM) for the joint distribution $p(x, y)$ . An EBM learn to predict if a certain pair of $(x, y)$ fit together or not. Given  input variable $\mathbf{x}$  and target variable $\mathbf{y}$. The level of dependency between $x$ and $y$ is defined by the energy function $E_{\theta}(x, y)$ which maps each point to a scalar value. $E_{\theta}(x, y)$ takes low values when $y$ is compatible with $x$ and higher values when $y$ is less compatible with $x$. The energy function $E_{\theta}(x, y)$ can be turned into a normalized joint probability distribution $ p_{\theta}(x, y)$  through the Gibbs distribution:
 \begin{equation}
        p_{\theta}(x, y) = \frac{\exp(- E_{\theta}(x, y)}{Z(\theta)}
    \end{equation}
  
where $Z{\theta} = \int_y \exp(- E_{\theta}(x, y)$ is is the normalizing constant.   

#### Motivation 
Performance gap between the strongest generative modeling approach to downstream tasks (semi-supervised learning, imputation of missing data, and calibration of uncertainty). State-of-the-art generative models have diverged quite heavily from state-of-the-art discriminative architectures. This lead into hand-tailored solutions for each specific problem.

The paper aim at using EBMs to help realize the potential of gen-erative models on downstream discriminative problems.

### Approach
The main idea of the paper is to enterpreate the logits of a classifier as the joint density of data points and labels and the density of data points alone.
![](my_icons/JEM.png)

Consider a machine learning, classifier  with K $f_{\theta}(\mathbf{x})$ which maps each data point $x\in D$ to $K$ real-valued numbers known as logits. The logits  parameterize a categorical distribution such as:
\begin{equation}
p_{\theta}(y|\mathbf{x})= \frac{\exp(f_{\theta}(\mathbf{x}))}{\sum \exp(f_{\theta}(\mathbf{x}))}
\end{equation}  We can re-interpret the logits obtained from $f_{\theta}(\mathbf{x})$ to define $p(\mathbf{x}, y)$ and $p(\mathbf{x})$ as well. Thus the EBM of the joint distribution of data point x and labels y  can be defined as:
\begin{equation}
p_{\theta}(\mathbf{x}, y)= \frac{\exp(f_{\theta}(\mathbf{x}))}{Z(\theta)}
\end{equation} where $Z{\theta}$ is unknown  normalizing constant and $E{\theta} = -f_{\theta}(\mathbf{x}) $

Marginalizing out y, we obtain an unnormalized density model for x,
\begin{equation}
p_{\theta}(\mathbf{x})= \sum_y p_{\theta}(\mathbf{x}, y) = \sum_y \frac{\exp(f_{\theta}(\mathbf{x}))}{Z(\theta)}
\end{equation} The  energy function of  a data point x can thus be defined as 
\begin{equation}
E_{\theta}(x)= -\mathrm{LogSum}_y f_{\theta}(\mathbf{x})  = -\log \sum_y \exp(f_{\theta}(\mathbf{x}))
\end{equation}
- The conditional distribution $p_{\theta}(y|\mathbf{x})$ is can be obitained as
\begin{equation}
p_{\theta}(y| \mathbf{x})= \frac{p_{\theta}(\mathbf{x}, y)}{p_{\theta}(\mathbf{x})}
\end{equation}

### Training EBM

For most choices of $E_{\theta}$, it is hard to compute liably estimate $Z_{\theta}$ which means estimating the normalized densities is intractable and standard maximum likelihood estimation of the parameters $\theta$ is not straightforward. Despite a long period of little development, there has been recent work using this method to train large-scale EBMs on high-dimensional data, parameterized by deep neural networks using **Stochastic Gradient Langevin Dynamics (SGLD)** 

First let consider the derivative of the log-likelihood for a single example $x$ with respect to $\theta$ 

\begin{equation}
\frac{\partial  \log p_{\theta}(\mathbf{x})} {\partial  \theta} = \mathbb{E}_{p_{\theta}(x^{\prime)})}\frac{\partial  E_{\theta}(\mathbf{x}^{\prime)}} {\partial  \theta} - \frac{\partial  E_{\theta}(\mathbf{x})} {\partial  \theta} 
\end{equation}

The SGLD  draws samples as follows
\begin{equation}
\mathbf{x}_{i+1} = \mathbf{x}_i - \frac{\alpha}{2}\frac{\partial  E_{\theta}(\mathbf{x})} {\partial  \theta} + \epsilon 
\end{equation} where $\mathbf{x}_0 \sim p_0(\mathbf{x}) $ and  $\epsilon \sim \mathcal{N}(0, \alpha)$

Factorize the  likelihood as
\begin{equation}
\log p_{\theta}(\mathbf{x}, y) = \log p_{\theta}(\mathbf{x})  + \log p_{\theta}(y|\mathbf{x}) 
\end{equation} We can therefore optimize $p_{\theta}(y|\mathbf{x})$ using standard cross-entropy and optimize $\log p_{\theta}(\mathbf{x})$ using Equation 8 with SGLD where gradients are taken with respect to $\mathrm{LogSum}_y f_{\theta}(\mathbf{x})$


## Application

- Hybrid modeling 
- Calibration
- Out of distribution detection
- Robustness