### Helpful resources
1. https://christian-igel.github.io/paper/TRBMAI.pdf
2. https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf

### Probablisitc model
A restricted Boltzmann machine is a special form of a Boltzmann machine. The nodes build a bipartite graph with one set being observable and the other one being hidden.
#### Model assumption
The probability distribution is modeled as 
\begin{align}
p(v,h) = \frac{e^{-E(v,h)}}{Z},
\end{align}
where $E$ is the energy function 
\begin{align}
E(v,h) = - \sum_{i,j} w_{ij} v_i h_j - \sum_i a_i v_i - \sum_j b_j h_j,
\end{align}
where v, h are the observable and hidden units, respectively; a, b, and w are weights associated with the model.

### Learning a restricted Boltzmann machine
In order to perform maximum likelihood training, we need the marginal distribution over the obvservable units
\begin{align}
    p(v) = \sum_h p(v,h).
\end{align}
The derivative of the log-likelhood (for one data point) with respect to an arbitraty weight $\theta$ is 
\begin{align}
    \frac{\partial \log \mathcal{L}(\theta \mid v)}{\partial \theta} &= \frac{\partial}{\partial \theta} \left[ - \log Z + \log \left( \sum_h  e^{-E(v,h)} \right) \right] \\
    &= - \frac{1}{ \sum_h  e^{-E(v,h)}}  \sum_h  e^{-E(v,h)} \frac{\partial E(v,h)}{\partial \theta} + \frac{1}{ \sum_{v,h}  e^{-E(v,h)}}  \sum_{v,h}  e^{-E(v,h)} \frac{\partial E(v,h)}{\partial \theta} \\
    &= - \underbrace{\sum_h p(h \mid v) \frac{\partial E(v,h)}{\partial \theta}}_{\text{positive phase}} + \underbrace{\sum_{v,h} p(v,h) \frac{\partial E(v,h)}{\partial \theta}}_{\text{negative phase}}.
\end{align}
#### Conditional distributions
Due to the assumption of the nodes being in a bipartite graph, we can easily caluclate the conditional distributions as
\begin{align}
    p(h_j=1 \mid v) = \sigma \left( \sum_{i=1}^{n_{\text{visible}}} w_{ij} v_i + b_j \right),
\end{align}
and 
\begin{align}
p(v_i=1 \mid h) = \sigma \left( \sum_{j=1}^{n_{\text{hidden}}} w_{ij} h_j + a_i \right).
\end{align}
#### Factorization trick
The positive phase can be cheaply calculated when using the factorization trick, e.g., for $\theta = w_{ij}$ we get
\begin{align}
    - \sum_h p(h \mid v) \frac{\partial E(v,h)}{\partial w_{ij}} &= \sum_h \prod_{k=1} p(h_k \mid v) v_i h_j \\
    &= \sum_{h_j} \sum_{h_{-j}} p(h_j \mid v) p(h_{-j} \mid v) h_j v_i \\
    &= \sum_{h_j} p(h_j \mid v) h_j v_i \underbrace{\left( \sum_{h_{-j}} p(h_{-j} \mid v) \right)}_{=1} \\
    &= \sum_{h_j} p(h_j \mid v) h_j v_i = p(h_j = 1 \mid v) v_i.
\end{align}
#### Learning algorithm 
Using the factorizatin trick, the derivatives of the log-likelihood (for onde data point) with respect to the connection weights $w_{ij}$, visible biases $a_i$ and hidden biases $b_j$ can be computed as 
\begin{align}
     \frac{\partial \log \mathcal{L}(\theta \mid v)}{\partial w_{ij}} &= p(h_j = 1 \mid v) v_i - \sum_v p(h_j = 1 \mid v) v_i, \\
    \frac{\partial \log \mathcal{L}(\theta \mid v)}{\partial a_i} &= v_i - \sum_v p(v) v_i, \\
    \frac{\partial \log \mathcal{L}(\theta \mid v)}{\partial b_i} &= p(h_j=1 \mid v) - \sum_v p(h_j = 1 \mid v),
\end{align}
respectively. For a large number of visible units, the sum is intractable as the summands increase exponentially, hence we need to approximate all gradients.
#### Contrastive divergence