### Helpful resources
1. https://christian-igel.github.io/paper/TRBMAI.pdf
2. https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf

### Probablisitc model
A restricted Boltzmann machine is a special form of a Boltzmann machine. The nodes build a bipartite graph with one set being observable and the other one being hidden.
#### Model assumption
The probability distribution is modeled as 
\begin{align}
p(v,h) = \frac{e^{-E(v,h)}}{Z},
\end{align}
where $E$ is the energy function 
\begin{align}
E(v,h) = - \sum_{i,j} w_{ij} v_i h_j - \sum_i a_i v_i - \sum_j b_j h_j,
\end{align}
where v, h are the observable and hidden units, respectively; a, b, and w are weights associated with the model.

### Learning a restricted Boltzmann machine
In order to perform maximum likelihood training, we need the marginal distribution over the obvservable units
\begin{align}
    p(v) = \sum_h p(v,h).
\end{align}
The derivative of the log-likelhood (for one data point) with respect to an arbitraty weight $\theta$ is 
\begin{align}
    \frac{\partial \log \mathcal{L}(\theta \mid v)}{\partial \theta} &= \frac{\partial}{\partial \theta} \left[ - \log Z + \log \left( \sum_h  e^{-E(v,h)} \right) \right] \\
    &= - \frac{1}{ \sum_h  e^{-E(v,h)}}  \sum_h  e^{-E(v,h)} \frac{\partial E(v,h)}{\partial \theta} + \frac{1}{ \sum_{v,h}  e^{-E(v,h)}}  \sum_{v,h}  e^{-E(v,h)} \frac{\partial E(v,h)}{\partial \theta} \\
    &= - \underbrace{\sum_h p(h \mid v) \frac{\partial E(v,h)}{\partial \theta}}_{\text{positive phase}} + \underbrace{\sum_{v,h} p(v,h) \frac{\partial E(v,h)}{\partial \theta}}_{\text{negative phase}}.
\end{align}
#### Conditional distributions
Due to the assumption of the nodes being in a bipartite graph, we can easily caluclate the conditional distributions as
\begin{align}
    p(h_j=1 \mid v) = \sigma \left( \sum_{i=1}^{n_{\text{visible}}} w_{ij} v_i + b_j \right),
\end{align}
and 
\begin{align}
p(v_i=1 \mid h) = \sigma \left( \sum_{j=1}^{n_{\text{hidden}}} w_{ij} h_j + a_i \right).
\end{align}
#### Factorization trick
The positive phase can be cheaply calculated when using the factorization trick, e.g., for $\theta = w_{ij}$ we get
\begin{align}
    - \sum_h p(h \mid v) \frac{\partial E(v,h)}{\partial w_{ij}} &= \sum_h \prod_{k=1} p(h_k \mid v) v_i h_j \\
    &= \sum_{h_j} \sum_{h_{-j}} p(h_j \mid v) p(h_{-j} \mid v) h_j v_i \\
    &= \sum_{h_j} p(h_j \mid v) h_j v_i \underbrace{\left( \sum_{h_{-j}} p(h_{-j} \mid v) \right)}_{=1} \\
    &= \sum_{h_j} p(h_j \mid v) h_j v_i = p(h_j = 1 \mid v) v_i.
\end{align}
#### Learning algorithm 
Using the factorizatin trick, the derivatives of the log-likelihood (for onde data point) with respect to the connection weights $w_{ij}$, visible biases $a_i$ and hidden biases $b_j$ can be computed as 
\begin{align}
     \frac{\partial \log \mathcal{L}(\theta \mid v)}{\partial w_{ij}} &= p(h_j = 1 \mid v) v_i - \sum_v p(v) p(h_j = 1 \mid v) v_i, \\
    \frac{\partial \log \mathcal{L}(\theta \mid v)}{\partial a_i} &= v_i - \sum_v p(v) v_i, \\
    \frac{\partial \log \mathcal{L}(\theta \mid v)}{\partial b_i} &= p(h_j=1 \mid v) - \sum_v p(v) p(h_j = 1 \mid v),
\end{align}
respectively. For a large number of visible units, the sum is intractable as the summands increase exponentially, hence we need to approximate all gradients.
#### Contrastive divergence
We could approximate the second terms of the three equations above using block Gibbs sampling, however, this would require to run the Markov chain for a long time. It has been shown that estimates obtained after the chain for just a few steps sufficies for model training. This idea has led to the $k$-step constrastive divergence algorithm. We basically set $v^{(0)} = v$, where $v$ is a training example and sample $h^{(0)}$ from $p\left(h \mid v^{(0)} \right)$. In the next step, $v^{(1)}$ is then sampled from $p \left( v \mid h^{(0)} \right)$ and so on.

In [1]:
import numpy as np

In [2]:
class RestrictedBoltzmannMachine:
    def __init__(self, init_weights, init_visible_biases, init_hidden_biases):
        self.weights = init_weights
        self.visible_biases = init_visible_biases
        self.hidden_biases = init_hidden_biases
        
    @staticmethod
    def _logistic_function(x):
        return (1 / (1 + np.exp(-x)))
        
    def _energy_function(self, v, h):
        energy = 0
        
        for i in range(self.weights.shape[0]):
            for j in range(self.weights.shape[1]):
                energy += - self.weights[i, j] * v[i] * h[j]
        
        for i, visible_bias in enumerate(self.visible_biases):
            energy += - visible_bias * v[i]
            
        for j, hidden_bias in enumerate(self.hidden_biases):
            energy += - hidden_bias * h[j]
                
        return energy
    
    def _conditional_h_given_v(self, v):
        return self._logistic_function((self.weights.T @ v) + self.hidden_biases)
    
    def _conditional_v_given_h(self, h):
        return self._logistic_function((self.weights @ h) + self.visible_biases)
    
    @staticmethod
    def _probability_to_logit(x):
        for i in range(x.shape[0]):
            u = np.random.uniform(0, 1)
            if x[i] > u:
                x[i] = 1.0
            else:
                x[i] = 0.0
                
        return x
    
    def _sampling_k_steps(self, v, k):
        for _ in range(k):
            h = self._probability_to_logit(self._conditional_h_given_v(v))
            v = self._probability_to_logit(self._conditional_v_given_h(h))
            
        return v
    
    def block_gibbs_sampling(self, n_samples, n_burn_in=1000):
        v = np.zeros((self.visible_biases.shape[0], 1))
        
        samples = []
        for i in range(n_burn_in + n_samples):
            v = self._sampling_k_steps(v, 1)

            if i >= n_burn_in:
                samples.append(v)
                
        return samples

In [3]:
def k_step_contrastive_divergence(data, model, learning_rate=1.0, k=1):
    for v in data:
        v = v.reshape((model.visible_biases.shape[0], 1))
        
        sampled_v = model._sampling_k_steps(v, k)
        
        model.weights += learning_rate * (v @ model._conditional_h_given_v(v).T -
                                          sampled_v @ model._conditional_h_given_v(sampled_v).T)
        
        model.visible_biases += learning_rate * (v - sampled_v)
        
        model.hidden_biases += learning_rate * (model._conditional_h_given_v(v) - 
                                                model._conditional_h_given_v(sampled_v))

In [4]:
import matplotlib.pyplot as plt
from itertools import product
%matplotlib inline

n_visible_units = 3
n_hidden_units = 16
n_points = 100
n_steps = 100
p = [.8, .15, .05]

weights = np.random.normal(size=(n_visible_units, n_hidden_units))
visible_biases = np.random.normal(size=(n_visible_units, 1))
hidden_biases = np.random.normal(size=(n_hidden_units, 1))

rbm = RestrictedBoltzmannMachine(weights, visible_biases, hidden_biases)

In [5]:
### [[0, 0, 0], [0, 0, 1], [0, 1, 0], ..., [1, 1, 0], [1, 1, 1]]
all_configs = list(product([0, 1], repeat=n_visible_units))

data =  np.random.binomial(n=1, p=p, size=(n_points, n_visible_units))

data_distribution = np.zeros((len(all_configs),))

for x in data:
    for j, config in enumerate(all_configs):
        if np.sum(x == config) == n_visible_units:
            data_distribution[j] += 1 / n_points

In [6]:
for i in range(n_steps):
    k_step_contrastive_divergence(data, rbm, learning_rate=0.1, k=1)

In [7]:
n_sampling = 1000
samples = rbm.block_gibbs_sampling(n_sampling, n_burn_in=1000)

In [8]:
sample_distribution = np.zeros((len(all_configs),))

for sample in samples:
    for j, config in enumerate(all_configs):
        if np.sum(sample.T == config) == n_visible_units:
            sample_distribution[j] += 1 / n_sampling

In [9]:
for i, _ in enumerate(all_configs):
    print('Configuration:', all_configs[i], 
          'learned distribution: {0:.3f}'.format(sample_distribution[i]), 
          'data distribution: {0:.3f}'.format(data_distribution[i]))

Configuration: (0, 0, 0) learned distribution: 0.018 data distribution: 0.130
Configuration: (0, 0, 1) learned distribution: 0.001 data distribution: 0.000
Configuration: (0, 1, 0) learned distribution: 0.013 data distribution: 0.040
Configuration: (0, 1, 1) learned distribution: 0.009 data distribution: 0.010
Configuration: (1, 0, 0) learned distribution: 0.769 data distribution: 0.690
Configuration: (1, 0, 1) learned distribution: 0.008 data distribution: 0.010
Configuration: (1, 1, 0) learned distribution: 0.171 data distribution: 0.120
Configuration: (1, 1, 1) learned distribution: 0.011 data distribution: 0.000
