### Probabilistic model
A Boltzmann machine is a (fully observable) graphical model whose nodes are binary random variables, i.e., $x_i = \{0, 1\}$ for all $i=1, \dots, n$. A bias $b_i$ determines how likely it is that $x_i=1$ and a weight $w_{ij}$ determines how likely it is that $x_i$ and $x_j$ take the same value. 
#### Model assumption
A Boltzmann machine assumes that the joint probability distribution $p(x) = p(x_1, \dots, x_n)$ can be modeled as 
\begin{align}
p(x) = \frac{e^{H(x)}}{Z},
\end{align}
where $H$ is the happiness function defined as 
\begin{align}
H(x) = \sum_{i \neq j} w_{ij} x_i x_j + \sum_i b_i,
\end{align}
and $Z$ is a normalization constant.

### Learning a Boltzmann machine
Given a dataset $\left\{X^{(j)}\right\}_{j \in [m]}$, we can learn the weights and biases of the model by the principle of maximum likelihood. In contrast to the naive Bayes model, there is no closed-form solution for the parameters, however, we can update the parameters using gradient descent.

#### Log-likelihood function
The log-likelihood function $l$ is given as 
\begin{align}
l = \left[ \frac{1}{m} \sum_{k=1}^m H(X^{(k)} \right] - \log Z,
\end{align}
and it turns out that 
\begin{align}
    \frac{\partial l}{\partial w_{rs}} &= \frac{1}{m} \sum_{k=1}^m X^{(k)}_r X^{(k)}_s - \sum_{x} p(x) x_r x_s, \\
    \frac{\partial l}{\partial b_{r}} &= \frac{1}{m} \sum_{k=1}^m X^{(k)}_r- \sum_{x} p(x) x_r,
\end{align}
where $\sum_x$ is the sum of all possible configurations of the Boltzmann machine.

#### Learning algorithm
We first initialize the weights and biases randomly and can then update them using gradient descent, i.e.,
\begin{align}
    w^{n+1}_{rs} &\gets w^{n}_{rs} + \frac{\partial l}{\partial w^{n}_{rs}}, \\
    b^{n+1}_{r} &\gets b^{n}_{r} + \frac{\partial l}{\partial b^{n}_{r}}.
\end{align}
Note that the sum over all configurations of the Boltzmann machine has $2^n$ terms making exact gradient descent steps infeasible. Hence, we approximate $\sum_{x} p(x) x_r x_s = \mathbb{E}_{\text{model}} [x_r x_s]$ and $\sum_{x} p(x) x_r = \mathbb{E}_{\text{model}} [x_r]$ using a Markov chain Monte Carlo method.

### Sampling
For sampling we use a Markov chain Monte Carlo method called Gibbs sampling. The reason for doing so is that Gibbs sampling is based on the conditional probability distributions
\begin{align}
\text{Pr}\left(x_i = 1 \mid x_{-i}\right),
\end{align}
which can be easily computed with our model assumption; note that $x_{-i} = \{x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n\}$. It turns out that 
\begin{align}
\text{Pr}\left(x_i = 1 \mid x_{-i}\right) = \sigma \left( \sum_{i \neq j} w_{ij} x_j + b_i \right),
\end{align}
where $\sigma$ is the logistic function.

In [2]:
import numpy as np