# L7a: Introduction to Boltzmann Machines
In this module, we will explore Boltzmann Machines, a type of stochastic recurrent neural network. Boltzmann machines come in two forms: full Boltzmann machines and restricted Boltzmann machines. The latter is a simplified version that is easier to train and is often used in practice. 

> __Learning Objectives:__
>
> By the end of this module, you should be able to:
> * __Define a Boltzmann machine:__ Describe the structure of a Boltzmann machine as a fully connected graph of binary units with symmetric weights and biases, and explain how units update stochastically via a logistic function of their energy contribution.
> * __Implement Gibbs sampling:__ Apply the Gibbs sampling algorithm to generate samples from a Boltzmann machine by iteratively updating node states according to conditional probabilities derived from the energy function.
> * __Explain the stationary distribution:__ Describe how a Boltzmann machine converges to a Boltzmann distribution over state configurations and identify methods for checking convergence to the stationary distribution.

Let's get started!

___

<div>
    <center>
      <img
        src="figs/Fig-Boltzmann-Machine-Schematic.svg"
        alt="triangle with all three sides equal"
        height="200"
        width="600"
      />
    </center>
  </div>

## Boltzmann Machines
Formally, [a Boltzmann Machine](https://en.wikipedia.org/wiki/Boltzmann_machine) $\mathcal{B}$ is a fully connected _undirected weighted graph_ defined by the tuple $\mathcal{B} = \left(\mathcal{V},\mathcal{E}, \mathbf{W},\mathbf{b}, \mathbf{s}\right)$.
* __Units__: Each unit (vertex, node, neuron) $v_{i}\in\mathcal{V}$ has a binary state (`on` or `off`) and a bias value 
$b_{i}\in\mathbb{R}$. The bias vector $\mathbf{b}\in\mathbb{R}^{|\mathcal{V}|}$ is the vector of bias values for all nodes in the network.
    - A machine $\mathcal{B}$ may have _visible_ nodes (the state is visible) and _hidden_ nodes (the state is not visible to us). The visible nodes represent the state of the data, while the hidden nodes capture the underlying structure of the data (latent variables).
    - The set of all nodes in the machine $\mathcal{B}$ is denoted by $\mathcal{V} \equiv \left\{v_{1},v_{2},\ldots,v_{|\mathcal{V}|}\right\}$, where $|\mathcal{V}|$ is the number of nodes in the network. We can partition the set of nodes into visible nodes $\mathcal{V}_{\text{vis}}$ and hidden nodes $\mathcal{V}_{\text{hid}}$, such that $\mathcal{V} = \mathcal{V}_{\text{vis}} \cup \mathcal{V}_{\text{hid}}$ and $\mathcal{V}_{\text{vis}} \cap \mathcal{V}_{\text{hid}} = \emptyset$.
* __Edges__: Each edge $e\in\mathcal{E}$ has a weight. The weight of the edge connecting $v_{i}\in\mathcal{V}$ and $v_{j}\in\mathcal{V}$, is denoted by $w_{ij}\in\mathbf{W}$, where the weight matrix $\mathbf{W}\in\mathbb{R}^{|\mathcal{V}|\times|\mathcal{V}|}$ is symmetric, i.e. $w_{ij} = w_{ji}$ and $w_{ii} = 0$ (no self loops). The weights $w_{ij}\in\mathbb{R}$ determine the strength of the connection between nodes $i$ and $j$. 
* __States__: The state of the machine $\mathcal{B}$ is represented by a binary vector $\mathbf{s}\in\mathbb{R}^{|\mathcal{V}|}$, where $s_{i}\in\{-1,1\}$ is the state of node $v_{i}$. When $s_{i} = 1$, the node is `on`, and when $s_{i} = -1$, the node is `off`. The set of all possible _state configurations_ is denoted by $\mathcal{S} \equiv \left\{\mathbf{s}^{(1)},\mathbf{s}^{(2)},\ldots,\mathbf{s}^{(N)}\right\}$, where $N$ is the number of possible state configurations, or $N = 2^{|\mathcal{V}|}$ for binary units.

Now that we have formally defined the machine $\mathcal{B}$, how can we use it, i.e., how can we draw samples from it?

### Dynamics
Suppose we let the state of the Boltzmann Machine $\mathcal{B}$ evolve over $t=1,2,\dots, T$ turns, where the state of each node at turn $t$ is binary $s_{i}^{(t)} \in \{-1, 1\}$. During each turn, every node can update its state based on the states of the other nodes it is connected to, the weights of its connections, and its bias term. The total input to node $v_{i}$ at turn $t$ denoted as $h_{i}^{(t)}$ is given by:
$$
h_{i}^{(t)} = \sum_{j\in\mathcal{V}} w_{ij}s_{j}^{(t-1)} + b_{i}\quad\forall i\in\mathcal{V}
$$
where $w_{ij}$ is the weight of the edge connecting $v_{i}$ and $v_{j}$, and $s_{j}^{(t-1)}$ is the state of node $v_{j}$ at turn $t-1$. However, unlike [classical Hopfield networks](https://en.wikipedia.org/wiki/Hopfield_network), where the update is deterministic, in a Boltzmann Machine, the state of each node is updated stochastically. The probability that node $v_{i}$ is `on` at turn $t$ is given by the logistic function:
$$
\begin{align*}
P(s_{i}^{(t)} = 1 \mid {s}_{\lnot{i}}) & = \frac{\exp\left(-\beta\,E(s_{i} = 1 \mid {s}_{\lnot{i}})\right)}{\exp\left(-\beta\,E(s_{i} = 1 \mid {s}_{\lnot{i}})\right) + \exp\left(-\beta\,E(s_{i} = -1 \mid {s}_{\lnot{i}})\right)} \\
\end{align*}
$$
However, we can simplify these probability expressions by substituting in the energy function:
$$
\begin{align*}
E(s_{i}^{(t)} = 1 \mid {s}_{\lnot{i}}) & = -s_{i}^{(t)}h_{i}^{(t)} \\
\end{align*}
$$
which gives:
$$
\begin{align*}
P(s_{i}^{(t)} = 1 \mid s_{\lnot{i}}) & = \frac{\exp\left(\beta\,h_{i}^{(t)}\right)}{\exp\left(\beta\,h_{i}^{(t)}\right) + \exp\left(-\beta\,h_{i}^{(t)}\right)} \\
& = \frac{1}{1 + \exp(-2\beta\,h_{i}^{(t)})}\quad\blacksquare
\end{align*}
$$
where $P(s_{i}^{(t)} = 1 \mid s_{\lnot{i}})$ is the probability that node $v_{i}$ is `on` at time $t$ given the state of the system. The probability that node $v_{i}$ is `off` at time $t$ is given by $P(s_{i}^{(t)} = -1| s_{\lnot{i}}) = 1 - P(s_{i}^{(t)} = 1 \mid s_{\lnot{i}})$,  i.e., one minus the probability that the node is `on`.
* _What is Î²_? The parameter $\beta$ is the (inverse) temperature parameter that controls the amount of randomness in the system. As $\beta\rightarrow\infty$, the Boltzmann Machine becomes more deterministic; however, as $\beta\rightarrow{0}$, the Boltzmann Machine becomes more random.
* _What is $s_{\lnot{i}}$?_ This notation refers to the state of all nodes in the network _except_ for node $v_{i}$. In other words, it is the system's state excluding the state of node $v_{i}$. Notice we have not included a superscript $t$ on $s_{\lnot{i}}$. However, from the perspective of node $v_{i}$, the state of the system is always the state of the system at the previous turn, i.e., $s_{\lnot{i}} = \left\{s_{j}^{(t-1)}\right\}_{j\in\mathcal{V}\setminus\{i\}}$.

### Sampling a Boltzmann Machine (Gibbs Sampling)
To generate samples from a Boltzmann Machine, let us consider the following algorithm: 

__Initialize__ the weights $\mathbf{W}$ and biases $\mathbf{b}$ of the Boltzmann Machine. Provide an initial state $\mathbf{s}^{(0)}$ of the network, a system (inverse) temperature $\beta$, and the number of turns $T$ to run the sampling algorithm.

For each turn $t=1,2,\dots,T$:
1. For each node $v_{i}\in\mathcal{V}$:
    - Compute the total input $h_{i}^{(t)}$ to node $v_{i}$ using the expression: $h_{i}^{(t)} = \sum_{j\in\mathcal{V}} w_{ij}s_{j}^{(t-1)} + b_{i}$.
    - Compute the probability of the _next_ state $s_{i}^{(t)} = 1$ using the logistic function $P(s_{i}^{(t)} = 1 \mid s_{\lnot{i}}) = \left(1+\exp(-2\beta{h}_{i}^{(t)})\right)^{-1}$ for node $v_{i}$. The probability of $s_{i}^{(t)} = -1$ is given by $P(s_{i}^{(t)} = -1| s_{\lnot{i}}) = 1 - P(s_{i}^{(t)} = 1 \mid s_{\lnot{i}})$.
    - Sample the _next_ state of node $v_{i}$ from a [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) with parameter $p = P(s_{i}^{(t)} = 1 \mid s_{\lnot{i}})$.
2. Store the state vector $\mathbf{s}^{(t)}$ of the network at turn $t$, and proceed to the next turn.

___

### Stationary Distribution
The Boltzmann Machine $\mathcal{B}$ is a stochastic (random) process that evolves over time.
For simplicity, let's assume all the nodes are visible nodes, i.e., $\mathcal{V} = \mathcal{V}_{\text{vis}}$ and $\mathcal{V}_{\text{hid}} = \emptyset$. The state of the network at each turn is a random variable, and the network can be in one of many possible configurations (state vectors) $\mathbf{s}\in\mathcal{S}$.

After a _sufficiently large_ number of turns $T$, the network configurations (state vectors) $\mathbf{s}^{(1)},\mathbf{s}^{(2)},\dots,$ of the Boltzmann Machine will converge to a _stationary distribution_ over the state configurations $\mathbf{s}\in\mathcal{S}$ which can be modeled as [a Boltzmann distribution](https://en.wikipedia.org/wiki/Boltzmann_distribution) of the form:
$$
P(\mathbf{s}) = \frac{1}{Z(\mathcal{S},\beta)}\exp\left(-\beta\cdot{E(\mathbf{s})}\right)
$$
where $E(\mathbf{s})$ is the energy of state $\mathbf{s}$, the $\beta$ is the (inverse) temperature of the system, and $Z(\mathcal{S},\beta)$ is the partition function. The energy of configuration $\mathbf{s}\in\mathcal{S}$ is given by:
$$
E(\mathbf{s}) = -\sum_{i\in\mathcal{V}} b_{i}s_{i} - \frac{1}{2}\sum_{i,j\in\mathcal{V}} w_{ij}s_{i}s_{j}
$$
where the first term is the energy associated with the bias terms, and the second term is the energy associated with the weights of the connections. The partition function $Z(\mathcal{S},\beta)$ is given by:
$$
Z(\mathcal{S},\beta) = \sum_{\mathbf{s}^{\prime}\in\mathcal{S}}\exp\left({-\beta\cdot{E}(\mathbf{s}^{\prime})}\right)
$$
where $\mathcal{S}$ is the set of _all possible network configurations_ of the Boltzmann Machine. 
* __Hmmm...__? The partition function $Z(\mathcal{S},\beta)$ is a normalizing constant ensuring the probabilities sum to `1`. However, for even a moderately sized system, the partition function is impossible to compute; the number of configurations grows exponentially with the number of nodes. For example, for a $28\times{28}$ image (784 nodes, all visible), the number of possible configurations (states) of the Boltzmann Machine is `2^{784}`.

We can't directly compute the partition function, but we can sample from the Boltzmann machine using the Gibbs sampling algorithm described above. The samples generated by the Gibbs sampling algorithm will converge to the stationary distribution of the Boltzmann Machine as the number of turns $T\rightarrow\infty$.

##### Check: Convergence to Stationary Distribution?
A practical question is how many turns do we need to run the sampling algorithm to ensure that the network is in a stationary distribution? There are several ways to do this; however, a straightforward approach we will take is monitoring the _energy_ of the network over time. 

We'll use the following algorithm to check for convergence to a stationary distribution:

__Initialize__: Let $E_{t} = E(\mathbf{s}^{(t)})$ be the energy of the network, where $\mathbf{s}^{(t)}$ is the state vector, at turn $t$. Define a window size $W$ (even), threshold $\epsilon$, a stable sample counter $M$, and the minimum number of stable samples $M_{\text{min}}$ we need to say this system has converged. 

1. For $t\geq{W}$, store the energy values $E_{t-W+1}, \ldots, E_{t-1}, E_{t}$. 
2. Split the energy values into two non-overlapping windows, and compute the average energy for each windwow where $a_{1} = \frac{2}{W}\sum_{k=t-W+1}^{t-W/2}E_{k}$ and $a_{2} = \frac{2}{W}\sum_{k=t-W/2+1}^{t}E_{k}$.
3. If $|a_{1} - a_{2}| < \epsilon$, then increase the count of stable windows $M \gets M + 1$.
4. If $M \geq M_{\text{min}}$, then the network is (approximately) in a stationary distribution; otherwise, continue collecting samples.

___

## Summary
This module introduced Boltzmann machines as stochastic recurrent neural networks that define probability distributions over binary state configurations.

> __Key Takeaways:__
>
> * **Boltzmann machine structure:** A Boltzmann machine is a fully connected undirected graph where each node has a binary state and bias, and edges have symmetric weights. The network defines a joint probability distribution over all possible state configurations.
> * **Gibbs sampling for state generation:** Because the partition function is intractable to compute directly, we use Gibbs sampling to generate states from the network. Each node updates stochastically based on the states of its neighbors and the logistic function of its total input.
> * **Convergence to stationary distribution:** After sufficient iterations, the sampled states converge to the Boltzmann distribution. Convergence can be monitored by tracking energy values across sliding windows.

The contrastive divergence algorithm enables training of Boltzmann machines by approximating likelihood gradients without computing the partition function.

___