# Training the Boltzmann Machine
The training of Boltzmann machines is done [using the _contrastive divergence_ algorithm](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-10/L10c/docs/CD-cdmiguel-hintonpdf.pdf). The goal of training is to learn the weights $\mathbf{W}$ and biases $\mathbf{b}$ of the network such that the network can approximate the probability distribution of the input data. We do this by _maximizing the log-likelihood_ of observing the data given the parameters using gradient ascent. 

> __Contrastive Divergence (CD)__ 
> 
> Contrastive divergence was developed by Geoffrey Hinton and colleagues at the University of Toronto, with its introduction widely recognized [in a 2002 paper describing the product of experts training problem](https://direct.mit.edu/neco/article-abstract/14/8/1771/6687/Training-Products-of-Experts-by-Minimizing?redirectedFrom=fulltext). The contrastive divergence algorithm _efficiently_ approximates the gradient of the log-likelihood expression _without_ computing the intractable partition function. It achieves this by performing a limited number of $k$-sampling steps to approximate the terms in the gradient expression, then updating model parameters based upon these approximations.

The contrastive divergence algorithm even has some [theoretical guarantees regarding convergence](https://arxiv.org/abs/1603.05729)! Check out [this tutorial on the CD algorithm from Hinton](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-10/L10c/docs/Hinton-PracticalGuide-CD-2010.pdf).

Suppose we have a collection of patterns $\mathbf{X} = \left\{\mathbf{x}^{(1)},\mathbf{x}^{(2)},\ldots,\mathbf{x}^{(m)}\right\}$, where $\mathbf{x}^{(i)}\in\mathbb{R}^{|\mathcal{V}|}$ is a binary vector of size $|\mathcal{V}|$ and $m$ is the number of patterns. We want to learn the parameters of the Boltzmann Machine $\mathcal{B}$ such that the stationary distribution of the Boltzmann Machine matches the distribution of the training patterns $\mathbf{X}$.

__Goal__: The goal of training the Boltzmann Machine is to learn the weights $\mathbf{W}$ and biases $\mathbf{b}$ of the network such that the stationary distribution of the Boltzmann Machine matches the distribution of the training patterns in the dataset $\mathbf{X}$.

__Gradient ascent__: The learning algorithm for the Boltzmann Machine is based on gradient ascent. The idea is to adjust the weights and biases of the network in the direction of the gradient of the log-likelihood of the training patterns. This will maximize the likelihood of observing the training patterns given the weights and biases of the network.

___

## Expectation Notation
Before we introduce the training algorithm, we need to define some expectation notation:

> __Pairwise correlation__ 
> 
> The average product of states of nodes $i$ and $j$ over all training patterns:
> $$
\langle x_i x_j \rangle_{\mathbf{X}} = \frac{1}{m} \sum_{k=1}^{m} x_i^{(k)} x_j^{(k)}
$$
> where $m$ is the number of training patterns, and $x_i^{(k)}$ is the state of node $i$ in pattern $k$. This measures the correlation between nodes in the data: positive values indicate nodes tend to have the same state, negative values indicate opposite states, and values near zero indicate no correlation.

We'll also need single node averages:

> __Single-node average__: 
>
> The average state of node $i$ over all training patterns:
> $$
\langle x_i \rangle_{\mathbf{X}} = \frac{1}{m} \sum_{k=1}^{m} x_i^{(k)}
$$
> This measures the bias of node $i$ in the data: positive values indicate the node is typically "on" (+1), negative values indicate typically "off" (-1). The same averages can be computed over samples $\mathbf{S}$ generated by the Boltzmann machine. These represent what the model "thinks" the correlations and biases should be.

The goal of training is to adjust weights and biases so that the model expectations match the data expectations.

## Training Algorithm
The training algorithm for the Boltzmann Machine maximizes the log-likelihood of observing the training patterns $x_{i}\in\mathbf{X}$ given the weights $\mathbf{W}$ and biases $\mathbf{b}$ of the network. The log-likelihood algorithm is given by:

__Initialize__: the weights $\mathbf{W}$ and biases $\mathbf{b}$ of the network to some initial guess, e.g., using the Hopfield network Hebbian learning rule. Set the learning rate $\eta$, temperature $\beta = 1$, and number of turns $T$. Precompute the data-dependent expectation $\langle{x_{i}x_{j}}\rangle_{\mathbf{X}}$ and $\langle{x_{i}}\rangle_{\mathbf{X}}$ using every training pattern $\mathbf{x}^{(i)}\in\mathbf{X}$.

1. Simulate the Boltzmann Machine $\mathcal{B}$ until it becomes stationary (or for a fixed number of turns $T$). Then, generate a set of stationary samples $\mathbf{S} = \left\{\mathbf{s}^{(1)},\mathbf{s}^{(2)},\ldots,\mathbf{s}^{(m)}\right\}$.
2. Compute the model-dependent expectation $\langle{s_{i}s_{j}}\rangle_{\mathbf{S}}$ using the stationary samples $\mathbf{s}^{(i)}\in\mathbf{S}$.
3. Update the weights of the network using the following update rule: $w_{ij}^{\prime} = w_{ij} + \Delta{w_{ij}}$ where $\Delta{w_{ij}} = \eta\left(\langle{x_{i}x_{j}}\rangle_{\mathbf{X}} - \langle{s_{i}s_{j}}\rangle_{\mathbf{S}}\right)$. The hyperparameter $\eta$ is the learning rate, $\langle{x_{i}x_{j}}\rangle_{\mathbf{X}}$ is the data-dependent expectation, and $\langle{s_{i}s_{j}}\rangle_{\mathbf{S}}$ is the model-dependent expectation. The update rule is applied to all weights in the network, i.e., $\forall i,j\in\mathcal{V}$.
4. Update the biases of the network using the following update rule: $b_{i}^{\prime} = b_{i} + \Delta{b_{i}}$ where $\Delta{b_{i}} = \eta\left(\langle{x_{i}}\rangle_{\mathbf{X}} - \langle{s_{i}}\rangle_{\mathbf{S}}\right)$. The hyperparameter $\eta$ is the learning rate, $\langle{x_{i}}\rangle_{\mathbf{X}}$ is the data-dependent expectation, and $\langle{s_{i}}\rangle_{\mathbf{S}}$ is the model-dependent expectation. The update rule is applied to all biases in the network, i.e., $\forall i\in\mathcal{V}$.
5. Repeat steps 1-4 until convergence (or for a fixed number of iterations).

___

### Convergence Criteria
The algorithm terminates when one of the following convergence criteria is satisfied:

> __Parameter change threshold__
>
> The algorithm has converged when the parameter updates become negligible:
> $$
\|\Delta\mathbf{W}\|_{F} < \epsilon \quad \text{and} \quad \|\Delta\mathbf{b}\|_{2} < \epsilon
$$
> where $\epsilon > 0$ is a small tolerance, $\|\cdot\|_{F}$ denotes the Frobenius norm, and $\|\cdot\|_{2}$ denotes the Euclidean norm. This criterion checks whether the learning process has stabilized.

Instead of monitoring parameter changes, we can directly measure how well the model has learned the data statistics:

> __Expectation matching__
>
> The algorithm has converged when the model expectations match the data expectations:
> $$
\max_{i,j}\left|\langle x_i x_j \rangle_{\mathbf{X}} - \langle s_i s_j \rangle_{\mathbf{S}}\right| < \epsilon \quad \text{and} \quad \max_{i}\left|\langle x_i \rangle_{\mathbf{X}} - \langle s_i \rangle_{\mathbf{S}}\right| < \epsilon
$$
> where $\epsilon > 0$ is a small tolerance. This criterion directly measures whether the model has learned to reproduce the statistics of the training data.

In practice, computing these criteria can be expensive. A simpler approach is to use a fixed computational budget:

> __Fixed iterations__
>
> Alternatively, we can run the algorithm for a fixed number of epochs $N_{\text{epochs}}$ regardless of parameter changes. An epoch is one complete execution of steps 1-4, i.e., one full update of all weights and biases. This approach is often used in practice when computational budget is limited or when the other criteria are difficult to evaluate reliably.

___

## Summary
This notebook introduced the contrastive divergence algorithm for training Boltzmann machines by maximizing the log-likelihood of training data.

> __Key Takeaways__
>
> * **Contrastive divergence:** The CD algorithm approximates the gradient of the log-likelihood without computing the intractable partition function by using a limited number of sampling steps.
> * **Expectation matching:** Training adjusts weights and biases to minimize the difference between data expectations $\langle x_i x_j \rangle_{\mathbf{X}}$ and model expectations $\langle s_i s_j \rangle_{\mathbf{S}}$.
> * **Convergence criteria:** The algorithm terminates when parameter updates become negligible, when model expectations match data expectations, or after a fixed number of iterations.

___