# L10a: Introduction to Boltzmann Machines
In this lecture, we will introduce Boltzmann Machines, a type of stochastic neural network that can learn to represent complex distributions over binary vectors. The key concepts of this lecture are:

* __A Boltzmann Machine__ is a network of symmetrically connected, neuron-like units with binary state that make stochastic decisions about whether to be on or oﬀ. Boltzmann machines have a simple learning algorithm that allows them to discover interesting features in datasets that can be modeled as binary vectors. However, they are often slow to _converge_ and are diﬃcult to scale to large datasets. Thus, they are not widely used in practice in the general form that we'll discuss here.

The notes for this lecture can be fond here.

## Architecture
A [Boltzmann Machine](https://en.wikipedia.org/wiki/Boltzmann_machine) consists of a set of binary units (neurons, nodes, vertices, etc) that are fully connected, potentally with self connections. 

* _Nodes in a Boltzmann machine_? Each node can be in one of two states: `on` or `off`. The units have bias terms and are connected to every other node in the network with non-negative weighted edges. The state of each unit is determined by the states of the other units and the weights of the connections. The state of a node is a random variable.

Formally, [a Boltzmann Machine](https://en.wikipedia.org/wiki/Boltzmann_machine) $\mathcal{B}$ is an fully connected _undirected weighted graph_ defined by the tuple $\mathcal{B} = \left(\mathcal{V},\mathcal{E}, \mathbf{W},\mathbf{b}, \mathbf{s}\right)$.
* __Units__: Each unit (vertex, node, neuron) $v_{i}\in\mathcal{V}$ has a binary state (`on` or `off`), and a bias value 
$b_{i}\in\mathbb{R}$, where $b_{i}$ is the bias of the node $v_{i}$. The bias vector $\mathbf{b}\in\mathbb{R}^{|\mathcal{V}|}$ is the vector of bias values for all nodes in the network. 
* __Edges__: Each edge $e\in\mathcal{E}$ has a weight value. The weight of the edge connecting $v_{i}\in\mathcal{V}$ and $v_{j}\in\mathcal{V}$, denoted by $w_{ij}\in\mathbf{W}$, where the weight matrix $\mathbf{W}\in\mathbb{R}^{|\mathcal{V}|\times|\mathcal{V}|}$ is symmetric, i.e. $w_{ij} = w_{ji}$. The weights $w_{ij}\in\mathbb{R}_{\geq{0}}$ are non-negative, and they determine the strength of the connection between the two nodes. 
* __States__: The state of each node is represented by a binary vector $\mathbf{s}\in\mathbb{R}^{|\mathcal{V}|}$, where $s_{i}\in\{-1,1\}$ is the state of node $v_{i}$. When $s_{i} = 1$, the node is `on`, and when $s_{i} = -1$, the node is `off`. The set of all possible state _configurations_ is denoted by $\mathcal{S} \equiv \left\{\mathbf{s}^{(1)},\mathbf{s}^{(2)},\ldots,\mathbf{s}^{(N)}\right\}$, where $N$ is the number of possible states, or $N = 2^{|\mathcal{V}|}$ for binary units.

## Stochastic Dynamics
Suppose we let the state of the Boltzmann Machine $\mathcal{B}$ evolve over $t=1,2,\dots,T$ turns. During each turn, each node can update its state based on the states of the other nodes, the weights of its connections and its bias term. The total input to node $v_{i}$ at time $t$ is given by:
$$
h_{i}^{(t)} = \sum_{j\in\mathcal{V}} w_{ij}s_{j}^{(t-1)} + b_{i}\quad\forall i\in\mathcal{V}
$$
where $h_{i}^{(t)}$ is the total input to node $v_{i}$ at time $t$, $w_{ij}$ is the weight of the edge connecting $v_{i}$ and $v_{j}$, and $s_{j}^{(t-1)}$ is the state of node $v_{j}$ at turn $t-1$. However, unlike [classical Hopfield networks](https://en.wikipedia.org/wiki/Hopfield_network) where the update is deterministic, in a Boltzmann Machine, the state of each node is updated stochastically. The probability that node $v_{i}$ is `on` at turn $t$ is given by the logistic function:
$$
P(s_{i}^{(t)} = 1|h_{i}^{(t)}) = \frac{1}{1+\exp(-\beta\cdot{h}_{i}^{(t)})}
$$
where $P(s_{i}^{(t)} = 1|h_{i}^{(t)})$ is the probability that node $v_{i}$ is `on` at time $t$ given the total input $h_{i}^{(t)}$. The probability that node $v_{i}$ is `off` at time $t$ is given by $P(s_{i}^{(t)} = -1|h_{i}^{(t)}) = 1 - P(s_{i}^{(t)} = 1|h_{i}^{(t)})$.  
* _What is β_? The parameter $\beta$ is the (inverse) temperature parameter that controls the amount of randomness in the system. As $\beta\rightarrow\infty$, the Boltzmann Machine becomes more deterministic, however, as $\beta\rightarrow{0}$ the Boltzmann Machine becomes more random. 


### Stationary Distribution
After $T\gg{1}$ turns, the states $\mathbf{s}^{(1)},\mathbf{s}^{(2)},\dots,$ of the Boltzmann Machine will converge to a stationary distribution over the states $\mathbf{s}\in\mathcal{S}$ that is governed by [a Boltzmann distribution](https://en.wikipedia.org/wiki/Boltzmann_distribution) of the form:
$$
P(\mathbf{s}) = \frac{1}{Z(\mathcal{S},\beta)}\exp\left(-\beta\cdot{E(\mathbf{s})}\right)
$$
where $E(\mathbf{s})$ is the energy of the state $\mathbf{s}$, the $\beta$ is the (inverse) temperature of the system, and $Z(\mathcal{S},\beta)$ is the partition function, which is a normalizing constant that ensures that the probabilities sum to `1`:
$$
Z(\mathcal{S},\beta) = \sum_{\mathbf{s}^{\prime}\in\mathcal{S}}\exp\left({-\beta\cdot{E}(\mathbf{s}^{\prime})}\right)
$$
where $\mathcal{S}$ is the set of all possible configurations of the Boltzmann Machine. 
The energy of the state (configuration) $\mathbf{s}\in\mathcal{S}$ is given by:
$$
E(\mathbf{s}) = -\sum_{i\in\mathcal{V}} b_{i}s_{i} - \frac{1}{2}\sum_{i,j\in\mathcal{V}} w_{ij}s_{i}s_{j}
$$
where the first term is the energy associated with the bias terms and the second term is the energy associated with the weights of the connections. The temperature $\beta$ controls the amount of randomness in the system. As $\beta$ approaches infinity, the Boltzmann Machine becomes more deterministic, and as $\beta$ approaches 0, the Boltzmann Machine becomes more random.