# L10c: Restricted Boltzmann Machines (RBMs)

___
In tis lecture we will discuss Restricted Boltzmann Machines (RBMs), a type of generative stochastic neural network that can learn a probability distribution over its set of inputs. RBMs are a special case of Boltzmann Machines, which are undirected graphical models that can learn to represent complex distributions over their inputs.

* __Restricted Boltzmann Machines__ (RBMs) are a class of _generative_ stochastic neural networks. More specifically, given some (binary) input data $\mathbf{x}\in\left\{-1,1\right\}^{n}$, an RBM can be trained to approximate the probability distribution of this input. Moreover, once the RBM is trained to approximate the distribution of the input, we can _sample_ from the network, in other words we generate new instances from the learned probability distribution.
* __Bipartite graph structure__. RBMs have [a bipartite graph structure](https://en.wikipedia.org/wiki/Bipartite_graph). The first layer is called the _visible_ layer, while the second is called the _hidden_ layer. The two layers are connected by a set of weighted edges, but there are no connections between the visible units or between the hidden units. This is what makes RBMs _restricted_ compared to general Boltzmann Machines, which can have connections between all units. In RBMs, the connections are only between the visible and hidden layers.

The source(s) for this lecture can be found here:
* [Mehlig, B. (2021). Machine Learning with Neural Networks. Chapter 4: The Boltzmann distribution](https://arxiv.org/abs/1901.05639v4)
___


## What is a Restricted Boltzmann Machine?
Before we get too deep into the weeds, let's [watch this introductory video from IBM](https://www.yout-ube.com/watch?v=L3ynnRgpZwg). The video mentions the strcuture of RBMs and potential applications. Can we think of any other applications for RBMs (to help us develop some intuiation aboyt how this thing works)?

### Architecture of RBMs
A [Restricted Boltzmann Machine (RBM)](https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine) consists of a two sets (layers) of binary units (neurons, nodes, vertices, etc.) that are connected by weighted edges. 
* _Visible and hidden layers_. The two sets of units are called the _visible_ layer and the _hidden_ layer. The visible layer is the input layer, while the hidden layer describes features or structures in the data. The two layers are connected by a set of weighted edges, but there are no connections between the visible units or between the hidden units. This is what makes RBMs _restricted_ compared to general Boltzmann Machines, which can have connections between all units. In RBMs, the connections are only between the visible and hidden layers.
* _Nodes in a Boltzmann machine_? Each node (visible or hidden) can be in one of two states: `on` or `off.`  The state of each unit is determined by the states of the other units and the weights of the connections. The state of a node is a random variable.

Formally, [a restricted Boltzmann Machine (RBM)](https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine) $\bar{\mathcal{B}}$ is an fully connected _undirected weighted bipartite graph_ defined by the tuple $\bar{\mathcal{B}} = \left(\mathcal{V}_{v}, \mathcal{V}_{h},\mathcal{E}, \mathbf{W},\mathbf{a},\mathbf{b}, \mathbf{v},\mathbf{h}\right)$.
* __Units__: Each unit (vertex, node, neuron) in the visible layer $v_{i}\in\mathcal{V}_{v}$ or hidden layer $h_{i}\in\mathcal{V}_{h}$ has a binary state (`on` or `off`) and a bias value denoted as $a_{i}\in\mathbf{a}$ for the visible layer, and $b_{i}\in\mathbf{b}$ for the hidden layer. The bias vector $\mathbf{a}\in\mathbb{R}^{|\mathcal{V}_{v}|}$ is the vector of bias values for all visible nodes in the network, and $\mathbf{b}\in\mathbb{R}^{|\mathcal{V}_{h}|}$ is the vector of bias values for all hidden nodes in the network.
* __Edges__: There is an edge between each visible and hidden node, but no edges between nodes in a layer. Each edge $e\in\mathcal{E}$ has a weight. The weight of the edge connecting $v_{i}\in\mathcal{V}_{v}$ and $h_{j}\in\mathcal{V}_{j}$, is denoted by $w_{ij}\in\mathbf{W}$, where the weight matrix $\mathbf{W}\in\mathbb{R}^{|\mathcal{V}_{v}|\times|\mathcal{V}_{h}|}$. The weights $w_{ij}\in\mathbb{R}$ determine the strength of the connection between a visible and a hidden nodes. The weiggt matrix for a RBM is _not_ symmetric.
* __States__: The state of the visible (hidden) layers is represented by a binary vector $\mathbf{v}\in\mathbb{R}^{|\mathcal{V}_{v}|}$ (or $\mathbf{h}\in\mathbb{R}^{|\mathcal{V}_{h}|}$), where $v_{i}\in\{-1,1\}$ (or $h_{i}\in\{-1,1\}$) is the state of node $v_{i}$ (or $h_{i}$). When $v_{i} = 1$, the node is `on`, and when $v_{i} = -1$, the node is `off`. 

## Energy function and Stationary Distribution for RBMs
Like the general Boltzmann Machine (or Hopfield networks), each _configuration_ of nodes $(\mathbf{v},\mathbf{h})$ can be characterized (scored) by an energy function. The energy function of an RBM is given by:
$$
\begin{align*}
E(\mathbf{v},\mathbf{h}) &= -\sum_{i=1}^{|\mathcal{V}_{v}|} a_{i}v_{i} - \sum_{j=1}^{|\mathcal{V}_{h}|} b_{j}h_{j} - \sum_{i=1}^{|\mathcal{V}_{v}|}\sum_{j=1}^{|\mathcal{V}_{h}|} w_{ij}v_{i}h_{j}
\end{align*}
$$
The first term is the bias of the visible layer, the second term is the bias of the hidden layer, and the third term is the interaction between the visible and hidden layers. The energy function is a measure of how well the visible and hidden layers are aligned. The lower the energy, the better the alignment. Given the energy function $E(\mathbf{v},\mathbf{h})$, we can define the _joint probability distribution_ of the _configuration_ of the visible and hidden layers as:
$$
\begin{align*}
P(\mathbf{v},\mathbf{h}) = \frac{1}{Z(\mathbf{v}, \mathbf{h}, \beta)}e^{-\beta\cdot{E}(\mathbf{v},\mathbf{h})}
\end{align*}
$$
where $\beta$ is the _inverse temperature_ (a hyperparameter) and $Z(\cdot)$ is the _partition function_ given by:
$$
\begin{align*}
Z(\mathbf{v}, \mathbf{h}, \beta) &= \sum_{\mathbf{v},\mathbf{h}}e^{-\beta\cdot{E}(\mathbf{v},\mathbf{h})}
\end{align*}
$$
Ok, so far this seems like a step in the wrong direction. We now have a partition function over all possible configurations of the visible and hidden layers. This is a _joint_ probability distribution, but we want to learn the _marginal_ distribution of the visible layer, e.g., what choice a consumer is going to make, what video we'll watch next, etc.

### Marginal Distributions
To learn the marginal distribution of the visible layer, we need to sum over all possible configurations of the hidden layer. 
We have a joint probability distribution, but we want to learn the _marginal_ distribution of the visible layer. To do this, we can sum over the hidden layer:
$$
\begin{align*}
P(\mathbf{v}) &= \sum_{\mathbf{h}}P(\mathbf{v},\mathbf{h}) = \frac{1}{Z(\mathbf{v}, \beta)}e^{-\beta\cdot{E}(\mathbf{v})}
\end{align*}
$$
where $Z(\mathbf{v}, \beta)$ is the partition function for the visible layer given by:
$$
\begin{align*}
Z(\mathbf{v}, \beta) &= \sum_{\mathbf{h}}e^{-\beta\cdot{E}(\mathbf{v},\mathbf{h})}
\end{align*}
$$
Alternatively, we can also sum over the visible layer to get the marginal distribution of the hidden layer:
$$
\begin{align*}
P(\mathbf{h}) &= \sum_{\mathbf{v}}P(\mathbf{v},\mathbf{h}) = \frac{1}{Z(\mathbf{h}, \beta)}e^{-\beta\cdot{E}(\mathbf{h})}
\end{align*}
$$
where $Z(\mathbf{h}, \beta)$ is the partition function for the hidden layer given by:
$$
\begin{align*}
Z(\mathbf{h}, \beta) &= \sum_{\mathbf{v}}e^{-\beta\cdot{E}(\mathbf{v},\mathbf{h})}
\end{align*}
$$

### Hmmm. Are we stuck?
In much the same way as the general Boltzmann Machine, we cannot compute the parttion function directly (except for very small networks). Instead, we need to use _sampling_ methods to approximate the behavior of the network. Let's take a look at how we might do that.

## Sampling from RBMs
The sampling for a [Restricted Boltzmann Machine (RBM)](https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine) can be thought from two perspectives:
* __Fix the visible layer__. Given a visible layer $\mathbf{v}$, we can sample the hidden layer $\mathbf{h}$ from the conditional distribution $P(\mathbf{h}|\mathbf{v})$. This is done by computing the probabilities of each hidden unit being `on` or `off` given the visible units. This tells use how likely each hidden unit is to be `on` or `off` given the visible units. 
* __Fix the hidden layer__. Given a hidden layer $\mathbf{h}$, we can sample the visible layer $\mathbf{v}$ from the conditional distribution $P(\mathbf{v}|\mathbf{h})$. This is done by computing the probabilities of each visible unit being `on` or `off` given the hidden units. This tells use how likely each visible unit is to be `on` or `off` given the hidden units.

Let's think about this from the second perspective, i.e., we have fixed the hidden layer to some set of values.
Given the specified hidden state, suppose we let the state of the restricted Boltzmann machine $\bar{\mathcal{B}}$ evolve over $t=1,2,\dots, T$ turns. During each turn, every node in the visible layer can update its state based on the states in the hidden layer, the weights of its connections, and its bias term. The total input to node $v_{i}$ at turn $t$ denoted as $I_{i}^{(t)}$ is given by:
$$
\begin{align*}
I_{i}^{(t)} = \sum_{j\in\mathcal{V}_{h}} w_{ij}h_{j}^{(t-1)} + a_{i}\quad\forall i\in\mathcal{V}_{v}
\end{align*}
$$
where $I_{i}^{(t)}$ is the total input to node $v_{i}$ at time $t$, $w_{ij}$ is the weight of the edge connecting $v_{i}$ and $h_{j}$, and $h_{j}^{(t-1)}$ is the state of the _hidden_ node $h_{j}$ at turn $t-1$. However, unlike [classical Hopfield networks](https://en.wikipedia.org/wiki/Hopfield_network), where the update is deterministic, in a restricted Boltzmann Machine, the state of each node is updated stochastically. The probability that node $v_{i}$ is `on` at turn $t$ is given by the logistic function:
$$
\begin{align*}
P(v_{i}^{(t)} = 1|I_{i}^{(t)}) = \frac{1}{1+\exp(-\beta\cdot{I}_{i}^{(t)})}
\end{align*}
$$
where $P(v_{i}^{(t)} = 1|I_{i}^{(t)})$ is the probability that node $v_{i}$ is `on` at time $t$ given the total input $I_{i}^{(t)}$. The probability that node $v_{i}$ is `off` at time $t$ is given by $P(v_{i}^{(t)} = -1|I_{i}^{(t)}) = 1 - P(v_{i}^{(t)} = 1|I_{i}^{(t)})$.  

### Algorithm
To generate samples from a restricted Boltzmann Machine $\bar{\mathcal{B}}$ (with a fixed hidden layer), consider the following algorithm: 

__Initialize__ the weights $\mathbf{W}$ and biases $\mathbf{b}$ of $\bar{\mathcal{B}}$. Provide a state for the hidden layer $\mathbf{h}$ of the network, an initial value for the visible nodes $\mathbf{v}^{(0)}$ and a system temperature $\beta$.

For each turn $t=1,2,\dots,T$:
1. For each node $v_{i}\in\mathcal{V}_{v}$:
    1. Compute the total input $I_{i}^{(t)}$ to node $v_{i}$ using $I_{i}^{(t)} = \sum_{j\in\mathcal{V}_{h}} w_{ij}h_{j}^{(t-1)} + a_{i}$.
    2. Compute the probability of the _next_ state $v_{i}^{(t)} = 1$ using the logistic function $P(v_{i}^{(t)} = 1|I_{i}^{(t)}) = \left(1+\exp(-\beta\cdot{I}_{i}^{(t)})\right)^{-1}$ for node $v_{i}$. The probability of $v_{i}^{(t)} = -1$ is given by $P(v_{i}^{(t)} = -1|I_{i}^{(t)}) = 1 - P(v_{i}^{(t)} = 1|I_{i}^{(t)})$.
    3. Sample the _next_ state of node $v_{i}$ from a [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) with parameter $p = P(v_{i}^{(t)} = 1|I_{i}^{(t)})$.
2. Store the state vector $\mathbf{v}^{(t)}$ of the network at turn $t$, and proceed to the next turn.

#### Hmmm. 
* _Question_: How would this change if we fixed the visible layer and sampled over the hidden states? 

## Training RBMs
The training of a Restricted Boltzmann Machine (RBM) is done using a _contrastive divergence_ algorithm. The goal of training is to learn the weights $\mathbf{W}$ and biases $\mathbf{a}$ and $\mathbf{b}$ of the network such that the network can approximate the probability distribution of the input data. 