# L7c: Restricted Boltzmann Machines
A Restricted Boltzmann Machine (RBM) is a simplified version of the general Boltzmann machine that removes connections between nodes in the same layer. This restriction creates a bipartite graph structure that enables efficient training and sampling algorithms.

> __Learning Objectives__
>
> * __Define a restricted Boltzmann machine:__ Describe the bipartite graph structure of an RBM with visible and hidden layers, and explain how the absence of intra-layer connections simplifies the network.
> * __Implement feedforward and feedback passes:__ Apply the conditional independence property to compute hidden states from visible inputs and reconstruct visible states from hidden activations.
> * __Explain contrastive divergence training:__ Describe the CD-k algorithm for approximating the gradient of the log-likelihood and updating the weights and biases of the RBM.

Let's get started!

___

## RBM Structure
Recall that a general Boltzmann machine is a fully connected undirected graph where every node connects to every other node. A Restricted Boltzmann Machine simplifies this structure by organizing nodes into two layers: a visible layer and a hidden layer, with connections only between layers.

Formally, an RBM is defined by the tuple $\mathcal{R} = \left(\mathcal{V}, \mathcal{H}, \mathbf{W}, \mathbf{a}, \mathbf{b}\right)$:

* **Visible layer** $\mathcal{V}$: The set of visible units $\left\{v_{1}, v_{2}, \ldots, v_{n}\right\}$ where $n = |\mathcal{V}|$. Each visible unit has a binary state $v_{i} \in \{-1, 1\}$ and a bias $a_{i} \in \mathbb{R}$. The bias vector is $\mathbf{a} \in \mathbb{R}^{n}$.
* **Hidden layer** $\mathcal{H}$: The set of hidden units $\left\{h_{1}, h_{2}, \ldots, h_{m}\right\}$ where $m = |\mathcal{H}|$. Each hidden unit has a binary state $h_{j} \in \{-1, 1\}$ and a bias $b_{j} \in \mathbb{R}$. The bias vector is $\mathbf{b} \in \mathbb{R}^{m}$.
* **Weight matrix** $\mathbf{W} \in \mathbb{R}^{n \times m}$: The weight $w_{ij}$ represents the connection strength between visible unit $v_{i}$ and hidden unit $h_{j}$.

The restriction is that there are no connections within the visible layer (no $v_{i}$-$v_{j}$ edges) and no connections within the hidden layer (no $h_{i}$-$h_{j}$ edges). This bipartite structure is the source of the name "restricted."

### Energy Function
The energy of a joint configuration $(\mathbf{v}, \mathbf{h})$ in an RBM is given by:
$$
E(\mathbf{v}, \mathbf{h}) = -\sum_{i=1}^{n} a_{i} v_{i} - \sum_{j=1}^{m} b_{j} h_{j} - \sum_{i=1}^{n} \sum_{j=1}^{m} w_{ij} v_{i} h_{j}
$$
In matrix notation:
$$
E(\mathbf{v}, \mathbf{h}) = -\mathbf{a}^{\top}\mathbf{v} - \mathbf{b}^{\top}\mathbf{h} - \mathbf{v}^{\top}\mathbf{W}\mathbf{h}
$$
The joint probability distribution over visible and hidden states follows the Boltzmann distribution:
$$
P(\mathbf{v}, \mathbf{h}) = \frac{1}{Z(\beta)}\exp\left(-\beta \cdot E(\mathbf{v}, \mathbf{h})\right)
$$
where $Z(\beta) = \sum_{\mathbf{v}^{\prime}, \mathbf{h}^{\prime}} \exp\left(-\beta \cdot E(\mathbf{v}^{\prime}, \mathbf{h}^{\prime})\right)$ is the partition function.

* _What is $\beta$?_ The parameter $\beta$ is the inverse temperature that controls the amount of randomness in the system. As $\beta\rightarrow\infty$, the RBM becomes more deterministic; as $\beta\rightarrow{0}$, the RBM becomes more random.

___

## Conditional Independence
The bipartite structure of the RBM leads to a property: given the visible layer, the hidden units are conditionally independent of each other; similarly, given the hidden layer, the visible units are conditionally independent. This property enables parallel computation of all units in one layer given the other layer.

### Feedforward Pass (Visible to Hidden)
Given a visible configuration $\mathbf{v}$, the probability that hidden unit $h_{j}$ is `on` is:
$$
P(h_{j} = 1 \mid \mathbf{v}) = \frac{1}{1 + \exp\left(-2\beta\left(\sum_{i=1}^{n} w_{ij} v_{i} + b_{j}\right)\right)}
$$
Because of conditional independence, we can compute all hidden unit probabilities simultaneously. The total input to hidden unit $h_{j}$ is:
$$
\phi_{j} = \sum_{i=1}^{n} w_{ij} v_{i} + b_{j} = \left(\mathbf{W}^{\top}\mathbf{v}\right)_{j} + b_{j}
$$

### Feedback Pass (Hidden to Visible)
Given a hidden configuration $\mathbf{h}$, the probability that visible unit $v_{i}$ is `on` is:
$$
P(v_{i} = 1 \mid \mathbf{h}) = \frac{1}{1 + \exp\left(-2\beta\left(\sum_{j=1}^{m} w_{ij} h_{j} + a_{i}\right)\right)}
$$
The total input to visible unit $v_{i}$ is:
$$
\psi_{i} = \sum_{j=1}^{m} w_{ij} h_{j} + a_{i} = \left(\mathbf{W}\mathbf{h}\right)_{i} + a_{i}
$$

### Sampling Algorithm
To generate samples from an RBM, we alternate between sampling the hidden layer given the visible layer (feedforward) and sampling the visible layer given the hidden layer (feedback):

__Initialize__: Set the visible layer $\mathbf{v}^{(0)}$ to some initial state (e.g., a training example or random configuration). Set the inverse temperature $\beta$ and number of iterations $T$.

For $t = 1, 2, \ldots, T$:

1. **Feedforward pass**: For each hidden unit $h_{j} \in \mathcal{H}$:
    - Compute $\phi_{j}^{(t)} = \sum_{i=1}^{n} w_{ij} v_{i}^{(t-1)} + b_{j}$
    - Compute $P(h_{j}^{(t)} = 1 \mid \mathbf{v}^{(t-1)}) = \left(1 + \exp(-2\beta\phi_{j}^{(t)})\right)^{-1}$
    - Sample $h_{j}^{(t)}$ from a Bernoulli distribution with this probability

2. **Feedback pass**: For each visible unit $v_{i} \in \mathcal{V}$:
    - Compute $\psi_{i}^{(t)} = \sum_{j=1}^{m} w_{ij} h_{j}^{(t)} + a_{i}$
    - Compute $P(v_{i}^{(t)} = 1 \mid \mathbf{h}^{(t)}) = \left(1 + \exp(-2\beta\psi_{i}^{(t)})\right)^{-1}$
    - Sample $v_{i}^{(t)}$ from a Bernoulli distribution with this probability

__Output__: The sequence of visible states $\mathbf{v}^{(1)}, \mathbf{v}^{(2)}, \ldots, \mathbf{v}^{(T)}$.

___

## Contrastive Divergence Training
Training an RBM follows the same principle as training a general Boltzmann machine: maximize the log-likelihood of the training data. The gradient of the log-likelihood with respect to the weights involves two expectationsâ€”one over the data distribution and one over the model distribution. Computing the model expectation requires sampling from the equilibrium distribution, which is computationally expensive.

Contrastive divergence (CD-k) approximates this gradient by running only $k$ steps of Gibbs sampling instead of waiting for equilibrium. The CD-1 algorithm (one step of Gibbs sampling) is commonly used in practice.

### CD-k Algorithm
In practice, CD-1 ($k=1$) works well for most applications and is the most commonly used variant. Larger values of $k$ provide better gradient approximations but increase computational cost. Typical learning rates are $\eta \in [0.001, 0.1]$, and the number of hidden units $m$ is often chosen to be comparable to or larger than the number of visible units $n$ to ensure sufficient representational capacity.

__Initialize__: Set weights $\mathbf{W}$ and biases $\mathbf{a}$, $\mathbf{b}$ to small random values (e.g., sampled from $\mathcal{N}(0, 0.01)$). Set learning rate $\eta$, inverse temperature $\beta = 1$, and number of Gibbs steps $k$.

For each training pattern $\mathbf{x}^{(i)} \in \mathbf{X}$:

1. **Positive phase** (data-driven):
    - Set $\mathbf{v}^{(0)} = \mathbf{x}^{(i)}$
    - Compute hidden probabilities: $p_{j}^{+} = P(h_{j} = 1 \mid \mathbf{v}^{(0)})$ for all $j$
    - Sample hidden states: $\mathbf{h}^{(0)} \sim \text{Bernoulli}(\mathbf{p}^{+})$
    - Compute positive correlations: $\langle v_{i} h_{j} \rangle^{+} = v_{i}^{(0)} \cdot p_{j}^{+}$

2. **Negative phase** (reconstruction):
    - For $t = 1$ to $k$:
        - Sample visible: $\mathbf{v}^{(t)} \sim P(\mathbf{v} \mid \mathbf{h}^{(t-1)})$
        - Sample hidden: $\mathbf{h}^{(t)} \sim P(\mathbf{h} \mid \mathbf{v}^{(t)})$
    - Compute reconstruction probabilities: $p_{j}^{-} = P(h_{j} = 1 \mid \mathbf{v}^{(k)})$
    - Compute negative correlations: $\langle v_{i} h_{j} \rangle^{-} = v_{i}^{(k)} \cdot p_{j}^{-}$

3. **Update parameters**:
    - $w_{ij} \gets w_{ij} + \eta\left(\langle v_{i} h_{j} \rangle^{+} - \langle v_{i} h_{j} \rangle^{-}\right)$
    - $a_{i} \gets a_{i} + \eta\left(v_{i}^{(0)} - v_{i}^{(k)}\right)$
    - $b_{j} \gets b_{j} + \eta\left(p_{j}^{+} - p_{j}^{-}\right)$

__Output__: Trained parameters $\mathbf{W}$, $\mathbf{a}$, $\mathbf{b}$.

### Why CD Works
The weight update $\Delta w_{ij} = \eta\left(\langle v_{i} h_{j} \rangle^{+} - \langle v_{i} h_{j} \rangle^{-}\right)$ has an intuitive interpretation:

> __Correlation Terms__
>
> * The notation $\langle v_{i} h_{j} \rangle$ denotes the correlation (product) between visible unit $v_{i}$ and hidden unit $h_{j}$.
> * The positive term $\langle v_{i} h_{j} \rangle^{+}$ measures correlations in the training data.
> * The negative term $\langle v_{i} h_{j} \rangle^{-}$ measures correlations in the model's reconstruction.
>
> If the reconstruction matches the data, these terms are equal and the weight does not change. If the model fails to capture a correlation present in the data, the weight increases to strengthen that connection.

___

## RBM vs General Boltzmann Machine
The restricted structure provides several computational advantages over the general Boltzmann machine:

| Property | General Boltzmann Machine | Restricted Boltzmann Machine |
|----------|---------------------------|------------------------------|
| Connections | Fully connected | Bipartite (visible-hidden only) |
| Sampling | Sequential (one node at a time) | Parallel (entire layer at once) |
| Training | Slow equilibrium sampling | CD-k approximation |
| Hidden-hidden edges | Yes | No |
| Visible-visible edges | Yes | No |

The bipartite structure means that we can compute all hidden unit activations in parallel given the visible layer, and all visible unit activations in parallel given the hidden layer. This reduces the sampling complexity from $O(|\mathcal{V}|^{2})$ per iteration to $O(|\mathcal{V}| \cdot |\mathcal{H}|)$ per complete feedforward-feedback cycle.

___

## Summary
This notebook introduced restricted Boltzmann machines as bipartite energy-based models that enable efficient sampling and training through conditional independence.

> __Key Takeaways__
>
> * **Bipartite structure:** An RBM has visible and hidden layers with connections only between layers, creating conditional independence within each layer.
> * **Parallel sampling:** Given one layer, all units in the other layer can be sampled simultaneously using the feedforward and feedback update equations.
> * **CD-k training:** Contrastive divergence approximates the log-likelihood gradient by running $k$ steps of Gibbs sampling, avoiding the need to compute the intractable partition function.

RBMs serve as building blocks for deep belief networks and provide a foundation for generative modeling with energy-based architectures.

___