# Activity: Understanding Boltzmann Machines

In this activity, we will explore some questions surrounding the training of a _small_ Boltzmann machine. In particular, we'll look at one of the key limitations of the training approach, namely requiring convergence to a stationary distribution for each training iteration.

__Why are we looking at this?__ The [Boltzmann machine](https://en.wikipedia.org/wiki/Boltzmann_machine) popularized by [Prof. G. Hinton in the mid-1980s](https://en.wikipedia.org/wiki/Geoffrey_Hinton) was (jointly) awarded the [2024 Nobel Prize in Physics](https://www.nobelprize.org/prizes/physics/2024/summary/) along with [Prof. J. Hopfield](https://en.wikipedia.org/wiki/John_Hopfield). While of little practical use (because of an issue that we will discuss today), these ideas led to the development of [the restricted Boltzmann machine](https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine), which has many practical applications.

> __Learning Objectives__
>
> By the end of this lab, you should be able to:
> * __Sample a Boltzmann machine:__ Implement Gibbs sampling to generate state configurations from a Boltzmann machine and understand the role of the inverse temperature parameter $\beta$.
> * __Compute the stationary distribution:__ Calculate the exact Boltzmann distribution for a small network by enumerating all configurations and computing the partition function.
> * __Compare empirical and theoretical distributions:__ Estimate the empirical distribution from samples and compare it to the theoretical stationary distribution to verify convergence.


Let's get started!
___

## Background: What is a Boltzmann Machine?
A [Boltzmann Machine](https://en.wikipedia.org/wiki/Boltzmann_machine) consists of a set of binary units (neurons, nodes, vertices, etc.) that are fully connected, with no self-connections. Formally, [a Boltzmann Machine](https://en.wikipedia.org/wiki/Boltzmann_machine) $\mathcal{B}$ is a fully connected _undirected weighted graph_ defined by the tuple $\mathcal{B} = \left(\mathcal{V},\mathcal{E}, \mathbf{W},\mathbf{b}, \mathbf{s}\right)$.
* __Units__: Each unit (vertex, node, neuron) $v_{i}\in\mathcal{V}$ has a binary state (`on` or `off`) and a bias value 
$b_{i}\in\mathbb{R}$, where $b_{i}$ is the bias of the node $v_{i}$. The bias vector $\mathbf{b}\in\mathbb{R}^{|\mathcal{V}|}$ is the vector of bias values for all nodes in the network. 
* __Edges__: Each edge $e\in\mathcal{E}$ has a weight. The weight of the edge connecting $v_{i}\in\mathcal{V}$ and $v_{j}\in\mathcal{V}$, is denoted by $w_{ij}\in\mathbf{W}$, where the weight matrix $\mathbf{W}\in\mathbb{R}^{|\mathcal{V}|\times|\mathcal{V}|}$ is symmetric, i.e. $w_{ij} = w_{ji}$ and $w_{ii} = 0$ (no self loops). The weights $w_{ij}\in\mathbb{R}$ determine the strength of the connection between the two nodes. 
* __States__: The state of each node is represented by a binary vector $\mathbf{s}\in\mathbb{R}^{|\mathcal{V}|}$, where $s_{i}\in\{-1,1\}$ is the state of node $v_{i}$. When $s_{i} = 1$, the node is `on`, and when $s_{i} = -1$, the node is `off`. The set of all possible state _configurations_ is denoted by $\mathcal{S} \equiv \left\{\mathbf{s}^{(1)},\mathbf{s}^{(2)},\ldots,\mathbf{s}^{(N)}\right\}$, where $N$ is the number of possible state configurations, or $N = 2^{|\mathcal{V}|}$ for binary units.

Suppose we have values for the weights $\mathbf{W}$ and biases $\mathbf{b}$ of the Boltzmann machine. One of the key questions we can ask is: how can we generate samples from this Boltzmann machine?

### Sampling algorithm
One of the key theoretical ideas of [the Boltzmann machine](https://en.wikipedia.org/wiki/Boltzmann_machine) is that the samples generated from it are distributed according to [the Boltzmann distribution](https://en.wikipedia.org/wiki/Boltzmann_distribution). Let's test this idea. 

To generate samples from a Boltzmann Machine, let us consider the following algorithm (Gibbs sampling): 

__Initialize__ the weights $\mathbf{W}$ and biases $\mathbf{b}$ of the Boltzmann Machine. Provide an initial state $\mathbf{s}^{(0)}$ of the network, and a system temperature $\beta$.

For each turn $t=1,2,\dots,T$:
1. For each node $v_{i}\in\mathcal{V}$:
    1. Compute the total input $h_{i}^{(t)}$ to node $v_{i}$ using $h_{i}^{(t)} = \sum_{j\in\mathcal{V}} w_{ij}s_{j}^{(t-1)} + b_{i}$.
    2. Compute the probability of the _next_ state $s_{i}^{(t)} = 1$ using the logistic function $P(s_{i}^{(t)} = 1|h_{i}^{(t)}) = \left(1+\exp(-\beta\cdot{h}_{i}^{(t)})\right)^{-1}$ for node $v_{i}$. The probability of $s_{i}^{(t)} = -1$ is given by $P(s_{i}^{(t)} = -1|h_{i}^{(t)}) = 1 - P(s_{i}^{(t)} = 1|h_{i}^{(t)})$.
    3. Sample the _next_ state of node $v_{i}$ from a [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) with parameter $p = P(s_{i}^{(t)} = 1|h_{i}^{(t)})$.
2. Store the state vector $\mathbf{s}^{(t)}$ of the network at turn $t$, and proceed to the next turn.


Let's implement this algorithm, and see what happens!
___

## Setup, Data, and Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants.

> The `Include.jl` file also loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem. It checks for a `Manifest.toml` file; if it finds one, packages are loaded. Other packages are downloaded and then loaded.

In [1]:
include("Include.jl"); # load a bunch of libs, including the ones we need to work with images

__Constants__: Set some constants that we will use later. Please look at the comments in the code for more details on each constant's permissible values, units, etc. 

In [None]:
number_of_nodes = 3; # number of nodes in the system
β = 0.1; # inverse temperature parameter for the system (big: cold, small: hot)
number_of_turns = 10000; # number of turns that we take in the simulation

### Implementation
We need a helper function to compute the energy of a given state configuration. We've implemented this in the `energy(...)` method below.

> The `energy(...)` method computes the energy of a state configuration $\mathbf{s}$ using the formula $E(\mathbf{s}) = -\frac{1}{2}\mathbf{s}^{\top}\mathbf{W}\mathbf{s} - \mathbf{b}^{\top}\mathbf{s}$. The first term captures pairwise interactions between nodes (weighted by $\mathbf{W}$), and the second term captures the contribution of each node's bias.

In [3]:
function energy(model::MySimpleBoltzmannMachineModel, s::Vector{Int})::Float64

    # initialize -
    W = model.W; # weight matrix
    b = model.b; # bias vector
    energy = -(1/2)*dot(s, W*s) - dot(b, s); # compute the energy of the state

    # return -
    return energy;
end

energy (generic function with 1 method)

___

## Task 1: Estimate the stationary distribution for a small system
In this task, we sample the dynamics of a three-state [Boltzmann machine](https://en.wikipedia.org/wiki/Boltzmann_machine) and explore a key question: can we estimate the stationary distribution of the Boltzmann machine from sample? 

We'll do this with a straightforward three-node Boltzmann machine (small enough to compute all the configurations required by the partition function) using the Gibbs sampling algorithm described above.

First, let's set up our model of the Boltzmann machine with some random parameters that we hypothetically learned from data. For now, let's save the random weights in the `W::Array{Float64,2}` matrix and the random biases in the `b::Array{Float64,1}` vector.

In [4]:
W,b = let 
    
    # initialize some random weights and biases
    W = 2*randn(number_of_nodes, number_of_nodes);
    b = randn(number_of_nodes);

    # subract the mean from the weights (no self connections)
    W = W - diagm(diag(W));

    # return -
    W, b
end;

__Model__: Next, let's build a model of the test Boltzmann machine. We'll use [the `MySimpleBoltzmannMachineModel` struct](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/boltzmann/#VLDataScienceMachineLearningPackage.MySimpleBoltzmannMachineModel) to represent the machine; we build an instance of this type [using a `build(...)` method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/factory/). The struct will have `W::Array{Float64,2}` and `b::Array{Float64,1}` fields that we set when we build an instance of the model.

In [5]:
model = build(MySimpleBoltzmannMachineModel, (
    W = W,
    b = b,
));

Next, let's sample the dynamics of the Boltzmann machine using the Gibbs sampling algorithm described above. We've implemented the Gibbs sampling approach in [the `sample(...)` method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/boltzmann/#VLDataScienceMachineLearningPackage.sample-Tuple{MySimpleBoltzmannMachineModel,%20Vector{Int64}}) which implements the simple Gibbs sampling procedure described above.
> The [`sample(...)` method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/boltzmann/#VLDataScienceMachineLearningPackage.sample-Tuple{MySimpleBoltzmannMachineModel,%20Vector{Int64}}) takes a [`MySimpleBoltzmannMachineModel` instance](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/boltzmann/#VLDataScienceMachineLearningPackage.MySimpleBoltzmannMachineModel), an initial state vector `sₒ::Array{Int,1}`,the number of turns `T::Int`, and a system (inverse) temperature `β::Float64`. The method returns an array of samples `S::Array{Int,2}` of size `N` $\times$ `T,` where `N` is the number of nodes in the Boltzmann machine and `T` is the number of turns. 

What's in the `S::Array{Int,2}` array? Each column `S[:,t]` is the state vector of the network at turn `t`, i.e., `S[:,t] == s^(t)`.

In [None]:
S, energy_state_array = let

    # initialize the system
    N = 2^number_of_nodes; # how many configurations do we have
    energy_state = zeros(N); # energy of each state
    
    for i ∈ 0:(N - 1)
        sᵢ = digits(i, base = 2, pad = number_of_nodes) |> x -> 2*x .- 1 |> reverse # convert integer to binary state vector
        energy_state[i + 1] = energy(model, sᵢ); # calculate the energy of each state
    end
    
    start_energy_state = rand(1:N); # Heuristic: find the state with the minimum (maximum) or random energy
    sₒ = digits(start_energy_state - 1, base = 2, pad = number_of_nodes) |> x -> 2*x .- 1 |> reverse # convert to -1,1
    S = VLDataScienceMachineLearningPackage.sample(model, sₒ, T = number_of_turns, β = β); # simulate the model 

    # return the data (we don't need the turn vector)``
    S,  energy_state;
end;

What is the lowest energy state of the Boltzmann machine defined by our (random) weights `W` and biases `b`? Let's look at the entries in the `energy_state_array` array:

In [7]:
energy_state_array

8-element Vector{Float64}:
 -0.3737561122973876
 -3.5158941104528116
 -0.5808517382568235
  0.8610056589990673
  1.7961132612247024
 -3.2117093805781067
  2.724459833876427
  2.300632587484933

__Check__: Let's verify how [the `digits(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.digits) converts an integer to a binary state vector. The lowest energy state corresponds to the minimum entry in the `energy_state_array` array. Suppose this occurs at index `i_min`. We can convert this index to a binary state vector using the `digits(...)` method as follows:

In [8]:
i_min = argmin(energy_state_array) - 1; # index of the minimum energy state
sᵢ = digits(i_min, base = 2, pad = number_of_nodes) |> x -> 2*x .- 1 |> reverse
println("Minimum energy state: ", sᵢ, " with energy ", energy_state_array[i_min + 1], " at index ", i_min);

Minimum energy state: [-1, -1, 1] with energy -3.5158941104528116 at index 1
[-1, -1, 1] with energy -3.5158941104528116 at index 1


The low energy state is some distribution of the form `[s₁, s₂, s₃]`, where each `sᵢ` is either `-1` or `1`. This state corresponds to the configuration of the Boltzmann machine that has the lowest energy according to the weights and biases we defined earlier. However, should we expect to see this state often when we sample from the Boltzmann machine?

> __Not necessarily!__ While low energy states are more probable according to the Boltzmann distribution, the actual frequency of observing a particular state during sampling depends on several factors, including the temperature parameter $\beta$ and the structure of the energy landscape defined by the weights and biases. 
> * __High temperature__: At higher temperatures (lower $\beta$), the system is more likely to explore higher energy states, leading to a more uniform distribution of observed states. 
> * __Low temperature__: Conversely, at lower temperatures (higher $\beta$), the system tends to favor lower energy states, increasing the likelihood of observing the minimum energy state more frequently.
> 
> What do we see when we sample from our Boltzmann machine?

When we sample, how often do we see the various possible states? To answer this question, let's look at the sample matrix `S`. Each column represents the state of the network at a particular turn. We can count how many times this low energy state appears in our samples.

In [9]:
S

3×10000 Matrix{Int64}:
  1  1  -1  -1  -1  -1  -1  1  1   1  -1  …  1   1   1  1  -1  -1  -1  -1   1
 -1  1  -1  -1   1   1  -1  1  1  -1  -1     1  -1  -1  1  -1  -1   1  -1  -1
  1  1   1  -1   1   1   1  1  1  -1   1     1   1   1  1  -1   1  -1  -1  -1

What is the frequency of observing each state of the Boltzmann machine in our samples? Let's compute this frequency for all possible states.

In [10]:
let 

    N = 2^number_of_nodes; # number of possible states
    counts = zeros(Int, N); # visit counts per state
    rows = NamedTuple[];

    for j ∈ 1:size(S, 2)
        state = Int.(S[:, j]);
        idx = sum(div.(reverse(state) .+ 1, 2) .* (2 .^ (0:(number_of_nodes - 1))));
        counts[idx + 1] += 1;
    end

    # compute the table entries -
    frequencies = counts ./ size(S, 2);
    for i ∈ 0:(N - 1)
        sᵢ = digits(i, base = 2, pad = number_of_nodes) |> x -> 2*x .- 1 |> reverse; # map index to state vector
        push!(rows, (
            state = i,
            configuration = sᵢ,
            β = β,
            energy = energy_state_array[i + 1],
            frequency = frequencies[i + 1],
        ));
    end

    # show the table -
    pretty_table(
        rows;
        backend = :text,
        table_format = TextTableFormat(borders = text_table_borders__compact),
    );
end

 ------- --------------- ----- ----------- -----------
 [1m state [0m [1m configuration [0m [1m   β [0m [1m    energy [0m [1m frequency [0m
 ------- --------------- ----- ----------- -----------
      0    [-1, -1, -1]   0.1   -0.373756      0.1417
      1     [-1, -1, 1]   0.1    -3.51589      0.1682
      2     [-1, 1, -1]   0.1   -0.580852      0.1104
      3      [-1, 1, 1]   0.1    0.861006       0.129
      4     [1, -1, -1]   0.1     1.79611      0.1125
      5      [1, -1, 1]   0.1    -3.21171      0.1375
      6      [1, 1, -1]   0.1     2.72446      0.0956
      7       [1, 1, 1]   0.1     2.30063      0.1051
 ------- --------------- ----- ----------- -----------


### Things to think about
* __Question__: In the sample code block above, we have used a particular heuristic to select the initial state for the sampling. What are alternative approaches to initialize the sampling? Consider at least two different methods and their consequences.
* __Question__: How would you expect the frequency of observing a state to change if we increased or decreased the value of $\beta$? Why?
___

## Task 2: What is the stationary distribution?
In this task, let's explore the stationary distribution of the Boltzmann machine. Since this Boltzmann machine is small (three nodes), we can compute the stationary distribution exactly. We'll compare this exact stationary distribution to the empirical stationary distribution that we estimate from the samples generated in the previous section.

### Theory
After a _sufficiently large_ number of turns, the network configurations (state vectors) $\mathbf{s}^{(1)},\mathbf{s}^{(2)},\dots,$ of the Boltzmann Machine will converge to a _stationary distribution_ over the state configurations $\mathbf{s}\in\mathcal{S}$ which can be modeled as [a Boltzmann distribution](https://en.wikipedia.org/wiki/Boltzmann_distribution) of the form:
$$
P(\mathbf{s}) = \frac{1}{Z(\mathcal{S},\beta)}\exp\left(-\beta\cdot{E(\mathbf{s})}\right)
$$
where $E(\mathbf{s})$ is the energy of state $\mathbf{s}$, the $\beta$ is the (inverse) temperature of the system, and $Z(\mathcal{S},\beta)$ is the partition function. The energy of configuration $\mathbf{s}\in\mathcal{S}$ is given by:
$$
E(\mathbf{s}) = -\sum_{i\in\mathcal{V}} b_{i}s_{i} - \frac{1}{2}\sum_{i,j\in\mathcal{V}} w_{ij}s_{i}s_{j}
$$
where the first term is the energy associated with the bias terms, and the second term is the energy associated with the weights of the connections. The partition function $Z(\mathcal{S},\beta)$ is difficult to compute in practice; however, it is given by:
$$
Z(\mathcal{S},\beta) = \sum_{\mathbf{s}^{\prime}\in\mathcal{S}}\exp\left({-\beta\cdot{E}(\mathbf{s}^{\prime})}\right)
$$
where $\mathcal{S}$ is the set of _all possible network configurations_ of the Boltzmann Machine. 

> __Note__: The partition function $Z(\mathcal{S},\beta)$ is a normalizing constant that ensures that the probabilities sum to 1. However, for even a moderately sized system, the partition function is impossible to compute because it involves summing over all possible network configurations, which grows exponentially with the number of nodes. For example, in our case, that is $2^{n}$, where $n$ is the number of nodes in the network. For our simple three-node Boltzmann machine, the partition function will sum $2^{3} = 8$ states. 

Let's enumerate these $2^3 = 8$ states using [the `digits(...)` method](https://docs.julialang.org/en/v1/base/math/#Base.ndigits) and compute the partition function. The code below converts each integer $i \in \{0,1,\ldots,7\}$ to its binary representation and then maps it to the $\{-1,1\}$ encoding.

In [11]:
Z,configurations = let

    # initialize -
    Z = Dict{Int,Float64}();
    configurations = Dict{Int,Vector{Int}}();
    N = 2^number_of_nodes; # how many configurations do we have

    # loop throught each configuration
    for i ∈ 0:(N - 1)
        sᵢ = digits(i, base = 2, pad = number_of_nodes) |> x -> 2*x .- 1 |> reverse # convert integer to binary state vector
        Z[i] = exp(-2*β*energy(model, sᵢ)); # calculate the partition function
        configurations[i] = sᵢ; # store the configuration
    end

    # return -
    Z,configurations
end;

### Compute the _actual_ stationary distribution

Let's compute the stationary distribution of the Boltzmann Machine using the Boltzmann distribution. We'll compute the energy of each state configuration $\mathbf{s}\in\mathcal{S}$ and then compute the probability of each state configuration using the Boltzmann distribution. 

We'll save the probabilities in the `P::Dict{Int,Array{Float64,1}}` dictionary where the key is the state configuration and the value is the probability of the state configuration.

In [12]:
P = let
    
    # initialize -
    P = Dict{Int,Float64}();
    N = 2^number_of_nodes; # how many configurations do we have

    # what is the normalizing constant
    Z̄ = sum(values(Z)); # calculate the value of the partition function

    # loop through each configuration
    for i ∈ 0:(N - 1)
        P[i] = Z[i]/Z̄; # calculate the probability of each configuration
    end

    # return -
    P
end;

__Check__: Does the _actual_ stationary Boltzmann distribution sum to `1` (use [the `@assert` macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert)? If not, then we have a problem.

In [13]:
@assert sum(values(P)) ≈ 1.0 # if this fails: we get an AssertionError, otherwise nothing happens

__Estimate the empirical stationary distribution__: Next, compute the empirical estimate of the stationary distribution by analyzing the simulation samples. If we generate enough samples, the empirical distribution should be similar to the stationary distribution. 
> __Idea__: Compute the number of times a particular configuration $\mathbf{s}\in\mathcal{S}$ occurs in the simulation sample matrix $\mathbf{S}$ for each of the configurations, and then divide by the total number of samples to get the probability of each configuration. This gives us the _empirical_ distribution of the samples.

We'll save the empirical probabilities in the `P̂::Dict{Int, Array{Float64,1}}` dictionary, where the key is the state configuration index and the value is the empirical probability of that state configuration.

In [14]:
P̂ = let
   
    # initialize -
    P̂ = Dict{Int,Float64}();
    N = 2^number_of_nodes; # how many configurations do we have
    number_of_turns = size(S,2); # how many turns do we have

    for i ∈ 0:(N - 1)
        sᵢ = digits(i, base = 2, pad = number_of_nodes) |> x -> 2*x .- 1 |> reverse # count by base 2, and convert to -1,1

        counter = 0;
        for j ∈ 1:number_of_turns
            if (S[:,j] == sᵢ)
                counter += 1;
            end
        end
        P̂[i] = counter/number_of_turns;
    end
    
    P̂
end;

__Check__: Does the empirical stationary Boltzmann distribution sum to `1` (use [the `@assert` macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert))? If not, then we have a problem.

In [15]:
@assert sum(values(P̂)) ≈ 1.0 # if this fails: we get an AssertionError, otherwise nothing happens

Unhide the code block below to see how we constructed the probability table for the actual and empirical stationary distributions.

> __What do we expect?__ If we generate enough samples, the empirical distribution $\hat{P}$ should converge to the actual Boltzmann distribution $P$. The ranking of states by probability should match between the two distributions.

Do we see what we expect?

In [18]:
let
   
    # initialize -
    df = DataFrame();
    N = 2^number_of_nodes; # how many configurations do we have

    # compute the ordinal rank -
    r = [P[i] for i ∈ 0:(N - 1)] |> x -> ordinalrank(x, rev = true);
    r̂ = [P̂[i] for i ∈ 0:(N - 1)] |> x -> ordinalrank(x, rev = true);

    # main -
    for i ∈ 0:(N - 1)
        sᵢ = digits(i, base = 2, pad = number_of_nodes) |> x -> 2*x .- 1|> reverse # count by base 2, and convert to -1,1
        row_df = (
            i = i,
            configuration = sᵢ,
            energy = energy(model, sᵢ),
            β = β,
            P = P[i],
            P̂ = P̂[i],
            r = r[i+1],
            r̂ = r̂[i+1],
        )
        push!(df, row_df)
    end
    
    pretty_table(
        df;
        fit_table_in_display_horizontally = false,
        backend = :text,
        table_format = TextTableFormat(borders = text_table_borders__compact),
    );
end

 ------- --------------- ----------- --------- ----------- --------- ------- -------
 [1m     i [0m [1m configuration [0m [1m    energy [0m [1m       β [0m [1m         P [0m [1m       P̂ [0m [1m     r [0m [1m     r̂ [0m
 [90m Int64 [0m [90m Vector{Int64} [0m [90m   Float64 [0m [90m Float64 [0m [90m   Float64 [0m [90m Float64 [0m [90m Int64 [0m [90m Int64 [0m
 ------- --------------- ----------- --------- ----------- --------- ------- -------
      0    [-1, -1, -1]   -0.373756       0.1    0.121448    0.1417       4       2
      1     [-1, -1, 1]    -3.51589       0.1    0.227675    0.1682       1       1
      2     [-1, 1, -1]   -0.580852       0.1    0.126584    0.1104       3       6
      3      [-1, 1, 1]    0.861006       0.1   0.0948729     0.129       5       4
      4     [1, -1, -1]     1.79611       0.1     0.07869    0.1125       6       5
      5      [1, -1, 1]    -3.21171       0.1    0.214237    0.1375       2       3
      6      [1, 1

### Things to think about
* __Question__: What happens to the probability of the states as we change the system (inverse) temperature $\beta$, i.e., do we see different behavior for (cool) $\beta\gg{1}$ versus (hot) $\beta\ll{1}$ systems?
* __Question__: In the gradient ascent training algorithm, the new step is given by: $\Delta{w_{ij}} = \eta\left(\langle{x_{i}x_{j}}\rangle_{\mathbf{X}} - \langle{s_{i}s_{j}}\rangle_{\mathbf{S}}\right)$. What is your interpretation of this, and how would we compute this update?

___

## Summary
This lab explored sampling and stationary distributions in a small Boltzmann machine.

> __Key Takeaways__
>
> * **Boltzmann distribution:** After sufficient sampling, the state configurations of a Boltzmann machine converge to a Boltzmann distribution where lower-energy states have higher probability.
> * **Partition function limitation:** Computing the exact stationary distribution requires summing over all $2^n$ configurations, which is intractable for large networks.
> * **Training challenge:** Each training iteration requires sampling until the network reaches its stationary distribution, making training computationally expensive.

The need to reach stationarity at each training step motivates the development of restricted Boltzmann machines and contrastive divergence.
___