# L10b: Training a Boltzmann Machine

___
In this lab, we will train a _small_ Boltzmann machine on some simple datasets. 

## Tasks
Before we get started, we'll quickly review modern Hopfied Networks. Then, you'll execute the `Run All Cells` command to check if you (or your neighbor) have any code or setup issues. Code issues, then raise your hands - and let's get those fixed!

* __Task 1: Setup, Data, Constants (5 min)__: Let's take 5 minutes to load [a Simpsons character library from Kaggle](https://www.kaggle.com/datasets/kostastokis/simpsons-faces) that our Hopfield network will memorize.
*  __Task 2: Build a Modern Network Model (5 min)__: In this task, we'll formulate the image dataset we give the network and then create a model of a modern Hopfield network. We'll also quickly check to ensure we are doing what we think we are doing.
* __Task 3: Retrieve a memory from the network (30 min)__: In this task, we will retrieve a memory from the modern Hopfield network starting from a random state vector $\mathbf{s}_{\circ}$. We'll corrupt an image (by cutting off some fraction of the image) and then see if the model recovers the correct memory given the corrupted starting point. 

Let's get started!
___

In [1]:
include("Include.jl"); # Include the required packages and codes from Include.jl

## Task 1: Training a Boltzmann Machine
Suppose have a collection of patterns $\mathbf{X} = \left\{\mathbf{x}^{(1)},\mathbf{x}^{(2)},\ldots,\mathbf{x}^{(m)}\right\}$, where $\mathbf{x}^{(i)}\in\mathbb{R}^{|\mathcal{V}|}$ is a binary vector of size $|\mathcal{V}|$ and $m$ is the number of patterns. We want to learn the parameters of the Boltzmann Machine $\mathcal{B}$ such that the stationary distribution of the Boltzmann Machine matches the distribution of the training patterns $\mathbf{X}$.

* __Goal__: The goal of training the Boltzmann Machine is to learn the weights $\mathbf{W}$ and biases $\mathbf{b}$ of the network such that the stationary distribution of the Boltzmann Machine matches the distribution of the training patterns in the dataset $\mathbf{X}$.
* __Gradient ascent__: The learning algorithm for the Boltzmann Machine is based on gradient ascent. The idea is to adjust the weights and biases of the network in the direction of the gradient of the log-likelihood of the training patterns. This will maximize the likelihood of observing the training patterns given the weights and biases of the network.

### Training Algorithm
The training algorithm for the Boltzmann Machine maximizes the log-likelihood of observing the training patterns $x_{i}\in\mathbf{X}$ given the weights $\mathbf{W}$ and biases $\mathbf{b}$ of the network. The log-likelihood algorithm is given by:

__Initialize__: the weights $\mathbf{W}$ and biases $\mathbf{b}$ of the network to some initial guess, e.g., using the Hopfield network Hebbian learning rule. Set the learning rate $\eta$, temperature $\beta = 1$, and number of turns $T$. Precompute the data-dependent expectation $\langle{x_{i}x_{j}}\rangle_{\mathbf{X}}$ and $\langle{x_{i}}\rangle_{\mathbf{X}}$ using every training pattern $\mathbf{x}^{(i)}\in\mathbf{X}$.

1. Simulate the Boltzmann Machine $\mathcal{B}$ until it becomes stationary (or for a fixed number of turns $T$). Then, generate a set of stationary samples $\mathbf{S} = \left\{\mathbf{s}^{(1)},\mathbf{s}^{(2)},\ldots,\mathbf{s}^{(m)}\right\}$.
3. Compute the model-dependent expectation $\langle{s_{i}s_{j}}\rangle_{\mathbf{S}}$ using the stationary samples $\mathbf{s}^{(i)}\in\mathbf{S}$.
3. Update the weights of the network using the following update rule: $w_{ij}^{\prime} = w_{ij} + \Delta{w_{ij}}$ where $\Delta{w_{ij}} = \eta\left(\langle{x_{i}x_{j}}\rangle_{\mathbf{X}} - \langle{s_{i}s_{j}}\rangle_{\mathbf{S}}\right)$. The hyperparameter $\eta$ is the learning rate, $\langle{x_{i}x_{j}}\rangle_{\mathbf{X}}$ is the data-dependent expectation, and $\langle{s_{i}s_{j}}\rangle_{\mathbf{S}}$ is the model-dependent expectation. The update rule is applied to all weights in the network, i.e..., $\forall i,j\in\mathcal{V}$.
4. Update the biases of the network using the following update rule: $b_{i}^{\prime} = b_{i} + \Delta{b_{i}}$ where $\Delta{b_{i}} = \eta\left(\langle{x_{i}}\rangle_{\mathbf{X}} - \langle{s_{i}}\rangle_{\mathbf{S}}\right)$. The hyperparameter $\eta$ is the learning rate, $\langle{x_{i}}\rangle_{\mathbf{X}}$ is the data-dependent expectation, and $\langle{s_{i}}\rangle_{\mathbf{S}}$ is the model-dependent expectation. The update rule is applied to all biases in the network, i.e., $\forall i\in\mathcal{V}$.
5. Repeat steps 2-4 until convergence (or for a fixed number of iterations). 

## Task 2: Compute the Stationary Distribution
In this task, let's review the dynamics of the Boltzmann machine that we are trying to learn, and explore a key question: can we compute the stationary distribution of the Boltzmann machine? We'll do a very simple _three_ node Boltzmann machine (which is small enough to compute all the possible configurations). 

First, let's setup our model of the Boltzmann machine with some random parameters that we will learn in the next task. We'll save the random weights in the `W::Array{Float64,2}` matrix and the random biases in the `b::Array{Float64,1}` vector.

In [2]:
W,b = let 

    number_of_nodes = 3; # number of nodes in the system
    
    # initialize some random weights and biases
    W = 10*randn(number_of_nodes, number_of_nodes);
    b = randn(number_of_nodes);

    # subract the mean from the weights (no self connections)
    W = W - diagm(diag(W));

    # return -
    W, b
end;

Next, let's build a model of the test Boltzmann machine. We'll use [the `MySimpleBoltzmannMachineModel` struct](src/Types.jl) to represent the machine, we build an instance of this type [using a `build(...)` method](src/Factory.jl). The struct will have `W::Array{Float64,2}` and `b::Array{Float64,1}` fields that we set when we are build an instance of the model.

In [3]:
model = build(MySimpleBoltzmannMachineModel, (
    W = W,
    b = b,
));

Set some constants that we will use later.

In [14]:
number_of_nodes = 3; # number of nodes in the system
β = 0.01; # temperature parameter for the system
number_of_turns = 1000; # number of turns that we take in the simulation

### Sample the test model
Fill me in.

In [15]:
S = let

    # initialize the system
    N = 2^number_of_nodes; # how many configurations do we have
    energy_state = zeros(N); # energy of each state
    for i ∈ 0:(N - 1)
        sᵢ = digits(i, base = 2, pad = number_of_nodes) |> x -> 2*x .- 1 |> reverse; # count by base 2, and convert to -1,1
        energy_state[i + 1] = energy(model, sᵢ); # calculate the energy of each state
    end
    min_energy_state = argmin(energy_state); # find the state with the minimum energy
    sₒ = digits(min_energy_state - 1, base = 2, pad = number_of_nodes) |> x -> 2*x .- 1 |> reverse; # convert to -1,1
    S = simulate(model, sₒ, T = number_of_turns, β = β); # simulate the model 

    # return the data (we don't need the turn vector)``
    S;
end;

In [16]:
S

3×1000 Matrix{Int64}:
 1  -1  1  1  -1   1  -1   1  -1  1  1  …  1  -1   1  -1  -1   1   1  -1  -1
 1  -1  1  1   1  -1   1   1  -1  1  1     1  -1  -1   1  -1  -1   1   1   1
 1   1  1  1   1   1  -1  -1   1  1  1     1  -1   1  -1   1   1  -1   1   1

### What is the stationary distribution?
After a _sufficiently large_ number of turns, the network configurations (state vectors) $\mathbf{s}^{(1)},\mathbf{s}^{(2)},\dots,$ of the Boltzmann Machine will converge to a _stationary distribution_ over the state configurations $\mathbf{s}\in\mathcal{S}$ which can be modeled as [a Boltzmann distribution](https://en.wikipedia.org/wiki/Boltzmann_distribution) of the form:
$$
P(\mathbf{s}) = \frac{1}{Z(\mathcal{S},\beta)}\exp\left(-\beta\cdot{E(\mathbf{s})}\right)
$$
where $E(\mathbf{s})$ is the energy of state $\mathbf{s}$, the $\beta$ is the (inverse) temperature of the system, and $Z(\mathcal{S},\beta)$ is the partition function. The energy of configuration $\mathbf{s}\in\mathcal{S}$ is given by:
$$
E(\mathbf{s}) = -\sum_{i\in\mathcal{V}} b_{i}s_{i} - \frac{1}{2}\sum_{i,j\in\mathcal{V}} w_{ij}s_{i}s_{j}
$$
where the first term is the energy associated with the bias terms, and the second term is the energy associated with the weights of the connections. The partition function $Z(\mathcal{S},\beta)$ is difficult to compute in practice; however, it is given by:
$$
Z(\mathcal{S},\beta) = \sum_{\mathbf{s}^{\prime}\in\mathcal{S}}\exp\left({-\beta\cdot{E}(\mathbf{s}^{\prime})}\right)
$$
where $\mathcal{S}$ is the set of _all possible network configurations_ of the Boltzmann Machine. 
* __Hmmm...__? The partition function $Z(\mathcal{S},\beta)$ is a normalizing constant that ensures that the probabilities sum to 1. However, for even a moderately sized system, the partition function is impossible to compute because it involves summing over all possible network configurations, which grows exponentially with the number of nodes. For example, in our case, that is $2^{n}$, where $n$ is the number of nodes in the network. For our simple three node Boltzmann machine, the partition function will sum $2^{3} = 8$ states.

In [17]:
Z,configurations = let

    # initialize -
    Z = Dict{Int,Float64}();
    configurations = Dict{Int,Vector{Int}}();
    N = 2^number_of_nodes; # how many configurations do we have

    # loop throught each configuration
    for i ∈ 0:(N - 1)
        sᵢ = digits(i, base = 2, pad = number_of_nodes) |> x -> 2*x .- 1 |> reverse; # count by base 2, and convert to -1,1
        Z[i] = exp(-2*β*energy(model, sᵢ)); # calculate the partition function
        configurations[i] = sᵢ; # store the configuration
    end

    # return -
    Z,configurations
end;

__Compute the _actual_ stationary distribution__: Let's compute the stationary distribution of the Boltzmann Machine using the Boltzmann distribution. We'll compute the energy of each state configuration $\mathbf{s}\in\mathcal{S}$ and then compute the probability of each state configuration using the Boltzmann distribution. 

We'll save the probabilities in the `P::Dict{Int,Array{Float64,1}}` dictionary where the key is the state configuration and the value is the probability of the state configuration.

In [18]:
P = let
    
    # initialize -
    P = Dict{Int,Float64}();
    N = 2^number_of_nodes; # how many configurations do we have

    # what is the normalizing constant
    Z̄ = sum(values(Z)); # calculate the value of the partition function

    # loop through each configuration
    for i ∈ 0:(N - 1)
        P[i] = Z[i]/Z̄; # calculate the probability of each configuration
    end

    # return -
    P
end;

__Check__: Does the _actual_ stationary Boltzmann distribution sum to `1` (use [the `@assert` macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert)? If not, then we have a problem.

In [19]:
@assert sum(values(P)) ≈ 1.0 # if this fails: we get an AssertionError, otherwise nothing happens

__Estimate the _emphirical_ stationary distribution__: Next, let's compute the _emphirical_ estimate of the stationary distribution by analyzing the simulation samples. If we generated enough samples, then the _emphirical_ distribution should be _similar_ to the stationary distribution. 
* __Idea__: Compute the number of times a particular configuration $\mathbf{s}\in\mathcal{S}$ occurs in the simulation sample matrix $\mathbf{S}$ for each of the configurations, and then divide by the total number of samples to get the probability of each configuration. This gives us the _emphirical_ distribution of the samples.

We'll save the _emphirical_ probabilities in the `P̂::Dict{Int,Array{Float64,1}}` dictionary where the key is the state configuration index and the value is the _emphirical_ probability of that state configuration.

In [20]:
P̂ = let
   
    # initialize -
    P̂ = Dict{Int,Float64}();
    N = 2^number_of_nodes; # how many configurations do we have
    number_of_turns = size(S,2); # how many turns do we have

    for i ∈ 0:(N - 1)
        sᵢ = digits(i, base = 2, pad = number_of_nodes) |> x -> 2*x .- 1 |> reverse; # count by base 2, and convert to -1,1

        @show (i,sᵢ)

        counter = 0;
        for j ∈ 1:number_of_turns
            if (S[:,j] == sᵢ)
                counter += 1;
            end
        end
        P̂[i] = counter/number_of_turns;
    end
    
    P̂
end;

(i, sᵢ) = (0, [-1, -1, -1])
(i, sᵢ) = (1, [-1, -1, 1])
(i, sᵢ) = (2, [-1, 1, -1])
(i, sᵢ) = (3, [-1, 1, 1])
(i, sᵢ) = (4, [1, -1, -1])
(i, sᵢ) = (5, [1, -1, 1])
(i, sᵢ) = (6, [1, 1, -1])
(i, sᵢ) = (7, [1, 1, 1])


In [21]:
S[:,2] 

3-element Vector{Int64}:
 -1
 -1
  1

__Check__: Does the emphirical stationary Boltzmann distribution sum to `1` (use [the `@assert` macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert)? If not, then we have a problem.

In [22]:
@assert sum(values(P̂)) ≈ 1.0 # if this fails: we get an AssertionError, otherwise nothing happens

`Unhide` the code block below to see how we constructed the probability table for the actual and emphirical stationary distributions.

In [23]:
let
   
    # initialize -
    df = DataFrame();
    N = 2^number_of_nodes; # how many configurations do we have

    # compute the ordinal rank -
    r = [P[i] for i ∈ 0:(N - 1)] |> x -> ordinalrank(x, rev = true);
    r̂ = [P̂[i] for i ∈ 0:(N - 1)] |> x -> ordinalrank(x, rev = true);

    # main -
    for i ∈ 0:(N - 1)
        sᵢ = digits(i, base = 2, pad = number_of_nodes) |> x -> 2*x .- 1 |> reverse; # count by base 2, and convert to -1,1
        row_df = (
            i = i,
            configuration = sᵢ,
            energy = energy(model, sᵢ),
            P = P[i],
            P̂ = P̂[i],
            r = r[i+1],
            r̂ = r̂[i+1],
        )
        push!(df, row_df)
    end
    
    pretty_table(df, tf = tf_simple);
end

 [1m     i [0m [1m configuration [0m [1m    energy [0m [1m         P [0m [1m       P̂ [0m [1m     r [0m [1m     r̂ [0m
 [90m Int64 [0m [90m Vector{Int64} [0m [90m   Float64 [0m [90m   Float64 [0m [90m Float64 [0m [90m Int64 [0m [90m Int64 [0m
      0    [-1, -1, -1]    -11.9083    0.152885     0.112       2       7
      1     [-1, -1, 1]   -0.116409    0.120765     0.125       6       5
      2     [-1, 1, -1]     22.9447   0.0761439     0.124       8       6
      3      [-1, 1, 1]     -6.2228    0.136452     0.138       3       1
      4     [1, -1, -1]    -5.30093     0.13396     0.111       4       8
      5      [1, -1, 1]     22.8746   0.0762507     0.131       7       2
      6      [1, 1, -1]    -4.74352    0.132474     0.131       5       3
      7       [1, 1, 1]    -17.5274    0.171069     0.128       1       4


## Task 3: Estimate the Boltzmann machine parameters
Fill me in.