In [None]:
'''
 * Copyright (c) 2018 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

Energy-based models are a special class of neural networks. The simplest energy model is the Hopfield Network, first introduced by Hopfield in 1982, which is usually viewed as a form of recurrent neural network.

The Hopfield network is a fully connected neural network with binary thresholding neural units whose values are either 0 or 1. These units are fully connected in a "recurrent" way in which the connection between weights and neurons are bidirectional.

With this setting, the energy of a Hopfield network is defined as:

$$
E = - \sum_{i} s_i b_i - \sum_{i,j} s_i s_j w_{ij}
$$

where $s_i $ is the state of unit i, $ b_i$ denotes its bias, and $w_{ij}$ denotes the bidirectional weights connecting units i and j.


The Boltzmann machine and the Hopfield network share the following common features:

- Their processing units have binary values (say +1 and −1) for their states.
- All the synaptic connections between their units are symmetric.
- The units are picked at random and one at a time for updating.
- They have no self-feedback.

There are three important differences between the Boltzmann machine and the Hopfield network:

(a) The Boltzmann machine permits the use of hidden neurons, while no such neurons exist in the Hopfield network.
(b) The Boltzmann machine uses stochastic neurons with a probabilistic firing mechanism, whereas the standard Hopfield network uses neurons with a deterministic firing mechanism.
(c) The Boltzmann machine operates in a supervised manner, while the Hopfield network operates in an unsupervised manner.

The above common features and important differences help in providing a better understanding of the following discussion on the Boltzmann machine.

The Boltzmann machine has the same definition of energy functions as that of the Hopfield network, except the Boltzmann machine splits the energy function according to hidden units and visible units:

$$
E(x, h) = -b^T x - c^T h - h^T Wx - x^T Ux - h^T Vh,
$$

where \( b \) and \( c \) are the offsets associated with the input \( x \) (visible vector) and the hidden output \( h \) (hidden vector), respectively, while \( W \), \( U \), and \( V \) are the weight matrices of hidden-visible units, visible-visible units, and hidden-hidden units, respectively.

If considering only one observed part (denoted by \( x \)) and a hidden part \( h \), then the energy-based probability distribution can be defined as:

$$
P(x, h) = \frac{e^{-E(x,h)}}{Z},
$$

where the normalizing factor with a sum running over the visible and hidden spaces given by

$$
Z = \sum_{x,h} e^{-E(x,h)}
$$

is called the partition function by analogy with physical systems.


Because only \( x \) is observed, it only needs to care about the marginal

$$
p(x) = \frac{1}{Z} \sum_h e^{-E(x,h)}.
$$

Substituting (7.6.4) into this equation yields

$$
p(x) = \frac{\sum_h e^{-E(x,h)}}{\sum_{x̄,h} e^{-E(x̄,h)}},
$$

whose log-likelihood form is given by

$$
\log p(x) = \log \sum_h e^{-E(x,h)} - \log \sum_{x̄,h} e^{-E(x̄,h)}.
$$

Then, by letting \( \theta = (b, c, W, U, V) \) denote the parameters of the Boltzmann model, one has:

$$
\frac{\partial \log p(x)}{\partial \theta} = \frac{\partial \log \sum_h e^{-E(x,h)}}{\partial \theta} - \frac{\partial \log \sum_{x̄,h} e^{-E(x̄,h)}}{\partial \theta}.
$$

This can be rewritten as:

$$
\frac{\partial \log p(x)}{\partial \theta} = - \sum_h \frac{e^{-E(x,h)}}{\sum_h e^{-E(x,h)}} \frac{\partial E(x, h)}{\partial \theta} + \sum_{x̄,h} \frac{e^{-E(x̄,h)}}{\sum_{x̄,h} e^{-E(x̄,h)}} \frac{\partial E(x̄, h)}{\partial \theta}.
$$

Let \( s = \begin{pmatrix} x \\ h \end{pmatrix} \) denote all the units in the Boltzmann machine. Then (7.6.2) can be rewritten as:

$$
E(s) = -\begin{pmatrix} b^T & c^T \end{pmatrix} \begin{pmatrix} x \\ h \end{pmatrix} - \begin{pmatrix} x^T & h^T \end{pmatrix} \begin{pmatrix} U & O \\ W & V \end{pmatrix} \begin{pmatrix} x \\ h \end{pmatrix} = -d^T s - s^T A s,
$$

where:

$$
d = \begin{pmatrix} b \\ c \end{pmatrix}, \quad A = \begin{pmatrix} U & O \\ W & V \end{pmatrix}.
$$


In [1]:
import numpy as np

class BoltzmannMachine:
    def __init__(self, visible_size, hidden_size):
        self.visible_size = visible_size
        self.hidden_size = hidden_size

        # Initialize weights and biases
        self.W = np.random.randn(hidden_size, visible_size)
        self.b = np.random.randn(visible_size)
        self.c = np.random.randn(hidden_size)

    def energy(self, v, h):
        """
        Compute the energy of the configuration (v, h)
        """
        term1 = -np.dot(self.b, v)
        term2 = -np.dot(self.c, h)
        term3 = -np.dot(h, np.dot(self.W, v))
        return term1 + term2 + term3

    def probability_h_given_v(self, v):
        """
        Compute the probability of hidden units given visible units
        """
        return 1 / (1 + np.exp(-np.dot(self.W, v) - self.c))

    def sample_h_given_v(self, v):
        """
        Sample hidden units given visible units
        """
        p_h_given_v = self.probability_h_given_v(v)
        return (np.random.rand(self.hidden_size) < p_h_given_v).astype(float)

    def probability_v_given_h(self, h):
        """
        Compute the probability of visible units given hidden units
        """
        return 1 / (1 + np.exp(-np.dot(self.W.T, h) - self.b))

    def sample_v_given_h(self, h):
        """
        Sample visible units given hidden units
        """
        p_v_given_h = self.probability_v_given_h(h)
        return (np.random.rand(self.visible_size) < p_v_given_h).astype(float)

    def train(self, data, learning_rate=0.1, epochs=1000):
        """
        Train the Boltzmann Machine using Contrastive Divergence
        """
        for epoch in range(epochs):
            for v0 in data:
                v0 = v0.astype(float)
                h0 = self.sample_h_given_v(v0)

                # Gibbs sampling
                vk = self.sample_v_given_h(h0)
                hk = self.sample_h_given_v(vk)

                # Update weights and biases
                self.W += learning_rate * (np.outer(h0, v0) - np.outer(hk, vk))
                self.b += learning_rate * (v0 - vk)
                self.c += learning_rate * (h0 - hk)

            if epoch % 100 == 0:
                print(f"Epoch {epoch}")

    def reconstruct(self, v):
        """
        Reconstruct the visible units from hidden units
        """
        h = self.sample_h_given_v(v)
        return self.sample_v_given_h(h)

# Example usage:
visible_size = 6  # Number of visible units
hidden_size = 3   # Number of hidden units
bm = BoltzmannMachine(visible_size, hidden_size)

# Generate some example data (6 visible units)
data = np.random.randint(0, 2, (10, visible_size))

# Train the Boltzmann Machine
bm.train(data, learning_rate=0.1, epochs=1000)

# Reconstruct a sample from the data
sample = data[0]
reconstructed_sample = bm.reconstruct(sample)

print("Original sample:     ", sample)
print("Reconstructed sample:", reconstructed_sample)


Epoch 0
Epoch 100
Epoch 200
Epoch 300
Epoch 400
Epoch 500
Epoch 600
Epoch 700
Epoch 800
Epoch 900
Original sample:      [0 0 1 0 0 1]
Reconstructed sample: [0. 0. 1. 0. 0. 1.]


# Restricted Boltzmann Machine (RBM)

A Restricted Boltzmann Machine (RBM) is a version of the Boltzmann machine with an added restriction: there should be no connections either between visible units or between hidden units. The RBM was originally known as Harmonium when invented by Smolensky in 1986 [139].

Figure 7.8 shows a comparison between the restricted Boltzmann machine and the Boltzmann machine.

The RBM is a two-layer, bipartite, undirected graphical model with a set of binary hidden units \( h \), a set of (binary or real-valued) visible units \( x \), and symmetric connections between these two layers represented by a weight matrix \( W \). The probabilistic semantics for an RBM is denoted by \( p(x; h) \), and is defined by its energy function:

$$
p(x, h) = \frac{e^{-E(x,h)}}{Z(h)},
$$

where 

$$
Z(h) = \sum_{\bar{h}} e^{-E(x,\bar{h})}
$$

is known as the partition function for hidden units.

Consider a training set of binary vectors which are assumed to be binary images. The training set can be modeled using a two-layer network called an RBM in which stochastic, binary pixels are connected to stochastic, binary feature detectors using symmetrically weighted connections. Because states of the pixels are observed, these pixels correspond to “visible” units of the RBM, while the feature detectors correspond to “hidden” units. Without connections either between visible units or between hidden units, \( U \) and \( V \) in (7.6.2) are two null weight matrices.

Let \( x_i \), \( i = 1, \ldots, m \) and \( h_j \), \( j = 1, \ldots, F \) be the observed visible variables and the binary values of hidden (latent) variables, respectively. Then the energy function (7.6.2) of the Boltzmann machine reduces to that of the restricted Boltzmann machine [65]:

$$
E(x, h) = -\sum_{i=1}^m b_i x_i - \sum_{j=1}^F c_j h_j - \sum_{i,j} x_i W_{ij} h_j = -x^T Wh - b^T x - c^T h,
$$

where \( x_i \) and \( h_j \) are the binary states of visible unit \( i \) and hidden unit \( j \), \( b_i \) and \( c_j \) are their biases, and \( W_{ij} \) is the weight between them, while \( b = [b_i] \) is the visible unit bias vector and \( c = [c_j] \) is the hidden unit bias vector.

In regression problems, the visible units \( x \) are real-valued, and the energy function is defined as:

$$
E(x, h) = \frac{1}{2} \sum_{i=1}^m x_i^2 - \sum_{i=1}^m \sum_{j=1}^F x_i W_{ij} h_j - \sum_{i=1}^m b_i x_i - \sum_{j=1}^F c_j h_j = \frac{1}{2} x^T x - x^T Wh - b^T x - c^T h.
$$

Clearly, the hidden units are conditionally independent of one another given the visible layer, and vice versa. In particular, the units of a binary layer (conditioned on the other layer) are independent Bernoulli random variables, and if the visible layer is real-valued then the visible units (conditioned on the hidden layer) are Gaussian with diagonal covariance [92].

Therefore, a tractable expression for the conditional probability for RBM can be readily obtained as [8]:

$$
p(h|x) = \frac{p(x, h)}{p(x)} = \frac{\exp(b^T x + c^T h + x^T Wh)}{\sum_{\bar{h}} \exp(b^T x + c^T \bar{h} + x^T W \bar{h})} = \prod_{j=1}^F P(h_j |x),
$$

where \( w_j \) is the \( j \)-th column of the weight matrix \( W = [w_1, \ldots, w_F] \in \mathbb{R}^{m \times F} \).


# Activation Probabilities in RBM

By the special structure of RBM (that is, there is connection between layers and no connection within layers), it is known that the activation states of hidden units are conditionally independent given the state of visible units. At this time, the activation probability of the \( j \)-th hidden unit is given by

$$
p(h_j = 1|x) = \frac{e^{c_j + w_j^T x}}{1 + e^{c_j + w_j^T x}} = \sigma(c_j + w_j^T x),
$$

where \( \sigma(z) \) is the logistic sigmoid function \( \frac{1}{1 + \exp(-z)} \).

Since \( x \) and \( h \) play a symmetric role in the energy function, a similar derivation gives the conditional probability of \( x \) given \( h \):

$$
p(x|h) = \prod_{i=1}^m p(x_i |h).
$$

Because the structure of RBM is symmetrical, when the state of the hidden unit is given, the activation state of each visible unit is conditionally independent. That is, the activation probability of the \( i \)-th visible unit is given by

$$
p(x_i = 1|h) = \sigma(b_i + \tilde{w}_i h),
$$

where \( \tilde{w}_i \) is the \( i \)-th row of \( W \in \mathbb{R}^{m \times F} \).

Consider a probability distribution over a vector \( x \) and with parameters \( W \) [15]:

$$
p(x; W) = \frac{1}{Z(W)} e^{-E(x; W)},
$$

where 

$$
Z(W) = \sum_x e^{-E(x; W)}
$$

is a normalization constant and \( E(x; W) \) is an energy function.

Maximum-likelihood (ML) learning of the parameters \( W \) given an i.i.d. sample \( X = \{x_n\}_{n=1}^N \) can be updated by gradient ascent:

$$
W^{(t+1)} = W^{(t)} + \eta \left. \frac{\partial L(W; X)}{\partial W} \right|_{W = W^{(t)}},
$$

where the learning rate \( \eta \) need not be constant, and the average log-likelihood is

$$
L(W; X) = \frac{1}{N} \sum_{n=1}^N \log p(x_n; W).
$$

This can be expressed as

$$
L(W; X) = \mathbb{E}_{P_0}[\log p(x; W)] = -\mathbb{E}_{P_0}[E(x; W)] - \log Z(W),
$$

where \( \mathbb{E}_{P_0} \) denotes an average with respect to the data distribution, i.e., \( P_0(x) = \frac{1}{N} \sum_{n=1}^N \delta(x - x_n) \).

A well-known difficulty arises in the computation of the gradient:

$$
\frac{\partial L(W; X)}{\partial W} = - \mathbb{E}_{P_0} \left[ \frac{\partial E(x; W)}{\partial W} \right] + \mathbb{E}_{P_\infty} \left[ \frac{\partial E(x; W)}{\partial W} \right],
$$

where \( \mathbb{E}_{P_\infty} \) denotes an average with respect to the model distribution \( P_\infty(x; W) = p(x; W) \). The average \( \mathbb{E}_{P_0} \) is readily computed using the sample data \( X = \{x_n\}_{n=1}^N \), but the average \( \mathbb{E}_{P_\infty} \) involves the normalization constant \( Z(W) \), which cannot generally be computed efficiently (being a sum of an exponential number of terms).


In [2]:
import numpy as np

class RBM:
    def __init__(self, visible_size, hidden_size):
        self.visible_size = visible_size
        self.hidden_size = hidden_size
        
        # Initialize weights and biases
        self.W = np.random.randn(hidden_size, visible_size) * 0.1
        self.b = np.zeros(visible_size)  # Bias for visible units
        self.c = np.zeros(hidden_size)  # Bias for hidden units

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def sample_prob(self, probs):
        return (np.random.rand(*probs.shape) < probs).astype(float)

    def energy(self, v, h):
        term1 = -np.dot(self.b, v)
        term2 = -np.dot(self.c, h)
        term3 = -np.dot(h, np.dot(self.W, v))
        return term1 + term2 + term3

    def probability_h_given_v(self, v):
        return self.sigmoid(np.dot(self.W, v) + self.c)

    def sample_h_given_v(self, v):
        p_h_given_v = self.probability_h_given_v(v)
        return self.sample_prob(p_h_given_v)

    def probability_v_given_h(self, h):
        return self.sigmoid(np.dot(self.W.T, h) + self.b)

    def sample_v_given_h(self, h):
        p_v_given_h = self.probability_v_given_h(h)
        return self.sample_prob(p_v_given_h)

    def train(self, data, learning_rate=0.1, epochs=1000):
        for epoch in range(epochs):
            for v0 in data:
                v0 = v0.astype(float)
                h0 = self.sample_h_given_v(v0)

                # Gibbs sampling
                vk = self.sample_v_given_h(h0)
                hk = self.sample_h_given_v(vk)

                # Update weights and biases
                self.W += learning_rate * (np.outer(h0, v0) - np.outer(hk, vk))
                self.b += learning_rate * (v0 - vk)
                self.c += learning_rate * (h0 - hk)

            if epoch % 100 == 0:
                print(f"Epoch {epoch}")

    def reconstruct(self, v):
        h = self.sample_h_given_v(v)
        return self.sample_v_given_h(h)

# Example usage:
visible_size = 6  # Number of visible units
hidden_size = 3   # Number of hidden units
rbm = RBM(visible_size, hidden_size)

# Generate some example data (6 visible units)
data = np.random.randint(0, 2, (10, visible_size))

# Train the RBM
rbm.train(data, learning_rate=0.1, epochs=1000)

# Reconstruct a sample from the data
sample = data[0]
reconstructed_sample = rbm.reconstruct(sample)

print("Original sample:     ", sample)
print("Reconstructed sample:", reconstructed_sample)


Epoch 0
Epoch 100
Epoch 200
Epoch 300
Epoch 400
Epoch 500
Epoch 600
Epoch 700
Epoch 800
Epoch 900
Original sample:      [1 0 0 1 1 1]
Reconstructed sample: [1. 0. 0. 0. 1. 0.]


In [3]:
import numpy as np

class RBM:
    def __init__(self, num_visible, num_hidden):
        self.num_visible = num_visible
        self.num_hidden = num_hidden
        self.W = np.random.normal(0, 0.1, size=(num_visible, num_hidden))  # Weight matrix
        self.b = np.zeros(num_visible)  # Visible biases
        self.c = np.zeros(num_hidden)   # Hidden biases
    def sigmoid(self, x):
        return 1.0 / (1 + np.exp(-x))
    def gibbs_sampling(self, visible_data, k=1):
        num_samples = visible_data.shape[0]
        h0_prob = self.sigmoid(np.dot(visible_data, self.W) + self.c)  # Initial hidden unit probabilities

        h0_sample = np.random.binomial(1, h0_prob)  # Sample hidden units
        h = h0_sample

        for _ in range(k):
            v_prob = self.sigmoid(np.dot(h, self.W.T) + self.b)  # Visible unit probabilities
            v_sample = np.random.binomial(1, v_prob)  # Sample visible units
            h_prob = self.sigmoid(np.dot(v_sample, self.W) + self.c)  # Hidden unit probabilities
            h = np.random.binomial(1, h_prob)  # Sample hidden units

        return v_sample, h_prob
    def train(self, data, learning_rate=0.1, epochs=100, batch_size=10, k=1):
        num_samples = data.shape[0]
        for epoch in range(epochs):
            np.random.shuffle(data)

            for i in range(0, num_samples, batch_size):
                batch = data[i:i+batch_size]
                v0 = batch
                v_sample, h_prob = self.gibbs_sampling(v0, k)

                # Contrastive divergence
                positive_grad = np.dot(v0.T, h_prob)
                negative_grad = np.dot(v_sample.T, self.sigmoid(np.dot(v_sample, self.W) + self.c))
                self.W += learning_rate * (positive_grad - negative_grad) / batch_size
                self.b += learning_rate * np.mean(v0 - v_sample, axis=0)
                self.c += learning_rate * np.mean(h_prob - self.sigmoid(np.dot(v_sample, self.W) + self.c), axis=0)

            if epoch % 10 == 0:
                print(f"Epoch {epoch+1} complete. Free energy: {self.free_energy(data)}")
    def free_energy(self, v):
        vbias_term = np.dot(v, self.b)
        wx_b = np.dot(v, self.W) + self.c
        hidden_term = np.sum(np.log(1 + np.exp(wx_b)), axis=1)
        return -hidden_term - vbias_term
# Example usage
num_visible = 6
num_hidden = 3
rbm = RBM(num_visible, num_hidden)

# Assuming data is your training dataset, shape (num_samples, num_visible)
data = np.random.binomial(1, 0.5, size=(100, num_visible))  # Dummy data

rbm.train(data, learning_rate=0.1, epochs=100, batch_size=10, k=1)


Epoch 1 complete. Free energy: [-2.37217511 -1.82369575 -2.14785729 -1.99671449 -2.50263147 -2.18177309
 -2.44321332 -1.94103867 -2.11201328 -2.28434181 -2.06644901 -2.15552803
 -2.11267439 -2.23073405 -2.4197584  -1.94103867 -2.4197584  -2.14785729
 -2.44321332 -2.12469088 -2.51474191 -2.09957989 -2.34233973 -2.1525991
 -2.51474191 -2.04749584 -2.06644901 -2.43289915 -2.20489043 -2.25331079
 -1.89755048 -2.26671518 -2.22392148 -2.336069   -1.89755048 -2.44321332
 -1.82369575 -2.336069   -2.21187651 -2.23073405 -2.199998   -2.4529002
 -2.30215921 -2.06644901 -2.44321332 -2.51474191 -2.28434181 -2.37217511
 -2.28122929 -2.06644901 -2.23934631 -2.05381314 -2.22392148 -2.11201328
 -2.07340392 -2.25371458 -2.36145258 -2.07340392 -2.26101945 -2.31375481
 -2.51474191 -2.44321332 -2.4197584  -2.04075127 -2.00741408 -2.16906751
 -2.06644901 -2.51474191 -2.25331079 -2.04749584 -2.199998   -2.36145258
 -2.34233973 -1.89755048 -2.4529002  -2.08211914 -2.20489043 -2.1525991
 -2.09957989 -2.0074140

Epoch 81 complete. Free energy: [-2.24853498 -3.14470964 -2.95292487 -3.42655974 -3.3754568  -2.21385148
 -3.17373161 -3.17373161 -2.90374414 -3.29020792 -3.29020792 -3.48248228
 -3.20087321 -3.04049135 -2.20032075 -2.35148868 -2.44882955 -2.54035234
 -2.58548206 -3.43167431 -2.30581279 -2.98336591 -2.25789856 -3.22431538
 -2.99929723 -3.43167431 -3.15058009 -2.35148868 -2.21385148 -3.29020792
 -3.00974922 -3.23019369 -2.76098209 -2.75226712 -2.47971349 -2.59780499
 -3.18392765 -3.04049135 -2.34080007 -2.96086711 -2.54035234 -2.96086711
 -2.30581279 -3.43167431 -2.47971349 -2.54035234 -2.20032075 -2.35148868
 -3.23362687 -2.39850077 -2.27185071 -2.99929723 -3.29020792 -2.70361184
 -2.99929723 -3.48248228 -2.29348696 -2.58548206 -3.15058009 -2.07350881
 -3.42655974 -3.29020792 -3.09412047 -2.70361184 -3.18392765 -2.80933798
 -2.78595923 -2.07350881 -3.23362687 -2.99929723 -3.43167431 -2.94288928
 -3.22431538 -2.99929723 -2.27185071 -2.95292487 -2.44882955 -2.29348696
 -2.30581279 -3.240