## Computation of the gradient

From the earlier discussion, we have the cost function of MPF to be

$$ K(\theta) \approx \frac{1}{|\mathcal{D}|}\sum_{y\in \mathcal{D}}\sum_{h=1}^{n}\exp\bigg\{\delta_h * (Wy)_h - 1/4 * diag(W)_h\bigg\}$$

and the gradient of the cost function with respect to the $W$ matrix is

$$ \frac{\partial K(\theta)}{\partial W_{ij}} = \begin{cases}\delta_iy_jk_i+\delta_jy_ik_j & i \neq j\\ \left(\delta_iy_i-\frac{1}{4}\right)k_i & i = j\\ \end{cases}$$

where $k_h = \exp\bigg\{\delta_h * (Wy)_h - 1/4 * diag(W)_h\bigg\}$. We shall now work out how to explicitly compute the gradients using Python.

We start by recalling some definitions:
- $s$ : samples where each row is the number of samples and each columns represent a unit in the restricted boltzmann machine, say $n$.
- $W$: the parameter matrix to be learnt which has a size of $n \times n$

With the energy matrix $E$, we can compute the $\delta_ik_i$ terms by $\delta * k$, following by we can obtain the $\delta_iy_jk_i$ terms by taking the dot product of $\delta * k$ tranpose and $s$, which we shall call this matrix $D'$ that looks like 

$$ D'_{ij} = \begin{cases}\delta_iy_jk_i & i \neq j\\ \delta_iy_ik_i & i = j\\ \end{cases}$$

we extract the diagonals as $C$ and we add the missing $0.25 * k_i$ term to it by subtracting 0.25 times of the sum of the rows of $k$ from $C$. To form the $d_iy_jk_i + \delta_jy_ik_j$ term we remove the diagonals of $D'$ and call it $D''$, following which added the transpose of $D''$ to itself. We get the desired gradient matrix by filling the empty diagonals of $D'' + D''^\top$ with $C$.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
# import theano
# import theano.tensor as T


import os
import timeit
from datetime import datetime
from mpfntutils import load_data

from numpy.linalg import norm


%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots

In [2]:
def unravelparam(theta, units = 16):
    """
    Restores a vector of parameters into matrix form.
    """
    W = theta.reshape(units, units)
    return W


def ravelparam(W):
    """
    Ravels the parameters into a vector.
    """
    theta = W.ravel()
    return theta

In [3]:
units = 16
U = np.random.normal(loc = 0, scale = 1/units, size = (units, units))
W = 0.5 * (U + U.T)

In [4]:
samples = load_data('16-50K.npy')

In [5]:
def Kcost(x, W, temperature = 1):
        """
        Returns the cost computed by using the diagonals as the bias.
        Inputs:
        - x: samples used to train W.
        - W: weights between the neurons of the Boltzmann Machine (BM).
        - n: number of neurons in the BM.
        - temperature: keep it as 1 until cost grows too big then raise temperature.
        """
        num_samples = x.shape[0]        
        num_units = x.shape[1]
        delta = 1/2 - x
        diag = np.diag(W)[:, None].T
        E = delta * np.dot(x, W) - .25 * diag
        
        cost = np.sum(np.exp(1/temperature * E)) / num_samples # the cost should be correct here
        
        # computation of gradient
        
        k = np.exp(E)
        
        D = np.dot((delta * k).T, x) # delta_i y_j k_i term from dot product of delta * k and samples
        
        C = np.zeros((num_units,)) # initialize diagonal gradients
        
        np.copyto(C, np.diag(D)) # extract out delta_i y_i k_i terms for later use
                
        np.fill_diagonal(D, 0) # set diagonals of D to zeros
        
        C = C - .25 * np.sum(k, axis = 0) # evaluation of (delta_i y_i - .25) k_i
        
        D = D + D.T # forming of delta_i y_j k_i + delta_j y_i k_j terms for i not equals to j
        
        np.fill_diagonal(D, C) # fill in the diagonals back to the gradient matrix

        return cost, D/ num_samples

Tidied version of Kcost with should be correct cost and gradient

In [None]:
def Kcost(x, W, temperature = 1):
        """
        Returns the cost computed by using the diagonals as the bias.
        Inputs:
        - x: samples used to train W.
        - W: weights between the neurons of the Boltzmann Machine (BM).
        - n: number of neurons in the BM.
        - temperature: keep it as 1 until cost grows too big then raise temperature.
        """
        num_samples = x.shape[0]        
        num_units = x.shape[1]
        delta = 1/2 - x
        diag = np.diag(W)[:, None].T
        E = delta * np.dot(x, W) - .25 * diag
        
        cost = np.sum(np.exp(1/temperature * E)) / num_samples         
        k = np.exp(E)        
        D = np.dot((delta * k).T, x)         
        C = np.zeros((num_units,))         
        np.copyto(C, np.diag(D))                 
        np.fill_diagonal(D, 0)         
        C = C - .25 * np.sum(k, axis = 0)         
        D = D + D.T         
        np.fill_diagonal(D, C) 

        return cost, D/ num_samples

In [6]:
cost, Wgrad = Kcost(samples, W)

In [7]:
print (cost)
print (Wgrad.shape)

15.962510117
(16, 16)


# Need to find problem with numgrad
- need to resolve the dimension problem in 21 (done)
- diagonals have correct gradient but the other gradients are wrong, look at 16.

In [8]:
def computeNumericalGradient(J,W):
    
    EPSILON = 0.0001
    mat_W_shape = W.shape # find shape of initial matrix
    e_mat = EPSILON * np.ones(mat_W_shape) # 
    
    W = ravelparam(W) # W is a n x n matrix, ravel it to a n **2 vector
    numgrad = np.zeros(np.shape(W)) # numgrad will have a n **2 dimensional vector

    

    num_para = W.shape[0] 
    
    
    e_plus = EPSILON * np.eye(num_para) # n **2 by n **2 matrix with diagonals EPSILON and 0 otherwise
    e_minus = - e_plus # negation of e_plus
    theta_e_plus = e_plus + W 
    theta_e_minus = e_minus + W
    
    # loop over number of parameters to compute the gradient for each paramter
    for i in range(num_para):
        p = unravelparam(theta_e_plus[i, :])
        m = unravelparam(theta_e_minus[i, :])
        numgrad[i] = (J(p) - J(m))/ (2 * EPSILON)
        
    return unravelparam(numgrad)

In [9]:
numgrad = computeNumericalGradient(lambda x: Kcost(samples, x)[0], W)

In [10]:
print (numgrad.shape)

(16, 16)


In [11]:
diff = norm(numgrad-Wgrad)/norm(numgrad+Wgrad)
print (diff)

0.344519137321


In [12]:
print (Wgrad[0:2,:6])
print (numgrad[0:2,:6])

[[-0.6736862  -0.58469385 -0.41706637  0.16497182  0.10624317 -0.78368   ]
 [-0.58469385 -0.61328071 -0.34259382  0.10182234  0.08588656 -0.64657365]]
[[-0.6736862  -0.26373654 -0.17178938  0.24298219  0.20236892 -0.42030855]
 [-0.32095731 -0.61328071 -0.16204418  0.19074741  0.17205835 -0.38150326]]


In [14]:
Wgrad[2,3] == Wgrad[3,2]

True

In [16]:
numgrad[2,3] - numgrad[3,2]

0.24404546683953754