# Implementation of a convolutional RBM using Theano

This notebook implements an efficient cRBM with various training procedures and different layers.

## Part 0: Importing theano, Biopython and all the other useful tools

In [1]:
import theano
import theano.tensor as T
import theano.tensor.nnet.conv as conv


import numpy as np
import Bio.SeqIO as sio
import Bio.motifs.matrix as mat
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
from Bio import motifs

import random
import time

ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu: libcublas.so.7.0: cannot open shared object file: No such file or directory
ERROR:theano.sandbox.cuda:Failed to compile cuda_ndarray.cu: libcublas.so.7.0: cannot open shared object file: No such file or directory


## Part 1: Getting and transforming the data

In [2]:
def readFASTASequencesFromFile (filename):
    dhsSequences = []
    for dhs in sio.parse(open(filename), 'fasta', IUPAC.unambiguous_dna):
        dhsSequences.append(dhs.seq)
    return dhsSequences

def readPWMsFromFile (filename):
    matrices = []
    for mat in motifs.parse(open(filename), 'jaspar'):
        matrices.append(mat.pwm)
    return matrices

### Read in FASTA sequences
These sequences are DNase1-hypersensitivity sites where each sample represents one DHS.
They were all already brought to the same dimensionality.

In [3]:
allSeqs = readFASTASequencesFromFile('/data/wgEncodeAwgDnaseUwAg10803UniPk.fa')

In [4]:
pwms = readPWMsFromFile('/data/jaspar_matrices.txt')

### Create a test set
Since working with these huge amounts of sequences is very time consuming, just start with a selection of them.

In [4]:
test_set = [allSeqs[random.randrange(0,len(allSeqs))] for i in range(1000)]
print "Length of test set: "  + str(len(test_set))

Length of test set: 1000


### Convert test set sequences to matrices we can work with
Each sequence is converted to a matrix of dimensionality (2 x 4 x N_v), representing:
* the two strands (first dimension)
* the four letters of the DNA alphabet (second dimension)
* the sequence represented by 1 (letter is present at this position) or 0 (letter is not present) (third dimension)

In [5]:
def getIntToLetter (letter):
    if letter == 'A' or letter == 'a':
        return 0
    elif letter == 'C' or letter == 'c':
        return 1
    elif letter == 'G' or letter == 'g':
        return 2
    elif letter == 'T' or letter == 't':
        return 3
    else:
        print "ERROR. LETTER " + letter + " DOES NOT EXIST!"
        return -1

def getMatrixFromSeq (seq):
    m = len(seq.alphabet.letters)
    n = len(seq)
    result = np.zeros((2, m, n))
    for i in range(len(seq)):
        result[0,getIntToLetter(seq[i]),i] = 1
        result[1,getIntToLetter(seq[i]),i] = 1
    return result

In [6]:
start = time.time()
dataMat = np.array([getMatrixFromSeq(t) for t in test_set])
print "Conversion of sequence to matrix for the test set in (in ms): " + str((time.time()-start)*1000)

Conversion of sequence to matrix for the test set in (in ms): 283.47492218


## Part 2: Implementing the cRBM
We now have all the tools we need to actually implement the cRBM.
Some notes on constraints and behavior of our learning algorithm:

* It uses Theano to do most (if not all) of the work. That means it's ways faster on a GPU
* It expects kernels (filters/motifs) to be of the same length. While this choice was made due to performance issues, it's not a problem because we expect the algorithm to combine multiple motifs if nessessary.
* The sequences (DHSes) are also expected to be of the same size

Now, finally some hints on usage:

* The pooling factor has to be evenly dividable by the length of the hidden units

In [175]:
"""
This class implements a cRBM for sequence analysis.
It does perform efficient forward and backward pass of any given DNA sequence.
It implements softmax and sigmoid functions as activation and performs probabilistic max pooling
after the convolution step.
So this class is basically a two layer network with a convolution layer and a pooling layer on top
of that.
The learning procedure uses contrastive divergence (CD) with a variable amount of steps.
"""
class ConvRBM:
    
    """
    Initialize the cRBM. The parameters here are global params that should not change
    during the execution of training or testing and characterize the network.
    
    Parameters:
    _motifLength:    How long are the motifs (position weight matrices PWM). This
                     This is equivalent to ask what the number of k-mers is.
                     The current approach only deals with one fixed motif length.
                     
    _numMotifs:      How many motifs are applied to the sequence, that is how many
                     hidden units does the network have. Each hidden unit consists
                     of a vector of size (sequenceLength-motifLength+1)
                     
    _poolingFactor:  How many units from the hidden layer are pooled together.
                     Note that the number has to divide evenly to the length of
                     the hidden units, that is:
                     mod(sequenceLength-motifLength+1, poolingFactor) == 0
                     (1 = equivalent to sigmoid activation)
                     
    _alphabet:       Biopython uses alphabets for sequences to do sanity checks.
                     However, all of the code is written for DNA sequences and even
                     though in theory there should be no difference between that
                     and other alphabets, Biopython may have trouble with the convolution.
    """
    def __init__ (self, _motifLength, _numMotifs, _learningRate=0.1, _poolingFactor=1, _alphabet=IUPAC.unambiguous_dna):
        # parameters for the motifs
        self.motifLength = _motifLength
        self.numMotifs = _numMotifs
        self.alphabet = _alphabet
        self.initializeMotifs()
        
        # cRBM parameters
        self.bias = np.random.rand(self.numMotifs)
        self.c = random.random()
        self.poolingFactor = _poolingFactor
        self.learningRate = _learningRate
        
        # infrastructural parameters
        self.rng = np.random.RandomState()
        
    
    
    def initializeMotifs (self):
        x = np.random.rand(self.numMotifs, 2, 4, self.motifLength)
        for i in range(self.numMotifs):
            x[i,1] = np.flipud(np.fliplr(x[i,0]))
        self.motifs = x
        
        
    """
    Calculate the forward pass for any given set of sequences, that is calculate P(H | V).
    This method applies convolution of all filters to all sequences in the set, using theano.
    It also looks on both strands for matches and returns the hidden activation layer.
    
    Parameters:
    data:            The DNA sequences to calculate the forward pass on. The data is a 4D matrix
                     with dimensionality: (N_batch x numOfStrands(2) x numOfLetters(4) x N_v)
                     
    Return:
    The function returns the hidden activation layer as numpy matrix.
    That matrix has dimensionality N_batch x K x 1 x N_h where
    K is the number of kernels (motifs/PWMs) that are applied and
    N_h is the length of the hidden layer (convolution is of type 'valid')
    
    Note that the strandness is lost during this process because it's added up in the hidden layer.
    """
    def forwardBatch (self, data):
        # create 4D tensor for theano (BatchSize x K x 2*numOfLetters x lenOfSeqs)
        D = T.tensor4('data')
        K = T.tensor4('kernels')
        out = conv.conv2d(D,K)
        f = theano.function([D,K], out, allow_input_downcast=True)
        
        bMod = self.bias[np.newaxis,:,np.newaxis,np.newaxis] # add dims to the bias until it works
        
        return f(data, self.motifs) + bMod



    """
    Calculates the backward pass on the data, that is P(V | H).
    It does so by performing convolution of the hidden layer and the motifs for each kernel using theano.
    
    Parameters:
    hidden layer:   The layer that was previously computed by the forward pass.
    
    Return:
    A matrix of dimensionality: N_batch x K x numOfLetters(4) x N_v
    The result can be interpreted as the probability for each position to have a specific letter
    once it is summed over all K (sum over second dimension).
    That means, the combination of summing over K and applying softmax can be interpreted as P(V | H).
    """
    def backwardBatch (self, hiddenActivation):
        # theano convolution call
        H = T.tensor4('hidden')
        K = T.tensor4('kernels')
        K_star = K.dimshuffle(1, 0, 2, 3)[:,:,::-1,::-1]
        C = conv.conv2d(H, K_star, border_mode='full')
        out = T.sum(C, axis=1) # sum over all K
        f = theano.function([H,K], out, allow_input_downcast=True)

        res = f(hiddenActivation, self.motifs) + self.c
        
        # add fourth dimension (the strands) that was lost during forward pass (max pooling)
        res = np.tile(res[:,np.newaxis,:,:], [1,2,1,1])
        return res



    """
    Set new kernels when you don't wish to start with random ones. Can be used if prior knowledge about
    the structure of the sequences is present.
    
    Parameters:
    customKernels:  A matrix of shape (K x numOfStrands(2) x numOfLetters(4) x num-of-k-mers) containing the
                    new kernels to work with.
                    
    Return:
    Nothing.
    """
    def setCustomKernels (self, customKernels):
        if len(customKernels.shape) != 4:
            print "New motifs must be a 4D matrix with dims: (K x numOfStrands(2) x numOfLetters(4) x numOfKMers)"
            return
        
        self.numMotifs = customKernels.shape[0]
        self.motifLength = customKernels.shape[3]
        self.bias = np.random.rand(self.numMotifs)
        self.motifs = customKernels
        print "New motifs set. # Motifs: " + str(self.numMotifs) + " K-mer-Length: " + str(self.motifLength)



    """
    Calculate the gradient for a given data matrix and samples from the hidden layer.
    The gradient is computed using theano and convolution.
    
    Parameters:
    hiddenProbs:    A matrix of shape (N_batch x K x numOfLetters(4) x N_v) containing probabilities
                    from the prob. distribution of the hidden layer.
                    Note that this matrix is obtained as result from forwardBatch().
                    
    data:           A matrix of shape (N_batch x numOfStrands(2) x numOfLetters(4) x N_v) containing
                    the data. Note that this matrix is usually the same as the matrix passed to forwardBatch().
                    
    Return:
    A matrix containing the gradient for all kernels/motifs/pwms.
    This matrix is of shape (N_batch x K x 1 x num-of-k-mers)
    """
    def gradient (self, hiddenProbs, data):
        H = T.tensor4('hidden')
        S = T.tensor4('sample')
        H_reshaped = H.dimshuffle(1, 0, 2, 3)
        out = conv.conv2d(S, H_reshaped)
        f = theano.function([H,S], out, allow_input_downcast=True)

        res = f(np.tile(np.mean(hiddenProbs, axis=0), [2,1,1,1]), np.tile(data, [1,1,1,1]))
        return np.mean(res, axis=0)



    def batchTraining (self, data, epochs, batchSize, numOfCDs):
        itPerEpoch = data.shape[0] / batchSize
        for epoch in range(epochs):
            for batch in range(itPerEpoch):
                D_batch = data[batch*batchSize:(batch+1)*batchSize]
                H = self.probMaxPooling(self.forwardBatch(D_batch))
                G_data = self.gradient(H, D_batch) # gradient for W for the data
                bias_data = np.mean(np.sum(H, axis=3), axis=0) # gradient for bias
                c_data = np.mean(np.sum(np.sum(np.sum(D_batch, axis=3), axis=2), axis=1), axis=0)
                print "C_data: " + str(c_data)
                for cd in range(numOfCDs):
                    S = self.sampleFromMatrix(H)
                    V = self.softmax(self.backwardBatch(S))
                    S_v = self.sampleFromMatrix(V)
                    H = self.probMaxPooling(self.forwardBatch(S_v))
                
                G_model = self.gradient(H, D_batch)
                bias_model = np.sum(np.sum(H, axis=3), axis=0)
                c_model = np.mean(np.sum(np.sum(np.sum(V, axis=3), axis=2), axis=1), axis=0)
                print "C_model: " + str(c_model)
                #print "Gradient of model: " + str((G_data-G_model).shape)
                #print "For data: " + str((batch*batchSize, (batch+1)*batchSize))
                #print G_data-G_model
                
                self.motifs = self.motifs + self.learningRate * (G_data - G_model)
                self.bias = self.bias + self.learningRate * (bias_data - bias_model).sum(axis=1)
                self.c = self.c + self.learningRate * (c_data - c_model)
                
            print "Epoch done: " + str(epoch)
        
        
        
    """
    Calculates the probabilistic max pooling layer (P) from a hidden layer H.
    P is sparse because only every poolingFactor'th unit can be active while all the other
    units are forced to zero. The one that becomes active is the one with the highest softmax
    activation from H.
    Parameters:
    H:              Hidden layer obtained by forwardBatch function.
    
    Return:
    A layer of the same shape as H, but sparsed out.
    """
    def probMaxPooling (self, H):

        # first of all some easy definitions
        n_h = H.shape[3]
        numOfGroups = n_h/self.poolingFactor
        #print "N_H = " + str(n_h) + " numOfGroups: " + str(numOfGroups) + " poolingFactor: " + str(self.poolingFactor)

        # exponentiate it all
        exp = np.exp(H)
        
        # get the right dimensions for the matrix
        dims = (exp.shape[0], exp.shape[1], self.poolingFactor, numOfGroups)
        reshaped = exp.reshape(dims)
        
        # append ones (to have the 5th, or non-active state), get the denominator and divide by it
        withOnes = np.insert(reshaped, self.poolingFactor, np.ones(numOfGroups), 2)
        denom = np.sum(withOnes, axis=2)
        div = withOnes / denom[:,:,np.newaxis,:]
        
        # now calculate argmax & max and insert into P
        P = np.zeros(div.shape)
        idx = np.argmax(div, axis=2)
        maxes = np.max(div, axis=2)
        
        # TODO: this is not performant!!!
        for sample in range(idx.shape[0]):
            for kernel in range(idx.shape[1]):
                for seqPos in range(idx.shape[2]):
                    P[sample,kernel,idx[sample,kernel,seqPos],seqPos] = maxes[sample, kernel, seqPos]

        # cut out the 5th extra dimension
        P = P[:,:,0:reshaped.shape[2],:]
        return np.reshape(P,(H.shape[0],H.shape[1],1,-1), 'F')



    """
    Maybe not working...
    """
    def sampleFromMatrix (self, M):
        boolean_mat = M > self.rng.random_sample(M.shape)
        return boolean_mat.astype(int)



    """
    Calculate the sigmoid activation for the hidden layer.
    """
    def sigmoid(self, _H):
        return (1.0 / (1.0 + np.exp(-_H)))



    """
    Calculates the softmax activation using numpy. This function was designed to work with the
    backward pass well.
    Using it for the probabilistic max pooling might not be recommended.
    Parameters:
    V:              The hidden layer calculated so far. That is, the result from convolution
                    when calling backwardBatch().
                    The matrix is of shape N_batch x K x numOfLetters(4) x N_v
    Return:
    A matrix of the same shape and size, but each column for the letters (3rd dim) has been softmaxed.
    """
    def softmax (self, V):
        exp = np.exp(-V) # exponentiate it all
        denominator = np.sum(exp, axis=1) # this the sum of all the exp(-x)
        # now expand the denominator such that the division works
        denominator = denominator[:,np.newaxis,:].repeat(exp.shape[1], axis=1)
        div = exp / denominator
        return div


In [176]:
x = ConvRBM(4,2,0.1, 3)
print "Data mat shape: " + str(dataMat.shape)
x.batchTraining(dataMat, 1, 100, 10)
print "Result from training: "
print x.motifs
print x.bias
print x.c

Data mat shape: (1000, 2, 4, 150)
C_data: 300.0
C_model: 600.0
C_data: 300.0
C_model: 600.0
C_data: 300.0
C_model: 600.0
C_data: 300.0
C_model: nan
C_data: 300.0
C_model: nan
C_data: 300.0
C_model: nan
C_data: 300.0
C_model: nan
C_data: 300.0
C_model: nan
C_data: 300.0
C_model: nan
C_data: 300.0
C_model: nan
Epoch done: 0
Result from training: 
[[[[ 0.15  0.5  -0.01 -0.06]
   [-0.21  0.55  0.    0.35]
   [-0.    0.06  0.5   0.07]
   [-0.16 -0.3  -0.28 -0.15]]

  [[-0.26 -0.38 -0.41 -0.26]
   [-0.    0.42 -0.02 -0.09]
   [ 0.23 -0.09  0.45 -0.31]
   [-0.15 -0.1   0.4   0.06]]]


 [[[-0.35 -0.11  0.17  0.01]
   [ 0.51  0.26  0.39 -0.01]
   [ 0.58  0.14 -0.08 -0.06]
   [ 0.43  0.52 -0.41  0.42]]

  [[ 0.31 -0.52  0.41  0.32]
   [-0.13 -0.16  0.06  0.49]
   [-0.12  0.29  0.16  0.41]
   [-0.07  0.08 -0.21 -0.44]]]]
[-332.77 -351.48]
nan


In [161]:
# Construct a kernel matrix, looking for ACGT and GGGG
kernel1 = np.tile(np.eye(4), [2,1,1]) # the kernel will only look for ACGT
kernel2 = np.tile(np.array([[0,0,0,0],[0,0,0,0],[1,1,1,1],[0,0,0,0]]), [2,1,1]) # this one looks for GGGG
kernel2[1,:,:] = np.array([[0,0,0,0],[1,1,1,1],[0,0,0,0],[0,0,0,0]])
kernel3 = np.tile(np.zeros((4,4)), [2,1,1])
kernel = np.array([kernel1, kernel2, kernel3])
print kernel.shape
print x.motifs.shape
# create toy sequences where we might be able to see what happens
randSeq1 = getMatrixFromSeq(Seq("ACGTGGGG", IUPAC.unambiguous_dna))
randSeq2 = getMatrixFromSeq(Seq("ACGTACGT", IUPAC.unambiguous_dna))
data = np.array([randSeq1, randSeq2])

print data.shape
print dataMat.shape
# initialize the learner, insert our predefined kernels and train
cRBM = ConvRBM (4, 3, 0.1, 1)
cRBM.setCustomKernels(kernel)
np.set_printoptions(precision=2, suppress=True)
cRBM.batchTraining(data, 10, 2, 1)
print cRBM.motifs

(3, 2, 4, 4)
(2, 2, 4, 4)
(2, 2, 4, 8)
(1000, 2, 4, 150)
New motifs set. # Motifs: 3 K-mer-Length: 4
(2, 2, 4, 8)
()


ValueError: operands could not be broadcast together with shapes (3,2,4,4) (3,4,4) 

### Verify that the algorithm is doing something meaningful
We can now apply the learning algorithm on a toy sequence to see what happens.
Important questions to ask:

* Does the hidden layer somehow make sense?
* What does the sigmoid function do with it?
* How does the *reconstruction* look like?
* Is the softmax doing the right thing?

In [162]:
# Construct a kernel matrix, looking for ACGT and GGGG
kernel1 = np.tile(np.eye(4), [2,1,1]) # the kernel will only look for ACGT
kernel2 = np.tile(np.array([[0,0,0,0],[0,0,0,0],[1,1,1,1],[0,0,0,0]]), [2,1,1]) # this one looks for GGGG
kernel2[1,:,:] = np.array([[0,0,0,0],[1,1,1,1],[0,0,0,0],[0,0,0,0]])
kernel3 = np.tile(np.zeros((4,4)), [2,1,1])
print kernel3.shape
print kernel2.shape
kernel = np.array([kernel1, kernel2, kernel3])

# create toy sequences where we might be able to see what happens
randSeq1 = getMatrixFromSeq(Seq("ACGTGGGG", IUPAC.unambiguous_dna))
randSeq2 = getMatrixFromSeq(Seq("ACGTACGT", IUPAC.unambiguous_dna))
data = np.array([randSeq1, randSeq2])

# initialize the learner and insert our predefined kernels
cRBM = ConvRBM (4, 3, 0.1, 5)
cRBM.setCustomKernels(kernel)
np.set_printoptions(precision=2, suppress=True)

res = cRBM.probMaxPooling(cRBM.forwardBatch(data))
print "Result from forward pass (that is: P(H | V))"
print "Shape: -> " + str(res.shape)
print res

vis = cRBM.softmax(cRBM.backwardBatch(res))
print "Result from backward pass (that is: P(V | H))"
print "Shape: -> " + str(vis.shape)
print vis
grad = cRBM.gradient(res, data)
print "Result from gradient calculation"
print "Shape: -> " + str(grad.shape)
print grad

(2, 4, 4)
(2, 4, 4)
New motifs set. # Motifs: 3 K-mer-Length: 4
Result from forward pass (that is: P(H | V))
Shape: -> (2, 3, 1, 5)
[[[[ 0.99  0.    0.    0.    0.  ]]

  [[ 0.    0.    0.    0.    0.44]]

  [[ 0.18  0.    0.    0.    0.  ]]]


 [[[ 0.5   0.    0.    0.    0.  ]]

  [[ 0.2   0.    0.    0.    0.  ]]

  [[ 0.18  0.    0.    0.    0.  ]]]]
Result from backward pass (that is: P(V | H))
Shape: -> (2, 2, 4, 8)
[[[[ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]
   [ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]
   [ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]
   [ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]]

  [[ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]
   [ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]
   [ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]
   [ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]]]


 [[[ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]
   [ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]
   [ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]
   [ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]]

  [[ 0.5  0.5  0.5  0.5  0

### Some performance tests on the test set
Now, let's test it on the whole test set. The performance obtained here, should tell us a lot on how fast we can actually train the learner.

In [122]:
cRBM = ConvRBM (7, 10, 0.1, 3)
# perform forward and backward pass and calculate the gradient
start = time.time()
res = cRBM.probMaxPooling(cRBM.forwardBatch(dataMat))
vis = cRBM.softmax(cRBM.backwardBatch(res))
grad = cRBM.gradient(res, dataMat)

print res[0,0,:,:]
print "Forward pass: " + str(res.shape)
print "Backward pass: " + str(vis.shape)
print "Gradient: " + str(grad.shape)
print "Time for processing (for, back, grad) of " + str(dataMat.shape[0]) + " sequences: " + str((time.time()-start))

[[ 0.61  0.    0.    0.    0.59  0.    0.    0.71  0.    0.    0.44  0.    0.
   0.89  0.    0.    0.    0.43  0.    0.    0.48  0.56  0.    0.    0.4
   0.    0.    0.    0.    0.48  0.    0.    0.48  0.    0.64  0.    0.48
   0.    0.    0.79  0.    0.    0.76  0.    0.    0.57  0.    0.    0.
   0.5   0.    0.    0.42  0.    0.    0.4   0.    0.    0.44  0.    0.62
   0.    0.    0.    0.55  0.    0.    0.72  0.    0.    0.    0.85  0.58
   0.    0.    0.65  0.    0.    0.    0.37  0.    0.38  0.    0.    0.    0.
   0.44  0.    0.    0.63  0.    0.    0.62  0.    0.    0.63  0.47  0.    0.
   0.    0.65  0.    0.48  0.    0.    0.67  0.    0.    0.74  0.    0.    0.
   0.52  0.    0.    0.67  0.    0.63  0.    0.    0.    0.    0.65  0.    0.
   0.84  0.    0.    0.66  0.43  0.    0.    0.    0.    0.52  0.49  0.    0.
   0.    0.    0.5   0.    0.54  0.  ]]
Forward pass: (1000, 10, 1, 144)
Backward pass: (1000, 2, 4, 150)
Gradient: (10, 4, 7)
Time for processing (for, back, grad) 

In [22]:
#del cRBM, test_set, allSeqs, pwms

## Part 3: Optimizing theano to do it all on the GPU
We don't want only the convolutions to happen on the GPU because that means that every time we perform a convolution, the whole information is transferred to the GPU and back after it.