# Convolutional RBM (cRBM)

This notebook takes care of implementing the basic functionality for cRBMs.
Or maybe it's just for the preliminaries, that is some simple stuff before it actually comes to the Boltzmann Machine.


## Part 1: Reading the data and converting it to various forms of matrices

In [1]:
import theano
import numpy as np
import Bio.SeqIO as sio
import Bio.motifs.matrix as mat
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
from Bio import motifs

import random
import time

ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu: libcublas.so.7.0: cannot open shared object file: No such file or directory
ERROR:theano.sandbox.cuda:Failed to compile cuda_ndarray.cu: libcublas.so.7.0: cannot open shared object file: No such file or directory


### Classes to read biological files (such as FASTA or JASPAR)

In [2]:
"""
This class reads sequences from fasta files.
To use it, create an instance of that object and use
the function readSequencesFromFile.
"""
class FASTAReader:
    
    def __init__(self, _path):
        self.path = _path
        
    def readSequencesFromFile (self, filename):
        dhsSequences = []
        for dhs in sio.parse(open(filename), 'fasta', IUPAC.unambiguous_dna):
            dhsSequences.append(dhs.seq)
        return dhsSequences
    
    
class JASPARReader:
    
    def __init__ (self):
        pass
    
    def readSequencesFromFile (self, filename):
        matrices = []
        for mat in motifs.parse(open(filename), 'jaspar'):
            matrices.append(mat.pwm)
        return matrices

### Read in JASPAR matrices and FASTA files

In [3]:
matReader = JASPARReader()
pwms = matReader.readSequencesFromFile('data/jaspar_matrices.txt')

In [4]:
# apply the two classes to calculate a forward pass in our algorithm
seqReader = FASTAReader('.')
allSeqs = seqReader.readSequencesFromFile('data/wgEncodeAwgDnaseUwAg10803UniPk.fa')

### Convert FASTA sequences to matrices

### Create test set

In [5]:
test_set = [allSeqs[random.randrange(0,len(allSeqs))] for i in range(1000)]
print len(test_set)

1000


In [6]:
def getIntToLetter (letter):
    if letter == 'A' or letter == 'a':
        return 0
    elif letter == 'C' or letter == 'c':
        return 1
    elif letter == 'G' or letter == 'g':
        return 2
    elif letter == 'T' or letter == 't':
        return 3
    else:
        print "ERROR. LETTER " + letter + " DOES NOT EXIST!"
        return -1

def getMatrixFromSeq (seq):
    m = len(seq.alphabet.letters)
    n = len(seq)
    result = np.zeros((2, m, n))
    revSeq = seq.reverse_complement()
    for i in range(len(seq)):
        result[0,getIntToLetter(seq[i]),i] = 1
        result[1,getIntToLetter(revSeq[i]),i] = 1
    return result


start = time.time()
dataMat = np.array([getMatrixFromSeq(t) for t in test_set])
print "Conversion of test set in (in ms): " + str((time.time()-start)*1000)

Conversion of test set in (in ms): 298.624038696


### Part 2a: Borrowing Ian Goodfellow's implementation of the probabilistic max pooling layer
This implementation is now part of the pylearn2 library which is licensed under the 3-claused BSD license.
Source code is available here: https://github.com/lisa-lab/pylearn2/blob/master/pylearn2/expr/probabilistic_max_pooling.py

In [45]:
from theano.gof.op import get_debug_values

def max_pool(z, pool_shape, top_down=None, theano_rng=None):
    """
    Probabilistic max-pooling
    Parameters
    ----------
    z : theano 4-tensor
        a theano 4-tensor representing input from below
    pool_shape : tuple
        tuple of ints. the shape of regions to be pooled
    top_down : theano 4-tensor, optional
        a theano 4-tensor representing input from above
        if None, assumes top-down input is 0
    theano_rng : MRG_RandomStreams, optional
        Used for random numbers for sampling
    Returns
    -------
    p : theano 4-tensor
        the expected value of the pooling layer p
    h : theano 4-tensor
        the expected value of the detector layer h
    p_samples : theano 4-tensor, only returned if theano_rng is not None
        samples of the pooling layer
    h_samples : theano 4-tensor, only returned if theano_rng is not None
        samples of the detector layer
    Notes
    ------
    all 4-tensors are formatted with axes ('b', 'c', 0, 1).
    This is for maximum speed when using theano's conv2d
    to generate z and top_down, or when using it to infer conditionals of
    other layers using the return values.
    Detailed description:
    Suppose you have a variable h that lives in a Conv2DSpace h_space and
    you want to pool it down to a variable p that lives in a smaller
    Conv2DSpace p.
    This function does that, using non-overlapping pools.
    Specifically, consider one channel of h. h must have a height that is a
    multiple of pool_shape[0] and a width that is a multiple of pool_shape[1].
    A channel of h can thus be broken down into non-overlapping rectangles
    of shape pool_shape.
    Now consider one rectangular pooled region within one channel of h.
    I now use 'h' to refer just to this rectangle, and 'p' to refer to
    just the one pooling unit associated with that rectangle.
    We assume that the space that h and p live in is constrained such
    that h and p are both binary and p = max(h). To reduce the state-space
    in order to make probabilistic computations cheaper we also
    constrain sum(h) <= 1.
    Suppose h contains k different units. Suppose that the only term
    in the model's energy function involving h is -(z*h).sum()
    (elemwise multiplication) and the only term in
    the model's energy function involving p is -(top_down*p).sum().
    Then P(h[i] = 1) = softmax( [ z[1], z[2], ..., z[k], -top_down] )[i]
    and P(p = 1) = 1-softmax( [z[1], z[2], ..., z[k], -top_down])[k]
    This variation of the function assumes that z, top_down, and all
    return values use Conv2D axes ('b', 'c', 0, 1).
    This variation of the function implements the softmax using a
    theano graph of exp, maximum, sub, and div operations.
    Performance notes:
    It might be possible to make a faster implementation with different
    theano ops. rather than using set_subtensor, it might be possible
    to use the stuff in theano.sandbox.neighbours. Probably not possible,
    or at least nasty, because that code isn't written with multiple
    channels in mind, and I don't think just a reshape can fix it.
    Some work on this in galatea.cond.neighbs.py
    At some point images2neighbs' gradient was broken so check that
    it has been fixed before sinking too much time into this.
    Stabilizing the softmax is also another source of slowness.
    Here it is stabilized with several calls to maximum and sub.
    It might also be possible to stabilize it with
    T.maximum(-top_down,T.signal.downsample.max_pool(z)).
    Don't know if that would be faster or slower.
    Elsewhere in this file I implemented the softmax with a reshape
    and call to Softmax / SoftmaxWithBias.
    This is slower, even though Softmax is faster on the GPU than the
    equivalent max/sub/exp/div graph. Maybe the reshape is too expensive.
    Benchmarks show that most of the time is spent in GpuIncSubtensor
    when running on gpu. So it is mostly that which needs a faster
    implementation. One other way to implement this would be with
    a linear.Conv2D.lmul_T, where the convolution stride is equal to
    the pool width, and the thing to multiply with is the hparts stacked
    along the channel axis. Unfortunately, conv2D doesn't work right
    with stride > 2 and is pretty slow for stride 2. Conv3D is used to
    mitigate some of this, but only has CPU code.
    """

    z_name = z.name
    if z_name is None:
        z_name = 'anon_z'

    batch_size, ch, zr, zc = z.shape

    r, c = pool_shape

    zpart = []

    mx = None

    if top_down is None:
        t = 0.
    else:
        t = - top_down
        t.name = 'neg_top_down'

    for i in xrange(r):
        zpart.append([])
        for j in xrange(c):
            cur_part = z[:, :, i:zr:r, j:zc:c]
            if z_name is not None:
                cur_part.name = z_name + '[%d,%d]' % (i, j)
            zpart[i].append(cur_part)
            if mx is None:
                mx = T.maximum(t, cur_part)
                if cur_part.name is not None:
                    mx.name = 'max(-top_down,' + cur_part.name + ')'
            else:
                max_name = None
                if cur_part.name is not None:
                    mx_name = 'max(' + cur_part.name + ',' + mx.name + ')'
                mx = T.maximum(mx, cur_part)
                mx.name = mx_name
    mx.name = 'local_max(' + z_name + ')'

    pt = []

    for i in xrange(r):
        pt.append([])
        for j in xrange(c):
            z_ij = zpart[i][j]
            safe = z_ij - mx
            safe.name = 'safe_z(%s)' % z_ij.name
            cur_pt = T.exp(safe)
            cur_pt.name = 'pt(%s)' % z_ij.name
            pt[-1].append(cur_pt)

    off_pt = T.exp(t - mx)
    off_pt.name = 'p_tilde_off(%s)' % z_name
    denom = off_pt

    for i in xrange(r):
        for j in xrange(c):
            denom = denom + pt[i][j]
    denom.name = 'denom(%s)' % z_name

    off_prob = off_pt / denom
    p = 1. - off_prob
    p.name = 'p(%s)' % z_name

    hpart = []
    for i in xrange(r):
        hpart.append([pt_ij / denom for pt_ij in pt[i]])

    h = T.alloc(0., batch_size, ch, zr, zc)

    for i in xrange(r):
        for j in xrange(c):
            h.name = 'h_interm'
            h = T.set_subtensor(h[:, :, i:zr:r, j:zc:c], hpart[i][j])

    h.name = 'h(%s)' % z_name

    if theano_rng is None:
        return p, h
    else:
        events = []
        for i in xrange(r):
            for j in xrange(c):
                events.append(hpart[i][j])
        events.append(off_prob)

        events = [event.dimshuffle(0, 1, 2, 3, 'x') for event in events]

        events = tuple(events)

        stacked_events = T.concatenate(events, axis=4)

        rows = zr // pool_shape[0]
        cols = zc // pool_shape[1]
        outcomes = pool_shape[0] * pool_shape[1] + 1
        assert stacked_events.ndim == 5
        for se, bs, r, c, chv in get_debug_values(stacked_events, batch_size,
                                                  rows, cols, ch):
            assert se.shape[0] == bs
            assert se.shape[1] == r
            assert se.shape[2] == c
            assert se.shape[3] == chv
            assert se.shape[4] == outcomes
        reshaped_events = stacked_events.reshape((
            batch_size * rows * cols * ch, outcomes))

        multinomial = theano_rng.multinomial(pvals=reshaped_events,
                                             dtype=p.dtype)

        reshaped_multinomial = multinomial.reshape((batch_size, ch, rows,
                                                    cols, outcomes))

        h_sample = T.alloc(0., batch_size, ch, zr, zc)

        idx = 0
        for i in xrange(r):
            for j in xrange(c):
                h_sample = T.set_subtensor(h_sample[:, :, i:zr:r, j:zc:c],
                                           reshaped_multinomial[:, :, :, :,
                                           idx])
                idx += 1

        p_sample = 1 - reshaped_multinomial[:, :, :, :, -1]

        return p, h, p_sample, h_sample


In [102]:
import theano.tensor as T
import theano.tensor.nnet.conv as conv
from theano.tensor.shared_randomstreams import RandomStreams

def _getLetterToInt (num):
    if num == 0:
        return 'A'
    elif num == 1:
        return 'C'
    elif num == 2:
        return 'G'
    elif num == 3:
        return 'T'
    else:
        print 'ERROR: Num ' + str(num) + " not a valid char in DNA alphabet"
        return -1

def _convertPWM2Array (pwm):
    result = np.zeros((4, len(pwm['A'])))
    for letter in range(len(pwm)):
        result[letter] = pwm[_getLetterToInt(letter)]
    return result
  
def makeIt3d (seq, numOfKernels):
    x = np.tile(seq, [numOfKernels, 1, 1])
    print x.shape
    return x

# data: 4D matrix with dims = (N_batch x 2 x 4 x N_v)
# kernel: 4D matrix with dims = (K x 2 x 4 x number of k-mers)
def forwardBatch (data, kernel, bias):
    # create 4D tensor for theano (BatchSize x K x 2*numOfLetters x lenOfSeqs)
    D = T.tensor4('data')
    K = T.tensor4('kernels')
    out = conv.conv2d(D,K)
    out = out[:,:,::-1,::-1]
    f = theano.function([D,K], out, allow_input_downcast=True)

    bMod = bias[np.newaxis,:,np.newaxis,np.newaxis] # add dims to the bias until it works

    return f(data, kernel) + bMod



def backwardBatch (hiddenActivation, kernel, c):
    # theano convolution call
    H = T.tensor4('hidden')
    K = T.tensor4('kernels')
    K_star = K.dimshuffle(1, 0, 2, 3)[:,:,::-1,::-1]
    C = conv.conv2d(H, K_star, border_mode='full')
    out = T.sum(C, axis=1) # sum over all K
    res = out + c
    
    # add fourth dimension (the strands) that was lost during forward pass (max pooling)
    res = np.tile(res[:,np.newaxis,:,:], [1,2,1,1])
    return res


def gradient (_H, V_0):
    H = T.tensor4('hidden')
    S = T.tensor4('sample')
    H_reshaped = H.dimshuffle(1, 0, 2, 3)
    out = conv.conv2d(S, H_reshaped)
    f = theano.function([H,S], out, allow_input_downcast=True)
    
    return f(np.tile(np.mean(_H, axis=0), [2,1,1,1]), np.tile(V_0, [1,1,1,1]))
    
def sigmoid(x):
    return 1.0 / (1.0 + T.exp(x))

def batchTraining (data, epochs, batchSize, numOfCDs):
    itPerEpoch = data.shape[0] / batchSize
    for epoch in range(epochs):
        for batch in range(itPerEpoch):
            D_batch = data[batch*batchSize:(batch+1)*batchSize]
            H = self.probMaxPooling(self.forwardBatch(D_batch))
            G_data = self.gradient(H, D_batch) # gradient for W for the data
            bias_data = np.mean(np.sum(H, axis=3), axis=0) # gradient for bias
            c_data = np.mean(np.sum(np.sum(np.sum(D_batch, axis=3), axis=2), axis=1), axis=0)
            print "C_data: " + str(c_data)
            for cd in range(numOfCDs):
                S = self.sampleFromMatrix(H)
                V = self.softmax(self.backwardBatch(S))
                S_v = self.sampleFromMatrix(V)
                H = self.probMaxPooling(self.forwardBatch(S_v))

            G_model = self.gradient(H, D_batch)
            bias_model = np.sum(np.sum(H, axis=3), axis=0)
            c_model = np.mean(np.sum(np.sum(np.sum(V, axis=3), axis=2), axis=1), axis=0)
            print "C_model: " + str(c_model)
            #print "Gradient of model: " + str((G_data-G_model).shape)
            #print "For data: " + str((batch*batchSize, (batch+1)*batchSize))
            #print G_data-G_model

            self.motifs = self.motifs + self.learningRate * (G_data - G_model)
            self.bias = self.bias + self.learningRate * (bias_data - bias_model).sum(axis=1)
            self.c = self.c + self.learningRate * (c_data - c_model)

        print "Epoch done: " + str(epoch)
        
        
def softmax_4d(softmax_input):
    si = softmax_input.reshape((softmax_input.shape[0], softmax_input.shape[1], -1))
    shp = (si.shape[0], 1, si.shape[2])
    exp = T.exp(si - si.max(axis=1).reshape(shp))
    softmax_expression = (exp / exp.sum(axis=1).reshape(shp) ).reshape(softmax_input.shape)
    return softmax_expression


def allBatch (data, kernel, bias, c, cd, rng):
    
    # do forward convolution (get H)
    V = T.tensor4('data')
    K = T.tensor4('kernels')
    bMod = bias[np.newaxis,:,np.newaxis,np.newaxis]

    H = T.nnet.sigmoid(conv.conv2d(V, K) + bMod)
    
    # compute the derivatives of the data
    H_reshaped = np.tile(T.mean(H, axis=0), [2,1,1,1]).dimshuffle(1, 0, 2, 3)
    GData_w = conv.conv2d(H, np.tile(V, [1,1,1,1]))
    GData_bias = T.mean(T.sum(H, axis=3), axis=0)
    
    for conDiv in range(cd):
        S = rng.binomial(size=H.shape)
        
        # now we have the sample, perform backward pass
        K_star = K.dimshuffle(1, 0, 2, 3)[:,:,::-1,::-1]
        V_ = conv.conv2d(S, K_star, border_mode='full')
        V_ = 1. / (1. + T.exp(V_)) # sigmoid
        pre_V = T.sum(V_ + c, axis=1) # sum over all K
        V = softmax_4d(np.tile(V[:,np.newaxis,:,:], [1,2,1,1]))
        
        


        rng = RandomStreams(np.random.RandomState(1234).randint(2 ** 30))

In [105]:
kernel1 = np.tile(np.array([[1,0,0],[0,1,0],[0,0,1],[0,0,0]]), [2,1,1])
kernel2 = np.tile(np.array([[0,0,0],[0,0,0],[1,1,1],[0,0,0]]), [2,1,1])
#kernel1 = np.tile(np.eye(4), [2,1,1])
#kernel2 = np.tile(np.array([[0,0,0,0],[0,0,0,0],[1,1,1,1],[0,0,0,0]]), [2,1,1])
kernel = np.array([kernel1, kernel2])
filter_shape = kernel.shape

randSeq1 = getMatrixFromSeq(Seq("ACGTGGGG", IUPAC.unambiguous_dna))
randSeq2 = getMatrixFromSeq(Seq("ACGTACGT", IUPAC.unambiguous_dna))
data = np.array([randSeq1, randSeq2])
print "Data: " + str(data.shape)
print data
print "Kernel: " + str(kernel.shape)
print kernel
D = T.tensor4(name='data')
K = T.tensor4(name='kernel')
C = conv.conv2d(D, K, filter_shape=[filter_shape[k] for k in [1,0,2,3]])
out = C[:,:,::-1,::-1]
b = theano.function([D,K],out, allow_input_downcast=True)
tmp = b(data,kernel)
print "Result from forward: " + str(tmp.shape)
print tmp


D = T.tensor4('data')
K = T.tensor4('kernel')
C = conv.conv2d(D, K, filter_shape=[filter_shape[k] for k in [1,0,2,3]])
out = C[:,:,::-1,::-1]
res = max_pool(z=out, pool_shape=(1,2), top_down=None, theano_rng=rng)
f = theano.function([D,K], res, allow_input_downcast=True)

P = f(data, kernel)

#print "Expectation of P -> " + str(P[0].shape)
#print P[0]
print "Expectation of H -> " + str(P[1].shape)
print P[1]
#print "Sampled from P -> " + str(P[2].shape)
#print P[2]
print "Sampled from H -> " + str(P[3].shape)
print P[3]

Data: (2, 2, 4, 8)
[[[[ 1.  0.  0.  0.  0.  0.  0.  0.]
   [ 0.  1.  0.  0.  0.  0.  0.  0.]
   [ 0.  0.  1.  0.  1.  1.  1.  1.]
   [ 0.  0.  0.  1.  0.  0.  0.  0.]]

  [[ 0.  0.  0.  0.  1.  0.  0.  0.]
   [ 1.  1.  1.  1.  0.  1.  0.  0.]
   [ 0.  0.  0.  0.  0.  0.  1.  0.]
   [ 0.  0.  0.  0.  0.  0.  0.  1.]]]


 [[[ 1.  0.  0.  0.  1.  0.  0.  0.]
   [ 0.  1.  0.  0.  0.  1.  0.  0.]
   [ 0.  0.  1.  0.  0.  0.  1.  0.]
   [ 0.  0.  0.  1.  0.  0.  0.  1.]]

  [[ 1.  0.  0.  0.  1.  0.  0.  0.]
   [ 0.  1.  0.  0.  0.  1.  0.  0.]
   [ 0.  0.  1.  0.  0.  0.  1.  0.]
   [ 0.  0.  0.  1.  0.  0.  0.  1.]]]]
Kernel: (2, 2, 4, 3)
[[[[1 0 0]
   [0 1 0]
   [0 0 1]
   [0 0 0]]

  [[1 0 0]
   [0 1 0]
   [0 0 1]
   [0 0 0]]]


 [[[0 0 0]
   [0 0 0]
   [1 1 1]
   [0 0 0]]

  [[0 0 0]
   [0 0 0]
   [1 1 1]
   [0 0 0]]]]
Result from forward: (2, 2, 1, 6)
[[[[ 4.  1.  2.  1.  4.  1.]]

  [[ 1.  1.  2.  2.  4.  4.]]]


 [[[ 6.  0.  0.  0.  6.  0.]]

  [[ 2.  2.  2.  0.  2.  2.]]]]
Expectati

### Testing incredibly fast theano convolution

In [50]:
kernel1 = np.tile(np.eye(4), [2,1,1])
kernel2 = np.tile(np.array([[0,0,0,0],[0,0,0,0],[1,1,1,1],[0,0,0,0]]), [2,1,1])
kernel = np.array([kernel1, kernel2])
randSeq1 = getMatrixFromSeq(Seq("ACGTGGGG", IUPAC.unambiguous_dna))
randSeq2 = getMatrixFromSeq(Seq("ACGTACGT", IUPAC.unambiguous_dna))
data = np.array([randSeq1, randSeq2])
res = sigmoid(forwardBatch(data, kernel, np.zeros(2)))
print res.shape
print data.shape
vis = sigmoid(backwardBatch(res, kernel, 0))
grad = gradient(res, data)
print grad.shape

Shape.0
(2, 2, 4, 8)


AsTensorError: ('Cannot convert [[[[Subtensor{::, ::, ::, ::}.0]]\n\n  [[Subtensor{::, ::, ::, ::}.0]]]] to TensorType', <type 'numpy.ndarray'>)

In [12]:
kernel = np.tile(np.eye(4), [10, 2,1,1])
print kernel.shape
print dataMat.shape
start = time.time()
res = forwardBatch(dataMat, kernel)
vis = backwardBatch(res, kernel)
grad = gradient(res, dataMat)
print grad.shape
print "Time for processing (for, back, grad) of " + str(dataMat.shape[0]) + " sequences: " + str((time.time()-start))

(10, 2, 4, 4)
(1000, 2, 4, 150)
(1000, 10, 4, 4)
Time for processing (for, back, grad) of 1000 sequences: 0.197737932205


## The implementation of our convRBM so far

In [13]:
"""
This class implements a cRBM for sequence analysis.
It does perform efficient forward and backward pass of any given DNA sequence.
It implements softmax and sigmoid functions as activation and performs probabilistic max pooling
after the convolution step.
So this class is basically a two layer network with a convolution layer and a pooling layer on top
of that.
The learning procedure uses contrastive divergence (CD) with a variable amount of steps.
"""

class ConvRBM:
    
    """
    Initialize the cRBM. The parameters here are global params that should not change
    during the execution of training or testing and characterize the network.
    
    Parameters:
    _motifLength:    How long are the motifs (position weight matrices PWM). This
                     This is equivalent to ask what the number of k-mers is.
                     The current approach only deals with one fixed motif length.
                     
    _numMotifs:      How many motifs are applied to the sequence, that is how many
                     hidden units does the network have. Each hidden unit consists
                     of a vector of size (sequenceLength-motifLength+1)
                     
    _poolingFactor:  How many units from the hidden layer are pooled together.
                     Note that the number has to divide evenly to the length of
                     the hidden units, that is:
                     mod(sequenceLength-motifLength+1, poolingFactor) == 0
                     (1 = equivalent to sigmoid activation)
                     
    _alphabet:       Biopython uses alphabets for sequences to do sanity checks.
                     However, all of the code is written for DNA sequences and even
                     though in theory there should be no difference between that
                     and other alphabets, Biopython may have trouble with the convolution.
    """
    def __init__ (self, _motifLength, _numMotifs, _learningRate=0.1, _poolingFactor=1, _alphabet=IUPAC.unambiguous_dna):
        # parameters for the motifs
        self.motifLength = _motifLength
        self.numMotifs = _numMotifs
        self.motifs = []
        self.alphabet = _alphabet
        self.poolingFactor = _poolingFactor
        
        # cRBM parameters
        self.bias = np.random.rand(self.numMotifs)
        self.c = random.random()
        self.learningRate = _learningRate
        
        # infrastructural parameters
        self.rng = np.random.RandomState()
        
    """
    This function initializes the motifs (or PWMs) of the cRBM. Maybe this function will
    be removed in future versions and initialization will be performed by the
    c'tor.
    """
    def initializePWMs (self):
        # set up PWMs
        for m in range(self.numMotifs):
            self.motifs.append(self._createRandomMotif(self.motifLength, self.alphabet))
        
        
    """
    Calculate the forward pass for any given sequence, that is P(H | V).
    This method applies convolution of all filters to the sequence, using the Biopython package.
    It also looks on both strands for matches and returns the hidden activation layer.
    It performs also the probabilistic max pooling on the convoluted data.
    
    Parameters:
    seq:             The DNA sequence to calculate the forward pass on. The sequence is of
                     type seq.
                     
    Return:
    The function returns the hidden activation layer as numpy matrix.
    That matrix has dimensionality 2 * (len(seq) - motifLength + 1) x K where K is the number
    of kernels (motifs/PWMs) that are applied.
    Since the algorithm looks on both strands, the factor 2 is present.
    """
    def forwardPass (self, seq):
        
        # check that we actually have some motifs to do convolution on
        if self.motifs == []:
            print 'Error: No motifs created so far. Try executing initializePWMs before!'
            return
        if (len(seq)-self.motifLength+1) % self.poolingFactor != 0:
            print 'Dimension mismatch: cannot create pooling layer because it would not fit!'

        # perform convolution of motif and sequence (that is, apply the motif to the sequence)
        hiddenActivation = np.zeros((2*self.numMotifs, len(seq)-self.motifLength+1))
        motifCount = 0
        for motif in self.motifs:
            pssm = motif.log_odds()
            
            # apply convolution on both strands
            hiddenActivation[motifCount*2,:] = pssm.calculate(seq) + self.bias[motifCount]
            hiddenActivation[motifCount*2+1,:] = pssm.calculate(seq.reverse_complement()) + self.bias[motifCount]
            hiddenActivation[motifCount*2:motifCount*2+2,:] = self._probMaxPooling(hiddenActivation[motifCount*2:motifCount*2+2,:])
            motifCount += 1

        return hiddenActivation
    
    
    """
    Calculates the backward pass on the data, that is P(V | H).
    It does so by performing convolution of the hidden layer and the data for each letter
    in the alphabet, respecting both strands.
    The de-convolutions for all kernels are added and the final sequence is obtained
    using the softmax over all four letters.
    
    Parameters:
    hidden layer:   The layer that was previously computed by the forward pass.
    
    Returns:        The reconstructed sequence. 
    """
    def backwardPass (self, hiddenActivation):
        
        # apply convolution on all of the filters
        restoredLength = hiddenActivation.shape[1] + self.motifLength - 1
        numOfLetters = len(self.alphabet.letters)
        reConv = np.zeros((numOfLetters, restoredLength))

        #start = time.time()
        reConv = np.zeros((numOfLetters, restoredLength))
        for k in range(len(self.motifs)):
            # apply convolution on each of the channels (A, C, G, T) seperately
            matrix = self._convertPWM2Array(self.motifs[k])
            for i in range(numOfLetters):
                conv1 = np.convolve(hiddenActivation[k], matrix[i])
                conv2 = np.convolve(hiddenActivation[k+1], matrix[i])
                reConv[i,:] = reConv[i,:] + conv1 + conv2 + self.c

        #convTime = time.time()
        # perform softmax and select index of the most promising one (results in visibleActivation)
        visibleActivation = np.zeros((1, restoredLength))
        
        # calculate exp for whole matrix
        reConv = np.exp(reConv)
        
        # calculate the sum (over all four letters for whole sequence)
        sums = np.sum(reConv, 0)

        # divide by the sum
        for i in range(numOfLetters):
            reConv[i,:] = reConv[i,:] / sums[i]
            
        # and the maximum is our letter...
        visibleActivation = np.argmax(reConv, 0)
        
        #print "Done with Softmax in: " + str((time.time()-convTime)*1000)
        # convert the resulting sequence to actual letters (A, C, G, T instead of 0, 1, 2, 3)
        return self._getDNASeqFromNumericals(visibleActivation)

    
    """
    The training algorithm for cRBMs. This uses the forward and backward pass
    to collect the statistics for them.
    """
    def miniBatchTraining (self, sequences, batchSize, cd_value):
        
        # first, some vars that we need all the time
        sequenceLength = len(sequences[0])
        hiddenUnitLength = sequenceLength - self.motifLength + 1
        
        # first, compute expected value of the data
        # that means computing forward pass for all datapoints and computing the gradient
        hiddenActivation = np.zeros((2*self.numMotifs, hiddenUnitLength))
        derivatives = np.zeros((self.motifLength, self.numMotifs))
        for seq in sequences:
            hiddenActivation = self.forwardPass(seq)
            
            # calculate the gradients of the data
            # In order to do that, we have to convolve the DNA and hidden layer
            # This can be done by expressing DNA as four different convolutions (for each letter)
            # We would need to represent the DNA as matrix of 4 x lengthOfDHS
            # Example:
            # -------
            # ACGTGGGG
            # 10000000
            # 01000000
            # 00101111
            # 00010000
            # Then, we can apply 4 convolutions to the sequence
            for k in range(self.numMotifs):
                derivatives[k,:] += np.convolve( hiddenActivation[k,:])
            
        print derivatives
        # Now, sample from this distribution and collect the model's statistics
        # Do that by sampling from the data statistics, recompute the model's statistics
        # and so on until reaching the stationary distribution (in case of CD, do it once)
        hidden_model = hiddenActivation
        for it in range(cd_value):
            sample = self._sampleFromHidden(hidden_model)
            visible = self.backwardPass(sample)
            hidden_model = self.forwardPass(visible)

        print hidden_model.shape
        
        return self.learningRate * (hiddenActivation - hidden_model)
            
    
    def _probMaxPooling (self, h_k):

        # first of all some easy definitions
        l = h_k.shape[1]
        numOfGroups = l/self.poolingFactor
        P = np.zeros((2, l))

        # exponent of everything
        ex = np.exp(h_k)
        
        # reshape s.t. each group forms one row
        newDim = (numOfGroups, -1)
        reordered = np.append(ex[0].reshape(newDim), ex[1].reshape(newDim), 1)
        #print "Shape of reordered: " + str(reordered.shape)
        # calculate denominators (sum of all rows)
        denoms = np.sum(reordered, 1) + 1 # denoms for all groups (add 1 to have log. unit)
        res = np.argmax(reordered.T / denoms, 0)

        # calculate the actual values of the pooling layer P
        for group in range(numOfGroups):
            if reordered[group,res[group]] > 1: # check if really any element from P should be active
                # we don't care about strands so just set res = res/2 for the index
                idx = group * self.poolingFactor + int(res[group]/2)
                P[res[group] % 2, idx] = reordered[group,res[group]] / denoms[group]
        return P
        
    def _createRandomMotif (self, motifLength, alphabet):
        counts = {}
        for letter in alphabet.letters:
            counts[letter] = [random.randint(0,100) for x in xrange(motifLength)]
        return mat.PositionWeightMatrix(alphabet, counts)
        
    def _sampleFromHidden (self, hiddenActivation):
        boolean_mat = hiddenActivation > self.rng.random_sample(hiddenActivation.shape)
        return boolean_mat.astype(int)
        
        
    def _sigmoid (self, x):
        return 1.0 / (1.0 + np.exp(-x))

    def _softmaxActivation (self, col, idx):
        p_all = np.sum(np.exp(col))
        p_x = np.exp(col[idx])
        return p_x / p_all

    def _getLetterToInt (self, num):
        if num == 0:
            return 'A'
        elif num == 1:
            return 'C'
        elif num == 2:
            return 'G'
        elif num == 3:
            return 'T'
        else:
            print 'ERROR: Num ' + str(num) + " not a valid char in DNA alphabet"
            return -1

    def _getDNASeqFromNumericals (self, seq):
        dna_seq = []
        for num in range(seq.shape[0]):
            dna_seq.append(self._getLetterToInt(seq[num]))
        return Seq("".join(dna_seq), alphabet=self.alphabet)

    def _convertPWM2Array (self, pwm):
        result = np.zeros((len(self.alphabet.letters), self.motifLength))
        for letter in range(len(pwm)):
            result[letter] = pwm[self._getLetterToInt(letter)]
        return result

### Test the training procedure

In [14]:
# initialize cRBM
L = ConvRBM(4,2)
L.initializePWMs()

# create test set of sequences
testSet = [allSeqs[random.randrange(0, len(allSeqs))] for x in range(100)]
# take some sequences and learn the 2 motifs
start = time.time()
data_stats = L.miniBatchTraining(testSet, 1, 1)
print "Time for minibatch training (in sec): " + str(time.time()-start)

TypeError: convolve() takes at least 2 arguments (1 given)

### Some code from someone else that did it with theano

In [15]:
def T_activation_v_OLD(self, h):
    # vbias is per visible dimension
    # The first two dimensions of the weights tensor have to be swapped here,
    # because they represent the number of input and output feature maps.
    W_shuffled = self.W.dimshuffle(1, 0, 2, 3)
    filter_shape = [self.filter_shape[k] for k in [1, 0, 2, 3]]
    return conv.conv2d(h, W_shuffled, border_mode='full', image_shape=self.hiddens_shape, filter_shape=filter_shape) + self.vbias.dimshuffle('x', 'x', 0, 'x')


def T_activation_v(self, h):
    # this version of T_activation_v swaps 'dim' and 'hmaps'dimensions,
    # because this seems to be a LOT faster.

    # It seems to be a little less precise than the original, butotherwise
    # the output is identical.

    W_shuffled = self.W.dimshuffle(2, 1, 0, 3)

    # The hmaps dimension needs to be flipped, because it waspreviously not convolved over,
    # and now it is. Note that a valid convolution with filtersize =inputsize is NOT
    # equivalent with a product - this is only the case if you FLIPthe filter.
    # Note that the dim dimension need not be flipped, because this isconvolved over
    # when computing T_activation_h (so it is already inherentlyflipped).
    W_shuffled = W_shuffled[:,:,::-1,:]

    h_shuffled = h.dimshuffle(0, 2, 1, 3)

    # this is a full convolution in the time direction and a valid onein the other,
    # hence the need to pad the input manually.
    zero_padding = T.zeros_like(h_shuffled)[:,:,:,0:self.filter_width-1]
    h_padded = T.concatenate([zero_padding, h_shuffled, zero_padding],axis=3)

    filter_shape = [self.filter_shape[k] for k in [2, 1, 0, 3]]
    image_shape = [self.hiddens_shape[k] for k in [0, 2, 1, 3]]
    image_shape[3] += 2*(self.filter_width-1)
    tmp = conv.conv2d(h_padded, W_shuffled, border_mode='valid',image_shape=image_shape, filter_shape=filter_shape)
    return tmp.dimshuffle(0, 2, 1, 3) + self.vbias.dimshuffle('x','x', 0, 'x')


### Test how to improve representation of DNA

In [16]:
def getIntToLetter (letter):
    if letter == 'A' or letter == 'a':
        return 0
    elif letter == 'C' or letter == 'c':
        return 1
    elif letter == 'G' or letter == 'g':
        return 2
    elif letter == 'T' or letter == 't':
        return 3
    else:
        print "ERROR. LETTER " + letter + " DOES NOT EXIST!"
        return -1

def getMatrixFromSeq (seq):
    numOfLetters = len(seq.alphabet.letters)
    result = np.zeros((numOfLetters, len(seq)))
    for i in range(len(seq)):
        result[getIntToLetter(seq[i]),i] = 1
    return result


start1 = time.time()
randSeq = Seq("ACGTGGGG", L.alphabet)
randSeq = allSeqs[random.randrange(0, len(allSeqs))]
randSeq = randSeq.upper()
m = getMatrixFromSeq(randSeq)
print "Small test matrix converstion took (in ms): " + str((time.time()-start1)*1000)
start = time.time()
for seq in allSeqs:
    getMatrixFromSeq(seq)
print "Conversion from DNA to matrix took: " + str(time.time()-start)
print m

Small test matrix converstion took (in ms): 0.546216964722
ERROR. LETTER N DOES NOT EXIST!
ERROR. LETTER N DOES NOT EXIST!
Conversion from DNA to matrix took: 24.7349979877
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  1.  0.
   1.  0.  1.  0.  1.  0.  0.  0.  0.  1.  0.  0.  0.  0.  1.  0.  0.  0.
   0.  1.  1.  1.  1.  0.  1.  0.  0.  1.  0.  1.  0.  1.  1.  0.  1.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  1.  0.  0.  1.  0.  0.  0.
   0.  1.  1.  1.  0.  0.  0.  1.  0.  0.  1.  1.  0.  0.  0.  0.  1.  1.
   1.  0.  1.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.
   1.  0.  0.  1.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.
   0.  0.  0.  1.  0.  0.]
 [ 1.  1.  1.  1.  0.  1.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.
   0.  0.  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.  0.  0.  1.  0.  0.  0.
   0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  0

### Test with a simple test sequence to verify that forward and backward pass work well

In [17]:
learner = ConvRBM(4, 2)
learner.bias = np.zeros(2)
learner.c = 0
# create PWM (that will look for a sequence of 4 Gs)
counts = {}
for letter in learner.alphabet.letters:
    if letter != 'G':
        counts[letter] = [0 for x in xrange(learner.motifLength)]
    else:
        counts[letter] = [1 for x in xrange(learner.motifLength)]
kernel = mat.PositionWeightMatrix(learner.alphabet, counts)
learner.motifs.append(kernel)

# create second kernel that will look for ACGT
counts = {}
for letter in range(len(learner.alphabet.letters)):
    counts[learner._getLetterToInt(letter)] = [int(x == letter) for x in xrange(learner.motifLength)]

kernel = mat.PositionWeightMatrix(learner.alphabet, counts)
learner.motifs.append(kernel)

# set the cRBMs other params to 0
learner.bias = np.zeros(learner.numMotifs)
learner.c = 0

# now test forward pass on a simple sequence
testSeq = Seq("ACGTGGGG", learner.alphabet)
h = learner.forwardPass(testSeq)
print "Hidden Layer:"
print h

maxPooled = learner._probMaxPooling(h[:2])
reconstructed = learner.backwardPass(h)
print "Reconstruction:"
print reconstructed

# Should be able to completely reconstruct the sequence because we gave two kernels
# which both can be hold responsible for a portion of the sequence.

Hidden Layer:
[[ 0.          0.          0.          0.          0.99966465]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.99966465  0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.99966465]]
Reconstruction:
ACGTGGGG


### Simple Speed Test

In [18]:
x = ConvRBM(4, 10, 1)
x.initializePWMs()

testSeq = allSeqs[7190]
startForward = time.time()
h = x.forwardPass(testSeq)
print "Time for forward: " + str((time.time()-startForward)*1000)
startBackward = time.time()
x.backwardPass(h)
print "Time for backward: " + str((time.time()-startBackward)*1000)

Time for forward: 4.54807281494
Time for backward: 1.15513801575


### Use the JASPAR data and see whether the sequences can be reproduced

In [19]:
l = 6
numMats = len(pwms)
avgLength = np.mean([len(x[0]) for x in pwms])
matsOfSpecificLength = [x for x in pwms if len(x[0]) == l]
avgRedLength = np.mean([len(x[0]) for x in matsOfSpecificLength])
print "Total number of JASPAR matrices: " + str(numMats)
print "Average motif length (k-mer length): " + str(avgLength)
print
print "Number of motifs with length " + str(l) + " : " + str(len(matsOfSpecificLength))
print "Verfication: " + str(avgRedLength)

cRBM = ConvRBM(11, len(matsOfSpecificLength))
cRBM.bias = np.zeros(len(matsOfSpecificLength))
cRBM.c = 0
# insert our pwms
cRBM.motifs = matsOfSpecificLength

# perform forward and backward pass
correct = []
errors = []
times = []
for i in range(1000):
    testSeq = allSeqs[random.randrange(0, len(allSeqs))]
    start = time.time()
    hiddenActivation = learner.forwardPass(testSeq)
    restored = learner.backwardPass(hiddenActivation)
    times.append(time.time()-start)
    
    # count the differences between the two sequences
    differences = 0
    for elem in range(len(string)):
        if string[elem] != testSeq[elem]:
            differences += 1

    #print "Correct: " + str(len(string)-differences)
    #print "Errors: " + str(differences)
    correct.append(len(string)-differences)
    errors.append(differences)
    
print "average correct: " + str(np.mean(correct))
print "average error: " + str(np.mean(errors))
print "var of error: " + str(np.var(errors))
print "average time for forward and backward pass (in ms): " + str(np.mean(times)*1000)

Total number of JASPAR matrices: 593
Average motif length (k-mer length): 10.7993254637

Number of motifs with length 6 : 37
Verfication: 6.0


NameError: name 'string' is not defined

### Verify that the forward and backward pass do anything meaningful

In [20]:
learner = ConvRBM(6, 1)
learner.initializePWMs()

hiddenActivation = learner.forwardPass(allSeqs[246])
restoredLength = hiddenActivation.shape[1] + learner.motifLength - 1
reConv = np.zeros((len(learner.alphabet.letters), restoredLength))
matrix = learner._convertPWM2Array(learner.motifs[0])
print matrix
for i in range(len(learner.alphabet.letters)):
    reConv[i,:] = np.convolve(hiddenActivation[0], matrix[i])
    
#print reConv

def softmaxActivation(col, idx):
    p_all = np.sum(np.exp(col))
    p_x = np.exp(col[idx])
    return p_x / p_all

visibleActivation = np.zeros((1, 150))
for i in range(150):
    visibleActivation[0,i] = np.argmax([softmaxActivation(reConv[:,i], x) for x in range(4)])
    
def convertNumericalToLetter(seq):
    dna_seq = []
    for num in range(seq.shape[1]):
        dna_seq.append(learner._getLetterToInt(seq[0,num]))
    return dna_seq

print "Restored:"
print convertNumericalToLetter(visibleActivation)[:20]
print
print "Original:"
print allSeqs[246][:20]

print "Now the real implementation:"
print "Restored:"
print learner.backwardPass(hiddenActivation)[:20]

[[ 0.16580311  0.28767123  0.36480687  0.28813559  0.03977273  0.38655462]
 [ 0.19689119  0.31506849  0.33476395  0.16610169  0.26136364  0.31092437]
 [ 0.20207254  0.32876712  0.08583691  0.29152542  0.47159091  0.11344538]
 [ 0.43523316  0.06849315  0.21459227  0.25423729  0.22727273  0.18907563]]
Restored:
['A', 'A', 'T', 'G', 'A', 'T', 'G', 'T', 'G', 'C', 'A', 'G', 'C', 'A', 'A', 'T', 'G', 'A', 'T', 'G']

Original:
TTTATCCTGCAGCTCGCCTG
Now the real implementation:
Restored:
AAAAAGAAGAAGAAAGAAAA


### Test the forward pass on the whole set of sequences

In [481]:
start = time.time()
i = 0
lengthes = []
someScores = []
for seq in allSeqs:
    convoluted = learner.forwardPass(seq)
    lengthes.append(len(seq))
    if i % 5000 == 0:
        someScores.append(convoluted[0][random.randint(0, len(convoluted))])
        print str(i) + " -> " + str(someScores[-1])
    i += 1
    
print
print
print "Number of filters: " + str(learner.numMotifs)
print "Number of DHSs: " + str(i)
print "Average Length of Sequences: " + str(np.array(lengthes).mean())
print "Execution Time: " + str(time.time()-start)

0 -> 0.0
5000 -> 0.0
10000 -> 0.0
15000 -> 0.0
20000 -> 0.0
25000 -> 0.842827207671
30000 -> 0.0
35000 -> 0.0
40000 -> 0.0
45000 -> 0.0
50000 -> 0.561771898779
55000 -> 0.0
60000 -> 0.0
65000 -> 0.0
70000 -> 0.0
75000 -> 0.0
80000 -> 0.58698077005
85000 -> 0.0
90000 -> 0.0
95000 -> 0.0
100000 -> 0.0
105000 -> 0.0
110000 -> 0.692016137551
115000 -> 0.0
120000 -> 0.0
125000 -> 0.0
130000 -> 0.0
135000 -> 0.0
140000 -> 0.0
145000 -> 0.0
150000 -> 0.0
155000 -> 0.465888195173
160000 -> 0.0
165000 -> 0.0
170000 -> 0.0


Number of filters: 1
Number of DHSs: 171275
Average Length of Sequences: 150.0
Execution Time: 62.1381750107


### Test both passes on all sequences using parallelization (just CPU)

In [11]:
from multiprocessing.pool import Pool

def calculatePassesForSeqs(seqs):
    print "Started thread with " + str(len(seqs)) + " Sequences!"
    lengthes = []
    for seq in seqs:
        hiddenActivation = learner.forwardPass(seq)
        reconstruction = learner.backwardPass(hiddenActivation)
        lengthes.append(len(seq))
    return np.mean(lengthes)


cpu_count = 4
print "AVAILABLE CPUs: " + str(cpu_count)
sizePerCPU = len(allSeqs) / cpu_count
p = Pool(processes = cpu_count)
sublists = []
for i in range(cpu_count):
    if not i == cpu_count-1:
        sublists.append(allSeqs[i*sizePerCPU:(i+1)*sizePerCPU])
    else:
        sublists.append(allSeqs[i*sizePerCPU:])
start = time.time()
result = p.map(calculatePassesForSeqs, sublists)
print result
print
print
print "Number of filters: " + str(learner.numMotifs)
print "Number of DHSs: " + str(len(allSeqs))
print "Execution Time: " + str(time.time()-start)

p.close()

AVAILABLE CPUs: 4


OSError: [Errno 12] Cannot allocate memory

In [482]:
p.close()

NameError: name 'p' is not defined

Some tests to learn how to do things with Biopython and Theano
===

### Do all DHS sequences have the same length by default?


In [51]:
fasta_seqs = sio.parse(open('../data/wgEncodeAwgDnaseUwAg10803UniPk.fa'), 'fasta')
count = 0
countNotSameLength = 0
for seq in fasta_seqs:
    if len(seq) != 150:
        print 'not length 150'
        countNotSameLength += 1
    count += 1

print 'Number of sequences: ' + str(count)
print 'Number of seqs with length != 150: ' + str(countNotSameLength)

Number of sequences: 171275
Number of seqs with length != 150: 0


### How do we generate random motif matrices (PWMs or PSSMs)

In [52]:
import Bio.NeuralNetwork.Gene.Schema as schema
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

In [145]:
alphabet = IUPAC.unambiguous_dna
generator = schema.RandomMotifGenerator(alphabet, 6, 10)
for i in range(3):
    x = generator.random_motif()
    print x

CTGCGAT
CGTGACCA
GTACACCGA


In [54]:
from Bio import motifs

In [203]:
len(alphabet.letters)

4

In [56]:
import Bio.motifs.matrix as mat
import random

In [64]:
def createRandomMotif (motifLength, alphabet):
    counts = {}
    for letter in alphabet.letters:
        counts[letter] = [random.randint(0,100) for x in xrange(motifLength)]
    return mat.PositionWeightMatrix(alphabet, counts)
#x = mat.PositionWeightMatrix(alphabet, counts)

In [207]:
x = createRandomMotif(10, alphabet)
print x

        0      1      2      3      4      5      6      7      8      9
A:   0.24   0.18   0.19   0.25   0.41   0.47   0.09   0.07   0.21   0.29
C:   0.28   0.26   0.16   0.42   0.13   0.24   0.39   0.49   0.16   0.38
G:   0.20   0.24   0.22   0.00   0.14   0.15   0.46   0.43   0.63   0.18
T:   0.29   0.32   0.43   0.32   0.32   0.14   0.06   0.02   0.00   0.14



In [224]:
y = np.zeros((len(alphabet.letters), 10))

def getLetterToInt (num):
    if num == 0:
        return 'A'
    elif num == 1:
        return 'C'
    elif num == 2:
        return 'G'
    elif num == 3:
        return 'T'
    else:
        print 'ERROR: Num ' + str(num) + " not a valid char in DNA alphabet"
        return -1


for letter in range(len(x)):
    print letter
    y[letter] = x[getLetterToInt(letter)]
    letterCount += 1

np.set_printoptions(precision=2)
y

0
1
2
3


array([[ 0.24,  0.18,  0.19,  0.25,  0.41,  0.47,  0.09,  0.07,  0.21,
         0.29],
       [ 0.28,  0.26,  0.16,  0.42,  0.13,  0.24,  0.39,  0.49,  0.16,
         0.38],
       [ 0.2 ,  0.24,  0.22,  0.  ,  0.14,  0.15,  0.46,  0.43,  0.63,
         0.18],
       [ 0.29,  0.32,  0.43,  0.32,  0.32,  0.14,  0.06,  0.02,  0.  ,
         0.14]])

### Are the elements of a PWM interpretable as probabilites?

In [144]:
# verify that we're dealing with probabilities by summing up over all letters for each position
for pos in range(x.length):
    c = 0
    for letter in alphabet.letters:
        c += x[letter][pos]
    print str(pos) + " -> " + str(c)

0 -> 1.0
1 -> 1.0
2 -> 1.0
3 -> 1.0
4 -> 1.0
5 -> 1.0
6 -> 1.0
7 -> 1.0
8 -> 1.0
9 -> 1.0
