# Visualization
In this notebook, we're going to explore what the model we have trained actually learned.
This involves some of the following things:
* **Verify that we actually get better while learning**
* **Look at the motifs we learn**
* What does the hidden layer tell us about the model

In [1]:
%matplotlib inline

# some always important inputs
import sys
import os
import random
import time
import numpy as np
import cPickle

# the underlying convRBM implementation
sys.path.append(os.path.abspath('../code'))
from convRBM import CRBM
import getData as dataRead

# plotting and data handling
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split

# the biopython stuff
import Bio.SeqIO as sio
import Bio.motifs.matrix as mat
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
from Bio import motifs as mot

ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
ERROR:theano.sandbox.cuda:nvcc compiler not found on $PATH. Check your nvcc installation and try again.


## Read in the data and a previously trained model
This part of the notebook trains a convolutional RBM on the DHS data. This may take a lot of time but only once we trained it, will it be possible to do the visualization of what the model learnt.

In [11]:
seqReader = dataRead.FASTAReader()
allSeqs = seqReader.readSequencesFromFile('../data/wgEncodeAwgDnaseUwAg10803UniPk.fa')

#data = [allSeqs[random.randrange(0,len(allSeqs))] for i in range(20000)]
data = allSeqs
train_set, test_set = train_test_split(data, test_size=0.1)
print "Training set size: " + str(len(train_set))
print "Test set size: " + str(len(test_set))

start = time.time()
trainingData = np.array([dataRead.getOneHotMatrixFromSeq(t) for t in train_set])
testingData = np.array([dataRead.getOneHotMatrixFromSeq(t) for t in test_set])
print "Conversion of test set in (in ms): " + str((time.time()-start)*1000)

Training set size: 154147
Test set size: 17128
ERROR. LETTER N DOES NOT EXIST!
ERROR. LETTER N DOES NOT EXIST!
Conversion of test set in (in ms): 18446.9199181


In [2]:
# read in the model
learner = CRBM(9, 20, 0.001, 2)
learner.loadModel('../../models/trainedModel_2016_01_18_15_44.pkl')

In [6]:
print learner.observers[-1].scores[0].shape

(40, 1, 4, 7)


## Write Motifs to File

### Some basic funcions to get motifs from the matrices

In [22]:
def getLetterToInt (num):
    if num == 0:
        return 'A'
    elif num == 1:
        return 'C'
    elif num == 2:
        return 'G'
    elif num == 3:
        return 'T'
    else:
        print 'ERROR: Num ' + str(num) + " not a valid char in DNA alphabet"
        return -1

def createMotifFromMatrix (matrix, alphabet=IUPAC.unambiguous_dna):
    assert matrix.shape[0] == 4
    
    # transform the matrix such that the log odds are taken away
    # matrix_ij = log(foreground/background) <=> log(foreground) - log(background)
    psm = matrix + np.log(0.25) # 0.25 if we treat all letters as equally probable
    psm = np.exp(psm)
    psm = psm / psm.sum(axis=1, keepdims=True)
    
    # make this matrix a valid motif
    counts = {}
    for row in range(4):
        counts[getLetterToInt(row)] = (psm[row]).tolist()
    motif = mot.Motif(alphabet=alphabet, instances=None, counts=counts)
    return motif

In [23]:
# first, get the motifs into single 2D matrices within a list
motifs = []
M = learner.motifs.get_value()
for i in range(0, M.shape[0], 2): # only add positive strands...
    motifs.append(M[i,0]) # second dim is 1, so just make it 2D
    

In [24]:
t = motifs[4]
mt = createMotifFromMatrix(t)
print mt.format('transfac')

P0      A      C      G      T
01 0.29125064611434936523 0.080493859946727752686 0.13343438506126403809 0.048717729747295379639      A
02 0.14961674809455871582 0.13750812411308288574 0.039602167904376983643 0.21968689560890197754      N
03 0.093699343502521514893 0.071027889847755432129 0.19974099099636077881 0.1710863262414932251      N
04 0.02942527085542678833 0.073732957243919372559 0.13121739029884338379 0.26657161116600036621      T
05 0.065043516457080841064 0.39700272679328918457 0.080245129764080047607 0.18321444094181060791      C
06 0.18865780532360076904 0.16982139647006988525 0.25595951080322265625 0.04600308835506439209      N
07 0.18230664730072021484 0.070413053035736083984 0.15980041027069091797 0.064719945192337036133      N
XX
//



In [25]:
def weblogo(motif, fname, file_format="png_print", version="2.8.2", **kwds): 
    from Bio._py3k import urlopen, urlencode, Request 
    frequencies = motif.format('transfac') 
    url = 'http://weblogo.threeplusone.com/create.cgi' 
    values = {'sequences': frequencies, 
                    'format': file_format.lower(), 
                    'stack_width': 'medium', 
                    'stack_per_line': '40', 
                    'alphabet': 'alphabet_dna', 
                    'ignore_lower_case': True, 
                    'unit_name': "bits", 
                    'first_index': '1', 
                    'logo_start': '1', 
                    'logo_end': str(motif.length), 
                    'composition': "comp_auto", 
                    'percentCG': '', 
                    'scale_width': True, 
                    'show_errorbars': True, 
                    'logo_title': '', 
                    'logo_label': '', 
                    'show_xaxis': True, 
                    'xaxis_label': '', 
                    'show_yaxis': True, 
                    'yaxis_label': '', 
                    'yaxis_scale': 'auto', 
                    'yaxis_tic_interval': '1.0', 
                    'show_ends': True, 
                    'show_fineprint': True, 
                    'color_scheme': 'color_auto', 
                    'symbols0': '', 
                    'symbols1': '', 
                    'symbols2': '', 
                    'symbols3': '', 
                    'symbols4': '', 
                    'color0': '', 
                    'color1': '', 
                    'color2': '', 
                    'color3': '', 
                    'color4': '', 
                    } 
    values.update(dict((k, "" if v is False else str(v)) for k, v in kwds.items()))
    data = urlencode(values).encode("utf-8")
    req = Request(url, data)
    response = urlopen(req)
    with open(fname, "wb") as f: 
        im = response.read()
        f.write(im)
    f.close()

In [26]:
weblogo(mt, 'test.png')
#freqs = mt.format('transfac')
#print freqs

In [27]:
count = 0
for m in motifs:
    motif = createMotifFromMatrix(m)
    weblogo(motif, '../../learnedMotifs/learnedMotif_'+str(count)+'.png')
    count += 1


## Make a video from the motifs in which we have a subplot of all motifs per frame!
For that, we have to find out how we can simply get the image without writing it to disk first.
Then, we can use python multimedia capabilities for some nice plotting!