## Introduction

This tutorial will introduce you to theano library. Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray). 

In the machine learning field, you were introduced by far the most popular library -- scikit-learn. Why do I introduce you to a seemly less popular library in the same field? Well, what sets Theano apart is that it takes advantage of the computer’s GPU in order to attain speeds rivaling hand-crafted C implementations for problems involving large amounts of data. It can also surpass C on a CPU by many orders of magnitude by taking advantage of recent GPUs. The key point is that Theano allows you to write model specifications rather than the model implementations. This is particularly useful as Theano is very well integrated into the GPU, which provides substantial speed-ups for deep learning training.

How does theano relate to other mathematical libraries? Theano sits somewhere between NumPy and the Python symbolic mathematics library SymPy.

To get a flavor of how to use Theano, let's have a look at simple example. It doesn’t show off many of Theano’s features, but it illustrates concretely what Theano is.

In [29]:
import numpy as np
import theano
import theano.tensor as T
import time

In [30]:
x = T.dscalar('x')
y = T.dscalar('y')
z = x + y
f = theano.function([x, y], z)
print np.allclose(f(15.9, 6.7), 22.6)

True


## Tutorial content

In this tutorial, first we will go through some basic theano operations -- algebra, derivatives, conditions and loop. Then we will apply them in an application example. Further resources are listed at the end.
<ul>
<li><a href='#dest1'>Algebra</a></li>
<li><a href='#dest2'>Derivatives in Theano</a></li>
<li><a href='#dest3'>Conditions -- ifelse and switch</a></li>
<li><a href='#dest4'>Looping using scan</a></li>
<li><a href='#dest5'>Application example: classifying MNIST digits using logistic regression</a></li>
<li><a href='#dest6'>Further resources</a></li>
<li><a href='#dest7'>Reference</a></li>
</ul>

<a id='dest1'></a>
## Algebra

### Adding two Scalars

The simple example in the beginning shows the essentials of adding two scalars. Let's go through it.

**Step 1 **
Using <i style="background-color:#D5F5E3">dscalar</i>, we declare two double scalar symbols. What is tricky is that x, y are instances of TensorVariable. While their 'type' field are made TensorType.

**Step 2 **
Combine x and y into their sum z.

**Step 3 **
Create a function taking x and y as inputs and giving z as output. The first argument to function is a list of Variables that will be provided as inputs to the function. The second argument is a single Variable or a list of Variables. f may then be used like a normal Python function.

### Adding two Matrices

Very similar to adding two scalars. The only change from the previous example is that you need to instantiate x and y using the matrix Types:

In [31]:
x = T.dmatrix('x')
y = T.dmatrix('y')
z = x + y
f = theano.function([x, y], z)
print np.allclose(f(np.array([[1, 2], [3, 4]]), np.array([[10, 20], [30, 40]])), np.array([[11, 22], [33, 44]]))

True


<a id='dest2'></a>
## Derivatives in theano

### Computing Gradients

To do this we will use the macro <i style="background-color:#D5F5E3">T.grad</i>. For instance, we can compute this complicated function:
\begin{equation*}
d(s(x))/dx
\end{equation*}
where
\begin{equation*}
s(x)=sum(1 / (1 + exp(-x)))
\end{equation*}

In [32]:
x = T.dmatrix('x')
s = T.sum(1 / (1 + T.exp(-x)))
gs = T.grad(s, x)
dlogistic = theano.function([x], gs)
dlogistic([[0, 1], [-1, -2]])

array([[ 0.25      ,  0.19661193],
       [ 0.19661193,  0.10499359]])

### Computing the Jacobian

In vector calculus, the Jacobian matrix is the matrix of all first-order partial derivatives of a vector-valued function. Check out https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant. Theano implements the theano.gradient.jacobian() macro that does all that is needed to compute the Jacobian. The following text explains how to do it manually.

In order to manually compute the Jacobian of some function y with respect to some parameter x we need to use <i style="background-color:#D5F5E3">scan</i>. What we do is to loop over the entries in y and compute the gradient of y[i] with respect to x.

In [33]:
x = T.dvector('x')
y = x ** 2
J, updates = theano.scan(lambda i, y,x : T.grad(y[i], x), sequences=T.arange(y.shape[0]), non_sequences=[y,x])
f = theano.function([x], J, updates=updates)
print f([2, 4])

[[ 4.  0.]
 [ 0.  8.]]


### Computing the Hessian

In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. Check out https://en.wikipedia.org/wiki/Hessian_matrix. Theano implements theano.gradient.hessian() macro that does all that is needed to compute the Hessian. 

You can compute the Hessian manually similarly to the Jacobian. The only difference is that now, instead of computing the Jacobian of some expression y, we compute the Jacobian of T.grad(cost,x), where cost is some scalar.

In [34]:
x = T.dvector('x')
y = x ** 2
cost = y.sum()
gy = T.grad(cost, x)
H, updates = theano.scan(lambda i, gy,x : T.grad(gy[i], x), sequences=T.arange(gy.shape[0]), non_sequences=[gy, x])
f = theano.function([x], H, updates=updates)
print f([2, 4])

[[ 2.  0.]
 [ 0.  2.]]


<a id='dest3'></a>
## Conditions -- ifelse and switch

Theano provides ifelse and switch as condition statements. 
<p style="background-color:#D5F5E3">ifelse(condition, var1, var2) -- if condition is true, return var1, otherwise return var2</p>
<p style="background-color:#D5F5E3">switch(tensor, var1, var2) -- if tensor is true, return var1, otherwise return var2</p>
<p>Whereas switch evaluates both output variables, ifelse is lazy and only evaluates one variable with respect to the condition. In other words, if condition is true, ifelse only calculates var1, while switch calculates both var1 and var2.</p>
<p>Suppose we have two scalar variables $a, $b, and two matrices $\mathbf{x, y}$. Define a function:
$$ 
\mathbf z = f(a, b,\mathbf{x, y}) = \left\{ 
\begin{aligned}
    \mathbf x & ,\ a <= b\\
    \mathbf y & ,\ a > b
\end{aligned}
\right.
$$
Declare variables:

In [35]:
a,b = T.scalars('a', 'b')
x,y = T.matrices('x', 'y')

Build with ifelse. Use T.lt() as "less than or equal", T.gt() as "greater than or equal":

In [36]:
z_lazy = theano.ifelse.ifelse(T.lt(a, b), T.mean(x), T.mean(y))
f_lazyifelse = theano.function([a, b, x, y], z_lazy, mode=theano.Mode(linker='vm'))

Build with switch:

In [37]:
z_switch = T.switch(T.lt(a, b), T.mean(x), T.mean(y))
f_switch = theano.function([a, b, x, y], z_switch, mode=theano.Mode(linker='vm'))

Test data:

In [38]:
val1 = 0.
val2 = 1.
big_mat1 = np.ones((10000, 1000), dtype=theano.config.floatX)
big_mat2 = np.ones((10000, 1000), dtype=theano.config.floatX)

Compare performance:

In [39]:
n_times = 10

tic = time.clock()
for i in xrange(n_times):
    f_switch(val1, val2, big_mat1, big_mat2)
print 'switch time =', time.clock() - tic

tic = time.clock()
for i in xrange(n_times):
    f_lazyifelse(val1, val2, big_mat1, big_mat2)
print 'ifelse time =',time.clock() - tic

switch time = 0.229577
ifelse time = 0.126928


In this example, the IfElse op spends less time (about half as much) than Switch since it computes only one variable out of the two. Unless linker='vm' or linker='cvm' are used, ifelse will compute both variables and take the same computation time as switch.

<a id='dest4'></a>
## Looping using scan

### Computing A^k

Assume that A is a tensor and you want to compute A^k elementwise. The python/numpy code might look like:
<br>result = 1</br>
<br>for i in range(k):</br>
<br>....result = result * A</br>
<p>The equivalent Theano code is:</p>

In [40]:
k = T.iscalar("k")
A = T.vector("A")

# Symbolic description of the result
result, updates = theano.scan(fn=lambda prior_result, A: prior_result * A,
                              outputs_info=T.ones_like(A),
                              non_sequences=A,
                              n_steps=k)
final_result = result[-1]
power = theano.function(inputs=[A,k], outputs=final_result, updates=updates)

print(power(range(10),2))
print(power(range(10),4))

[  0.   1.   4.   9.  16.  25.  36.  49.  64.  81.]
[  0.00000000e+00   1.00000000e+00   1.60000000e+01   8.10000000e+01
   2.56000000e+02   6.25000000e+02   1.29600000e+03   2.40100000e+03
   4.09600000e+03   6.56100000e+03]


<p>Scan returns a tuple containing our result (result) and a dictionary of updates (empty in this case). Note that the result is not a matrix, but a 3D tensor containing the value of A^k for each step. We want the last value (after k steps) so we compile a function to return just that. Note that there is an optimization, that at compile time will detect that you are using just the last value of the result and ensure that scan does not store all the intermediate values that are used. So do not worry if A and k are large.</p>

Listed below are three things that we need to handle in the A^k example:
<table>
<tr>
<th>for loop</th>
<th>scan</th>
</tr>
<tr>
<td>Initial value assigned to result</td>
<td>Initialization occurs in outputs_info</td>
</tr>
<tr>
<td>Accumulation of results in result</td>
<td>Accumulation happens automatically</td>
</tr>
<tr>
<td>Unchanging variable A</td>
<td>Unchanging variables are passed to scan as non_sequences</td>
</tr>
</table>

<a id='dest5'></a>
## Application example: classifying MNIST digits using logistic regression

Multiclass classification is the task of predicting some discrete-valued output
$$y^{(i)} \in \{1,2,\ldots,k\}.$$
To accomplish this task, we're going to expand our notion of a hypothesis function a bit.  Instead of having a scalar-valued hypothesis function (e.g., for binary classification, where it determines the confidence level in our prediction), in multi-class classification we have a _vector valued_ hypothesis function
$$h_\theta : \mathbb{R}^n \rightarrow \mathbb{R}^k$$
Fortunately, everything else remains pretty similar as with normal classification, we just need to specify the three ingredients of any machine learning algorithm
1. The hypothesis class
2. The loss function
3. The optimization appoach

### The hypothesis class
<p>Logistic regression is a probabilistic, linear classifier. It is parametrized by a weight matrix W and a bias vector b. The probability that an input vector x is a member of a class i can be written as:
\begin{eqnarray}
P(Y=k \mid x) = \frac{\text{exp}(\beta_{k0} + \beta^{T}_k x)}{1 + \sum^{K-1}_{l=1} \text{exp}(\beta_{l0} + \beta^{T}_{l} x)}
\end{eqnarray}
where $k \in \{1,\ldots,K-1 \}$.</p>
<p>Written in matrix form:
\begin{eqnarray}
P(Y=k \mid x; W,b) = \frac{\text{exp}(W_k x + b_k)}{\sum^{K}_{l=1} \text{exp}(W_l x + b_l)}
\end{eqnarray}</p>
<p>Thus we classify the image to a particular digit by taking the highest probability digit across all digits 0...9. The argmax function is:
\begin{eqnarray}
y_{\text{pred}} = \text{argmax}_k P(Y=k \mid x; W,b)
\end{eqnarray}</p>

### The loss function

For logistic regression the negative log-likelihood for N observations of training data is given by:
\begin{eqnarray}
\ell (\theta = \{ W, d \} \mid \mathcal{D}) = \sum_{i=1}^N \log (P(Y=k \mid x_i; \theta))
\end{eqnarray}


### The optimization appoach

<p>We will use stochastic gradient descent to evaluate our loss function. In machine learning and other statistical estimation examples objective functions often have the form:
\begin{eqnarray}
f(x) = \sum_{i=1}^n f_i (x)
\end{eqnarray}</p>
<p>That is, the objective function is a sum of functions fi, which are often associated with the i-th observation of the training/feature data set.</p>
<p>Stochastic gradient descent is similar to gradient descent except that instead of evaluating all partial derivatives of the summands fi, ∂fi/∂xj at each step, a random subset of partials is evaluated at every step. This leads to huge savings in computational cost.</p>
<p>What the following code will do are:</p>
1. Create a class that encapsulates logistic regression model. It contains the weight matrix W, the bias vector b, a function of calculating P(Y=k∣x;W,b) and class prediction y_pred. This class also defines the negative log-likelihood loss function in a symbolic manner and how to calculae the error rate for a particular batch.
2. Decompress up the MNIST gzip file and build the model.
3. Train the model during which print iteration and error information.
4. Save the trained model.
5. Load the saved model and predict labels.

In [41]:
import six.moves.cPickle as pickle
import gzip

class LogisticRegression(object):

    def __init__(self, input, n_in, n_out):
        """ 
        input (theano.tensor.TensorType): symbolic variable describing the input of the architecture (one minibatch)
        n_in (int): number of features - n
        n_out (int): number of classes - k
        """
        # Use borrow=True to avoid costly deep copying. This is similar to passing by 
        # reference in C++. However this parameter will have no effect on a GPU.
        self.W = theano.shared(
            value=np.zeros(
                (n_in, n_out),
                dtype=theano.config.floatX
            ),
            name='W',
            borrow=True
        )
        self.b = theano.shared(
            value=np.zeros(
                (n_out,),
                dtype=theano.config.floatX
            ),
            name='b',
            borrow=True
        )

        # symbolic expression of P(Y=k∣x;W,B)
        self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)

        # symbolic expression of maximizing probability of classes
        self.y_pred = T.argmax(self.p_y_given_x, axis=1)

        self.params = [self.W, self.b]
        self.input = input

    def negative_log_likelihood(self, y):
        """Return the mean of the negative log-likelihood of the prediction
        of this model under a given target distribution. We use the mean instead of the sum so that the 
        learning rate is less dependent on the batch size.

        y (theano.tensor.TensorType): corresponds to a vector that gives for each example the correct label
        """
        # y.shape[0]: number of records
        # T.arange(y.shape[0]): a symbolic vector containing [0,1,2,... y.shape[0]-1] 
        # T.log(self.p_y_given_x): a matrix of Log-Probabilities
        # T.log(self.p_y_x)[T.arange(y.shape[0]), y]: a vector containing the log likelihoods of each training 
        # example/class pair
        return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])

    def errors(self, y):
        """Return a float representing the number of errors in the minibatch.
        
        y (theano.tensor.TensorType): a vector that gives for each example the correct label
        """
        # check if y has the same dimension as y_pred
        if y.ndim != self.y_pred.ndim:
            raise TypeError(
                'y should have the same shape as self.y_pred',
                ('y', y.type, 'y_pred', self.y_pred.type)
            )
        # check if y is of the correct datatype
        if y.dtype.startswith('int'):
            # the T.neq operator returns a vector of 0s and 1s, where 1 represents a mistake in prediction
            return T.mean(T.neq(self.y_pred, y))
        else:
            raise NotImplementedError()


def load_data(dataset):
    with gzip.open(dataset, 'rb') as f:
        try:
            train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
        except:
            train_set, valid_set, test_set = pickle.load(f)

            
    def shared_dataset(data_xy, borrow=True):
        """ Loads the dataset into shared variables.
        The reason we store our dataset in shared variables is to allow
        Theano to copy it into the GPU memory (when code runs on GPU).
        Since copying data into the GPU is slow, copying a minibatch everytime
        is needed (the default behaviour if the data is not in a shared
        variable) would lead to a large decrease in performance.
        """
        data_x, data_y = data_xy
        shared_x = theano.shared(np.asarray(data_x, dtype=theano.config.floatX), borrow=borrow)
        shared_y = theano.shared(np.asarray(data_y, dtype=theano.config.floatX), borrow=borrow)
        # When storing data on the GPU it has to be stored as floats. But during our computations
        # we need them as ints (we use labels as index, and if they are floats it doesn't make sense) 
        # therefore instead of returning ``shared_y`` we will have to cast it to int.
        return shared_x, T.cast(shared_y, 'int32')

    test_set_x, test_set_y = shared_dataset(test_set)
    valid_set_x, valid_set_y = shared_dataset(valid_set)
    train_set_x, train_set_y = shared_dataset(train_set)

    rval = [(train_set_x, train_set_y), (valid_set_x, valid_set_y),
            (test_set_x, test_set_y)]
    return rval


def sgd_optimization_mnist(learning_rate=0.13, n_epochs=1000,
                           dataset='mnist.pkl.gz',
                           batch_size=600):
    """
    Stochastic gradient descent optimization

    learning_rate (float)
    n_epochs (int): maximal iterations to run the optimizer
    dataset (string)
    """
    datasets = load_data(dataset)

    train_set_x, train_set_y = datasets[0]
    valid_set_x, valid_set_y = datasets[1]
    test_set_x, test_set_y = datasets[2]

    # compute number of minibatches for training, validation and testing
    n_train_batches = train_set_x.get_value(borrow=True).shape[0] // batch_size
    n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] // batch_size
    n_test_batches = test_set_x.get_value(borrow=True).shape[0] // batch_size

    #####build model######
    index = T.lscalar()
    x = T.matrix('x')
    y = T.ivector('y')

    # each MNIST image are 28*28
    classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10)

    # the cost we minimize during training is the negative log likelihood
    cost = classifier.negative_log_likelihood(y)

    # compiling a Theano function that computes the mistakes of a minibatch
    test_model = theano.function(
        inputs=[index],
        outputs=classifier.errors(y),
        givens={
            x: test_set_x[index * batch_size: (index + 1) * batch_size],
            y: test_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )

    validate_model = theano.function(
        inputs=[index],
        outputs=classifier.errors(y),
        givens={
            x: valid_set_x[index * batch_size: (index + 1) * batch_size],
            y: valid_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )

    # compute the gradient of cost
    g_W = T.grad(cost=cost, wrt=classifier.W)
    g_b = T.grad(cost=cost, wrt=classifier.b)

    # specify how to update the parameters
    updates = [(classifier.W, classifier.W - learning_rate * g_W),
               (classifier.b, classifier.b - learning_rate * g_b)]

    # compiling a Theano function `train_model` that returns the cost, and updates the parameter 
    # of the model based on the rules defined in `updates`
    train_model = theano.function(
        inputs=[index],
        outputs=cost,
        updates=updates,
        givens={
            x: train_set_x[index * batch_size: (index + 1) * batch_size],
            y: train_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )
    #######train model########
    # the initial minimum number of examples to look at in each minibatch
    # As classification error decreases, patience value increases, as more samples per minibatch are needed in order 
    # to decrease classification error.
    patience = 5000  
    patience_increase = 2  # wait this much longer when a new best is found
    improvement_threshold = 0.995  # a relative improvement of this much is considered significant
    # how often to assess the classification performance on the validation set
    validation_frequency = min(n_train_batches, patience // 2)                            
    best_validation_loss = np.inf
    test_score = 0.
    start_time = time.time()
    done_looping = False
    epoch = 0
    while (epoch < n_epochs) and (not done_looping):
        epoch = epoch + 1
        for minibatch_index in range(n_train_batches):
            minibatch_avg_cost = train_model(minibatch_index)
            # iteration number
            iter = (epoch - 1) * n_train_batches + minibatch_index
            # If number of iterations is a multiple of validation frequency the validation, loss is calculated 
            # and printed.
            if (iter + 1) % validation_frequency == 0:
                validation_losses = [validate_model(i) for i in range(n_valid_batches)]
                this_validation_loss = np.mean(validation_losses)

                print(
                    'epoch %i, minibatch %i/%i, validation error %f %%' %
                    (
                        epoch,
                        minibatch_index + 1,
                        n_train_batches,
                        this_validation_loss * 100.
                    )
                )

                # if we got the best validation score until now
                if this_validation_loss < best_validation_loss:
                    #improve patience if loss improvement is good enough
                    if this_validation_loss < best_validation_loss * improvement_threshold:
                        patience = max(patience, iter * patience_increase)
                    best_validation_loss = this_validation_loss
                    # test it on the test set
                    test_losses = [test_model(i) for i in range(n_test_batches)]
                    test_score = np.mean(test_losses)

                    print(
                        (
                            '     epoch %i, minibatch %i/%i, test error of'
                            ' best model %f %%'
                        ) %
                        (
                            epoch,
                            minibatch_index + 1,
                            n_train_batches,
                            test_score * 100.
                        )
                    )

                    # save the best model
                    with open('best_model.pkl', 'wb') as f:
                        pickle.dump(classifier, f)
            # If we exceed the "patience" for this minibatch, then we skip to the next minibatch.
            if patience <= iter:
                done_looping = True
                break

    end_time = time.time()
    print(
        (
            'Optimization complete with best validation score of %f %%,'
            'with test performance %f %%'
        )
        % (best_validation_loss * 100., test_score * 100.)
    )
    print 'total time =', end_time - start_time


def predict():
    """
    Load a trained model and use it to predict labels.
    """
    # load the saved model
    classifier = pickle.load(open('best_model.pkl'))

    # compile a predictor function
    predict_model = theano.function(
        inputs=[classifier.input],
        outputs=classifier.y_pred)

    # test it on part of test test
    dataset='mnist.pkl.gz'
    datasets = load_data(dataset)
    test_set_x, test_set_y = datasets[2]
    test_set_x = test_set_x.get_value()

    predicted_values = predict_model(test_set_x[:20])
    print("Predicted values for the first 20 examples in test set:")
    print(predicted_values)
    print ("Correct labels for the first 20 examples in test set:")
    print(test_set_y.eval()[:20])


sgd_optimization_mnist()
predict()

epoch 1, minibatch 83/83, validation error 12.458333 %
     epoch 1, minibatch 83/83, test error of best model 12.375000 %
epoch 2, minibatch 83/83, validation error 11.010417 %
     epoch 2, minibatch 83/83, test error of best model 10.958333 %
epoch 3, minibatch 83/83, validation error 10.312500 %
     epoch 3, minibatch 83/83, test error of best model 10.312500 %
epoch 4, minibatch 83/83, validation error 9.875000 %
     epoch 4, minibatch 83/83, test error of best model 9.833333 %
epoch 5, minibatch 83/83, validation error 9.562500 %
     epoch 5, minibatch 83/83, test error of best model 9.479167 %
epoch 6, minibatch 83/83, validation error 9.322917 %
     epoch 6, minibatch 83/83, test error of best model 9.291667 %
epoch 7, minibatch 83/83, validation error 9.187500 %
     epoch 7, minibatch 83/83, test error of best model 9.000000 %
epoch 8, minibatch 83/83, validation error 8.989583 %
     epoch 8, minibatch 83/83, test error of best model 8.958333 %
epoch 9, minibatch 83/83, 

<a id='dest6'></a>
## Further resources

1. <a href="http://deeplearning.net/software/theano/tutorial/index.html#advanced">Theano advanced</a>
2. <a href="http://ufldl.stanford.edu/tutorial/supervised/LogisticRegression/">Logistic Regression</a>
3. <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">Wikipedia: Stochastic Gradient Descent</a>
4. <a href="https://www.udacity.com/course/intro-to-parallel-programming--cs344">Udacity: Intro to Parallel Programming - Using CUDA to Harness the Power of GPUs</a>
5. <a href="https://www.coursera.org/learn/machine-learning">Coursera: Machine Learning by Andrew Ng</a>

<a id='dest7'></a>
## Reference

1. https://www.quantstart.com/articles/Deep-Learning-with-Theano-Part-1-Logistic-Regression
2. http://ufldl.stanford.edu/tutorial/supervised/LogisticRegression/
3. http://deeplearning.net/software/theano
4. homework4 mnist

The dataset and best model in this tutorial are available at https://drive.google.com/drive/folders/0B5-WcG2D1SViMjRqSmpEMVZERzQ