Have a Taste of DNN on the Shoulder of Theano
====

![image](fruit.png)

## Outline
* Introduction to Theano
* Softmax regression
* Highlights of DNN
* Neural network
    * Multiple layer perception
    * Forward propagation
    * Backward propagation
* Sparse autoencoder
    * Autoencoder
    * Sparse autoencoder
* Building deep networks for classification
    * Model structure
    * Pre-training
    * Fine-tuning
    * Experimental results

## Introduction to Theano
Theano is Python library that allows you to define, evaluate and optimize math expressions.
* Efficient symbolic differentiation
* Efficient handling of matrices
* Tight integration of NumPy
* Dynamic C code generate
* Transparent use of GPU

#### Fast to develop and fast to run
![image](fast.png)

#### Machine learning libraries built on top of Theano:
* Pylearn2
    * great flexibility and a good choice for trying out ML ideas
* PyMC3
    * Probabilistic programming; building statistical Bayesian models
* Sklearn-theano
    * Easy-to-use deep learning tool
* Lasagne
    * Lightweight library to build neural networks

#### Models that have been built with Theano:
* Neural networks
* Convolutional Neural Networks (CNN)
* Recurrent Neural Networks (RNN)
* Long Short Term Memory (LSTM)
* Autoencoders
* GoogLeNet
* Overfeat
...


#### Symbolic variables in Theano
* Variable (C, Java, Python, etc.)
    * A segment of physical storage in RAM
    * Operations are based on value passing between variables
* Tensor (Theano)
    * A mathematical symbol
    * No physical storage in RAM to hold its value
    * Operations are actually building connections between tensors
* Shared variable (Theano)
    * Hybrid of variable and tensor
    * Tensor with physical storage in RAM to hold its value


In [12]:
import theano
import theano.tensor as T

x = T.dvector(name='x')
def f(x):
    return x ** 2
y = f(x)

theano.printing.pydotprint(theano.function([x], y), '1.png')

The output file is available at 1.png


![image](1.png)

_theano.function_ brings life to theano variables.

In [9]:
import theano
import theano.tensor as T
import numpy as np

x = T.dvector(name='x')
def f(x):
    return x ** 2
y = f(x)

pow2 = theano.function(inputs=[x], outputs=y)

a = np.array([1,2,3], dtype=theano.config.floatX)
b = pow2(a)
print "a is {a}, b is {b}".format(a=a, b=b)

a is [ 1.  2.  3.], b is [ 1.  4.  9.]


In [11]:
import theano
import theano.tensor as T
import numpy as np

x = T.dvector('x')
y = x.sum()
grad = T.grad(cost=y, wrt=[x])
grad_func = theano.function(inputs=[x], outputs=grad)

a = np.array([1,2,3], dtype=theano.config.floatX)
print grad_func(a)

[array([ 1.,  1.,  1.])]


## Softmax Regression

### The Model

In the softmax regression setting, we are interested in multi-class classification. Suppose we have $m$ samples in the training set $\{(x^{(1)}, y^{(1)}),...,(x^{(m)}, y^{(m)})\}$, where $y^{(i)}\in \{1,2,...,k\}$ and $x^{(i)}\in R^{n}$.

Given a test sample $x$, we want to estimate the probability that $x$ belongs to class $j$, i.e., $p(y=j|x)$, for all possible $j$.

\begin{equation}
h_{W,b}(x) = 
\left[
  \begin{array}{c}
  p(y=1|x;W,b)\\
  p(y=2|x;W,b)\\
  ...\\
  p(y=k|x;W,b)\\
  \end{array}
\right]
=\frac{1}{\sum_{j=1}^{k}{e^{w_j^Tx+b_j}}}
\left[
  \begin{array}{c}
  e^{w_1^Tx+b_1}\\
  e^{w_2^Tx+b_2}\\
  ...\\
  e^{w_k^Tx+b_k}\\
  \end{array}
\right]
\end{equation}

When you implement softmax regression, it is usually convenient to represent $W$ as a $n$-by-$k$ matrix, so that
\begin{equation}
W = \left[\begin{array}{c}
w_1^T\\
...\\
w_k^T
\end{array}\right]
\end{equation}


### Defining a Loss Function
The loss of $h_{W, b}$ on the trainig set is
\begin{equation}
J(W, b) = -\frac{1}{m}\left[\sum_{i=1}^{m}\sum_{j=1}^{k}1\{y^{(i)}=j\}\log\frac{e^{w_j^Tx^{(i)}}}{\sum_{l=1}^{k}e^{w_l^Tx^{(i)}}}\right] + \frac{\lambda}{2}\sum_{i=1}^{k}\sum_{j=1}^{n}w_{ij}^2
\end{equation}

The second term is a weight decay term to disambiguate $W$ and $b$ that could yeild the least training error.


$J(W,b)$ is a convex function, and thus gradient descent will not run into a local optima problem.

### Learning the Model

Gradient descent

$$W \leftarrow W - \alpha \frac{\partial{J(W,b)}}{\partial{W}}$$

$$b \leftarrow b - \alpha \frac{\partial{J(W,b)}}{\partial{b}}$$

where $\alpha$ is the learning rate.

$$\frac{\partial{J(W,b)}}{\partial{w_j}} = -\frac{1}{m}\sum_{i=1}^{m}\left[x^{(i)}(1\{y^{(i)}\} - p(y^{(i)}=j|x^{(i)};W,b))\right] + \lambda w_j$$

$$\frac{\partial{J(W,b)}}{\partial{b}} = -\frac{1}{m}\sum_{i=1}^{m}\left(1\{y^{(i)}\} - p(y^{(i)}=j|x^{(i)};W,b)\right)$$

A more advanced option is the L-BFGS alogrithm, which also requires the gradient function as an input argument.


### Building the Model with Theano

#### Vectorising the model

Notations
* X: data matrix of size $n\times m$, where $n$ is the number of dimensions and m the number of instances. That is, each column stores an instance.
* W: weight matrix of size $k \times n$, where $k$ is the number of classes.
* b: bias vector of size $k\times 1$.

The posterior probability is
\begin{equation}
p = softmax(WX+b)
\end{equation}
where softmax(M) = M / M.sum(axis=0) and M is a arbitrary matrix.

#### Utilities for manipulating the paramters

In [1]:
import numpy as np

def get_size(shape):
    """
    count the number of elements in a ndarray with shape=shape
    """
    size = 1
    for i in shape:
        size *= i
    return size
    
def pack(param_list):
    """
    Args:
        param_list: list of ndarrays
    Returns:
        theta:  vector of shape (?,), flattened params
        shapes: list of tuples, with each tuple storing the shape of each ndarray in param_list
    """
    shapes = []
    theta=None
    for p in param_list:
        size = p.size
        p2 = p.reshape((size, )) 
        if theta is None:
            theta = p2
        else:
            theta = np.hstack((theta, p2))
        shapes += [p.shape] 
    return theta, shapes
        
def unpack(theta, shapes):
    """
    Args:
        theta:  vector of shape (?,), flattened params
        shapes: list of tuples, with each tuple storing the shape of each ndarray in param_list      
    Returns:
        param_list: list of ndarrays
    """
    i = 0
    params = []
    for shape in shapes:
        size = get_size(shape)
        x = theta[i:i+size]
        params += [x.reshape(shape)]
        i += size
    return params

#### Defining the softmax regression class

In [6]:
#coding=utf-8
import theano
import theano.tensor as T
import numpy as np
import scipy as sp
import gzip
import cPickle

#from param_util import pack, unpack

# for debugging
theano.config.optimizer="fast_run"
theano.config.exception_verbosity="high"

class SoftmaxRegression(object):
    def __init__(self, n_in, n_out, L2_reg_coef, max_iter=100):
        self.n_in = n_in
        self.n_out = n_out
        self.L2_reg_coef = L2_reg_coef
        self.max_iter = max_iter
        
        self.W = theano.shared(
            value=0.005 * np.random.randn(self.n_out, self.n_in),
            name='W',
            borrow=True)
        self.b = theano.shared(
            value=np.zeros((self.n_out, 1), dtype=theano.config.floatX),
            name='b',
            broadcastable=(False, True),
            borrow=True)
        self.params = [self.W, self.b]
        
        # sample matrix, each column stores a sample
        X = T.dmatrix('X')
        # label
        y = T.lvector('y')
        # predict
        p_y_given_x = self.__softmax__(T.dot(self.W, X) + self.b)
        pred = T.argmax(p_y_given_x, axis=0)
        # NLL: negative log-likelihood
        nll = -T.mean(T.log(p_y_given_x[y, T.arange(0, y.shape[0])]))
        # cost (loss)
        cost = nll + self.L2_reg_coef * (self.W**2).sum()
        # error rate
        error = 1.0*T.sum(T.neq(pred, y))/X.shape[1]
        
        self.test_model = theano.function(
            inputs=[X, y],
            outputs=error)
        self.run_model = theano.function(
            inputs=[X],
            outputs=pred)
            
        # compute gradient
        grads = T.grad(cost, self.params)
        params = [p.type() for p in self.params]
        self.grad_func = theano.function(
            inputs=params + [X, y],
            outputs=grads,
            givens=[(s, p) for s, p in zip(self.params, params)])
        self.cost_func = theano.function(
            inputs=params + [X, y],
            outputs=cost,
            givens=[(s, p) for s, p in zip(self.params, params)])   
        
    def fit(self, X, y):
        init_theta, shapes = pack([self.W.get_value(), self.b.get_value()])
        opt_theta, opt_cost, d = sp.optimize.fmin_l_bfgs_b(
            func=self.__cost_and_grad__,
            x0=init_theta,
            fprime=None,
            args=(shapes, X, y),
            maxiter=self.max_iter,
            iprint=1)
        opt_param_list = unpack(opt_theta, shapes)
        for shared_param, opt_param in zip(self.params, opt_param_list):
             shared_param.set_value(opt_param)      

    def transform(self, X):
        return self.run_model(X)
    
    def evaluate(self, X, y):
        return self.test_model(X, y)
        
    def __cost_and_grad__(self, theta, *args):
        shapes = args[0]
        X = args[1]
        y = args[2]
        param_list = unpack(theta, shapes)
        cost =  self.cost_func(*(param_list + [X, y]))
        grad = self.grad_func(*(param_list + [X, y]))
        grad, _ = pack(grad)
        return cost, grad        
    
    def __softmax__(self, M):
        """
        normalise along the vertical axis
        """
        e_M = T.exp(M - M.max(axis=0, keepdims=True))
        return e_M / e_M.sum(axis=0, keepdims=True)

#### Evaluating the model

##### Dataset: MNIST Dataset of Handwritten Digits

* Gray-scale image of size $28\times 28$ with value ranging from [0, 1].
* 50,000 training samples, 10,000 validation samples and 10,000 testing samples.

![image](mnist.png)

In [7]:
if __name__ == '__main__':
    np.random.seed(0)
    
    f = gzip.open('data/mnist.pkl.gz')
    train_set, valid_set, test_set = cPickle.load(f)
    
    # train model
    train_X = train_set[0].transpose()
    train_y = train_set[1]
    valid_X = valid_set[0].transpose()
    valid_y = valid_set[1]
    
    train_X = np.hstack((train_X, valid_X))
    train_y = np.hstack((train_y, valid_y))
    
    train_X = train_X[:, 0:60001:5]
    train_y = train_y[0:60001:5]
    
    n_in = train_X.shape[0]
    n_out = np.unique(train_y).shape[0]
    print "{n} training samples of dim {d}".format(n=train_X.shape[1], d=n_in)
    
    softmax_regression = SoftmaxRegression(n_in, n_out, L2_reg_coef=1e-4, max_iter=100)
    softmax_regression.fit(train_X, train_y)
    
    # test model
    test_X = test_set[0].transpose()
    test_y = test_set[1]
    error = softmax_regression.evaluate(test_X, test_y)
    print 'error rate on test set is {e}%, accuracy is {a}%'.format(e=error*100, a=100-100*error)
    # baseline method
    pred_baseline = np.random.randint(
        low = np.min(train_y),
        high = np.max(train_y)+1,
        size=valid_y.shape)
    error_baseline = 1.0 * (pred_baseline != valid_y).sum() / valid_X.shape[1]
    print 'baseline: error rate on test set is {e}, accuracy is {a}%'.format(e=error_baseline*100, a=100-100*error_baseline)

12000 training samples of dim 784
error rate on test set is 8.5%, accuracy is 91.5%
baseline: error rate on test set is 90.18, accuracy is 9.82%%


##### Visualising the most "preferable" input

(TODO)拉格朗日乘子法https://en.wikipedia.org/wiki/Lagrange_multiplier, KKT条件等http://blog.csdn.net/xianlingmao/article/details/7919597

## Highlights of DNN

## Neural Network

## Autoencoder

### The Model

Suppose we have only a set of unlabeled training examples $\{x^{(1)},...,x^{(m)}\}$, where $x^{(i)}\in R^n$.
An autoencoder neural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs, i.e., $y^{(i)}=x^{(i)}$.

![image](autoencoder.png)

The autoencoder tries to learn an identity function $h_{W,b}\approx x$. 
* The identity function seems a particularly trivial function to be trying to learn; but by placing constraints on the network, such as by limiting the number of hidden units, we can discover interesting structure about the data.
* As a concrete example, suppose the inputs $x$ are the pixel intensity values from a $10\times 10$ image ($100$ pixels) so $n=100$, and there are $s_2=50$ hidden units in layer $L_2$. 
Since there are only $50$ hidden units, the network is forced to learn a compressed representation of the input.
* If the inputs are completely random, i.e., each dimension comes from an independent distribution, the autoencoding task would be very difficult.

Notations
* $m$: the number of samples
* $n_l$: the number of layers, including the input, hidden and output layers
* $s_l$: the number of units in layer $L_l$, excluding the bias unit
* $W^{(l)}$: the weight matrix connecting layer $L_l$ and layer $L_{l+1}$
* $W_{ji}^{(l)}$: the weight connecting unit $i$ in layer $L_l$ and unit $j$ in layer $L_{l+1}$
* $b^{(l)}$: the bias vector connecting the bias unit in layer $L_l$ to the units in layer $L_{l+1}$

### Defining a Loss Function

For a single training sample $x$, the loss function is defined as
\begin{equation}
J(W,b; x) = \frac{1}{2}\Vert h_{W,b}(x)-x\Vert^2
\end{equation}
Given a trainig set of $m$ samoples, we define the loss function as
\begin{equation}
\begin{aligned}
J(W,b) &= 
\left[\frac{1}{m}\sum_{i=1}^{m}J(W,b; x^{(i)})\right]
+\lambda \sum_{l=1}^{n_{l-1}}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}\left(W_{ji}^{(l)}\right)^2\\
&=\left[\frac{1}{m}\sum_{i=1}^{m}\left(\frac{1}{2}\Vert h_{W,}(x^{(i)})-x^{(i)}\Vert^2\right)\right]
+\lambda \sum_{l=1}^{n_{l-1}}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}\left(W_{ji}^{(l)}\right)^2
\end{aligned}
\end{equation}

The second term is a weight decay term.

### Sparsity Constraint

We would like to constrain the neurons to be active only for a subset of patterns.