






# <center>Deep Learning and Text Analytics</center>

References:
- General introduction
     - https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/
     - http://neuralnetworksanddeeplearning.com
     - http://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/
- Word vector:
     - https://code.google.com/archive/p/word2vec/
- Keras tutorial
     - https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
- CNN
     - http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

## 1. Agenda
- Introduction to neural networks
- Word/Document Vectors (vector representation of words/phrases/paragraphs)
- Convolutionary neural network (CNN)
- Application of CNN in text classification

## 2. Introduction neural networks
- A neural network is a computational model inspired by the way biological neural networks in the human brain process information.
- Neural networks have been widely applied in speech recognition, computer vision and text processing

### 2.1. Single Neuron

<img src="single_neuron.png" width="60%">
$$h_{W,b}(x)=f(w_1x_1+w_2x_2+w_3x_3+b)$$
- Basic components:
    - **input** ($X$): $[x_1, x_2, x_3]$
    - **weight** ($W$): $[w_1, w_2, w_3]$
    - **bias**: $b$
    - **activation** function: $f$
- Different activation functions:
    - **Sigmoid** (logistic function): takes a real-valued input and squashes it to range [0,1]. $$f(z)=\frac{1}{1+e^{-z}}$$, where $z=w_1x_1+w_2x_2+w_3x_3+b$
    - Tanh (hyperbolic tangent): takes a real-valued input and squashes it to the range [-1, 1]. $$f(z)=tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}$$
    - ReLU (Rectified Linear Unit): $$f(z)=max(0,z)$$   
    - **Softmax** (normalized exponential function): a generalization of the logistic function. If $z=[z_1, z_2, ..., z_k]$ is a $k$-dimensional vector, $$f(z)_{j \in k}=\frac{e^{z_j}}{\sum_{i=1}^k{e^{z_i}}}$$ 
     - $f(z)_{j} \in [0,1]$
     - $\sum_{j \in k} {f(z)_{j}} =1 $
     - $f(z)_{j}$ is treated as the **probability** of component $j$, a probability distribution over $k$ different possible outcomes
     - e.g. in multi-label classification, softmax gives a probability of each label 

### 2.2 Neural Network Model
- A neural network is composed of many simple neurons, so that the output of a neuron can be the input of another
- The sample neural network model has 3 input nodes, 3 hidden units, and 1 output unit
    - input layer: the leftmost layer
    - outout layer: the rightmost layer (produce target, i.e. prediction, classification)
    - bias units: indicated by "+1" node
    - hidden layer: the middle layer of nodes
<img src="neural_network.png" width="60%"/>

- $W$, $x$, and $b$ usually represented as arrays (i.e. vectorized)
   - $w_{ij}^{(l)}$: the weight associated with the link from unit $j$ in layer $l$ to unit $i$ in layer $l+1$
   - $W^{(1)} \in \mathbb{R}^{3\text{x}3}$, $W^{(2)} \in \mathbb{R}^{1\text{x}3}$, $b^{(1)} \in \mathbb{R}^{3\text{x}1}$, $b^{(2)} \in \mathbb{R}^{1\text{x}1}$
   - Note $W^{(l)}x$ is the dot product between $W^{(l)}$ and $x$, i.e. $W^{(l)} \cdot x$
   
- If a neural network contains more than 1 hidden layer, it's called a **deep neural network** (**deep learning**)
- Training a neural network model is to find $W$ and $b$ that optimize some **cost function**, given tranining samples (X,Y), where X and Y can be multi-dimensional


### 2.3. Cost function
- Training set: m samples denoted as $(X,Y)={(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)})}$
- A typical cost function: **mean_squared_error** 
  - Sum of square error: $J(W,b;x,y)=\frac{1}{2}||h_{W,b}(x)-y||^2$
  - Regularization (square of each weight, or L2): $\sum_{i, j, l}(w_{ij}^{(l)})^2$. An important mechanism to prevent overfitting
  - Cost function:
$$J(W,b)=\frac{1}{m}\sum_i^m{(\frac{1}{2}||h_{W,b}(x)-y||^2)}+ \frac{\lambda}{2}\sum_{i, j, l}(w_{ij}^{(l)})^2$$, where $\lambda$ is **regularization coefficient**
- Other popular cost functions
  - **Cross-entropy cost**
      - Let's assume a single neuron with sigmoid activation function <img src='single_neuron.png' width="30%" style="float: right;">
      - Let $\widehat y=h_{W,b}(x)$, the prediction of true value $y$. $\widehat y, y \in [0,1]$. 
      - Then cross-entrophy cost is defined as: $$J=-\frac{1}{m}\sum_{i=1}^m{y_i\ln{\widehat y_i}+(1-y_i)\ln{(1-\widehat y_i)}}$$
      - What makes cross-entropy a good cost function
        - It's non-negative
        - if the neuron's output $\widehat y$ is close to the actual value $y$ (0 or 1) for all training inputs, then the cross-entropy will be close to zero
- For comparison between "Sum of Square error" and "Cross-entropy cost", read http://neuralnetworksanddeeplearning.com/chap3.html

### 2.4. Gradient Descent
- An optimization algorithm used to find the values of parameters ($W, b$) of a function ($J$) that minimizes a cost function ($J(W,b)$.
- It is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm
  <img src='gradient_descent.png' width='80%'>
  resource: https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/
- It uses derivatives of cost function to determine the direction to move the parameter values in order to get a lower cost on the next iteration
- Procedure:
    1. initialize $W$ with random values
    2. given samples (X,Y) as inputs, calculate dirivatives of cost function with regard to every parameter $w_{ij}^{(l)}$, i.e. $\frac{\partial{J}}{\partial{w_{ij}^{(l)}}}$
    3. update parameters by $(w_{ij}^{(l)})^{'}=w_{ij}^{(l)}-\alpha*\frac{\partial{J}}{\partial{w_{ij}^{(l)}}}$, where $\alpha$ is the learning rate
    4. repeat steps 2-3 until $w_{ij}^{(l)}$ converges
- **Learning rate $\alpha$**
  - It's critical to pick the right learning rate. Big $\alpha$ or small $\alpha$?
  - $\alpha$ may need to be adapted as learning unfolds
- Challenges of Gradient Descent
  - It is expensive to compute $\frac{1}{m}\sum_i^m{(\frac{1}{2}||h_{W,b}(x_i)-y_i||^2)}$ for all samples in each round
  - It is difficult to compute $\frac{\partial{J}}{\partial{w_{ij}^{(l)}}}$ if a neural netowrk has many layers

### 2.5. Stochastic Gradient Descent
- Estimate of cost function using a subset of randomly chosen training samples (mini-batch) instead of the entire training set
- Procedure: 
  1. pick a randomly selected mini-batch, train with them and update $W, b$, 
  2. repeat step (1) with another randomly selected mini-batch until the training set is exhausted (i.e. complete an epoch), 
  3. start over with another epoch until $W, b$ converge
- **Hyperparameters** (parameters that control the learning of $W, b$)
    - **Batch size**: the size of samples selected for each iteration
    - **Epoches**: One epoch means one complete pass through the whole training set. Ususally we need to use many epoches until $W, b$ converge
    - e.g. if your sample size is 1000, and your batch size is 200, how many iterations are needed for one epoch?
    - e.g. if you set # of epoches to 5, how many times in total you update $W, b$?


#### 2.6. Backpropagation Algorithm -- The efficient way to calcluate gradients (i.e. partial derivatives)

Forward Propagation             |  Backprogation
:-------------------------:|:-------------------------:
![](forward-propagation.png)  |  ![](backpropagation.png)
input signals are passing through each layer by multiplying the weights | backpropagate the error back to each layer proportional to perspective weights, and update the weights based on attributed errors in hope to correct the error
- Algorithm:
  1. perform a feedforward pass, computing the activations for layers L2, L3, ... and so on up to the output layer
  2. for output layer $n$,<br> $\delta^{(n)} = \frac{\partial}{\partial z^{(n)}}
 J(W,b; x, y) = \frac{\partial}{\partial z^{(n)}}
 \frac{1}{2} \left\|y - h_{W,b}(x)\right\|^2 = - (y - a^{(n)}) \cdot f'(z^{(n)})$
  3. for $l=n-1, n-2, ..., n-3, ..., 2$,<br>
  $ \delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \cdot f'(z^{(l)})$
  4. Compute the desired partial derivatives, which are given as:<br>
     $ \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x, y) = a^{(l)}_j \delta_i^{(l+1)}$ <br>
$\frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y) = \delta_i^{(l+1)}$
- Example: 
  - $\delta^{(3)} = \frac{\partial}{\partial z^{(3)}} J(W,b; x, y) = (a^{(3)} - y) \cdot f'(z^{(3)})$

  - $ \delta^{(2)} = \left((W^{(2)})^T \delta^{(3)}\right) \cdot f'(z^{(2)})$
  - $ \frac{\partial}{\partial W_{12}^{(2)}} J(W,b; x, y) = a^{(2)}_2 \delta_1^{(3)}$


### 2.7 Hyperparameters
- Hyperparameters are parameters that control the learning of $w, b$ (our learning target)
- Summary of hyperparameters:
    - Network structure:
      - number of hidden layers
      - number of neurons of each layer
      - activation fucntion of each layer
    - Learning rate ($\alpha$)
    - regularization coeffiecient ($\lambda$)
    - mini-batch size
    - epoches
- For detailed explanation, watch: https://www.coursera.org/learn/neural-networks-deep-learning/lecture/TBvb5/parameters-vs-hyperparameters

## 3. Develop your First Neural Network Model with Keras
- Keras: 
  - high-level library for neural network models
  - It wraps the efficient numerical computation libraries Theano and TensorFlow 
- Why Keras:
  - Simple to get started and keep going
  - Written in python and higly modular; easy to expand
  - Built-in modules for some sophisticated neural network models
- Installation
  - pip install keras (or pip install keras --upgrade if you already have it) to install the latest version (2.0.8)
  - pip install theano (version 0.9.0)
  - pip install tensorflow (version 1.3.0)
  - pip install np-utils (version 0.5.3.4)
- Basic procedure
  1. Load data
  2. Define model
  3. Compile model
  4. Fit model
  5. Evaluate model

### 3.1. Basic Keras Modeling Constructs
- Sequential model:  linear stack of layers
- Layers
  - Dense: in a dense layer, each neuron is connected to neurons in the next layer
  - Embedding
  - Convolution
  - MaxPooling
  - ...
- Cost (loss) functions
  - mean_squared_error
  - binary_crossentropy
  - categorical_crossentropy
  - ...
- Optimizer (i.e. optimization algorithm)
  - SGD (Stochastic Gradient Descent): fixed learning rate in all iterations
  - Adagrad: adapts the learning rate to the parameters, performing larger updates for infrequent, and smaller updates for frequent parameters
  - Adam (Adaptive Moment Estimation): computes adaptive learning rates for each parameter.
- Metrics
  - accuracy: a ratio of correctly predicted samples to the total samples
  - precision/recall/f1 through sklearn package
  - Example:
    - acc: (90+85)/200=87%
    - prec: 
    - recall:

|        | Predicted T        |   Predicted F  |
|:----------|-------------------:|---------------:|
|Actual T  |  90                | 10              |
|Actual F  |  15                | 85              |

### 3.2. Example
- Example: build a simple neural network model to predict diabetes using "Pima Indians onset of diabetes database" at http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
  - Columns 1-8: variables 
  - Column 9: class variable, 0 or 1
- A sequential model with 4 layers
  - each node is a tensor, a function of multidimensional arrays
    - Input (L1)
    - L2 (hidden layer, dense)
    - L3 (hidden layer, dense)
    - Output (dense)
  - the model is a tensor graph (computation graph)

  <img src='model.png' width='20%'>
  <div class="alert alert-block alert-info">Training a deep learning model is a very empirical process. You may need to tune the hyperparameters in many iterations</div>

In [1]:
# set up interactive shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# Exercise 3.1. Load data
import numpy as np
import pandas as pd

# Load data
data=pd.read_csv("pima-indians-diabetes.csv",header=None)
data.head()

data[8].value_counts()

X=data.values[:,0:8]
y=data.values[:,8]
X.shape

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


0    500
1    268
Name: 8, dtype: int64

(768, 8)

In [3]:
# Exercise 3.2. Create Model

# sequential model is a linear stack of layers
from keras.models import Sequential


# in a dense layer which each neuron is connected to 
# each neuron in the next layer
from keras.layers import Dense

# import packages for L2 regularization
from keras.regularizers import l2

# fix random seed for reproducibility
np.random.seed(7)

# set lambda (regularization coefficient)
lam=0.01

# create a sequential model
model = Sequential()

# add a dense layer with 12 neurons, 8 input variables
# and rectifier activation function (relu)
# and L2 regularization
# how many parameters in this layer?
model.add(Dense(12, input_dim=8, activation='relu', \
                kernel_regularizer=l2(lam), name='L2') )

# add another hidden layer with 8 neurons
model.add(Dense(8, activation='relu', \
                kernel_regularizer=l2(lam),name='L3') )

# add the output layer with sigmoid activation function
# to return probability
model.add(Dense(1, activation='sigmoid', name='Output'))

# compile the model using binary corss entropy cost function
# adam optimizer and accuracy
model.compile(loss='binary_crossentropy', \
              optimizer='adam', metrics=['accuracy'])

Using Theano backend.


In [4]:
# Exercise 3.3. Check model configuration

model.summary()

# Show the model in a computation graph
# it needs pydot and graphviz
# don't worry if you don't have them installed

from keras.utils import plot_model
plot_model(model, to_file='model.png')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
L2 (Dense)                   (None, 12)                108       
_________________________________________________________________
L3 (Dense)                   (None, 8)                 104       
_________________________________________________________________
Output (Dense)               (None, 1)                 9         
Total params: 221
Trainable params: 221
Non-trainable params: 0
_________________________________________________________________


ImportError: Failed to import pydot. You must install pydot and graphviz for `pydotprint` to work.

In [5]:
# Exercise 3.4. Fit Model

# train the model with min-batch of size 10, 
# 100 epoches (# how many iterations?)
# Keep 20% samples for test
# shuffle data before train-test split
# set fitting history into variable "training"

training=model.fit(X, y, validation_split=0.2, \
                   shuffle=True,epochs=150, \
                   batch_size=32, verbose=2)

Train on 614 samples, validate on 154 samples
Epoch 1/150
 - 2s - loss: 1.9401 - acc: 0.6384 - val_loss: 1.3583 - val_acc: 0.5130
Epoch 2/150
 - 1s - loss: 1.1131 - acc: 0.5863 - val_loss: 1.0756 - val_acc: 0.5455
Epoch 3/150
 - 1s - loss: 0.9913 - acc: 0.5847 - val_loss: 1.0256 - val_acc: 0.6234
Epoch 4/150
 - 1s - loss: 0.9365 - acc: 0.5993 - val_loss: 0.9297 - val_acc: 0.5909
Epoch 5/150
 - 2s - loss: 0.8954 - acc: 0.6254 - val_loss: 0.8907 - val_acc: 0.5714
Epoch 6/150
 - 2s - loss: 0.8569 - acc: 0.6482 - val_loss: 0.8596 - val_acc: 0.5714
Epoch 7/150
 - 2s - loss: 0.8303 - acc: 0.6401 - val_loss: 0.8443 - val_acc: 0.6364
Epoch 8/150
 - 1s - loss: 0.8143 - acc: 0.6433 - val_loss: 0.8211 - val_acc: 0.6494
Epoch 9/150
 - 1s - loss: 0.8117 - acc: 0.6368 - val_loss: 0.8259 - val_acc: 0.6169
Epoch 10/150
 - 1s - loss: 0.7915 - acc: 0.6515 - val_loss: 0.7892 - val_acc: 0.6429
Epoch 11/150
 - 2s - loss: 0.7738 - acc: 0.6482 - val_loss: 0.7984 - val_acc: 0.6364
Epoch 12/150
 - 1s - loss: 0

Epoch 97/150
 - 1s - loss: 0.6139 - acc: 0.7329 - val_loss: 0.7113 - val_acc: 0.7078
Epoch 98/150
 - 1s - loss: 0.6092 - acc: 0.7296 - val_loss: 0.7164 - val_acc: 0.6948
Epoch 99/150
 - 1s - loss: 0.6139 - acc: 0.7362 - val_loss: 0.6702 - val_acc: 0.7143
Epoch 100/150
 - 1s - loss: 0.6163 - acc: 0.7362 - val_loss: 0.6683 - val_acc: 0.7078
Epoch 101/150
 - 1s - loss: 0.6012 - acc: 0.7329 - val_loss: 0.7078 - val_acc: 0.7143
Epoch 102/150
 - 1s - loss: 0.6148 - acc: 0.7427 - val_loss: 0.6946 - val_acc: 0.7013
Epoch 103/150
 - 1s - loss: 0.5990 - acc: 0.7362 - val_loss: 0.6753 - val_acc: 0.7013
Epoch 104/150
 - 1s - loss: 0.6098 - acc: 0.7410 - val_loss: 0.6858 - val_acc: 0.7208
Epoch 105/150
 - 1s - loss: 0.6167 - acc: 0.7508 - val_loss: 0.6753 - val_acc: 0.7013
Epoch 106/150
 - 1s - loss: 0.5965 - acc: 0.7492 - val_loss: 0.6644 - val_acc: 0.7078
Epoch 107/150
 - 1s - loss: 0.6067 - acc: 0.7394 - val_loss: 0.6656 - val_acc: 0.7143
Epoch 108/150
 - 1s - loss: 0.5979 - acc: 0.7492 - val_lo

In [6]:
# Exercise 3.5. Get prediction and performance

from sklearn import metrics

# evaluate the model using samples
scores = model.evaluate(X, y)
print("\n%s: %.2f%%" % (model.metrics_names[1], \
                        scores[1]*100))

# get prediction
predicted=model.predict(X)
print(predicted[0:5])
# reshape the 2-dimension array to 1-dimension
predicted=np.reshape(predicted, -1)

# decide prediction to be 1 or 0 based probability
predicted=np.where(predicted>0.5, 1, 0)

# calculate performance report
print(metrics.classification_report(y, predicted, \
                                    labels=[0,1]))


acc: 75.00%
[[ 0.64852411]
 [ 0.30080977]
 [ 0.94323528]
 [ 0.23109812]
 [ 0.82546616]]
             precision    recall  f1-score   support

          0       0.84      0.76      0.80       500
          1       0.62      0.74      0.67       268

avg / total       0.77      0.75      0.75       768

