# NLP Tutorial Comp 551 - Part 2

Sept 28th, 2017

### Agenda

- Text representation
    - one hot
    - embeddings
    - pre-trained embeddings
- Keras
- Spacy

## Vector Spaced Representation of Documents

A very simple approach to represent documents as numerical value is to use each word as an atomic type and as a basis for a vector space. For example imagine a world where there exist only 3 words: “Apple”, “Orange”, and “Banana” and every sentence or document is made of them. They become the basis of a 3 dimensional vector space:

```
Apple  ==>> [1,0,0]
Banana ==>> [0,1,0]
Orange ==>> [0,0,1]
```

This representation is called "one_hot" as it is always a vector of zeros with 1 on the position of the word.

Then a “sentence” or a “document” is simply the linear combination of these vectors where the number of the counts of appearance of the words is the coefficient along that dimension. For example:

```
d3 = "Apple Orange Orange Apple" ==>> [2,0,2]
d4 = "Apple Banana Apple Banana" ==>> [2,2,0]
d1 = "Banana Apple Banana Banana Banana Apple" ==>> [2,4,0]
d2 = "Banana Orange Banana Banana Orange Banana" ==>> [0,4,2]
d5 = "Banana Apple Banana Banana Orange Banana" ==>> [1,4,1]
```

### Problems

1. There are words that are just very common, so they appear in lots of documents. (“the”, “and”, “or” etc..)
2. Huge vocabulary makes the vectors very sparse
3. When words are used as atomic types for the basis of the vector space, they have no semantic relations (the similarity between them is zero, since they are perpendicular to each other). However, in reality we know that words can be similar in meaning, or even almost identical synonyms.
4. There is no semantic structure incorporated.

### Word Embeddings

Here we create a "dense" representation of each word where proximity in vector space represents "similarity". 

![](https://github.com/michaelcapizzi/nlp-basics/raw/753ab4c178c6bd2cebcc8d4a5631bf6220c85479/images/king_queen_vis.png)

The basic idea is, using a shallow neural network we can train on a large corpora of text to generate individual word vectors, which are located in closely related semantic space. Each word is representated by a distribution of weights across those elements. So instead of a one-to-one mapping between an element in the vector and a word, the representation of a word is spread across all of the elements in the vector, and each element in the vector contributes to the definition of many words.

![](https://adriancolyer.files.wordpress.com/2016/04/word2vec-distributed-representation.png?w=566&zoom=2)

In practice, we can either **train** the word embeddings from scratch on our dataset, or we can use **pre-trained** word embeddings. Gigawords and GloVe are two most popular word embeddings in use.

Refer to these resources:

- https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/


## Keras Introduction

Lets start with text classification using some of the concepts we learnt before.
We will be using the Keras library (https://keras.io/), but know that there are many more that may suit your personal needs and preferences better (Block/Fuel, Lasagne, MXNet, Torch, Caffe, Deeplearning4j (Java), among many!).

You may want to go to a lower level (or implement fancier models), and directly use symbolic CPU&GPU computing libraries such as Theano, Tensorflow, or Chainer.   
Note that Keras is built on top of Theano/Tensorflow, and requires one or the other as a backend.

**Dataset** : We will be using Reuters newswire topic classification task

In [1]:
import numpy as np
import keras
from keras.datasets import reuters
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
from keras import utils as np_utils

Using Theano backend.


In [2]:
max_words = 1000
batch_size = 32
epochs = 5

Loading the data

In [46]:
(x_train, y_train), (x_test, y_test) = reuters.load_data(nb_words=max_words,
                                                         test_split=0.2)


In [47]:
len(x_train)

8982

In [48]:
len(x_test)

2246

Now we vectorize the input sentences. Here first we use a simple one-hot encoding

In [49]:
tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')

In [50]:
x_train

array([[ 0.,  1.,  1., ...,  0.,  0.,  0.],
       [ 0.,  1.,  1., ...,  0.,  0.,  0.],
       [ 0.,  1.,  1., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  1.,  1., ...,  0.,  0.,  0.],
       [ 0.,  1.,  1., ...,  0.,  0.,  0.],
       [ 0.,  1.,  1., ...,  0.,  0.,  0.]])

Convert class vector to binary class matrix, for the categorical cross-entropy loss function

In [51]:
num_classes = np.max(y_train) + 1

In [52]:
num_classes

46

In [53]:
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

In [54]:
y_train.shape

(8982, 46)

Now we proceed to build the model. Here we are using a simple feedforward neural network with Relu non-linearity. Using Keras you can easily build complex architectures on your own.

In [55]:
model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

## Training your model

In order to train models, people usually use some variant of stochastic gradient descent. The good thing with all these libraries is that all you have to do is declare the architecture of your model, and the library takes care of automatically computing gradients for you!

### Losses

There are many loss functions one can use. The two basic ones you will need are **mean square error** (quite general)
$$ \frac{1}{N_{ex}}\sum_i^{N_{examples}} \|f(\mathbf{x}_i)-\mathbf{y}_i\|^2_2 $$
and **categorical crossentropy** (specialized for classification):
$$ - \frac{1}{N_{ex}}\sum_i^{N_{ex}} \sum_j^{N_{classes}} y_{i,j}\log f_j(\mathbf{x}_i) $$
Here $y_i$ is assumed to be a 1-hot vector of the class encoding, but when implemented only the index of the 1 is necessary ("sparse" categorical crossentropy).

### Gradient descent methods

There are several GD methods you should know.

The basic one, **SGD**, consists in taking a single example or a minibatch of examples and updating your weights a bit.
$$\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}$$
The learning rate is $\alpha$, typically $\alpha<1$.

**Momentum** consists in keeping track of the velocity of your stochastic gradient descent, so as to keep going in the same general direction, and can considerably speed up learning.
$$\begin{align} \mu &\leftarrow \gamma \mu + (1-\gamma) \nabla_\theta \mathcal{L} \\ \theta &\leftarrow \theta - \alpha \mu
\end{align}$$
With $0<\gamma<1$, usually quite close to 1.

**RMSProp** (Root Mean Square Prop) is similar in spirit, but is intended to act more like an adaptive learning rate.
$$\begin{align} g &= \nabla_\theta \mathcal{L}\\
\eta &\leftarrow \gamma \eta + (1-\gamma) g^2 \\
\theta &\leftarrow \theta - \frac{\alpha g}{\sqrt{\eta + \epsilon}}
\end{align}$$
With $\epsilon$ typically around $10^{-4}$

**Adam** (Adaptive Moment estimation) is probably the most popular method right now, and is somewhat a mix of RMSProp and Momentum:
$$\begin{align} g &= \nabla_\theta \mathcal{L}\\
\mu &\leftarrow \beta_1 \mu + (1-\beta_1) g \\
\eta &\leftarrow \beta_2 \eta + (1-\beta_2) g^2 \\
\theta &\leftarrow \theta - \frac{\alpha \mu}{\sqrt{\eta + \epsilon}}
\end{align}$$

### Dropout

Dropout is a regularization technique, where some proportion $p$ of the units is randomly multiplied by 0 at each new forward pass. This forces the model to be robust to noise and often to generalize better, as it cannot rely on a single unit to hold some vital information about the final prediction. It is usually only applied to intermediate layers, but can be applied to the input as well.

For a more complete description of algorithms, see http://sebastianruder.com/optimizing-gradient-descent/index.html

We now compile the model. We use **categorical cross-entropy** loss, and **Adam** optimizer. Generally for NLP tasks Adam optimizer works quite good, but feel free to experiment on your own.

In [56]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Our model is ready to be trained!

In [57]:
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_split=0.1)

Train on 8083 samples, validate on 899 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Just within 5 epochs, we reach 90% training accuracy as 79% validation accuracy!

Now we evaluate the test set like this.

In [58]:
score = model.evaluate(x_test, y_test,
                       batch_size=batch_size, verbose=1)



In [59]:
score

[0.89323368777362555, 0.79207479964381122]

Where, the first element is the test score (mean loss), and the second element is the test accuracy.

## Introducing Spacy

While `nltk` is a fairly comprehensive tool to do text processing and NLP tasks, in this tutorial we would like to introduce a new kid on the market, [Spacy](https://spacy.io). Written purely in Python and Cython bindings, Spacy gives us a whole new way to process text, get the word embeddings and can be plugged in to any existing deep learning framework with ease.

In [19]:
import spacy
nlp = spacy.load('en') # language model to load. Spacy comes with 3 language models : English, German, French, although alpha support is available for other languages.

Get word embeddings

In [60]:
len(nlp.vocab)

742225

In [61]:
sentence = nlp(u'hey how are you?')

In [62]:
sentence

hey how are you?

In [63]:
word = sentence[0]

In [64]:
word

hey

Spacy comes with pre-trained word2vec embeddings from Glove.

In [65]:
word.vector

array([ -2.84509987e-01,   3.10070008e-01,  -5.70389986e-01,
        -7.30559975e-02,  -1.73219994e-01,   3.45140010e-01,
         2.50640009e-02,  -7.44499981e-01,  -8.00030008e-02,
         1.27059996e+00,  -1.54060006e-01,  -5.62049985e-01,
        -5.52049987e-02,  -2.57739991e-01,  -3.04600000e-02,
        -1.09810002e-01,   9.67290029e-02,   5.18400013e-01,
        -9.26730037e-02,  -4.92510013e-02,   9.01570022e-02,
         9.43889990e-02,   1.80559993e-01,  -6.19909987e-02,
         7.09050000e-02,  -2.71360010e-01,  -8.50069970e-02,
        -1.11780003e-01,   5.10959983e-01,   7.31770024e-02,
        -7.37999976e-02,   4.16130006e-01,   4.70569991e-02,
         5.62300012e-02,  -3.62500012e-01,   3.00779998e-01,
        -5.41069992e-02,   1.59170002e-01,  -3.28060001e-01,
         4.12040018e-02,  -7.02250004e-02,  -1.33489996e-01,
        -2.04420000e-01,  -1.81180000e-01,   3.01580001e-02,
         2.43729994e-01,   1.51639998e-01,  -1.19800001e-01,
         3.40539992e-01,

You can use these embeddings in Keras using an Embedding layer

```
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)
```

Where, EMBEDDING_DIM = 300, weights = the embedding matrix (nlp.vocab)

### Tons of features

In [41]:
## Sentence similarity

doc1 = nlp(u'hey how are you')
doc2 = nlp(u'hey how are you')
doc1.similarity(doc2)

1.0000000601858707

In [42]:
## POS Tagging

[w.pos_ for w in doc1]

[u'INTJ', u'ADV', u'VERB', u'PRON']

In [44]:
## Enity recognition

doc = nlp(u'Montreal is such a cool place!')

print(doc[0].text, doc[0].ent_iob, doc[0].ent_type_)

(u'Montreal', 3, u'GPE')


... and many more!

## General Tips

### A small list of things that often go wrong
(in Deep Learning)

**The learning rate is too large**  
Symptoms: training loss oscillates, and/or doesn't go down

**The learning rate is too small**  
Symptoms: training loss doesn't go down, nor does accuracy go up

**Your model is taking forever to learn**  
Symptoms: Validation and training error go down, but very slowly  
Fix: Use batch normalization and momentum methods such as Adam and RMSProp

**Your model has too many layers/parameters**  
Symptoms: the training accuracy is almost 1, but the validation accuracy is terrible (overfitting!)   
Fix: reduce the number of layers and hidden units

**Your model has too little data**  
Symptoms: the training accuracy is almost 1, but the validation accuracy is terrible (overfitting!)   
Fix: create noised data, use Dropout, augment your data with other datasets

**Your input is not well distributed**  
Symptoms: extreme activation/loss values, and no learning  
Fix: make sure that your input is in [0,1] or has mean 0 and variance 1. 


# References

1. [Keras text classification with Spacy and word embeddings](https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py)
2. [Sentiment Analysis with Keras](https://github.com/explosion/spaCy/blob/master/examples/deep_learning_keras.py)
3. [Huge list of Keras examples](https://github.com/fchollet/keras/tree/master/examples)
4. [Spacy Docs](https://spacy.io/docs/)