# Part 2: Specific Neural Architectures for NLP

**_Yuriy Guts_**

_UCU NLP Summer School, 2018_

## 2.1. Motivation

By now, you have a decent arsenal of approaches for building machine learning models for NLP. However, as far as the models go, we've mainly used general-purpose classification architectures so far. Even in the case of feed-forward neural networks (MLP) there's nothing architecturally tailored for handling language data or sequential data.

Take a look at the following phrase. What would be the BoW vector for it?

<center>
_I liked the new LG phone but it was not just as good as the one from Samsung_
<img src="bow-and-word-order.png">
<center>

The feature vector can be directly used for many traditional models, but does it capture subtle language details like word order and regularities?

Today we are going to explore some neural architectures that are more specialized for handling language: 1D **convolutional neural networks** (CNN) and **recurrent neural networks** (RNN).

Note that the architectures we'll explore are primarily used as feature extractors according to the neural NLP diagram below (source: [explosion.ai](https://explosion.ai/blog/deep-learning-formula-nlp)):

<img src="embed-encode-attend-predict.png">

When thinking about CNNs, RNNs or other approaches it is useful to think about them as "Lego Bricks" that can be mixed and matched to create a desired structure and to achieve the desired behavior. This allows us to create large and elaborate network structures, with multiple layers of MLPs, CNNs and RNNs feeding into each other, and training the entire network in an end-to-end fashion.

## 2.2. Convolutional Neural Networks

#### _AKA: "Learnable N-gram Detectors"_

Often we are interested in making predictions based on **ordered** sets of items:
    
* sequence of characters in a word
* sequence of words in a sentence
* sequence of sentences in a document

A typical example is sentiment analysis where we want to predict the emotional background of a phrase. Some of the sentence words can be very informative of the sentiment, but let's look at our earlier example again:

<center>
_I liked the new LG phone but it was not just as good as the one from Samsung._
<center>

It is difficult for a non-language-aware model to capture that the writer actually considered the first phone to be inferior to the other.

As you probably know, a slightly better option in this case would be to include combinations of words (n-grams) in the bag-of-words model. That is, we allocate separate dimensions for _I liked the_, _liked the new_, _the new LG_ and so on.

That way, _good_ and _not as good_ would have a different meaning for the model. However, a downside of this approach is that it results in huge matrices, will not scale for longer ngrams (typically longer than 4-5), and extreme sparsity will lead to ill-defined statistical strength of the features. For example, "_really loved the new phone_" and "_genuinely enjoyed the new phone_" have a very similar meaning but it's highly improbable to see both exact phrases even in a fairly large corpus.

A **convolution-and-pooling network** is designed to identify indicative local predictors in a large structure, and combine them into a fixed size vector representation of the structure, capturing the local aspects that are valuable for the prediction task at hand.

### Convolution operation

The main idea is to apply a non-linear learned function over each $k$-word sliding window over the sequence.

Let's define the operator $\oplus(w_{i:i+k-1})$ to be the concatenation of vectors $w_i, w_{i+1}, ..., w_{i+k-1}$.

The concatenated vector of the $i$-th window is $x_i = \oplus(w_{i:i+k-1}) = [w_i; w_{i+1}; ...; w_{i+k-1}]$.

We then apply the convolution filter to each window vector $x_i$, resulting in _scalar_ values $p_i$:

$$
p_i = g(x_i \cdot u) \\
x_i = \oplus(w_{i:i-k+1}) \\
p_i \in \mathbb{R} \quad x_i \in \mathbb{R}^{k \cdot d_{emb}} \quad u \in \mathbb{R}^{k \cdot d_{emb}}
$$

Where $g$ is a nonlinear activation function like $tahn$, $ReLU$ etc. and $u$ is a learnable weight vector.

It is customary to use $\ell$ different filters $u_1, ... u_{\ell}$ which can be arranged into a matrix $U$, and a bias vector $b$ is often added:

$$
p_i g(x_i \cdot U + b) \\
p_i \in \mathbb{R}^\ell \quad x_i \in \mathbb{R}^{k \cdot d_{emb}} \quad U \in \mathbb{R}^{k \cdot d_{emb} \times \ell} \quad b \in \mathbb{R}^\ell 
$$

We can also vary the sizes of filters ($k$) to obtain features from $k$-grams of different length.

Or, graphically (source: [WildML](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/)): <img src="word-cnn.png">

In [1]:
from collections import defaultdict

import numpy as np
import spacy

np.set_printoptions(formatter={'float': '{: 0.5f}'.format})

In [2]:
# Make sure to run `python -m spacy download en_core_web_md` if you've just installed spacy.
nlp = spacy.load('en_core_web_md')

In [3]:
doc = nlp('I like this movie very much!')
for token in doc:
    print('Token: {}\tFirst 3 embedding dims: {}'.format(token, token.vector[:3]))

Token: I	First 3 embedding dims: [ 0.18733  0.40595 -0.51174]
Token: like	First 3 embedding dims: [-0.18417  0.05511 -0.36953]
Token: this	First 3 embedding dims: [-0.08760  0.35502  0.06387]
Token: movie	First 3 embedding dims: [ 0.20710 -0.47656  0.15479]
Token: very	First 3 embedding dims: [-0.31342  0.37267 -0.41600]
Token: much	First 3 embedding dims: [-0.40534  0.47027 -0.06660]
Token: !	First 3 embedding dims: [-0.26554  0.33531  0.21860]


In [4]:
doc_matrix = np.vstack(token.vector for token in doc)
print('Document tensor shape:', doc_matrix.shape)

Document tensor shape: (7, 300)


In [5]:
conv_sizes = [2, 3, 4]
conv_filters_per_size = 2

# Initialize each filter weights randomly.
# We'll use 'i', 'k', 'l' names to match the notation above.

conv_filter_weights = {
    k: [
        np.random.uniform(size=(k, doc_matrix.shape[1]))
        for l in range(conv_filters_per_size)
    ]
    for k in conv_sizes
}

for k in conv_sizes:
    for l in range(conv_filters_per_size):
        print('{}-gram filter #{} shape: {}'.format(k, l, conv_filter_weights[k][l].shape))

2-gram filter #0 shape: (2, 300)
2-gram filter #1 shape: (2, 300)
3-gram filter #0 shape: (3, 300)
3-gram filter #1 shape: (3, 300)
4-gram filter #0 shape: (4, 300)
4-gram filter #1 shape: (4, 300)


In [6]:
def relu(x):
    return np.maximum(x, 0)

# Convolve the input document using different filters.
# This code isn't efficient, mainly written for clarity.
conv_filter_outputs = {}

for k in conv_sizes:
    print()
    conv_filter_outputs[k] = [list() for l in range(conv_filters_per_size)]
    
    for l in range(conv_filters_per_size):
        for i in range(len(doc_matrix) - k + 1):
            tokens = [doc[t].text for t in range(i, i + k)]
            print('{}-gram filter #{}: convolving {}'.format(k, l, ','.join(tokens)))

            convolution = relu(np.sum(conv_filter_weights[k][l] * doc_matrix[i:i + k]))
            conv_filter_outputs[k][l].append(convolution)

        conv_filter_outputs[k][l] = np.array(conv_filter_outputs[k][l])

for k in conv_sizes:
    print()
    for l in range(conv_filters_per_size):
        print('{}-gram filter #{} output: {}'.format(k, l, conv_filter_outputs[k][l]))


2-gram filter #0: convolving I,like
2-gram filter #0: convolving like,this
2-gram filter #0: convolving this,movie
2-gram filter #0: convolving movie,very
2-gram filter #0: convolving very,much
2-gram filter #0: convolving much,!
2-gram filter #1: convolving I,like
2-gram filter #1: convolving like,this
2-gram filter #1: convolving this,movie
2-gram filter #1: convolving movie,very
2-gram filter #1: convolving very,much
2-gram filter #1: convolving much,!

3-gram filter #0: convolving I,like,this
3-gram filter #0: convolving like,this,movie
3-gram filter #0: convolving this,movie,very
3-gram filter #0: convolving movie,very,much
3-gram filter #0: convolving very,much,!
3-gram filter #1: convolving I,like,this
3-gram filter #1: convolving like,this,movie
3-gram filter #1: convolving this,movie,very
3-gram filter #1: convolving movie,very,much
3-gram filter #1: convolving very,much,!

4-gram filter #0: convolving I,like,this,movie
4-gram filter #0: convolving like,this,movie,very
4-gram

### Pooling operation

After we've computed the convolutions over all possible windows in the sentence, we apply a "pooling" operation to combine the vectors resulting from different windows into a single $\ell$-dimensional vector, usually by taking the max or the average value in each of the $\ell$ dimensions over the different windows.

The intention is to focus on the features that activate the receptive unit the most, regardless of their exact location. Each filter extracts a different indicator, and the pooling operation "zooms in" on the most important ones.

Another variation is the $k$-max pooling operation, in which the top $k$ values in each dimension are retained instead of only the best one, while preserving the order in which they appeared in the text. This pooling operation produces a $k \times \ell$ matrix which can be later reshaped into a vector. Yet another variation is dynamic pooling where we divide the sequence into parts along the time axis and pool each part separately.  

### Hierarchy

The 1D convolution approach described so far can be thought of as an $n$-gram detector. A convolution layer with a window of size $k$ is learning to identify indicative $k$-grams in the input.

The approach can be extended into a hierarchy of convolutional layers, in which a sequence of convolution layers are applied one after the other. 

<img src="cnn-hierarchical.png">

### Padding, channels, strides, dilations

A common question for CNNs: what should we do on the document boundaries? Should we allow the filter to cross the boundary and apply some sort of padding like we often do with images? The answer is we can do both. The approaches without padding and with padding are called **narrow** and **wide** convolutions respectively. If we apply padding, we can either pad with zeros or allocate a special padding word in our dictionary with its own embedding vector.

<img src="cnn-padding.png">

In computer vision, a picture is represented as a collection of pixels, each representing the color intensity of a particular point. When using an RGB color scheme, each pixel is a combination of three intensity values—one for each of the Red, Green, and Blue components. These are then stored in three different matrices. Each matrix provides a different “view” of the
image, and is referred to as a **channel**.

When applying a convolution to an image in computer vision, it is common to apply a different set of filters to each channel, and then combine the three resulting vectors into a single vector. Taking the different-views-of-the-data metaphor, we can have multiple channels in text processing as well. For example, one channel will be the sequence of words, while another channel is the sequence of corresponding POS tags.

Applying the convolution over the words will result in $m$ vectors $p_{1:m}^w$, and applying it over the POS-tags will result in $m$ vectors $p_{1:m}^t$. These two views can then be combined either by summation $p_i = p_i^w + p_i^t$ or by concatenation $p_i = [p_i^w; p_i^t]$.

Similarly to computer vision, we are not forced to shift each convolutional window by only one word forward at each step. We can define a **stride** of 2, 3, etc. words. so that our filters overlap less or skip certain gaps if we have prior knowledge of them.

<img src="cnn-strided.png">

In a **dilated convolution architecture** [[Yu & Koltun, 2016](https://arxiv.org/pdf/1511.07122.pdf)] the hierarchy of convolution layers each has a stride size of $k - 1$. This allows an exponential growth in the effective window size as a function of the number of layers.

<img src="cnn-dilated.png">

### References and Further Reading

1. [Yoav Goldberg - Neural Network Models for NLP [Morgan & Claypool, 2017]](https://www.morganclaypool.com/doi/abs/10.2200/S00762ED1V01Y201703HLT037).
2. [Explosion.ai - Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models](https://explosion.ai/blog/deep-learning-formula-nlp).
3. [WildML - Understanding Convolutional Neural Networks for NLP](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/).