# Paper: A Convolutional Neural Network for Modelling Sentences

## Summary

This paper focuses on the sentence modelling problem which is about representing a sentence's semantic content for the purposes of classification. The paper discussed other approaches for solving the sentence modelling problem:
- Composition-based methods: vector representations of word meaning are used to create vectors for longer phrases.
- Represent the meaning of sentences by automatically extracting logical forms (Zettlemoyer and Collins, 2005).
- Neural sentence models based on convolution like Time-Delay Neural Networks (TDNN)

The paper explains how the TDNN works, its advantanges and disadvantages. Then the authors introduce their invension Dynamic Convolutional Neural Network (DCNN). It is a new type of network that attempts to solve the shortcomings of TDNN while maintaining its advantages. This is done by using two operators:
- Wide convolution is a type of convolution where the input is zero-padded with almost two times the size of filter
- Dynamic $k$-max pooling finds $k$ largest values instead of a single maximum

These two operators allow the network to:
- discriminate whether a specific $n$-gram occurs in an input sentence.
- tell the relative position of the most relavant $n$-grams



The network is tested on following tasks:
- Predicting the sentiment (binary and multi-class) of movie reviews
- Categorisation of questions in six question types in the TREC dataset
- Predicting the sentiment of Twitter posts using distant supervision

A convolutional filter in the DCNN can be seen as a lingustic feature detector that learns during training to recognise a specific sequence of input words. These feature detectors can be visualised.

## One-dimensional Convolution

Suppose 
- $\mathbf{m} \in \mathbb{R}^M$ is a vector representing the weights of a convolutional filter of size $M$.
- $\mathbf{s} \in \mathbb{R}^S$ is a vector representing a sentence.

The one-dimensional convolution operation is defined as follows:

$$
c_j = \mathbf{m}^T \mathbf{s}_{j-m+1:j}
$$

where $c_j$ is the $j$th element of the vector $\mathbf{c}$.

### Narrow vs Wide Convolution

The range of the index $j$ gives rise to two types of convolutions:

- **Narrow:** the value of $j$ can range from $M$ to $S$. This requires that the size of sentence vector is larger or equal to the size of the filter: $S > M$. The resulting sequence has $S-M+1$ elements. The figure above illustrates a narrow 1D convolution. In this example $S=9$, $M=3$ and $\mathbf{c} \in \mathbb{R}^{7}$.

The figure below illustrate a 1D convolution between a filter of size $M=3$ and a sentence vector of size $S=9$. The convolution yields the vector $\mathbf{c}$ of size 7.

<img src="figures/p1/1d-conv-narrow.png" width="400" />










- **Wide:** the value of $j$ can range from $1$ to $S+M-1$. Out-of-range input values $s_i$ where $i < 1$ or $i > S$ are set to zero.

<img src="figures/p1/1d-conv-wide.png" width="500" />










Notice that the result narrow convolution is a subsequence of the result of the wide convolution.

Suppose that the element $s_i \in \mathbb{R}$ of a sentence vector $\mathbf{s}$ corresponds to the $i$th word in the sentence.

Then the trained weights $\mathbf{m}$ in a filter can be seen as a lingustic feature detector that can recognise a specific class of $n$-grams. For example, let us assume we have trained the filter $\mathbf{m} = [0.2, 0.1, 0.4]^T$ to recognise the 3-gram *turtle loves shower*. This means that when the filter is convolved with a sentence with the phrase *turtle loves shower* then it will result in a high value.

Wide convolution have two major advantanges over narrow convolution:
- Ensures that all weights in the filter reach the entire sentence, including the words at the margins. This is particularly significant when the filter size $M$ is relatively large such as $M=8$ or $M=10$. 
- Guarantees that the convolution of a filter $\mathbf{m}$ to the input sentence $\mathbf{s}$ always produces a valid non-empty result $\mathbf{c}$, independently of the filter size $M$ and the sentence length $S$.

## Maximum Time-Delay Neural Network (Max-TDNN)

The TDNN architecture was first proposed to classify <span title="one of the set of speech sounds in any given language that serve to distinguish one word from another" style="border-bottom: 1px dotted #000;">phonemes</span> in speech signals for automatic speech recognition.


<img src="figures/p1/time-delay-nn.png" width="477" />

Source: https://electroviees.wordpress.com/2013/08/06/speech-recoginition/




Max-TDNN model:
- Each sentence is represented as a $D \times S$ matrix where column $i$ of the matrix represents the $i$th word $\mathbf{w}_i \in \mathbb{R}^{D}$ in the sentence.
- Sentence lengths may vary.
- Each filter is of size $D \times M$. We can see this a $D$ one-dimensional filters for size $M$.
- Each row of the filter is the convolved with the corresponding row of the input.
- The convolution (narrow) yields a matrix.
- Max pooling addresses the problem of varying sentence lengths. The Max-TDNN takes the maximum of each row in the resulting matrix yielding a vector of $D$ values, $\mathbf{c}_{max}$.
- The fixed-sized vector $\mathbf{c}_{max}$ is then used as input to a fully connected layer for classification.

<img src="figures/p1/max-tdnn-example.png" width="800" />











Advantages of Max-TDNN:
- It is sensitive to the order of the words in the sentence
- It does not depend on external language-specific features such as dependency or constituency parse trees.
- It also gives largely uniform importance to the signal coming from each of the words in the sentence, with the exception of words at the margins that are considered fewer times in the computation of the narrow convolution. 

Limitations of Max-TDNN:
- Higher-order and long-range feature detectors cannot be easily incorporated into the model. The range of the feature detectors is limited to the span $M$ of the weights. Increasing $M$ or stacking multiple convolutional layers of the narrow type makes the range of the feature detectors larger; at the same time it also exacerbates the neglect of the margins of the sentence and increases the minimum size $S$ of the input sentence required by the convolution.
- Max pooling cannot distinguish whether a relevant feature in one of the rows occurs just one or multiple times and it forgets the order in which the features occur.

## Dynamic Convolutional Neural Network (DCNN)

DCNN attempts to solve the limitations of Max-TDNN while preserving the advantages.

<img src="figures/p1/dcnn-architecture-example.png" width="500" />














### Wide Convolution

DCNN uses wide one-dimensional convolutions in every layer.

### Dynamic $k$-Max Pooling

DCNN  introduces dynamic $k$-max pooling operator:
- Given a sequence of values $\mathbf{p} \in \mathbb{R}^{P}$, $k$-max pooling operator returns the subsequence of $k$ largest values in that sequence.
- The order of the values corresponds to their original order in $\mathbf{p}$
- It is a generalisation of the max pooling operator which only return the largest value
- The pooling parameter $k$ can be dynamically altered by making $k$ a function of other aspects of the network or the input. One example is 
$$
k_l = \max\left( k_{top}, \left \lceil \frac{L-l}{L}S \right \rceil \right)
$$
where 
   - $l$ is the number of the current convolutional layer for which the pooling is applied, 
   - $L$ is the total number of convolutional layers, 
   - $k_{top}$ is a fixed pooling parameter for the topmost convolutional layer.
   - $S$ is the length of an input sentence

<img src="figures/p1/k-max-pooling-example.png" width="900" />







Why?
- Allows us to extract the $k$ most active features in a sequence.
- It preserves the order of the active features, but is insensitive to their specific positions.
- It can also discern more finely the number of times the feature is highly activated in $\mathbf{p}$ and the progression by which the high activations of the feature change across $\mathbf{p}$
- The dynamic pooling parameter $k$ allows for a smooth extraction of higher-order and longer-range features.

### Non-linear Feature Function

After $k$-max pooling is applied to the result of a convolution, a bias $\mathbf{b} \in \mathbb{R}^D$ and a nonlinear function $g(\cdot)$ are applied component-wise to the pooled matrix.

There is a single bias value for each row of the pooled matrix.

### Computation of Feature Maps

- After applying the convolution, max-pooling and a non-linear function, we get a **feature map**. 
- A feature map produced in the $i$-th convolutional layer is called an $i$-th order feature map and denoted as $\mathbf{F}^{i}$.
- Each filter (a $D \times M$ matrix) produces a feature map.
- Each convolutional layer can have $n$ filters and thus produce $n$ feature maps; $\mathbf{F}^{i}_1,\mathbf{F}^{i}_2,\cdots,\mathbf{F}^{i}_n$.
- A second-order feature map $\mathbf{F}^{2}$ can be computed by convolving a filter represented by $\mathbf{m}^{2}_k$ with each feature map in the preceeding layer $\mathbf{F}^{1}_k$ and summing the results:
$$
\mathbf{F}^{2} = \sum_{k=1}^{n} \mathbf{m}^{2}_k * \mathbf{F}^{1}_k
$$
where 
  - $n$ is the number of feature maps from the preceeding layer
  - $*$ is the wide convolution
  - $\mathbf{m}^{2}_k$ is order-4 tensor
- In general, each feature map $\mathbf{F}^{i}_j$ can be computed:
$$
\mathbf{F}^{i}_j = \sum_{k=1}^{n} \mathbf{m}^{2}_{j,k} * \mathbf{F}^{i-1}_k
$$

### Folding

- After a convolution and before max pooling, one just sums every two rows in a feature map component-wise.
- For a feature map of $D$ rows, folding returns a feature map of $D/2$ rows, thus halving the size of the representation. 
- With a folding layer, a feature detector of the i-th order depends now on two rows of feature values in the lower maps of order $i - 1$.

<img src="figures/p1/folding-example.png" width="800" />







## Properties of the DCNN

- Can discriminate whether a specific $n$-gram occurs in an input sentence. 
  - The filters of the wide convolution in the first layer can learn to recognise specific $n$-grams ( where $n \leq M$ and $M$ is the size of the filter)
  - The filters in the first layer is set to a large value of $M$ e.g. $M=10$
- Can tell the relative position of the most relavant $n$-grams
  - The $k$-max pooling operation maintains the order and relative position of the $n$-grams recognised by the convolution layer


- The convolution and pooling layers induce an internal feature graph over the input.
  - Nodes that are not selected by the pooling operation at a layer are dropped from the graph.

<img src="figures/p1/induced-feature-graph.png" width="400" />









Figure 1: Subgraph of a feature graph induced
over an input sentence in a Dynamic Convolutional Neural Network. The full induced graph
has multiple subgraphs of this kind with a distinct
set of edges; subgraphs may merge at different
layers. The left diagram emphasises the pooled
nodes. The width of the convolutional filters is 3
and 2 respectively. With dynamic pooling, a filter with small width at the higher layers can relate
phrases far apart in the input sentence.

## Experiments

Tested the network on three different experiments:

**1)** The prediction of the sentiment of movie reviews in the Stanford Sentiment Treebank


<img src="figures/p1/experiment-1-results.png" width="300" />


**2)** The classification of questions to one of six question types (the TREC questions dataset):


<img src="figures/p1/experiment-2-results.png" width="400" />


**3)** Prediction of sentiment of tweets with distant supervision. The training set is labelled automatically according to the emoticon that occurs in them but the test set consists of about 400 hand-annotated tweets.


<img src="figures/p1/experiment-3-results.png" width="200" />


### Visualising Feature Detectors

A filter in the DCNN can be seen as a lingustic feature detector that learns during training to recognise a specific sequence of input words.

In the first layer, the sequence is an $n$-gram from the input sentence.

In higher layers, sequences can be made of multiple separate important $n$-grams.

The figure below illustrates the two feature detectors (positive or negative sentiment) in the first layer of a DCNN trained on the binary sentiment task (Experiment 1).

<img src="figures/p1/top-five-7-grams-part-1.png" width="600" />








Detectors for particles such as "not" that negate sentiment and "too" that potentiate sentiment.

<img src="figures/p1/top-five-7-grams-part-2.png" width="600" />








They also found detectors for multiple other notable constructs:
 - all
 - or
 - with ... that
 - as ... as

The feature detectors learn to recognise not just single $n$-grams, but patterns within $n$-grams that have syntactic, semantic or structural significance.