# Natural Language Processing (almost) from Scratch

    Ronan Collobert,Jason Weston
    2 Mar 2011

https://arxiv.org/pdf/1103.0398.pdf

## The Benchmark Tasks
1. Part-Of-Speech Tagging
    1. POS aims at labeling each word with a unique tag that indicates its syntactic role, e.g. plural noun, adverb, ...
1. Chunking
    1. Also called shallow parsing, chunking aims at labeling segments of a sentence with syntactic constituents such as noun or verb phrases (NP or VP). 
    1. Each word is assigned only one unique tag, often encoded as a begin-chunk (e.g. B-NP) or inside-chunk tag (e.g. I-NP).
1. Named Entity Recognition
    1. NER labels atomic elements in the sentence into categories such as “PERSON” or “LOCATION”.
1. Semantic Role Labeling
    1. SRL aims at giving a semantic role to a syntactic constituent of a sentence.
    1. State-of-the-art SRL systems consist of several stages: producing a parse tree, identifying which parse tree nodes represent the arguments of a given verb, and finally classifying these nodes to compute the corresponding SRL tags.

## The Networks
![1](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/14550305.jpg)

![2](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/24344565.jpg)

We consider a neural network $f_{\theta}(\cdot)$, with parameters θ. Any feed-forward neural network with L layers, can be seen as a composition of functions $f_{\theta}^l(\cdot)$, corresponding to each layer l:

$f_{\theta}(\cdot)=f_{\theta}^L(f_{\theta}^{L-1}(\cdots f_{\theta}^1(\cdot)\cdots))$

matrix A:

$[A]_{i,j}$ at row i and column j in the matrix.

We also denote $<A>_i^{d_{win}}$ the vector obtained by concatenating the $d_{win}$ column vectors around the $i^{th}$ column vector of matrix $A \in \mathbb{R}^{d_1 \times d_2}$:

$[<A>_i^{d_{win}}]^T=([A]_{1,i-d_{win}/2},\cdots,[A]_{d_1,i+d_{win}/2}\cdots,[A]_{1,i+d_{win}/2},\cdots,[A]_{d_1,i+d_{win}/2})$

As a special case, $<A>_i^1$ represents the ith column of matrix A. For a vector v, we denote $[v]_i$ the scalar at index i in the vector. Finally, a sequence of element $\{x_1, x_2, \cdots , x_T\}$ is written $[x]_1^T$ . The $i^{th}$ element of the sequence is $[x]_i$.

### Transforming Words into Feature Vectors
For each word w ∈ D, an internal $d_{wrd}$-dimensional feature vector representation is given by the lookup table layer $LT_W(\cdot)$:

$LT_W(w)=<W>_w^1$

- $W \in \mathbb{R}^{d_{wrd} \times |D|}$ is a matrix of parameters to be learnt
- $<W>_w^1 \in \mathbb{R}^{d_{wrd}}$ is the $w_{th}$ column of W
- $d_{wrd}$ is the word vector size

Given a sentence or any sequence of T words $[w]_1^T$ in D, the lookup table layer applies the same operation for each word in the sequence, producing the following output matrix:

$$LT_W([w]_1^T)=(<W>_{[w]_1}^1\ <W>_{[w]_2}^1\ \cdots\ <W>_{[w]_T}^1)\ (1)$$

#### Extending to Any Discrete Features
Generally speaking, we can consider a word as represented by K discrete features $w \in D_1 \times \cdots \times D_K$ , where $D_k$ is the dictionary for the $k^{th}$ feature. We associate to each feature a lookup table $LT_{W^k}(·)$, with parameters $W^k \in \mathbb{R}^{d_{wrd}^k \times |D_k|}$ where $d_{wrd}^k \in \mathbb{N}$ is a user-specified vector size. Given a word w, a feature vector of dimension $d_{wrd} =  \sum_k d_{wrd}^k$ is then obtained by concatenating all lookup table outputs:

$$
LT_{W^1,\cdots,W^{k(w)}} =
\left[
\begin{matrix}
LT_{W^1}(w_1) \\ 
\vdots \\ 
LT_{W^K}(w_K) \\ 
\end{matrix}
\right] =
\left[
\begin{matrix}
<W^1>_{w_1}^1 \\ 
\vdots \\ 
<W^K>_{w_K}^1 \\ 
\end{matrix}
\right]
$$

The matrix output of the lookup table layer for a sequence of words $[w]_1^T$ is then similar to (1), but where extra rows have been added for each discrete feature:

$$
LT_{W^1,\cdots,W^K}([w]_1^T)=
\left[
\begin{matrix}
<W^1>_{[w_1]_1}^1 & \cdots & <W^1>_{[w_1]_T}^1 \\
\vdots & \vdots \\
<W^K>_{[w_K]_1}^1 & \cdots & <W^K>_{[w_K]_T}^1 \\
\end{matrix}
\right]\ (2)
$$

### Extracting Higher Level Features from Word Feature Vectors
#### Window Approach
A window approach assumes the tag of a word depends mainly on its neighboring words. Given a word to tag, we consider a fixed size $k_{sz}$ (a hyper-parameter) window of words around this word. Each word in the window is first passed through the lookup table layer (1) or (2), producing a matrix of word features of fixed size $d_{wrd} \times k_{sz}$.

More formally, the word feature window given by the first network layer can be written as:

$$
f_{\theta}^1=<LT_W([w]_1^T)>_t^{d_{win}} =
\left[
\begin{matrix}
<W>_{[w]_{t-d_{win}/2}}^1 \\
\vdots \\
<W>_{[w]_{t}}^1 \\
\vdots \\
<W>_{[w]_{t+d_{win}/2}}^1 \\
\end{matrix}
\right]\ (3)
$$

`Linear Layer`: The fixed size vector $f_{\theta}^1$ can be fed to one or several standard neural network layers which perform affine transformations over their inputs:

$$f_{\theta}^l=W^lf_{\theta}^{l-1}+b^l\ (4)$$

where $W^l \in \mathbb{R}^{n_{hu}^l \times n_{hu}^{l-1}}$ and $b^l \in \mathbb{R}^{n_{hu}^l}$ are the parameters to be trained. The hyper-parameter $n_{hu}^l$ is usually called the number of hidden units of the $l^{th}$ layer.

`HardTanh Layer`: Several linear layers are often stacked, interleaved with a non-linearity function, to extract highly non-linear features. The corresponding layer l applies a HardTanh over its input vector:

$$[f_{\theta}^l]_i = HardTanh([f_{\theta}^{l-1}]_i)$$

where

$$
HardTanh(x)=\left\{\begin{matrix}
 -1,& if\ x < -1 \\ 
 x, & if\ -1 <= x <= 1 \\
 1, & if\ x > 1
\end{matrix}\right.\ (5)
$$

`Scoring`: Finally, the output size of the last layer L of our network is equal to the number of possible tags for the task of interest. Each output can be then interpreted as a score of the corresponding tag (given the input of the network).

> Remark 1 (Border Effects) The feature window (3) is not well defined for words near the beginning or the end of a sentence. To circumvent this problem, we augment the sentence with a special “PADDING” word replicated $d_{win}/2$ times at the beginning and the end. This is akin to the use of “start” and “stop” symbols in sequence models.

#### Sentence Approach
We will see in the experimental section that a window approach performs well for most natural language processing tasks we are interested in. However this approach fails with SRL, where the tag of a word depends on a verb (or, more correctly, predicate) chosen beforehand in the sentence. If the verb falls outside the window, one cannot expect this word to be tagged correctly. In this particular case, tagging a word requires the consideration of the whole sentence.

`Convolutional Layer`: A convolutional layer can be seen as a generalization of a window approach: given a sequence represented by columns in a matrix $f_{\theta}^{l−1}$ (in our lookup table
matrix (1)), a matrix-vector operation as in (4) is applied to each window of successive windows in the sequence. Using previous notations, the $t^{th}$ output column of the $l^{th}$ layer can be computed as:

$$<f_{\theta}^l>_t^1=W^l<f_{\theta}^{l-1}>_t^{d_{win}} + b^l\ \forall t,\ (6)$$

where the weight matrix $W^l$ is the same across all windows t in the sequence. Convolutional layers extract local features around each window of the given sequence. As for standard affine layers (4), convolutional layers are often stacked to extract higher level features.

`Max Layer`: The average operation does not make much sense in our case, as in general most words in the sentence do not have any influence on the semantic role of a given word to tag. Instead, we used a max approach, which forces the network to capture the most useful local features produced by the convolutional layers (see Figure 3), for the task at hand. Given a matrix $f_{\theta}^{l−1}$ output by a convolutional layer l − 1, the Max layer l outputs a vector $f_{\theta}^l$ :

$$[f_{\theta}^l]_i=max_t [f_{\theta}^{l-1}]_{i,t}\ 1 \le i \le n_{hu}^{l-1}\ (7)$$

This fixed sized global feature vector can be then fed to standard affine network layers (4). As in the window approach, we then finally produce one score per possible tag for the given task.

![3](http://ou8qjsj0m.bkt.clouddn.com//17-8-8/80167925.jpg)

> Remark 2 The same border effects arise in the convolution operation (6) as in the window approach (3). We again work around this problem by padding the sentences with a special word.

#### Tagging Schemes
![4](http://ou8qjsj0m.bkt.clouddn.com//17-8-8/70965302.jpg)

We have decided to use the most expressive IOBES tagging scheme for all tasks. For instance, in the CHUNK task, we describe noun phrases using four different tags. Tag “S-NP” is used to mark a noun phrase containing a single word. Otherwise tags “B-NP”, “I-NP”, and “E-NP” are used to mark the first, intermediate and last words of the noun phrase. An additional tag “O” marks words that are not members of a chunk.

### Training
If we denote θ to be all the trainable parameters of the network, which are trained using a training set T we want to maximize the following log-likelihood with respect to θ:

$$\theta \mapsto \sum_{(x,y) \in T} log p(y|x,\theta)\ (8)$$

where x corresponds to either a training word window or a sentence and its associated features, and y represents the corresponding tag. The probability p(·) is computed from the outputs of the neural network. 

#### Word-Level Log-Likelihood
In this approach, each word in a sentence is considered independently. Given an input example x, the network with parameters θ outputs a score $[f_{\theta}(x)]_i$, for the $i^{th}$ tag with respect to the task of interest.

This score can be interpreted as a conditional tag probability p(i | x, θ) by applying a softmax (Bridle, 1990) operation over all the tags:

$$p(i|x,\theta)=\frac{e^{[f_\theta]_i}}{\sum_j e^{[f_\theta]_j}}\ (9)$$

Defining the log-add operation as

$$logadd_i z_i=lig(\sum_i e^{z_i})\ (10)$$

we can express the log-likelihood for one training example (x, y) as follows:

$$logp(y|x,\theta)=[f_\theta]_y - logadd_j [f_theta]_j\ (11)$$

While this training criterion, often referred as `cross-entropy` is widely used for classification problems.

#### Sentence-Level Log-Likelihood
We consider the matrix of scores $f_{\theta}([x]_1^T)$ output by the network. As before, we drop the input $[x]_1^T$ for notation simplification. The element $[f_\theta]_{i,t}$ of the matrix is the score output by the network with parameters θ, for the sentence $[x]_1^T$ and for the $i^{th}$ tag, at the $t^{th}$ word.

In [None]:
p15