<a href="https://colab.research.google.com/github/victorviro/Deep_learning_python/blob/master/Deep_learning_methods_for_sequence_labeling_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Table of contents


1. [Introduction to sequence labeling](#1)
    1. [Part-of-Speech Tagging](#1.1)
    2. [NER](#1.2)
    3. [Text chunking](#1.3)
    4. [Other sequence labeling tasks](#1.4)
2. [Traditional approaches](#2)
3. [Deep learning based methods](#3)
    1. [Distributed representations for input](#3.1)
        1. [Word-level representations](#3.1.1)
        2. [Character-level representations](#3.1.2)
        3. [Hybrid representations](#3.1.3)
    2. [Context encoder architectures](#3.2)
        1. [Convolutional neural networks](#3.2.1)
        2. [Recurrent neural networks](#3.2.2)
        3. [Transformers](#3.2.3)
    3. [Tag decoder](#3.3)
        1. [MLP + Softmax](#3.3.1)
        2. [Conditional random fields](#3.3.2)
        3. [Recurrent neural networks](#3.3.3)
        3. [Pointer networks](#3.3.4)
5. [References](#5)





# Introduction to sequence labeling <a name="1"></a>

A variety of NLP tasks can be formulated as general **sequence labeling** problems: given a sequence of tokens and a fixed set of labels, assign one of the labels to each token in a sequence.  The labels usually depend on the types of specific tasks. Examples of classical tasks include *Part-Of-Speech (POS) tagging*, *Named Entity Recognition* (NER), *text chunking*, etc.

## POS Tagging <a name="1.1"></a>

*Part-of-speech (POS) Tagging* aims at assigning a
correct part-of-speech tag to each word based on its context and definition. Examples of POS are nouns, pronouns, adjectives, verbs, adverbs, etc. Most POS are divided into sub-categories, for example, singular noun (`NN`), plural noun (`NNS`), the base form of a verb (`VB`), the past tense of a verb (`VBD`), etc.

![](https://i.ibb.co/QF7Q9Y5/POS-tagging.png)

**Evaluation metrics**

Part-of-speech tagging systems are usually evaluated according to the token *accuracy*. It depicts the ratio of the number of correctly classified instances and the total number of instances, which can be computed using the following equation.

$$\text{ACC}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}$$

where $\text{TP}$, $\text{TN}$, $\text{FP}$, $\text{FN}$ denote True positive, True negative, False positive, False negative, respectively.

## Named Entity Recognition <a name="1.2"></a>

A **named entity** is a word or a phrase that identifies one item from a set of other items that have similar attributes. Examples of named entities are an organization, person, and location names in the general domain; gene, protein, drug, and disease names in the biomedical domain. **Name Entity Recognition (NER)** is the process of locating and classifying named entities in text into predefined entity categories.

![](https://www.techslang.com/wp-content/uploads/2020/06/ad11-sample.png)

NER not only acts as a standalone tool for information extraction (IE) but also plays an essential role in a variety of NLP applications such as text understanding, information retrieval, automatic text summarization, question answering, machine translation, knowledge base construction, etc. 

**Evaluation metrics**

NER essentially involves two subtasks: *Boundary detection*
and *Type identification*. A correctly recognized instance requires a system to correctly identify its boundary and type, simultaneously. More specifically, the numbers of False Positives (FP), False Negatives (FN), and True Positives (TP) are used to compute Precision, Recall, and F-score.

- False Positive (FP): an entity that is returned by a NER
system but does not appear in the ground truth.

- False Negative (FN): an entity that is not returned by a
NER system but appears in the ground truth.

- True Positive (TP): an entity that is returned by a NER
system and also appears in the ground truth.

*Precision* refers to the percentage of our system results
which are correctly recognized. 

$$\text{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}}$$

*Recall* refers to the percentage of total entities correctly recognized by our system.

$$\text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}$$

A measure that combines precision and recall is the
the harmonic mean of precision and recall, the traditional F measure or balanced *F-score*, which indicates the fraction of correctly classified instances for each class within the dataset:

$$\text{F-score}=2\frac{\text{Precision}\text{Recall}}{\text{Precision}+\text{Recall}}$$

In addition, the macro-averaged F-score and micro-averaged F-score both consider the performance across multiple entity types. 

- *Macro-averaged F-score* independently calculates the F-score on different entity types, then takes the average of the F-scores. 

- *Micro-averaged F-score* sums up the individual false negatives, false positives and true positives across all entity types then applies them to get the statistics. It can be heavily affected by the quality of recognizing entities in large classes in the corpus.

## Text chunking <a name="1.3"></a>

*Text Chunking*: its goal is to divide the text into syntactically related non-overlapping groups of words such as noun phrases (`NP`), verb phrases (`VP`), prepositional phrases (`PP`), etc.  The task can be regarded as a sequence labeling problem that assigns specific labels to words in sentences.

![](https://i.ibb.co/M1xRHjW/text-chunking.png)

The smaller boxes show the word-level tokenization and POS tagging, while the large boxes show higher-level chunking. Chunking usually selects a subset of the tokens. Like tokenization, the pieces produced by a chunker do not overlap in the source text.

The F1-score is usually adopted as the evaluation metric of chunking.

## Other sequence labeling tasks <a name="1.4"></a>

There have been many explorations into applying the sequence labeling framework to address other NLP tasks such as dependency parsing, semantic role labeling, answer selection, document summarization, emotion detection in dialogues, etc.

# Traditional Approaches <a name="2"></a>

Traditional approaches to sequence labeling are broadly classified into three main streams.

- *Rule-based systems* rely on hand-crafted rules.

- *Unsupervised Learning Approaches*. Clustering-based sequence labeling systems extract labels from the clustered groups based on context similarity. The key idea is that lexical resources, lexical patterns, and statistics computed on a large corpus can be used to infer the labels.

- *Feature-based Supervised Learning Approaches*. Given annotated data samples, features are carefully designed to represent each training example. Machine learning algorithms are then used to learn a model to recognize similar patterns from unseen data. Common statistical machine learning techniques include Hidden Markov Models (HMM), Support Vector Machines (SVM), Maximum Entropy Models, and Conditional Random Fields (CRF).

# Deep learning based methods <a name="3"></a>

Compared to feature-based approaches, deep learning is beneficial in discovering hidden features automatically. There are three core strengths of applying deep learning techniques to sequence labeling tasks. 

- First, systems benefit from the non-linear transformation, which generates non-linear mappings from input to output. Compared with linear models, DL-based models can learn complex and intricate features from data via non-linear activation functions. 

- Second, deep learning save significant effort on designing features. The traditional feature-based approaches require considerable amount of engineering skill and domain expertise. DL-based models, on the other hand, are effective in automatically learning useful representations and underlying factors from raw data. 

- Third, deep neural sequence models can be trained in an end-to-end paradigm, by gradient descent.


DL-based models for sequence labeling can be split in three stages (Figure 2 depicts these stages for a NER model). 

![](https://i.ibb.co/p1TDQJR/DL-NER.png)


- *Distributed representations for input* or the *embedding module* maps words into their distributed vector representations. In this stage, we can also incorporate additional features like POS tag and gazetteer that have been effective in feature-based approaches.

- The *Context encoder* extracts context features using CNN, RNN, or other networks.

- The *Tag decoder or inference module* predicts tags or labels for tokens in the input sequence. For NER, for instance, in Figure 2 each token is predicted with a tag indicated by `B`-(begin), `I`-(inside), `E`-(end), S-(singleton) of a named entity with its type, or `O`-(outside) of named entities. The tag decoder may also be trained to detect boundaries. For NER, for example, it can detect entity boundaries and then the detected text spans are classified to the entity types.

### Distributed representations for input <a name="3.1"></a>

The embedding module maps words into their distributed
representations as the initial input of model.  Distributed representation represents words in low dimensional real-valued dense vectors where each dimension represents a latent feature. Automatically learned from text, distributed representation *captures semantic and syntactic* properties of words. In addition to pretrained word embeddings, character-level representations, hand-crafted features and sentence-level representations can also be part of the embedding module, supplementing features for the initial input from different perspectives. 


Next, we review three types of distributed representations that have been used in sequence labeling models: word-level, character-level, and hybrid representations.

#### Word-level representation <a name="3.1.1"></a>

Some studies employed word-level representation (typically pre-trained over large corpus through unsupervised algorithms such as word2vec). Using as the input, the pre-trained word embeddings can be either fixed or further fine-tuned during model training. Commonly used word embeddings include [Google Word2Vec](https://arxiv.org/abs/1301.3781), [Stanford GloVe](https://nlp.stanford.edu/projects/glove/) and [Facebook fastText](https://fasttext.cc/).

Collobert et al. propose the [*SENNA*](https://arxiv.org/abs/1103.0398) architecture in 2011, which pioneers the idea of solving natural language processing tasks from the perspective of neural language model, and it also includes a construction method of pretrained word embeddings. Many subsequent works adopt SENNA word embeddings as the initial input of their sequence labeling models. 

[Lample et al.](https://arxiv.org/abs/1603.01360) apply
[skip-n-gram](https://www.aclweb.org/anthology/D15-1161/) to pretrain their word embeddings, which is a variation of word2vec that accounts for word order. Strubell et al. proposed a tagging scheme based on Iterated Dilated Convolutional Neural Networks ([ID-CNNs]((https://arxiv.org/abs/1702.02098))) using dimensional embeddings trained by skip-n-gram. 

**Contextual word embeddings**

The above pretrained word embedding methods only generate a single *context-independent* vector for each word, ignoring the modeling of polysemy problem. Recently, many approaches for learning *contextual word representations* have been proposed, where bidirectional language models are trained on a large unlabeled corpus and the corresponding internal states are used to produce a word representation. The representation of each word is dependent on its context, meaning that the same word has different embeddings depending on its contextual use.

[Peters et al.](https://arxiv.org/abs/1705.00108) propose pretrained contextual embeddings from bidirectional language models with character convolutions and added them to sequence labeling model, achieving pretty excellent performance on the task of NER and chunking.
Peters et al. extend their method by introducing
[ELMo](https://arxiv.org/abs/1802.05365) (Embeddings from Language Models) which significantly improves the performance across a broad range of diverse NLP tasks.

Akbik et al. propose a similar method to generate contextual word embeddings ([Contextual String Embeddings](https://www.aclweb.org/anthology/C18-1139/)) by adopting a bidirectional character-aware language model. Figure 4 illustrates the architecture of extracting a contextual string embedding for word “Washington” in a sentential context. From the forward language model (red), the model extracts the output hidden state after the last character in the word. From the backward language model (blue), the model extracts the output hidden state before the first character in the word. Both output hidden states are concatenated to form the final embedding of a word.

![](https://i.ibb.co/HBckC4x/contextual-string-embedding.png)

Devlin et al. propose a pretraining language representation model called [BERT](https://arxiv.org/abs/1810.04805) (Bidirectional Encoder Representations from Transformers). It obtains new state-of-the-art results on eleven NLP tasks. The core idea of BERT is to pretrain deep bidirectional representations by jointly conditioning on both left and right context in all layers. Although the sequence labeling tasks can be addressed by fine-tuning the existing pretrained BERT model, the output hidden states of BERT can also be taken as additional word embeddings to promote the performance of sequence labeling models. 

A notebook explaining recent language model techniques for transfer learning, including ELMo and BERT, is available [here](https://nbviewer.jupyter.org/github/victorviro/Deep_learning_python/blob/master/Transfer_learning_in_NLP.ipynb).

Some studies ([Improving NER with Gazetteers](https://arxiv.org/abs/2003.03072), [Dependency-Guided LSTM-CRF for NER](https://arxiv.org/abs/1909.10148), [Multi-Grained NER](https://arxiv.org/abs/1906.08449), [Hierarchical Contextualized Representation for NER](https://arxiv.org/abs/1911.02257),  [NAS for LM and NER](https://www.aclweb.org/anthology/D19-1367/)) have achieved promising performance via leveraging the combination of traditional embeddings and language model embeddings for sequence labeling tasks.

In addition to such context modeling, a recent work proposed by He et al. ([KAWR for NER](https://ojs.aaai.org/index.php/AAAI/article/view/6299)) provide a new kind of word embedding that is both context-aware and knowledge-aware, which encode the prior knowledge of entities from an external knowledge base. The proposed knowledge-graph augmented word representations significantly promotes the performance of NER in various domains.


#### Character-level representation <a name="3.1.2"></a>

Many researches learn character-level representations of words through neural networks and incorporate them into the embedding module of models to exploit useful **intra-word information** such us prefix and suffix, and can also tackle the **out-of-vocabulary** word problem. 

There are two widely-used architectures for extracting character-level representation: *CNN-based* and *RNN-based models*.

![](https://i.ibb.co/XDWHdsF/CNN-and-RNN-based-char-embeddings.png)

**CNN-based models**

Santos and Zadrozny initially propose [Character-level Representations for POS Tagging](http://proceedings.mlr.press/v32/santos14.html). They use CNNs to learn character-level representations of words for sequence labeling, which is followed by many subsequent work. The approach applies a convolutional operation to the sequence of character embeddings and produces local features of each character. Then a fixed-sized character-level embedding of the word is extracted by using the max over all character windows. The process is depicted in Fig 1.

![](https://i.ibb.co/C5pftGx/CNN-character-feature-extractor.png)

[Li et al.](https://www.aclweb.org/anthology/D17-1282/) applied a series of convolutional and highway layers to generate character-level representations for words. The final embeddings of words are fed into a bidirectional recursive network. 

Yang et al. proposed a [Neural reranking model for NER](https://arxiv.org/abs/1707.05127), where a convolutional layer with a fixed window-size is used on top of a character embedding layer. 

Xin et al. propose [IntNet](https://arxiv.org/abs/1810.12443), a funnel-shaped wide convnet to learn character-level representations for sequence labeling. Unlike previous CNN-based character embedding approaches, this method designs the convolutional block that comprises of several consecutive operations, and uses multiple convolutional layers in which feature maps are concatenated in every other ones. It helps the network to capture different levels of features and learn better internal structure of words achieving significant improvements over other character embedding models and obtains state-ofthe-art performance on various sequence labeling datasets.

**RNN-based models**: Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are two typical choices of the basic units.

Ling et al. propose a compositional character to word ([C2W](https://arxiv.org/abs/1508.02096)) model that uses bidirectional LSTMs (Bi-LSTM) to build word embeddings by taking the characters as atomic units. A forward and a backward LSTM processes the character embeddings sequence of a word in direct and reverse order. The representaion for a word derived from its characters is obtained by combining the final states of the Bi-LSTM. The C2W model yields excellent results in language modeling and part of-speech tagging. And many work follow them to apply Bi-LSTM for obtaining character-level representations for sequence labeling. Illustration of the proposed method is shown in Fig 3.

![](https://i.ibb.co/VxnMbhL/lexical-composition-model.png)

[Dozat et al.](https://www.aclweb.org/anthology/K18-2001/) propose a RNN based character-level model for POS tagging  in which the character embeddings sequence of each word is fed into a unidirectional LSTM followed by an attention mechanism. 

[Bohnet et al.](https://arxiv.org/abs/1805.08237) propose a novel sentence-level character model for learning context sensitive character-based representations of words. This method feeds all characters of a sentence into a Bi-LSTM layer and concatenates the forward and backward output vector of the first and last character in the word to form its final character-level representation allowing context information to be incorporated in the initial word embeddings before flowing into the context encoder module. Similarly, [Liu et al.](https://arxiv.org/abs/1709.04109) adopt the character-level BiLSTM that processes all characters of a sentence. However, their proposed model focuses on extracting knowledge from raw texts by leveraging the neural language model to predict the next and previous word at word boundaries effectively extracting character-level information. 

#### Hybrid Representation <a name="3.1.3"></a>

Besides word-level and character-level representations,
some studies also incorporate additional information (e.g.,
gazetteers, lexical similarity, linguistic dependency and visual features) into the final representations of words, before feeding into context encoding layers. In other words, the DL-based representation is combined with feature-based approach. Adding additional information may lead to improvements in performance, with the price of hurting generality of these systems.

- [Collobert et al.](https://arxiv.org/abs/1103.0398) utilize word suffix, gazetteer and capitalization features as well
as cascading features that include tags from related tasks.

- In the [BiLSTM-CRF model](https://arxiv.org/abs/1508.01991) by Huang et al., four types of features are
used for the NER task: spelling features (word prefix and suffix features, capitalization feature, etc), context features (unigram, bi-gram and tri-gram features),
word embeddings, and gazetteer features. 

- The [BiLSTM-CNN](https://arxiv.org/abs/1511.08308) model by Chiu and Nichols incorporates a bidirectional LSTM and a
character-level CNN. Besides word embeddings, the model
uses additional word-level features (capitalization, lexicons)
and character-level features (4-dimensional vector representing the type of a character: upper case, lower case,
punctuation, other).

- [Wei et al.](https://core.ac.uk/download/pdf/82898367.pdf) presented a CRF-based neural system for
recognizing and normalizing disease names. This system employs rich features in addition to word embeddings,
including words, POS tags, chunking, and word shape features (e.g., dictionary and morphological features). 

- The [ID-CNNs](https://arxiv.org/abs/1702.02098) proposed by Strubell et al. concatenated 100-dimensional embeddings with
a 5-dimensional word shape vector (e.g., all capitalized,
not capitalized, first-letter capitalized or contains a capital letter). 

- [Lin et al.](https://www.aclweb.org/anthology/W17-4421/) concatenated character-level representation, word-level representation, and syntactical word representation (i.e., POS tags, dependency roles, word positions, head positions) to form a comprehensive word representation. 

- A multi-task approach for NER was proposed by
[Aguilar et al.](https://arxiv.org/abs/1906.04135), which uses a CNN to capture orthographic features and word shapes at character level. For syntactical and contextual information at word level, e.g., POS and word embeddings, the model implements a LSTM architecture. 

- [Xu et al.](https://www.aclweb.org/anthology/P17-1114/) proposed a local detection approach for NER based on fixed-size ordinally forgetting encoding (FOFE). FOFE explores both character-level and word-level representations for each fragment and its contexts.

- In the [Multi-modal NER system](https://arxiv.org/abs/1802.07862) by Moon et al., for noisy user-generated data like tweets and Snapchat captions, word embeddings, character embeddings, and visual features are merged with modality attention. 

- [BERT](https://arxiv.org/abs/1810.04805) uses masked language models to enable pre-trained deep bidirectional representations. For a given token, its input representation is comprised by summing the corresponding position, segment and token embeddings. The pre-trained language model embeddings incorporate auxiliary embeddings (e.g., position and segment embeddings) so we can see these contextualized language-model embeddings as hybrid representations.

- [Wu et al.](https://arxiv.org/abs/1808.09075) propose a hybrid neural model which combines a feature auto-encoder loss component to utilize hand-crafted features (part-of-speech tags, word shapes and gazetteers).It significantly outperforms existing competitive
models on the task of NER. However, designing such features
for low-resource languages is challenging, because gazetteers
in these languages are absent. To address this proplem, [Rijhwani et al.](https://arxiv.org/abs/2005.01866) propose a method of *soft gazetteers* that incorporates information from English knowledge bases through cross-lingual entity linking and create continuous-valued gazetteer features for low-resource languages.

- [Ghaddar and Langlais](https://arxiv.org/abs/1806.03489)  propose an alternative lexical representation (called *Lexical Similarity*) indicating that robust lexical features are useful and can benefit DNN architectures. The method first
embeds words and named entity types into a joint low-dimensional vector space, which is trained from a Wikipedia corpus annotated with 120 fine-grained entity types. Then a feature vector (i.e., LS vector) for each
word is computed offline, where each dimension encodes the
similarity of the word embedding with the embedding of an
entity type. The LS vectors are finally incorporated into the embedding module of their neural NER model.

Existing research has proved that the global contextual information from the entire sentence is useful for modeling sequence. Some recent work ([GCDT](https://arxiv.org/abs/1906.02437)) by Liu et al., [Hierarchical Contextualized Representation for NER](https://arxiv.org/abs/1911.02257) by Luo et al. have introduced *sentence-level representations* into the embedded module in addition to pretrained word embeddings and character-level representations. They prove the superiority of adding sentence representation in the embedding module for the final performance of sequence labeling tasks.

### Context Encoder Architectures <a name="3.2"></a>

The *context encoder* module extracts contextual features of each token and capture the context dependencies of given input sequence. Learned contextual representations will be passed into inference module for label prediction. 

There are three commonly used model architectures for context encoder module: RNN, CNN and Transformers.

####  Convolutional Neural Networks <a name="3.2.1"></a>

Convolutional Neural Networks (CNNs) are a popular architecture for encoding context information in sequence labeling models. Compared to RNN, CNN based methods are considerably faster since it can fully leverage the GPU parallelism through the feed-forward structure. 

An initial work in this area is proposed by [Collobert et al.](https://arxiv.org/abs/1103.0398). The method employs a simple feed-forward neural network with a fixed-size sliding window over the input sequence embedding, which can be viewed as a simplified CNN without pooling layer. And this window approach is based on the assumption that the label of a word depends mainly on its neighbors. 

![](https://i.ibb.co/z4cpxjk/sentence-network-CNN-NER.png)

Each word in the input sequence is embedded to an $N$-dimensional vector after the stage of input representation. Then a convolutional layer is used to produce local features around each word, and the size of the output of the convolutional layers depends on the number of words in the sentence. The global feature vector is constructed by combining local feature vectors extracted by the convolutional layers. The dimension of the global feature vector is fixed, independent of the sentence length, in order to apply subsequent standard affine layers. Two approaches are widely used to extract global features: a max or an averaging operation over the position (i.e., “time” step) in the sentence. Finally, these fixed-size global features are fed into tag decoder to compute distribution scores for all possible tags for the words in the network input. 

Following Collobert’s work, Yao et al. proposed
[Bio-NER](https://gvpress.com/journals/IJHIT/vol8_no8/29.pdf) for biomedical NER. 

Zhou et al. observed that with RNN latter words influence the final sentence representation more than former words. However, important words may appear anywhere in a sentence. In their proposed model, named [BLSTMRE]((https://tianjun.me/static/essay_resources/RelationExtraction/Paper/Joint-Entity-and-Relation-Extraction-Based-on.pdf)), BLSTM is used to capture long-term dependencies and obtain the whole representation of an input sequence. CNN is then used to learn a high-level representation, which is then fed into a sigmoid classifier. Finally, the whole sentence representation (generated by BLSTM) and the relation presentation (generated by the sigmoid classifier) are fed into another LSTM to predict entities.

Shen et al. propose a [Deep active learning for NER](https://arxiv.org/abs/1707.05928). Their tagging model extracts context representations for each word using a CNN due to its strong efficiency, which is crucial for their iterative retraining scheme. The structure has two convolutional layers with kernels of width three, and it concatenates the representation at the last convolutional layer with the input embedding to form the output.

Wang et al. propose [Gated Convolutional
Neural Networks (GCNN) for NER](https://www.researchgate.net/publication/320247182_Named_Entity_Recognition_with_Gated_Convolutional_Neural_Networks) , which
extend the convolutional layer with gating mechanism.


Though high efficiency, a disadvantage of CNNs is that it has difficulties in capturing long-range dependencies in sequences due to the limited receptive fields. In recent year, some CNN-based models modify traditional CNNs to better capture global context information.

Strubell et al. propose Iterated Dilated Convolutional Neural Networks ([ID-CNNs](https://arxiv.org/abs/1702.02098)) method for NER, which is more computationally efficient due to the capacity of handling larger context and structured prediction. Figure 6 shows the architecture of a dilated CNN block, where four stacked dilated convolutions of width 3 produce token representations. Experimental results show that ID-CNNs achieves 14-20x test-time speedups compared to Bi-LSTM-CRF while retaining comparable accuracy.

![](https://i.ibb.co/4Z2k0Ls/ID-CNN.png)

[Chen et al.](https://arxiv.org/abs/1907.05611) propose gated relation network (GRN) for NER, in which a gated relation layer that models the relationship between any two words is built on top of CNNs for capturing long-range context information. It achieves significantly better performance than ID-CNN, owing to its stronger capacity to capture global context dependencies.

#### Recurrent Neural Networks <a name="3.2.2"></a>

Recurrent neural networks, together with its variants such
as gated recurrent unit (GRU) and long-short term memory
(LSTM), have demonstrated remarkable achievements in
modeling sequential data. In particular, bidirectional RNNs
incorporates past/future contexts from both directions (forward/backward) to generate the hidden states of each token, and then jointly concatenate them to represent the global information of the entire sequence. Thus, a token encoded by a bidirectional RNN will contain evidence from the whole input sentence.

Bidirectional RNNs therefore become de facto standard
for composing deep context-dependent representations of
text. A typical architecture of RNN-based context encoder is shown in Figure 7.

![](https://i.ibb.co/YRF80pM/RNN-context-encoder.png)

[BiLSTM-CRF model](https://arxiv.org/abs/1508.01991) by Huang et al. is among the first to use a Bi-LSTM  architecture to generate contextual representations of every
word in their sequence labeling model, and produce state-of-the-art accuracy on POS tagging, chunking and NER tasks. Following this work, a body of works applied BiLSTM as the basic architecture to encode sequence context information. 

[Yang et al.](https://arxiv.org/abs/1603.06270) employed deep GRUs on both character and word levels to encode morphology an context information. They further extended their model to cross-lingual and multi-task joint trained by sharing the architecture and parameters.



[Rei](https://arxiv.org/abs/1704.07156) propose a multitask learning method that equips the Bi-LSTM context encoder module with a auxiliary training objective, which learns to predict surrounding words for every word in the sentence. The language modeling objective provides consistent performance improvements on several sequence labeling benchmark, because it motivates the model to learn more general semantic and syntactic composition patterns of the language.

Zhang et al. propose [Multi-Order BiLSTM]((https://arxiv.org/abs/1711.08231)) which combines low order and high order LSTMs together in order to learn more tag dependencies. The high
order LSTMs predict multiple tags for the current token which
contains not only the current tag but also the previous several tags. The model keeps the scalability to high order models with a pruning technique, and achieves the state-of-the-art result in chunking and competitive results in two NER datasets.

[Ma et al.](https://arxiv.org/abs/1709.10191) propose a LSTM-based model for jointly training sentence-level classification and sequence labeling tasks, in which a modified LSTM structure is adopted as their context encoder module. In particular, the method employs a convolutional neural network before LSTM to extract features from both the context and previous tags of each word. Therefore, the input for LSTM is changed to include meaningful contextual and label information.

Most of the existing LSTM based methods use one or more
stacked LSTM layers to extract context features of words.
However, [Gregoric et al.](https://www.aclweb.org/anthology/P18-2012/) present a different architecture
which employs multiple parallel independent Bi-LSTM units
across the same input for NER. It promotes diversity among the LSTM units by employing an inter-model regularization term. By distributing computation across multiple smaller LSTMs, they found a reduction in total number of parameters and achieves significant improvements on the CoNLL 2003
NER dataset compared to other previous methods.



Some studies designed LSTM-based neural networks for nested named entity recognition (named entities contain other named entities inside them). [Katiyar and Cardie](https://www.aclweb.org/anthology/N18-1079/) presented a modification to standard LSTM-based sequence labeling model to handle nested named entity recognition. [Ju et al.](https://www.aclweb.org/anthology/N18-1131/) proposed a neural model to identify nested entities by dynamically stacking flat NER layers until no outer entities are extracted. Each flat NER layer employs
bidirectional LSTM to capture sequential context. The model
merges the outputs of the LSTM layer in the current flat
NER layer to construct new representations for the detected
entities and then feeds them into the next flat NER layer.

Although Bi-LSTM has been widely adopted as context
encoder, there are several limitations, such as the shallow connections between consecutive hidden states of RNNs. At each time step, BiLSTMs consume an incoming word and construct a new summary of the past subsequence. This process should be highly non-linear so that the hidden states can quickly adapt to variable inputs while still retaining useful summaries of the past. [*Deep transition RNNs*](https://arxiv.org/abs/1312.6026) extend conventional RNNs by increasing the transition depth of consecutive hidden states. Recently, Liu et al. introduce the [Deep transition architecture for sequence labeling](https://arxiv.org/abs/1906.02437) and achieve a significant performance improvement for text chunking and NER. Besides, the way of sequentially processing inputs of RNN might limit the ability to capture the non-continuous relations over tokens within a sentence. To tackle the problem, a recent work proposed by [Wei et al.](https://arxiv.org/abs/1908.09128) employs self-attention to provide complementary context information on the basis of Bi-LSTM. They propose a position-aware self-attention as well as a well-designed self-attentional context fusion network, aiming to explore the relative positional information of an input sequence for capturing the latent relations among tokens. The method achieves significant improvements on the tasks of POS, NER and chunking.

#### Transformers <a name="3.2.3"></a>

[Transformer](https://arxiv.org/abs/1706.03762) is proposed by Vaswani et al. in 2017 and achieves excellent performance for Neural Machine Translation (NMT) tasks. The architecture is based solely on attention mechanisms  to draw global dependencies between input, dispensing with recurrence and convolutions entirely. It employs a sequence to sequence structure that comprises the encoder and decoder. But the subsequent research work often adopt the encoder part to serve as the feature extractor. The encoder is composed of a stack of several identical layers, which includes a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. It employs a residual connection around each of the two sub-layers to ease the training of network. And layer normalization is applied after the residual connection to stabilize the activations of model.

Due to its performance, the Transformer is widely
used in various NLP tasks achieving excellent results.
However, in sequence labeling tasks, the Transformer encoder
has been reported to perform poorly. Recently, Yan et
al. found that both the direction and relative distance information are important in the NER, but these information will lose when the sinusoidal position embedding is used in the vanilla Transformer. They propose [TENER](https://arxiv.org/abs/1911.04474), an adapted Transformer Encoder by incorporating the direction and relative distance aware attention and the un-scaled attention, which can greatly boost the performance of Transformer encoder for NER. [Star-Transformer](https://arxiv.org/abs/1902.09113) is a lightweight alternative of Transformer proposed by Shao et al.. It replaces the fully-connected structure with a star-shaped topology, in which every two non-adjacent nodes are connected through a shared relay node. The model complexity is reduced significantly, and it also achieved great improvements against the standard Transformer on various tasks including sequence labeling tasks.

As we mentioned previously, language model embeddings pre-trained using
Transformer (BERT) are contextualized embeddings and can be used to replace or combine with traditional embeddings, such as Google Word2vec and Stanford GloVe. But these language model embeddings can be further fine-tuned with one additional output layer for a wide range of tasks including NER and chunking. Especially, [Li et al.](https://arxiv.org/abs/1910.11476), [Li et al.](https://arxiv.org/abs/1911.02855) framed the NER task as a machine reading comprehension (MRC) problem, which can be solved by fine-tuning the BERT model.

### Tag Decoder Architectures <a name="3.3"></a>

The inference module or tag decoder is the final stage in a sequence labeling model. It takes context-dependent representations from the context encoder as input and produce a sequence of tags corresponding to the input sequence.

Figure 12 summarizes four architectures of tag decoders:
MLP + softmax layer, conditional random fields (CRFs),
recurrent neural networks, and pointer networks.

![](https://i.ibb.co/LpqmYGz/tag-decoders-sequence-labeling.png)

#### Multi-layer Perceptron + Softmax <a name="3.3.1"></a>

The softmax function has been widely used in a variety of probability-based multi-classification methods, since Multinomial Logistic Regression to classifiers based on neural networks. With a multi-layer Perceptron + Softmax layer as the tag decoder layer, the sequence labeling task is cast as a multi-class classification problem. Tag for each word is independently predicted based on the context-dependent representations without taking into account its neighbors. Several sequence labeling models that have been introduced earlier use MLP + Softmax as the tag decoder. 

#### Conditional Random Fields <a name="3.3.2"></a>

The above method of independently inferring word labels in a given sequence ignore the dependencies between labels. Typically, the correct label to each word depends on the choices of nearby elements. Therefore, it is necessary to consider the correlation between labels of adjacent neighborhoods to jointly decode the optimal label chain of the entire sequence. A *conditional random field* ([CRF](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers)) can take context into account by modeling the prediction as a graphical model, which implements dependencies between the predictions. CRFs have been widely used in feature-based supervised learning approaches. Many deep learning based NER models use a CRF layer as the tag decoder, e.g., on top of an bidirectional LSTM layer, and on top of a CNN layer. CRF is the most common choice for tag decoder.

Specifically, let $\boldsymbol{Z} = [\boldsymbol{z}^{(1)}, \boldsymbol{z}^{(2)},...,\boldsymbol{z}^{(T)}]$ be the output of
context encoder of the given sequence 
$\boldsymbol{X}=[\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)},...,\boldsymbol{x}^{(T)}]$

The probability 
$P(y^{(1)}, y^{(2)},...,y^{(T)}|\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)},...,\boldsymbol{x}^{(T)})=P(\boldsymbol{y}|\boldsymbol{X})$
 of generating the whole label sequence $y^{(i)} \in \boldsymbol{y}$ with
regard to $\boldsymbol{Z}$ is

$$P(\boldsymbol{y}|\boldsymbol{X})=\frac{\prod_{j=1}^n \phi(y^{(j-1)},y^{(j)}|\boldsymbol{z}^{(j)})}{\sum_{y^{\prime} \in \boldsymbol{Y}(\boldsymbol{Z})}\phi(y^{\prime(j-1)},y^\prime{(j)}|\boldsymbol{z}^{(j)})}$$

where 
- $\boldsymbol{Y}(\boldsymbol{Z})$  is the set of possible label sequences for $\boldsymbol{Z}$
- $\phi(y^{(j-1)},y^{(j)}|\boldsymbol{z^{(j)}})=\exp({\boldsymbol{W}_{y^{(j-1)},y^{(j)}}}\boldsymbol{z^{(j)}}+b_{y^{(j-1)},y^{(j)}})$, $\boldsymbol{W}_{y^{(j-1)},y^{(j)}}$ and $b_{y^{(j-1)},y^{(j)}}$ indicate the weighted matrix and bias parameters corresponding to the label pair $(y^{(j-1)},y^{(j)})$, respectively.

A further explanation can be found in this [video](https://youtu.be/GF3iSJkgPbA?list=PLjH60bdMRSckrLLfQRlfXYUkA_b7DDIkO).

CRFs, however, cannot make full use of segment-level information because the inner properties of segments cannot
be fully encoded with word-level representations.

**Semi-CRF**
Semi-Markov conditional random fields ([semi-CRFs](https://proceedings.neurips.cc/paper/2004/hash/eb06b9db06012a7a4179b8f3cb5384d3-Abstract.html)) is an extension of conventional CRFs, in which labels are assigned to the segments of input sequence rather than to individual words. It extracts features of segments and models the transition between them, suitable for segment-level sequence labeling tasks such as named entity recognition and phrase chunking. Compared to CRFs, the advantage of semi-CRFs is that it can make full use of segment-level information to capture the internal properties of segments, and higher-order label dependencies can be taken into account. However, since it jointly learns to determine the length of each segment and
the corresponding label, the time complexity becomes higher. Besides, more features is required for modeling segments with different lengths and automatically extracting meaningful segment-level features is an important issue for Semi-CRFs. With advances in deep learning, some models combining neural networks and Semi-CRFs for sequence labeling have been studied.

Kong et al. propose Segmental Recurrent Neural Networks ([SRNNs](https://arxiv.org/abs/1511.06018)) for segment-level sequence labeling problems, which adopts a semi-CRF as the inference module and learns representations of segments through Bi-LSTM. Based on the recurrent nature of RNN, this method further designs a dynamic programming algorithm to reduce the time complexity. 

A parallel work Gated Recursive Semi-CRFs
([grSemi-CRFs](https://www.aclweb.org/anthology/P16-1134/)) proposed by Zhuo et al. employs a Gated Recursive Convolutional Neural Network ([grConv](https://arxiv.org/abs/1409.1259)) to extract segment-level features for semi-CRF. The grConv is a variant of recursive neural network that learns segment-level representations by constructing a pyramid-like structure and recursively combining adjacent segment vectors. 

The followup work proposed by [Kemos et al.](https://arxiv.org/abs/1808.04208) utilize the same grConv architecture for extracting segment features in their neural semi-CRF model for POS tagging. It takes characters as the basic input unit but does not require any correct token boundaries, which is different from existing character-level models. The model is based on semi-CRF to jointly segment (tokenize) and label characters, being robust for languages with difficult or noisy tokenization. 

[Sato et al.](https://www.aclweb.org/anthology/I17-2017/) design Segment-level Neural CRF for segment-level sequence labeling tasks. The method applies a CNN to obtain segment-level representations and constructs segment lattice to reduce search space.


The aforementioned models only adopt segment-level labels
for segment score calculation and model training. An extension proposed by Ye et al. demonstrates that incorporating word-level labels information can be beneficial for building semi-CRFs. The proposed Hybrid Semi-CRFs ([HSCRF](https://arxiv.org/abs/1805.03838)) model utilizes word-level and segment-level labels simultaneously to derive the segment scores.

#### Recurrent Neural Networks <a name="3.3.3"></a>

Some studies demonstrate that RNNs can also be adopted in the inference module for producing optimal labels. In addition to the learned representations output from the context encoder, the information of former predicted labels also serves as an input. Thus the corresponding label of each word is generated based on both the features of the input sequence and the previously predicted labels, making long-range label dependencies captured. However, unlike the global normalized CRF model, the RNN-based reasoning method greedily decodes the label from left to right, so it’s a local normalized model that might suffer from label bias and exposure bias problems ([reference](https://arxiv.org/pdf/1603.06042.pdf)).

[Shen et al.](https://arxiv.org/abs/1707.05928) employ a LSTM layer on top of the context encoder for label decoding. As depicted in Figure 12(c), the decoder LSTM takes the last generated label as well as the contextual representation of the current word as inputs and computes the
hidden state which will be passed through a softmax function
to finally decode the label. They reported that RNN tag decoders outperform CRF and are faster to train when the number of entity types is large. [Zheng et al.](https://arxiv.org/abs/1706.05075) adopt a similar LSTM structure as the inference module of their sequence labeling model.

#### Pointer networks <a name="3.3.4"></a>

[*Pointer networks*](https://arxiv.org/abs/1506.03134) apply RNNs to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to the positions in an input sequence. It represents variable-length dictionaries by using a softmax probability distribution as a "*pointer*".

[Zhai et al.](https://arxiv.org/abs/1701.04027) propose a neural sequence chunking model based on an encoder-decoder-pointer framework, which is suitable for tasks that need assign labels to meaningful chunks in sentences, such as phrase chunking and semantic role labeling.  The architecture is illustrated in Figure 12(d). The proposed model divides the original sequence labeling task into two steps: (1) Segmentation, identifying the scope of each chunk; (2) Labeling, treating each chunk as a complete unit to label. It adopts a pointer network to process the segmentation by determining the ending point of each chunk and the LSTM decoder is utilized for labeling based on the segmentation results. 

 The model proposed by [Li et al.](https://www.ijcai.org/Proceedings/2018/0579.pdf) also employs a similar architecture for their text segmentation model, where a seq2seq model equipped with a pointer network is designed to infer the segment boundaries.

# References <a name="5"></a>

- [A Survey on Deep Learning for Named Entity Recognition](https://arxiv.org/abs/1812.09449)

- [A Survey on Recent Advances in Sequence Labeling from Deep Learning Models](https://arxiv.org/abs/2011.06727) 

- [A Survey on Recent Advances in Named Entity Recognition from Deep Learning models](https://arxiv.org/abs/1910.11470)

- [Natural Language Processing Advancements By
Deep Learning: A Survey](https://arxiv.org/abs/2003.01200)

- [Sequence Tagging, Syntactic and Semantic Parsing with BERT](https://arxiv.org/abs/1908.04943)

- [Parallel sequence tagging for concept recognition](https://arxiv.org/abs/2003.07424)

- [Extracting Information from Text](http://www.nltk.org/book/ch07.html)