## Recognize named entities on news data with CNN
In this tutorial, you will use a convolutional neural network to solve Named Entity Recognition (NER) problem. NER is a common task in natural language processing systems. It serves for extraction such entities from the text as persons, organizations, locations, etc. In this task you will experiment to recognize named entities in different documents from s dataset.

For example, we want to extract persons' and organizations' names from the text. Then for the input text:

Yan Goodfellow works for Google Brain

a NER model needs to provide the following sequence of tags:

B-REF I-REF O

Where B- and I- prefixes stand for the beginning and inside of the entity, while O stands for out of tag or no tag. Markup with the prefix scheme is called BIO markup. This markup is introduced for distinguishing of consequent entities with similar types.

A solution of the task will be based on neural networks, particularly, on Convolutional Neural Networks.

## Data
The following cell will be in folder /data. 

We will work with a corpus, which contains twits with NE tags. Typical file with NER data contains lines with pairs of tokens (word/punctuation symbol) and tags, separated by a whitespace. In many cases additional information such as POS tags included between Different documents are separated by lines started with -DOCSTART- token. Different sentences are separated by an empty line. Example

-DOCSTART- -X- -X- O

По O
защите O
прав O
граждан O
( O
определения O
от B-REF
21 I-REF
декабря I-REF
1998 I-REF
года I-REF
№ I-REF
183-O I-REF
. O

А O
также O

We start with using the Conll2003DatasetReader class that provides functionality for reading the dataset. It returns a dictionary with fields train, test, and valid. At each field a list of samples is stored. Each sample is a tuple of tokens and tags. Both tokens and tags are lists. The following example depicts the structure that should be returned by read method:

```python
{'train': [(['с', 'определением', 'от','24','декабря','2003','года','№','156-O'], [ 'O', 'O', 'B-REF','I-REF','I-REF','I-REF'.'I-REF','I-REF','I-REF']), ....],
 'valid': [...],
 'test': [...]}
```

There are three separate parts of the dataset:

train data for training the model;
validation data for evaluation and hyperparameters tuning;
test data for final evaluation of the model.
Each of these parts is stored in a separate txt file.

We will use Conll2003DatasetReader from the library to read the data from text files to the format described above.

In [None]:
from deeppavlov.dataset_readers.conll2003_reader import Conll2003DatasetReader
dataset = Conll2003DatasetReader().read('/data')

## Prepare dictionaries
To train a neural network, we will use two mappings:

{token}$\to${token id}: address the row in embeddings matrix for the current token;
{tag}$\to${tag id}: one-hot ground truth probability distribution vectors for computing the loss at the output of the network.
Token indices will be used to address the row in embeddings matrix. The mapping for tags will be used to create one-hot ground truth probability distribution vectors to compute the loss at the output of the network.

The SimpleVocabulary implemented in the library will be used to perform those mappings.

In [None]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary

Now we need to build dictionaries for tokens and tags. Sometimes there are special tokens in vocabularies, for instance an unknown word token, which is used every time we encounter out of vocabulary word. In our case the only special token will be<UNK> for out of vocabulary words.

In [17]:
special_tokens = ['<UNK>']

token_vocab = SimpleVocabulary(special_tokens, save_path='model/token.dict')
tag_vocab = SimpleVocabulary(save_path='model/tag.dict')



Lets fit the vocabularies on the train part of the data.

In [18]:
all_tokens_by_sentences = [tokens for tokens, tags in dataset['train']]
all_tags_by_sentences = [tags for tokens, tags in dataset['train']]

token_vocab.fit(all_tokens_by_sentences)
tag_vocab.fit(all_tags_by_sentences)


Try to get the indices. Keep in mind that we are working with batches of the following structure:

[['utt0_tok0', 'utt1_tok1', ...], ['utt1_tok0', 'utt1_tok1', ...], ...]

In [19]:
token_vocab([['Конституции', 'Российскийской', 'Федерации', 'от']])

[[54, 0, 9, 3]]

In [20]:
tag_vocab([['O', 'O', 'O'], ['B-REF', 'I-REF']])

[[0, 0, 0], [2, 1]]

## Dataset Iterator
Neural Networks are usually trained with batches. It means that weight updates of the network are based on several sequences at every single time. The tricky part is that all sequences within a batch need to have the same length. So we will pad them with a special <UNK> token. Likewise tokens tags also must be padded It is also a good practice to provide RNN with sequence lengths, so it can skip computations for padding parts. We provide the batching function batches_generator readily available for you to save time.

An important concept in the batch generation is shuffling. Shuffling is taking sample from the dataset at random order. It is important to train on the shuffled data because large number consequetive samples of the same class may result in pure quality of the model.

In [21]:
from deeppavlov.core.data.data_learning_iterator import DataLearningIterator

Create the dataset iterator from the loaded dataset

In [22]:
data_iterator = DataLearningIterator(dataset)

Try it out:

In [24]:
next(data_iterator.gen_batches(1, shuffle=True))

((['К',
   'тому',
   'же',
   'при',
   'осуществлении',
   'в',
   'период',
   'предварительного',
   'расследования',
   'судебного',
   'контроля',
   'за',
   'законностью',
   'и',
   'обоснованностью',
   'процессуальных',
   'актов',
   'органов',
   'дознания',
   ',',
   'следователей',
   'и',
   'прокуроров',
   'судом',
   'не',
   'должны',
   'предрешаться',
   'вопросы',
   ',',
   'которые',
   'впоследствии',
   'могут',
   'стать',
   'предметом',
   'судебного',
   'разбирательства',
   'по',
   'существу',
   'уголовного',
   'дела',
   '(',
   'Постановление',
   'Конституционного',
   'Суда',
   'Российской',
   'Федерации',
   'от',
   '23',
   'марта',
   '1999',
   'года',
   '№',
   '5-П',
   ';',
   'определения',
   'Конституционного',
   'Суда',
   'Российской',
   'Федерации',
   'от',
   '27',
   'мая',
   '2010',
   'года',
   '№',
   '633-О-О',
   ',',
   'от',
   '14',
   'июля',
   '2011',
   'года',
   '№',
   '1027-О-О',
   ',',
   'от',
   '21',


Masking
The last thing about generating training data. We need to produce a binary mask which is one where tokens present and zero elsewhere. This mask will stop backpropagation through paddings. An instance of such mask:

[[1, 1, 0, 0, 0],
 [1, 1, 1, 1, 1]]
 
For the sentences in batch:

 [['The', 'roof'],
  ['This', 'is', 'my', 'domain', '!']]

The mask length must be equal to the maximum length of the sentence in the batch.

In [25]:
from deeppavlov.models.preprocessors.mask import Mask
get_mask = Mask()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\stron\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\stron\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package perluniprops to
[nltk_data]     C:\Users\stron\AppData\Roaming\nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     C:\Users\stron\AppData\Roaming\nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!


In [26]:
get_mask([['Try', 'to', 'get', 'the', 'mask'], ['Check', 'paddings']])

array([[1., 1., 1., 1., 1.],
       [1., 1., 0., 0., 0.]], dtype=float32)

## Build a recurrent neural network
This is the most important part of the assignment. Here we will specify the network architecture based on TensorFlow building blocks. It's fun and easy as a lego constructor! We will create an Convolutional Neural Network (CNN) network which will produce probability distribution over tags for each token in a sentence. To take into account both right and left contexts of the token, we will use CNN. Dense layer will be used on top to perform tag classification.
An essential part of almost every network in NLP domain is embeddings of the words. We pass the text to the network as a series of tokens. Each token is represented by its index. For every token (index) we have a vector. In total the vectors form an embedding matrix. This matrix can be either pretrained using some common algorithm like Skip-Gram or CBOW or it can be initialized by random values and trained along with other parameters of the network. In this tutorial we will follow the second alternative.

We need to build a function that takes the tensor of token indices with shape [batch_size, num_tokens] and for each index in this matrix it retrieves a vector from the embedding matrix, corresponding to that index. That results in a new tensor with sahpe [batch_size, num_tokens, emb_dim].

In [30]:
import tensorflow as tf
import numpy as np
def get_embeddings(indices, vocabulary_size, emb_dim):
    # Initialize the random gaussian matrix with dimensions [vocabulary_size, embedding_dimension]
    # The **VARIANCE** of the random samples must be 1 / embedding_dimension
    emb_mat = np.random.randn(vocabulary_size, emb_dim).astype(np.float32) / np.sqrt(emb_dim) # YOUR CODE HERE
    emb_mat = tf.Variable(emb_mat, name='Embeddings', trainable=True)
    emb = tf.nn.embedding_lookup(emb_mat, indices)
    return emb

Now stack a number of layers with dimensionality given in n_hidden_list

In [31]:

def conv_net(units, n_hidden_list, cnn_filter_width, activation=tf.nn.relu):
    # Use activation(units) to apply activation to units
    for n_hidden in n_hidden_list:
        
        units = tf.layers.conv1d(units,
                                 n_hidden,
                                 cnn_filter_width,
                                 padding='same')
        units = activation(units)
    return units

Now define your own function that returns a scalar masked cross-entropy loss

In [32]:
def masked_cross_entropy(logits, label_indices, number_of_tags, mask):
    ground_truth_labels = tf.one_hot(label_indices, depth=number_of_tags)
    loss_tensor = tf.nn.softmax_cross_entropy_with_logits_v2(labels=ground_truth_labels, logits=logits)
    loss_tensor *= mask
    loss = tf.reduce_mean(loss_tensor)
    return loss

Put everything into a class:

In [33]:
class NerNetwork:
    def __init__(self,
                 n_tokens,
                 n_tags,
                 token_emb_dim=100,
                 n_hidden_list=(128,),
                 cnn_filter_width=7,
                 use_batch_norm=False,
                 embeddings_dropout=False,
                 top_dropout=False,
                 **kwargs):
        # ================ Building inputs =================

        self.learning_rate_ph = tf.placeholder(tf.float32, [])
        self.dropout_keep_ph = tf.placeholder(tf.float32, [])
        self.token_ph = tf.placeholder(tf.int32, [None, None], name='token_ind_ph')
        self.mask_ph = tf.placeholder(tf.float32, [None, None], name='Mask_ph')
        self.y_ph = tf.placeholder(tf.int32, [None, None], name='y_ph')

        # ================== Building the network ==================

        # Now embedd the indices of tokens using token_emb_dim function

        ######################################
        ########## YOUR CODE HERE ############
        emb = get_embeddings(self.token_ph, n_tokens, token_emb_dim)
        ######################################

        emb = tf.nn.dropout(emb, self.dropout_keep_ph, (tf.shape(emb)[0], 1, tf.shape(emb)[2]))

        # Build a multilayer CNN on top of the embeddings.
        # The number of units in the each layer must match
        # corresponding number from n_hidden_list.
        # Use ReLU activation
        ######################################
        ########## YOUR CODE HERE ############
        units = conv_net(emb, n_hidden_list, cnn_filter_width)
        ######################################
        units = tf.nn.dropout(units, self.dropout_keep_ph, (tf.shape(units)[0], 1, tf.shape(units)[2]))
        logits = tf.layers.dense(units, n_tags, activation=None)
        self.predictions = tf.argmax(logits, 2)

        # ================= Loss and train ops =================
        # Use cross-entropy loss. check the tf.nn.softmax_cross_entropy_with_logits_v2 function
        ######################################
        ########## YOUR CODE HERE ############
        self.loss = masked_cross_entropy(logits, self.y_ph, n_tags, self.mask_ph)
        ######################################

        # Create a training operation to update the network parameters.
        # We purpose to use the Adam optimizer as it work fine for the
        # most of the cases. Check tf.train to find an implementation.
        # Put the train operation to the attribute self.train_op

        ######################################
        ########## YOUR CODE HERE ############
        optimizer = tf.train.AdamOptimizer(self.learning_rate_ph)
        self.train_op = optimizer.minimize(self.loss)
        ######################################

        # ================= Initialize the session =================

        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())

    def __call__(self, tok_batch, mask_batch):
        feed_dict = {self.token_ph: tok_batch,
                     self.mask_ph: mask_batch,
                     self.dropout_keep_ph: 1.0}
        return self.sess.run(self.predictions, feed_dict)

    def train_on_batch(self, tok_batch, tag_batch, mask_batch, dropout_keep_prob, learning_rate):
        feed_dict = {self.token_ph: tok_batch,
                     self.y_ph: tag_batch,
                     self.mask_ph: mask_batch,
                     self.dropout_keep_ph: dropout_keep_prob,
                     self.learning_rate_ph: learning_rate}
        self.sess.run(self.train_op, feed_dict)

Now create an instance of the NerNetwork class:

In [34]:
nernet = NerNetwork(len(token_vocab),
                    len(tag_vocab),
                    n_hidden_list=[100, 100])

Lets write the evaluation function. We need to get all predictions for the given part of the dataset and compute F1.

In [36]:
from deeppavlov.metrics.fmeasure import precision_recall_f1
# The function precision_recall_f1 takes two lists: y_true and y_predicted
# the tag sequences for each sentences should be merged into one big list 
from deeppavlov.core.data.utils import zero_pad
# zero_pad takes a batch of lists of token indices, pad it with zeros to the
# maximal length and convert it to numpy matrix
from itertools import chain


def eval_valid(network, batch_generator):
    total_true = []
    total_pred = []
    for x, y_true in batch_generator:

        # Prepare token indices from tokens batch
        x_inds = token_vocab(x) # YOUR CODE HERE

        # Pad the indices batch with zeros
        x_batch = zero_pad(x_inds) # YOUR CODE HERE

        # Get the mask using get_mask
        mask = get_mask(x) # YOUR CODE HERE
        
        # We call the instance of the NerNetwork because we have defined __call__ method
        y_inds = network(x_batch, mask)

        # For every sentence in the batch extract all tags up to paddings
        y_inds = [y_inds[n][:len(x[n])] for n, y in enumerate(y_inds)] # YOUR CODE HERE
        y_pred = tag_vocab(y_inds)

        # Add fresh predictions 
        total_true.extend(chain(*y_true))
        total_pred.extend(chain(*y_pred))
    res = precision_recall_f1(total_true, total_pred, print_results=True)

Set hyperparameters.

In [37]:
batch_size = 16 # YOUR HYPERPARAMETER HERE
n_epochs = 20 # YOUR HYPERPARAMETER HERE
learning_rate = 0.001 # YOUR HYPERPARAMETER HERE
dropout_keep_prob = 0.5 # YOUR HYPERPARAMETER HERE

In [39]:
for epoch in range(n_epochs):
    for x, y in data_iterator.gen_batches(batch_size, 'train'):
        # Convert tokens to indices via Vocab
        x_inds = token_vocab(x) # YOUR CODE 
        # Convert tags to indices via Vocab
        y_inds = tag_vocab(y) # YOUR CODE 
        
        # Pad every sample with zeros to the maximal length
        x_batch = zero_pad(x_inds)
        y_batch = zero_pad(y_inds)

        mask = get_mask(x)
        nernet.train_on_batch(x_batch, y_batch, mask, dropout_keep_prob, learning_rate)
    print('Evaluating the model on valid part of the dataset')
    eval_valid(nernet, data_iterator.gen_batches(batch_size, 'valid'))

Evaluating the model on valid part of the dataset


2018-11-18 23:55:09.655 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 540 phrases; correct: 521.

precision:  96.48%; recall:  99.62%; FB1:  98.02

	REF: precision:  96.48%; recall:  99.62%; F1:  98.02 540




Evaluating the model on valid part of the dataset


2018-11-18 23:55:27.949 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 537 phrases; correct: 523.

precision:  97.39%; recall:  100.00%; FB1:  98.68

	REF: precision:  97.39%; recall:  100.00%; F1:  98.68 537




Evaluating the model on valid part of the dataset


2018-11-18 23:55:46.326 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 535 phrases; correct: 523.

precision:  97.76%; recall:  100.00%; FB1:  98.87

	REF: precision:  97.76%; recall:  100.00%; F1:  98.87 535




Evaluating the model on valid part of the dataset


2018-11-18 23:56:04.227 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 537 phrases; correct: 521.

precision:  97.02%; recall:  99.62%; FB1:  98.30

	REF: precision:  97.02%; recall:  99.62%; F1:  98.30 537




Evaluating the model on valid part of the dataset


2018-11-18 23:56:22.396 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 536 phrases; correct: 522.

precision:  97.39%; recall:  99.81%; FB1:  98.58

	REF: precision:  97.39%; recall:  99.81%; F1:  98.58 536




Evaluating the model on valid part of the dataset


2018-11-18 23:56:40.484 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 538 phrases; correct: 520.

precision:  96.65%; recall:  99.43%; FB1:  98.02

	REF: precision:  96.65%; recall:  99.43%; F1:  98.02 538




Evaluating the model on valid part of the dataset


2018-11-18 23:56:58.746 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 538 phrases; correct: 517.

precision:  96.10%; recall:  98.85%; FB1:  97.46

	REF: precision:  96.10%; recall:  98.85%; F1:  97.46 538




Evaluating the model on valid part of the dataset


2018-11-18 23:57:17.536 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 535 phrases; correct: 517.

precision:  96.64%; recall:  98.85%; FB1:  97.73

	REF: precision:  96.64%; recall:  98.85%; F1:  97.73 535




Evaluating the model on valid part of the dataset


2018-11-18 23:57:37.657 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 537 phrases; correct: 521.

precision:  97.02%; recall:  99.62%; FB1:  98.30

	REF: precision:  97.02%; recall:  99.62%; F1:  98.30 537




Evaluating the model on valid part of the dataset


2018-11-18 23:57:57.863 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 536 phrases; correct: 521.

precision:  97.20%; recall:  99.62%; FB1:  98.39

	REF: precision:  97.20%; recall:  99.62%; F1:  98.39 536




Evaluating the model on valid part of the dataset


2018-11-18 23:58:17.9 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 534 phrases; correct: 522.

precision:  97.75%; recall:  99.81%; FB1:  98.77

	REF: precision:  97.75%; recall:  99.81%; F1:  98.77 534




Evaluating the model on valid part of the dataset


2018-11-18 23:58:36.562 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 534 phrases; correct: 518.

precision:  97.00%; recall:  99.04%; FB1:  98.01

	REF: precision:  97.00%; recall:  99.04%; F1:  98.01 534




Evaluating the model on valid part of the dataset


2018-11-18 23:58:54.793 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 535 phrases; correct: 522.

precision:  97.57%; recall:  99.81%; FB1:  98.68

	REF: precision:  97.57%; recall:  99.81%; F1:  98.68 535




Evaluating the model on valid part of the dataset


2018-11-18 23:59:12.746 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 536 phrases; correct: 520.

precision:  97.01%; recall:  99.43%; FB1:  98.21

	REF: precision:  97.01%; recall:  99.43%; F1:  98.21 536




Evaluating the model on valid part of the dataset


2018-11-18 23:59:31.235 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 535 phrases; correct: 518.

precision:  96.82%; recall:  99.04%; FB1:  97.92

	REF: precision:  96.82%; recall:  99.04%; F1:  97.92 535




Evaluating the model on valid part of the dataset


2018-11-18 23:59:49.526 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 537 phrases; correct: 522.

precision:  97.21%; recall:  99.81%; FB1:  98.49

	REF: precision:  97.21%; recall:  99.81%; F1:  98.49 537




Evaluating the model on valid part of the dataset


2018-11-19 00:00:07.696 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 536 phrases; correct: 523.

precision:  97.57%; recall:  100.00%; FB1:  98.77

	REF: precision:  97.57%; recall:  100.00%; F1:  98.77 536




Evaluating the model on valid part of the dataset


2018-11-19 00:00:25.856 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 535 phrases; correct: 520.

precision:  97.20%; recall:  99.43%; FB1:  98.30

	REF: precision:  97.20%; recall:  99.43%; F1:  98.30 535




Evaluating the model on valid part of the dataset


2018-11-19 00:00:44.5 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 533 phrases; correct: 516.

precision:  96.81%; recall:  98.66%; FB1:  97.73

	REF: precision:  96.81%; recall:  98.66%; F1:  97.73 533




Evaluating the model on valid part of the dataset


2018-11-19 00:01:05.788 DEBUG in 'deeppavlov.metrics.fmeasure'['fmeasure'] at line 286: processed 19512 tokens with 523 phrases; found: 536 phrases; correct: 520.

precision:  97.01%; recall:  99.43%; FB1:  98.21

	REF: precision:  97.01%; recall:  99.43%; F1:  98.21 536




Lets try to infer the model on our sentence:

In [43]:
sentence = 'о разъяснении Определения Конституционного Суда Российской Федерации от 24 марта 2015 года № 720-О город Санкт-Петербург'
x = [sentence.split()]

x_inds = token_vocab(x)
x_batch = zero_pad(x_inds)
mask = get_mask(x)
y_inds = nernet(x_batch, mask)
for token,tag in zip(x[0],tag_vocab(y_inds)[0]):
    print(token + " : " + tag)

о : O
разъяснении : O
Определения : O
Конституционного : O
Суда : O
Российской : O
Федерации : O
от : B-REF
24 : I-REF
марта : I-REF
2015 : I-REF
года : I-REF
№ : I-REF
720-О : I-REF
город : O
Санкт-Петербург : O
