# Multilingual Named Entity Recognition on news data with BERT

In this tutorial, you will use a Transformer Network to solve Named Entity Recognition (NER) problem with [BERT](https://arxiv.org/abs/1810.04805). NER is a common task in natural language processing systems. It serves for extraction such entities from the text as persons, organizations, locations, etc. In this task you will experiment to recognize named entities in different news from common CoNLL-2003 dataset. We will use multilingual model to build system that performs recognition on multiple languages. The system will be trained only on English language, however, it will be capable to perform recognition for 100 languages.

## Task description

For example, we want to extract persons' and organizations' names from the text. Then for the input text:

    Yan Goodfellow works for Google Brain

a NER model needs to provide the following sequence of tags:

    B-PER I-PER    O     O   B-ORG  I-ORG

Where *B-* and *I-* prefixes stand for the beginning and inside of the entity, while *O* stands for out of tag or no tag. Markup with the prefix scheme is called *BIO markup*. This markup is introduced for distinguishing of consequent entities with similar types.

### Load the CoNLL-2003 Named Entity Recognition corpus

We will work with a corpus, which contains twits with NE tags. Typical file with NER data contains lines with pairs of tokens (word/punctuation symbol) and tags, separated by a whitespace. In many cases additional information such as POS tags included between  Different documents are separated by lines **started** with **-DOCSTART-** token. Different sentences are separated by an empty line. Example

    -DOCSTART- -X- -X- O

    EU NNP B-NP B-ORG
    rejects VBZ B-VP O
    German JJ B-NP B-MISC
    call NN I-NP O
    to TO B-VP O
    boycott VB I-VP O
    British JJ B-NP B-MISC
    lamb NN I-NP O
    . . O O

    Peter NNP B-NP B-PER
    Blackburn NNP I-NP I-PER

In this tutorial we will focus only on tokens and tags (first and last elements of the line) and drop POS information located in between.

We start with using the *Conll2003DatasetReader* class that provides functionality for reading the dataset. It returns a dictionary with fields *train*, *test*, and *valid*. At each field a list of samples is stored. Each sample is a tuple of tokens and tags. Both tokens and tags are lists. The following example depicts the structure that should be returned by *read* method:

    {'train': [(['Mr.', 'Dwag', 'is', 'derping', 'around'], ['B-PER', 'I-PER', 'O', 'O', 'O']), ....],
     'valid': [...],
     'test': [...]}

There are three separate parts of the dataset:
 - *train* data for training the model;
 - *validation* data for evaluation and hyperparameters tuning;
 - *test* data for final evaluation of the model.
 

Each of these parts is stored in a separate txt file.

We will use [Conll2003DatasetReader](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/dataset_readers/conll2003_reader.py) from the library to read the data from text files to the format described above.

In [None]:
from deeppavlov.dataset_readers.conll2003_reader import Conll2003DatasetReader
dataset = Conll2003DatasetReader().read(data_path='data', dataset_name='conll2003')

You should always understand what kind of data you deal with. For this purpose, you can print the data running the following cell:

In [None]:
for sample in dataset['train'][:4]:
    for token, tag in zip(*sample):
        print('%s\t%s' % (token, tag))
    print()

## BERT Model

BERT is a Transformer based model. At the moment it shows state of the art results on a wide range of natural language processing tasks. 

## Download BERT model

We will use pre-trained multilingual BERT model from original repository. The downloaded files contain: subword vocabulary for tokenization (`vocab.txt`), BERT configuration file (`bert_config.json`), and model files (`bert_model.ckpt`)

In [None]:
from deeppavlov.core.data.utils import download_decompress
import os
cased_bert_base_url = 'http://files.deeppavlov.ai/deeppavlov_data/bert/multi_cased_L-12_H-768_A-12.zip'
bert_dir = 'multi_cased_L-12_H-768_A-12'
BERT_CONFIG_PATH = os.path.join('model', bert_dir, 'bert_config.json')
BERT_MODEL_PATH = os.path.join('model', bert_dir, 'bert_model.ckpt')

BERT_VOCAB_PATH = os.path.join('model', bert_dir, 'vocab.txt')
download_decompress(cased_bert_base_url, 'model')

## Preprocessing

BERT uses subword tokenization which is also known as Byte Pair Encoding [(BPE)](https://arxiv.org/abs/1508.07909). This technique allows to use small vocabulary without Out Of Vocabulary (OOV) tokens problem. All out of vocabulary words are split into known subwords. 

Let's try BERT BPE.

In [None]:
from bert_dp.tokenization import FullTokenizer

bert_tokenizer = FullTokenizer(vocab_file=BERT_VOCAB_PATH, do_lower_case=False)

In [None]:
bert_tokenizer.tokenize('Gobbledegook!')

The CoNLL dataset consists of tokens and tags. According to the original [BERT](https://arxiv.org/abs/1810.04805) paper we need to mask every subword unit except the first one. It means that there is no prediction for masked subwords. For the following example:

    ['This', 'is', 'BERT']
    ['O',    'O',  'B-PER']
    
and tokenization

    ['This', 'is', 'BE', '##RT']
    
the tags must be:
    
    ['O', 'O', 'B-PER', 'X']
    
where `X` stands for mask.

Moreover, BERT uses special start and stop tokens `[CLS]` and `[SEP]`. There is no predictions for these tokens so they must be masked. Finally, the input to the network should be the following:

    ['[CLS]', 'This', 'is', 'BE', '##RT', '[SEP]']
    
with tags

    ['X', 'O', 'O', 'B-PER', 'X', 'X']


Now you need to implement the function, that performs subword tokenization and produces subword tokens and subword tags with masking and special tokens as in the example above. 

Input example:

    ['This', 'is', 'BERT']
    ['O',    'O',  'B-PER']
    
Output example:

    ['[CLS]', 'This', 'is', 'BE', '##RT', '[SEP]']
    ['X', 'O', 'O', 'B-PER', 'X', 'X']


In [None]:
def preprocess_tokens_and_tags(tokens, tags, tokenizer):
    ######################################
    ########## YOUR CODE HERE ############
    tokens_subword = ['[CLS]']
    tags_subword = ['X']
    for token, tag in zip(tokens, tags):
        subwords = tokenizer.tokenize(token)
        tokens_subword.extend(subwords)
        tags_subword.extend([tag] + ['X'] * (len(subwords) - 1))
    tokens_subword.append('[SEP]')
    tags_subword.append('X')
    ######################################
    return tokens_subword, tags_subword
    

In [None]:
tokens = ['This', 'is', 'BERT']
tags = ['O', 'O', 'B-PER']

subword_tokens, subword_tags = preprocess_tokens_and_tags(tokens, tags, bert_tokenizer)

assert subword_tokens == ['[CLS]', 'This', 'is', 'BE', '##RT', '[SEP]']
assert subword_tags == ['X', 'O', 'O', 'B-PER', 'X', 'X']

For inference time we need a function that process only tokens. Make a separate function for this case.

In [None]:
def preprocess_tokens(tokens, tokenizer):
    ######################################
    ########## YOUR CODE HERE ############
    tokens_subword = ['[CLS]']
    for token in tokens:
        subwords = tokenizer.tokenize(token)
        tokens_subword.extend(subwords)
    tokens_subword.append('[SEP]')
    ######################################
    return tokens_subword

In [None]:
# TEST

tokens = ['This', 'is', 'BERT']

subword_tokens = preprocess_tokens(tokens, bert_tokenizer)

assert subword_tokens == ['[CLS]', 'This', 'is', 'BE', '##RT', '[SEP]']

### Prepare dictionaries

To train a neural network, we will use two mappings: 
- {token}$\to${token id}: address the row in embeddings matrix for the current token;
- {tag}$\to${tag id}: one-hot ground truth probability distribution vectors for computing the loss at the output of the network.

Token vocabulary is already implemented in `BertNerPreprocessor`. To make a vocabulary for tags we will use the [SimpleVocabulary](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/core/data/simple_vocab.py). 

We already have vocabulary for subword tokens:

In [None]:
bert_tokenizer.convert_tokens_to_ids(['[CLS]', 'This', 'is', 'BE', '##RT', '[SEP]'])

But we need a tag vocabulary to convert tags to indices. Let's first collect all tags that appear after subword tokenization:

In [None]:
subword_tags_total = []
for tokens, tags in dataset['train']:
    subword_tokens, subword_tags = preprocess_tokens_and_tags(tokens, tags, bert_tokenizer)
    subword_tags_total.extend(subword_tags)

And now fit the vocabulary

In [None]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary

tag_vocab = SimpleVocabulary(unk_token='O', save_path='model/tag_vocab.txt')
tag_vocab.fit(subword_tags_total)

Check the index to token dictionary of the vocabulary:

In [None]:
print(tag_vocab._i2t)

And call it

In [None]:
# Tag vocabulary works with batches, so, the input is a list of lists
tag_vocab([['X', 'O', 'O', 'B-PER', 'X', 'X']])

Before making the final class for input data preprocessing we need to prepare binary masks that will stop propagation of the gradients from the special tokens and masked subword units. The mask value must be 0 for special tokens `[SEP]` and `[CLS]` and for masked subword units (all subtokens starting with `##`). For all other tokens it must be 1.

In [None]:
def subword_mask_from_tokens(tokens, tokenizer):
    mask = [0]
    for tok in tokens:
        subword_tokens = tokenizer.tokenize(tok)
        mask.extend([1] + [0] * (len(subword_tokens) - 1))
    mask.append(0)
    return mask

In [None]:
subword_mask_from_tokens(['This', 'is', 'BERT'], bert_tokenizer)

In [None]:
# Test for 
# ['This', 'is', 'BERT'] -> 
# ['[CLS]', 'This', 'is', 'BE', '##RT', '[SEP]']

assert subword_mask_from_tokens(['This', 'is', 'BERT'], bert_tokenizer) == [0, 1, 1, 1, 0, 0]

Finally, we make a class that performs all pre-processing: subword tokenization, conversion to indices, and preparing mask. It also performs zero padding for all indices and masks. In this class, input mask generation is added to mask attention on the paddings.

In [None]:
from deeppavlov.core.data.utils import zero_pad


class BertNerPreprocessor:
    def __init__(self, bert_vocab_file, tag_vocab, do_lower_case=False):
        self.bert_tokenizer = FullTokenizer(vocab_file=bert_vocab_file,
                                            do_lower_case=do_lower_case)
        self.tag_vocab = tag_vocab
        
    def __call__(self, tokens_batch, tags_batch=None):
        subword_tokens_batch = []
        subword_token_indices_batch = []
        subword_output_mask_batch = []
        subword_input_mask_batch = []
        if tags_batch is not None:
            subword_tags_batch = []
            subword_tags_indices_batch = []
            for tokens, tags in zip(tokens_batch, tags_batch):
                subword_tokens, subword_tags = preprocess_tokens_and_tags(tokens, tags, self.bert_tokenizer)
                
                subword_token_indices = self.bert_tokenizer.convert_tokens_to_ids(subword_tokens)
                subword_tag_indices = self.tag_vocab(subword_tags)
                subword_output_mask = subword_mask_from_tokens(tokens, self.bert_tokenizer)
                subword_input_mask = [1] * len(subword_tokens)
                
                subword_tokens_batch.append(subword_tokens)
                subword_token_indices_batch.append(subword_token_indices)
                subword_output_mask_batch.append(subword_output_mask)   
                subword_input_mask_batch.append(subword_input_mask)
                subword_tags_batch.append(subword_tags)
                subword_tags_indices_batch.append(subword_tag_indices)
                
            return (subword_tokens_batch, 
                    zero_pad(subword_token_indices_batch),
                    zero_pad(subword_input_mask_batch),
                    zero_pad(subword_output_mask_batch),
                    subword_tags_batch,
                    zero_pad(subword_tags_indices_batch))
        else:
            for tokens in tokens_batch:
                subword_tokens = preprocess_tokens(tokens, self.bert_tokenizer)
                
                subword_token_indices = self.bert_tokenizer.convert_tokens_to_ids(subword_tokens)
                subword_output_mask = subword_mask_from_tokens(tokens, self.bert_tokenizer)
                subword_input_mask = [1] * len(subword_tokens)
                
                
                
                subword_tokens_batch.append(subword_tokens)
                subword_token_indices_batch.append(subword_token_indices)
                subword_output_mask_batch.append(subword_output_mask)   
                subword_input_mask_batch.append(subword_input_mask)
                
            return (subword_tokens_batch, 
                    zero_pad(subword_token_indices_batch),
                    zero_pad(subword_input_mask_batch),
                    zero_pad(subword_output_mask_batch))

In [None]:
tokens_batch = [['Wow', '!'], ['BERT', 'is', 'here']]

tags_batch = [['O', 'O'], ['B-PER', 'O', 'O']]

preprocessor = BertNerPreprocessor(BERT_VOCAB_PATH, tag_vocab)

print('Train phase:\n\n')

(subword_tokens_batch, 
 subword_token_indices_batch,
 subword_input_mask_batch,
 subword_output_mask_batch,
 subword_tags_batch,
 subword_tags_indices_batch) = preprocessor(tokens_batch, tags_batch)

print(f'subword_tokens_batch: {subword_tokens_batch}\n')
print(f'subword_token_indices_batch: {subword_token_indices_batch}\n')
print(f'subword_input_mask_batch: {subword_input_mask_batch}\n')
print(f'subword_output_mask_batch: {subword_output_mask_batch}\n')
print(f'subword_tags_batch: {subword_tags_batch}\n')
print(f'subword_tags_indices_batch: {subword_tags_indices_batch}\n')

print('\n\nInference phase:\n\n')
(subword_tokens_batch, 
 subword_token_indices_batch,
 subword_input_mask_batch,
 subword_output_mask_batch) = preprocessor(tokens_batch)

print(f'subword_tokens_batch: {subword_tokens_batch}\n')
print(f'subword_token_indices_batch: {subword_token_indices_batch}\n')
print(f'subword_input_mask_batch: {subword_input_mask_batch}\n')
print(f'subword_output_mask_batch: {subword_output_mask_batch}\n')

### Dataset Iterator

Neural Networks are usually trained with batches. It means that weight updates of the network are based on several sequences at every single time. The tricky part is that all sequences within a batch need to have the same length. So we will pad them with a special `<UNK>` token. Likewise tokens tags also must be padded It is also a good practice to provide RNN with sequence lengths, so it can skip computations for padding parts. We provide the batching function *batches_generator* readily available for you to save time. 

An important concept in the batch generation is shuffling. Shuffling is taking sample from the dataset at random order. It is important to train on the shuffled data because large number consequetive samples of the same class may result in pure quality of the model.

In [None]:
from deeppavlov.core.data.data_learning_iterator import DataLearningIterator

Create the dataset iterator from the loaded dataset

In [None]:
data_iterator = DataLearningIterator(dataset)

Try it out:

In [None]:
next(data_iterator.gen_batches(2, shuffle=True))

## Building the model

Here we will specify the network architecture based on TensorFlow building blocks. It's fun and easy as a lego constructor! We will create a BERT-based model for NER which will produce probability distribution over tags for each token in a sentence. Dense layer will be used on top to perform tag classification.  

For BERT model we need a number of placeholders:
- input_ids_ph - indices of subtokens
- input_masks_ph - attention mask (to not attend to paddings)
- token_types_ph - segment id (equals 0 for all inputs since we feed single sentences)
- is_train_ph - internal to BERT, determines dropout behaviour

In [None]:
import tensorflow as tf
tf.reset_default_graph()

input_ids_ph = tf.placeholder(shape=(None, None),
                              dtype=tf.int32,
                              name='token_indices_ph')
input_masks_ph = tf.placeholder(shape=(None, None),
                                dtype=tf.int32,
                                name='token_mask_ph')

is_train_ph = tf.placeholder_with_default(False, shape=[], name='is_train_ph')


Now we will assemble BERT model:

In [None]:
from bert_dp.modeling import BertConfig, BertModel


bert_config = BertConfig.from_json_file(BERT_CONFIG_PATH)

bert = BertModel(config=bert_config,
                 is_training=is_train_ph,
                 input_ids=input_ids_ph,
                 input_mask=input_masks_ph,
                 use_one_hot_embeddings=False)

bert_layers = bert.all_encoder_layers

Now we will try to get first layer hidden states for some random input:

In [None]:
import numpy as np

# Dummy data
batch_size = 2
seq_len = 3
vocab_size = len(bert_tokenizer.vocab)
feed_dict = {input_ids_ph: np.random.randint(vocab_size, size=[batch_size, seq_len]),
             input_masks_ph: np.ones([batch_size, seq_len], np.int32)}

first_layer = bert_layers[0]

with tf.Session() as sess:
    # Initialize all variables
    sess.run(tf.global_variables_initializer())
    
    # Get activations for the first layer
    first_layer_activations = sess.run(first_layer, feed_dict)
    print('First layer hidden states: ')
    print(first_layer_activations)
    print(f'Shape: {first_layer_activations.shape}')

You can see that the last dimension is equal to 768. To perform classification we need to project the last dimension to the `n_classes` dimensional space. The values after projection will be log probabilities or logits. In most of the cases we perform projection with a Linear (Dense) layer. Have a look at `tf.layers.dense` and project the first layer to the number of tags classes. Number of tags can be determined by `len(tag_vocab)`. 

In [None]:
logits = tf.layers.dense(first_layer, len(tag_vocab))


Finally we need a loss function to train our A common loss for the classification task is cross-entropy. Why classification? Because for each token the network must decide which tag to predict. The cross-entropy has the following form:

$$ H(P, Q) = -E_{x \sim P} log Q(x) $$

It measures the dissimilarity between the ground truth distribution over the classes and predicted distribution. In the most of the cases ground truth distribution is one-hot. Luckily this loss is already [implemented](https://www.tensorflow.org/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits) in TensorFlow.

In [None]:
# The logits shape is [batch_size, seq_len, number of classes]
# So indices of the right classes should have shape [batch_size, seq_len]

# Dummy indices placeholder
indices = tf.placeholder(tf.int32, [batch_size, seq_len])

loss_tensor = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=indices, logits=logits)
print(loss_tensor)

All sentences in the batch must have the same length, so we pad the each sentence to the maximal lendth. So there are paddings at the end and pushing the network to predict those paddings usually results in deteriorated quallity. Then we need to multiply the loss tensor by binary mask to prevent gradient flow from the paddings.

In [None]:
mask = tf.placeholder(tf.float32, shape=[batch_size, seq_len])
loss_tensor *= mask
print(loss_tensor)

The last step to do is to compute the mean value of the loss tensor:

In [None]:
loss = tf.reduce_mean(loss_tensor)
print(loss)

Now define your own function that returns a scalar masked cross-entropy loss

In [None]:
def masked_cross_entropy(logits, label_indices, mask):
    loss_tensor = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=label_indices, logits=logits)
    loss_tensor *= mask
    loss = tf.reduce_mean(loss_tensor)
    return loss

## Make the final class

Put everything into a class: placeholders, the BERT model, loss function.

In [None]:
import numpy as np
import tensorflow as tf

class NerNetwork:
    def __init__(self,
                 bert_config_path,
                 pretrained_bert_model_path,
                 preprocessor,
                 **kwargs):
        self.preprocessor = preprocessor
        n_tags = len(self.preprocessor.tag_vocab)
        
        # ================ Building inputs =================
        
        self.init_placeholders()
        
        # ================== Building the network ==================
        
        # Build the BERT model and get the units from the last layer
        
        ######################################
        ########## YOUR CODE HERE ############
        bert_config = BertConfig.from_json_file(bert_config_path)

        self.bert = BertModel(config=bert_config,
                              input_ids=self.input_ids_ph,
                              input_mask=self.input_mask_ph,
                              token_type_ids=self.token_types_ph,
                              is_training=self.is_train_ph,
                              use_one_hot_embeddings=False)

        last_layer = self.bert.all_encoder_layers[-1]
        ######################################
        
        # Add dropout to the last layer units
        
        ######################################
        ########## YOUR CODE HERE ############
        units = tf.nn.dropout(last_layer, keep_prob=self.keep_prob_ph)
        ######################################
        
        units = last_layer
        self.units = units
        with tf.variable_scope('NER'):
            self.logits = tf.layers.dense(units, n_tags, activation=None)
            
        self.predictions = tf.argmax(self.logits, 2)
        
        # ================= Loss and train ops =================
        # Use masked cross-entropy loss and output mask
        ######################################
        ########## YOUR CODE HERE ############
        self.loss = masked_cross_entropy(self.logits, self.y_ph, self.output_mask_ph)
        ######################################

        # Create a training operation to update the network parameters.
        # We purpose to use the Adam optimizer as it work fine for the
        # most of the cases. Check tf.train to find an implementation.
        # Put the train operation to the attribute self.train_op
        
        ######################################
        ########## YOUR CODE HERE ############
        optimizer = tf.train.AdamOptimizer(self.learning_rate_ph)
        self.train_op = optimizer.minimize(self.loss)
        ######################################

        # ================= Initialize the session and load the bert model =================
        
        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())
        
        # Restore pre-trained BERT. Load only BERT variables. Do not to load new variables
        # specific to NER task
        all_vars = tf.trainable_variables()
        vars_to_train = [var for var in all_vars if not var.name.startswith('NER')]
        self.restorer = tf.train.Saver(vars_to_train)
        self.restorer.restore(self.sess, pretrained_bert_model_path)
        
        self.saver = tf.train.Saver()
        
    def init_placeholders(self):
        self.input_ids_ph = tf.placeholder(shape=(None, None),
                                           dtype=tf.int32,
                                           name='token_indices_ph')
        
        self.token_types_ph = tf.placeholder_with_default(tf.zeros_like(self.input_ids_ph, dtype=tf.int32),
                                                          shape=self.input_ids_ph.shape,
                                                          name='token_types_ph')
        self.input_mask_ph = tf.placeholder(shape=(None, None),
                                            dtype=tf.float32,
                                            name='input_mask_ph')

        self.y_ph = tf.placeholder(shape=(None, None),
                                   dtype=tf.int32,
                                   name='y_ph')
        self.output_mask_ph = tf.placeholder(shape=(None, None),
                                         dtype=tf.float32,
                                         name='output_mask_ph')
        
        self.learning_rate_ph = tf.placeholder_with_default(0.0, shape=[], name='learning_rate_ph')
        self.keep_prob_ph = tf.placeholder_with_default(1.0, shape=[], name='keep_prob_ph')
        self.is_train_ph = tf.placeholder_with_default(False, shape=[], name='is_train_ph')
        
    def save(self, model_path):
        self.saver.save(self.sess, model_path)

    def load(self, model_path):
        self.saver.restore(self.sess, model_path)
        
    def __call__(self, tok_batch):
        (subword_tokens_batch, 
         subword_token_indices_batch,
         subword_input_mask_batch,
         subword_output_mask_batch) = preprocessor(tok_batch)
        feed_dict = {self.input_ids_ph: subword_token_indices_batch,
                     self.input_mask_ph: subword_input_mask_batch,
                     self.keep_prob_ph: 1.0}
        predictions = self.sess.run(self.predictions, feed_dict)
        return predictions, subword_output_mask_batch

    def train_on_batch(self, tok_batch, tag_batch, dropout_keep_prob, learning_rate):
        (subword_tokens_batch, 
         subword_token_indices_batch,
         subword_input_mask_batch,
         subword_output_mask_batch,
         subword_tags_batch,
         subword_tags_indices_batch) = self.preprocessor(tok_batch, tag_batch)
        feed_dict = {self.input_ids_ph: subword_token_indices_batch,
                     self.y_ph: subword_tags_indices_batch,
                     self.input_mask_ph: subword_input_mask_batch,
                     self.output_mask_ph: subword_output_mask_batch,
                     self.keep_prob_ph: dropout_keep_prob,
                     self.learning_rate_ph: learning_rate}
        
        loss, _ = self.sess.run([self.loss, self.train_op], feed_dict)
        return loss


Now create an instance of the NerNetwork class:

In [None]:
tf.reset_default_graph()

nernet = NerNetwork(BERT_CONFIG_PATH,
                    BERT_MODEL_PATH,
                    preprocessor)

Check the network `train_on_batch` and `__call__` methods

In [None]:
tokens_batch, tags_batch = next(data_iterator.gen_batches(2, shuffle=True))

print(f'Tokens batch: {tokens_batch}')
print(f'Tags batch: {tags_batch}')
print(max(len(sent) for sent in tokens_batch))

In [None]:
predictions, subword_output_mask_batch = nernet(tokens_batch)
print('Predicted tags indices:')
print(predictions, predictions.shape)
print('Output mask:')
print(subword_output_mask_batch, subword_output_mask_batch.shape)

Now we need to drop `[CLS]` and `[SEP]` tokens and convert tags from indices to strings.

In [None]:
def predicted_tag_indices_to_tags(tag_predictions_batch,
                                  subword_output_mask_batch,
                                  tag_vocab):
    tags_batch = []
    for tags_inds, mask in zip(tag_predictions_batch, subword_output_mask_batch):
        # Gather only non masked tags
        tags_indices = [t for t, m in zip(tags_inds, mask) if m > 0]
        tags = tag_vocab(tags_indices)
        tags_batch.append(tags)
    return tags_batch
    

In [None]:
print(predicted_tag_indices_to_tags(predictions, subword_output_mask_batch, tag_vocab))

Regularly we want to check the score on validation part of the dataset every epoch. In the most of the cases of NER tasks the classes are imbalanced. And the accuray is not the best measure of performance. If we have 95% of 'O' tags, than the silly classifier, that always predicts '0' get 95% accuracy. To tackle this issue the F1-score is used. The F1-score can be defined as:

$$ F1 =  \frac{2 P R}{P + R}$$ 

where P is precision and R is recall.

Here is the function that evaluates the network given a batch generator.

In [None]:
from deeppavlov.metrics.fmeasure import precision_recall_f1
# The function precision_recall_f1 takes two lists: y_true and y_predicted
# the tag sequences for each sentences should be merged into one big list 
from deeppavlov.core.data.utils import zero_pad
# zero_pad takes a batch of lists of token indices, pad it with zeros to the
# maximal length and convert it to numpy matrix
from itertools import chain


def eval_valid(network, batch_generator, tag_vocab):
    total_true = []
    total_pred = []
    for tokens, tags_true in batch_generator:
        
        # We call the instance of the NerNetwork because we have defined __call__ method
        predicted_tag_inds, subword_output_mask_batch = network(tokens)

        # For every sentence in the batch extract all tags up to paddings
        tags_pred = predicted_tag_indices_to_tags(predicted_tag_inds, subword_output_mask_batch, tag_vocab)

        # Add fresh predictions 
        total_true.extend(chain(*tags_pred))
        total_pred.extend(chain(*tags_true))
    res = precision_recall_f1(total_true, total_pred, print_results=True)
    return res

Set hyperparameters. You might want to start with the following recommended values:
- *batch_size*: 8;
- n_epochs: 10;
- starting value of *learning_rate*: 3e-5;
- *learning_rate_decay*: a square root of 2;
- *dropout_keep_probability* equal to 0.7 for training (typical values for dropout probability are ranging from 0.3 to 0.9).

A very efficient technique for the learning rate managment is dropping learning rate after convergence. It is common to use dividers 2, 3, and 10 to drop the learning rate.

In [None]:
batch_size = 8
n_epochs = 10
learning_rate = 1e-5
dropout_keep_prob = 0.9

evaluate_every_n_batches = 100

Now we iterate through dataset batch by batch and pass the data to the train op

In [None]:
best_validation_score = 0
model_path = 'model/bert_ner/model.ckpt'

print('Start training:')
for epoch in range(n_epochs):
    print(f'Epoch: {epoch}')
    for n, (tokens_batch, tags_batch) in enumerate(data_iterator.gen_batches(batch_size, 'train')):
        
        nernet.train_on_batch(tokens_batch,
                              tags_batch,
                              dropout_keep_prob=dropout_keep_prob,
                              learning_rate=learning_rate)
        if n % evaluate_every_n_batches == evaluate_every_n_batches - 1:
            print('Evaluating the model on the valid part of the dataset')
            scores = eval_valid(nernet, data_iterator.gen_batches(batch_size, 'valid'), tag_vocab)
            f_1_score = scores['__total__']['f1']
            if f_1_score > best_validation_score:
                print(f'New best score: {f_1_score}, saving model to {model_path}')
                nernet.save(model_path)


In [None]:
nernet.load(model_path)
eval_valid(nernet, data_iterator.gen_batches(batch_size, 'valid'), tag_vocab)

Eval the model on test part now

In [None]:
eval_valid(nernet, data_iterator.gen_batches(batch_size, 'test'), tag_vocab)

Lets try to infer the model on our sentence:

In [None]:
sentence = 'My name is Bert'
senetence = 'Его зовут Берт'

tokens = [sentence.split()]
predicted_tag_inds, subword_output_mask_batch = nernet(tokens)
predicted_tag_indices_to_tags(predicted_tag_inds, subword_output_mask_batch, tag_vocab)