## Replica of the Senna named entity recognizer

See the original article by Collobert et al. at http://ronan.collobert.com/pub/matos/2011_nlp_jmlr.pdf

In [1]:
%load_ext autoreload
%autoreload 2


In [2]:

import numpy as np
import constants as const
from senna_ner import SennaNER


  from ._conv import register_converters as _register_converters


## Named entity extractor model with word-level loss function:

Architecture details:
    1. two-layer neural network with 300 units in the hidden layer.
    2. loss function = ordinary cross-entropy on the token named entity labels.

In [3]:
# Train a named entity recognizer using the word-level loss function:

loss_type = 'word_level'

nn_design_list = [{'layer_type': 'full_cn', 'output_shape': [300], 'activation': 'relu'},
                  {'layer_type': 'full_cn', 'output_shape': [len(const.LIST_KEYWORD_TAGS)]}]

model_details_dict = {'project_name': 'reuters',
                      'loss_type': loss_type,
                      'nn_design_list': nn_design_list,
                      'model_name': 'reuters_model'}

with SennaNER(model_details_dict) as senna_ner:
    
    word_level_model_name = senna_ner.model_name
    word_level_project_name = senna_ner.project_name
    
    senna_ner.train_model(num_training_epochs = 10)
    

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

Finished mapping all tokens in the input data to vectors.
Total time for data curation: 11.879


Epoch 1 of 10 completed.
------------------------------------------------
------------------------------------------------


Time elapsed during training so far    : 26.288
Time elapsed since the last checkpoint : 26.288
Time elapsed for the last epoc training: 12.72


Cumulative train loss over the last epoc: 0.079
Cumulative testa loss over the last epoc: 0.105
Cumulative testb loss over the last epoc: 0.128


Cumulative train accuracy over the last epoc: 0.977
Cumulative testa accuracy over the last epoc: 0.971
Cumulative testb accuracy over the last epoc: 0.963


Cumulative train precision over the last epoc: 0.911
Cumulative te



Epoch 9 of 10 completed.
------------------------------------------------
------------------------------------------------


Time elapsed during training so far    : 125.446
Time elapsed since the last checkpoint : 13.284
Time elapsed for the last epoc training: 13.202


Cumulative train loss over the last epoc: 0.012
Cumulative testa loss over the last epoc: 0.125
Cumulative testb loss over the last epoc: 0.185


Cumulative train accuracy over the last epoc: 0.997
Cumulative testa accuracy over the last epoc: 0.978
Cumulative testb accuracy over the last epoc: 0.969


Cumulative train precision over the last epoc: 0.988
Cumulative testa precision over the last epoc: 0.92
Cumulative testb precision over the last epoc: 0.885


Cumulative train recall over the last epoc: 0.978
Cumulative testa recall over the last epoc: 0.879
Cumulative testb recall over the last epoc: 0.845


Cumulative train f1_score over the last epoc: 0.983
Cumulative testa f1_score over the last epoc: 0.899
Cumula

In [4]:
# Check out the computational graph with tensorboard. The sentence breaks are not used, 
# and are thus disconnected from the rest of the graph, when we use the word-level loss function:

model_details_dict = {'model_name': word_level_model_name,
                      'project_name': word_level_project_name}

with SennaNER(model_details_dict) as senna_ner:
    
    senna_ner.show_graph()
    

Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from /Users/smflores/Repos/senna-replica/reuters/models/reuters_model_2019-03-25T22-03-18.642335/reuters_model_2019-03-25T22-03-18.642335


In [5]:
# Now deploy the model we just trained to recognize named entities in the following text:

text = 'George Washington was the first president of the United States. John Adams was the second, and Thomas Jefferson was the third.'

model_details_dict = {'model_name': word_level_model_name,
                      'project_name': word_level_project_name}

with SennaNER(model_details_dict) as senna_ner:
    
    print('\n')
    
    labeled_sentences_list = senna_ner.get_named_entity_labels(text)
    for labeled_tok_list in labeled_sentences_list:
        for labeled_token in labeled_tok_list:
            print(labeled_token)
        print('\n')


INFO:tensorflow:Restoring parameters from /Users/smflores/Repos/senna-replica/reuters/models/reuters_model_2019-03-25T22-03-18.642335/reuters_model_2019-03-25T22-03-18.642335


George (B-PER)
Washington (E-PER)
was (O-)
the (O-)
first (O-)
president (O-)
of (O-)
the (O-)
United (B-LOC)
States (E-LOC)
. (O-)


John (B-PER)
Adams (E-PER)
was (O-)
the (O-)
second (O-)
, (O-)
and (O-)
Thomas (B-PER)
Jefferson (E-PER)
was (O-)
the (O-)
third (O-)
. (O-)




## Named entity extractor model with sentence-level loss function:

Architecture details:
    1. two-layer neural network with 300 units in the hidden layer.
    2. loss function = a modified cross-entropy loss with a conditional random field over pairs of consecutive
       tags, evaluated at the sentence level.

In [6]:

# Train a named entity recognizer using the sentence-level loss function:

# Disclaimer: using the sentence-level loss function should increase the final F1 score on the testb.txt
# test set by about 2% over using the word-loss function, but that does not happen here (though in a previous
# version I had, it did). This will require some investigation and debugging in the near future.

loss_type = 'sentence_level'

nn_design_list = [{'layer_type': 'full_cn', 'output_shape': [300], 'activation': 'relu'},
                  {'layer_type': 'full_cn', 'output_shape': [len(const.LIST_KEYWORD_TAGS)]}]

model_details_dict = {'project_name': 'reuters',
                      'loss_type': loss_type,
                      'nn_design_list': nn_design_list,
                      'model_name': 'reuters_model'}

with SennaNER(model_details_dict) as senna_ner:
    
    sent_level_model_name = senna_ner.model_name
    sent_level_project_name = senna_ner.project_name
    
    senna_ner.train_model(num_training_epochs = 10)
    

Instructions for updating:
Use tf.cast instead.
Finished mapping all tokens in the input data to vectors.
Total time for data curation: 13.498


Epoch 1 of 10 completed.
------------------------------------------------
------------------------------------------------


Time elapsed during training so far    : 102.907
Time elapsed since the last checkpoint : 102.907
Time elapsed for the last epoc training: 86.836


Cumulative train loss over the last epoc: 1.1
Cumulative testa loss over the last epoc: 1.597
Cumulative testb loss over the last epoc: 1.59


Cumulative train accuracy over the last epoc: 0.977
Cumulative testa accuracy over the last epoc: 0.969
Cumulative testb accuracy over the last epoc: 0.963


Cumulative train precision over the last epoc: 0.905
Cumulative testa precision over the last epoc: 0.878
Cumulative testb precision over the last epoc: 0.85


Cumulative train recall over the last epoc: 0.876
Cumulative testa recall over the last epoc: 0.847
Cumulative testb reca



Epoch 9 of 10 completed.
------------------------------------------------
------------------------------------------------


Time elapsed during training so far    : 710.397
Time elapsed since the last checkpoint : 70.785
Time elapsed for the last epoc training: 70.662


Cumulative train loss over the last epoc: 0.125
Cumulative testa loss over the last epoc: 1.369
Cumulative testb loss over the last epoc: 1.816


Cumulative train accuracy over the last epoc: 0.997
Cumulative testa accuracy over the last epoc: 0.978
Cumulative testb accuracy over the last epoc: 0.97


Cumulative train precision over the last epoc: 0.985
Cumulative testa precision over the last epoc: 0.904
Cumulative testb precision over the last epoc: 0.872


Cumulative train recall over the last epoc: 0.983
Cumulative testa recall over the last epoc: 0.894
Cumulative testb recall over the last epoc: 0.865


Cumulative train f1_score over the last epoc: 0.984
Cumulative testa f1_score over the last epoc: 0.899
Cumula

In [7]:
# Check out the computational graph with tensorboard. Here, the sentence breaks are used, 
# and are thus not disconnected from the rest of the graph, when we use the sentence-level loss function:

model_details_dict = {'model_name': sent_level_model_name,
                      'project_name': sent_level_project_name}

with SennaNER(model_details_dict) as senna_ner:
    
    senna_ner.show_graph()
    

INFO:tensorflow:Restoring parameters from /Users/smflores/Repos/senna-replica/reuters/models/reuters_model_2019-03-25T22-13-14.581575/reuters_model_2019-03-25T22-13-14.581575


In [8]:
# Now deploy the model we just trained to recognize named entities in the following text:

text = 'George Washington was the first president of the United States. John Adams was the second, and Thomas Jefferson was the third.'

model_details_dict = {'model_name': sent_level_model_name,
                      'project_name': sent_level_project_name}

with SennaNER(model_details_dict) as senna_ner:
    
    print('\n')
    
    labeled_sentences_list = senna_ner.get_named_entity_labels(text)
    for labeled_tok_list in labeled_sentences_list:
        for labeled_token in labeled_tok_list:
            print(labeled_token)
        print('\n')


INFO:tensorflow:Restoring parameters from /Users/smflores/Repos/senna-replica/reuters/models/reuters_model_2019-03-25T22-13-14.581575/reuters_model_2019-03-25T22-13-14.581575


George (B-PER)
Washington (E-PER)
was (O-)
the (O-)
first (O-)
president (O-)
of (O-)
the (O-)
United (B-LOC)
States (E-LOC)
. (O-)


John (B-PER)
Adams (E-PER)
was (O-)
the (O-)
second (O-)
, (O-)
and (O-)
Thomas (B-PER)
Jefferson (E-PER)
was (O-)
the (O-)
third (O-)
. (O-)


