https://github.com/explosion/spaCy/blob/master/LICENSE

This example shows how to add a multi-task objective that is trained
alongside the entity recognizer. This is an alternative to adding features
to the model.

The multi-task idea is to train an auxiliary model to predict some attribute,
with weights shared between the auxiliary model and the main model. In this
example, we're predicting the position of the word in the document.

The model that predicts the position of the word encourages the convolutional
layers to include the position information in their representation. The
information is then available to the main model, as a feature.

The overall idea is that we might know something about what sort of features
we'd like the CNN to extract. The multi-task objectives can encourage the
extraction of this type of feature. The multi-task objective is only used
during training. We discard the auxiliary model before run-time.

The specific example here is not necessarily a good idea --- but it shows
how an arbitrary objective function for some word can be used.

In [1]:
import random
import plac
import spacy
import os.path
from spacy.gold import read_json_file, GoldParse

In [2]:
random.seed(0)

In [3]:
TRAIN_DATA = list(read_json_file('../data/training-data.json'))

In [4]:
def get_position_label(i, words, tags, heads, labels, ents):
    '''Return labels indicating the position of the word in the document.
    '''
    if len(words) < 20:
        return 'short-doc'
    elif i == 0:
        return 'first-word'
    elif i < 10:
        return 'early-word'
    elif i < 20:
        return 'mid-word'
    elif i == len(words)-1:
        return 'last-word'
    else:
        return 'late-word'

In [5]:
nlp = spacy.blank('en')
ner = nlp.create_pipe('ner')
ner.add_multitask_objective(get_position_label)
nlp.add_pipe(ner)

In [6]:
optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA)
for itn in range(10):
    random.shuffle(TRAIN_DATA)
    losses = {}
    for text, annot_brackets in TRAIN_DATA:
        annotations, _ = annot_brackets
        doc = nlp.make_doc(text)
        gold = GoldParse.from_annot_tuples(doc, annotations[0])
        nlp.update(
            [doc],  # batch of texts
            [gold],  # batch of annotations
            drop=0.2,  # dropout - make it harder to memorise data
            sgd=optimizer,  # callable to update weights
            losses=losses)
    print(losses.get('nn_labeller', 0.0), losses['ner'])

0.0 38.86346965752
0.0 25.92212019408212
0.0 29.088292832429502
0.0 21.533498413347996
0.0 17.81456380731507
0.0 17.797718213371915
0.0 12.751342002763954
0.0 14.553385668152174
0.0 7.800720825029798
0.0 7.32101260258942


In [7]:
for text, _ in TRAIN_DATA:
    doc = nlp(text)
    print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
    print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])

Entities [('Oct. 19', 'DATE'), ('The Misanthrope', 'WORK_OF_ART'), ('Chicago', 'GPE'), ('Goodman Theatre', 'FAC'), ('Revitalized Classics Take the', 'WORK_OF_ART'), ('Leisure & Arts', 'ORG'), ('Celimene', 'PERSON'), ('Kim Cattrall', 'PERSON'), ('Christina Haag', 'PERSON')]
Tokens [('In', '', 2), ('an', '', 2), ('Oct.', 'DATE', 3), ('19', 'DATE', 1), ('review', '', 2), ('of', '', 2), ('"', '', 2), ('The', 'WORK_OF_ART', 3), ('Misanthrope', 'WORK_OF_ART', 1), ('"', '', 2), ('at', '', 2), ('Chicago', 'GPE', 3), ("'s", '', 2), ('Goodman', 'FAC', 3), ('Theatre', 'FAC', 1), ('(', '', 2), ('"', '', 2), ('Revitalized', 'WORK_OF_ART', 3), ('Classics', 'WORK_OF_ART', 1), ('Take', 'WORK_OF_ART', 1), ('the', 'WORK_OF_ART', 1), ('Stage', '', 2), ('in', '', 2), ('Windy', '', 2), ('City', '', 2), (',', '', 2), ('"', '', 2), ('Leisure', 'ORG', 3), ('&', 'ORG', 1), ('Arts', 'ORG', 1), (')', '', 2), (',', '', 2), ('the', '', 2), ('role', '', 2), ('of', '', 2), ('Celimene', 'PERSON', 3), (',', '', 2), ('