https://github.com/explosion/spaCy/blob/master/LICENSE

A simple example for training a part-of-speech tagger with a custom tag map.
To allow us to update the tag map with our custom one, this example starts off
with a blank Language class and modifies its defaults. For more details, see
the documentation:
* Training: https://spacy.io/usage/training#section-tagger-parser
* POS Tagging: https://spacy.io/usage/linguistic-features#pos-tagging

In [1]:
import random
from pathlib import Path
import spacy

In [2]:
# You need to define a mapping from your data's part-of-speech tag names to the
# Universal Part-of-Speech tag set, as spaCy includes an enum of these tags.
# See here for the Universal Tag Set:
# http://universaldependencies.github.io/docs/u/pos/index.html
# You may also specify morphological features for your tags, from the universal
# scheme.
TAG_MAP = {
    'N': {'pos': 'NOUN'},
    'V': {'pos': 'VERB'},
    'J': {'pos': 'ADJ'}
}

In [3]:
# Usually you'll read this in, of course. Data formats vary. Ensure your
# strings are unicode and that the number of tags assigned matches spaCy's
# tokenization. If not, you can always add a 'words' key to the annotations
# that specifies the gold-standard tokenization, e.g.:
# ("Eatblueham", {'words': ['Eat', 'blue', 'ham'] 'tags': ['V', 'J', 'N']})
TRAIN_DATA = [
    ("I like green eggs", {'tags': ['N', 'V', 'J', 'N']}),
    ("Eat blue ham", {'tags': ['V', 'J', 'N']})
]

Create a new model, set up the pipeline and train the tagger. In order to
train the tagger with a custom tag map, we're creating a new Language
instance with a custom vocab.

In [4]:
nlp = spacy.blank('en')

In [5]:
# add the tagger to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
tagger = nlp.create_pipe('tagger')

In [6]:
# Add the tags. This needs to be done before you start training.
for tag, values in TAG_MAP.items():
    tagger.add_label(tag, values)

In [7]:
nlp.add_pipe(tagger)

In [8]:
optimizer = nlp.begin_training()
for i in range(25):
    random.shuffle(TRAIN_DATA)
    losses = {}
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations], sgd=optimizer, losses=losses)
    print(losses)

{'tagger': 0.5731515735387802}
{'tagger': 0.5486934930086136}
{'tagger': 0.4483537822961807}
{'tagger': 0.2599456459283829}
{'tagger': 0.11532417312264442}
{'tagger': 0.030577277764678}
{'tagger': 0.0038234422099776566}
{'tagger': 0.00029430676659103483}
{'tagger': 3.885022488248069e-05}
{'tagger': 5.859134262209409e-06}
{'tagger': 1.3233649269750458e-06}
{'tagger': 4.0967667302993505e-07}
{'tagger': 1.5634972783118428e-07}
{'tagger': 7.014039837827113e-08}
{'tagger': 3.6199250708079944e-08}
{'tagger': 2.043198144008329e-08}
{'tagger': 1.296029239483687e-08}
{'tagger': 8.569453147089234e-09}
{'tagger': 6.100724370128319e-09}
{'tagger': 4.687360499744386e-09}
{'tagger': 3.618845445529928e-09}
{'tagger': 3.0014332130789967e-09}
{'tagger': 2.4642485829673433e-09}
{'tagger': 2.155601586117939e-09}
{'tagger': 1.872014654402676e-09}


In [9]:
# test the trained model
test_text = "I like blue eggs"
doc = nlp(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])

Tags [('I', 'N', 'NOUN'), ('like', 'V', 'VERB'), ('blue', 'J', 'ADJ'), ('eggs', 'N', 'NOUN')]
