# Create Training Data

spaCy’s rule-based Matcher is a great way to quickly create training data for named entity models. A list of sentences is available as the variable TEXTS. You can print it the IPython shell to inspect it. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as 'GADGET'.

1. Write a pattern for two tokens whose lowercase forms match 'iphone' and 'x'.
2. Write a pattern for two tokens: one token whose lowercase form matches 'iphone' and an optional digit using the '?' operator.

In [2]:
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

In [8]:
TEXT =['How to preorder the iPhone X', 
       'iPhone X is coming', 
       'Should I pay $1,000 for the iPhone X?', 
       'The iPhone 8 reviews are here', 
       'Your iPhone goes up to 11 today', 
       'I need a new phone! Any tips?']
nlp = English()

In [10]:
matcher = Matcher(nlp.vocab)

In [11]:
# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{'LOWER': 'iphone'}, {'LOWER': 'x'}]

In [12]:
# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{'LOWER': 'iphone'}, {'IS_DIGIT': True, 'OP': '?'}]

In [13]:
# Add patterns to the matcher
matcher.add("GADGET", None, pattern1, pattern2)

 Bootstrap a set of training examples

1. Create a doc object for each text using nlp.pipe.
2. Match on the doc and create a list of matched spans.
3. Get (start character, end character, label) tuples of matched spans.
4. Format each example as a tuple of the text and a dict, mapping 'entities' to the entity tuples.
5. Append the example to TRAINING_DATA and inspect the printed data.

In [27]:
TRAINING_DATA = []

In [28]:
# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXT):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

In [31]:
print(*TRAINING_DATA, sep="\n")

(How to preorder the iPhone X, {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]})
(iPhone X is coming, {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]})
(Should I pay $1,000 for the iPhone X?, {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]})
(The iPhone 8 reviews are here, {'entities': [(4, 12, 'GADGET')]})
(Your iPhone goes up to 11 today, {'entities': [(5, 11, 'GADGET')]})
(I need a new phone! Any tips?, {'entities': []})


# 2. Training Neural 

### Setting Pipeline

1. Create a blank 'en' model, for example using the spacy.blank method.
2. Create a new entity recognizer using nlp.create_pipe and add it to the pipeline.
3. Add the new label 'GADGET' to the entity recognizer using the add_label method on the pipeline component.

In [33]:
import spacy

In [34]:
nlp= spacy.blank('en')

In [35]:
ner = nlp.create_pipe('ner')

In [36]:
nlp.add_pipe(ner)

In [37]:
ner.add_label('GADGET')

### Build a training Loop

The small set of labelled examples that you’ve created previously is available as TRAINING_DATA. To see the examples, you can print them in your script.

1. Call nlp.begin_training, create a training loop for 10 iterations and shuffle the training data.
2. Create batches of training data using spacy.util.minibatch and iterate over the batches.
3. Convert the (text, annotations) tuples in the batch to lists of texts and annotations.
4. For each batch, use nlp.update to update the model with the texts and annotations

In [40]:
import random

In [38]:
nlp.begin_training()

<thinc.neural.optimizers.Optimizer at 0x7fbf31a589b0>

In [44]:
for itn in range(10):
    random.shuffle(TRAINING_DATA)
    losses = {}
    for batch in spacy.util.minibatch(TRAINING_DATA,size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]
        nlp.update(texts,annotations,losses=losses)
        print(losses)

{'ner': 12.799999833106995}
{'ner': 24.067995309829712}
{'ner': 31.91904306411743}
{'ner': 9.710038661956787}
{'ner': 13.813634246587753}
{'ner': 17.50754678249359}
{'ner': 3.1633437052369118}
{'ner': 4.610809381119907}
{'ner': 7.603855360648595}
{'ner': 0.9206280021899147}
{'ner': 2.5533733413349182}
{'ner': 3.974387467026645}
{'ner': 1.2504660709382733}
{'ner': 2.21735084753891}
{'ner': 9.722241501774988}
{'ner': 0.5169736024690792}
{'ner': 1.56973624532111}
{'ner': 1.598955237050717}
{'ner': 0.029113001650557635}
{'ner': 1.1186205591604903}
{'ner': 1.120538724588111}
{'ner': 0.00025163989943699505}
{'ner': 1.0078574750490983}
{'ner': 1.0079279419797198}
{'ner': 2.3430478067130025}
{'ner': 2.3430594575857686}
{'ner': 2.3430698325725996}
{'ner': 1.1493791369139217}
{'ner': 1.1493840977427456}
{'ner': 1.1493871128783053}


In [47]:
5/7

0.7142857142857143