### Training and updating models

#### How training works

1. Initialize the model weights randomly with nlp.begin_training
2. Predict a few examples with the current weights by calling nlp.update
3. Compare prediction with true labels
4. Calculate how to change weights to improve predictions
5. Update weights slightly
6. Go back to 2.

#### Example

In [3]:
# how to semi-automate to create training examples using Matcher in spacy
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("iphone.json") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]
matcher.add("GADGET", None, pattern1, pattern2)

TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)
print(*TRAINING_DATA)

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]}) ('iPhone X is coming', {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]}) ('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]}) ('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]}) ('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]}) ('I need a new phone! Any tips?', {'entities': []})


#### Steps of training loop

1. Loop for a number of times.
2. Shuffle the training data.
3. Divide the data into batches.
4. Update the model for each batch.
5. Save the updated model.

#### Example

In [6]:
TRAINING_DATA = [
    ("How to preorder the iPhone X", {'entities': [(20, 28, 'GADGET')]})
    # And many more examples...
]
import random
import spacy

# loop for 10 iterations
for i in range(10):
    # shuffle the training data
    random.shuffle(TRAINING_DATA)
    # Create a btach and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA):
        # split the batch in text and annotations
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)

# Save the model
nlp.to_disk('model/')

#### Building a training loop from scratch

In [9]:
import spacy
import random
import json

with open("gadgets.json") as f:
    TRAINING_DATA = json.loads(f.read())

nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("GADGET")

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]
    
        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print(losses)

{'ner': 8.000000059604645}
{'ner': 22.496776401996613}
{'ner': 31.853319942951202}
{'ner': 7.109317302703857}
{'ner': 13.006432175636292}
{'ner': 18.542052090168}
{'ner': 3.4365335404872894}
{'ner': 6.569849184714258}
{'ner': 7.88712777395267}
{'ner': 2.3865505084686447}
{'ner': 4.474786773978849}
{'ner': 5.275617860221246}
{'ner': 4.410458291764371}
{'ner': 9.060808309703134}
{'ner': 10.982691241573775}
{'ner': 2.273387190653011}
{'ner': 3.025673861593532}
{'ner': 5.262153170981037}
{'ner': 1.228482561185956}
{'ner': 2.1085004426713567}
{'ner': 2.769125703634927}
{'ner': 0.23909635161635379}
{'ner': 0.26650693952467464}
{'ner': 1.2055368919477019}
{'ner': 0.004890342182520158}
{'ner': 0.013082372427990485}
{'ner': 2.3215074471822303}
{'ner': 2.1328314491493607}
{'ner': 2.1330513375847806}
{'ner': 2.133055827927847}


#### Best practices for training spaCy models

In [11]:
TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "GPE")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "GPE")]},
    ),
    (
        "There's also a Paris in Arkansas, lol",
        {"entities": [(15, 20, "GPE"), (24, 32, "GPE")]},
    ),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "GPE")]},
    ),
]

In [12]:
TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [(0, 6, "WEBSITE"), (21, 28, "WEBSITE")]},
    ),
    (
        "PewDiePie smashes YouTube record",
        {"entities": [(0, 9, "PERSON"), (18, 25, "WEBSITE")]},
    ),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [(0, 6, "WEBSITE"), (15, 29, "PERSON")]},
    ),
]  