# Training and updating models

<section data-markdown="" data-markdown-parsed="true" class="present" style="display: block;"><h1 id="why-updating-the-model">Why updating the model?</h1>
<ul>
<li>Better results on your specific domain</li>
<li>Learn classification schemes specifically for your problem</li>
<li>Essential for text classification</li>
<li>Very useful for named entity recognition</li>
<li>Less critical for part-of-speech tagging and dependency parsing</li>
</ul>
</section>

<section data-markdown="" data-markdown-parsed="true" class="present" style="display: block;"><h1 id="how-training-works-1">How training works (1)</h1>
<ol>
<li><strong>Initialize</strong> the model weights randomly with <code>nlp.begin_training</code></li>
<li><strong>Predict</strong> a few examples with the current weights by calling <code>nlp.update</code></li>
<li><strong>Compare</strong> prediction with true labels</li>
<li><strong>Calculate</strong> how to change weights to improve predictions</li>
<li><strong>Update</strong> weights slightly</li>
<li>Go back to 2.</li>
</ol>
</section>    

<section data-markdown="" data-markdown-parsed="true" class="present" style="display: block;"><h1 id="how-training-works-2">How training works (2)</h1>
<img src="training.png" alt="Diagram of the training process">

<ul>
<li><strong>Training data:</strong> Examples and their annotations.</li>
<li><strong>Text:</strong> The input text the model should predict a label for.</li>
<li><strong>Label:</strong> The label the model should predict.</li>
<li><strong>Gradient:</strong> How to change the weights.</li>
</ul>
</section>

<section data-markdown="" data-markdown-parsed="true" class="present" style="display: block;"><h1 id="example-training-the-entity-recognizer">Example: Training the entity recognizer</h1>
<ul>
<li>The entity recognizer tags words and phrases in context</li>
<li>Each token can only be part of one entity</li>
<li>Examples need to come with context</li>
</ul>
<pre class=" language-python"><code class=" language-python"><span class="token punctuation">(</span><span class="token string">"iPhone X is coming"</span><span class="token punctuation">,</span> <span class="token punctuation">{</span><span class="token string">"entities"</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">8</span><span class="token punctuation">,</span> <span class="token string">"GADGET"</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">}</span><span class="token punctuation">)</span></code></pre>
<ul>
<li>Texts with no entities are also important</li>
</ul>
<pre class=" language-python"><code class=" language-python"><span class="token punctuation">(</span><span class="token string">"I need a new phone! Any tips?"</span><span class="token punctuation">,</span> <span class="token punctuation">{</span><span class="token string">"entities"</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">}</span><span class="token punctuation">)</span></code></pre>
<ul>
<li><strong>Goal:</strong> teach the model to generalize</li>
</ul>
</section>

<section data-markdown="" data-markdown-parsed="true" class="present" style="display: block;"><h1 id="the-training-data">The training data</h1>
<ul>
<li>Examples of what we want the model to predict in context</li>
<li>Update an <strong>existing model</strong>: a few hundred to a few thousand examples</li>
<li>Train a <strong>new category</strong>: a few thousand to a million examples<ul>
<li>spaCy's English models: 2 million words</li>
</ul>
</li>
<li>Usually created manually by human annotators</li>
<li>Can be semi-automated – for example, using spaCy's <code>Matcher</code>!</li>
</ul>
</section>

# Practice

<h3>Creating training data (1)</h3>
<p>1 Write a pattern for two tokens whose lowercase forms match "iphone" and "x".</p>
<p>2 Write a pattern for two tokens: one token whose lowercase form matches "iphone" and a digit using the "?" operator.</p>

In [1]:
from spacy.matcher import Matcher
from spacy.lang.en import English

TEXTS = ['How to preorder the iPhone X', 'iPhone X is coming', 'Should I pay $1,000 for the iPhone X?', 'The iPhone 8 reviews are here', "iPhone 11 vs iPhone 8: What's the difference?", 'I need a new phone! Any tips?']

nlp = English()
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match "iphone" and "x"
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches "iphone" and a digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]

# Add patterns to the matcher and check the result
matcher.add("GADGET", None, pattern1, pattern2)
for doc in nlp.pipe(TEXTS):
    print([doc[start:end] for match_id, start, end in matcher(doc)])

[iPhone X]
[iPhone X]
[iPhone X]
[iPhone 8]
[iPhone 11, iPhone 8]
[]


<h3>Creating training data (2)</h3>
<p>1 Create a doc object for each text using nlp.pipe.</p>
<p>2 Match on the doc and create a list of matched spans.</p>
<p>3 Get (start character, end character, label) tuples of matched spans.</p>
<p>4 Format each example as a tuple of the text and a dict, mapping "entities" to the entity tuples.</p>
<p>4 Append the example to TRAINING_DATA and inspect the printed data.</p>

In [2]:
from spacy.matcher import Matcher
from spacy.lang.en import English

TEXTS = ['How to preorder the iPhone X', 'iPhone X is coming', 'Should I pay $1,000 for the iPhone X?', 'The iPhone 8 reviews are here', "iPhone 11 vs iPhone 8: What's the difference?", 'I need a new phone! Any tips?']

nlp = English()
matcher = Matcher(nlp.vocab)
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]
matcher.add("GADGET", None, pattern1, pattern2)

TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start: end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})
("iPhone 11 vs iPhone 8: What's the difference?", {'entities': [(0, 9, 'GADGET'), (13, 21, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})


# The Training Loop

<section data-markdown="" data-markdown-parsed="true" class="present" style="display: block;"><h1 id="the-steps-of-a-training-loop">The steps of a training loop</h1>
<ol>
<li><strong>Loop</strong> for a number of times.</li>
<li><strong>Shuffle</strong> the training data.</li>
<li><strong>Divide</strong> the data into batches.</li>
<li><strong>Update</strong> the model for each batch.</li>
<li><strong>Save</strong> the updated model.</li>
</ol>
</section>

<section data-markdown="" data-markdown-parsed="true" class="present" style="display: block;"><h1 id="example-loop">Example loop</h1>
<pre class=" language-python"><code class=" language-python">TRAINING_DATA <span class="token operator">=</span> <span class="token punctuation">[</span>
    <span class="token punctuation">(</span><span class="token string">"How to preorder the iPhone X"</span><span class="token punctuation">,</span> <span class="token punctuation">{</span><span class="token string">"entities"</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token punctuation">(</span><span class="token number">20</span><span class="token punctuation">,</span> <span class="token number">28</span><span class="token punctuation">,</span> <span class="token string">"GADGET"</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">}</span><span class="token punctuation">)</span>
    <span class="token comment"># And many more examples...</span>
<span class="token punctuation">]</span></code></pre>
<pre class=" language-python"><code class=" language-python"><span class="token comment"># Loop for 10 iterations</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">10</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token comment"># Shuffle the training data</span>
    random<span class="token punctuation">.</span>shuffle<span class="token punctuation">(</span>TRAINING_DATA<span class="token punctuation">)</span>
    <span class="token comment"># Create batches and iterate over them</span>
    <span class="token keyword">for</span> batch <span class="token keyword">in</span> spacy<span class="token punctuation">.</span>util<span class="token punctuation">.</span>minibatch<span class="token punctuation">(</span>TRAINING_DATA<span class="token punctuation">)</span><span class="token punctuation">:</span>
        <span class="token comment"># Split the batch in texts and annotations</span>
        texts <span class="token operator">=</span> <span class="token punctuation">[</span>text <span class="token keyword">for</span> text<span class="token punctuation">,</span> annotation <span class="token keyword">in</span> batch<span class="token punctuation">]</span>
        annotations <span class="token operator">=</span> <span class="token punctuation">[</span>annotation <span class="token keyword">for</span> text<span class="token punctuation">,</span> annotation <span class="token keyword">in</span> batch<span class="token punctuation">]</span>
        <span class="token comment"># Update the model</span>
        nlp<span class="token punctuation">.</span>update<span class="token punctuation">(</span>texts<span class="token punctuation">,</span> annotations<span class="token punctuation">)</span>

<span class="token comment"># Save the model</span>
nlp<span class="token punctuation">.</span>to_disk<span class="token punctuation">(</span>path_to_model<span class="token punctuation">)</span></code></pre>
</section>

<section data-markdown="" data-markdown-parsed="true" class="present" style="display: block;"><h1 id="setting-up-a-new-pipeline-from-scratch">Setting up a new pipeline from scratch</h1>
<pre class=" language-python"><code class=" language-python"><span class="token comment"># Start with blank English model</span>
nlp <span class="token operator">=</span> spacy<span class="token punctuation">.</span>blank<span class="token punctuation">(</span><span class="token string">"en"</span><span class="token punctuation">)</span>
<span class="token comment"># Create blank entity recognizer and add it to the pipeline</span>
ner <span class="token operator">=</span> nlp<span class="token punctuation">.</span>create_pipe<span class="token punctuation">(</span><span class="token string">"ner"</span><span class="token punctuation">)</span>
nlp<span class="token punctuation">.</span>add_pipe<span class="token punctuation">(</span>ner<span class="token punctuation">)</span>
<span class="token comment"># Add a new label</span>
ner<span class="token punctuation">.</span>add_label<span class="token punctuation">(</span><span class="token string">"GADGET"</span><span class="token punctuation">)</span>

<span class="token comment"># Start the training</span>
nlp<span class="token punctuation">.</span>begin_training<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token comment"># Train for 10 iterations</span>
<span class="token keyword">for</span> itn <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">10</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
    random<span class="token punctuation">.</span>shuffle<span class="token punctuation">(</span>examples<span class="token punctuation">)</span>
    <span class="token comment"># Divide examples into batches</span>
    <span class="token keyword">for</span> batch <span class="token keyword">in</span> spacy<span class="token punctuation">.</span>util<span class="token punctuation">.</span>minibatch<span class="token punctuation">(</span>examples<span class="token punctuation">,</span> size<span class="token operator">=</span><span class="token number">2</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
        texts <span class="token operator">=</span> <span class="token punctuation">[</span>text <span class="token keyword">for</span> text<span class="token punctuation">,</span> annotation <span class="token keyword">in</span> batch<span class="token punctuation">]</span>
        annotations <span class="token operator">=</span> <span class="token punctuation">[</span>annotation <span class="token keyword">for</span> text<span class="token punctuation">,</span> annotation <span class="token keyword">in</span> batch<span class="token punctuation">]</span>
        <span class="token comment"># Update the model</span>
        nlp<span class="token punctuation">.</span>update<span class="token punctuation">(</span>texts<span class="token punctuation">,</span> annotations<span class="token punctuation">)</span></code></pre>
</section>

# Practice

<h3>Setting up the pipeline</h3>
<p>1 Create a blank "en" model, for example using the spacy.blank method.</p>
<p>2 Create a new entity recognizer using nlp.create_pipe and add it to the pipeline.</p>
<p>3 Add the new label "GADGET" to the entity recognizer using the add_label method on the pipeline component.</p>

In [3]:
import spacy

# Create a blank "en" model
nlp = spacy.blank("en")

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)

# Add the label "GADGET" to the entity recognizer
ner.add_label("GADGET")

<h3>Building a training loop</h3>
<p>1 Call nlp.begin_training, create a training loop for 10 iterations and shuffle the training data.</p>
<p>2 Create batches of training data using spacy.util.minibatch and iterate over the batches.</p>
<p>3 Convert the (text, annotations) tuples in the batch to lists of texts and annotations.</p>
<p>4 For each batch, use nlp.update to update the model with the texts and annotations.</p>

In [4]:
import spacy
import random

TRAINING_DATA = [['How to preorder the iPhone X', {'entities': [[20, 28, 'GADGET']]}], ['iPhone X is coming', {'entities': [[0, 8, 'GADGET']]}], ['Should I pay $1,000 for the iPhone X?', {'entities': [[28, 36, 'GADGET']]}], ['The iPhone 8 reviews are here', {'entities': [[4, 12, 'GADGET']]}], ['Your iPhone goes up to 11 today', {'entities': [[5, 11, 'GADGET']]}], ['I need a new phone! Any tips?', {'entities': []}]]

nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("GADGET")

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        # Update the model
        nlp.update(texts, annotations, losses=losses)
    print(losses)

{'ner': 33.50006282329559}
{'ner': 20.239550530910492}
{'ner': 8.076356374192983}
{'ner': 5.9776672299194615}
{'ner': 10.819391730386997}
{'ner': 6.543384524411522}
{'ner': 3.674863284635876}
{'ner': 1.8307197888055953}
{'ner': 0.8525482789530088}
{'ner': 2.1375879164635307}
