Let's first check out we get out.

In [None]:
import spacy
import pandas as pd
from spacy import displacy
from spacy.matcher import Matcher

In [None]:
nlp = spacy.load("en_core_web_md")

In [None]:
doc = nlp("My name is Vincent and I was born on 23rd June 1987. \
           I work at Rasa from Haarlem. I just bought a guitar \
           cost $1000 on ebay and I will get is services here for 20 euro a year.")

In [None]:
displacy.render(doc, style="ent")

In [None]:
[(e, type(e)) for e in doc.ents]

[(Vincent, spacy.tokens.span.Span),
 (23rd June 1987, spacy.tokens.span.Span),
 (Rasa, spacy.tokens.span.Span),
 (Haarlem, spacy.tokens.span.Span),
 (1000, spacy.tokens.span.Span),
 (ebay, spacy.tokens.span.Span),
 (20, spacy.tokens.span.Span)]

In [None]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x1185e7780>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x11874b048>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x11874b0a8>)]

## From the spaCy Documentation

![](https://spacy.io/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg) 

In [None]:
nlp.remove_pipe('ner')

('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x11874b0a8>)

In [None]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x1185e7780>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x11874b048>)]

In [None]:
doc = nlp("My name is Vincent")
doc.ents

()

## Parsing a Doc 

First I'll need to prepare the data such that it fits the API. Docs found [here](https://spacy.io/usage/training#training-simple-style).

The docs say the data format needs to look something like this; 

```
TRAIN_DATA = [
   ("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]}),
   ("Google rebrands its apps", {"entities": [(0, 6, "ORG")]})
]
```

So we gotta have something like;

```
TRAIN_DATA = [
   ("Python is cool", {"entities": [(0, 6, "PROGLANG")]}),
   ("Me like golang", {"entities": [(8, 14, "PROGLANG")]})
]
```

## Making Matches!

I've taken the patterns code we made in the previous and put it in a seperate python file. This keeps the notebook clean and it is still easy for me to quickly get all these patterns.

In [None]:
from common import create_patterns
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("PROG_LANG", None, *create_patterns())

In [None]:
doc = nlp("I do code with datastuff using python and golang.")

for idx, start, end in matcher(doc):
    print(doc[start:end],)

python
golang


In [None]:
type(doc[start:end])

spacy.tokens.span.Span

In [None]:
def parse_train_data(doc):
    detections = [(doc[start:end].start_char, doc[start:end].end_char, 'PROGLANG') for idx, start, end in matcher(doc)]
    return (doc.text, {'entities': detections})

parse_train_data(nlp("i like python, javascript and golang"))

('i like python, javascript and golang',
 {'entities': [(7, 13, 'PROGLANG'),
   (15, 25, 'PROGLANG'),
   (30, 36, 'PROGLANG')]})

## Full Training Set

Now to load previous data.

In [None]:
df = (pd.read_csv("../data/have_label.txt", 
                  nrows=5_000, 
                  sep='\t', 
                  usecols=['Label', 'Title']))

titles = df.loc[lambda d: d['Label'] == 1]['Title']

In [None]:
TRAIN_DATA = [parse_train_data(d) for d in nlp.pipe(titles) if len(matcher(d)) == 1]
TRAIN_DATA[5:8]

[('How to set up unit testing for Visual Studio C++',
  {'entities': [(45, 48, 'PROGLANG')]}),
 ('How do you pack a visual studio c++ project for release?',
  {'entities': [(32, 35, 'PROGLANG')]}),
 ('How do you get leading wildcard full-text searches to work in SQL Server?',
  {'entities': [(62, 65, 'PROGLANG')]})]

## Training Loop

Again, the docs for reference are [here](https://spacy.io/usage/training#training-simple-style). We take a slightly different approach than what is listed though.

We first create a blank nlp model and then add a `ner` step to it. This is easier than loading in a big model and replacing a step. It's also faster since the loading can be slow.


In [None]:
def create_blank_nlp(train_data):
    nlp = spacy.blank("en")
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner, last=True)
    ner = nlp.get_pipe("ner")
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
    return nlp

Next we just run it.

In [None]:
import random 
import datetime as dt

nlp = create_blank_nlp(TRAIN_DATA)
optimizer = nlp.begin_training()  
for i in range(20):
    random.shuffle(TRAIN_DATA)
    losses = {}
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations], sgd=optimizer, losses=losses)
    print(f"Losses at iteration {i} - {dt.datetime.now()}", losses)

Losses at iteration 0 - 2020-02-29 16:01:37.407358 {'ner': 210.8715040565613}
Losses at iteration 1 - 2020-02-29 16:01:57.611517 {'ner': 54.94050754958943}
Losses at iteration 2 - 2020-02-29 16:02:21.511440 {'ner': 17.64956042300965}
Losses at iteration 3 - 2020-02-29 16:02:46.515510 {'ner': 22.03818076914257}
Losses at iteration 4 - 2020-02-29 16:03:12.728737 {'ner': 32.78210120097184}
Losses at iteration 5 - 2020-02-29 16:03:37.277097 {'ner': 36.18430367359715}
Losses at iteration 6 - 2020-02-29 16:04:02.517578 {'ner': 12.249202834523112}
Losses at iteration 7 - 2020-02-29 16:04:27.960499 {'ner': 0.0001372906562084279}
Losses at iteration 8 - 2020-02-29 16:04:53.461668 {'ner': 1.167531842838531e-09}
Losses at iteration 9 - 2020-02-29 16:05:18.533350 {'ner': 6.024408435729552e-10}
Losses at iteration 10 - 2020-02-29 16:05:44.401320 {'ner': 3.3534985043498425e-08}
Losses at iteration 11 - 2020-02-29 16:06:09.925620 {'ner': 1.124431198414007e-09}
Losses at iteration 12 - 2020-02-29 16:0

## Improvements 

I'll just add some things that makes training these things slightly nicer.

In [None]:
from spacy.util import minibatch, compounding

In [None]:
nlp = create_blank_nlp(TRAIN_DATA)
optimizer = nlp.begin_training()
for i in range(20):
    losses = {}
    batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
            texts,  # batch of texts
            annotations,  # batch of annotations
            drop=0.1,  # dropout - make it harder to memorise data
            losses=losses,
        )
    print(f"Losses at iteration {i} - {dt.datetime.now()} {losses}")

Losses at iteration 0 - 2020-02-29 16:13:35.823776 {'ner': 421.81081383064986}
Losses at iteration 1 - 2020-02-29 16:13:40.236429 {'ner': 16.171604070858784}
Losses at iteration 2 - 2020-02-29 16:13:45.031095 {'ner': 10.869232156674228}
Losses at iteration 3 - 2020-02-29 16:13:50.309758 {'ner': 5.347369765463781}
Losses at iteration 4 - 2020-02-29 16:13:54.814064 {'ner': 5.267283654703734}
Losses at iteration 5 - 2020-02-29 16:13:59.583930 {'ner': 7.034331411273773}
Losses at iteration 6 - 2020-02-29 16:14:04.977785 {'ner': 20.55244086534093}
Losses at iteration 7 - 2020-02-29 16:14:11.207178 {'ner': 16.854737952514622}
Losses at iteration 8 - 2020-02-29 16:14:16.702827 {'ner': 12.846826920458023}
Losses at iteration 9 - 2020-02-29 16:14:22.886344 {'ner': 7.316021861073125}
Losses at iteration 10 - 2020-02-29 16:14:29.519257 {'ner': 0.20566945497729483}
Losses at iteration 11 - 2020-02-29 16:14:36.143884 {'ner': 3.7788202090958585}
Losses at iteration 12 - 2020-02-29 16:14:42.415683 {'

In [None]:
nlp.pipeline

[('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x13cabd2e8>)]

In [None]:
doc = nlp("i write code in python")

In [None]:
doc = nlp("i write code in python")
displacy.render(doc, style="ent")

In [None]:
doc = nlp("i write code in python and go")
displacy.render(doc, style="ent")