We start the previous notebook with this code. Let's give it another spin.

In [1]:
import spacy
import pandas as pd
from spacy import displacy

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
def has_go_token(doc):
    for t in doc:
        if t.lower_ in ['go', 'golang', 'python', 'ruby', 'objective-c']:
            if t.pos_ != 'VERB':
                return True
    return False

In [4]:
doc = nlp("i am an iOS dev and I like to code in objective-c")

In [5]:
[t for t in doc]

[i, am, an, iOS, dev, and, I, like, to, code, in, objective, -, c]

Our current approach won't work for languages with more than one token. So instead I'll need to use matchers instead. There's a bunch of them applied below.

In [6]:
from spacy.matcher import Matcher

In [7]:
obj_c_pattern1 = [{'LOWER': 'objective'},
                  {'IS_PUNCT': True, 'OP': '?'},
                  {'LOWER': 'c'}]

obj_c_pattern2 = [{'LOWER': 'objectivec'}]

golang_pattern1 = [{'LOWER': 'golang'}] 
golang_pattern2 = [{'LOWER': 'go', 
                    'POS': {'NOT_IN': ['VERB']}}]

python_pattern = [{'LOWER': 'python'}]
ruby_pattern   = [{'LOWER': 'ruby'}]
js_pattern     = [{'LOWER': {'IN': ['js', 'javascript']}}]

In [8]:
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("OBJ_C_LANG", None, obj_c_pattern1, obj_c_pattern2)
matcher.add("PYTHON_LANG", None, python_pattern)
matcher.add("GO_LANG", None, golang_pattern1, golang_pattern2)
matcher.add("JS_LANG", None, js_pattern)
matcher.add("RUBY_LANG", None, ruby_pattern)

In [9]:
doc = nlp("I am an iOS dev who codes in both python, go/golang as well as objective-c")
for match_id, start, end in matcher(doc):
    print(doc[start: end])

python
golang
objective-c


In [10]:
doc = nlp("I've done some js and ruby and go programming")
for match_id, start, end in matcher(doc):
    print(doc[start: end])

js
ruby


## Benchmarking

Our current approach works, but it would be good to confirm this with data. I'll do a soft benchmark; I'll check for the occurence of a string, like "objective", and I'll see which instances my matcher does not pick up. If there's stuff that I am missing I should get a pretty clear picture of it. 

In [11]:
import pandas as pd

df = (pd.read_csv("Questions.csv", nrows=1_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

In [12]:
titles = (_ for _ in df['Title'] if "python" in _.lower())

In [13]:
for i in range(200):
    doc = nlp(next(titles))
    if len(matcher(doc)) == 0:
        print(doc)

mod_python/MySQL error on INSERT with a lot of data: "OperationalError: (2006, 'MySQL server has gone away')"
Running subversion under apache and mod_python
What's the best way to embed IronPython inside my C# App?
How to set the PYTHONPATH in Emacs?
wxPython wxDC object from win32gui.GetDC
Need skeleton code to call Excel VBA from PythonWin
Questions for python->scheme conversion
wxPython and sharing objects between windows
Django on IronPython
IronPython Webframework
A SuggestBox for wxPython?
Intercepting Method Access on the Host Program of IronPython
Is there anything like IPython / IRB for Perl?


I've used this benchmark to find some mistakes but this is another milestone for now. The next step; prepare the dataset for machine learning.