## Start

Let's first load the dataset (which can be found [here](https://www.kaggle.com/stackoverflow/stacksample)) into pandas.

We'll start by grabbing a list of the question titles.

In [3]:
import pandas as pd

df = (pd.read_csv("data/Questions.csv", nrows=1_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))
titles = [_ for _ in df['Title']]

In [2]:
def has_golang(text):
    return " go " in text

g = (title for title in titles if has_golang(title))
[next(g) for i in range(2)]

['Where does Console.WriteLine go in ASP.NET?',
 'Should try...catch go inside or outside a loop?']

It doesn't work too well. So let us try spacy first. You might need to install the language model first.

```
python -m spacy download en_core_web_sm
```

In [6]:
import spacy 

nlp = spacy.load("en_core_web_sm")

Let's first see how it works.

In [7]:
[t for t in nlp("My name is Vincent.")]

[My, name, is, Vincent, .]

In [8]:
doc = nlp("My name is Vincent.")

In [9]:
t = doc[0]

In [10]:
from spacy import displacy

displacy.render(doc)

In [11]:
for t in nlp("Where does Console.WriteLine go in ASP.NET?"):
    print(t, t.pos_, t.dep_)

Where ADV advmod
does VERB ROOT
Console PROPN nsubj
. PUNCT punct
WriteLine PROPN nsubj
go VERB ROOT
in ADP prep
ASP.NET PROPN pobj
? PUNCT punct


# Back to Detecting Golang

Let's now use it one more time for our problem of detecting the `go` language. I'll also go ahead and write the code in a more performant way.

In [12]:
nlp = spacy.load("en_core_web_sm")

In [13]:
df = (pd.read_csv("Questions.csv", nrows=2_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

titles = [_ for _ in df.loc[lambda d: d['Title'].str.lower().str.contains("go")]['Title']]

In [20]:
nlp = spacy.load("en_core_web_sm", disable=["ner"])

In [18]:
%%time

def has_golang(doc):
    for t in doc:
        if t.lower_ in ["go", "golang"]:
            if t.pos_ == "NOUN":
                return True 
    return False

g = (doc for doc in nlp.pipe(titles) if has_golang(doc))
[next(g) for i in range(30)]

CPU times: user 9.78 s, sys: 1.59 s, total: 11.4 s
Wall time: 11.4 s


[Deploying multiple Java web apps to Glassfish in one go,
 Removing all event handlers in one go,
 How do I disable multiple listboxes in one go using jQuery?,
 multi package makefile example for go,
 SOAPUI & Groovy Scripts, executing multiple SQL statements in one go,
 What's the simplest way to edit conflicted files in one go when using git and an editor like Vim or textmate?,
 Import large chunk of data into Google App Engine Data Store at one go,
 How many records can be loaded into Salesforce using Apex Data Loader in one go?,
 How can I run multiple inserts with NHibernate in one go?,
 GO action after a submit (go URI),
 Saving all nested form objects in one go,
 Global Variables with GO,
 Decrypt many PDFs in one go using pdftk,
 making generic algorithms in go,
 How do I allocate memory for an array in the go programming language?,
 Is message passing via channels in go guaranteed to be non-blocking?,
 What's wrong with the following go code that I receive 'all goroutines are 

This works! Now let's write some pandas code that will help us with our benchmarks.

![](images/these-2.png)

In [21]:
df_tags = pd.read_csv("Tags.csv")
go_ids = df_tags.loc[lambda d: d['Tag'] == 'go']['Id']

def has_go_token(doc):
    for t in doc:
        if t.lower_ in ['go', 'golang']:
            if t.pos_ != 'VERB':
                return True
    return False

all_go_sentences = df.loc[lambda d: d['Id'].isin(go_ids)]['Title'].tolist()
detectable = [d.text for d in nlp.pipe(all_go_sentences) if has_go_token(d)]

non_detectable = (df
                  .loc[lambda d: ~d['Id'].isin(go_ids)]
                  .loc[lambda d: d['Title'].str.lower().str.contains("go")]
                  ['Title']
                  .tolist())

non_detectable = [d.text for d in nlp.pipe(non_detectable) if has_go_token(d)]

len(all_go_sentences), len(detectable), len(non_detectable)

(1858, 892, 102)

Nice, we get some numbers that can result in a meaningful benchmark.

We can calculate precision/recall like stats by running the code below. You can put a forloop around it if you want but as it is you can fiddle around with the `has_go_token` function to see how well it performs.

In [48]:
model_name = "en_core_web_sm"
model = spacy.load(model_name, disable=["ner"])

def has_go_token(doc):
    for t in doc:
        if t.lower_ in ["go", "golang"]:
            if t.pos_ != "VERB":
                return True
    return False

method = "not-verb-but-pobj"

correct = sum(has_go_token(doc) for doc in model.pipe(detectable))
wrong = sum(has_go_token(doc) for doc in model.pipe(non_detectable))
precision = correct/(correct + wrong)
recall = correct/len(detectable)
accuracy = (correct + len(non_detectable) - wrong)/(len(detectable) + len(non_detectable))

f"{precision},{recall},{accuracy},{model_name},{method}" # this is logged

'0.89738430583501,1.0,0.89738430583501,en_core_web_sm,not-verb-but-pobj'

Enjoy playing around with this.