# Part 2: Let's apply spaCy a bit

I would now like to teach spaCy some more by applying it to a fun example. So in this video I will explain the data that can be found [here](https://github.com/mikeckennedy/talk-python-transcripts). 

## A bit of data cleaning. 

It's a bit of regex, as well as some `str`-methods. 

In [30]:
import srsly 
from re import compile
from pathlib import Path 

regex = compile("([0-9][0-9]:[0-9][0-9]:[0-9][0-9])")

def episode_lines(path):
    i = 0
    for line in Path(path).read_text().split("\n"):
        if regex.match(line):
            without_time = regex.sub("", line[8:])
            without_name = without_time[without_time.find(":") + 1: ]
            speaker = without_time[:without_time.find(":")].strip() if (":" in without_time) else ""
            i += 1
            yield {
                "text": without_name.strip().replace("\xa0", " "), 
                "meta": {
                    "speaker": speaker,
                    "file": str(path),
                    "turn": i
                }
            }

def all_episode_lines(turn_limit=None):
    for path in reversed(sorted(Path("transcripts").glob("*.txt"))):
        for line in episode_lines(path):
            if turn_limit:
                turn = line['meta']['turn']
                if turn_limit:
                    if turn in turn_limit:
                        yield line
                else:
                    yield line
            else:
                yield line


lines = all_episode_lines()

Let's first just iterate over some of the data to see spaCy in action. The results won't be perfect, but it's nice to see what you can get out of the box. 

In [31]:
lines = all_episode_lines(turn_limit=[25, 30, 31, 32])

In [32]:
next(lines)

{'text': "So one of the things you've been up to in addition to courses is writing books, Django in action, almost released. Is that right? What's the status?",
 'meta': {'speaker': '',
  'file': 'transcripts/437-htmx-for-django-developers-and-all-of-us.txt',
  'turn': 25}}

You might look at my code and wonder ... "why generators"? If you're a data scientist you might be used to thinking in dataframes ... so why does it make sense to use generators here? 

In short, it just turns out to be slightly more convenient. Text can contain many nested items that we're interested in like entities and sentences. That nested structure already makes it somewhat inconvient to use a flat data structure like a table. So we're more likely to use a sequence of dictionaries. 

You could also use a list for that, but if you're dealing with _huge_ quantities of text ... then a lazy approach might make more sense. 

The code from before ... note how that only keeps one file open at a time? Stuff like that is _really_ nice. And this is why spaCy assumes generators.

## Toying on Real Data 

So let's run our `en_core_web_md` model against some examples of the transcripts. Just to get a feel of happens. 

In [35]:
doc = nlp(next(lines)['text'])
displacy.render(doc, style="ent")

### Discuss 

The results are not perfect, but it's not bad when you consider that the spaCy model isn't trained on data that knows about twitter and programming tools. I personally think that it's interesting that `Django` is often detected as a person and sometimes as a product. It's not the worst mistake if you think about it ... 

So while we agree that it's not perfect, it might be fun to see how it fares on the entire dataset.

## Speed 

If you're going to use spaCy over a bunch of examples, then you may want to use `nlp.pipe` instead of `nlp`.

In [36]:
import itertools as it

In [37]:
%%time 
lines = all_episode_lines()

subset = it.islice(lines, 1000)
[nlp(line['text']).ents for line in subset];

CPU times: user 8.09 s, sys: 15.2 ms, total: 8.11 s
Wall time: 8.21 s


In [38]:
%%time 
lines = all_episode_lines()

subset = (ex['text'] for ex in it.islice(lines, 1000))
[doc.ents for doc in nlp.pipe(subset)];

CPU times: user 3.01 s, sys: 187 ms, total: 3.2 s
Wall time: 3.34 s


In [39]:
%%time 
lines = all_episode_lines()

subset = ((ex['text'], ex) for ex in it.islice(lines, 1000))
[doc.ents for doc, ex in nlp.pipe(subset, as_tuples=True)];

CPU times: user 3.03 s, sys: 123 ms, total: 3.15 s
Wall time: 3.41 s


There's also another speedup that we can do. We can also choose to load the spaCy model with only the components that we _really_ need. 

In [40]:
nlp = spacy.load("en_core_web_md", enable=["ner"])

This will enable only the `ner` component in the pipeline and will disable anything else that we may not need. 

In [41]:
%%time 
lines = all_episode_lines()

subset = ((ex['text'], ex) for ex in it.islice(lines, 1000))
[doc.ents for doc, ex in nlp.pipe(subset, as_tuples=True)];

CPU times: user 1.2 s, sys: 175 ms, total: 1.37 s
Wall time: 1.42 s


So again we see that it's a bunch quicker. 

## Hacking on Python Packages

Let's now check if spaCy can detect Python packages by having it detect `PRODUCT` entities. It's not going to be a perfect mapping, but you might be able to imagine how spaCy might mistake a Python package for a product given how it is used in a sentence linguistically. 

In [42]:
import itertools as it 
from collections import Counter 

lines = (line['text'] for line in all_episode_lines())
n_lines = sum(1 for _ in all_episode_lines())
n_lines

84396

In [43]:
from tqdm import tqdm

In [44]:
%%time 

counter = Counter()
for doc in nlp.pipe(tqdm(lines, total=n_lines)):
    product_entities = [ent.text for ent in doc.ents if ent.label_ == "PRODUCT"]
    counter.update(Counter(product_entities))

  0%|          | 0/84396 [00:00<?, ?it/s]

100%|██████████| 84396/84396 [02:23<00:00, 588.42it/s] 


CPU times: user 2min 8s, sys: 12.7 s, total: 2min 21s
Wall time: 2min 23s


In [45]:
counter.most_common(30)

[('Linux', 577),
 ('Excel', 547),
 ('Windows', 517),
 ('Python', 425),
 ('Django', 307),
 ('JavaScript', 269),
 ('C++', 223),
 ('VS', 150),
 ('Docker', 141),
 ('Flask', 115),
 ('SQLAlchemy', 96),
 ('Google Play', 86),
 ('Matplotlib', 83),
 ('Emacs', 78),
 ('Pyodide', 70),
 ('Perl', 69),
 ('MATLAB', 69),
 ('VS Code', 61),
 ('Cython', 57),
 ('Apache', 47),
 ('FastAPI', 46),
 ('Celery', 45),
 ('Talkpython', 40),
 ('Pandas', 38),
 ('Twitter', 35),
 ('Reddit', 35),
 ('Fast API', 33),
 ('CS', 32),
 ('Sanic', 31),
 ('Keras', 29)]

## Another Speedup 

Our approach can be faster if we add a few more cores. 

In [46]:
%%time 
lines = (line['text'] for line in all_episode_lines())
n_lines = sum(1 for _ in all_episode_lines())

counter = Counter()
for doc in nlp.pipe(tqdm(lines, total=n_lines), n_process=8):
    new_count = Counter([ent.text for ent in doc.ents if ent.label_ == "PRODUCT"])
    counter.update(new_count)

  0%|          | 0/84396 [00:00<?, ?it/s]

 83%|████████▎ | 69632/84396 [02:46<00:43, 343.15it/s]

In [None]:
counter.most_common(30)

[('Linux', 577),
 ('Excel', 547),
 ('Windows', 517),
 ('Python', 425),
 ('Django', 307),
 ('JavaScript', 269),
 ('C++', 223),
 ('VS', 150),
 ('Docker', 141),
 ('Flask', 115),
 ('SQLAlchemy', 96),
 ('Google Play', 86),
 ('Matplotlib', 83),
 ('Emacs', 78),
 ('Pyodide', 70),
 ('Perl', 69),
 ('MATLAB', 69),
 ('VS Code', 61),
 ('Cython', 57),
 ('Apache', 47),
 ('FastAPI', 46),
 ('Celery', 45),
 ('Talkpython', 40),
 ('Pandas', 38),
 ('Twitter', 35),
 ('Reddit', 35),
 ('Fast API', 33),
 ('CS', 32),
 ('Sanic', 31),
 ('Keras', 29)]

In our case, the speedup is not linear. This can be blamed on the `counter.update(new_count)` line. This operation becomes slower over time, but it's also a part of the process that can't be parallelized. 

## Another approach 

The approach that we just took works ... but it's a bit hacky and odds are that it's missing out on a whole bunch of packages. But we can also resort to another approach ... after all ... we have lists of popular python packages available [to us](https://hugovk.github.io/top-pypi-packages/). The linked site is super cool, it uses the BigQuery data that has every PYPI download. 

In [None]:
! wget https://hugovk.github.io/top-pypi-packages/top-pypi-packages.min.json

--2023-11-27 14:25:40--  https://hugovk.github.io/top-pypi-packages/top-pypi-packages.min.json
Resolving hugovk.github.io (hugovk.github.io)... 185.199.109.153, 185.199.108.153, 185.199.111.153, ...
Connecting to hugovk.github.io (hugovk.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 411246 (402K) [application/json]
Saving to: ‘top-pypi-packages.min.json.1’


2023-11-27 14:25:40 (1,14 MB/s) - ‘top-pypi-packages.min.json.1’ saved [411246/411246]



In [None]:
package_names = [ex["project"] for ex in srsly.read_json("top-pypi-packages.min.json")["rows"]]
package_names[:10]

['boto3',
 'urllib3',
 'botocore',
 'requests',
 'typing-extensions',
 'setuptools',
 'charset-normalizer',
 'certifi',
 's3transfer',
 'wheel']

If we want to detect Python packages, maybe we can just re-use this? Note that this approach won't be perfect either ... but it may cover a bunch of ground ...

## Custom Code

In [37]:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md")

matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "boto"}, {"LOWER": "core"}]
matcher.add("pypackage", [pattern])

doc = nlp("We use boto core a lot in our company.")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(start, end, span.text)

2 4 boto core


In [38]:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md")

matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "go", "POS": {"NOT_IN": ["VERB"]}}]
matcher.add("pypackage", [pattern])

doc = nlp("I sometimes also write some code in Go.")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(start, end, span.text)

7 8 Go


In [39]:
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_md")

matcher = PhraseMatcher(nlp.vocab)
terms = ["boto core", "pandas"]
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("pypackage", patterns)

doc = nlp("We use boto core a lot in our company.")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(start, end, span.text)

2 4 boto core


We can also extend this, so that the `Doc` will actually have the right span as an entity.

In [40]:
from spacy.tokens import Span 
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_md")

matcher = PhraseMatcher(nlp.vocab)
terms = ["boto core", "pandas"]
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("pypackage", patterns)

doc = nlp("We use boto core a lot in our company.")
matches = matcher(doc)
ents = []
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    ents.append(Span(doc, start, end, string_id))

doc.set_ents(ents)
displacy.render(doc, style="ent")

I wanted to show this feature because it's very flexible. You can set entities yourself with custom code all you like ... but in our case ... we could also just re-use existing spaCy components. 

## Entity Component

Besides statistical tools, spaCy also allows you to write rule-based solutions on top of it's data structures. 

Explain this image: https://spacy.io/usage/processing-pipelines

You can add an entity ruler that does token based matching. 

In [41]:
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "proglang", "pattern": [{"LOWER": "go", "POS": {"NOT_IN": ["VERB"]}}]}]
ruler.add_patterns(patterns)

doc = nlp("I sometimes also write some code in Go.")
displacy.render(doc, style="ent")

But phrase based matching can also be done directly.

In [43]:
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "pypackage", "pattern": pkg} for pkg in package_names]
ruler.add_patterns(patterns)

doc = nlp("I used to use pandas a lot but nowadays I'm doing polars.")
displacy.render(doc, style="ent")

Neat!

In [47]:
# If we wanted to re-use this model 
nlp.to_disk

<bound method Language.to_disk of <spacy.lang.en.English object at 0x7f8545c63340>>

## Rerunning our model

Let's now apply our rule based model to see if we capture something else.

In [44]:
[e.label_ for e in nlp("fastapi").ents]

['pypackage']

In [45]:
%%time 
lines = (line['text'] for line in all_episode_lines())
# lines = it.islice(lines, 10000)
n_lines = sum(1 for _ in all_episode_lines())

counter = Counter()
for doc in nlp.pipe(tqdm(lines, total=n_lines), n_process=8):
    new_count = Counter([ent.text for ent in doc.ents if ent.label_ == "pypackage"])
    counter.update(new_count)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 84396/84396 [00:42<00:00, 1990.55it/s]


CPU times: user 49.4 s, sys: 457 ms, total: 49.9 s
Wall time: 50.3 s


In [49]:
counter.most_common(150)

[('data', 9745),
 ('sure', 4607),
 ('first', 3755),
 ('us', 3628),
 ('build', 2886),
 ('bunch', 2176),
 ('install', 1493),
 ('times', 1464),
 ('ago', 1299),
 ('control', 791),
 ('click', 731),
 ('pip', 694),
 ('future', 657),
 ('portion', 629),
 ('pick', 608),
 ('requests', 607),
 ('six', 575),
 ('email', 569),
 ('notebook', 559),
 ('path', 514),
 ('image', 507),
 ('dependencies', 501),
 ('style', 484),
 ('update', 440),
 ('moment', 439),
 ('eight', 438),
 ('databases', 412),
 ('public', 376),
 ('workflow', 373),
 ('pandas', 357),
 ('typing', 346),
 ('connect', 340),
 ('events', 316),
 ('flask', 312),
 ('packaging', 300),
 ('coverage', 294),
 ('result', 290),
 ('black', 278),
 ('pattern', 273),
 ('higher', 268),
 ('names', 258),
 ('waiting', 251),
 ('mode', 243),
 ('binary', 239),
 ('distributed', 238),
 ('progress', 234),
 ('segments', 216),
 ('rules', 215),
 ('schema', 210),
 ('tables', 204),
 ('rich', 199),
 ('asyncio', 199),
 ('constantly', 185),
 ('art', 181),
 ('conda', 178),
 ('

A few things to note:

- This approach is _much_ faster the `en_core_web_md` model. That's because we're just doing string matching.
- This approach seems to match a bunch of Python packages. But it's not perfect either! It seems to match a _bunch_ of things that don't feel like packages. But we can check to confirm that they actually are ... 