In [1]:
from __future__ import annotations

# Exercise: Using spaCy

[spaCy](https://spacy.io/) is one amongst many great industry used NLP tools.

spaCy's strength is it's ability to use pretrained models for a variety of NLP
tasks. Users can even provide their own labeled language data to train &
fine-tune models in spaCy!

In this exercise, we'll use spaCy's already trained pipeline to perform
lemmatization on text as well as do NLP tasks like named entity recognition 
(NER).

## Preparation

Because spaCy has NLP models trained with relevant language data, the first step
is to download these models so they can be utilized.

The line below uses spaCy to download the 
[en_core_web_sm](https://spacy.io/models/en#en_core_web_sm) pipeline. This is 
ideal since it's relatively small to download (at ~12 MB) and is optimized to
use the CPU. It also has lemmatizer and NER components.

In [3]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m64.0 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Next we import the `spacy` module and load the trained [pipeline](https://spacy.io/usage/spacy-101#pipelines).


![](https://spacy.io/images/pipeline.svg)

Below is the standard way the pipeline is loaded.

In [4]:
import spacy

nlp = spacy.load('en_core_web_sm')

## Processing Text


### Base Tokenizer

Take the text below and use the built-in tokenizer to tokenize the text to get
a list of all the tokens.

In [5]:
text = (
    'Dr. Smith graduated from the University of Washington. '
    'He later started an analytics firm called Lux, which catered to enterprise customers.'
)
print(text)

Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.


In [6]:
# TODO: Get the tokens using spaCy's – it only has to be iterated over
tokens = nlp.tokenizer(text)

In [7]:
# Check you can iterate over the tokens
for token in tokens:
    print(token.text)

Dr.
Smith
graduated
from
the
University
of
Washington
.
He
later
started
an
analytics
firm
called
Lux
,
which
catered
to
enterprise
customers
.


### Lemmatization

In [8]:
text = (
    'The first time you see The Second Renaissance it may look boring. '
    'Look at it at least twice and definitely watch part 2. '
    'It will change your view of the Matrix. '
    'Are the human people the ones who started the war? Is AI a bad thing ?'
)
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the Matrix. Are the human people the ones who started the war? Is AI a bad thing ?


In [9]:
diffs: list[tuple[str, str]] = []
# TODO: Use spaCy to compare the tokens before and after lemmatization
doc = nlp(text)
for token in doc:
    orig = token.text
    lemma = token.lemma_
    if orig != lemma:
        diff = (orig, lemma)
        diffs.append(diff)

In [10]:
# Check the differences between your tokens
print(f'There are {len(diffs)} differences between the tokens before and after lemmatization')

for orig, lemma in diffs:
    print(orig, lemma)

There are 8 differences between the tokens before and after lemmatization
The the
The the
Look look
It it
Are be
ones one
started start
Is be


## Observing Attributes

### Parts of Speech

We'll now closer at the the tokens, specifically at the parts of speech for each
token, according to the pipeline.

In [11]:
text = (
    'Dr. Smith graduated from the University of Washington. '
    'He later started an analytics firm called Lux, which catered to enterprise customers.'
)
print(text)

Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.


In [13]:
pos_groupings: dict[str, list[str]] = {}
# TODO: Group the tokens by parts of speech
doc = nlp(text)
for token in doc:
    pos_tag = token.pos_
    if pos_tag not in pos_groupings:
        pos_groupings[pos_tag] = list()
    pos_groupings[pos_tag].append(token)

In [14]:
# Check your part of speech groupings
for pos_tag, tokens in pos_groupings.items():
    print(pos_tag)
    print('\t', tokens)

PROPN
	 [Dr., Smith, University, Washington, Lux]
VERB
	 [graduated, started, called, catered]
ADP
	 [from, of, to]
DET
	 [the, an]
PUNCT
	 [., ,, .]
PRON
	 [He, which]
ADV
	 [later]
NOUN
	 [analytics, firm, enterprise, customers]


### Named Entity Recognition

First determine what are the labels used by the model we loaded.

In [15]:
# TODO: List the different entity labels used by the spaCy pipeline being used
nlp.get_pipe("ner").labels

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

Next, take the following text and list out all the tokens that have an
associated entity label.
Also list out the type of entity it is according to the
pipeline.

In [28]:
text = (
    'The first time you see The Second Renaissance it may look boring. '
    'Look at it at least twice and definitely watch part 2. '
    'It will change your view of the Matrix. '
    'Are the human people the ones who started the war? Is AI a bad thing ?'
)
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the Matrix. Are the human people the ones who started the war? Is AI a bad thing ?


In [29]:
# TODO: List only the tokens that are entities along with their labels
doc = nlp(text)

for token in doc.ents:
    text = token.text
    entity = token.label_
    print(f'Text: {text}\n\t Entity: {entity}')


Text: first
	 Entity: ORDINAL
Text: The Second Renaissance
	 Entity: ORG
Text: 2
	 Entity: CARDINAL
Text: the Matrix
	 Entity: WORK_OF_ART


In [30]:
print('\n', 10*'-', 'Alternative', '-'*10)
for token in doc:
    text = token.text
    entity = token.ent_type_
    if entity:
        print(f'Text: {text}\n\t Entity: {entity}')


 ---------- Alternative ----------
Text: first
	 Entity: ORDINAL
Text: The
	 Entity: ORG
Text: Second
	 Entity: ORG
Text: Renaissance
	 Entity: ORG
Text: 2
	 Entity: CARDINAL
Text: the
	 Entity: WORK_OF_ART
Text: Matrix
	 Entity: WORK_OF_ART
