In [1]:
from __future__ import annotations

# Exercise: Using spaCy

[spaCy](https://spacy.io/) is one amongst many great industry used NLP tools.

spaCy's strength is it's ability to use pretrained models for a variety of NLP
tasks. Users can even provide their own labeled language data to train &
fine-tune models in spaCy!

In this exercise, we'll use spaCy's already trained pipeline to perform
lemmatization on text as well as do NLP tasks like named entity recognition 
(NER).

## Preparation

Because spaCy has NLP models trained with relevant language data, the first step
is to download these models so they can be utilized.

The line below uses spaCy to download the 
[en_core_web_sm](https://spacy.io/models/en#en_core_web_sm) pipeline. This is 
ideal since it's relatively small to download (at ~12 MB) and is optimized to
use the CPU. It also has lemmatizer and NER components.

In [2]:
# ! python -m spacy download en_core_web_sm

Next we import the `spacy` module and load the trained [pipeline](https://spacy.io/usage/spacy-101#pipelines).


![](https://spacy.io/images/pipeline.svg)

Below is the standard way the pipeline is loaded.

In [3]:
import spacy

nlp = spacy.load('en_core_web_sm')

## Processing Text


### Base Tokenizer

Take the text below and use the built-in tokenizer to tokenize the text to get
a list of all the tokens.

In [None]:
text = (
    'Dr. Smith graduated from the University of Washington. '
    'He later started an analytics firm called Lux, which catered to enterprise customers.'
)
print(text)

In [5]:
# TODO: Get the tokens using spaCy's – it only has to be iterated over
tokens = nlp.tokenizer(text)


In [None]:
# Check you can iterate over the tokens
for token in tokens:
    print(token.text)

### Lemmatization

In [None]:
text = (
    'The first time you see The Second Renaissance it may look boring. '
    'Look at it at least twice and definitely watch part 2. '
    'It will change your view of the Matrix. '
    'Are the human people the ones who started the war? Is AI a bad thing ?'
)
print(text)

In [None]:
# TODO: Use spaCy to compare the tokens before and after lemmatization
doc = nlp(text)
diffs = []
for token in doc:
    orig = token.text 
    lemma = token.lemma_
    if orig != lemma:
        print(orig, lemma)
        diff = (orig, lemma)
        diffs.append(diff)

print(len(diffs))


## Observing Attributes

### Parts of Speech

We'll now closer at the the tokens, specifically at the parts of speech for each
token, according to the pipeline.

In [None]:
text = (
    'Dr. Smith graduated from the University of Washington. '
    'He later started an analytics firm called Lux, which catered to enterprise customers.'
)
print(text)

In [10]:
pos_groupings: dict[str, list[str]] = {}
# TODO: Group the tokens by parts of speech
doc = nlp(text)

for token in doc:
    pos_tag = token.pos_
    if pos_tag not in pos_groupings:
        pos_groupings[pos_tag] = list()
    pos_groupings[pos_tag].append(token)



In [None]:
for pos_tag, tokens in pos_groupings.items():
    print(pos_tag)
    print('\t', tokens)

### Named Entity Recognition

First determine what are the labels used by the model we loaded.

In [None]:
# TODO: List the different entity labels used by the spaCy pipeline being used
nlp.get_pipe("ner").labels

Next, take the following text and list out all the tokens that have an
associated entity label.
Also list out the type of entity it is according to the
pipeline.

In [13]:
text = (
    'The first time you see The Second Renaissance it may look boring. '
    'Look at it at least twice and definitely watch part 2. '
    'It will change your view of the Matrix. '
    'Are the human people the ones who started the war? Is AI a bad thing ?'
)

In [None]:
# TODO: List only the tokens that are entities along with their labels
doc = nlp(text)

for token in doc.ents:
    text = token.text
    entity = token.label_
    print(f'Text: {text}\n\t Entity: {entity}')

# Alternative
print('\n', 10*'-', 'ALTERNATIVE', '-'*10)
for token in doc:
    text = token.text
    entity = token.ent_type_
    if entity:
        print(f'Text: {text}\n\t Entity: {entity}')