## Importing Spacy

In [1]:
# !conda install -c conda-forge spacy
# !python -m spacy download en_core_web_sm
## I recommend the one above, because the following is more accurate but less efficient
# !python -m spacy download en_core_web_lg

In [2]:
import spacy


nlp = spacy.load("en_core_web_sm")
# You can also load en_core_web_lg that has an higher accuracy but it's less efficient
# nlp = spacy.load("en_core_web_lg")



<IPython.core.display.Javascript object>

In [3]:
print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7f8850258430>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7f8860afcbe0>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7f8860afcca0>)]


<IPython.core.display.Javascript object>

In [4]:
# Process sentences 'Hello, world. Antonio is learning Python.' using spaCy
doc = nlp(u"Hello, world. Antonio is learning Python.")

<IPython.core.display.Javascript object>

In [5]:
for token in doc:
    print(token.text)

Hello
,
world
.
Antonio
is
learning
Python
.


<IPython.core.display.Javascript object>

In [6]:
# Get first token of the processed document
token = doc[0]
print(token)

# Print sentences (one sentence per line)
for sent in doc.sents:
    print(sent)

Hello
Hello, world.
Antonio is learning Python.


<IPython.core.display.Javascript object>

In [7]:
tokens = nlp("Let's go to N.Y.!")

<IPython.core.display.Javascript object>

In [8]:
for token in tokens:
    print(token.text)

Let
's
go
to
N.Y.
!


<IPython.core.display.Javascript object>

As you have seen, using `nlp`, that comes from `spacy.load("en_core_web_sm")`, you get the tokenized version of the sentence. If you want only the instance of the `Tokenizer` class, you can run:

In [9]:
tokenizer = nlp.tokenizer
type(tokenizer)

spacy.tokenizer.Tokenizer

<IPython.core.display.Javascript object>

If you want to instantiate a custom one, with rules and prefixes and so on:

In [10]:
from spacy.tokenizer import Tokenizer

tokenizer = Tokenizer(vocab=nlp.vocab)

<IPython.core.display.Javascript object>

The tokenizer defined above contains only english rules.
Let's test it on "Let's go to N.Y.!"

In [11]:
tokens = tokenizer("Let's go to N.Y.!")
for token in tokens:
    print(token)



Let's
go
to
N.Y.!


<IPython.core.display.Javascript object>

As you can see here, it doesn't handle the exceptions about the dots. So we can add rules for this!

In [12]:
prefix_re = spacy.util.compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = spacy.util.compile_suffix_regex(nlp.Defaults.prefixes)

<IPython.core.display.Javascript object>

In [13]:
tokenizer = Tokenizer(
    vocab=nlp.vocab, prefix_search=prefix_re.search, suffix_search=suffix_re.search
)

<IPython.core.display.Javascript object>

In [14]:
tokens = tokenizer("Let's go to N.Y.!")
for token in tokens:
    print(token)

Let's
go
to
N.Y.
!


<IPython.core.display.Javascript object>

You can also check the exceptions the tokenizer can handle:

In [15]:
from spacy.lang.en.tokenizer_exceptions import TOKENIZER_EXCEPTIONS

TOKENIZER_EXCEPTIONS.values()

dict_values([[{65: 'i', 73: '-PRON-', 67: 'i', 75: 'PRP'}, {65: "'m", 73: 'be', 67: 'am', 75: 'VBP'}], [{65: 'i', 73: '-PRON-', 67: 'i', 75: 'PRP'}, {65: 'm', 73: 'be', 75: 'VBP'}], [{65: 'i', 73: '-PRON-', 67: 'i', 75: 'PRP'}, {65: "'m", 73: 'be', 67: 'am'}, {65: 'a', 73: 'going to', 67: 'gonna'}], [{65: 'i', 73: '-PRON-', 67: 'i', 75: 'PRP'}, {65: 'm', 73: 'be', 67: 'am'}, {65: 'a', 73: 'going to', 67: 'gonna'}], [{65: 'I', 73: '-PRON-', 67: 'i', 75: 'PRP'}, {65: "'m", 73: 'be', 67: 'am', 75: 'VBP'}], [{65: 'I', 73: '-PRON-', 67: 'i', 75: 'PRP'}, {65: 'm', 73: 'be', 75: 'VBP'}], [{65: 'I', 73: '-PRON-', 67: 'i', 75: 'PRP'}, {65: "'m", 73: 'be', 67: 'am'}, {65: 'a', 73: 'going to', 67: 'gonna'}], [{65: 'I', 73: '-PRON-', 67: 'i', 75: 'PRP'}, {65: 'm', 73: 'be', 67: 'am'}, {65: 'a', 73: 'going to', 67: 'gonna'}], [{65: 'i', 73: '-PRON-', 67: 'i', 75: 'PRP'}, {65: "'ll", 73: 'will', 67: 'will', 75: 'MD'}], [{65: 'i', 73: '-PRON-', 67: 'i', 75: 'PRP'}, {65: "'ll", 73: 'will', 67: 'will',

<IPython.core.display.Javascript object>

In [16]:
tokens = tokenizer("This is a $STOCK.")
for token in tokens:
    print(token)

This
is
a
$
STOCK.


<IPython.core.display.Javascript object>

You can add special prefixes in the form of regex by doing:

In [17]:
custom_prefixes = nlp.Defaults.prefixes + (r"\$[a-zA-Z]+",)

<IPython.core.display.Javascript object>

In [18]:
prefix_re = spacy.util.compile_prefix_regex(custom_prefixes)



<IPython.core.display.Javascript object>

In [19]:
import re

prefix_re = re.compile(r"\$[a-zA-Z]+")
tokenizer = Tokenizer(
    nlp.vocab, prefix_search=prefix_re.search, suffix_search=suffix_re.search
)

tokens = tokenizer("This is a $STOCK.")
for token in tokens:
    print(token)

This
is
a
$STOCK
.


<IPython.core.display.Javascript object>

You can add also special-case tokenization rules. This mechanism is also used to add custom tokenizer exceptions to the language data. See the usage guide on the [languages data](https://spacy.io/usage/linguistic-features#language-data) and [tokenizer special cases](https://spacy.io/usage/linguistic-features#special-cases) for more details and examples.

In [20]:
from spacy.attrs import ORTH, NORM, LOWER

dont_case = [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}]
gimme_case = [{ORTH: "gi", NORM:"give"}, {ORTH: "me", NORM: "me"}]
tokenizer.add_special_case("don't", dont_case)
tokenizer.add_special_case("gimme", gimme_case)
tokens = tokenizer("Yo! gimme five!")
for token in tokens:
    print(token.norm_)
tokens = tokenizer("You don't do that")
for token in tokens:
    print(token.norm_)


yo
!
give
me
five
!
you
do
not
do
that


<IPython.core.display.Javascript object>

When you load a model with pretrained NER (Named Entity Recognition), like `en_core_web_sm`, it is possible to make the tokenizer to merge the token for the entities it finds. Let's check what is inside the pipeline performed by `nlp`:


In [21]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f8850258430>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f8860afcbe0>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f8860afcca0>)]

<IPython.core.display.Javascript object>

There's a tagger, a dependency parser and the entity recognizer. Let's check the entities of the following sentence:

In [22]:
doc = nlp("Apple is a $1000b company.")


<IPython.core.display.Javascript object>

In [23]:
for token in doc:
    print(token)

Apple
is
a
$
1000b
company
.


<IPython.core.display.Javascript object>

In [24]:
for ent in doc.ents:
    print(ent, ent.label_)

Apple ORG
1000b MONEY


<IPython.core.display.Javascript object>

In [25]:
doc = nlp(
    "This is Strive School. It's worthy to merge 'Strive School' as a single token instead of two"
)

for token in doc:
    print(token)

This
is
Strive
School
.
It
's
worthy
to
merge
'
Strive
School
'
as
a
single
token
instead
of
two


<IPython.core.display.Javascript object>

In [26]:
for ent in doc.ents:
    print(ent, ent.label_)

Strive School ORG
Strive School' ORG
two CARDINAL


<IPython.core.display.Javascript object>

Let's add "merge_entities" to the pipeline (you can do it only if there is the entity recognizer):

In [27]:
nlp.add_pipe(nlp.create_pipe("merge_entities"))

<IPython.core.display.Javascript object>

In [28]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f8850258430>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f8860afcbe0>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f8860afcca0>),
 ('merge_entities', <function spacy.pipeline.functions.merge_entities(doc)>)]

<IPython.core.display.Javascript object>

In [29]:
doc = nlp(
    "This is Strive School. It's worthy to merge 'Strive School' as a single token instead of two"
)

for token in doc:
    print(token)

This
is
Strive School
.
It
's
worthy
to
merge
'
Strive School'
as
a
single
token
instead
of
two


<IPython.core.display.Javascript object>

In [30]:
TEXTS = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]


<IPython.core.display.Javascript object>

In [31]:
for sentence in nlp.pipe(TEXTS):
    for token in sentence:
        print(token)
    print("------------------")

Net
income
was
$9.4 million
compared
to
the prior year
of
$2.7 million
.
------------------
Revenue
exceeded
twelve billion dollars
,
with
a
loss
of
$
1b
.
------------------


<IPython.core.display.Javascript object>

It's also possible to merge the noun chunks into one:

In [32]:
nlp.add_pipe(nlp.create_pipe("merge_noun_chunks"))

<IPython.core.display.Javascript object>

In [33]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f8850258430>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f8860afcbe0>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f8860afcca0>),
 ('merge_entities', <function spacy.pipeline.functions.merge_entities(doc)>),
 ('merge_noun_chunks',
  <function spacy.pipeline.functions.merge_noun_chunks(doc)>)]

<IPython.core.display.Javascript object>

In [34]:
doc = nlp("Hello, I'm Antonio Marsella, nice to meet you.")
for token in doc:
    print(token)

Hello
,
I
'm
Antonio Marsella
,
nice
to
meet
you
.


<IPython.core.display.Javascript object>

## Removing stop words

In general, it's convenient to remove all the stop words, *i.e. very common words in a language*, because they don't help most of NLP problem such as semantic analysis.

In [35]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print("Number of stop words: %d" % len(spacy_stopwords))
print("First ten stop words: %s" % list(spacy_stopwords)[:10])

Number of stop words: 326
First ten stop words: ['front', 'within', 'down', 'besides', 'each', 'until', 'more', 'take', 'have', 'various']


<IPython.core.display.Javascript object>

To remove them:

In [36]:
text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

doc = nlp(text)

tokens = [token.text for token in doc if not token.is_stop]
for token in tokens:
    print(token)

determined
drop
his litigation
the monastry
,
relinguish
his claims
wood
-
cuting


fishery rihgts
.
ready
this becuase
the rights
valuable
,


indeed the vaguest idea
the wood
river
question
.


<IPython.core.display.Javascript object>

For adding customized stop words:

In [39]:
customize_stop_words = ["computing", "filtered"]
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True

<IPython.core.display.Javascript object>

## Stemming and Lemmatization

In most natural languages, a root word can have many variants. For example, the word ‘play’ can be used as ‘playing’, ‘played’, ‘plays’, etc. You can think of similar examples (and there are plenty).

**Stemming**

Let’s first understand stemming:

Stemming is a text normalization technique that cuts off the end or beginning of a word by taking into account a list of common prefixes or suffixes that could be found in that word
It is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word
 

**Lemmatization**

Lemmatization, on the other hand, is an organized & step-by-step procedure of obtaining the root form of the word. It makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

Stemming algorithm works by cutting the suffix or prefix from the word. Lemmatization is a more powerful operation as it takes into consideration the morphological analysis of the word.

Lemmatization returns the lemma, which is the root word of all its inflection forms.

We can say that stemming is a quick and dirty method of chopping off words to its root form while on the other hand, lemmatization is an intelligent operation that uses dictionaries which are created by in-depth linguistic knowledge. Hence, Lemmatization helps in forming better features.

In [40]:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(nlp.create_pipe("merge_entities"))
# not using merge_chunk_nouns
doc = nlp(
    u"""He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""
)

lemma_word1 = []
for token in doc:
    if token.is_stop:
        continue
    lemma_word1.append(token.lemma_)
lemma_word1

['determine',
 'drop',
 'litigation',
 'monastry',
 ',',
 'relinguish',
 'claim',
 'wood',
 '-',
 'cut',
 '\n',
 'fishery',
 'rihgts',
 '.',
 'ready',
 'becuase',
 'right',
 'valuable',
 ',',
 '\n',
 'vague',
 'idea',
 'wood',
 'river',
 'question',
 '.']

<IPython.core.display.Javascript object>

## Removing the punctuation



In [41]:
text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""


import string

text_no_punct = "".join([char for char in text if char not in string.punctuation])

text_no_punct

'He determined to drop his litigation with the monastry and relinguish his claims to the woodcuting and \nfishery rihgts at once He was the more ready to do this becuase the rights had become much less valuable and he had \nindeed the vaguest idea where the wood and river in question were'

<IPython.core.display.Javascript object>

In [42]:
doc = nlp(text_no_punct)
for token in doc:
    print(token)

He
determined
to
drop
his
litigation
with
the
monastry
and
relinguish
his
claims
to
the
woodcuting
and


fishery
rihgts
at
once
He
was
the
more
ready
to
do
this
becuase
the
rights
had
become
much
less
valuable
and
he
had


indeed
the
vaguest
idea
where
the
wood
and
river
in
question
were


<IPython.core.display.Javascript object>

For text extracted from dialogues or chats, it is convenient to preprocess the text so that multiple occurrences of the same characters get condensed into one or two, and then use a spell checker to find the correct form of the word.

A way to do that is to replace all the occurrences of repeated characters with a single one and then use a spell checker: "hhheeelllllooo hoooowww areee youuu?" becomes "helo how are you?" and then the spell checker would make it "hello how are you?"




In [99]:
st = "hhheeeLLLLooo hoooowww areee youuu?????"
text = re.sub(r"(.)\1+", r"\1", st)
text

'heLo how are you?'

<IPython.core.display.Javascript object>

In [100]:
from spellchecker import SpellChecker

text = nlp(text)
spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown([token.text for token in text])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

<IPython.core.display.Javascript object>

It didn't find any mispelled (even if there was "helo"). Try another spell checker:

https://github.com/fsondej/autocorrect

In [95]:
from autocorrect import Speller

spell = Speller()

spell(text.text)

'hero how are you?'

<IPython.core.display.Javascript object>

As you can see, it's not always working properly! However, overall it should improve your text.

If you want to create a separate lemmatizer instead of having it in the pipeline:

In [109]:
from spacy.lemmatizer import Lemmatizer, ADJ, NOUN, VERB

lemmatizer = nlp.vocab.morphology.lemmatizer
print(lemmatizer("studying", VERB))
print(lemmatizer("studying", NOUN))
print(lemmatizer("studying", ADJ))

['study']
['studying']
['studying']


<IPython.core.display.Javascript object>

In [117]:
nlp.vocab.lookups.tables

['lemma_lookup', 'lemma_rules', 'lemma_index', 'lemma_exc', 'lexeme_norm']

<IPython.core.display.Javascript object>

spaCy has no built-in stemming! However, Lemmatization is enough for most of the tasks. As alternative, you can use [NLTK library](https://www.nltk.org).

## Named Entity Recognition

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc.


Example:



In [129]:
doc = nlp("Antonio works at Strive School.")

<IPython.core.display.Javascript object>

In [131]:
from spacy import displacy

displacy.render(doc, style="ent")

<IPython.core.display.Javascript object>

In [136]:
doc = nlp("Rome is a big city.")

<IPython.core.display.Javascript object>

In [137]:
displacy.render(doc, style="ent")

<IPython.core.display.Javascript object>

ORG stands for organization, GPE stands for Geopolitical Entity. Some other tags are: