# NLP Tasks using spaCy in Python

This notebook is based on the article from [Real Python](https://realpython.com/natural-language-processing-spacy-python/), which explores several natural language processing (NLP) tasks using the [spaCy](https://spacy.io/) library. Here, unstructured text are represented in a format that can be processed by machine learning models.


## Setup the environment

First, we need to install spaCy and download the English language model:

```bash
python -m venv venv
source ./venv/bin/activate
python -m pip install spacy
python -m spacy download en_core_web_sm
```


After installing spaCy, we can import it and load the model:


In [40]:
import spacy

nlp = spacy.load("en_core_web_sm")
nlp

<spacy.lang.en.English at 0x23981d62d80>

## The `Doc` object for processed text


A `Doc` object is a sequence of `Token` objects. A `Token` object represents an individual token — i.e. a word, punctuation symbol, whitespace, etc.


In [41]:
doc = nlp("We're going to N.Y. this summer!")
doc

We're going to N.Y. this summer!

In [42]:
type(doc)

spacy.tokens.doc.Doc

In [43]:
[token.text for token in doc]

['We', "'re", 'going', 'to', 'N.Y.', 'this', 'summer', '!']

A `Doc` object is usually formed after reading a text file.


In [44]:
import pathlib

file_name = "data/jeepney-news.txt"
news_doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
tokens = [token.text for token in news_doc]

tokens[:10]

['Contrary',
 'to',
 'the',
 'call',
 'of',
 'protesting',
 'jeepney',
 'drivers',
 'and',
 'some']

## Sentence detection


You can help you divide a text into meaningful chunks. This is called sentence segmentation or sentence boundary detection (SBD).


In [45]:
sentences = list(news_doc.sents)

len(sentences)

21

In [46]:
for sentence in sentences[:5]:
    print(f"{sentence[:5]}...")

Contrary to the call of...
Jeepney and UV Express operators...
“The authority to operate...
“The said units are...
Unconsolidated individual operators may also...


Custom delimiters can also be used to split a text into chunks.


In [47]:
from spacy.language import Language

ellipsis_text = (
    "Gus, can you, ... never mind, I forgot"
    " what I was saying. So, do you think"
    " we should ..."
)


@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    """Set sentence boundaries for ellipsis tokens"""
    for token in doc[:-1]:
        if token.text == "...":
            doc[token.i + 1].is_sent_start = True
    return doc


custom_nlp = spacy.load("en_core_web_sm")
custom_nlp.add_pipe("set_custom_boundaries", before="parser")
custom_ellipsis_doc = custom_nlp(ellipsis_text)
custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)

for i, sentence in enumerate(custom_ellipsis_sentences):
    print(i, sentence)

0 Gus, can you, ...
1 never mind, I forgot what I was saying.
2 So, do you think we should ...


See [Processing Pipelines: Simple stateless pipeline components - spaCy](https://spacy.io/usage/processing-pipelines#custom-components-simple) for more information.


## Tokenization

spaCy can break down a text into its basic units, called tokens. Tokens are the basic building blocks of a `Doc` object.


In [48]:
nlp = spacy.load("en_core_web_sm")

about_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech company."
)
about_doc = nlp(about_text)

for token in about_doc:
    # `idx` is the starting position of the token in the text
    print(token, token.idx)

Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32
working 42
for 50
a 54
London 56
- 62
based 63
Fintech 69
company 77
. 84


In [49]:
print(
    f"{"Text with Whitespace":22}"
    f"{"Is Alphanumeric?":18}"
    f"{"Is Punctuation?":18}"
    f"{"Is Stop Word?"}"
)

for token in about_doc:
    print(
        f"{str(token.text_with_ws):22}"
        f"{str(token.is_alpha):18}"
        f"{str(token.is_punct):18}"
        f"{str(token.is_stop)}"
    )

Text with Whitespace  Is Alphanumeric?  Is Punctuation?   Is Stop Word?
Gus                   True              False             False
Proto                 True              False             False
is                    True              False             True
a                     True              False             True
Python                True              False             False
developer             True              False             False
currently             True              False             False
working               True              False             False
for                   True              False             True
a                     True              False             True
London                True              False             False
-                     False             True              False
based                 True              False             False
Fintech               True              False             False
company               True          

Notice that hyphen (-) is considered as an infix that links two words together. spaCy was able to split the word "London-based" into two tokens: "London" and "based". Changing the hyphen into an underscore (\_), or any other character, will prevent spaCy from splitting the word.


In [50]:
custom_about_text = (
    "Gus Proto is a Python developer currently"
    " working for a London_based Fintech company."
)

print([token.text for token in nlp(custom_about_text)[8:15]])

['for', 'a', 'London_based', 'Fintech', 'company', '.']


You can set a custom infix by creating a new `Tokenizer` object.


In [51]:
from spacy.tokenizer import Tokenizer

custom_nlp = spacy.load("en_core_web_sm")
prefix_re = spacy.util.compile_prefix_regex(custom_nlp.Defaults.prefixes)
suffix_re = spacy.util.compile_suffix_regex(custom_nlp.Defaults.suffixes)

custom_infixes = [r"_"]

infix_re = spacy.util.compile_infix_regex(
    list(custom_nlp.Defaults.infixes) + custom_infixes
)

custom_nlp.tokenizer = Tokenizer(
    nlp.vocab,
    prefix_search=prefix_re.search,
    suffix_search=suffix_re.search,
    infix_finditer=infix_re.finditer,
    token_match=None,
)

custom_tokenizer_about_doc = custom_nlp(custom_about_text)

print([token.text for token in custom_tokenizer_about_doc[8:15]])

['for', 'a', 'London', '_', 'based', 'Fintech', 'company']


## Stop words

Stop words are words that are filtered out because they are too common and carry too little information. spaCy holds a built-in list of some English stop words.


In [52]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

326

In [53]:
for stop_word in list(spacy_stopwords)[:10]:
    print(stop_word)

are
if
per
anyhow
some
perhaps
well
whatever
onto
several


You can detect stop words by checking the `is_stop` property of a `Token` object.


In [54]:
custom_about_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing."
)

nlp = spacy.load("en_core_web_sm")
about_doc = nlp(custom_about_text)

print([token for token in about_doc if not token.is_stop])

[Gus, Proto, Python, developer, currently, working, London, -, based, Fintech, company, ., interested, learning, Natural, Language, Processing, .]


## Lemmatization

Lemmatization is the process of reducing a word to its base form, called a lemma. It uses the vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.


In [55]:
conference_help_text = (
    "Gus is helping organize a developer"
    " conference on Applications of Natural Language"
    " Processing. He keeps organizing local Python meetups"
    " and several internal talks at his workplace."
)

conference_help_doc = nlp(conference_help_text)
for token in conference_help_doc:
    if str(token) != str(token.lemma_):
        print(f"{str(token):>20} : {str(token.lemma_)}")

                  is : be
                  He : he
               keeps : keep
          organizing : organize
             meetups : meetup
               talks : talk


## Word Frequency

Since spaCy allows you to convert a text into a sequence of tokens, you can easily apply some statistical analysis to them. For example, you can count the frequency of words in a text.


In [56]:
from collections import Counter

words = [
    token.text
    for token in news_doc
    if not token.is_stop and not token.is_punct and not token.is_space
]

len(words)

318

In [57]:
Counter(words).most_common(10)

[('operators', 10),
 ('April', 8),
 ('30', 7),
 ('units', 6),
 ('individual', 5),
 ('consolidate', 5),
 ('lawmakers', 4),
 ('deadline', 4),
 ('new', 4),
 ('unconsolidated', 4)]

## Part-of-speech tagging

Part-of-speech (POS) tagging is the process of assigning a word to its grammatical category, in order to understand its role within the sentence. POS tags are used to annotate words and label them with their appropriate part of speech. There are eight main parts of speech: nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections.


In [58]:
for token in news_doc[:5]:
    print(
        f"{token.text:>10} {token.pos_:>5} {token.tag_:>5} {spacy.explain(token.tag_)}"
    )

  Contrary   ADJ    JJ adjective (English), other noun-modifier (Chinese)
        to   ADP    IN conjunction, subordinating or preposition
       the   DET    DT determiner
      call  NOUN    NN noun, singular or mass
        of   ADP    IN conjunction, subordinating or preposition


For POS tagging, two attributes of a `Token` object can be useful: `pos_` and `tag_`. The `pos_` attribute returns the simple part-of-speech tag. The `tag_` attribute returns the detailed part-of-speech tag.


In [59]:
unique_adjectives = set(
    [token.text for token in news_doc if token.pos_ == "ADJ"]
)
print(unique_adjectives)

{'Contrary', 'new', 'unconsolidated', 'valid', 'few', 'consolidated', 'contrary', 'inevitable', 'final', 'extra', 'negotiable', '-', 'same', 'total', 'circular', 'particular', 'provisional', 'individual', 'non', 'other', 'public'}


## Preprocessing functions

You can create a custom preprocessing function to clean up the text. For example, you can remove stop words, punctuation, and lemmatize the text.


In [60]:
import spacy

nlp = spacy.load("en_core_web_sm")

complete_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech company. He is"
    " interested in learning Natural Language Processing."
    " There is a developer conference happening on 21 July"
    ' 2019 in London. It is titled "Applications of Natural'
    ' Language Processing". There is a helpline number'
    " available at +44-1234567891. Gus is helping organize it."
    " He keeps organizing local Python meetups and several"
    " internal talks at his workplace. Gus is also presenting"
    ' a talk. The talk will introduce the reader about "Use'
    ' cases of Natural Language Processing in Fintech".'
    " Apart from his work, he is very passionate about music."
    " Gus is learning to play the Piano. He has enrolled"
    " himself in the weekend batch of Great Piano Academy."
    " Great Piano Academy is situated in Mayfair or the City"
    " of London and has world-class piano instructors."
)

complete_doc = nlp(complete_text)


def is_token_allowed(token):
    return bool(
        token
        and str(token).strip()
        and not token.is_stop
        and not token.is_punct
    )


def preprocess_token(token):
    return token.lemma_.strip().lower()


complete_filtered_tokens = [
    preprocess_token(token) for token in complete_doc if is_token_allowed(token)
]

print(complete_filtered_tokens)

['gus', 'proto', 'python', 'developer', 'currently', 'work', 'london', 'base', 'fintech', 'company', 'interested', 'learn', 'natural', 'language', 'processing', 'developer', 'conference', 'happen', '21', 'july', '2019', 'london', 'title', 'application', 'natural', 'language', 'processing', 'helpline', 'number', 'available', '+44', '1234567891', 'gus', 'helping', 'organize', 'keep', 'organize', 'local', 'python', 'meetup', 'internal', 'talk', 'workplace', 'gus', 'present', 'talk', 'talk', 'introduce', 'reader', 'use', 'case', 'natural', 'language', 'processing', 'fintech', 'apart', 'work', 'passionate', 'music', 'gus', 'learn', 'play', 'piano', 'enrol', 'weekend', 'batch', 'great', 'piano', 'academy', 'great', 'piano', 'academy', 'situate', 'mayfair', 'city', 'london', 'world', 'class', 'piano', 'instructor']


## Rule-based matching

spaCy offers a rule-matching tool called `Matcher`. It allows you to build a library of token patterns and then match those patterns against a `Doc` object to return a list of found matches.


In [61]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)


def extract_full_name(nlp_doc):
    """Extract two objects in which the POS tags for both are proper nouns"""
    pattern = [{"POS": "PROPN"}, {"POS": "PROPN"}]
    matcher.add("FULL_NAME", [pattern])
    matches = matcher(nlp_doc)
    for _, start, end in matches:
        span = nlp_doc[start:end]
        yield span.text


next(extract_full_name(news_doc))

'UV Express'

Let's extract phone numbers from a text.


In [62]:
def on_match(matcher, doc, id, matches):
    print("Matched!", matches)


def extract_phone_number(nlp_doc):
    matcher = Matcher(nlp.vocab)
    patterns = [
        [
            {"ORTH": "("},
            {"SHAPE": "ddd"},
            {"ORTH": ")"},
            {"SHAPE": "ddd"},
            {"ORTH": "-", "OP": "?"},
            {"SHAPE": "dddd"},
        ]
    ]
    matcher.add("PHONE_NUMBER", patterns, on_match=on_match)
    matches = matcher(nlp_doc)
    for match_id, start, end in matches:
        span = nlp_doc[start:end]
        return span.text


conference_org_text = (
    "There is a developer conference"
    " happening on 21 July 2019 in London. It is titled"
    ' "Applications of Natural Language Processing".'
    " There is a helpline number available"
    " at (123) 456-7891"
)

conference_org_doc = nlp(conference_org_text)
extract_phone_number(conference_org_doc)

Matched! [(10788718092470551940, 31, 37)]


'(123) 456-7891'

## Dependency parsing

Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents.

The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the head of the sentence. The verb then governs its object (a noun or pronoun) which is called a direct object. If there is an indirect object (i.e. a prepositional phrase), the direct object depends on the preposition.


In [63]:
doc = nlp(
    "When the CEO paid himself $750,000, the company"
    " faced bankruptcy and the employee revolted."
)

print(
    f"{"Token":10}"
    f"{"POS":10}"
    f"{"Dependency":15}"
)

for token in doc:
    print(
        f"{str(token):10}"
        f"{token.pos_:10}"
        f"{token.dep_:15}"
    )

Token     POS       Dependency     
When      SCONJ     advmod         
the       DET       det            
CEO       NOUN      nsubj          
paid      VERB      advcl          
himself   PRON      dative         
$         SYM       nmod           
750,000   NUM       dobj           
,         PUNCT     punct          
the       DET       det            
company   NOUN      nsubj          
faced     VERB      ROOT           
bankruptcyNOUN      dobj           
and       CCONJ     cc             
the       DET       det            
employee  NOUN      nsubj          
revolted  VERB      conj           
.         PUNCT     punct          


## Visualization

spaCy offers a built-in visualization tool called `displacy`. It can be used to visualize a dependency parse or named entities in a browser.


In [64]:
from spacy import displacy

displacy.render(
    doc,
    style="dep",
    options={
        "distance": 110,
        "compact": "True",
        "color": "white",
        "bg": "#09a3d5",
        "font": "Inter",
    },
)

## Named entity recognition

Named entity recognition (NER) is the process of locating named entities in unstructured text and then classifying them into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.


In [65]:
import spacy

nlp = spacy.load("en_core_web_sm")

piano_class_text = (
    "Great Piano Academy is situated"
    " in Mayfair or the City of London and has"
    " world-class piano instructors."
)
piano_class_doc = nlp(piano_class_text)

for ent in piano_class_doc.ents:
    print(f"{ent.text:20} --> {ent.label_} ({spacy.explain(ent.label_)})")

Great Piano Academy  --> ORG (Companies, agencies, institutions, etc.)
Mayfair              --> GPE (Countries, cities, states)
the City of London   --> GPE (Countries, cities, states)


In [66]:
displacy.render(piano_class_doc, style="ent")