https://spacy.io

### Industrial-Strength Natural Language Processing

### Fast and easy integration with deep learning purposes

#### Make sure you install the package via `pip3 install spacy` and then download the English language model via `python3 -m spacy download en`

Let's see some basic tasks with Spacy

In [1]:
import spacy

WEBSITE_DES = '''
spaCy is the best way to prepare text for deep learning.
'''

# Load the spacy language model for English
# sm at the end stands for small. Some models are missing in the small
# version, e.g. word vectors. Use lg to load all of them
# en_vectors_web_lg includes over 1 million unique vectors
nlp = spacy.load('en_core_web_sm')

# Now let's create a document and see how spacy process it for us
document = nlp(WEBSITE_DES.strip())

#### POS tagging
Let's see the tokens and their part of speech (POS) in our text

In [2]:
for word in document:
    # Spacy ships with a POS-tagger without pain
    print(word.text, word.pos_, word.tag_, spacy.explain(word.tag_))

spaCy PROPN NNP noun, proper singular
is VERB VBZ verb, 3rd person singular present
the DET DT determiner
best ADJ JJS adjective, superlative
way NOUN NN noun, singular or mass
to PART TO infinitival to
prepare VERB VB verb, base form
text NOUN NN noun, singular or mass
for ADP IN conjunction, subordinating or preposition
deep ADJ JJ adjective
learning NOUN NN noun, singular or mass
. PUNCT . punctuation mark, sentence closer


Now let's see the Token type and its properties

In [3]:
# Let's see what else each token has
print(type(document[-1]))
print(document[-1].__dir__())

<class 'spacy.tokens.token.Token'>
['__repr__', '__hash__', '__str__', '__lt__', '__le__', '__eq__', '__ne__', '__gt__', '__ge__', '__len__', '__new__', 'set_extension', 'get_extension', 'has_extension', 'remove_extension', '__unicode__', '__bytes__', '__reduce__', 'check_flag', 'nbor', 'similarity', 'is_ancestor', '_', 'lex_id', 'rank', 'string', 'text', 'text_with_ws', 'prob', 'sentiment', 'lang', 'idx', 'cluster', 'orth', 'lower', 'norm', 'shape', 'prefix', 'suffix', 'lemma', 'pos', 'tag', 'dep', 'has_vector', 'vector', 'vector_norm', 'tensor', 'n_lefts', 'n_rights', 'sent', 'sent_start', 'is_sent_start', 'lefts', 'rights', 'children', 'subtree', 'left_edge', 'right_edge', 'ancestors', 'head', 'conjuncts', 'ent_type', 'ent_type_', 'ent_iob', 'ent_iob_', 'ent_id', 'ent_id_', 'ent_kb_id', 'ent_kb_id_', 'whitespace_', 'orth_', 'lower_', 'norm_', 'shape_', 'prefix_', 'suffix_', 'lang_', 'lemma_', 'pos_', 'tag_', 'dep_', 'is_oov', 'is_stop', 'is_alpha', 'is_ascii', 'is_digit', 'is_lower'

`has_vector` actually refers to the word vector that ships with each
real English word when we load the language model.

Let's check for dependencies

In [4]:
doc_copy = nlp(WEBSITE_DES.replace('is', 'isn\'t').strip())
for word in doc_copy:
    print(word, word.dep_)

spaCy nsubj
is ROOT
n't neg
the det
best amod
way attr
to aux
prepare relcl
text dobj
for prep
deep amod
learning pobj
. punct


How about sentences?

In [5]:
WEBSITE_DES = '''spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. Independent research in 2015 found spaCy to be the fastest in the world.'''
doc_copy = nlp(WEBSITE_DES)
for sentence in doc_copy.sents:
    print(sentence)


spaCy excels at large-scale information extraction tasks.
It's written from the ground up in carefully memory-managed Cython.
Independent research in 2015 found spaCy to be the fastest in the world.


#### Extending sentence terminators

Now imagine a case where you want to process a postmodern poem text where some sentences might end with --- instead of .

In this case we can extend the default sentence splitter of `Spacy`

In [6]:
# First let's see if that works already
poem = nlp('Hug the life--- Feed the hope--- Never mind the rest--- Make your best--- The life is yours--- Like other alls. And a sentence end with period. And another.')
for life_hack in poem.sents:
    print(life_hack)

Hug the life--- Feed the
hope--- Never mind the rest---
Make your best---
The life is yours--- Like other alls.
And a sentence end with period.
And another.


Not very clean ---

In [7]:
from pprint import pprint


def three_dots_sentence(document):
    for token in document[:-1]:
        if token.text.endswith('---'):
            document[token.i + 1].is_sent_start = True
    return document

pprint(nlp.pipeline)
nlp.add_pipe(three_dots_sentence, before='parser')
pprint(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7f3a02673b70>),
 ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7f3a020411c8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7f3a02041228>)]
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7f3a02673b70>),
 ('three_dots_sentence', <function three_dots_sentence at 0x7f3a01debd90>),
 ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7f3a020411c8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7f3a02041228>)]


In [8]:
# Let's see the result

# New poem object with the updated nlp
poem = nlp(poem.text)
for life_hack in poem.sents:
    print(life_hack)

Hug the life---
Feed the hope---
Never mind the rest---
Make your best---
The life is yours---
Like other alls.
And a sentence end with period.
And another.


Very simple, but does the job! However, this is not the best way to achieve our goal.

#### Drawbacks to Spacy's tokenization

In [9]:
sentence = nlp('I am a non-vegetarian student and zamanias@mcmaster.ca is my email address.')
for word in sentence:
    print(word.text)

I
am
a
non
-
vegetarian
student
and
zamanias@mcmaster.ca
is
my
email
address
.


Not intelligent enough to recognize non-vegetarian as one word, but
intelligent enough to recognize email address.

Let's dive into entities

In [10]:
doc_copy = nlp('New York is a city in the United States. I have a $3 million apartment there.')
for entity in doc_copy.ents:
    print(' - '.join([entity.text, entity.label_,
                     str(spacy.explain(entity.label_))]))

New York - GPE - Countries, cities, states
the United States - GPE - Countries, cities, states
$3 million - MONEY - Monetary values, including unit


You may also define new entities, e.g. the name of your company, by
using the `Span` method from `spacy.tokens`

How about names?

In [11]:
doc_copy = nlp('Albert Einstein and Marilyn Monroe married in Royal Albert Hall last night')
for noun in doc_copy.noun_chunks:
    print(noun)

Albert Einstein
Marilyn Monroe
Royal Albert Hall


Pretty good results, hmm?

#### Time for visualization

In [12]:
from spacy import displacy
doc_copy = nlp('Python is a good programming language.')
displacy.render(doc_copy, style='dep', jupyter=True,
                options={'distance': 100, 'compact':True})

In [13]:
doc_copy = nlp('McMaster is a cool university. Hamilton is a cool city near the cooler Toronto city in the coolest country, Canada!')
displacy.render(doc_copy, style='ent', jupyter=True)


#### Redaction and Sanitization

Sometimes it is necessary to redact names and places from a report before releasing it. `Spacy` can help with that.

In [14]:
doc_redacted = []

# New and York are two tokens that form one entity. Let's make them one token.
for ent in doc_copy.ents:
    ent.merge()
    
for token in doc_copy:
    if token.ent_type_ in ['PERSON', 'ORG', 'GPE']:
        doc_redacted.append('[REDACTED]')
    else:
        doc_redacted.append(token.text)

' '.join(doc_redacted)

'[REDACTED] is a cool university . [REDACTED] is a cool city near the cooler [REDACTED] city in the coolest country , [REDACTED] !'

#### Let's dive into Lemmatization and Stemming in Spacy
Stemming refers to reducing a word to its root form.
For instance, walk, walking, walker, walked, etc. all
come from a common root. It's not a good idea to treat them
as distinctive words while doing NLP.

Lemmatization also refers to pretty much similar task, with
a slightly different approach.

As for Stemming, we usually chop off the ends of words **in
the hope of** achieving our goal (like a heuristic) while in
Lemmatization we use vocabulary analysis of the words to
return the dictionary form of them.

**Enough of theory, let's see them in action**

In [15]:
doc_copy = nlp('walk walked walker walking computer computing has have are You they')
for word in doc_copy:
    print(word.lemma_)

walk
walk
walker
walk
computer
computing
have
have
be
-PRON-
-PRON-


A little strange!

In what way is walking different from computing when doing lemmatization?

Also note that there is no direct way of doing *Stemming* in
Spacy, so we need to use another tool for that.

#### Similarity

Now let's play with the `similarity()` method of Spacy. It basically uses the vector representation of words to compare their similarity with each other.

In [16]:
tokens = nlp("dog cat banana")
print(tokens[0].similarity(tokens[1]))
print(tokens[0].similarity(tokens[2]))

# nlp = spacy.load("en_core_web_lg") 

# tokens = nlp("dog cat banana afsdasdkfsd")
# print(tokens[0].similarity(tokens[1]))

0.70593494
0.47661957


  "__main__", mod_spec)
  "__main__", mod_spec)


As the warning tells us, we need to download the large version of the English language model which is ~900 MB. Feel free to try that yourself!

Now let's see the similarity between two documents instead of tokens

In [17]:
# Example from Stackoverflow
doc1 = nlp("This was very strange argument between american and british person")
doc2 = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(doc1.similarity(doc2))

0.6350208743120859


  "__main__", mod_spec)


0.63 is quite a large similarity between this two non-relevant sentences. What could be the reason?

Maybe not using the large language model. What about `stop words` tho?

In [18]:
doc1 = nlp(' '.join([str(t) for t in doc1 if not t.is_stop]))
doc2 = nlp(' '.join([str(t) for t in doc2 if not t.is_stop]))

print(doc1)
print(doc2)

print(doc1.similarity(doc2))

strange argument american british person
Japan , true English gentleman eyes , reasons liked going school .
0.759336364850448


  "__main__", mod_spec)


The way that `Spacy` computes the similarity between the two sentences is that it the word embedding of a full sentence is simply the average over all different words. Therefore, some of the word vectors may cancel each other to form a final more similar vector representation.

#### Stop Words

We just talked about `stop words`. These are the words that are necessary to form our sentences in a correct and structured way, but they usually don't carry much meaning with themselves, especially in the context of Natural Language Processing.

Let's see some of the English stop words.

In [19]:
from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS)

{'have', 'much', 'anywhere', 'together', 'whole', 'thereafter', 'via', 'hereafter', 'between', 'on', 'they', 'our', 'been', 'everyone', 'of', 'must', 'am', 'we', 'were', 'what', 'alone', 'again', 'name', 'give', 'below', 'toward', 'n‘t', 'whatever', 'up', 'except', 'meanwhile', 'n’t', 'which', 'for', 'quite', 'put', 'who', 'everything', 'whither', 'beside', 'once', 'by', 'least', 'become', 'both', 'call', 'hers', 'so', 'mostly', 'somehow', 'twelve', "'ll", 'to', 'formerly', 'within', 'us', 'thence', 'a', 'three', 'upon', 'more', 'ourselves', 'rather', "'s", 'further', 'serious', 'none', 'all', 'hereupon', 'made', 'everywhere', 'your', 'last', 'became', 'should', 'ten', '’m', 'into', '’ve', 'forty', 'since', 'sometime', 'whence', 'every', 'its', 'many', 'the', 'most', 'do', 're', 'same', "n't", 'is', 'thru', 'she', 'them', 'no', 'then', 'two', 'themselves', 'being', '’d', 'whereafter', 'those', 'twenty', "'m", 'very', 'another', 'one', 'other', 'latterly', 'thereupon', 'say', 'in', 'bec

#### Extending Stop Words

A lot of times we want to process informal texts, e.g. tweets, reviews. The above stopwords are mostly useful only for formally written English texts such as newspapers, engineering books etc.

We can't see *lol*, *hbu*, *lmao*... in the list above. Therefore, it's a good idea to extend the set of stop words based on the application and the problem we are working on.

In [20]:
print(nlp.vocab['lol'].is_stop, '\n')

nlp.vocab['lol'].is_stop = True
nlp.vocab['hbu'].is_stop = True
nlp.vocab['lmao'].is_stop = True

tokens = nlp('lol you\'re very funny lmao')
for token in tokens:
    print(token.text, token.is_stop)


False 

lol True
you True
're True
very True
funny False
lmao True
