In [1]:
# source https://nlpforhackers.io/complete-guide-to-spacy/
# jupyter notebook for Python3 and Spacy introduction

# the original tutorial at the URL above has more stuff

# To properly setup the python libraries
# I had a very old Spacy library; fixed with the following
# conda install -c conda-forge spacy=2.0.11
# only after doing the above the language model could be loaded as below 
# python3 -m spacy download en

import spacy
nlp = spacy.load('en')

Let's start from the basics: tokenization

In [2]:
doc = nlp('Hello     World!')
for token in doc:
    print('"' + token.text + '"')

"Hello"
"    "
"World"
"!"


Notice the index preserving tokenization in action. Rather than only keeping the words, spaCy keeps the spaces too. This is helpful for situations when you need to replace words in the original text or add some annotations. With NLTK tokenization, there’s no way to know exactly where a tokenized word is in the original raw text. spaCy preserves this “link” between the word and its place in the raw text. 

Here’s how to get the exact index of a word:  

In [3]:
for token in doc:
    print('"' + token.text + '"', token.idx)

"Hello" 0
"    " 6
"World" 10
"!" 15


Sentence detection

Here’s how to achieve one of the most common NLP tasks with spaCy:

In [4]:
doc = nlp("These are apples. These are oranges.")
 
for sent in doc.sents:
    print(sent)

These are apples.
These are oranges.


## Part Of Speech Tagging

We’ve already seen how this works but let’s have another look:

In [5]:
doc = nlp("Next week I'll be in Madrid.")
print([(token.text, token.tag_) for token in doc])

[('Next', 'JJ'), ('week', 'NN'), ('I', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('in', 'IN'), ('Madrid', 'NNP'), ('.', '.')]


## Named Entity Recognition

Doing NER with spaCy is super easy and the pretrained model performs pretty well:

In [6]:
doc = nlp("Next week I'll be in Madrid.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Next week DATE
Madrid GPE


The spaCy NER also has a healthy variety of entities. You can view the full list here: Entity Types

In [7]:
doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
for ent in doc.ents:
    print(ent.text, ent.label_)

2 CARDINAL
9 a.m. TIME
30% PERCENT
just 2 days DATE
WSJ ORG


Let’s use displaCy to view a beautiful visualization of the Named Entity annotated sentence:

In [8]:
from spacy import displacy
 
doc = nlp('I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(doc, style='ent', jupyter=True)

## Chunking

spaCy automatically detects noun-phrases as well:

In [9]:
doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.label_, chunk.root.text)

Wall Street Journal NP Journal
an interesting piece NP piece
crypto currencies NP currencies


Notice how the chunker also computes the root of the phrase, the main word of the phrase.

## Dependency Parsing

This is what makes spaCy really stand out. Let’s see the dependency parser in action:

In [10]:
doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
 
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))
 

Wall/NNP <--compound-- Street/NNP
Street/NNP <--compound-- Journal/NNP
Journal/NNP <--nsubj-- published/VBD
just/RB <--advmod-- published/VBD
published/VBD <--ROOT-- published/VBD
an/DT <--det-- piece/NN
interesting/JJ <--amod-- piece/NN
piece/NN <--dobj-- published/VBD
on/IN <--prep-- piece/NN
crypto/JJ <--compound-- currencies/NNS
currencies/NNS <--pobj-- on/IN


If this doesn’t help visualizing the dependency tree, displaCy comes in handy:

In [11]:
doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
 

## Word Vectors

spaCy comes shipped with a Word Vector model as well. We’ll need to download a larger model for that:

In [12]:
# run the following in a shaell to load a larger English model

The vectors are attached to spaCy objects: Token, Lexeme (a sort of unnatached token, part of the vocabulary), Span and Doc. The multi-token objects average its constituent vectors.

Explaining word vectors(aka word embeddings) are not the purpose of this tutorial. Here are a few properties word vectors have:

    If two words are similar, they appear in similar contexts
    Word vectors are computed taking into account the context (surrounding words)
    Given the two previous observations, similar words should have similar word vectors
    Using vectors we can derive relationships between words

Let’s see how we can access the embedding of a word in spaCy:

In [13]:
nlp = spacy.load('en_core_web_lg')
print(nlp.vocab['banana'].vector)

[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 -2.7421e-01  2.5564e-01  6.9793e-02 -2

There’s a really famous example of word embedding math: "man" - "woman" + "queen" = "king". It sounds pretty crazy to be true, so let’s test that out:

In [15]:
from scipy import spatial
 
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)
 
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector
queen = nlp.vocab['queen'].vector
king = nlp.vocab['king'].vector
 
# We now need to find the closest vector in the vocabulary to the result of "man" - "woman" + "queen"
maybe_king = man - woman + queen
computed_similarities = []
 
for word in nlp.vocab:
    # Ignore words without vectors
    if not word.has_vector:
        continue
 
    similarity = cosine_similarity(maybe_king, word.vector)
    computed_similarities.append((word, similarity))
 
computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])
print([w[0].text for w in computed_similarities[:10]])
 

['Queen', 'QUEEN', 'queen', 'King', 'KING', 'king', 'KIng', 'Kings', 'KINGS', 'kings']


Surprisingly, the closest word vector in the vocabulary for “man” – “woman” + “queen” is still “Queen” but “King” comes right after. Maybe behind every King is a Queen?

## Computing Similarity

Based on the word embeddings, spaCy offers a similarity interface for all of it’s building blocks: Token, Span, Doc and Lexeme. Here’s how to use that similarity interface:

In [16]:
banana = nlp.vocab['banana']
dog = nlp.vocab['dog']
fruit = nlp.vocab['fruit']
animal = nlp.vocab['animal']
 
print(dog.similarity(animal), dog.similarity(fruit)) 
print(banana.similarity(fruit), banana.similarity(animal))
 

0.66185343 0.2355285
0.67148364 0.2427285


Let’s now use this technique on entire texts:

In [17]:
target = nlp("Cats are beautiful animals.")
 
doc1 = nlp("Dogs are awesome.")
doc2 = nlp("Some gorgeous creatures are felines.")
doc3 = nlp("Dolphins are swimming mammals.")
 
print(target.similarity(doc1))  
print(target.similarity(doc2))  
print(target.similarity(doc3))  

0.8901765218466683
0.9115827883983011
0.7822955760597128
