# IST664 - Week 6 Lab: Introduction to spaCy

SpaCy is an open source natural language processing library written by  Matthew Honnibal and Ines Montani. Most of spaCy is written natively in Python. Unlike NLTK, which was designed for teaching and research, spaCy was created from the start to support production applications - real world activities that require natural language processing. SpaCy uses a "pipeline" metaphor such that input documents and data go through a variety of typical processing stages where each stage feeds into the next one. Examples of these stages include tokenization, part of speech tagging, named entity recognition, and transformation into word vectors.

Try searching for "spaCy" on Kaggle.com. At this writing there were more than 4600 projects that used spaCy. Part of the appeal is that spaCy makes it easy to get started with a project. SpaCy contains support for dozens of different languages and its integration with word- and sentence-embedding approaches provides access to the advantages of pre-trained deep learning models.

Although you have seen spaCy briefly before in previous labs, in this lab you will get a more comprehensive view of the architecture and capabilities of spaCy.

Sections of this lab:
- Basics: Getting Started
- Lemmatization
- Token Extracting / Removing / Transforming
- Sentence Segmentation
- Part of Speech Tagging
- Named Entity Recognition
- Dependency Parsing
- Word Vectors
- Sentence Similarity
- Customizing pipeline components

# Basics: Getting Started

In [1]:
# Every spaCy project begins with importing the package and
# instantiating a processing object that is initialized with a particular
# language model. In this case we will start with English:
import spacy
nlp = spacy.load("en_core_web_sm")

# That's equivalent to:
# import en_core_web_sm
# nlp = en_core_web_sm.load()

type(nlp)

spacy.lang.en.English

In [2]:
# There are lots of things this object can do. Let's use
# dir to get a list of them:

[m for m in dir(nlp) if m[0] != "_"]

['Defaults',
 'add_pipe',
 'analyze_pipes',
 'batch_size',
 'begin_training',
 'component',
 'component_names',
 'components',
 'config',
 'create_optimizer',
 'create_pipe',
 'create_pipe_from_source',
 'default_config',
 'default_error_handler',
 'disable_pipe',
 'disable_pipes',
 'disabled',
 'enable_pipe',
 'evaluate',
 'factories',
 'factory',
 'factory_names',
 'from_bytes',
 'from_config',
 'from_disk',
 'get_factory_meta',
 'get_factory_name',
 'get_pipe',
 'get_pipe_config',
 'get_pipe_meta',
 'has_factory',
 'has_pipe',
 'initialize',
 'lang',
 'make_doc',
 'max_length',
 'meta',
 'path',
 'pipe',
 'pipe_factories',
 'pipe_labels',
 'pipe_names',
 'pipeline',
 'rehearse',
 'remove_pipe',
 'rename_pipe',
 'replace_listeners',
 'replace_pipe',
 'resume_training',
 'select_pipes',
 'set_error_handler',
 'set_factory_meta',
 'to_bytes',
 'to_disk',
 'tokenizer',
 'update',
 'use_params',
 'vocab']

In [3]:
# At the most basic level, and at the beginning of most
# NLP pipelines, we tokenize a document:
doc = nlp("Hello World!") # This is the most basic way to use the instance
type(doc), len(doc) # What is the result?

(spacy.tokens.doc.Doc, 3)

In [4]:
# A spaCy "tokens-doc" behaves like a list, such that
# we can use a list comprehension to access the individual
# tokens in the document:
[token.text for token in doc]

['Hello', 'World', '!']

In [5]:
# And because it behaves like a list, we can also use
# slicing to get access to the individual tokens.
first_token = doc[0] # Slice off the first token
print(type(first_token)) # What is its type?
print(first_token.text) # Show the text of the token

<class 'spacy.tokens.token.Token'>
Hello


In [6]:
# In spaCy terminology, a span is any contiguous set of tokens.
# Spans are often used to break up a document into sentences. Here
# we are just using slicing to create a span with the first two
# of our three tokens.
span = doc[0:2]
[token.text for token in span]

['Hello', 'World']

In [7]:
# For this first exercise, tokenize a longer text excerpted from Wikipedia.
# Use slicing to show the first five tokens:

longtext = """A neural network is either a biological neural network or an
artificial neural network for solving artificial intelligence (AI) problems.
The connections of the biological neuron are modeled as weights. A positive
weight reflects an excitatory connection, while negative values mean
inhibitory connections."""

# 6.1: Tokenize longtext
doc = nlp(longtext)

# 6.2: Display the texts of tokens in a span consisting of the first five tokens
span = doc[0:5]
print("first five tokens", [token.text for token in span])

# 6.2a: (Challenge) Use Python slicing notation to show the last five tokens
span = doc[-5:]
print("last five tokens", [token.text for token in span])



first five tokens ['A', 'neural', 'network', 'is', 'either']
last five tokens ['mean', '\n', 'inhibitory', 'connections', '.']


In [8]:
# SpaCy uses the language model to make better tokenization decisions. Let's
# compare spaCy tokenization with the primitive use of split(). Remember that
# split() defaults to splitting on spaces.
headline = "Rare Bird’s Detection Highlights Promise of ‘Environmental DNA’"

splitspacy = nlp(headline) # Use spaCy tokenization
splitspace = headline.split() # Use simple splitting on spaces

print("SpaCy tokens:")
print([t.text for t in splitspacy])
print(len(splitspacy), "tokens.")

print("\nSimple splitting:")
print([s for s in splitspace])
print(len(splitspace), "tokens.")

SpaCy tokens:
['Rare', 'Bird', '’s', 'Detection', 'Highlights', 'Promise', 'of', '‘', 'Environmental', 'DNA', '’']
11 tokens.

Simple splitting:
['Rare', 'Bird’s', 'Detection', 'Highlights', 'Promise', 'of', '‘Environmental', 'DNA’']
8 tokens.


In [9]:
# Why do you think it might be helpful to tokenize the possessive "Bird's" into
# two tokens? Add a comment that explains your reasoning. Then find or write
# a sentence that contains a hyphenated noun phrase. How does spaCY treat that?

# Better understanding for nouns and entities

# 6.2b: Use spaCy to tokenize a sentence that contains a hyphenated phrase.

sentence = "That was a well-timed shot"
doc = nlp(sentence)

print([token.text for token in doc])



['That', 'was', 'a', 'well', '-', 'timed', 'shot']


# Lemmatization

Lemmatization is the process of reducing inflected forms, sometimes derivationally related forms of a word to a common base form. This reduced form or root word is called a lemma. Lemmas have an advantage over simple stemming: Lemmas are always dictionary words. Lemmatizing can be a valuable data reduction technique because it aggregates various inflective forms of a word down to a single root.

In [10]:
# Demonstrate spaCy lemmatization with a verb form
text = "am are is" # All variations on the verb "to be"

# Note that the underscore following the attribute name in
# the expression token.lemma_ provides the human readable form.
[token.lemma_ for token in nlp(text)]

['be', 'be', 'be']

In [11]:
# Look at the non-text form of the lemma:

# 6.3: use token.lemma instead of token.lemma_
[token.lemma_ for token in nlp(text)]


# Write a comment describing what you see. These values are
# ID numbers for spaCy's "StringStore." More information here:
# https://spacy.io/usage/spacy-101#vocab

['be', 'be', 'be']

In [12]:
# Here's another example
text = "look looks looked"
doc = nlp(text)
for token in doc:
    print("token:{} -> lemma:{}".format(token.text,token.lemma_ ))

token:look -> lemma:look
token:looks -> lemma:looks
token:looked -> lemma:look


In [13]:
# Add your own example, this time using different forms of a noun

# 6.4: Lemmatize two or more inflective forms of a noun

text = "cat cats"
doc = nlp(text)
for token in doc:
    print("token:{} -> lemma:{}".format(token.text,token.lemma_ ))

token:cat -> lemma:cat
token:cats -> lemma:cat


# Token Extracting / Removing / Transforming

Here's an overview of all of the bound methods that a token has. When creating an NLP pipeline, it is incredibly helpful not to have to write Python routines to do these tasks.

|Attribute Name	|Type|Description                                                                    |
|:--------------------:|:---------:|:------------------------------------------------------------------------------------------------------------------------------------:|
| lemma              | int     | Base form of the token, with no inflectional suffixes.                                                                             |
| lemma_             | unicode | Base form of the token, with no inflectional suffixes.                                                                             |
| norm               | int     | The token’s norm, i.e. a normalized form of the token text. Usually set in the language’s tokenizer exceptions or norm exceptions. |
| norm_              | unicode | The token’s norm, i.e. a normalized form of the token text. Usually set in the language’s tokenizer exceptions or norm exceptions. |
| lower              | int     | Lowercase form of the token.                                                                                                       |
| lower_             | unicode | Lowercase form of the token text. Equivalent to Token.text.lower().                                                                |
| shape              | int     | Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “dd”.                                      |
| shape_             | unicode | Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “dd”.                                      |
| prefix             | int     | Hash value of a length-N substring from the start of the token. Defaults to N=1.                                                   |
| prefix_            | unicode | A length-N substring from the start of the token. Defaults to N=1.                                                                 |
| suffix             | int     | Hash value of a length-N substring from the end of the token. Defaults to N=3.                                                     |
| suffix_            | unicode | Length-N substring from the end of the token. Defaults to N=3.                                                                     |
| is_alpha           | bool    | Does the token consist of alphabetic characters? Equivalent to token.text.isalpha().                                               |
| is_ascii           | bool    | Does the token consist of ASCII characters? Equivalent to all(ord(c) < 128 for c in token.text).                                   |
| is_digit           | bool    | Does the token consist of digits? Equivalent to token.text.isdigit().                                                              |
| is_lower           | bool    | Is the token in lowercase? Equivalent to token.text.islower().                                                                     |
| is_upper           | bool    | Is the token in uppercase? Equivalent to token.text.isupper().                                                                     |
| is_title           | bool    | Is the token in titlecase? Equivalent to token.text.istitle().                                                                     |
| is_punct           | bool    | Is the token punctuation?                                                                                                          |
| is_left_punct      | bool    | Is the token a left punctuation mark, e.g. (?                                                                                      |
| is_right_punct     | bool    | Is the token a right punctuation mark, e.g. )?                                                                                     |
| is_space           | bool    | Does the token consist of whitespace characters? Equivalent to token.text.isspace().                                               |
| is_bracket         | bool    | Is the token a bracket?                                                                                                            |
| is_quote           | bool    | Is the token a quotation mark?                                                                                                     |
| is_currency V2.0.8 | bool    | Is the token a currency symbol?                                                                                                    |
| like_url           | bool    | Does the token resemble a URL?                                                                                                     |
| like_num           | bool    | Does the token represent a number? e.g. “10.9”, “10”, “ten”, etc.                                                                  |
| like_email         | bool    | Does the token resemble an email address?                                                                                          |

### Extracting

The list of attributes that spaCy makes available on the token object provide a variety of type tests. The is_ attributes allow testing for alpahnumeric, uppercase, title case, left punctuation mark, right punctuation mark, any punctuation mark, a bracket, a quote mark or a currency symbol. These are all very helpful in navigating within a string of tokens: Later we will show a search capability that allows us to include these in pattern matching.

There are also three "like" attributes that show if a token looks like a web address, a numeric string, or an email address.

Let's run some tests on a long and complex sentence from Wikipedia:

In [14]:
text='''An information retrieval technique using latent semantic structure was
patented in 1988 (US Patent 4,839,853, now expired) by Scott Deerwester,
Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum
and Lynn Streeter. In the context of its application to information retrieval,
it is sometimes called latent semantic indexing (LSI).'''

In [15]:
my_list=[] # Initialize a blank list
doc = nlp(text) # Tokenize the text string
for token in doc: # Check each token
    if token.is_punct: # Run the bound method
        my_list.append(token) # Append to the list

for item in my_list: # Review each item in the list
    print(item) # Print the item

(
,
)
,
,
,
,
,
.
,
(
)
.


In [16]:
# Now do something similar but use a list comprehension
[tok for tok in doc if tok.is_left_punct]

[(, (]

In [17]:
# Add a line of code to display right punctuation

# 6.5: Use the bound method to detect and print right puncutation
[tok for tok in doc if tok.is_right_punct]

[), )]

In [18]:
# Add a line of code to detect tokens that seem like numbers

# 6.6: Use the bound method to detect and display numbers
[tok for tok in doc if tok.like_num]

[1988, 4,839,853]

In [19]:
# For diagnostic purposes, it may be useful to examine these attributes
# all together. Here's a code fragment that sets up a table of token
# attributes:

import pandas as pd # Use a pandas DF

# These will be out column names
cols = ("text", "lemma_","is_punct", "is_stop", "is_alpha","is_space","lower_")

rows = [] # A blank list to hold the rows

for t in doc: # Iterate through the tokens - will work for any length document
    # build the next row
    row = [t.text, t.lemma_,  t.is_punct,  t.is_stop,  t.is_alpha,  t.is_space,  t.lower_]
    rows.append(row) # Append the row to the existing rows

# Create the pandas data frame from the column names and the list of rows
attri_pdf = pd.DataFrame(rows, columns=cols)

attri_pdf # Gives a preview, but may not show all rows

Unnamed: 0,text,lemma_,is_punct,is_stop,is_alpha,is_space,lower_
0,An,an,False,True,True,False,an
1,information,information,False,False,True,False,information
2,retrieval,retrieval,False,False,True,False,retrieval
3,technique,technique,False,False,True,False,technique
4,using,use,False,True,True,False,using
...,...,...,...,...,...,...,...
62,indexing,indexing,False,False,True,False,indexing
63,(,(,True,False,False,False,(
64,LSI,LSI,False,False,True,False,lsi
65,),),True,False,False,False,)


In previous weeks we have considered stop words and why in some cases it makes sense to remove them from the token stream. Let's examine spaCy's stop word list.

In [20]:
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

326

In [21]:
list(spacy_stopwords)[:8]

['anyhow',
 'if',
 'along',
 'latterly',
 'whenever',
 'towards',
 'beyond',
 'eleven']

While it is good to know what is on the stop list, we don't usually need it as spaCy has already tagged each token with an attribute showing whether that token is a stop word. Let's process another piece of text from Wikipedia to focus on the stop words.

In [22]:
text = """In natural language processing, the Latent Dirichlet Allocation (LDA)
is a generative statistical model that allows sets of observations to be
explained by unobserved groups that explain why some parts of the data are
similar. For example, if observations are words collected into documents,
it posits that each document is a mixture of a small number of topics and that
each word's presence is attributable to one of the document's topics. LDA is
an example of a topic model and belongs to the machine learning field and in
a wider sense to the artificial intelligence field."""

doc = nlp(text)
type(doc), len(doc)

(spacy.tokens.doc.Doc, 113)

In [23]:
# Display the tokens that are stop words:
print([token for token in doc if token.is_stop])

[In, the, is, a, that, of, to, be, by, that, why, some, of, the, are, For, if, are, into, it, that, each, is, a, of, a, of, and, that, each, 's, is, to, one, of, the, 's, is, an, of, a, and, to, the, and, in, a, to, the]


In [24]:
# Make a list of the tokens that are not stop-words
no_stops = [token for token in doc if not token.is_stop]
type(no_stops), len(no_stops)

(list, 64)

In [25]:
# Use slicing to view the first few non-stop words.
no_stops[0:12]

[natural,
 language,
 processing,
 ,,
 Latent,
 Dirichlet,
 Allocation,
 (,
 LDA,
 ),
 ,
 generative]

In some applications we may have uses for the punctuation tokens, but it is also good to know how to remove them. Conveniently, spaCy has also tagged every token with an indicator of whether it is punctuation.

In [26]:
# Also remove punctuation tokens
no_stops_or_punct = [token for token in no_stops if not token.is_punct]
type(no_stops_or_punct[0]), len(no_stops_or_punct)

(spacy.tokens.token.Token, 56)

In [27]:
# Use slicing to view the first few non-stop, non-punct words.
no_stops_or_punct[0:10]

[natural,
 language,
 processing,
 Latent,
 Dirichlet,
 Allocation,
 LDA,
 ,
 generative,
 statistical]

In [28]:
# Another attribute on a token contains the lowercase version
# of the token. Why does this attribute end with an underscore?
lowercased = [ token.lower_ for token in no_stops_or_punct]
lowercased[0:9]

['natural',
 'language',
 'processing',
 'latent',
 'dirichlet',
 'allocation',
 'lda',
 '\n',
 'generative']

When we called nlp() on the text object and created the tokens, spaCy also automatically guessed at the lemma for each token and stuck that in as an attribute.  

In [29]:
# Make an additional list of lemma tokens
lemma_list = [token.lemma_ for token in no_stops_or_punct]
type(lemma_list[0]), len(set(lemma_list))

(str, 39)

In [30]:
# The output above suggests that some lemmas appear in the token
# list more than one time. Use Counter from the collections package
# to count instances of lemmas.
from collections import Counter

# 6.7: Instantiate a counter object with Counter(lemma_list). Assign this
# to a new variable such as wc_lemmas

wc_lemmas = Counter(lemma_list)

# 6.8: Display the frequency counts of the five most common lemmas. Hint:
# a Counter has a bound method called most_common() that takes one
# argument called "n"

most_common_lemmas = wc_lemmas.most_common(5)
print(most_common_lemmas)






[('\n', 7), ('document', 3), ('topic', 3), ('LDA', 2), ('model', 2)]


In [31]:
# 6.9: Grab a new long string from Wikipedia or another source
#      and remove stop words and punctuation, then lemmatize and
#      count the frequencies of the top five lemmas.

text = """View of the Tower of Hercules near the center of A Coruña, Galicia, north-western
          coast of Spain. The 55 metres (180 ft) hight tower, an ancient Roman lighthouse, is
          the oldest (almost 1900 years) Roman lighthouse in use today and the second tallest
          lighthouse in Spain (after the Faro de Chipiona).
          The lighthouse was rehabilitated in 1791 and is a UNESCO World Heritage Site since 2009."""

doc = nlp(text)

no_stops = [token for token in doc if not token.is_stop]
no_stops_or_punct = [token for token in no_stops if not token.is_punct]
lowercased = [ token.lower_ for token in no_stops_or_punct]
lemma_list = [token.lemma_ for token in no_stops_or_punct]
wc_lemmas = Counter(lemma_list)
most_common_lemmas = wc_lemmas.most_common(5)
print(most_common_lemmas)







[('\n          ', 4), ('lighthouse', 4), ('Spain', 2), ('roman', 2), ('view', 1)]


# Sentence Segmentation

The spaCy doc object contains an element called "sents" that records the beginning and ending position (counting by tokens) of each sentence in the document. As you may have discussed in class, finding sentence boundaries requires a substantial amount of algorithmic complexity, because the ending punctuation in strings such as U.S. or etc. may or may not indicate a sentence boundary. There are four strategies for sentence boundary detection in spaCy: dependency parser (default), statistical segmenter, rule-based segmenter, or custom function. Let's tokenize a fragment of Wikipedia text using the default (dependency parser) and then examine the resulting sentences.

In [32]:
text = """Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities. About 47% of the periods in the Wall Street Journal corpus denote abbreviations.[1] Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, computer code, and slang."""
text[-26:] # Show the end of the string


' computer code, and slang.'

In [33]:
doc = nlp(text)
type(doc.sents)

generator

In Python, a generator is a special kind of iterator function that creates the requested elements on demand and "on the fly." This is a helpful approach when working with large sets of data elements that it would be challenging to represent in memory all at once.  So doc.sents is a generator object, which means we can iterate through it's elements to find what we need.

In [34]:
for sent in doc.sents:
    print("start_pos={}, end_pos={}, text:{}".format(sent.start, sent.end, sent.text))

start_pos=0, end_pos=36, text:Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end.
start_pos=36, end_pos=67, text:Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks.
start_pos=67, end_pos=103, text:In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities.
start_pos=103, end_pos=139, text:About 47% of the periods in the Wall Street Journal corpus denote abbreviations.[1] Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, computer code, and slang.


# Part of Speech Tagging

In [35]:
doc = nlp("I was reading an article about Berkeley Avenue in Reading, which was closed due to a police investigation.")
type(doc), len(doc)

(spacy.tokens.doc.Doc, 20)

In [36]:
from tabulate import tabulate # To make a neat table

tabdata = [ (token.text, token.tag_, token.pos_, spacy.explain(token.tag_)) for token in doc]

print(tabulate(tabdata,  headers=["Token", "Token Tag", "POS", "Explanation"]))


Token          Token Tag    POS    Explanation
-------------  -----------  -----  -----------------------------------------
I              PRP          PRON   pronoun, personal
was            VBD          AUX    verb, past tense
reading        VBG          VERB   verb, gerund or present participle
an             DT           DET    determiner
article        NN           NOUN   noun, singular or mass
about          IN           ADP    conjunction, subordinating or preposition
Berkeley       NNP          PROPN  noun, proper singular
Avenue         NNP          PROPN  noun, proper singular
in             IN           ADP    conjunction, subordinating or preposition
Reading        NNP          PROPN  noun, proper singular
,              ,            PUNCT  punctuation mark, comma
which          WDT          PRON   wh-determiner
was            VBD          AUX    verb, past tense
closed         VBN          VERB   verb, past participle
due            IN           ADP    conjunction, subordi

In [37]:
# Make a list of tokens for all of the proper nouns
propnlist = [token for token in doc if token.pos_ == "PROPN"]

[ (token, token.is_ascii, token.is_title) for token in propnlist]

[(Berkeley, True, True), (Avenue, True, True), (Reading, True, True)]

In [38]:
# Find another sentence that includes a place name. Tokenize it,
# display the POS tags, and excerpt the proper noun(s).

# 6.10: Create a new text object for tokenizing.

text = """View of the Tower of Hercules near the center of A Coruña, Galicia, north-western
          coast of Spain. The 55 metres (180 ft) hight tower, an ancient Roman lighthouse, is
          the oldest (almost 1900 years) Roman lighthouse in use today and the second tallest
          lighthouse in Spain (after the Faro de Chipiona).
          The lighthouse was rehabilitated in 1791 and is a UNESCO World Heritage Site since 2009."""


# 6.11: Tokenize the text object.

doc = nlp(text)

# 6.12: Display the POS tags for all tokens.

tabdata = [ (token.text, token.tag_, token.pos_, spacy.explain(token.tag_)) for token in doc]
print(tabulate(tabdata,  headers=["Token", "Token Tag", "POS", "Explanation"]))

# 6.13: Extract the proper nouns and display them.

propnlist = [token for token in doc if token.pos_ == "PROPN"]

[ token for token in propnlist]


Token          Token Tag    POS    Explanation
-------------  -----------  -----  --------------------------------------------------
View           NN           NOUN   noun, singular or mass
of             IN           ADP    conjunction, subordinating or preposition
the            DT           DET    determiner
Tower          NNP          PROPN  noun, proper singular
of             IN           ADP    conjunction, subordinating or preposition
Hercules       NNP          PROPN  noun, proper singular
near           IN           ADP    conjunction, subordinating or preposition
the            DT           DET    determiner
center         NN           NOUN   noun, singular or mass
of             IN           ADP    conjunction, subordinating or preposition
A              DT           DET    determiner
Coruña         NNP          PROPN  noun, proper singular
,              ,            PUNCT  punctuation mark, comma
Galicia        NNP          PROPN  noun, proper singular
,              ,  

[Tower,
 Hercules,
 Coruña,
 Galicia,
 Spain,
 Spain,
 Faro,
 Chipiona,
 UNESCO,
 World,
 Heritage,
 Site]

# Dependency Parsing

In [39]:
# The part of speech tags displayed above were determined after a thorough
# parsing of the dependency structure of the sentence. Let's take a closer
# look at the dependency structure:
from tabulate import tabulate # To make a neat table

tabdata = [ (token.text, token.tag_, token.dep_, token.head.text, token.head.tag_) for token in doc]

print(tabulate(tabdata,  headers=["Token", "Token POS", "Dependency", "Head Token", "Head POS"]))


Token          Token POS    Dependency    Head Token     Head POS
-------------  -----------  ------------  -------------  ----------
View           NN           ROOT          View           NN
of             IN           prep          View           NN
the            DT           det           Tower          NNP
Tower          NNP          pobj          of             IN
of             IN           prep          Tower          NNP
Hercules       NNP          pobj          of             IN
near           IN           prep          View           NN
the            DT           det           center         NN
center         NN           pobj          near           IN
of             IN           prep          center         NN
A              DT           det           Galicia        NNP
Coruña         NNP          nmod          Galicia        NNP
,              ,            punct         Coruña         NNP
Galicia        NNP          pobj          of             IN
,              ,     

Take a close look at the output just above. For each token in the sentence, the text of the token is shown along with its part of speech. Then the dependency relation is shown. For example, the first token, "I", is the noun/subject of the sentence and is therefore dependent on the main verb, "reading", which is the gerund form of the verb to read.

Take the time to examine each row of the output and make sure you understand the dependency relation that is being documented. And remember that spaCy's ability to diagram the relations in this way works because of the language model we originally loaded: "en_core_web_sm". Also important: The default sentence segmentation that we examined in a previous block works because spaCy's dependency parser accounts for all of the elements in a sentence, and therefore "knows" when the period character is closing a sentence.

In [40]:
# Grab another sentence from the web, but this time, cut off the sentence
# before the end so that some key grammatical element is missing. Do paste
# a period on the end, though, just to see if you can confuse spaCy.

# 6.14: Cut and paste part of a sentence from the web into a text variable.

text = """The planet Venus has been used as a setting in fiction since before the 19th century.
          Its impenetrable cloud cover gave writers free rein to speculate on conditions at Venus's surface,
          which was often depicted as warmer than Earth's but habitable. Images of a lush, verdant paradise."""

# 6.15: Tokenize the sentence.

doc = nlp(text)

# 6.16: Generate a table showing the dependency relations in the sentence.

tabdata = [ (token.text, token.tag_, token.dep_, token.head.text, token.head.tag_) for token in doc]

print(tabulate(tabdata,  headers=["Token", "Token POS", "Dependency", "Head Token", "Head POS"]))

# 6.17: Add a comment to document any mistakes that spaCy made.


Token         Token POS    Dependency    Head Token    Head POS
------------  -----------  ------------  ------------  ----------
The           DT           det           Venus         NNP
planet        NN           compound      Venus         NNP
Venus         NNP          nsubjpass     used          VBN
has           VBZ          aux           used          VBN
been          VBN          auxpass       used          VBN
used          VBN          ROOT          used          VBN
as            IN           prep          used          VBN
a             DT           det           setting       NN
setting       NN           pobj          as            IN
in            IN           prep          setting       NN
fiction       NN           pobj          in            IN
since         IN           prep          used          VBN
before        IN           prep          since         IN
the           DT           det           century       NN
19th          JJ           amod          century  

In [41]:
# There is a graphical display module for
# spaCy that supports drawing a figure of the dependency relations.
from spacy import displacy
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

Now that you can see the dependency relations as a graph, do you notice any problems with the parsing? Did spaCy make any mistakes in connecting the various elements of the sentence?

# Named Entity Recognition

Whenever spaCy finds a token that looks like a proper noun, it tags it as a predicted named entity. "Predicted," because each spaCy language model has a trained classifier that makes predictions of whether or not a token might be a named entity and also what type of entity it is (e.g., an organization, a country, or something else.

The IOB tagging method is a a straightforward way of notating the status of tokens. Tokens can have one of the following four statuses:

| TAG | ID | DESCRIPTION                           |
|:-----:|:----:|:---------------------------------------:|
| I   | 1  | Token is inside an entity.            |
| O   | 2  | Token is outside an entity.           |
| B   | 3  | Token begins an entity.               |
|     | 0  | No entity tag is set (missing value). |

In [42]:
text='''We’re bringing the celebration of Syracuse University’s 150 years of impact to Chicago'''
doc = nlp(text)
type(doc.ents)

tuple

In [43]:
# So the list of entities is a tuple, which means we should be able to slice it.
doc.ents[0]

Syracuse University’s

In [44]:
# We can iterate through all of the entities in the document.
for ent in doc.ents:
    print("{}, [{},{}), {}".format(ent.text, ent.start_char, ent.end_char, ent.label_))

    # For each entity, we can also access each entity as a span and
    # iterate through its tokens
    for token in ent.as_doc():
        print("    {} {} {}".format(token, token.ent_iob_, token.ent_type_))  #ent_iob_: IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.

Syracuse University’s, [34,55), ORG
    Syracuse B ORG
    University I ORG
    ’s I ORG
150 years, [56,65), DATE
    150 B DATE
    years I DATE
Chicago, [79,86), GPE
    Chicago B GPE


- __text__ gives the Unicode text representation of the entity.
- __start_char__ denotes the character offset for the start of the entity.
- __end_char__ denotes the character offset for the end of the entity.
- __label___ gives the label of the entity.

In [45]:
# The displacy module can also provide a graphical view of the named entities:
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

In [46]:
# Now you copy and paste another sentence from the web that seems to contain
# some named entities.

# 6.18: Cut and paste part of a sentence from the web into a text variable.
text = """In the city of Paris, John and Mary visited the Louvre Museum, known for its stunning art collection.
They admired the famous Mona Lisa painting, created by Leonardo da Vinci. After their museum tour, they enjoyed delicious
croissants and coffee at a local cafe. Back in London, they took a ride on the iconic London Eye, a giant observation wheel
on the South Bank of the River Thames. The view from the top was breathtaking, with Big Ben, Buckingham Palace, and the Shard in the distance.
Their next stop was New York City, where they explored Central Park and Times Square. They caught a Broadway show and had dinner
at a trendy restaurant in the heart of Manhattan.
"""

# 6.19: Tokenize the sentence.

doc = nlp(text)

# 6.20: Generate a displacy graphic with the named entities.
displacy.render(doc, style="ent", jupyter=True)

# Optional Advanced Topic: Processing pipelines

Throughout this lab, we have been calling a function that we often referred to as nlp(). After loading a spaCy language model such as en_core_web_sm, we instantiate a pipeline object to conduct all of the steps that we will routinely want to accomplish with a document.

The spaCy pipeline can be modified to change the default components or to add new components. Here's a list of the default components from the spaCy documentation:

<center><img style="text-align:center;" src="img/pipeline.png"></center>

| NAME      | COMPONENT         | CREATES                                             | DESCRIPTION                                      |
|:-----------|:-------------------|:-----------------------------------------------------|:--------------------------------------------------|
| tokenizer | Tokenizer         | Doc                                                 | Segment text into tokens.                        |
| tagger    | Tagger            | Doc[i].tag                                          | Assign part-of-speech tags.                      |
| parser    | DependencyParser  | Doc[i].head, Doc[i].dep, Doc.sents, Doc.noun_chunks | Assign dependency labels.                        |
| ner       | EntityRecognizer  | Doc.ents, Doc[i].ent_iob, Doc[i].ent_type           | Detect and label named entities.                 |
| textcat   | TextCategorizer   | Doc.cats                                            | Assign document labels.                          |
| …         | custom components | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx`                  | Assign custom attributes, methods or properties. |

In [47]:
# We can also examine the pipeline for an instantiated object like this:
import spacy
nlp = spacy.load("en_core_web_sm")

nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x79a58f738c40>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x79a58d39e8c0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x79a58ef736f0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x79a58d1a7f80>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x79a58eee2bc0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x79a58d3d3a00>)]

You may notice in the list above that there is no tokenizer. In the spaCy pipeline model, it is assumed that tokenization was accomplished before pipeline processing begins. All pipeline elements receive a doc object, work on it and return a doc object. The tokenizer has a different kind of job becuase it receives a raw character string and returns a list of tokens. Thus, the language object has a different slot where the tokenizer is listed:

In [54]:
nlp.tokenizer

<spacy.tokenizer.Tokenizer at 0x79a58c2d09d0>

The spaCy pipeline was designed to balance simplicity and computational effort. Simplicity is important for getting started quickly with a language processing tasks, so the default pipeline contains all the stuff that most people need to address a realistic task. But if a component is not needed, it can save a lot of compute time to take a task out of the pipeline. Take a look at this example:

In [49]:
print(spacy.__version__) # Some version dependent stuff below

3.6.1


In [50]:
# Let's skip the entity recognition and the dependency parsing
nlp_simple = spacy.load("en_core_web_sm", disable=["parser","ner"])
# Note that in version 3 of spaCy, the disable argument has been
# replaced by exclude. Version 3 also adds facilities for enabling
# and disabling pipeline elements on the fly.

print(nlp_simple.pipeline) # SHow the pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x79a58edda980>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x79a58edda9e0>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x79a58f815180>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x79a58eeb8180>)]


In [51]:
# Groucho Marx is an entity, but this pipeline doesn't detect it
simple_doc = nlp_simple("Groucho Marx shot an elephant in his underpants.")

[ent for ent in simple_doc.ents]

[]

In [52]:
# And the pipeline did not do dependency parsing
[token.dep_ for token in simple_doc]

['', '', '', '', '', '', '', '', '']

# Adding Custom Pipeline Components

A component receives a Doc object and can modify it. By adding a component to the pipeline, you’ll get access to the Doc at any point during processing – instead of only being able to modify it afterwards. You can control the position of the new component in the pipeline with the last, first, before, and after arguments.

| ARGUMENT | TYPE | DESCRIPTION                                          |
|----------|------|------------------------------------------------------|
| doc      | Doc  | The Doc object processed by the previous component.  |
| RETURNS  | Doc  | The Doc object processed by this pipeline component. |

| ARGUMENT | TYPE    | DESCRIPTION                                                        |
|----------|---------|--------------------------------------------------------------------|
| last     | bool    | If set to True, component is added last in the pipeline (default). |
| first    | bool    | If set to True, component is added first in the pipeline.          |
| before   | unicode | String name of component to add the new component before.          |
| after    | unicode | String name of component to add the new component after.           |

In [53]:
import spacy
from spacy.tokens import Doc, Span, Token
import json

# Instantiate a default pipeline
nlp = spacy.load("en_core_web_sm")

# Process a sentence
doc = nlp("This is a sentence.")
print("Before stopwords_removal, this doc is: {}".format(doc))
# See the result
print("After stopwords_removal, this doc is: {}".format([token.text for token in doc if not token.is_stop]))

Before stopwords_removal, this doc is: This is a sentence.
After stopwords_removal, this doc is: ['sentence', '.']
