## CHAPTER-2
Natural Lanugage Processing- SPACY

Chapter 2: Large-scale data analysis with spaCy

In this chapter, you'll use your new skills to extract specific information from large volumes of text. You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.

#### Data Structures (1): Vocab, Lexemes and StringStore

Welcome back! Now that you've had some real experience using spaCy's objects, it's time for you to learn more about what's actually going on under spaCy's hood.

In this lesson, we'll take a look at the shared vocabulary and how spaCy deals with strings.

Shared vocab and string store (1)

Vocab: stores data shared across multiple documents
To save memory, spaCy encodes all strings to hash values
Strings are only stored once in the StringStore via nlp.vocab.strings
String store: lookup table in both directions

spaCy stores all shared data in a vocabulary, the Vocab.

This includes words, but also the labels schemes for tags and entities.

To save memory, all strings are encoded to hash IDs. If a word occurs more than once, we don't need to save it every time.

Instead, spaCy uses a hash function to generate an ID and stores the string only once in the string store. The string store is available as nlp dot vocab dot strings.

It's a lookup table that works in both directions. You can look up a string and get its hash, and look up a hash to get its string value. Internally, spaCy only communicates in hash IDs.

Hash IDs can't be reversed, though. If a word in not in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.

In [2]:
import spacy

# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

coffee_hash = nlp.vocab.strings['coffee']
coffee_string = nlp.vocab.strings[coffee_hash]

KeyError: "[E018] Can't retrieve string for hash '3197928453018144401'."

Hashes can't be reversed – that's why we need to provide the shared vocab

In [3]:
# Raises an error if we haven't seen the string before
string = nlp.vocab.strings[3197928453018144401]

KeyError: "[E018] Can't retrieve string for hash '3197928453018144401'."

Shared vocab and string store (2)

To get the hash for a string, we can look it up in nlp dot vocab dot strings.

To get the string representation of a hash, we can look up the hash.

A Doc object also exposes its vocab and strings.

In [4]:
# Look up the string and hash in nlp.vocab.strings

doc = nlp("I love coffee")
print('hash value:', nlp.vocab.strings['coffee'])
print('string value:', nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401
string value: coffee


In [5]:
# The doc also exposes the vocab and strings

doc = nlp("I love coffee")
print('hash value:', doc.vocab.strings['coffee'])

hash value: 3197928453018144401


###### Lexemes: entries in the vocabulary

Lexemes are context-independent entries in the vocabulary.

You can get a lexeme by looking up a string or a hash ID in the vocab.

Lexemes expose attributes, just like tokens.

They hold context-independent information about a word, like the text, or whether the the word consists of alphanumeric characters.

Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context.

In [6]:
# A Lexeme object is an entry in the vocabulary

doc = nlp("I love coffee")
lexeme = nlp.vocab['coffee']

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


Contains the context-independent information about a word

Word text: lexeme.text and lexeme.orth (the hash)

Lexical attributes like lexeme.is_alpha

Not context-dependent part-of-speech tags, dependencies or entity labels

In [7]:
# String to Hash

#Part 1
# Look up the string “cat” in nlp.vocab.strings to get the hash.
# Look up the hash to get back the string.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I have a cat")

# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings["cat"]
print(cat_hash)

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

5439657043933447811
cat


In [8]:
# Part 2
# Look up the string label “PERSON” in nlp.vocab.strings to get the hash.
# Look up the hash to get back the string.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("David Bowie is a PERSON")

# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings["PERSON"]
print(person_hash)

# Look up the person_hash to get the string
person_string = nlp.vocab.strings[person_hash]
print(person_string)

380
PERSON


#### The Doc object

The Doc is one of the central data structures in spaCy. It's created automatically when you process a text with the nlp object. But you can also instantiate the class manually.

After creating the nlp object, we can import the Doc class from spacy dot tokens.

Here we're creating a Doc from three words. The spaces are a list of boolean values indicating whether the word is followed by a space. Every token includes that information – even the last one!

The Doc class takes three arguments: the shared vocab, the words and the spaces.

In [9]:
# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

To create a Span manually, we can also import the class from spacy dot tokens. We can then instantiate it with the doc and the span's start and end index.

To add an entity label to the span, we first need to look up the string in the string store. We can then provide it to the span as the label argument.

The doc dot ents are writable, so we can add entities manually by overwriting it with a list of spans.

In [10]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

# Add span to the doc.ents
doc.ents = [span_with_label]

Best practices

Doc and Span are very powerful and hold references and relationships of words and sentences.

Convert result to strings as late as possible.

Use token attributes if available – for example, token.i for the token index.

Don't forget to pass in the shared vocab.

A few tips and tricks before we get started:

The Doc and Span are very powerful and optimized for performance. They give you access to all references and relationships of the words and sentences.

If your application needs to output strings, make sure to convert the doc as late as possible. If you do it too early, you'll lose all relationships between the tokens.

To keep things consistent, try to use built-in token attributes wherever possible. For example, token dot i for the token index.

Also, don't forget to always pass in the shared vocab!

In [11]:
# Creating a Doc

# Let’s create some Doc objects from scratch!

# Part 1
# Import the Doc from spacy.tokens.
# Create a Doc from the words and spaces. Don’t forget to pass in the vocab!

import spacy

nlp = spacy.load("en_core_web_sm")

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
words = ["spaCy", "is", "cool", "!"]
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


In [12]:
# Part 2
# Import the Doc from spacy.tokens.
# Create a Doc from the words and spaces. Don’t forget to pass in the vocab!

import spacy

nlp = spacy.load("en_core_web_sm")

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ["Go", ",", "get", "started", "!"]
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


In [13]:
# Part 3
# Import the Doc from spacy.tokens.
# Complete the words and spaces to match the desired text and create a doc.

import spacy

nlp = spacy.load("en_core_web_sm")

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ["Oh", ",", "really", "?", "!"]
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


Docs, Span and Entity from Scratch

In this exercise, you’ll create the Doc and Span objects manually, and update the named entities – just like spaCy does behind the scenes. A shared nlp object has already been created.

Import the Doc and Span classes from spacy.tokens.

Use the Doc class directly to create a doc from the words and spaces.

Create a Span for “David Bowie” from the doc and assign it the label "PERSON".

Overwrite the doc.ents with a list of one entity, the “David Bowie” span.

In [14]:
from spacy.lang.en import English

nlp = English()

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


In [15]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

Found proper noun before a verb: Berlin


In [23]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin


###### Word vectors and semantic similarity

In this lesson, you'll learn how to use spaCy to predict how similar documents, spans or tokens are to each other.

You'll also learn about how to use word vectors and how to take advantage of them in your NLP application.

###### Comparing semantic similarity

spaCy can compare two objects and predict similarity
Doc.similarity(), Span.similarity() and Token.similarity()
Take another object and return a similarity score (0 to 1)
Important: needs a model that has word vectors included, for example:
✅ en_core_web_md (medium model)
✅ en_core_web_lg (large model)
🚫 NOT en_core_web_sm (small model)

spaCy can compare two objects and predict how similar they are – for example, documents, spans or single tokens.

The Doc, Token and Span objects have a dot similarity method that takes another object and returns a floating point number between 0 and 1, indicating how similar they are.

One thing that's very important: In order to use similarity, you need a larger spaCy model that has word vectors included.

For example, the medium or large English model – but not the small one. So if you want to use vectors, always go with a model that ends in "md" or "lg". You can find more details on this in the models documentation.

In [24]:
import spacy

# Load the en_core_web_md model
#nlp = spacy.load("en_core_web_md")

# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)

[ 2.1561384   0.6859281  -1.8234854   0.4145496  -0.886605    5.0773377
  0.28650832  3.6156225  -2.627604    5.01052     2.6055033   5.4986916
 -0.82726336 -2.4128723  -1.5714562   0.67344356 -1.1230624   3.017315
  3.4531426   2.6312394  -2.3144596   2.0717711  -0.5736556  -0.5199362
 -0.4892068   1.4417053   1.1748309   3.291245    2.7368522   2.1909308
  2.4100504  -1.5442916  -0.81270695 -1.7967525  -2.4401696   0.96489155
 -5.071314    2.4865592  -1.1760099   1.0010973  -1.8218107   6.159581
  5.876448   -1.9877293   6.579393    1.0499439  -1.5798447  -4.1203165
 -0.17076118 -4.819325   -2.1152763  -4.640588    1.5844907  -3.2757292
  2.1921952  -0.47692332 -1.8678508   1.0092752   0.7716696  -0.37776387
  0.07058215 -0.18511617  5.209738   -3.002555   -1.8404679   4.089005
 -2.0230193   1.0394226  -1.7199193   1.0383378   0.23976706 -0.67239416
  1.3192352  -0.33726573  0.21724188 -0.5032941   0.26279616 -0.58214176
 -3.0981517  -4.9684753  -3.2268834  -4.5933228  -3.0618596  -0

###### comparing similarities - Part-1
In this exercise, you’ll be using spaCy’s similarity methods to compare Doc, Token and Span objects and get similarity scores.

Use the doc.similarity method to compare doc1 to doc2 and print the result.

In [26]:
import spacy

#nlp = spacy.load("en_core_web_md")

doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

  "__main__", mod_spec)


0.5566261254871249


part-2 
Use the token.similarity method to compare token1 to token2 and print the result.

In [27]:
import spacy

#nlp = spacy.load("en_core_web_md")

doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books"
similarity = token1.similarity(token2)
print(similarity)

  "__main__", mod_spec)


0.3083113


part-3
Create spans for “great restaurant”/“really nice bar”.
Use span.similarity to compare them and print the result

In [28]:
import spacy

# nlp = spacy.load("en_core_web_md")

doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[12:15]

# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

  "__main__", mod_spec)


0.7244181


###### Combining models and rules

Combining statistical models with rule-based systems is one of the most powerful tricks you should have in your NLP toolbox.

In this lesson, we'll take a look at how to do it with spaCy.

###### Statistical predictions vs. rules

Statistical models	Rule-based systems
Use cases	application needs to generalize based on examples	⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⠀
Real-world examples	product names, person names, subject/object relationships	
spaCy features	entity recognizer, dependency parser, part-of-speech tagger

Statistical models are useful if your application needs to be able to generalize based on a few examples.

For instance, detecting product or person names usually benefits from a statistical model. Instead of providing a list of all person names ever, your application will be able to predict whether a span of tokens is a person name. Similarly, you can predict dependency labels to find subject/object relationships.

To do this, you would use spaCy's entity recognizer, dependency parser or part-of-speech tagger.

Statistical predictions vs. rules
Statistical models	                                            Rule-based systems
Use cases	application needs to generalize based on examples	dictionary with finite number of examples
Real-world examples	product names, person names, subject/object relationships	countries of the world, cities, drug names, dog breeds
spaCy features	entity recognizer, dependency parser, part-of-speech tagger	tokenizer, Matcher, PhraseMatcher

Rule-based approaches on the other hand come in handy if there's a more or less finite number of instances you want to find. For example, all countries or cities of the world, drug names or even dog breeds.

In spaCy, you can achieve this with custom tokenization rules, as well as the matcher and phrase matcher.

In [29]:
# Recap: Rule-based Matching
# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{'LEMMA': 'love', 'POS': 'VERB'}, {'LOWER': 'cats'}]
matcher.add('LOVE_CATS', None, pattern)

# Operators can specify how often a token should be matched
pattern = [{'TEXT': 'very', 'OP': '+'}, {'TEXT': 'happy'}]

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)

In the last chapter, you learned how to use spaCy's rule-based matcher to find complex patterns in your texts. Here's a quick recap.

The matcher is initialized with the shared vocabulary – usually nlp dot vocab.

Patterns are lists of dictionaries, and each dictionary describes one token and its attributes. Patterns can be added to the matcher using the matcher dot add method.

Operators let you specify how often to match a token. For example, "+" will match one or more times.

Calling the matcher on a doc object will return a list of the matches. Each match is a tuple consisting of an ID, and the start and end token index in the document.

In [30]:
# Adding statistical predictions
matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span:', span.text)
    # Get the span's root token and root head token
    print('Root token:', span.root.text)
    print('Root head token:', span.root.head.text)
    # Get the previous token and its POS tag
    print('Previous token:', doc[start - 1].text, doc[start - 1].pos_)

Matched span: Golden Retriever
Root token: Retriever
Root head token: have
Previous token: a DET


Here's an example of a matcher rule for "golden retriever".

If we iterate over the matches returned by the matcher, we can get the match ID and the start and end index of the matched span. We can then find out more about it. Span objects give us access to the original document and all other token attributes and linguistic features predicted by the model.

For example, we can get the span's root token. If the span consists of more than one token, this will be the token that decides the category of the phrase. For example, the root of "Golden Retriever" is "Retriever". We can also find the head token of the root. This is the syntactic "parent" that governs the phrase – in this case, the verb "have".

Finally, we can look at the previous token and its attributes. In this case, it's a determiner, the article "a".

##### Efficient phrase matching (1)
PhraseMatcher like regular expressions or keyword search – but with access to the tokens!
Takes Doc object as patterns
More efficient and faster than the Matcher
Great for matching large word lists

The phrase matcher is another helpful tool to find sequences of words in your data.

It performs a keyword search on the document, but instead of only finding strings, it gives you direct access to the tokens in context.

It takes Doc objects as patterns.

It's also really fast.

This makes it very useful for matching large dictionaries and word lists on large volumes of text.

In [31]:
# Efficient phrase matching (2)
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add('DOG', None, pattern)
doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print('Matched span:', span.text)

Matched span: Golden Retriever


Here's an example.

The phrase matcher can be imported from spacy dot matcher and follows the same API as the regular matcher.

Instead of a list of dictionaries, we pass in a Doc object as the pattern.

We can then iterate over the matches in the text, which gives us the match ID, and the start and end of the match. This lets us create a Span object for the matched tokens "Golden Retriever" to analyze it in context.

In [32]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad"}, {"TEXT": "-"}, {"LOWER": "free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", None, pattern1)
matcher.add("PATTERN2", None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


In [34]:
## Efficient Phrase Matching

import json
from spacy.lang.en import English

with open("exercises/countries.json") as f:
    COUNTRIES = json.loads(f.read())

nlp = English()
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

FileNotFoundError: [Errno 2] No such file or directory: 'exercises/countries.json'

Sometimes it’s more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things – like all countries of the world. We already have a list of countries, so let’s use this as the basis of our information extraction script. A list of string names is available as the variable COUNTRIES.

Import the PhraseMatcher and initialize it with the shared vocab as the variable matcher.
Add the phrase patterns and call the matcher on the doc.