## Introduction to SpaCy

### The nlp object

* At the center of spaCy is the object containingthe processing pipeling
* This variable is usually called "nlp"
* It contains all the different components in the pipeline
* It also includes language-specific rules used for tokenizing the text into words and punctuation

In [1]:
# Import the English language class
from spacy.lang.en import English

In [2]:
# Create the nlp object
nlp = English()

### The doc object

* When you process the nlp object, spaCy creates a Doc object
* The Doc lets you access the information about the text in a structured way

In [3]:
# Creted by processing a string of text with the nlp object
doc = nlp("Hello world!")

In [4]:
# Iterate over tokens in a doc
for token in doc:
    print(token.text)

Hello
world
!


### The Token object

<img src="images/Token_object.png"/>

* Token objects represent the tokens in a document - for example, a word or a punctuation character
* To get a token at a specific position, you can index into the Doc
* Token objects also provide various attributes that let you access more information about the tokens
* For example, the dot text attribute returns the verbatim token text

In [5]:
doc = nlp("Hello world!")

In [6]:
# Index into the Doc to get a single Token
token = doc[1]

In [7]:
# Get the token text via the .text attribute
print(token.text)

world


### The span object

<img src="images/span_object.png"/>

* A Span object is a slice of the document consisting of one or more tokens
* It's only a view of the Doc and doesn't contain any data itself

In [8]:
doc = nlp("Hello world!")

In [9]:
# A slice from the Doc is a Span object
span = doc[1:4]

In [10]:
# Get the span text via the .text attribute
print(span.text)

world!


### Lexical Attributes

* "i" is the index of the token within the parent document
* "text" returns the token text
* "is alpha", "is punct" and "like num" return boolean values indicating whether the token consists of alphanumeric characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.
* These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don't depend on the token's context.

In [11]:
doc = nlp("It costs $5.")

In [12]:
print('Index:   ', [token.i for token in doc])
print('Text:    ', [token.text for token in doc])

Index:    [0, 1, 2, 3, 4]
Text:     ['It', 'costs', '$', '5', '.']


In [13]:
print('is_alpha:', [token.is_alpha for token in doc])
print('is_punct:', [token.is_punct for token in doc])
print('like_num:', [token.like_num for token in doc])

is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


## Getting Started

### Part 1: English

* Import the English class from spacy.lang.en and create the nlp object.
* Create a doc and print its text.

In [14]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

# Process a text
doc = nlp("This is a sentence.")

# Print the document text
print(doc.text)

This is a sentence.


### Part 2: German

* Import the German class from spacy.lang.de and create the nlp object.
* Create a doc and print its text.

In [15]:
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


### Part 3: Spanish

* Import the Spanish class from spacy.lang.es and create the nlp object
* Create a doc and print its text

In [16]:
# Import the Spanish language class
from spacy.lang.es import Spanish

# Create the nlp object
nlp = Spanish()

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(doc.text)

¿Cómo estás?


## Documents, spans and tokens

### Step 1

* Import the English language class and create the nlp object
* Process the text and instantiate a Doc object in the variable doc
* Select the first token of the Doc and print its text

In [17]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


### Step 2

* Import the English language class and create the nlp object
* Process the text and instantiate a Doc object in the variable doc
* Create a slice of the Doc for the tokens “tree kangaroos” and “tree kangaroos and narwhals”

In [18]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


## Lexical attributes

* Use the like_num token attribute to check whether a token in the doc resembles a number
* Get the token following the current token in the document. The index of the next token in the doc is token.i + 1
* Check whether the next token’s text attribute is a percent sign ”%“

In [19]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i+1]
        # Check if the next token's text equals '%'
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


## Statistical Models

### What are statistical models?

* Enable spaCy to predict linguistic attributes in context
    * Part-of-speech tags
    * Syntactic dependencies
    * Named entities
* Trained on labeled example texts
* Can be updated with more examples to fine-tune predictions

### Model Packages

* spaCy provides a number of pre-trained model packages you can download using the "spacy download" command
* The package includes
    * Binary weights that enable spaCy to make predictions
    * Vocabulary
    * Meta information to tell spaCy which language class to use and how to configure the processing pipeline

In [20]:
! python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [21]:
import spacy

In [22]:
nlp = spacy.load('en_core_web_sm')

### Predicting Part-of-speech Tags

In [23]:
import spacy

In [24]:
# Load the small English model
nlp = spacy.load('en_core_web_sm')

In [25]:
# Process a text
doc = nlp("She ate the pizza")

In [26]:
# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


### Presicting Syntactic Dependencies

* In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object
* The "dep underscore" attribute returns the predicted dependency label
* The head attribute returns the syntactic head token

In [27]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


### Dependency label scheme

<img src="images/dependency_label_scheme.png"/>

* To describe syntactic dependencies, spaCy uses a standardized label scheme

* Here's an example of some common labels:
    * The pronoun "She" is a nominal subject attached to the verb – in this case, to "ate"
    * The noun "pizza" is a direct object attached to the verb "ate". It is eaten by the subject, "she"
    * The determiner "the", also known as an article, is attached to the noun "pizza"

### Predicting Named Entities

<img src="images/named_entities.png"/>
    
* Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country
* The doc dot ents property lets you access the named entities predicted by the model
* It returns an iterator of Span objects, so we can print the entity text and the entity label using the "label underscore" attribute.

In [28]:
# Process a text
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

In [29]:
# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


#### Tip : Get quick definitions of most common tags and labels

* To get definitions for the most common tags and labels, you can use the spacy dot explain helper function

In [30]:
spacy.explain('GPE')

'Countries, cities, states'

In [31]:
spacy.explain('NNP')

'noun, proper singular'

In [32]:
spacy.explain('dobj')

'direct object'

## Model Packages

What’s not included in a model package that you can load into spaCy?

* A meta file including the language, pipeline and license.
* Binary weights to make statistical predictions.
* The labelled data that the model was trained on.
* Strings of the model's vocabulary and their hashes.

The labelled data that the model was trained on.

## Loading Models

* Use spacy.load to load the small English model 'en_core_web_sm'
* Process the text and print the document text

In [33]:
import spacy

# Load the 'en_core_web_sm' model
nlp = spacy.load('en_core_web_sm')

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


## Predicting linguistic annotations

### Part 1

* Process the text with the nlp object and create a doc
* For each token, print the token text, the token’s .pos_ (part-of-speech tag) and the token’s .dep_ (dependency label)

In [34]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))

It          PRON      nsubj     
’s          VERB      punct     
official    NOUN      ccomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


### Part 2

* Process the text and create a doc object
* Iterate over the doc.ents and print the entity text and label_ attribute

In [35]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


## Predicting named entities in context

* Process the text with the nlp object
* Iterate over the entities and print the entity text and label
* Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.

In [36]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

New iPhone EVENT
Apple ORG
Missing entity: iPhone X


## Rule-based matching

### Why not just regular expressions?

* Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.
* It's also more flexible: you can search for texts but also other lexical attributes
* You can even write rules that use the model's predictions
* For example, find the word "duck" only if it's a verb, not a noun

### Match patterns

* Match patterns are lists of dictionaries
* Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values
* Match exact token texts

[{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

* Match lexical attributes

[{'LOWER': 'iphone'}, {'LOWER': 'x'}]

* Match any token attributes

[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]

### Using the matcher

* To use a pattern, we first import the matcher from spacy dot matcher
* We also load a model and create the nlp object
* The matcher is initialized with the shared vocabulary, nlp dot vocab
* The matcher dot add method lets you add a pattern
    * The first argument is a unique ID to identify which pattern was matched
    * The second argument is an optional callback
    * The third argument is the pattern
* To match the pattern on a text, we can call the matcher on any doc
* This will return the matches

In [37]:
import spacy

In [38]:
# Import the matcher
from spacy.matcher import Matcher

In [39]:
# Load the model and create nlp obkect
nlp = spacy.load('en_core_web_sm')

In [40]:
# Intitialize the matcher with the shared voab
matcher = Matcher(nlp.vocab)

In [41]:
# Add the pattern to the matcher
pattern = [{'TEXT' : 'iPhone'}, {'TEXT' : 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

In [42]:
# Process some text
doc = nlp("New iPhone X relese date leaked")

In [43]:
# Call the matcher on the text
matches = matcher(doc)

* When you call the matcher on a doc, it returns a list of tuples.
* Each tuple consists of three values: the match ID, the start index and the end index of the matched span.
* This means we can iterate over the matches and create a Span object: a slice of the doc at the start and end index.
* **match_id**: hash value of the pattern name
* **start**: start index of matched span
* **end**: end index of matched span

In [44]:
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


### Matching lexical attributes

In [45]:
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]

In [46]:
doc = nlp("2018 FIFA World Cup: France won!")

In [47]:
matcher.add('LEX_PATTERN', None, pattern)

In [48]:
# Call the matcher on the text
matches = matcher(doc)

In [49]:
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


### Matching over other token attributes

In [50]:
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]

In [51]:
doc = nlp("I loved dogs but now I love cats more.")

In [52]:
matcher.add('TOKEN_ATT_PATTERN', None, pattern)

In [53]:
# Call the matcher on the text
matches = matcher(doc)

In [54]:
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs
love cats


### Using operators and quantifiers

* Operators and quantifiers let you define how often a token should be matched
* They can be added using the "OP" key

In [55]:
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]

In [56]:
doc = nlp("I bought a smartphone. Now I'm buying apps.")

In [57]:
matcher.add('OPS_QUANT_PATTERN', None, pattern)

In [58]:
# Call the matcher on the text
matches = matcher(doc)

In [59]:
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

bought a smartphone
buying apps


| Example | Description |
|---------|-------------|
|{'OP': '!'}|Negation: match 0 times|
|{'OP': '?'}|Optional: match 0 or 1 times|
|{'OP': '+'}|Match 1 or more times|
|{'OP': '*'}|Match 0 or more times|

## Using the Matcher

* Import the Matcher from spacy.matcher
* Initialize it with the nlp object’s shared vocab
* Create a pattern that matches the 'TEXT' values of two tokens: "iPhone" and "X"
* Use the matcher.add method to add the pattern to the matcher
* Call the matcher on the doc and store the result in the variable matches
* Iterate over the matches and get the matched span from the start to the end index

In [61]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("New iPhone X release date leaked as Apple reveals pre-orders by mistake")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{'TEXT' : 'iPhone'}, {'TEXT' : 'X'}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


## Writing match patterns

### Part 1

* Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”

In [62]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": 'iOS'}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


### Part 2

* Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag 'PROPN' (proper noun).

In [63]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": 'download'}, {"POS": 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


### Part 3

* Write one pattern that matches adjectives ('ADJ') followed by one or two 'NOUN's (one noun and one optional noun).

In [66]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": 'ADJ'}, {"POS": 'NOUN'}, {"POS": 'NOUN', "OP": '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses
