## Introduction

So far we've seen how text is divided into tokens, and how individual tokens are parsed and tagged with parts of speech, dependencies and lemmas.

In this section we will identify and label specific tokens and phrases that match patterns we can define ourselves. 

## Rules-based Matching

spaCy’s rule-based matcher engines and components not only let you find you the words and phrases you’re looking for – they also give you access to the tokens within the document and their relationships. This means you can easily access and analyse the surrounding tokens, merge spans into single tokens or add entries to the named entities in `doc.ents`.

spaCy offers a **rule-matching tool** called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. 

We can match on any part of the token including text and annotations, and web add multiple patterns to the same matcher.

In [5]:
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

## Creating a token pattern

For this example, I want to find three combinations of the words **stop word**. The three combinations of these words are:

(a) a token that looks for lowercase text **stopword**<br>
(b) a token where the `is_punct` flag is set to `True` so that any punctuation is detected eg **stop-word**<br>
(c) a token where two words are found that read **stop** and **word** with a space in between eg **stop word**<br>

First we import the `matcher` library

In [6]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

Then we create each pattern. There are several token attributes we can use. These are shown below.



<thead><tr class="_8a68569b"><th class="_2e8d2972">Attribute</th><th class="_2e8d2972">Type</th><th class="_2e8d2972">&nbsp;Description</th></tr></thead>
<tbody><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">ORTH</code></td><td class="_5c99da9a">unicode</td><td class="_5c99da9a">The exact verbatim text of a token.</td>
    </tr>
    <tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">LOWER</code></td><td class="_5c99da9a">unicode</td><td class="_5c99da9a">The lowercase form of the token text.</td></tr><tr class="_8a68569b"><td class="_5c99da9a">&nbsp;<code class="_1d7c6046">LENGTH</code></td><td class="_5c99da9a">int</td><td class="_5c99da9a">The length of the token text.</td></tr><tr class="_8a68569b"><td class="_5c99da9a">&nbsp;<code class="_1d7c6046">IS_ALPHA</code>, <code class="_1d7c6046">IS_ASCII</code>, <code class="_1d7c6046">IS_DIGIT</code></td><td class="_5c99da9a">bool</td><td class="_5c99da9a">Token text consists of alphabetic characters, ASCII characters, digits.</td></tr><tr class="_8a68569b"><td class="_5c99da9a">&nbsp;<code class="_1d7c6046">IS_LOWER</code>, <code class="_1d7c6046">IS_UPPER</code>, <code class="_1d7c6046">IS_TITLE</code></td><td class="_5c99da9a">bool</td><td class="_5c99da9a">Token text is in lowercase, uppercase, titlecase.</td></tr><tr class="_8a68569b"><td class="_5c99da9a">&nbsp;<code class="_1d7c6046">IS_PUNCT</code>, <code class="_1d7c6046">IS_SPACE</code>, <code class="_1d7c6046">IS_STOP</code></td><td class="_5c99da9a">bool</td><td class="_5c99da9a">Token is punctuation, whitespace, stop word.</td></tr><tr class="_8a68569b"><td class="_5c99da9a">&nbsp;<code class="_1d7c6046">LIKE_NUM</code>, <code class="_1d7c6046">LIKE_URL</code>, <code class="_1d7c6046">LIKE_EMAIL</code></td><td class="_5c99da9a">bool</td><td class="_5c99da9a">Token text resembles a number, URL, email.</td></tr><tr class="_8a68569b"><td class="_5c99da9a">&nbsp;<code class="_1d7c6046">POS</code>, <code class="_1d7c6046">TAG</code>, <code class="_1d7c6046">DEP</code>, <code class="_1d7c6046">LEMMA</code>, <code class="_1d7c6046">SHAPE</code></td><td class="_5c99da9a">unicode</td><td class="_5c99da9a">The token’s simple and extended part-of-speech tag, dependency label, lemma, shape.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">ENT_TYPE</code></td><td class="_5c99da9a">unicode</td><td class="_5c99da9a">The token’s entity label.</td></tr>
    </tbody>

Here's the three matching tokens for the three combinations of **stop word** described above. Note that we don't need to tokenise a single space as it is not recognised as punctuation.

It doesn't matter if the attribute names are upper or lowercase. spaCy will normalise the names internally and `{"LOWER": "text"}` and `{"lower": "text"}` will both produce the same result. Using the uppercase version is mostly a convention to make it clear that the attributes are **special** and don’t exactly map to the token attributes like `Token.lower` and `Token.lower_`.

In [7]:
# match for "stopword"
token_match1 = [{"LOWER": "stopword"}]
# match for "stopwords"
token_match2 = [{"LOWER": "stopwords"}]
# match for stop-word
token_match3 = [{"LOWER": "stop"}, {"IS_PUNCT": True}, {"LOWER": "word"}]
# match for stop-words
token_match4 = [{"LOWER": "stop"}, {"IS_PUNCT": True}, {"LOWER": "words"}]
# match for "stop word". We don't need to check for a single space as it is not tokenised
token_match5 = [{"LOWER": "stop"}, {"LOWER": "word"}]
# stopwords
token_match6 = [{"LOWER": "stop"}, {"LOWER": "words"}]

Then we call `matcher.add` command to add all three token matches. The second argument lets you pass in an optional callback function to invoke on a successful match. For now, we set it to `None`.

In [8]:
matcher.add("StopWord", None, token_match1, token_match2, token_match3, token_match4, token_match5, token_match6)

## Applying the matcher to a doc object


In [11]:
file_name = open("Stop words.txt")
sentence = file_name.read()
doc_object = nlp(sentence)

In [12]:
print(doc_object)

Words like "a" and "the" are called stop---words.
Sometimes this can be written as stop-words or stopwords.
Each stop word can be filtered from the text to be processed.
spaCy holds a built-in list of some 305 English stop--words.


In [13]:
token_matches = matcher(doc_object)

In [14]:
for token in token_matches:
    print(token)

(17470060577089942448, 11, 14)
(17470060577089942448, 22, 25)
(17470060577089942448, 26, 27)
(17470060577089942448, 30, 32)
(17470060577089942448, 54, 57)


Lets create a function that accepts a string and displays the matcher objects. I'll also structure the output of the function.

In [15]:
def find_matches(text):
    # convert text to a doc object
    doc_object = nlp(text)
    print(doc_object)
    # find all matches within the doc object
    token_matches = matcher(doc_object)
    # For each item in the token_matches provide the following
    # match_id is the hash value of the identified token match
    for match_id, start, end in token_matches:
        string_id = nlp.vocab.strings[match_id]
        matched_span = doc_object[start:end]      
        print(f"{match_id:<{20}} {string_id:<{15}} {start:{3}} {end:{3}} {matched_span.text:{20}}")

Now I'll send in the text from the earlier example into the function.

In [18]:
find_matches(sentence)

Words like "a" and "the" are called stop---words.
Sometimes this can be written as stop-words or stopwords.
Each stop word can be filtered from the text to be processed.
spaCy holds a built-in list of some 305 English stop--words.
17470060577089942448 StopWord         11  14 stop---words        
17470060577089942448 StopWord         22  25 stop-words          
17470060577089942448 StopWord         26  27 stopwords           
17470060577089942448 StopWord         30  32 stop word           
17470060577089942448 StopWord         54  57 stop--words         


### Setting pattern options and quantifiers

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>

You can make token rules optional by passing an `'OP':'*'` argument.  

This lets us streamline our patterns list:

In [19]:
# Remove old matcher to avoid issues
matcher.remove("StopWord")

# Redefine the patterns:
token_match1 = [{"LOWER": "stopword"}]
token_match2 = [{"LOWER": "stopwords"}]
token_match3 = [{"LOWER": "stop"}, {"IS_PUNCT": True, "OP":"*"}, {"LOWER": "word"}]
token_match4 = [{"LOWER": "stop"}, {"IS_PUNCT": True, "OP":"*"}, {"LOWER": "words"}]
token_match5 = [{"LOWER": "stop"}, {"LOWER": "word"}]
token_match6 = [{"LOWER": "stop"}, {"LOWER": "words"}]

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add("StopWord", None, token_match1, token_match2, token_match3, token_match4, token_match5, token_match6)

In [20]:
file_name = open("Stop words.txt")
sentence = file_name.read()
#my_text = "Words like \"a\" and \"the\" are called stop---words.\
#Sometimes this can be written as stop-words or stopwords.\
#Each stop word can be filtered from the text to be processed.\
#spaCy holds a built-in list of some 305 English stop--words."

find_matches(sentence)

Words like "a" and "the" are called stop---words.
Sometimes this can be written as stop-words or stopwords.
Each stop word can be filtered from the text to be processed.
spaCy holds a built-in list of some 305 English stop--words.
17470060577089942448 StopWord         11  14 stop---words        
17470060577089942448 StopWord         22  25 stop-words          
17470060577089942448 StopWord         26  27 stopwords           
17470060577089942448 StopWord         30  32 stop word           
17470060577089942448 StopWord         30  32 stop word           
17470060577089942448 StopWord         54  57 stop--words         


## Be careful with lemmatisation searching
If we wanted to match on the words '**petrol power** and **petrol powered**, it might be tempting to look for the **lemma** of **powered** and expect it to be **power**. Then we could potentially pick that up with a **lemmatisation** match. This is not always the case though. The lemma of the adjective **powered** is still **powered**.

Lets look at an example of this problem.

First I'll create an exemplar sentence and show the lemmas from it.

In [13]:
doc_object = nlp(u"Petrol-powered energy runs petrol-powered cars.")

# Lets look at the lemmatisation of each word
for word in doc_object:
    print (word.text + "\t" + " -----> " + word.lemma_ + "\t" + word.pos_)

Petrol	 -----> petrol	PROPN
-	 -----> -	PUNCT
powered	 -----> power	VERB
energy	 -----> energy	NOUN
runs	 -----> run	VERB
petrol	 -----> petrol	NOUN
-	 -----> -	PUNCT
powered	 -----> powered	ADJ
cars	 -----> car	NOUN
.	 -----> .	PUNCT


The second **powered** word is an adjective so it can't match on the lemma **power** since an adjective does not reduced down to the base word **power**. This example will not work as expected.

In [14]:
token_match1 = [{'LOWER': 'petrolpower'}]
token_match2 = [{'LOWER': 'petrol'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}]

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('PetrolPower', None, token_match1, token_match2)

In [15]:
found_matches = matcher(doc_object)
print (found_matches)

[(15516410614135709684, 0, 3)]


Only the first occurrence of **petrol-powered** is recognised. The second occurrence's lemma equivelant does not change to **power** so it is not matched.

# Phrase Matcher
In token-based matching we used token patterns to perform rule-based matching. 

An alternative - and often more efficient method is to match on terminology lists. In this case we use `PhraseMatcher` to create a Doc object from a list of phrases, and pass that into `matcher` instead.

In [16]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

The example text is from this link https://en.wikipedia.org/wiki/Natural_language_processing
    
It is also available on Blackboard under the file name **NLP.txt**.

Before opening the text file, make sure it is in the same folder location as your jupyter notebook file. You should be able to use predictive text to pick up the name of your file.

In [17]:
with open("../Jupyter notebook files/NLP.txt", encoding = "utf8") as my_file:
    doc_object = nlp(my_file.read())

Now I would like to match on some words within the text file I've just imported. 

I've created a list of match phrases I would like to check my imported text for.

In [18]:
phrase_list = ["natural language processing", "machine learning", "supervised learning", "machine translation"]

[natural language processing, machine learning, supervised learning, machine translation]


Next I will convert each of these phrases into a suitable structure. I'm going to create a `doc` object.

In [24]:
# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp.make_doc(word) for word in phrase_list]

Lets have a look at these phrase patterns.

In [26]:
# Show these phrase patterns
print(phrase_patterns)

[natural language processing, machine learning, supervised learning, machine translation]


Now I'll add each of these phrase patterns to a `matcher` object called **NLP**.

In [27]:
# Pass each Doc object into matcher (note the use of the asterisk)
# refers to a *phrase_patterns (Doc): `Doc` objects representing match patterns.
matcher.add("NLP", None, *phrase_patterns)

Finally I'll build a list of relevant matches and put the results into a variable called **matches**.

In [29]:
# Build a list of matches:
matches = matcher(doc_object)

Lets have a look at the contents of the found matches. Each match contains the `match_id`, and the `start` and `stop` locations of each match within the text file.

In [30]:
matches

[(15832915187156881108, 3, 6),
 (15832915187156881108, 85, 87),
 (15832915187156881108, 127, 129),
 (15832915187156881108, 137, 139),
 (15832915187156881108, 150, 152),
 (15832915187156881108, 160, 163),
 (15832915187156881108, 368, 371),
 (15832915187156881108, 396, 399),
 (15832915187156881108, 403, 405),
 (15832915187156881108, 471, 473),
 (15832915187156881108, 512, 515),
 (15832915187156881108, 622, 624),
 (15832915187156881108, 762, 764),
 (15832915187156881108, 807, 809),
 (15832915187156881108, 890, 892),
 (15832915187156881108, 896, 899),
 (15832915187156881108, 1037, 1040),
 (15832915187156881108, 1047, 1049),
 (15832915187156881108, 1062, 1064),
 (15832915187156881108, 1091, 1093)]

We can show each match using a loop I created earlier in this document. 

In [20]:
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc_object[start:end]
    print(match_id, "\t", string_id, "\t", start, "\t", end, "\t", span.text)


15832915187156881108 	 NLP 	 3 	 6 	 natural language processing
15832915187156881108 	 NLP 	 85 	 87 	 machine translation
15832915187156881108 	 NLP 	 127 	 129 	 machine translation
15832915187156881108 	 NLP 	 137 	 139 	 machine translation
15832915187156881108 	 NLP 	 150 	 152 	 machine translation
15832915187156881108 	 NLP 	 160 	 163 	 natural language processing
15832915187156881108 	 NLP 	 368 	 371 	 natural language processing
15832915187156881108 	 NLP 	 396 	 399 	 natural language processing
15832915187156881108 	 NLP 	 403 	 405 	 machine learning
15832915187156881108 	 NLP 	 471 	 473 	 machine learning
15832915187156881108 	 NLP 	 512 	 515 	 natural language processing
15832915187156881108 	 NLP 	 622 	 624 	 machine translation
15832915187156881108 	 NLP 	 762 	 764 	 supervised learning
15832915187156881108 	 NLP 	 807 	 809 	 supervised learning
15832915187156881108 	 NLP 	 890 	 892 	 machine learning
15832915187156881108 	 NLP 	 896 	 899 	 natural language pr

## Viewing Matches
There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc object that is wider than the match.

For example, the first **machine translation** match occurs between words 85 - 86. I can view the context of the sentence it is in by choosing a few wordseither side of its location within the string.

In [21]:
# Allowing a few words either side of the match
doc_object[80:93]

three or five years, machine translation would be a solved problem.[2]

We could use the loop I created earlier to capture some text on either side of the matched phrase.

In [62]:
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc_object[start-3:end+3]
    print(string_id, "\t", start, "\t", end, "\t", span.text)

NLP 	 3 	 6 	 The history of natural language processing (NLP)
NLP 	 85 	 87 	 five years, machine translation would be a
NLP 	 127 	 129 	 , funding for machine translation was dramatically reduced
NLP 	 137 	 139 	 further research in machine translation was conducted until
NLP 	 150 	 152 	 the first statistical machine translation systems were developed
NLP 	 160 	 163 	 Some notably successful natural language processing systems developed in
NLP 	 368 	 371 	 1980s, most natural language processing systems were based
NLP 	 396 	 399 	 a revolution in natural language processing with the introduction
NLP 	 403 	 405 	 the introduction of machine learning algorithms for language
NLP 	 471 	 473 	 earliest-used machine learning algorithms, such
NLP 	 512 	 515 	 Markov models to natural language processing, and increasingly
NLP 	 622 	 624 	 the field of machine translation, due especially
NLP 	 762 	 764 	 and semi-supervised learning algorithms. Such
NLP 	 807 	 809 	 more difficul

Another way is to first apply the `sentencizer` to the doc object, then iterate through the sentences to the match point:

In [58]:
# Build a list of sentences
sentences = [sent for sent in doc_object.sents]

# Sentences contain start and end token values
# for example, here's the start and end values of the first sentence
print(sentences[0].start, sentences[0].end)

0 24


In [54]:
# Iterate over the sentence list until the sentence end value exceeds a match start value:
for sent in sentences:
    # matches[2][2] refers to the 3rd row in matches and the third column "129"
    # send.end is the end of an occurrence of "sent"
    if matches[2][2] < sent.end:
        print(sent, sent.start, sent.end, matches[2][2])
        break

However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. 93 133 129
