In [2]:
# Identify Unique Tokens

# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

# The Doc object that holds the processed text is our focus here.
print(type(doc))
# Print each token separately
for token in doc:
    print(token.text, token.pos_, token.dep_)

<class 'spacy.tokens.doc.Doc'>
Tesla PROPN nsubj
is VERB aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


### Pipeline

When we run nlp, our text enters a processing pipeline that first breaks down the text and then performs a series of operations to tag, parse and describe the data. Image source: https://spacy.io/usage/spacy-101#pipelines

<img src="https://spacy.io/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg" width="600">

We can check to see what components currently live in the pipeline. In later sections we'll learn how to disable components and add new ones as needed.

In [3]:
# Print the pipelines object

print(nlp.pipeline)
print(nlp.pipe_names)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x0000023200E864A8>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x00000232011EAE88>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x00000232011EAEE8>)]
['tagger', 'parser', 'ner']


### Tokenization

The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information.

In [4]:
# Tokenisation

doc2 = nlp(u"Tesla isn't   looking into startups anymore.")

for token in doc2:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is VERB aux
n't ADV neg
   SPACE 
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


In [5]:
# Even though doc2 contains processed information about each token, it also retains the original text

print(doc2)
print(doc2[0])

Tesla isn't   looking into startups anymore.
Tesla


### Part-of-Speech Tagging (POS)

The next step after splitting the text up into tokens is to assign parts of speech. In the above example, Tesla was recognized to be a proper noun. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.
For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

In [13]:
# Here we have applied Text, POS, Dependency, tagging, lemmatisation, shape(Size defined by X and cases, Alphabet Charecter,
# token part of stop list etc)

for token in doc2:
    print(token.text, token.pos_, token.dep_, token.tag_, token.lemma_, token.shape_, token.is_alpha, token.is_stop)

Tesla PROPN nsubj NNP Tesla Xxxxx True False
is VERB aux VBZ be xx True True
n't ADV neg RB not x'x False True
   SPACE  _SP       False False
looking VERB ROOT VBG look xxxx True False
into ADP prep IN into xxxx True True
startups NOUN pobj NNS startup xxxx True False
anymore ADV advmod RB anymore xxxx True False
. PUNCT punct . . . False False


### Dependencies

We also looked at the syntactic dependencies assigned to each token. Tesla is identified as an nsubj or the nominal subject of the sentence.
For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing. 
A good explanation of typed dependencies can be found here

In [10]:
# You can see the full name of a tag eg: PROPN, using spacy.explain(TAG)

print(spacy.explain('PROPN'))
print(spacy.explain('nsubj'))
print(spacy.explain('ADP'))
print(spacy.explain('advmod'))

proper noun
nominal subject
adposition
adverbial modifier


### Sentences

Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [14]:
# Sentence Split

doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

for sent in doc4.sents:
    print(sent)

#  Check if 6th element in doc is start of the sentence
print(doc4[6].is_sent_start)

This is the first sentence.
This is another sentence.
This is the last sentence.
True


# We will look Briefly under each Sub-Topics

## 1. Tokenisation

The first step in creating a Doc object is to break down the incoming text into component pieces or "tokens".

In [15]:
# Create a string that includes opening and closing quotation marks
mystring = '"We\'re moving to L.A.!"'
print(mystring)

# Create a Doc object and explore tokens
doc = nlp(mystring)

for token in doc:
    print(token.text, end=' | ')

"We're moving to L.A.!"
" | We | 're | moving | to | L.A. | ! | " | 

-  **Prefix**:	Character(s) at the beginning &#9656; `$ ( “ ¿`
-  **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
-  **Infix**:	Character(s) in between &#9656; `- -- / ...`
-  **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`

Notice that tokens are pieces of the original text. That is, we don't see any conversion to word stems or lemmas (base forms of words) and we haven't seen anything about organizations/places/money etc. Tokens are the basic building blocks of a Doc object - everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.

### Prefixes, Suffixes and Infixes

spaCy will isolate punctuation that does not form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

In [17]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!, and  \
           it will cost you 10.50$")

for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!
,
and
            
it
will
cost
you
10.50
$


<font color=green>Note that the exclamation points, comma, and the hyphen in 'snail-mail', 10.50 money are assigned their own tokens, yet both the email address, currency and website are preserved.</font>

In [18]:
# Counting Tokens
print(len(doc))

# Counting Vocab entities - Vocab objects contain a full library of items!
print(len(doc.vocab))

8
519


In [19]:
# Tokens can be retrieved by index position and slice

print(doc[2])

print(doc[2:5])

print(doc[-4:])

're
're moving to
to L.A.!"


Tokens cannot be reassigned. <br>
Although Doc objects can be considered lists of tokens, they do not support item reassignment

### Named Entities

Going a step beyond tokens, *named entities* add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the `ents` property of a `Doc` object.

In [20]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc8:
    print(token.text, end=' | ')

print('\n----')

# Check for Entities in the Tokens
for ent in doc8.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 
----
Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


Note how two tokens combine to form the entity Hong Kong, and three tokens combine to form the monetary entity: $6 million

In [21]:
# Check thew no of Named Entities in a Doc Object

len(doc8.ents)

3

Named Entity Recognition (NER) is an important machine learning tool applied to Natural Language Processing.
We'll do a lot more with it in an upcoming section. For more info on named entities visit https://spacy.io/usage/linguistic-features#named-entities

### Noun Chunks

Similar to Doc.ents, Doc.noun_chunks are another object property. Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in Sheb Wooley's 1958 song, a *"one-eyed, one-horned, flying, purple people-eater"* would be one long noun chunk.

We'll look at additional noun_chunks components besides .text in an upcoming section.
For more info on noun_chunks visit https://spacy.io/usage/linguistic-features#noun-chunks

In [23]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:
    print(chunk.text)

print('-'*20)
doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

for chunk in doc11.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers
--------------------
He
a one-eyed, one-horned, flying, purple people-eater


### Built-in Visualizers

spaCy includes a built-in visualization tool called displaCy. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.
For more info visit https://spacy.io/usage/visualizers

In [26]:
# Dependency Parser Visualisation

from spacy import displacy

doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')

# 'distance' argument sets the distance between tokens. If the distance is made too small, text that appears beneath short arrows may become too compressed to read.
displacy.render(doc, style='dep', jupyter=True, options={'distance': 80})

In [28]:
# Visualizing the entity recognizer

doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)

You can also Serve the visual to Browser using

displacy.serve(doc, style='dep')

## 2. Stemming

Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for "boat" might also return "boats" and "boating". 

Here, "boat" would be the stem for [boat, boater, boating, boats].
Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. 

In fact, spaCy doesn't include a stemmer, opting instead to rely entirely on lemmatization. For those interested, there's some background on this decision here. 

Instead, we'll use another popular NLP tool called nltk, which stands for Natural Language Toolkit. For more information on nltk visit https://www.nltk.org/

### Porter Stemmer

One of the most common - and effective - stemming tools is Porter's Algorithm developed by Martin Porter in 1980. The algorithm employs five phases of word reduction, each with its own set of mapping rules. In the first phase, simple suffix mapping rules are defined. From a given set of stemming rules only one rule is applied, based on the longest suffix S1. More sophisticated phases consider the length/complexity of the word before applying a rule.

In [30]:
# Import the toolkit and the full Porter Stemmer library
import nltk
from nltk.stem.porter import *

# Initialize Porterstemmer

p_stemmer = PorterStemmer()

words = ['run','runner','running','ran','runs','easily','fairly','generous','generation','generously','generate']

# Stem the Words

for word in words:
    print(word+' --> '+p_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli
generous --> gener
generation --> gener
generously --> gener
generate --> gener


### Snowball Stemmer

This is somewhat of a misnomer, as Snowball is the name of a stemming language developed by Martin Porter. The algorithm used here is more acurately called the "English Stemmer" or "Porter2 Stemmer". It offers a slight improvement over the original Porter stemmer, both in logic and speed. Since nltk uses the name SnowballStemmer, we'll use it here.

In [31]:
from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')

# Words
words = ['run','runner','running','ran','runs','easily','fairly','generous','generation','generously','generate']

# Stem
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair
generous --> generous
generation --> generat
generously --> generous
generate --> generat


In [33]:
# Testing for a Sentence

phrase = 'I am meeting him tomorrow at the meeting'
for word in phrase.split():
    print(word +' --> '+ p_stemmer.stem(word) +'<--->'+ s_stemmer.stem(word))
    
# Here the word "meeting" appears twice - once as a verb, and once as a noun, and yet the stemmer treats both equally.

I --> I<--->i
am --> am<--->am
meeting --> meet<--->meet
him --> him<--->him
tomorrow --> tomorrow<--->tomorrow
at --> at<--->at
the --> the<--->the
meeting --> meet<--->meet


## 3. Lemmatization

In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence.

In [35]:
# Performing Lemmatisation

doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")

# See Some Lemmas are Assigned to Same Values to avoid Duplication
for token in doc1:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

I 	 PRON 	 561228191312463089 	 -PRON-
am 	 VERB 	 10382539506755952630 	 be
a 	 DET 	 11901859001352538922 	 a
runner 	 NOUN 	 12640964157389618806 	 runner
running 	 VERB 	 12767647472892411841 	 run
in 	 ADP 	 3002984154512732771 	 in
a 	 DET 	 11901859001352538922 	 a
race 	 NOUN 	 8048469955494714898 	 race
because 	 ADP 	 16950148841647037698 	 because
I 	 PRON 	 561228191312463089 	 -PRON-
love 	 VERB 	 3702023516439754181 	 love
to 	 PART 	 3791531372978436496 	 to
run 	 VERB 	 12767647472892411841 	 run
since 	 ADP 	 10066841407251338481 	 since
I 	 PRON 	 561228191312463089 	 -PRON-
ran 	 VERB 	 12767647472892411841 	 run
today 	 NOUN 	 11042482332948150395 	 today


In [36]:
# Writing Function to display lemmas More Easily

def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')
        
# Here we're using an f-string to format the printed text by setting minimum field widths and adding a left-align to the lemma hash value.

In [37]:
doc2 = nlp(u"I saw eighteen mice today!")
show_lemmas(doc2)

I            PRON   561228191312463089     -PRON-
saw          VERB   11925638236994514241   see
eighteen     NUM    9609336664675087640    eighteen
mice         NOUN   1384165645700560590    mouse
today        NOUN   11042482332948150395   today
!            PUNCT  17494803046312582752   !


Although lemmatization looks at surrounding text to determine a given word's part of speech, it does not categorize phrases. 

## 4. Stop Words

Words like "a" and "the" appear so frequently that they don't require tagging as thoroughly as nouns, verbs and modifiers. We call these stop words, and they can be filtered from the text to be processed. spaCy holds a built-in list of some 305 English stop words.

In [39]:
# Print the set of spaCy's default stop words (remember that sets are unordered):
print(nlp.Defaults.stop_words)

# Get the length of Default Stopwords
len(nlp.Defaults.stop_words)

{'enough', 'why', 'besides', 'has', 'really', 'and', 'few', 'anyhow', 'whole', 'empty', 'very', 'among', 'when', 'everything', 'herein', 'please', 'beforehand', 'bottom', 'hereby', 'into', 'my', 'nevertheless', 'at', 'thereupon', 'along', 'last', 'above', 'again', 'had', 'i', 'more', 'least', 'same', 'some', 'via', 'former', "'ll", 'all', 'behind', 'if', 'this', 'seem', 'any', 'eleven', 'nowhere', 'alone', 'a', 'been', 'becoming', 'see', 'much', 'mine', 'he', 'our', 'does', 'thence', 'hereupon', 'ten', 'about', 'yours', 'whereas', 'formerly', 'someone', 'who', 'meanwhile', 'no', 'during', "'m", 'none', 'made', 'before', 'do', 'anything', 'per', 'would', 'yourselves', 'top', 'hundred', 'but', 'nobody', 'around', 'its', 'or', 'perhaps', 'toward', 'nothing', 'within', 'can', 'two', 'each', 'six', 'were', 'such', 'name', 'regarding', 'many', 'it', 'further', 'those', 'there', 'then', 'his', 'indeed', 'she', 'others', 'own', 'is', 'take', 'three', 'various', 'sometime', 'seemed', 'itself', 

312

In [40]:
# Check if a Word is a Stopword

print(nlp.vocab['myself'].is_stop)
print(nlp.vocab['mystery'].is_stop)

True
False


### Adding a stop word

There may be times when you wish to add a stop word to the default set. Perhaps you decide that 'btw' (common shorthand for "by the way") should be considered a stop word.

Note: **When adding stop words, always use lowercase. Lexemes are converted to lowercase before being added to vocab.**

In [43]:
# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('btw')

# Set the stop_word tag on the lexeme
nlp.vocab['btw'].is_stop = True

# Validate
print(len(nlp.Defaults.stop_words)) # Should Show 313
print(nlp.vocab['btw'].is_stop)

313
True


### Removal of a stop word

Alternatively, you may decide that 'beyond' should not be considered a stop word.

In [44]:
# Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('beyond')

# Remove the stop_word tag from the lexeme
nlp.vocab['beyond'].is_stop = False

# Validate
print(len(nlp.Defaults.stop_words)) # Should Show 312
print(nlp.vocab['beyond'].is_stop)

312
False


## 5. Vocabulary and Matching

So far we've seen how a body of text is divided into tokens, and how individual tokens are parsed and tagged with parts of speech, dependencies and lemmas.
Here, we will identify and label specific phrases that match patterns we can define ourselves. 

### Rule-based Matching

spaCy offers a rule-matching tool called Matcher that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [45]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Matcher object pairs to the current Vocab object. We can add and remove specific named matchers to matcher as needed.

### Creating patterns

In literature, the phrase 'solar power' might appear as one word or two, with or without a hyphen. In this section we'll develop a matcher named 'SolarPower' that finds all three:

In [46]:
# Adding a Pattern

pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]

matcher.add('SolarPower', None, pattern1, pattern2, pattern3)

Let's break this down:
* `pattern1` looks for a single token whose lowercase text reads 'solarpower'
* `pattern2` looks for two adjacent tokens that read 'solar' and 'power' in that order
* `pattern3` looks for three adjacent tokens, with a middle token that can be any punctuation.<font color=green>*</font>

<font color=green>\* Remember that single spaces are not tokenized, so they don't count as punctuation.</font>
<br>Once we define our patterns, we pass them into `matcher` with the name 'SolarPower', and set *callbacks* to `None`

### Applying the matcher to a Doc object

In [48]:
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

# Apply the Matcher
found_matches = matcher(doc)
# It returns list of tuples & Each tuple contains ID for the match, with start & end tokens that map to the span doc[start:end]
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


In [50]:
# Get the String Representation of the Matches

for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)
    
# The match_id is simply the hash value of the string_ID 'SolarPower'

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


### Setting pattern options and quantifiers

You can make token rules optional by passing an 'OP':'*' argument. This lets us streamline our patterns list:

In [51]:
# Redefine the patterns:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
# The above is used to find both two-word patterns, with and without the hyphen!

# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower') # SolarPower is the Name we gave for Before Pattern

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2)

# Check for Matches

found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


### Be careful with lemmas!

If we wanted to match on both 'solar power' and 'solar powered', it might be tempting to look for the lemma of 'powered' and expect it to be 'power'. This is not always the case! The lemma of the adjective 'powered' is still 'powered'

In [57]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}] # CHANGE THIS PATTERN

# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2)

doc2 = nlp(u'Solar-powered energy runs solar-powered cars.')

# Matcher
found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 5, 8)]


In [58]:
# It may be better to set explicit token patterns.

pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solarpowered'}]
pattern4 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'powered'}]

# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2, pattern3, pattern4)

# Get Matches
found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 5, 8)]


#### Other token attributes

Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

#### Token wildcard
You can pass an empty dictionary `{}` as a wildcard to represent **any token**. For example, you might want to retrieve hashtags without knowing what might follow the `#` character:
>`[{'ORTH': '#'}, {}]`

### PhraseMatcher

In the above, we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into matcher instead.

For this exercise we're going to import a Text Document containing different data blog articles<br>

In [60]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [64]:
with open('datablogs.txt', encoding='utf8') as f:
    doc3 = nlp(f.read())

In [67]:
# First, create a list of match phrases:
phrase_list = ['data science', 'unsupervised learning', 'machine learning', 'deep learning', 'analytics']

# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]

# Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('DataBloggerr', None, *phrase_patterns)

# Build a list of matches:
matches = matcher(doc3)

# (match_id, start, end)
print(matches)

[(11288900201468143886, 716, 718), (11288900201468143886, 794, 796), (11288900201468143886, 797, 799), (11288900201468143886, 1384, 1386), (11288900201468143886, 1497, 1499), (11288900201468143886, 1539, 1541), (11288900201468143886, 1576, 1578), (11288900201468143886, 1604, 1606), (11288900201468143886, 1670, 1672), (11288900201468143886, 1814, 1816), (11288900201468143886, 1861, 1863), (11288900201468143886, 2083, 2085), (11288900201468143886, 2087, 2088), (11288900201468143886, 2346, 2348), (11288900201468143886, 2597, 2599), (11288900201468143886, 3069, 3071), (11288900201468143886, 3111, 3113), (11288900201468143886, 3176, 3178), (11288900201468143886, 3226, 3228), (11288900201468143886, 3283, 3285), (11288900201468143886, 3308, 3310), (11288900201468143886, 3387, 3389), (11288900201468143886, 3430, 3432), (11288900201468143886, 3449, 3451), (11288900201468143886, 3484, 3486), (11288900201468143886, 3717, 3719), (11288900201468143886, 3847, 3849), (11288900201468143886, 5706, 5708

#### Viewing Matches

There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc that is wider than the match

In [73]:
print(doc3[1496:1500])
print(doc3[3716:3720])

using data science and
in machine learning have


#### Another way is to first apply the sentencizer to the Doc, then iterate through the sentences to the match point

In [76]:
# Build a list of sentences
sents = [sent for sent in doc3.sents]

# Next  we'll see that sentences contain start and end token values:
print(sents[0].start, sents[0].end)

0 9


In [79]:
# Iterate over the sentence list until the sentence end value exceeds a match start value:

# Printing the sentence that contains First found match
for sent in sents:
    if matches[0][1] < sent.end:  # this is the fifth match, that starts at doc3[673]
        print(sent)
        break
        
# Printing the sentence that contains 9th found match
for sent in sents:
    if matches[9][1] < sent.end:  # this is the fifth match, that starts at doc3[673]
        print(sent)
        break

Now, let’s marry all this creative thinking with data science to create a COVID-19 At-Risk Score for everyone in the country (or the world…this methodology scales nicely).  
By identifying similar opportunities, shared pain-points across several social change organizations, or an overall theme under which many organizations work, DataKind can target a common data science capacity boost with the aim of generating solutions for multiple partners and potentially sector-wide system change.


For additional information visit https://spacy.io/usage/linguistic-features#section-rule-based-matching