In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
from spacy.matcher import Matcher

In [3]:
matcher = Matcher(nlp.vocab)

In [4]:
# SolarPower
pattern1 = [{'LOWER':'solarpower'}]
# Solar-power
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True},{'LOWER':'power'}]
# Solar power
pattern3 = [{'LOWER':'solar'},{'LOWER':'power'}]

In [5]:
matcher.add('SolarPower',None,pattern1,pattern2,pattern3) # Matcher

Let's break this down:

pattern1 looks for a single token whose lowercase text reads 'solarpower'

pattern2 looks for two adjacent tokens that read 'solar' and 'power' in that order

pattern3 looks for three adjacent tokens, with a middle token that can be any punctuation.*

## Applying the matcher to a Doc object

In [17]:
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

In [18]:
found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 2)]


matcher returns a list of tuples. Each tuple contains an ID for the match,
with start & end tokens that map to the span doc[start:end]

In [19]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 0 2 The Solar


## Removing the Patterns

In [10]:
matcher.remove('SolarPower')

In [20]:
# Create a new set of patterns
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]

In [21]:
# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2)

In [22]:
doc2 = nlp(u'SOLAR POWER and not solar power or is it solar---power')

In [23]:
found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 2), (8656102463236116519, 4, 6), (8656102463236116519, 9, 12)]


In [24]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 0 2 The Solar
8656102463236116519 SolarPower 4 6 continues to
8656102463236116519 SolarPower 9 12 for solarpower increases


## Be careful with lemmas!

In [15]:
from spacy.matcher import PhraseMatcher

In [16]:
matcher = PhraseMatcher(nlp.vocab)

In [17]:
with open('reaganomics.txt') as f:
    doc3 = nlp(f.read())

In [18]:
# First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

In [19]:
# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]

In [25]:
type(phrase_patterns[0])

spacy.tokens.doc.Doc

In [20]:
phrase_patterns

[voodoo economics,
 supply-side economics,
 trickle-down economics,
 free-market economics]

In [21]:
# Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('EconMatcher', None, *phrase_patterns)

In [22]:
found_matches = matcher(doc3)

In [23]:
found_matches

[(3680293220734633682, 41, 45),
 (3680293220734633682, 49, 53),
 (3680293220734633682, 54, 56),
 (3680293220734633682, 61, 65),
 (3680293220734633682, 673, 677),
 (3680293220734633682, 2985, 2989)]

In [24]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc3[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

3680293220734633682 EconMatcher 41 45 supply-side economics
3680293220734633682 EconMatcher 49 53 trickle-down economics
3680293220734633682 EconMatcher 54 56 voodoo economics
3680293220734633682 EconMatcher 61 65 free-market economics
3680293220734633682 EconMatcher 673 677 supply-side economics
3680293220734633682 EconMatcher 2985 2989 trickle-down economics


In [26]:
# To get better context of tokens aside the matches

for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc3[start-10:end+10]                    # get the matched span
    print(match_id, string_id, start, end, span.text)


3680293220734633682 EconMatcher 41 45 during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo
3680293220734633682 EconMatcher 49 53 associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-
3680293220734633682 EconMatcher 54 56 economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by
3680293220734633682 EconMatcher 61 65 down economics or voodoo economics by political opponents, and free-market economics by political advocates.

The four pillars of Reagan
3680293220734633682 EconMatcher 673 677 At the same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian demand-
3680293220734633682 EconMatcher 2985 2989 against institutions.[66] His policies became widely known as "trickle-down economics", due to the significant c