<a href="https://colab.research.google.com/github/soujanya-vattikolla/NLP-with-spaCy/blob/main/UsingSpaCyMatcher.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import the requisite library
import spacy

In [None]:
from spacy.matcher import Matcher

#### Basic Example

In [None]:
#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_EMAIL": True}]
matcher.add("EMAIL_ADDRESS", [pattern])

In [None]:
doc = nlp("This is an email address: wmattingly@aol.com")
matches = matcher(doc)

In [None]:
print(matches)

[(16571425990740197027, 6, 7)]


(16571425990740197027) it is Lexeme, 6 is start token, 7 is end token

In [None]:
print (nlp.vocab[matches[0][0]].text)

EMAIL_ADDRESS


Attributes Taken by Matcher

ORTH - The exact verbatim of a token (str)

TEXT - The exact verbatim of a token (str)

LOWER - The lowercase form of the token text (str)

LENGTH - The length of the token text (int)

In [None]:
# example2:
with open ("wiki_story.txt", "r") as f:
    text = f.read()

In [None]:
print(text)

Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 â€“ April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr.

King participated in and led marches for blacks' right to vote, desegregation, labor rights, and other basic civil rights.[1] King led the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King helped organize the 1963 March on Washington, where he delivered his famou

In [None]:
#Build upon the spaCy Small Model
nlp_story = spacy.load("en_core_web_sm")

In [None]:
matcher_story = Matcher(nlp_story.vocab)
pattern_story = [{"POS": "PROPN"}]
matcher_story.add("PROPER_NOUN", [pattern_story])

In [None]:
doc_story = nlp_story(text)
matches = matcher_story(doc_story)

In [None]:
print(len(matches))

102


In [None]:
for match_story in matches[:10]:
    print (match_story, doc_story[match_story[1]:match_story[2]])

(451313080118390996, 0, 1) Martin
(451313080118390996, 1, 2) Luther
(451313080118390996, 2, 3) King
(451313080118390996, 3, 4) Jr.
(451313080118390996, 6, 7) Michael
(451313080118390996, 7, 8) King
(451313080118390996, 8, 9) Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 14, 15) â€
(451313080118390996, 16, 17) April


### Improving it with Multi-Word Tokens

In [None]:
matcher_story = Matcher(nlp_story.vocab)
pattern_story = [{"POS": "PROPN","OP":"+"}]
matcher_story.add("PROPER_NOUN", [pattern_story])
doc_story = nlp_story(text)
matches = matcher_story(doc_story)
print(len(matches))
for match_story in matches[:10]:
    print (match_story, doc_story[match_story[1]:match_story[2]])

175
(451313080118390996, 0, 1) Martin
(451313080118390996, 0, 2) Martin Luther
(451313080118390996, 1, 2) Luther
(451313080118390996, 0, 3) Martin Luther King
(451313080118390996, 1, 3) Luther King
(451313080118390996, 2, 3) King
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 1, 4) Luther King Jr.
(451313080118390996, 2, 4) King Jr.
(451313080118390996, 3, 4) Jr.


Here we are getting the proper nouns that are matched one or more times.

### Greedy Keyword Argument

In [None]:
matcher_story = Matcher(nlp_story.vocab)
pattern_story = [{"POS": "PROPN","OP":"+"}]
matcher_story.add("PROPER_NOUN", [pattern_story],greedy="LONGEST")
doc_story = nlp_story(text)
matches = matcher_story(doc_story)
print(len(matches))
for match_story in matches[:10]:
    print (match_story, doc_story[match_story[1]:match_story[2]])

61
(451313080118390996, 84, 89) Martin Luther King Sr.
(451313080118390996, 470, 475) Martin Luther King Jr. Day
(451313080118390996, 537, 542) Martin Luther King Jr. Memorial
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 129, 133) Southern Christian Leadership Conference
(451313080118390996, 248, 252) Director J. Edgar Hoover
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 326, 329) Nobel Peace Prize
(451313080118390996, 423, 426) James Earl Ray
(451313080118390996, 464, 467) Congressional Gold Medal


The longest proper nouns are displayed.

### Sorting it to Apperance

In [None]:
matcher_story = Matcher(nlp_story.vocab)
pattern_story = [{"POS": "PROPN","OP":"+"}]
matcher_story.add("PROPER_NOUN", [pattern_story],greedy="LONGEST")
doc_story = nlp_story(text)
matches = matcher_story(doc_story)
matches.sort(key = lambda x: x[1])   # sort by start token
print(len(matches))
for match_story in matches[:10]:
    print (match_story, doc_story[match_story[1]:match_story[2]])

61
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 14, 15) â€
(451313080118390996, 16, 17) April
(451313080118390996, 24, 25) Baptist
(451313080118390996, 50, 51) King
(451313080118390996, 70, 72) Mahatma Gandhi
(451313080118390996, 84, 89) Martin Luther King Sr.
(451313080118390996, 90, 91) King


### Adding in Sequences

In [None]:
matcher_story = Matcher(nlp_story.vocab)
pattern_story = [{"POS": "PROPN","OP":"+"}, {"POS": "VERB"}]
matcher_story.add("PROPER_NOUN", [pattern_story],greedy="LONGEST")
doc_story = nlp_story(text)
matches = matcher_story(doc_story)
matches.sort(key = lambda x: x[1])   # sort by start token
print(len(matches))
for match_story in matches[:10]:
    print (match_story, doc_story[match_story[1]:match_story[2]])

7
(451313080118390996, 50, 52) King advanced
(451313080118390996, 90, 92) King participated
(451313080118390996, 114, 116) King led
(451313080118390996, 168, 170) King helped
(451313080118390996, 248, 253) Director J. Edgar Hoover considered
(451313080118390996, 323, 325) King won
(451313080118390996, 486, 489) United States beginning


Here the first one is proper noun and second one is verb.

### Finding Quotes and Speakers

In [None]:
# Example3:
import json
with open ("alice.json", "r") as f:
    data = json.load(f)

In [None]:
text = data[0][2][0]
print (text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'


In [None]:
text = text.replace("`","'")
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': "'"}, 
           {'IS_ALPHA': True, "OP": "+"}, 
           {'IS_PUNCT': True, "OP": "*"}, 
           {'ORTH': "'"}]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])

2
(3232560085755078826, 47, 58) 'and what is the use of a book,'
(3232560085755078826, 60, 67) 'without pictures or conversation?'


### Find Speaker

In [None]:
speak_lemmas = ["think","say"]
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': "'"}, 
           {'IS_ALPHA': True, "OP": "+"}, 
           {'IS_PUNCT': True, "OP": "*"}, 
           {'ORTH': "'"},
           {'POS':"VERB", "LEMMA": {"IN": speak_lemmas}},
           {'POS':'PROPN', "OP": "+"},
           {'ORTH': "'"}, 
           {'IS_ALPHA': True, "OP": "+"}, 
           {'IS_PUNCT': True, "OP": "*"}, 
           {'ORTH': "'"}]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])

1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


In [None]:
for text in data[0][2]:
    text = text.replace("`", "'")
    doc = nlp(text)
    matches = matcher(doc)
    matches.sort(key = lambda x: x[1])
    print (len(matches))
    for match in matches[:10]:
        print (match, doc[match[1]:match[2]])

1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


Here we are not getting all the sentences with quotation marks, we got only one sentence.

### Adding More Patterns

In [None]:
speak_lemmas = ["think", "say"]
text = data[0][2][0].replace( "`", "'")
matcher = Matcher(nlp.vocab)
pattern1 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]
pattern2 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}]
pattern3 = [{"POS": "PROPN", "OP": "+"},{"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]
matcher.add("PROPER_NOUNS", [pattern1, pattern2, pattern3], greedy='LONGEST')
for text in data[0][2]:
    text = text.replace("`", "'")
    doc = nlp(text)
    matches = matcher(doc)
    matches.sort(key = lambda x: x[1])
    print (len(matches))
    for match in matches[:10]:
        print (match, doc[match[1]:match[2]])

1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
0
0
0
0
0
1
(3232560085755078826, 0, 6) 'Well!' thought Alice
0
0
0
0
0
0
0
1
(3232560085755078826, 57, 68) 'which certainly was not here before,' said Alice
0
0


Here we are getting all the sentences with quotation marks.