# spaCy's Matcher

Based on **Dr. William Mattingly** video: https://www.youtube.com/watch?v=dIUTsFT2MeQ&t

and his Jupyter Book: http://spacy.pythonhumanities.com/02_02_matcher.html

In [14]:
import spacy

In [15]:
from spacy.matcher import Matcher

## Lexeme

A **lexeme** in spaCy represents a word in a text and includes essential linguistic attributes. It serves as a unit of vocabulary and is associated with a unique integer ID. Lexemes store information such as the word's text, part-of-speech tag, lemma, morphological features and more. They enable efficient and memory-friendly text processing by serving as shared references to the vocabulary, minimizing duplication of linguistic information. Working with lexemes in spaCy enhances performance and reduces memory usage in natural language processing tasks.

## Basic Example

In [18]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_EMAIL": True}]
matcher.add("EMAIL_ADDRESS", [pattern])
doc = nlp("This is an email address: wiktorflorianwf@gmail.com")
matches = matcher(doc)

In [29]:
print(f"Lexeme: {matches[0][0]}, start token: {matches[0][1]}, end token: {matches[0][2]}")

Lexeme: 16571425990740197027, start token: 6, end token: 7


In [31]:
print(f"Lexeme: {nlp.vocab[matches[0][0]].text}, start token: {nlp.vocab[matches[0][1]].text}, end token: {nlp.vocab[matches[0][2]].text}")


Lexeme: EMAIL_ADDRESS, start token: IS_SPACE, end token: IS_TITLE


## Atrributes of the Matcher

+ **ORTH**: The exact verbatim of a token (string). the token's "orthographic" form, which is the exact verbatim representation of the token as it appears in the original text. It preserves the original casing, punctuation, and any other textual details without modifications. The **ORTH** attribute is useful when you want to precisely match or manipulate the token's original form. 
+ **TEXT**: The exat verbatim of a token (string). Normalized form of token, typically returns the lowercase version of the token, regardless of its original casing. The **TEXT** atrribute is useful when you want to compare or process tokens in a case-insensitive manner or when you want to apply general text processing operations.
+ **LOWER**: The lowercase form of the token text (string).
+ **LENGTH**: The length of the token text (integer).
+ **IS_ALPHA**: Indicates if the token consists of alphabetic characters.
+ **IS_ASCII**: Indicates if the token consists of ASCII characters.
+ **IS_DIGIT**: Indicates if the token consists of digits.
+ **IS_LOWER**: Indicates if the token is in lowercase.
+ **IS_UPPER**: Indicates if the token is in uppercase.
+ **IS_TITLE**: Indicates if the token is in title case.
+ **IS_PUNCT**: Indicates if the token is a punctuaction mark.
+ **IS_SPACE**: Indicates if the token is a space character.
+ **IS_STOP**: Indicates if the token is a stop word.
+ **IS_SENT_START**: Indicates if the token starts a sentence.
+ **LIKE_NUM**: Indicates if the token resembles a numeric value.
+ **LIKE_URL**: Indicates if the token resembles a URL.
+ **LIKE_EMAIL**: Indicates if the token resembles an email address.
+ **SPACY**: The unique identifier of the spaCy model.
+ **POS**: The part-of-speech tag of token.
+ **TAG**: The fine-grained part-of-speech tag of token.
+ **MORPH**: The morphological features of the token.
+ **DEP**: The syntactic dependency relation of the token.
+ **LEMMA**: The base form or lemma of the token
+ **SHAPE**: The shape or pattern of the token.
+ **ENT_TYPE**: The named entity type of the token.
+ **_**: Custom extension attributes (a dictionaryy of strin keys and any values).
+ **OP**: The operator used to define the matching pattern.

## Applied Matcher

In [32]:
with open ("data/wiki_mlk.txt", "r") as f:
    text = f.read()

In [34]:
nlp = spacy.load("en_core_web_sm")

## Grabbing all Proper Nouns

In [51]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN"}]
matcher.add("PROPER_NOUNS", [pattern])
doc = nlp(text)
matches = matcher(doc)
print(len(matches))
for match in matches[:10]:
    print(f"Lexeme: {match[0]}, start token: {match[1]}, end token: {match[2]}, proper noun: {doc[match[1]:match[2]]}")

102
Lexeme: 3232560085755078826, start token: 0, end token: 1, proper noun: Martin
Lexeme: 3232560085755078826, start token: 1, end token: 2, proper noun: Luther
Lexeme: 3232560085755078826, start token: 2, end token: 3, proper noun: King
Lexeme: 3232560085755078826, start token: 3, end token: 4, proper noun: Jr.
Lexeme: 3232560085755078826, start token: 6, end token: 7, proper noun: Michael
Lexeme: 3232560085755078826, start token: 7, end token: 8, proper noun: King
Lexeme: 3232560085755078826, start token: 8, end token: 9, proper noun: Jr.
Lexeme: 3232560085755078826, start token: 10, end token: 11, proper noun: January
Lexeme: 3232560085755078826, start token: 15, end token: 16, proper noun: April
Lexeme: 3232560085755078826, start token: 23, end token: 24, proper noun: Baptist


### Multi-Word Tokens

In [52]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUNS", [pattern])
doc = nlp(text)
matches = matcher(doc)
print(len(matches))
for match in matches[:10]:
    print(f"Lexeme: {match[0]}, start token: {match[1]}, end token: {match[2]}, proper noun: {doc[match[1]:match[2]]}")

175
Lexeme: 3232560085755078826, start token: 0, end token: 1, proper noun: Martin
Lexeme: 3232560085755078826, start token: 0, end token: 2, proper noun: Martin Luther
Lexeme: 3232560085755078826, start token: 1, end token: 2, proper noun: Luther
Lexeme: 3232560085755078826, start token: 0, end token: 3, proper noun: Martin Luther King
Lexeme: 3232560085755078826, start token: 1, end token: 3, proper noun: Luther King
Lexeme: 3232560085755078826, start token: 2, end token: 3, proper noun: King
Lexeme: 3232560085755078826, start token: 0, end token: 4, proper noun: Martin Luther King Jr.
Lexeme: 3232560085755078826, start token: 1, end token: 4, proper noun: Luther King Jr.
Lexeme: 3232560085755078826, start token: 2, end token: 4, proper noun: King Jr.
Lexeme: 3232560085755078826, start token: 3, end token: 4, proper noun: Jr.


### Greedy Keyword Argument

In [63]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUNS", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
print(len(matches))
for match in matches[:10]:
    print(f"Lexeme: {match[0]:<19}, start token: {match[1]:<3}, end token: {match[2]:<3}, proper noun: {doc[match[1]:match[2]]}")

61
Lexeme: 3232560085755078826, start token: 83 , end token: 88 , proper noun: Martin Luther King Sr.
Lexeme: 3232560085755078826, start token: 469, end token: 474, proper noun: Martin Luther King Jr. Day
Lexeme: 3232560085755078826, start token: 536, end token: 541, proper noun: Martin Luther King Jr. Memorial
Lexeme: 3232560085755078826, start token: 0  , end token: 4  , proper noun: Martin Luther King Jr.
Lexeme: 3232560085755078826, start token: 128, end token: 132, proper noun: Southern Christian Leadership Conference
Lexeme: 3232560085755078826, start token: 247, end token: 251, proper noun: Director J. Edgar Hoover
Lexeme: 3232560085755078826, start token: 6  , end token: 9  , proper noun: Michael King Jr.
Lexeme: 3232560085755078826, start token: 325, end token: 328, proper noun: Nobel Peace Prize
Lexeme: 3232560085755078826, start token: 422, end token: 425, proper noun: James Earl Ray
Lexeme: 3232560085755078826, start token: 463, end token: 466, proper noun: Congressional Go

### Sorting

In [64]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUNS", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
print(len(matches))
for match in matches[:10]:
    print(f"Lexeme: {match[0]:<19}, start token: {match[1]:<3}, end token: {match[2]:<3}, proper noun: {doc[match[1]:match[2]]}")

61
Lexeme: 3232560085755078826, start token: 0  , end token: 4  , proper noun: Martin Luther King Jr.
Lexeme: 3232560085755078826, start token: 6  , end token: 9  , proper noun: Michael King Jr.
Lexeme: 3232560085755078826, start token: 10 , end token: 11 , proper noun: January
Lexeme: 3232560085755078826, start token: 15 , end token: 16 , proper noun: April
Lexeme: 3232560085755078826, start token: 23 , end token: 24 , proper noun: Baptist
Lexeme: 3232560085755078826, start token: 49 , end token: 50 , proper noun: King
Lexeme: 3232560085755078826, start token: 69 , end token: 71 , proper noun: Mahatma Gandhi
Lexeme: 3232560085755078826, start token: 83 , end token: 88 , proper noun: Martin Luther King Sr.
Lexeme: 3232560085755078826, start token: 89 , end token: 90 , proper noun: King
Lexeme: 3232560085755078826, start token: 113, end token: 114, proper noun: King


### Adding in Sequences

In [65]:
matcher  = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}, {"POS": "VERB"}]
matcher.add("PROPER_NOUNS", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
print(len(matches))
for match in matches[:10]:
    print(f"Lexeme: {match[0]:<19}, start token: {match[1]:<3}, end token: {match[2]:<3}, proper noun: {doc[match[1]:match[2]]}")

7
Lexeme: 3232560085755078826, start token: 49 , end token: 51 , proper noun: King advanced
Lexeme: 3232560085755078826, start token: 89 , end token: 91 , proper noun: King participated
Lexeme: 3232560085755078826, start token: 113, end token: 115, proper noun: King led
Lexeme: 3232560085755078826, start token: 167, end token: 169, proper noun: King helped
Lexeme: 3232560085755078826, start token: 247, end token: 252, proper noun: Director J. Edgar Hoover considered
Lexeme: 3232560085755078826, start token: 322, end token: 324, proper noun: King won
Lexeme: 3232560085755078826, start token: 485, end token: 488, proper noun: United States beginning


## Quotes and Speakers

In [67]:
import json
with open("data/alice.json", "r") as f:
    data = json.load(f)

In [68]:
text = data[0][2][0]
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'


In [69]:
text = data[0][2][0].replace("`", "'")
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


In [70]:
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH": "'"}, {"IS_ALPHA": True, "OP": "+"}, {"IS_PUNCT": True, "OP": "*"}, {"ORTH": "'"}]
matcher.add("PROPER_NOUNS", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
print(len(matches))
for match in matches:
    print(f"Lexeme: {match[0]:<19}, start token: {match[1]:<3}, end token: {match[2]:<3}, proper noun: {doc[match[1]:match[2]]}")

2
Lexeme: 3232560085755078826, start token: 47 , end token: 58 , proper noun: 'and what is the use of a book,'
Lexeme: 3232560085755078826, start token: 60 , end token: 67 , proper noun: 'without pictures or conversation?'


### Finding Speaker

In [71]:
speak_lemmas = ["think", "say"]
text = data[0][2][0].replace("`", "'")
matcher = Matcher(nlp.vocab)
pattern1 = [{"ORTH": "'"}, {"IS_ALPHA": True, "OP": "+"}, {"IS_PUNCT": True, "OP": "*"}, {"ORTH": "'"},
            {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}, {"ORTH": "'"},
            {"IS_ALPHA": True, "OP": "+"}, {"IS_PUNCT": True, "OP": "*"}, {"ORTH": "'"}]
matcher.add("PROPER_NOUNS", [pattern1], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
print(len(matches))
for match in matches:
    print(f"Lexeme: {match[0]:<19}, start token: {match[1]:<3}, end token: {match[2]:<3}, proper noun: {doc[match[1]:match[2]]}")

1
Lexeme: 3232560085755078826, start token: 47 , end token: 67 , proper noun: 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


### Problem with this Approach

In [72]:
for text in data[0][2]:
    text = text.replace("`", "'")
    doc = nlp(text)
    matches = matcher(doc)
    matches.sort(key = lambda x: x[1])
    print(len(matches))
    for match in matches[:10]:
        print(match, doc[match[1]:match[2]])

1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


### Adding More Patterns

In [73]:
speak_lemmas = ["think", "say"]
matcher = Matcher(nlp.vocab)
pattern1 = [{"ORTH": "'"}, {"IS_ALPHA": True, "OP": "+"}, {"IS_PUNCT": True, "OP": "*"}, {"ORTH": "'"},
            {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}, {"ORTH": "'"},
            {"IS_ALPHA": True, "OP": "+"}, {"IS_PUNCT": True, "OP": "*"}, {"ORTH": "'"}]
pattern2 = [{"ORTH": "'"}, {"IS_ALPHA": True, "OP": "+"}, {"IS_PUNCT": True, "OP": "*"}, {"ORTH": "'"},
            {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}]
pattern3 = [{"POS": "PROPN", "OP": "+"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"ORTH": "'"},
            {"IS_ALPHA": True, "OP": "+"}, {"IS_PUNCT": True, "OP": "*"}, {"ORTH": "'"}]
matcher.add("PROPER_NOUNS", [pattern1, pattern2, pattern3], greedy="LONGEST")
for text in data[0][2]:
    text = text.replace("`", "'")
    doc = nlp(text)
    matches = matcher(doc)
    matches.sort(key = lambda x: x[1])
    print(len(matches))
    for match in matches[:10]:
        print(match, doc[match[1]:match[2]])


1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
0
0
0
0
0
1
(3232560085755078826, 0, 6) 'Well!' thought Alice
0
0
0
0
0
0
0
1
(3232560085755078826, 57, 68) 'which certainly was not here before,' said Alice
0
0
