# spaCy's Matcher

Based on **Dr. William Mattingly** video: https://www.youtube.com/watch?v=dIUTsFT2MeQ&t

and his Jupyter Book: http://spacy.pythonhumanities.com/02_02_matcher.html

In [14]:
import spacy

In [15]:
from spacy.matcher import Matcher

## Lexeme

A **lexeme** in spaCy represents a word in a text and includes essential linguistic attributes. It serves as a unit of vocabulary and is associated with a unique integer ID. Lexemes store information such as the word's text, part-of-speech tag, lemma, morphological features and more. They enable efficient and memory-friendly text processing by serving as shared references to the vocabulary, minimizing duplication of linguistic information. Working with lexemes in spaCy enhances performance and reduces memory usage in natural language processing tasks.

## Basic Example

In [18]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_EMAIL": True}]
matcher.add("EMAIL_ADDRESS", [pattern])
doc = nlp("This is an email address: wiktorflorianwf@gmail.com")
matches = matcher(doc)

In [29]:
print(f"Lexeme: {matches[0][0]}, start token: {matches[0][1]}, end token: {matches[0][2]}")

Lexeme: 16571425990740197027, start token: 6, end token: 7


In [31]:
print(f"Lexeme: {nlp.vocab[matches[0][0]].text}, start token: {nlp.vocab[matches[0][1]].text}, end token: {nlp.vocab[matches[0][2]].text}")


Lexeme: EMAIL_ADDRESS, start token: IS_SPACE, end token: IS_TITLE


## Atrributes of the Matcher

+ **ORTH**: The exact verbatim of a token (string). the token's "orthographic" form, which is the exact verbatim representation of the token as it appears in the original text. It preserves the original casing, punctuation, and any other textual details without modifications. The **ORTH** attribute is useful when you want to precisely match or manipulate the token's original form. 
+ **TEXT**: The exat verbatim of a token (string). Normalized form of token, typically returns the lowercase version of the token, regardless of its original casing. The **TEXT** atrribute is useful when you want to compare or process tokens in a case-insensitive manner or when you want to apply general text processing operations.
+ **LOWER**: The lowercase form of the token text (string).
+ **LENGTH**: The length of the token text (integer).
+ **IS_ALPHA**: Indicates if the token consists of alphabetic characters.
+ **IS_ASCII**: Indicates if the token consists of ASCII characters.
+ **IS_DIGIT**: Indicates if the token consists of digits.
+ **IS_LOWER**: Indicates if the token is in lowercase.
+ **IS_UPPER**: Indicates if the token is in uppercase.
+ **IS_TITLE**: Indicates if the token is in title case.
+ **IS_PUNCT**: Indicates if the token is a punctuaction mark.
+ **IS_SPACE**: Indicates if the token is a space character.
+ **IS_STOP**: Indicates if the token is a stop word.
+ **IS_SENT_START**: Indicates if the token starts a sentence.
+ **LIKE_NUM**: Indicates if the token resembles a numeric value.
+ **LIKE_URL**: Indicates if the token resembles a URL.
+ **LIKE_EMAIL**: Indicates if the token resembles an email address.
+ **SPACY**: The unique identifier of the spaCy model.
+ **POS**: The part-of-speech tag of token.
+ **TAG**: The fine-grained part-of-speech tag of token.
+ **MORPH**: The morphological features of the token.
+ **DEP**: The syntactic dependency relation of the token.
+ **LEMMA**: The base form or lemma of the token
+ **SHAPE**: The shape or pattern of the token.
+ **ENT_TYPE**: The named entity type of the token.
+ **_**: Custom extension attributes (a dictionaryy of strin keys and any values).
+ **OP**: The operator used to define the matching pattern.