# spaCy's RegEx

Based on **Dr. William Mattingly** video: https://www.youtube.com/watch?v=dIUTsFT2MeQ&t

and his Jupyter Book: http://spacy.pythonhumanities.com/02_05_simple_regex.html

## Regular Expressions (RegEx)

is a powerful tool for performing complex string matching based on simple or intricate patterns. It allows fo finding and retrieving patterns or replacing matching patterns in a string with another pattern. RegEx was invented by Steohen Cole Kleene in the 1950s and remains widely used today, especially for tasks involving string matching in text. It is fully integrated with most search engines and enables more robust searching capabilities. Data scientists, particulary those working with text, often rely on RegEx at various stages of their workflow, including data searching, data cleaning, and implementing machine learning models. It is an essential tool for ant researcher working with text-based data.

In spaCy, RegEx can be utilized in different pipes, depending on the specific task at hand. It can be leveraged to identify entities or perform pattern matching, among other applications.

## Pros of RegEx

+ Due to its complex syntax, it can allow for programmers to write robust rules in short spaces.
+ It can allow the researcher to find all types of variance in strings.
+ It can perform remarkably quickly when compared to other methods.
+ It is universally supported.

## Cons of RegEx

+ Its syntax is quite difficult.
+ In order to achieve optimal performance, it is essential to have a domain expert collaborate with the programmer to consider all possible variations of patterns in texts.

## RegEx in Python

Python has prepackeged with a RegEx libraty called **re**.

In [1]:
import re

In [3]:
pattern = r"((\d){1,2} (January|February|March|April|May|June|July|August|September|October|November|December))"

text = "This is a date 2 February. Another date would be 14 August."
matches = re.findall(pattern, text)
print(matches)

[('2 February', '2', 'February'), ('14 August', '4', 'August')]


In the provided code snippet, we can observe a real-life example of a **RegEx** formula in action. Although it may appear complex, the syntax of RegEx is relatively straightforward. Breaking down the code step by step:
1. The opening parenthesis signifies that we are looking for a pattern that encompasses everything within.
2. **(\d){1,2}** specifies that we are searching for any digit (0-9) occurring once or twice.
3. After that, we encounter a space character, indicating the expected space in the string.
4. Following the space, we have **January|February|...|December**. This part represents another component of the pattern enclosed in parentheses. The **|** character serves as an **or** operator, allowing any of the listed months to be matched.

When combined, this pattern will match any sequence consisting of one or two numbers followed by a month.

In [4]:
text = "This is a date February 2. Another date would be 14 August."
matches = re.findall(pattern, text)
print(matches)

[('14 August', '4', 'August')]


it fails, but this is not fault of RegEx. Our pattern cannot accommodate that particular variation. However, we can address it by including it as a possible varition. In RegEx, possible variations are accounted for using an asterisk (*).

In [5]:
pattern = r"(((\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December)))|(((January|February|March|April|May|June|July|August|September|October|November|December) )(\d){1,2}))"
text = "This is a date February 2. Another date would be 14 August."
matches = re.findall(pattern, text)
print(matches)

[('February 2', '', '', '', '', 'February 2', 'February ', 'February', '2'), ('14 August', '14 August', '4', ' August', 'August', '', '', '', '')]


There are alternative ways to write the same **RegEx** formula in a more concise manner. In this case, we have chosen a slightly more verbose approach to enhance readability. However, there are options to simplify it further.

It's important to note that the current implementation includes additional information for each match, representing the individual components of the match. To remove these unnecessary components, one approach is to use the **finditer** command instead of **findall** in RegEx.

By utilizing **finditer**, we can iterate over the matches and access only the relevant information we are interested in, rather than retrieving the entire mathc. This allows for more streamlined representation of the desired output.

In [6]:
text = "This is a date February 2. Another date would be 14 August."
iter_matches = re.finditer(pattern, text)
print(iter_matches)

<callable_iterator object at 0x108677d90>


In [8]:
# looping over iterator object
text = "This is a date February 2. Another date would be 14 August."
iter_matches = re.finditer(pattern, text)
print(iter_matches)
for hit in iter_matches:
    print(hit)


<callable_iterator object at 0x1086772e0>
<re.Match object; span=(15, 25), match='February 2'>
<re.Match object; span=(49, 58), match='14 August'>


Within each match returned by the RegEx pattern, there is valuable information available, including the start and end locations of the match within the input string. Additionally, the **match** attribute contains the text that corresponds to the match.

By using the start and end locations, we can extract the specific text that corresponds to each match from the input string. This allows us to retrieve the relevant portions of the string based on the identified matches using RegEx.

In [9]:
text = "This is a date February 2. Another date would be 14 August."
iter_matches = re.finditer(pattern, text)
for hit in iter_matches:
    start = hit.start()
    end = hit.end()
    print(text[start:end])


February 2
14 August


## RegEx in spaCy

Patterns that exhibit consistent or relatively consistnt structures such as dates, times, and IP addresses, are ideal candidates for RegEx. THe structured nature of these patterns allows for precise matching using RegEx patterns.

Thankfully, spaCy provides convenient ways to incorporate RegEx in three specifis pipes: Matcher, PharseMatcher, and Entity Ruler. These pipes enable the use of RegEx patterns for matching specific entities or phrases within text.
\However, it's important to note that one major drawback of using atcher and PhraseMatcher is that they do no alignthe identified matches with the **doc.ents** attribute of the Doc object. This means that the matches found using these pipes are not directly recognized as rntities by spaCy's NER system.

In [12]:
import spacy

# Sample text
text = "This is a sample number 555-5555."

# Build upon the spaCy sm
nlp = spacy.blank("en")

# Create the ruler and add it
ruler = nlp.add_pipe("entity_ruler")

# List of entities and patterns
patterns = [
    {"label": "PHONE_NUMBER", "pattern": [{"SHAPE": "ddd"},
    {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]}
]

# Add patterns to the ruler
ruler.add_patterns(patterns)

# Create the doc object
doc = nlp(text)

# Extract entities
for ent in doc.ents:
    print(ent.text, ent.label_)

555-5555 PHONE_NUMBER


If we want to use RegEx instead of linguistic features like shape to capture a specific pattern, such as "555-5555", we can define a RegEx pattern to match that specific format.

In [13]:
pattern = r"((\d){3}-(\d){4})"
text = "This is a sample number 555-5555."
matches = re.findall(pattern, text)
print(matches)

[('555-5555', '5', '5')]


### Trying to implement RegEx pattern into spaCy

In [14]:
import spacy

# Sample text
text = "This is a sample number (555) 555-5555."

# Build upon the spaCy sm
nlp = spacy.blank("en")

# Create the ruler and add it
ruler = nlp.add_pipe("entity_ruler")

# List of entities and patterns
patterns = [
    {"label": "PHONE_NUMBER", "pattern": [{"TEXT": {"REGEX": "((\d){3}-(\d){4})"}}]} 
]

# Add patterns to the ruler
ruler.add_patterns(patterns)

# Create the doc object
doc = nlp(text)

# Extract entities
for ent in doc.ents:
    print(ent.text, ent.label_)

spaCy's EntityRuler cannot directly use RegEx to pattern match across token. This limitation becomes apparent when dealing with patterns like phone numbers that contains special characters like hyphens.

In [15]:
import spacy

# Sample text
text = "This is a sample number 5555555."

# Build upon the spaCy sm
nlp = spacy.blank("en")

# Create the ruler and add it
ruler = nlp.add_pipe("entity_ruler")

# List of entities and patterns
patterns = [
    {"label": "PHONE_NUMBER", "pattern": [{"TEXT": {"REGEX": "((\d){5})"}}]} 
]

# Add patterns to the ruler
ruler.add_patterns(patterns)

# Create the doc object
doc = nlp(text)

# Extract entities
for ent in doc.ents:
    print(ent.text, ent.label_)

5555555 PHONE_NUMBER


Without the dash and a few modifications to the RegEx, we were able to capture the phone number **55555555** as a single token in the spaCy Doc object. This is because the modified pattern matches a sequence of seven consecutive digits withouth any intervening characters. As a result, it aligns well with the capabilities of the EntityRuler, allowing us to successfull capture the desired phone number format using spaCy's EntityRuler.