# spaCy's RegEx advanced (multi-wrod tokens, RegEx's finditer, and spans)

Based on **Dr. William Mattingly** video: https://www.youtube.com/watch?v=dIUTsFT2MeQ&t

and his Jupyter Book: http://spacy.pythonhumanities.com/02_06_complex_regex.html

## Problems with Multi-Word Tokens in spaCy as Entities.

We can use spaCy's Matcher to grab multi-word tokens, or tokens that span multiple tokens. The main problem with this, however, is that these multi-word tokens are not placed into the **doc.ents**. This means that we cannot access them the same way we would other entities.

## Extract Multi-Word Tokens

The text discusses the challenges of dealing with multi-word tokens as entities in spaCy. When using RegEx to capture multi-word token, it is important to define patterns that match specific sequences of token. In this case, the goal is to capture a multi-word token where the first word i "Paul" followed by a capitalized letter and the rest of the second word.

The provided RegEx pattern **\bPaul**, **[A-Z]\w+** matches a single capitalized letter followed by one ore more word characters until word break. By applying this pattern, ut becomes possible to identify and extract the desired ,ulti-word token.

In [1]:
import re

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."

pattern = r"Paul [A-Z]\w+"

matches = re.finditer(pattern, text)

for match in matches:
    print(match)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>


We have not grab the final **Paul** which is not followed by a last name. In this case, we are not interested in that Paul.

## Reconstruct Spans

In [18]:
import re
import spacy
from spacy.tokens import Span

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."

pattern = r"Paul [A-Z]\w+"

# Blank spaCy en pipeline
nlp = spacy.blank("en")
doc = nlp(text)

This part is unnecessary, but in other situations we will have entities. And we need to store them as a separate list to which we will append things.

In [4]:
original_ents = list(doc.ents)

Now, we will iterate over the results obtained from **re.finditer()**. We will retrieve the starting and ending positions of each match. Additionally, we will create a temporary span that corresponds to the starting and ending characters within the **doc** object. It is important to note that tokens and characters may not always align perfectly. Finally, we will append the starting, ending position, and text to **mwt_ents**.

In [5]:
mwt_ents = []
for match in re.finditer(pattern, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))

In [6]:
print(mwt_ents)

[(0, 2, 'Paul Newman'), (8, 10, 'Paul Hollywood')]


## Inject the Spans into the doc.ents

Using that data, we can iterate over each entity and determine its starting and ending positions in spaCy. It is important to note that we are utilizing the spaCy **Span** class for this purpose. This class enables us to create a span object and assign it a customized label. By leveraging this data, we can append each **Span** object to the **original_ents**.

In [7]:
for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="PERSON")
    original_ents.append(per_ent)

In [8]:
print(original_ents)

[Paul Newman, Paul Hollywood]


We have to set **doc.ents** equal to **original_ents**. This operation effectively loads the spans back into the spaCy **doc.ents**.

In [9]:
doc.ents = original_ents

In [10]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Paul Newman PERSON
Paul Hollywood PERSON


## Give priority to Longer Spans

Sometimes, the situation is not as straightforward. There are cases where our custom RegEx entities may overlap with the entities identified by spaCy.

In [15]:
import re
import spacy

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host."
pattern = r"Hollywood"

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP


Let's consider a scenario where we create a new entity related to cinema, and we want to classify "Hollywood" as a tag under "CINEMA". In the previous text, "Hollywood" is clearly associated with Paul Hollywood. However, let's imagine for a moment that it is not the case. If we attempt to run the same code as before we will encounter an error.

In [16]:
mwt_ents = []
original_ents = list(doc.ents)
for match in re.finditer(pattern, doc.text):
    print(match)
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))
for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="CINEMA")
    original_ents.append(per_ent)

doc.ents = original_ents

<re.Match object; span=(44, 53), match='Hollywood'>


ValueError: [E1010] Unable to set entity information for token 9 which is included in more than one span in entities, blocked, missing or outside.

This error indicates that one of the tokens identified by **finditer()** overlaps with a token identified by the **NER** component in spaCY. Fortunately, this issue can be resolved using spaCy's **filtert_spans** functon, which prioritizes longer spans. In this case, we have allowed the entity "Paul Hollywood" to be classified as PERSON rather than CINEMA because **Hollywood** is shorter in length.

In [17]:
from spacy.util import filter_spans
filtered = filter_spans(original_ents)
doc.ents = filtered
for ent in doc.ents:
    print(ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP


## Creating custom pipe to git it inside pipeline

In [19]:
import re
import spacy
from spacy.tokens import Span
from spacy.language import Language

In [20]:
text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."

In [21]:
pattern = r"Paul [A-Z]\w+"

In [22]:
matches = re.finditer(pattern, text)
for match in matches:
    print(match)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>


In [28]:
nlp = spacy.blank("en")
doc = nlp(text)
original_ents = list(doc.ents)
mwt_ents = []
for match in re.finditer(pattern, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))

for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="PERSON")
    original_ents.append(per_ent)

doc.ents = original_ents
for ent in doc.ents:
    print(ent.text, ent.label_)

Paul Newman PERSON
Paul Hollywood PERSON


###  Creating custom component

In [31]:
@Language.component("paul_ner")
def paul_ner(doc):
    pattern = r"Paul [A-Z]\w+"
    original_ents = list(doc.ents)
    mwt_ents = []
    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span is not None:
            mwt_ents.append((span.start, span.end, span.text))

    for ent in mwt_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label="PERSON")
        original_ents.append(per_ent)

    doc.ents = original_ents
    return (doc)

In [32]:
nlp2 = spacy.blank("en")
nlp2.add_pipe("paul_ner")

<function __main__.paul_ner(doc)>

In [33]:
doc2 = nlp2(text)
print(doc2.ents)

(Paul Newman, Paul Hollywood)


### Typical problem in implementing custom component

In [35]:
@Language.component("cinema_ner")
def cinema_ner(doc):
    pattern = r"Hollywood"
    original_ents = list(doc.ents)
    mwt_ents = []
    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span is not None:
            mwt_ents.append((span.start, span.end, span.text))

    for ent in mwt_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label="CINEMA")
        original_ents.append(per_ent)

    doc.ents = original_ents
    return (doc)

In [36]:
nlp3 = spacy.load("en_core_web_sm")
nlp3.add_pipe("cinema_ner")

<function __main__.cinema_ner(doc)>

In [38]:
doc3 = nlp3(text)

ValueError: [E1010] Unable to set entity information for token 9 which is included in more than one span in entities, blocked, missing or outside.

to fix this error we have to use filter_spans

In [39]:
from spacy.util import filter_spans

In [44]:
@Language.component("cinema_ner")
def cinema_ner(doc):
    pattern = r"Hollywood"
    original_ents = list(doc.ents)
    mwt_ents = []
    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span is not None:
            mwt_ents.append((span.start, span.end, span.text))

    for ent in mwt_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label="CINEMA")
        original_ents.append(per_ent)

    filtered = filter_spans(original_ents)
    doc.ents = filtered
    return (doc)

In [45]:
nlp4 = spacy.load("en_core_web_sm")
nlp4.add_pipe("cinema_ner")

<function __main__.cinema_ner(doc)>

In [48]:
doc4 = nlp4(text)
for ent in doc4.ents:
    print(ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP
Paul PERSON
