### Multi-Word Token Entities and RegEx

##### problems 
- multi-word tokens are not placed into the doc.ents
-  means that we cannot access them the same way we would other entities

##### extract multiword tokens
- grab multiword token = person whose first name begins with paul
- in regex, specify looking for any string starting with 'paul'
- then followed by capitalised letter
- then tell to grab entire second word until the end of the word

In [1]:
import re

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."

pattern = r"Paul [A-Z]\w+"

matches = re.finditer(pattern, text)

for match in matches:
    print (match)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>


##### reconstruct spans

In [2]:
import re
import spacy
from spacy.tokens import Span

In [3]:
text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."
pattern = r"Paul [A-Z]\w+"

In [4]:
nlp = spacy.blank("en")
doc = nlp(text)

In [5]:
original_ents = list(doc.ents)

In [6]:
mwt_ents = []
for match in re.finditer(pattern, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))

##### inject the Spans into the doc.ents
- iterate over each entity and identify where it begins and ends in spaCy
- create a span object and assign it a custom label

In [7]:
for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="PERSON")
    original_ents.append(per_ent)

In [8]:
doc.ents = original_ents

In [9]:
for ent in doc.ents:
    print (ent.text, ent.label_)

Paul Newman PERSON
Paul Hollywood PERSON


##### give priority to Longer Spans

In [10]:
import re
import spacy

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host."
pattern = r"Hollywood"

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)
for ent in doc.ents:
    print (ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP


In [12]:
from spacy.util import filter_spans
filtered = filter_spans(original_ents)
doc.ents = filtered
for ent in doc.ents:
    print (ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP
