- Phone Numbers can have many different formats and matching them is often tricky. During `Tokenization`, spaCy will leave sequences of numbers intact and only split on whitespace and punctuation. This means that your match pattern will have to look out for number sequences of a certain length, surrounded by specific punctuation - depending on the national conventions.

- You want to match like this : `(123) 4567 8901 or (123) 4567-8901`

- [{"ORTH":"("}, {"SHAPE":"ddd"}, {"ORTH":")"}, {"SHAPE":"dddd"}, {"ORTH":"-", "OP":"?"}, {"SHAPE":"dddd"}]

In [3]:
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy

In [4]:
nlp = spacy.load('en_core_web_sm')

In [5]:
pattern = [{"ORTH":"("}, {"SHAPE":"ddd"}, {"ORTH":")"}, {"SHAPE":"dddd"}, {"ORTH":"-", "OP":"?"}, {"SHAPE":"dddd"}]

In [6]:
matcher = Matcher(nlp.vocab)
matcher.add("PhoneNumber", [pattern])

In [7]:
doc = nlp("Call me at (123) 4560-7890")

In [8]:
print([t.text for t in doc])

['Call', 'me', 'at', '(', '123', ')', '4560', '-', '7890']


In [9]:
matches = matcher(doc)
matches

[(7978097794922043545, 3, 9)]

In [10]:
for match_id, start, end in matches:
  span = doc[start:end]
  print(span.text)

(123) 4560-7890
