- 개체명 인식 : 텍스트 원문에서 언급된 개체명을 찾아 미리 정의해 놓은 범주로 분류
    - 범주예시 : 인명, 조직, 장소, 의료코드, 시간표현, 수량, 백분율, 금전가치 등

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

In [4]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text + '-' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)))
    else:
        print('No entities fount')

In [5]:
doc = nlp(u'Hi how are you?')

In [6]:
show_ents(doc)

No entities fount


In [7]:
doc = nlp(u"May I go to Washington, DC next May to see the Washington Monument?")

In [8]:
show_ents(doc)

Washington, DC-GPE - Countries, cities, states
next May-DATE - Absolute or relative dates or periods
the Washington Monument-ORG - Companies, agencies, institutions, etc.


In [9]:
doc = nlp(u"Can I please have 500 dollars of Microsoft stock?")

In [10]:
show_ents(doc)

500 dollars-MONEY - Monetary values, including unit
Microsoft-ORG - Companies, agencies, institutions, etc.


In [11]:
doc = nlp(u"Tesla to build a U.K. factory for $6 million")

In [12]:
show_ents(doc)

U.K.-GPE - Countries, cities, states
$6 million-MONEY - Monetary values, including unit


### Tesla를 회사명으로 인식하지 못한 경우

In [13]:
from spacy.tokens import Span

In [15]:
ORG = doc.vocab.strings[u"ORG"]

In [16]:
ORG

381

In [17]:
new_ent = Span(doc,0,1,label=ORG)

In [18]:
doc.ents = list(doc.ents) + [new_ent]

In [19]:
show_ents(doc)

Tesla-ORG - Companies, agencies, institutions, etc.
U.K.-GPE - Countries, cities, states
$6 million-MONEY - Monetary values, including unit


In [20]:
doc = nlp(u"Our company created a brand new vacuum cleaner."
          u"This new vacuum-cleaner is the best in show.")

In [21]:
show_ents(doc)

No entities fount


In [22]:
from spacy.matcher import PhraseMatcher

In [23]:
matcher = PhraseMatcher(nlp.vocab)

In [24]:
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']

In [25]:
phrase_patterns = [nlp(text) for text in phrase_list]

In [26]:
matcher.add('newproduct',None,*phrase_patterns)

In [27]:
found_matches = matcher(doc)

In [29]:
found_matches

[(2689272359382549672, 6, 8), (2689272359382549672, 11, 14)]

In [30]:
from spacy.tokens import Span

In [31]:
PROD = doc.vocab.strings[u"PRODUCT"]

In [32]:
found_matches

[(2689272359382549672, 6, 8), (2689272359382549672, 11, 14)]

In [34]:
new_ents = [Span(doc, match[1], match[2], label=PROD) for match in found_matches]

In [35]:
doc.ents = list(doc.ents) + new_ents

In [36]:
show_ents(doc)

vacuum cleaner-PRODUCT - Objects, vehicles, foods, etc. (not services)
vacuum-cleaner-PRODUCT - Objects, vehicles, foods, etc. (not services)


In [37]:
doc = nlp(u"Originally I paid $29.95 for this car toy, but no it is marked down by 10 dollars.")

In [39]:
len([ent for ent in doc.ents if ent.label_ == "MONEY"])

2