# Part-3 Combining NLP Models and Custom Rules
You can combine statistical and rule-based components in variaty of ways. Rule-based components can be used to improve the accuracy of statistical models, by presetting tags, entities or sentence boundaries for specific tokens. The statistical models will usually respect these preset annotations, which sometimes improves the accuracy of other decisions. You can also use rule-based components after a statistical model to correct common errors. Finally, rule-based components can reference the attributes set by statistical models, in order to implement more abstract logic.

## Expanding Named Entities
For example, the corpus spaCy's English models were trained on defines a PERSON entity as just the person name, without titles like "Mr" or "Dr". This makes sense, because it makes it easier to resolve the entity type back to a knowledge base. But what if your application needs the full names, including the titles?

- Mr. Ridvan Yigit 

In [1]:
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy
from spacy.language import Language

In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
doc = nlp("Dr. Alex Smith chaired first board meeting at Google")
doc

Dr. Alex Smith chaired first board meeting at Google

In [5]:
print([(ent.text, ent.label_) for ent in doc.ents])

[('Alex Smith', 'PERSON'), ('first', 'ORDINAL'), ('Google', 'ORG')]


In [6]:
@Language.component("add_title")  # Component'i kaydet
def add_title(doc):
  new_ents = []
  for ent in doc.ents:
    if ent.label_ == "PERSON" and ent.start != 0:
      prev_token = doc[ent.start - 1]
      if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Mrs", "Mrs."):
        new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label_)
        new_ents.append(new_ent)
      else:
        new_ents.append(ent)
  doc.ents = new_ents
  return doc

In [7]:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("add_title", after="ner")

<function __main__.add_title(doc)>

In [8]:
doc = nlp("Dr. Alex Smith chaired first board meeting at Google")

In [9]:
print([(ent.text, ent.label_) for ent in doc.ents])

[('Dr. Alex Smith', 'PERSON')]


## Use of POS and Dep Parsing

In [10]:
nlp = spacy.load('en_core_web_sm')
doc = nlp("Alex Smith was working at Google")

In [12]:
displacy.render(doc, style="dep", options={"compact":True, "distance":150})

In [13]:
@Language.component("get_person_orgs")
def get_person_orgs(doc):
  person_entities = [ent for ent in doc.ents if ent.label_=="PERSON"]
  for ent in person_entities:
    head = ent.root.head
    if head.lemma_ == "work":
      preps = [token for token in head.children if token.dep_ == "prep"]
      for prep in preps:
        orgs = [token for token in prep.children if token.ent_type_ == "ORG"]
        print({"person": ent, "orgs": orgs, "past": head.tag_ == "VBD"})
  return doc

In [14]:
#from spacy.pipeline import merge_entities
#nlp.add_pipe("merge_entities")
nlp.add_pipe("get_person_orgs")

<function __main__.get_person_orgs(doc)>

In [15]:
doc = nlp("Alex Smith was working at Google")
displacy.render(doc, style="dep", options={"compact":True, "distance":150})


{'person': Alex Smith, 'orgs': [Google], 'past': False}


### Modify Model

In [16]:
@Language.component("get_person_orgs")
def get_person_orgs(doc):
  person_entities = [ent for ent in doc.ents if ent.label_=="PERSON"]
  for ent in person_entities:
    head = ent.root.head
    if head.lemma_ == "work":
      preps = [token for token in head.children if token.dep_ == "prep"]
      for prep in preps:
        orgs = [token for token in prep.children if token.ent_type_ == "ORG"]
        print({"person": ent, "orgs": orgs, "past": head.tag_ == "VBD"})

        aux = [token for token in head.children if token.dep_ == "aux"]
        past_aux = any(t.tag_ == "VBD" for t in aux)
        past = head.tag_ == "VBD" or head.tag_ == "VBG" and past_aux
      print({"person": ent, "orgs": orgs, "past": past})

  return doc

In [17]:
doc = nlp("Alex Smith was working at Google")

{'person': Alex Smith, 'orgs': [Google], 'past': False}
