In [2]:
%load_ext autoreload
%autoreload 2
import re, spacy
from spacy.matcher import DependencyMatcher
from scify.nlp import show_tabs, visualise_subtrees, visualise_doc, check_for_non_trees, construct_pattern

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
nlp = spacy.load("en_core_sci_sm")

In [7]:
doc = nlp("Alzheimer's disease causes dementia, mostly in old people")
show_tabs(doc)
doc[4].nbor(), doc[4].nbor(2), [chunk for chunk in doc.noun_chunks]


token      lemma      POS    Tag    DEP     shape    is_alpha    is_stop
---------  ---------  -----  -----  ------  -------  ----------  ---------
Alzheimer  alzheimer  NOUN   NN     poss    Xxxxx    True        False
's         's         PART   POS    case    'x       False       True
disease    disease    NOUN   NN     nsubj   xxxx     True        False
causes     cause      VERB   VBZ    ROOT    xxxx     True        False
dementia   dementia   NOUN   NN     dobj    xxxx     True        False
,          ,          PUNCT  ,      punct   ,        False       False
mostly     mostly     ADV    RB     advmod  xxxx     True        True
in         in         ADP    IN     prep    xx       True        True
old        old        ADJ    JJ     amod    xxx      True        False
people     people     NOUN   NNS    pobj    xxxx     True        False


(,, mostly, [Alzheimer's disease, dementia, old people])

Overall, there are two complicated things about using the DependencyMatcher.

The NBOR_RELOP field only allows you to express structural relations between nodes (such as parent, child, govenor, preceeds) and not based on dependency types. Instead, we express the structural relation through the spec, and the dependency label through the PATTERN field, as spacy tokens have a Token.dep_ attribute, which specifies the dependency label for the arc connecting the token to its parent.

When you pass your pattern into the dependency matcher, it walks over your patterns, constructing it's representation. When it does this, it assumes that the rules are ordered in a way such that when you reference a node in the NBOR_NAME field, it has already been seen by the matcher in the list of rules. In the next section, we'll introduce a way to overcome this by making a simplifying assumption about what we want to match against.

In [9]:
visualise_doc(doc)

{
    "POS": "NOUN",
    "LEMMA": "dog"
}
Matches any Noun that also has Lemma dog

When thinking about matching on subtrees, it makes everything easier if we make the following simplifying assumption:

**We will only create matching rules for nodes of the form Parent > Child **
This helps resolve both of the complicated things we noticed above:

Specifying the dependency relation which connects two nodes now corresponds to setting the {"DEP": "relation_type"} in the pattern dictionary of the child token.

Now that we know that we will only be referencing a node's parent, if we add the rules we create in a Depth First Search ordering, our subtrees are guaranteed to be accepted by the DependencyMatcher, because a node's parent will always exist before a child tries to reference it.

In [10]:
examples = [
        ["causes", "nsubj", "START_ENTITY"],
        ["causes", "dobj", "END_ENTITY"]
        ]

s = check_for_non_trees(examples)
s

('causes',
 defaultdict(list,
             {'causes': [('nsubj', 'START_ENTITY'), ('dobj', 'END_ENTITY')],
              'START_ENTITY': [],
              'END_ENTITY': []}))

In [12]:
construct_pattern(examples)

[{'SPEC': {'NODE_NAME': 'causes'}, 'PATTERN': {'LEMMA': 'cause'}},
 {'SPEC': {'NODE_NAME': 'START_ENTITY',
   'NBOR_RELOP': '>',
   'NBOR_NAME': 'causes'},
  'PATTERN': {'DEP': 'nsubj', 'POS': 'NOUN'}},
 {'SPEC': {'NODE_NAME': 'END_ENTITY',
   'NBOR_RELOP': '>',
   'NBOR_NAME': 'causes'},
  'PATTERN': {'DEP': 'dobj', 'POS': 'NOUN'}}]

In [13]:
matcher = DependencyMatcher(nlp.vocab)

In [15]:
matcher.add("pattern1", None, construct_pattern(examples))

In [19]:
matches = matcher(doc)
matches

[(13439661873955722336, [[3, 2, 4]])]