Ontological Engineering:
NLP parsing and knowledge graph construction

In [211]:
#!pip install -U spacy
# python -m spacy download en_core_web_sm
# python -m spacy download en_core_web_trf
# sm, md, lg, trf



In [212]:
import spacy
import spacy_transformers
# nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemmatizer", "morphologizer"])
nlp = spacy.load("en_core_web_trf")

nlp.add_pipe("merge_entities")     # merges named entities
nlp.add_pipe("merge_noun_chunks")  # merges base noun-chunks

# import en_core_web_trf
# nlp = en_core_web_trf.load()
print(nlp.pipe_names)
doc = nlp("This is a sentence.")
print([(w.text, w.pos_) for w in doc])

['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'merge_entities', 'merge_noun_chunks']
[('This', 'PRON'), ('is', 'AUX'), ('a sentence', 'NOUN'), ('.', 'PUNCT')]


In [213]:
import spacy

# 1) Load model & text

nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("merge_entities")     # merges named entities
nlp.add_pipe("merge_noun_chunks")  # merges base noun-chunks
nlp.add_pipe("merge_subtokens")  

text = (
        "Like other tyrannosaurids, Tyrannosaurus was a bipedal carnivore with a massive skull balanced by a long, heavy tail. "
        "Relative to its large and powerful hind limbs, the forelimbs of Tyrannosaurus were short but unusually powerful for their size, and they had two clawed digits. "
        "The most complete specimen measures 12.3–12.4 m (40–41 ft) in length, but according to most modern estimates, Tyrannosaurus could have exceeded sizes of 13 m (43 ft) in length, 3.7–4 m (12–13 ft) in hip height, and 8.8 t (8.7 long tons; 9.7 short tons) in mass."
        "Although some other theropods might have rivaled or exceeded Tyrannosaurus in size, it is still among the largest known land predators, with its estimated bite force being the largest among all terrestrial animals. "
        "By far the largest carnivore in its environment, Tyrannosaurus rex was most likely an apex predator, preying upon hadrosaurs, juvenile armored herbivores like ceratopsians and ankylosaurs, and possibly sauropods. Some experts have suggested the dinosaur was primarily a scavenger. "
        "The question of whether Tyrannosaurus was an apex predator or a pure scavenger was among the longest debates in paleontology. " 
        "Most paleontologists today accept that Tyrannosaurus was both a predator and a scavenger." 
    )

text2 = ("Born to Jedi Knight Anakin Skywalker (later Darth Vader) and Senator Padmé Amidala, Luke was raised by his Aunt Beru and Uncle Owen on the desert planet Tatooine." 
         "Unaware of his true parentage, he worked as a moisture farmer until fate intervened: he encountered the droid R2-D2, which carried Princess Leia’s plea for help, setting Luke on a path that would forever change the galaxy."
         "Under the tutelage of Obi-Wan “Ben” Kenobi and, later, Jedi Master Yoda on Dagobah, Luke learned to harness the Force and hone his skills with a lightsaber." 
         "His rigorous training culminated in a fierce confrontation with Darth Vader aboard Cloud City, where he faced not only the dark side’s power but also the shattering revelation of his lineage ")
doc = nlp(text) 

# 2) Print named entities
print("=== Entities ===")
for ent in doc.ents:
    print(f"{ent.text:20}  {ent.label_}  [{ent.start_char}:{ent.end_char}]")

# 3) Print dependency arcs sentence by sentence
print("\n=== Dependency Parse ===")
for i, sent in enumerate(doc.sents, 1):
    print(f"\nSentence {i}: {sent.text}")
    for token in sent:
        # token.dep_: dependency label; token.head: the "governor"
        print(f"  {token.text:12} → {token.head.text:12}  ({token.dep_})")
        
print([(w.text, w.pos_) for w in doc])


=== Entities ===
12.3–12.4 m           QUANTITY  [314:325]
13 m                  QUANTITY  [431:435]
43 ft                 QUANTITY  [437:442]
3.7–4 m               QUANTITY  [455:462]
8.8 t                 QUANTITY  [493:498]
8.7 long tons         QUANTITY  [500:513]
9.7 short tons        QUANTITY  [515:529]
today                 DATE  [1182:1187]

=== Dependency Parse ===

Sentence 1: Like other tyrannosaurids, Tyrannosaurus was a bipedal carnivore with a massive skull balanced by a long, heavy tail.
  Like         → was           (prep)
  other tyrannosaurids → Like          (pobj)
  ,            → was           (punct)
  Tyrannosaurus → was           (nsubj)
  was          → was           (ROOT)
  a bipedal carnivore → was           (attr)
  with         → a bipedal carnivore  (prep)
  a massive skull → with          (pobj)
  balanced     → a massive skull  (acl)
  by           → balanced      (agent)
  a long, heavy tail → by            (pobj)
  .            → was           (punct

In [214]:
from collections import Counter
import spacy
from spacy.tokens import Span, Token

from typing import Any, Iterable, Optional

def most_common_by_text(seq):
    """
    Return the object in `seq` whose `.text` value appears most frequently.
    If `seq` is empty, returns None.
    If multiple objects share the top frequency, returns the first one seen.
    """
    # 1) build counts of each text
    counts = {}
    for obj in seq:
        text = getattr(obj, "text", obj)  # use obj itself if no .text
        counts[text] = counts.get(text, 0) + 1

    if not counts:
        return None

    # 2) find the text with the highest count
    max_text = None
    max_count = 0
    for text, cnt in counts.items():
        if cnt > max_count:
            max_text, max_count = text, cnt

    # 3) return the first object whose text matches
    for obj in seq:
        if getattr(obj, "text", obj) == max_text:
            return obj
    return None
        

def extract_topic(doc):
    """Extract the main topic of the document and its label."""
    # collect candidates as either Spans (entities) or Tokens (nsubj NOUN/PROPN)
    candidates = []
    # first all named‐entity spans
    for ent in doc.ents:
        candidates.append(ent)
    # then any noun/proper‐noun subjects
    for token in doc:
        if token.dep_ == "nsubj" and token.pos_ in ("NOUN", "PROPN"):
            candidates.append(token)

    if not candidates:
        return None, None
    
    print(candidates)

    # pick the most frequent
    topic_obj = most_common_by_text(candidates)

    # if it's a Span, use its label_
    if isinstance(topic_obj, Span):
        return topic_obj.text, topic_obj.label_

    # if it's a Token inside an entity, use token.ent_type_
    if isinstance(topic_obj, Token) and topic_obj.ent_type_:
        return topic_obj.text, topic_obj.ent_type_

    # otherwise fall back to the UPOS tag
    return topic_obj.text, topic_obj.pos_


topic, label = extract_topic(doc)
print("Main topic:", topic)
print("Label:", label)


[12.3–12.4 m, 13 m, 43 ft, 3.7–4 m, 8.8 t, 8.7 long tons, 9.7 short tons, today, Tyrannosaurus, the forelimbs, The most complete specimen, Tyrannosaurus, some other theropods, its estimated bite force, Tyrannosaurus rex, Some experts, the dinosaur, The question, Tyrannosaurus, Most paleontologists, Tyrannosaurus]
Main topic: Tyrannosaurus
Label: PROPN


In [215]:
def sent_subtree(tok):
    return " ".join(t.text for t in tok.subtree)

def extract_spo(sent):
    # 1) Find the main verb
    roots = [t for t in sent if t.dep_ == "ROOT"]
    if not roots:
        return None
    root = roots[0]

    # 2) Subject(s): any child with “subj” in its dep_
    subs = [child for child in root.lefts if "subj" in child.dep_]
    # 3) Object(s): expand to attr, acomp, dobj, pobj, iobj
    objs = [
        child
        for child in root.rights
        if child.dep_ in ("dobj", "pobj", "iobj", "attr", "acomp")
    ]

    if not subs or not objs:
        return None

    # 4) Build full phrases
    subj_phrase = sent_subtree(subs[0])
    obj_phrase  = sent_subtree(objs[0])

    # 5) Build the relation: include auxiliaries, negations, particles
    aux = [tok.text for tok in root.lefts if tok.dep_ in ("aux", "auxpass", "neg", "prt")]
    rel = " ".join(aux + [root.text])

    return (subj_phrase, rel, obj_phrase)


for i, sent in enumerate(doc.sents, 1):
    spo = extract_spo(sent)
    if spo:
        print(f"Triple {i}:", spo)

Triple 1: ('Tyrannosaurus', 'was', 'a bipedal carnivore with a massive skull balanced by a long, heavy tail')
Triple 2: ('the forelimbs of Tyrannosaurus', 'were', 'short but unusually powerful for their size')
Triple 3: ('The most complete specimen', 'measures', '12.3–12.4 m ( 40–41 ft ) in length')
Triple 5: ('Tyrannosaurus rex', 'was', 'an apex predator')


**Entity pair extraction**

To determine the entity pairs we also make use of the dependency graph. Looking at the first sentence of our example, Leonard Simon Nimoy was born in Boston, we can see that:

Entities can be found in noun phrases.
The entity tagged as a subj (Leonard Simon Nimoy) is the head of the triple.
While the obj (Boston) is the tail and the verb (was born in) in between them is the relation.
subj and obj may be composed of several tokens (dep_ == "compound").


In [216]:
def extract_entity_pairs(sent):
  head = ''
  tail = ''

  prefix = ''             # variable for storing compound noun phrases
  prev_token_dep = ''     # dependency tag of previous token in the sentence
  prev_token_text = ''    # previous token in the sentence

  
  for token in sent:
    # if it's a punctuation mark, do nothing and move on to the next token
    if token.dep_ == 'punct':
      continue

    # Condition #1: subj is the head entity
    if "subj" in token.dep_:
      head = f'{prefix} {token.text}'

      # Reset placeholder variables, to be reused by succeeding entities
      prefix = ''
      prev_token_dep = ''
      prev_token_text = ''      

    # Condition #2: obj is the tail entity
    if token.dep_ in ("dobj","pobj","iobj"):
      tail = f'{prefix} {token.text}'
        
    # Condition #3: entities may be composed of several tokens
    if token.dep_ == "compound":
      # if the previous word was also a 'compound' then add the current word to it
      if prev_token_dep == "compound":
        prefix = f'{prev_token_text} {token.text}'
      # if not, then this is the first token in the noun phrase
      else:
        prefix = token.text

    # Placeholders for compound cases.      
    prev_token_dep = token.dep_
    prev_token_text = token.text
  #############################################################

  return [head.strip(), tail.strip()]

for id, sent in enumerate(doc.sents):
  print(f'Sentence {id+1}: {extract_entity_pairs(sent)}')

Sentence 1: ['Tyrannosaurus', 'a long, heavy tail']
Sentence 2: ['they', 'two clawed digits']
Sentence 3: ['Tyrannosaurus', 'mass']
Sentence 4: ['its estimated bite force', 'all terrestrial animals']
Sentence 5: ['Tyrannosaurus rex', 'ceratopsians']
Sentence 6: ['the dinosaur', '']
Sentence 7: ['Tyrannosaurus', 'paleontology']
Sentence 8: ['Tyrannosaurus', '']


**Relation Extraction**

To extract the relation, we make use of spaCy's rule-based Matcher class. When we look at our example sentences, we can observe that relations are often tagged as verb phrases. Looking at our dependency graph, we can now define the dependency graph tags as patterns and use the span to identify the corresponding tokens of the relation.

In [217]:
from spacy.matcher import Matcher

def extract_relation(sent):

  # Rule-based pattern matching class
  matcher = Matcher(nlp.vocab)

  # define the patterns according to the dependency graph tags 
  pattern = [{'DEP':'ROOT'},                # verbs are often root
            {'DEP':'prep','OP':"?"},
            {'DEP':'attr','OP':"?"},
            {'DEP':'det','OP':"?"},
            {'DEP':'agent','OP':"?"}] 

  matcher.add("relation",[pattern]) 

  matches = matcher(sent)
  k = len(matches) - 1

  span = sent[matches[k][1]:matches[k][2]] 

  return(span.text)

for id, sent in enumerate(doc.sents):
  print(f'Sentence {id+1}: {extract_relation(sent)}')

Sentence 1: was a bipedal carnivore
Sentence 2: were
Sentence 3: measures
Sentence 4: is
Sentence 5: was
Sentence 6: suggested
Sentence 7: was among
Sentence 8: accept


In [218]:
for id, sent in enumerate(doc.sents):  
  entity_pair = extract_entity_pairs(sent)
  print(f'Triple {id+1}: ({entity_pair[0]}, {extract_relation(sent)}, {entity_pair[1]})')

Triple 1: (Tyrannosaurus, was a bipedal carnivore, a long, heavy tail)
Triple 2: (they, were, two clawed digits)
Triple 3: (Tyrannosaurus, measures, mass)
Triple 4: (its estimated bite force, is, all terrestrial animals)
Triple 5: (Tyrannosaurus rex, was, ceratopsians)
Triple 6: (the dinosaur, suggested, )
Triple 7: (Tyrannosaurus, was among, paleontology)
Triple 8: (Tyrannosaurus, accept, )
