Ontological Engineering:
NLP parsing and knowledge graph construction

In [263]:
#!pip install -U spacy
# python -m spacy download en_core_web_sm
# python -m spacy download en_core_web_trf
# sm, md, lg, trf



In [264]:
import spacy
import spacy_transformers
# nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemmatizer", "morphologizer"])
nlp = spacy.load("en_core_web_trf")

nlp.add_pipe("merge_entities")     # merges named entities
nlp.add_pipe("merge_noun_chunks")  # merges base noun-chunks

# import en_core_web_trf
# nlp = en_core_web_trf.load()
print(nlp.pipe_names)
doc = nlp("This is a sentence.")
print([(w.text, w.pos_) for w in doc])

['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'merge_entities', 'merge_noun_chunks']
[('This', 'PRON'), ('is', 'AUX'), ('a sentence', 'NOUN'), ('.', 'PUNCT')]


In [None]:
import spacy

# 1) Load model & text

nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("merge_entities")     # merges named entities
nlp.add_pipe("merge_noun_chunks")  # merges base noun-chunks
nlp.add_pipe("merge_subtokens")  

text = (
        "Like other tyrannosaurids, Tyrannosaurus was a bipedal carnivore with a massive skull balanced by a long, heavy tail. "
        "Relative to its large and powerful hind limbs, the forelimbs of Tyrannosaurus were short but unusually powerful for their size, and they had two clawed digits. "
        "The most complete specimen measures 12.3–12.4 m (40–41 ft) in length, but according to most modern estimates, Tyrannosaurus could have exceeded sizes of 13 m (43 ft) in length, 3.7–4 m (12–13 ft) in hip height, and 8.8 t (8.7 long tons; 9.7 short tons) in mass."
        "Although some other theropods might have rivaled or exceeded Tyrannosaurus in size, it is still among the largest known land predators, with its estimated bite force being the largest among all terrestrial animals. "
        "By far the largest carnivore in its environment, Tyrannosaurus rex was most likely an apex predator, preying upon hadrosaurs, juvenile armored herbivores like ceratopsians and ankylosaurs, and possibly sauropods. Some experts have suggested the dinosaur was primarily a scavenger. "
        "The question of whether Tyrannosaurus was an apex predator or a pure scavenger was among the longest debates in paleontology. " 
        "Most paleontologists today accept that Tyrannosaurus was both a predator and a scavenger." 
    )

text2 = ("Born to Jedi Knight Anakin Skywalker (later Darth Vader) and Senator Padmé Amidala, Luke was raised by his Aunt Beru and Uncle Owen on the desert planet Tatooine." 
         "Unaware of his true parentage, he worked as a moisture farmer until fate intervened: he encountered the droid R2-D2, which carried Princess Leia’s plea for help, setting Luke on a path that would forever change the galaxy."
         "Under the tutelage of Obi-Wan “Ben” Kenobi and, later, Jedi Master Yoda on Dagobah, Luke learned to harness the Force and hone his skills with a lightsaber." 
         "His rigorous training culminated in a fierce confrontation with Darth Vader aboard Cloud City, where he faced not only the dark side’s power but also the shattering revelation of his lineage ")

text3 = ("Archaeopteryx (from the Greek archaîos “ancient” + ptéryx “wing”; German Urvogel, “primeval bird”) is a genus of bird-like dinosaurs that lived in the Late Jurassic (~150 Mya) in what is now southern Germany. Roughly magpie-sized (up to 0.5 m long), it combined avian features (broad, flight-capable wings and advanced feather impressions) with classic theropod traits (sharp-toothed jaws, three-fingered claws, long bony tail, and a hyperextensible second toe). Because it bridges non-avian dinosaurs and modern birds, Archaeopteryx is celebrated as one of the most important transitional fossils in vertebrate evolution. First known from a single feather described in 1861, twelve more body-fossil specimens have since surfaced—almost all from the Solnhofen limestone quarries. The “London” specimen (1861) and the more complete “Berlin” specimen (1874–75) remain the best-preserved examples. These fossils not only bolstered early acceptance of Darwin’s theory of evolution but also established that feathers arose before true birds, reshaping our understanding of dinosaur–bird relationships.")

text4 = ("Quetzalcoatlus northropi is one of the largest known pterosaurs—and indeed one of the largest flying animals ever to have lived. Named for the feathered serpent god Quetzalcoatl of Aztec mythology and paleontologist John Northrop, it soared the skies of what is now western North America during the Late Cretaceous, roughly 70–66 million years ago. With an estimated wingspan of 10–11 meters (33–36 feet), Quetzalcoatlus combined ultra-light, hollow bones and an aerodynamic skull crest to remain airborne. Its long, slender beak lacked teeth, suggesting a feeding strategy focused on small vertebrates or carrion rather than fish snatching; its robust hind limbs imply a terrestrial stalking style, picking prey on shorelines or inland plains before taking flight in powerful strokes of its membranous wings. Fossils of Quetzalcoatlus were first uncovered in the 1970s within the Maastrichtian deposits of Big Bend National Park, Texas. Unlike the marine pterosaurs found closer to ancient shorelines, these specimens came from inland floodplain sediments—evidence that some giant pterosaurs thrived far from the coasts. As both apex aerial predators and efficient scavengers, Quetzalcoatlus and its kin help paint a richer picture of Late Cretaceous ecosystems, where the mastery of flight unlocked entirely new niches among Mesozoic reptiles.")

doc = nlp(text4) 

# 2) Print named entities
print("=== Entities ===")
for ent in doc.ents:
    print(f"{ent.text:20}  {ent.label_}  [{ent.start_char}:{ent.end_char}]")

# 3) Print dependency arcs sentence by sentence
print("\n=== Dependency Parse ===")
for i, sent in enumerate(doc.sents, 1):
    print(f"\nSentence {i}: {sent.text}")
    for token in sent:
        # token.dep_: dependency label; token.head: the "governor"
        print(f"  {token.text:12} → {token.head.text:12}  ({token.dep_})")
        
print([(w.text, w.pos_) for w in doc])


=== Entities ===
Jedi Knight Anakin Skywalker  PERSON  [8:36]
later Darth Vader     PERSON  [38:55]
Senator Padmé Amidala  PERSON  [61:82]
Luke                  PERSON  [84:88]
his Aunt Beru         PERSON  [103:116]
Uncle Owen            PERSON  [121:131]
R2-D2                 PRODUCT  [272:277]
Luke                  PERSON  [332:336]
Obi-Wan “Ben” Kenobi  PERSON  [406:426]
Jedi Master Yoda      PERSON  [439:455]
Dagobah               PRODUCT  [459:466]
Luke                  PERSON  [468:472]
Darth Vader           PERSON  [604:615]
Cloud City            LOC  [623:633]

=== Dependency Parse ===

Sentence 1: Born to Jedi Knight Anakin Skywalker (later Darth Vader) and Senator Padmé Amidala, Luke was raised by his Aunt Beru and Uncle Owen on the desert planet Tatooine.
  Born         → raised        (advcl)
  to           → Born          (prep)
  Jedi Knight Anakin Skywalker → to            (pobj)
  (            → Jedi Knight Anakin Skywalker  (punct)
  later Darth Vader → Jedi Knight An

In [266]:
from collections import Counter
import spacy
from spacy.tokens import Span, Token

from typing import Any, Iterable, Optional

def most_common_by_text(seq):
    """
    Return the object in `seq` whose `.text` value appears most frequently.
    If `seq` is empty, returns None.
    If multiple objects share the top frequency, returns the first one seen.
    """
    # 1) build counts of each text
    counts = {}
    for obj in seq:
        text = getattr(obj, "text", obj)  # use obj itself if no .text
        counts[text] = counts.get(text, 0) + 1

    if not counts:
        return None

    # 2) find the text with the highest count
    max_text = None
    max_count = 0
    for text, cnt in counts.items():
        if cnt > max_count:
            max_text, max_count = text, cnt

    # 3) return the first object whose text matches
    for obj in seq:
        if getattr(obj, "text", obj) == max_text:
            return obj
    return None
        

def extract_topic(doc):
    """Extract the main topic of the document and its label."""
    # collect candidates as either Spans (entities) or Tokens (nsubj NOUN/PROPN)
    valid_dep = ["nsubj", "pobj", "nsubjpass", "puncr", "prep"]  # valid dependency labels for subjects
    candidates = []
    ent_candidates = []
    # first all named‐entity spans
    for ent in doc.ents:
        ent_candidates.append(ent)
    # then any noun/proper‐noun subjects
    for token in doc:
        if token.dep_ in valid_dep and token.pos_ in ("NOUN", "PROPN"):
            candidates.append(token)

    if not candidates:
        return None, None
    
    print(candidates)

    # pick the most frequent
    topic_obj = most_common_by_text(candidates)

    # if it's a Span, use its label_
    if isinstance(topic_obj, Span):
        return topic_obj.text, topic_obj.label_

    # if it's a Token inside an entity, use token.ent_type_
    if isinstance(topic_obj, Token) and topic_obj.ent_type_:
        return topic_obj.text, topic_obj.ent_type_

    # otherwise fall back to the UPOS tag
    return topic_obj.text, topic_obj.pos_


topic, label = extract_topic(doc)
print("Main topic:", topic)
print("Label:", label)


[Jedi Knight Anakin Skywalker, Luke, his Aunt Beru, the desert planet, his true parentage, a moisture farmer, fate, help, a path, the tutelage, Obi-Wan “Ben” Kenobi, Dagobah, Luke, a lightsaber, His rigorous training, a fierce confrontation, Darth Vader, Cloud City, his lineage]
Main topic: Luke
Label: PERSON


In [267]:
def sent_subtree(tok):
    return " ".join(t.text for t in tok.subtree)

def extract_spo(sent):
    # 1) Find the main verb
    roots = [t for t in sent if t.dep_ == "ROOT"]
    if not roots:
        return None
    root = roots[0]

    # 2) Subject(s): any child with “subj” in its dep_
    subs = [child for child in root.lefts if "subj" in child.dep_]
    # 3) Object(s): expand to attr, acomp, dobj, pobj, iobj
    objs = [
        child
        for child in root.rights
        if child.dep_ in ("dobj", "pobj", "iobj", "attr", "acomp")
    ]

    if not subs or not objs:
        return None

    # 4) Build full phrases
    subj_phrase = sent_subtree(subs[0])
    obj_phrase  = sent_subtree(objs[0])

    # 5) Build the relation: include auxiliaries, negations, particles
    aux = [tok.text for tok in root.lefts if tok.dep_ in ("aux", "auxpass", "neg", "prt")]
    rel = " ".join(aux + [root.text])

    return (subj_phrase, rel, obj_phrase)


for i, sent in enumerate(doc.sents, 1):
    spo = extract_spo(sent)
    if spo:
        print(f"Triple {i}:", spo)

Triple 2: ('he', 'encountered', 'the droid R2-D2 , which carried Princess Leia’s plea for help')


**Entity pair extraction**

To determine the entity pairs we also make use of the dependency graph. Looking at the first sentence of our example, Leonard Simon Nimoy was born in Boston, we can see that:

Entities can be found in noun phrases.
The entity tagged as a subj (Leonard Simon Nimoy) is the head of the triple.
While the obj (Boston) is the tail and the verb (was born in) in between them is the relation.
subj and obj may be composed of several tokens (dep_ == "compound").


In [268]:
def extract_entity_pairs(sent):
  head = ''
  tail = ''

  prefix = ''             # variable for storing compound noun phrases
  prev_token_dep = ''     # dependency tag of previous token in the sentence
  prev_token_text = ''    # previous token in the sentence

  
  for token in sent:
    # if it's a punctuation mark, do nothing and move on to the next token
    if token.dep_ == 'punct':
      continue

    # Condition #1: subj is the head entity
    if "subj" in token.dep_:
      head = f'{prefix} {token.text}'

      # Reset placeholder variables, to be reused by succeeding entities
      prefix = ''
      prev_token_dep = ''
      prev_token_text = ''      

    # Condition #2: obj is the tail entity
    if token.dep_ in ("dobj","pobj","iobj"):
      tail = f'{prefix} {token.text}'
        
    # Condition #3: entities may be composed of several tokens
    if token.dep_ == "compound":
      # if the previous word was also a 'compound' then add the current word to it
      if prev_token_dep == "compound":
        prefix = f'{prev_token_text} {token.text}'
      # if not, then this is the first token in the noun phrase
      else:
        prefix = token.text

    # Placeholders for compound cases.      
    prev_token_dep = token.dep_
    prev_token_text = token.text
  #############################################################

  return [head.strip(), tail.strip()]

for id, sent in enumerate(doc.sents):
  print(f'Sentence {id+1}: {extract_entity_pairs(sent)}')

Sentence 1: ['Luke', 'the desert planet']
Sentence 2: ['that', 'the galaxy']
Sentence 3: ['Luke', 'a lightsaber']
Sentence 4: ['he', 'his lineage']


**Relation Extraction**

To extract the relation, we make use of spaCy's rule-based Matcher class. When we look at our example sentences, we can observe that relations are often tagged as verb phrases. Looking at our dependency graph, we can now define the dependency graph tags as patterns and use the span to identify the corresponding tokens of the relation.

In [269]:
from spacy.matcher import Matcher

def extract_relation(sent):

  # Rule-based pattern matching class
  matcher = Matcher(nlp.vocab)

  # define the patterns according to the dependency graph tags 
  pattern = [{'DEP':'ROOT'},                # verbs are often root
            {'DEP':'prep','OP':"?"},
            {'DEP':'attr','OP':"?"},
            {'DEP':'det','OP':"?"},
            {'DEP':'agent','OP':"?"}] 

  matcher.add("relation",[pattern]) 

  matches = matcher(sent)
  k = len(matches) - 1

  span = sent[matches[k][1]:matches[k][2]] 

  return(span.text)

for id, sent in enumerate(doc.sents):
  print(f'Sentence {id+1}: {extract_relation(sent)}')

Sentence 1: raised by
Sentence 2: encountered
Sentence 3: learned
Sentence 4: culminated in


In [270]:
for id, sent in enumerate(doc.sents):  
  entity_pair = extract_entity_pairs(sent)
  print(f'Triple {id+1}: ({entity_pair[0]}, {extract_relation(sent)}, {entity_pair[1]})')

Triple 1: (Luke, raised by, the desert planet)
Triple 2: (that, encountered, the galaxy)
Triple 3: (Luke, learned, a lightsaber)
Triple 4: (he, culminated in, his lineage)
