In [140]:
text1 = (
        "Like other tyrannosaurids, Tyrannosaurus was a bipedal carnivore with a massive skull balanced by a long, heavy tail. "
        "Relative to its large and powerful hind limbs, the forelimbs of Tyrannosaurus were short but unusually powerful for their size, and they had two clawed digits. "
        "The most complete specimen measures 12.3–12.4 m (40–41 ft) in length, but according to most modern estimates, Tyrannosaurus could have exceeded sizes of 13 m (43 ft) in length, 3.7–4 m (12–13 ft) in hip height, and 8.8 t (8.7 long tons; 9.7 short tons) in mass."
        "Although some other theropods might have rivaled or exceeded Tyrannosaurus in size, it is still among the largest known land predators, with its estimated bite force being the largest among all terrestrial animals. "
        "By far the largest carnivore in its environment, Tyrannosaurus rex was most likely an apex predator, preying upon hadrosaurs, juvenile armored herbivores like ceratopsians and ankylosaurs, and possibly sauropods. Some experts have suggested the dinosaur was primarily a scavenger. "
        "The question of whether Tyrannosaurus was an apex predator or a pure scavenger was among the longest debates in paleontology. " 
        "Most paleontologists today accept that Tyrannosaurus was both a predator and a scavenger." 
    )

text2 = ("Born to Jedi Knight Anakin Skywalker (later Darth Vader) and Senator Padmé Amidala, Luke was raised by his Aunt Beru and Uncle Owen on the desert planet Tatooine." 
         "Unaware of his true parentage, he worked as a moisture farmer until fate intervened: he encountered the droid R2-D2, which carried Princess Leia’s plea for help, setting Luke on a path that would forever change the galaxy."
         "Under the tutelage of Obi-Wan “Ben” Kenobi and, later, Jedi Master Yoda on Dagobah, Luke learned to harness the Force and hone his skills with a lightsaber." 
         "His rigorous training culminated in a fierce confrontation with Darth Vader aboard Cloud City, where he faced not only the dark side’s power but also the shattering revelation of his lineage ")

text3 = ("Archaeopteryx (from the Greek archaîos “ancient” + ptéryx “wing”; German Urvogel, “primeval bird”) is a genus of bird-like dinosaurs that lived in the Late Jurassic (~150 Mya) in what is now southern Germany. Roughly magpie-sized (up to 0.5 m long), it combined avian features (broad, flight-capable wings and advanced feather impressions) with classic theropod traits (sharp-toothed jaws, three-fingered claws, long bony tail, and a hyperextensible second toe). Because it bridges non-avian dinosaurs and modern birds, Archaeopteryx is celebrated as one of the most important transitional fossils in vertebrate evolution. First known from a single feather described in 1861, twelve more body-fossil specimens have since surfaced—almost all from the Solnhofen limestone quarries. The “London” specimen (1861) and the more complete “Berlin” specimen (1874–75) remain the best-preserved examples. These fossils not only bolstered early acceptance of Darwin’s theory of evolution but also established that feathers arose before true birds, reshaping our understanding of dinosaur–bird relationships.")

text4 = ("Quetzalcoatlus northropi is one of the largest known pterosaurs—and indeed one of the largest flying animals ever to have lived. Named for the feathered serpent god Quetzalcoatl of Aztec mythology and paleontologist John Northrop, it soared the skies of what is now western North America during the Late Cretaceous, roughly 70–66 million years ago. With an estimated wingspan of 10–11 meters (33–36 feet), Quetzalcoatlus combined ultra-light, hollow bones and an aerodynamic skull crest to remain airborne. Its long, slender beak lacked teeth, suggesting a feeding strategy focused on small vertebrates or carrion rather than fish snatching; its robust hind limbs imply a terrestrial stalking style, picking prey on shorelines or inland plains before taking flight in powerful strokes of its membranous wings. Fossils of Quetzalcoatlus were first uncovered in the 1970s within the Maastrichtian deposits of Big Bend National Park, Texas. Unlike the marine pterosaurs found closer to ancient shorelines, these specimens came from inland floodplain sediments—evidence that some giant pterosaurs thrived far from the coasts. As both apex aerial predators and efficient scavengers, Quetzalcoatlus and its kin help paint a richer picture of Late Cretaceous ecosystems, where the mastery of flight unlocked entirely new niches among Mesozoic reptiles.")

text5 = ("Dakotaraptor steini was a giant dromaeosaur (“raptor”) of the Hell Creek Formation (Late Cretaceous, ~66 Ma) that specialized in hunting tyrannosaurus rex. At up to 5 m in length and standing over 1.2 m tall at the hip, Dakotaraptor combined extreme speed with formidable weaponry—most notably its sickle-shaped second toe claws, each nearly 15 cm long—to take down even the fiercest of contemporaries. Evidence from bonebeds suggests Dakotaraptor hunted in coordinated packs. Working together, they ambushed juvenile and subadult tyrannosaurus rex, the smaller size and less–hardened defenses of these younger tyrants making them prime targets. By circling and harrying their prey, the raptors could deliver precise, crippling strikes to legs and flanks before dragging a weakened tyrannosaurus rex to the ground. With a low, muscular build and extremely stiff tail for balance, Dakotaraptor could sprint out of cover in explosive bursts. Each slash of its oversized toe claw could puncture the hide of a growing tyrannosaurus rex, while its blade-like teeth delivered fatal wounds. Robust forelimbs tipped with large, recurved claws helped maintain grip on struggling prey, ensuring the pack maintained control until the kill was complete. By targeting younger tyrannosaurus rex, Dakotaraptor packs regulated tyrannosaur populations and prevented runaway dominance by the giant predators. Their ability to hunt tyrannosaurus rex underscores the dynamic, predator–rich ecosystems of Late Cretaceous North America—where even the apex carnivore could fall victim to a well-organized hunting party of raptors.")

texts = [text1, text2, text3, text4, text5]

In [141]:
import spacy

# 1) Load model & text

nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("merge_entities")     # merges named entities
nlp.add_pipe("merge_noun_chunks")  # merges base noun-chunks
nlp.add_pipe("merge_subtokens")  

<function spacy.pipeline.functions.merge_subtokens(doc: spacy.tokens.doc.Doc, label: str = 'subtok') -> spacy.tokens.doc.Doc>

In [142]:


doc = nlp(text1) 

# 2) Print named entities
print("=== Entities ===")
for ent in doc.ents:
    print(f"{ent.text:20}  {ent.label_}  [{ent.start_char}:{ent.end_char}]")

# 3) Print dependency arcs sentence by sentence
print("\n=== Dependency Parse ===")
for i, sent in enumerate(doc.sents, 1):
    print(f"\nSentence {i}: {sent.text}")
    for token in sent:
        # token.dep_: dependency label; token.head: the "governor"
        print(f"  {token.text:12} → {token.head.text:12}  ({token.dep_})")
        
print([(w.text, w.pos_) for w in doc])


=== Entities ===
12.3–12.4 m           QUANTITY  [314:325]
13 m                  QUANTITY  [431:435]
43 ft                 QUANTITY  [437:442]
3.7–4 m               QUANTITY  [455:462]
8.8 t                 QUANTITY  [493:498]
8.7 long tons         QUANTITY  [500:513]
9.7 short tons        QUANTITY  [515:529]
today                 DATE  [1182:1187]

=== Dependency Parse ===

Sentence 1: Like other tyrannosaurids, Tyrannosaurus was a bipedal carnivore with a massive skull balanced by a long, heavy tail.
  Like         → was           (prep)
  other tyrannosaurids → Like          (pobj)
  ,            → was           (punct)
  Tyrannosaurus → was           (nsubj)
  was          → was           (ROOT)
  a bipedal carnivore → was           (attr)
  with         → a bipedal carnivore  (prep)
  a massive skull → with          (pobj)
  balanced     → a massive skull  (acl)
  by           → balanced      (agent)
  a long, heavy tail → by            (pobj)
  .            → was           (punct

In [143]:
from collections import Counter
import spacy
from spacy.tokens import Span, Token

from typing import Any, Iterable, Optional

def most_common_by_text(seq):
    """
    Return the object in `seq` whose `.text` value appears most frequently.
    If `seq` is empty, returns None.
    If multiple objects share the top frequency, returns the first one seen.
    """
    # build counts of each text
    counts = {}
    for obj in seq:
        text = getattr(obj, "text", obj)  # use obj itself if no .text
        counts[text] = counts.get(text, 0) + 1

    if not counts:
        return None

    # find the text with the highest count
    max_text = None
    max_count = 0
    for text, cnt in counts.items():
        if cnt > max_count:
            max_text, max_count = text, cnt

    # return the first object whose text matches
    for obj in seq:
        if getattr(obj, "text", obj) == max_text:
            return obj
    return None
        

def extract_topic(doc):
    """Extract the main topic of the document and its label."""
    # collect candidates as either Spans (entities) or Tokens (nsubj NOUN/PROPN)
    valid_dep = ["nsubj", "pobj", "nsubjpass", "puncr", "prep"]  # valid dependency labels for subjects
    candidates = []
    ent_candidates = []
    # first all named‐entity spans
    for ent in doc.ents:
        ent_candidates.append(ent)
    # then any noun/proper‐noun subjects
    for token in doc:
        if token.dep_ in valid_dep and token.pos_ in ("NOUN", "PROPN"):
            candidates.append(token)

    if not candidates:
        return None, None
    
    print(candidates)

    # pick the most frequent
    topic_obj = most_common_by_text(candidates)

    # if it's a Span, use its label_
    if isinstance(topic_obj, Span):
        return topic_obj.text, topic_obj.label_

    # if it's a Token inside an entity, use token.ent_type_
    if isinstance(topic_obj, Token) and topic_obj.ent_type_:
        return topic_obj.text, topic_obj.ent_type_

    # check if this token text is a named entity
    for ent in ent_candidates:
        if topic_obj.text == ent.text:
            return topic_obj.text, ent.label_
    # otherwise fall back to the UPOS tag
    return topic_obj.text, "NaT" # "NaT" = Not a Topic


topic, label = extract_topic(doc)
print("Main topic:", topic)
print("Label:", label)


[other tyrannosaurids, Tyrannosaurus, a massive skull, a long, heavy tail, its large and powerful hind limbs, the forelimbs, Tyrannosaurus, their size, The most complete specimen, length, most modern estimates, Tyrannosaurus, 13 m, length, hip height, mass, some other theropods, size, the largest known land predators, its estimated bite force, all terrestrial animals, its environment, Tyrannosaurus rex, hadrosaurs, ceratopsians, Some experts, the dinosaur, The question, Tyrannosaurus, the longest debates, paleontology, Most paleontologists, Tyrannosaurus]
Main topic: Tyrannosaurus
Label: NaT


In [144]:

from pydantic import BaseModel, Field
from typing import Any, Optional

class TextMeta(BaseModel):
    text: str = Field(..., description="The text of the token or entity")
    token: Any = Field(..., description="The original token or entity object from spaCy")

class Triple(BaseModel):
    subject: TextMeta = Field(..., description="The subject of the triple")
    predicate: TextMeta = Field(..., description="The relation between subject and object")
    object: TextMeta = Field(..., description="The object of the triple")




def sent_subtree(tok):
    return " ".join(t.text for t in tok.subtree)

#========================================#
# ORIGINAL                               #
#========================================#

# subject-predicate-object
def extract_spo(sent) -> Optional[Triple]:
    # 1) Find the main verb
    roots = [t for t in sent if t.dep_ == "ROOT"]
    if not roots:
        return None
    root = roots[0]

    # 2) Subject(s): any child with “subj” in its dep_
    subs = [child for child in root.lefts if "subj" in child.dep_]
    # 3) Object(s): expand to attr, acomp, dobj, pobj, iobj
    objs = [
        child
        for child in root.rights
        if child.dep_ in ("dobj", "pobj", "iobj", "attr", "acomp")
    ]

    if not subs or not objs:
        return None

    # 4) Build full phrases
    subj_phrase = sent_subtree(subs[0])
    # print(f"Subj text: {subs[0].text}")  # Name of subject
    # print(f"Subj type: {subs[0].dep_}") # TODO: use this to look for connections when creating multiple nodes
    # print(f"Obj text: {objs[0].text}")    # Name of object
    # print(f"Obj type: {objs[0].dep_}")
    obj_phrase  = sent_subtree(objs[0])

    # 5) Build the relation: include auxiliaries, negations, particles
    aux = [tok.text for tok in root.lefts if tok.dep_ in ("aux", "auxpass", "neg", "prt")]
    rel = " ".join(aux + [root.text])
    
    subject = TextMeta(text=subj_phrase, token=subs[0])
    predicate = TextMeta(text=rel, token=root.head)
    object_ = TextMeta(text=obj_phrase, token=objs[0])
    
    return Triple(subject=subject, predicate=predicate, object=object_)

    #return (subj_phrase, rel, obj_phrase)


spo_list: list[Optional[Triple]] = []
for i, sent in enumerate(doc.sents):
    spo = extract_spo(sent)
    if spo: spo_list.append(spo)
    # if spo is not None:
    #     # Print a nicely–formatted JSON string of the Triple
    #     print(f"Triple {i}:")
    #     # For Pydantic v2
    #     print(spo.model_dump_json(indent=2))

for i, spo in enumerate(spo_list, start=1):
    subj_txt = spo.subject.text
    rel_txt  = spo.predicate.text
    obj_txt  = spo.object.text
    print(f"Triple {i}: ({subj_txt!r}, {rel_txt!r}, {obj_txt!r})")
    print(f"Triple {i}: ({spo.subject.token.dep_!r}, {spo.predicate.token.dep_!r}, {spo.object.token.dep_!r})")

Triple 1: ('Tyrannosaurus', 'was', 'a bipedal carnivore with a massive skull balanced by a long, heavy tail')
Triple 1: ('nsubj', 'ROOT', 'attr')
Triple 2: ('the forelimbs of Tyrannosaurus', 'were', 'short but unusually powerful for their size')
Triple 2: ('nsubj', 'ROOT', 'acomp')
Triple 3: ('The most complete specimen', 'measures', '12.3–12.4 m ( 40–41 ft ) in length')
Triple 3: ('nsubj', 'ROOT', 'dobj')
Triple 4: ('Tyrannosaurus rex', 'was', 'an apex predator')
Triple 4: ('nsubj', 'ROOT', 'attr')


In [145]:


class Node(BaseModel):
    """A node in the knowledge graph."""
    node_type: str = Field(..., description="Type of the node (e.g., 'PERSON', 'LOCATION')")
    title: str = Field(..., description="Title or Topic of the node")
    id: str = Field(..., description="id or type of the node")
    triples: Any = Field(..., description="List of triples associated with the node")
    
    def print_node(self):
        """Print the node's details."""
        print(f"Node ID: {self.id}")
        print(f"Node Type: {self.node_type}")
        print(f"Title: {self.title}")
        print("Triples:")
        for triple in self.triples:
            try:
                print(f"  - {triple.subject.text} {triple.predicate.text} {triple.object.text}")
            except:
                print("not a triple")
    
class Edge(BaseModel):
    """An edge in the knowledge graph."""
    source: str = Field(..., description="Source node id")
    target: str = Field(..., description="Target node id")
    relation: str = Field(..., description="Relation type between source and target nodes")

In [None]:
def create_node(triples, node_type, title):
    """
    Create a Node object from a list of SPO triples.
    """
    return Node(
        node_type=node_type,
        title=title,
        id=title.lower().replace(" ", "_"),  
        triples=triples
    )

node = create_node(spo_list, node_type=label, title=topic)

node.print_node()

Node ID: tyrannosaurus
Node Type: NaT
Title: Tyrannosaurus
Triples:
  - Tyrannosaurus was a bipedal carnivore with a massive skull balanced by a long, heavy tail
  - the forelimbs of Tyrannosaurus were short but unusually powerful for their size
  - The most complete specimen measures 12.3–12.4 m ( 40–41 ft ) in length
  - Tyrannosaurus rex was an apex predator
