In [3]:
text = (
        "Like other tyrannosaurids, Tyrannosaurus was a bipedal carnivore with a massive skull balanced by a long, heavy tail. "
        "Relative to its large and powerful hind limbs, the forelimbs of Tyrannosaurus were short but unusually powerful for their size, and they had two clawed digits. "
        "The most complete specimen measures 12.3–12.4 m (40–41 ft) in length, but according to most modern estimates, Tyrannosaurus could have exceeded sizes of 13 m (43 ft) in length, 3.7–4 m (12–13 ft) in hip height, and 8.8 t (8.7 long tons; 9.7 short tons) in mass."
        "Although some other theropods might have rivaled or exceeded Tyrannosaurus in size, it is still among the largest known land predators, with its estimated bite force being the largest among all terrestrial animals. "
        "By far the largest carnivore in its environment, Tyrannosaurus rex was most likely an apex predator, preying upon hadrosaurs, juvenile armored herbivores like ceratopsians and ankylosaurs, and possibly sauropods. Some experts have suggested the dinosaur was primarily a scavenger. "
        "The question of whether Tyrannosaurus was an apex predator or a pure scavenger was among the longest debates in paleontology. " 
        "Most paleontologists today accept that Tyrannosaurus was both a predator and a scavenger." 
    )

text2 = ("Born to Jedi Knight Anakin Skywalker (later Darth Vader) and Senator Padmé Amidala, Luke was raised by his Aunt Beru and Uncle Owen on the desert planet Tatooine." 
         "Unaware of his true parentage, he worked as a moisture farmer until fate intervened: he encountered the droid R2-D2, which carried Princess Leia’s plea for help, setting Luke on a path that would forever change the galaxy."
         "Under the tutelage of Obi-Wan “Ben” Kenobi and, later, Jedi Master Yoda on Dagobah, Luke learned to harness the Force and hone his skills with a lightsaber." 
         "His rigorous training culminated in a fierce confrontation with Darth Vader aboard Cloud City, where he faced not only the dark side’s power but also the shattering revelation of his lineage ")

text3 = ("Archaeopteryx (from the Greek archaîos “ancient” + ptéryx “wing”; German Urvogel, “primeval bird”) is a genus of bird-like dinosaurs that lived in the Late Jurassic (~150 Mya) in what is now southern Germany. Roughly magpie-sized (up to 0.5 m long), it combined avian features (broad, flight-capable wings and advanced feather impressions) with classic theropod traits (sharp-toothed jaws, three-fingered claws, long bony tail, and a hyperextensible second toe). Because it bridges non-avian dinosaurs and modern birds, Archaeopteryx is celebrated as one of the most important transitional fossils in vertebrate evolution. First known from a single feather described in 1861, twelve more body-fossil specimens have since surfaced—almost all from the Solnhofen limestone quarries. The “London” specimen (1861) and the more complete “Berlin” specimen (1874–75) remain the best-preserved examples. These fossils not only bolstered early acceptance of Darwin’s theory of evolution but also established that feathers arose before true birds, reshaping our understanding of dinosaur–bird relationships.")

text4 = ("Quetzalcoatlus northropi is one of the largest known pterosaurs—and indeed one of the largest flying animals ever to have lived. Named for the feathered serpent god Quetzalcoatl of Aztec mythology and paleontologist John Northrop, it soared the skies of what is now western North America during the Late Cretaceous, roughly 70–66 million years ago. With an estimated wingspan of 10–11 meters (33–36 feet), Quetzalcoatlus combined ultra-light, hollow bones and an aerodynamic skull crest to remain airborne. Its long, slender beak lacked teeth, suggesting a feeding strategy focused on small vertebrates or carrion rather than fish snatching; its robust hind limbs imply a terrestrial stalking style, picking prey on shorelines or inland plains before taking flight in powerful strokes of its membranous wings. Fossils of Quetzalcoatlus were first uncovered in the 1970s within the Maastrichtian deposits of Big Bend National Park, Texas. Unlike the marine pterosaurs found closer to ancient shorelines, these specimens came from inland floodplain sediments—evidence that some giant pterosaurs thrived far from the coasts. As both apex aerial predators and efficient scavengers, Quetzalcoatlus and its kin help paint a richer picture of Late Cretaceous ecosystems, where the mastery of flight unlocked entirely new niches among Mesozoic reptiles.")


In [4]:
import spacy

# 1) Load model & text

nlp = spacy.load("en_core_web_trf")
#nlp.add_pipe("merge_entities")     # merges named entities
#nlp.add_pipe("merge_noun_chunks")  # merges base noun-chunks
#nlp.add_pipe("merge_subtokens")  


doc = nlp(text3) 

# 2) Print named entities
print("=== Entities ===")
for ent in doc.ents:
    print(f"{ent.text:20}  {ent.label_}  [{ent.start_char}:{ent.end_char}]")

# 3) Print dependency arcs sentence by sentence
print("\n=== Dependency Parse ===")
for i, sent in enumerate(doc.sents, 1):
    print(f"\nSentence {i}: {sent.text}")
    for token in sent:
        # token.dep_: dependency label; token.head: the "governor"
        print(f"  {token.text:12} → {token.head.text:12}  ({token.dep_})")
        
print([(w.text, w.pos_) for w in doc])


=== Entities ===
Greek                 NORP  [24:29]
German                NORP  [66:72]
the Late Jurassic     DATE  [147:164]
Germany               GPE  [200:207]
three                 CARDINAL  [390:395]
second                ORDINAL  [450:456]
non-avian             NORP  [482:491]
1861                  DATE  [670:674]
twelve                CARDINAL  [676:682]
Solnhofen             GPE  [750:759]
London                GPE  [785:791]
1861                  DATE  [803:807]
Berlin                GPE  [832:838]
Darwin                PERSON  [948:954]

=== Dependency Parse ===

Sentence 1: Archaeopteryx (from the Greek archaîos “ancient” + ptéryx “wing”; German Urvogel, “primeval bird”) is a genus of bird-like dinosaurs that lived in the Late Jurassic (~150 Mya) in what is now southern Germany.
  Archaeopteryx → is            (nsubj)
  (            → Archaeopteryx  (punct)
  from         → Archaeopteryx  (prep)
  the          → ancient       (det)
  Greek        → ancient       (amod)
  ar

In [5]:
from collections import Counter
import spacy
from spacy.tokens import Span, Token

from typing import Any, Iterable, Optional

def most_common_by_text(seq):
    """
    Return the object in `seq` whose `.text` value appears most frequently.
    If `seq` is empty, returns None.
    If multiple objects share the top frequency, returns the first one seen.
    """
    # build counts of each text
    counts = {}
    for obj in seq:
        text = getattr(obj, "text", obj)  # use obj itself if no .text
        counts[text] = counts.get(text, 0) + 1

    if not counts:
        return None

    # find the text with the highest count
    max_text = None
    max_count = 0
    for text, cnt in counts.items():
        if cnt > max_count:
            max_text, max_count = text, cnt

    # return the first object whose text matches
    for obj in seq:
        if getattr(obj, "text", obj) == max_text:
            return obj
    return None
        

def extract_topic(doc):
    """Extract the main topic of the document and its label."""
    # collect candidates as either Spans (entities) or Tokens (nsubj NOUN/PROPN)
    valid_dep = ["nsubj", "pobj", "nsubjpass", "puncr", "prep"]  # valid dependency labels for subjects
    candidates = []
    ent_candidates = []
    # first all named‐entity spans
    for ent in doc.ents:
        ent_candidates.append(ent)
    # then any noun/proper‐noun subjects
    for token in doc:
        if token.dep_ in valid_dep and token.pos_ in ("NOUN", "PROPN"):
            candidates.append(token)

    if not candidates:
        return None, None
    
    print(candidates)

    # pick the most frequent
    topic_obj = most_common_by_text(candidates)

    # if it's a Span, use its label_
    if isinstance(topic_obj, Span):
        return topic_obj.text, topic_obj.label_

    # if it's a Token inside an entity, use token.ent_type_
    if isinstance(topic_obj, Token) and topic_obj.ent_type_:
        return topic_obj.text, topic_obj.ent_type_

    # check if this token text is a named entity
    for ent in ent_candidates:
        if topic_obj.text == ent.text:
            return topic_obj.text, ent.label_
    # otherwise fall back to the UPOS tag
    return topic_obj.text, "NaT" # "NaT" = Not a Topic


topic, label = extract_topic(doc)
print("Main topic:", topic)
print("Label:", label)


[Archaeopteryx, Urvogel, dinosaurs, Jurassic, traits, Archaeopteryx, fossils, evolution, feather, specimens, quarries, specimen, fossils, theory, evolution, feathers, birds, relationships]
Main topic: Archaeopteryx
Label: NaT


In [7]:
def sent_subtree(tok):
    return " ".join(t.text for t in tok.subtree)

#========================================#
# ORIGINAL                               #
#========================================#

# subject-predicate-object
def extract_spo(sent):
    # 1) Find the main verb
    roots = [t for t in sent if t.dep_ == "ROOT"]
    if not roots:
        return None
    root = roots[0]

    # 2) Subject(s): any child with “subj” in its dep_
    subs = [child for child in root.lefts if "subj" in child.dep_]
    # 3) Object(s): expand to attr, acomp, dobj, pobj, iobj
    objs = [
        child
        for child in root.rights
        if child.dep_ in ("dobj", "pobj", "iobj", "attr", "acomp")
    ]

    if not subs or not objs:
        return None

    # 4) Build full phrases
    subj_phrase = sent_subtree(subs[0])
    # print(f"Subj text: {subs[0].text}")  # Name of subject
    # print(f"Subj type: {subs[0].dep_}") # TODO: use this to look for connections when creating multiple nodes
    # print(f"Obj text: {objs[0].text}")    # Name of object
    # print(f"Obj type: {objs[0].dep_}")
    obj_phrase  = sent_subtree(objs[0])

    # 5) Build the relation: include auxiliaries, negations, particles
    aux = [tok.text for tok in root.lefts if tok.dep_ in ("aux", "auxpass", "neg", "prt")]
    rel = " ".join(aux + [root.text])

    return (subj_phrase, rel, obj_phrase)


for i, sent in enumerate(doc.sents, 1):
    spo = extract_spo(sent)
    if spo:
        print(f"Triple {i}:", spo)

Triple 1: ('Archaeopteryx ( from the Greek archaîos “ ancient ” + ptéryx “ wing ” ; German Urvogel , “ primeval bird ” )', 'is', 'a genus of bird - like dinosaurs that lived in the Late Jurassic ( ~150 Mya ) in what is now southern Germany')
Triple 2: ('it', 'combined', 'avian features ( broad , flight - capable wings and advanced feather impressions )')
Triple 5: ('The “ London ” specimen ( 1861 ) and the more complete “ Berlin ” specimen ( 1874–75 )', 'remain', 'the best - preserved examples')
Triple 6: ('These fossils', 'bolstered', 'early acceptance of Darwin ’s theory of evolution')


In [None]:
from pydantic import BaseModel, Field

class Node(BaseModel):
    """A node in the knowledge graph."""
    node_type: str = Field(..., description="Type of the node (e.g., 'PERSON', 'LOCATION')")
    title: str = Field(..., description="Title or Topic of the node")
    id: str = Field(..., description="id or type of the node")
    triples: list[tuple[str, str, str]]
    
    
class Edge(BaseModel):
    """An edge in the knowledge graph."""
    source: str = Field(..., description="Source node id")
    target: str = Field(..., description="Target node id")
    relation: str = Field(..., description="Relation type between source and target nodes")