[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/advanced/12_Unstructured_to_Ontology.ipynb)

# Unstructured Text to Ontology

Welcome to the advanced guide on extracting structured ontologies from unstructured text. This notebook explores two powerful paradigms available in Semantica:

1.  **Classical NLP Pipeline**: Using Named Entity Recognition (NER) and Relation Extraction.
2.  **Generative AI Pipeline**: Using Large Language Models (LLMs) for direct conceptual modeling.

We will compare both approaches, visualize the results, and validate the generated ontologies.

**Documentation**: [API Reference](https://semantica.readthedocs.io/reference/ontology/)

## Setup and Installation

Ensure you have Semantica installed with all dependencies.

In [None]:
!pip install -qU semantica

In [None]:


from semantica.utils.logging import get_logger

logger = get_logger("unstructured_guide")
print("Environment setup complete.")

---

## The Input Text

We will use a rich paragraph of text describing a technology company to test both extraction methods.

In [None]:
text_corpus = """
QuantumDynamics is a leading AI research lab founded by Dr. Elena Rostova in 2018. 
The lab is headquartered in Zurich, Switzerland, and focuses on quantum computing algorithms. 
Dr. Rostova serves as the Chief Scientist. 
The lab has released products like the Q-1 Processor and the NeuralBridge SDK. 
QuantumDynamics collaborates with major universities such as MIT and ETH Zurich.
"""

---

## Approach 1: The Classical NLP Pipeline

This approach builds the ontology from the bottom up:
1.  **Extract Entities**: Identify nouns/proper nouns (e.g., "QuantumDynamics", "Zurich").
2.  **Extract Relations**: Identify verbs connecting them (e.g., "headquartered in").
3.  **Generate Ontology**: Map these triplets to Classes and Properties.

**Pros**: Deterministic, traceable, works offline.
**Cons**: Dependent on the underlying NLP model's vocabulary and flexibility.

In [None]:
from semantica.semantic_extract import NERExtractor, RelationExtractor
from semantica.ontology import OntologyGenerator, OntologyOptimizer

# 1. Initialize Extractors
ner = NERExtractor()
re = RelationExtractor()

# 2. Extract Entities
print("Extracting entities...")
entities = ner.extract(text_corpus)

# Note: entities are returned as Entity objects (dataclasses), not dictionaries.
# We access properties using dot notation (e.g., entity.text, entity.label).
print(f"Found {len(entities)} entities.")
for e in entities[:5]:
    print(f" - {e.text} ({e.label}) [Conf: {e.confidence}]")

# 3. Extract Relationships
print("\nExtracting relationships...")
relationships = re.extract(text_corpus, entities)

# Note: relationships are returned as Relation objects.
print(f"Found {len(relationships)} relationships.")
for r in relationships:
    print(f" - {r.subject.text} -> {r.predicate} -> {r.object.text}")

# 4. Prepare Data for Ontology Generation
# The OntologyGenerator expects dictionaries, so we convert our objects.
# We also ensure we handle both object attributes and potential dictionary keys for robustness.
entities_data = []
for e in entities:
    if hasattr(e, 'to_dict'):
        entities_data.append(e.to_dict())
    else:
        # Manual conversion for dataclasses without to_dict
        entities_data.append({
            "id": getattr(e, "text", str(e)),
            "text": getattr(e, "text", str(e)),
            "type": getattr(e, "label", getattr(e, "type", "Unknown")),
            "confidence": getattr(e, "confidence", 1.0)
        })

relationships_data = []
for r in relationships:
    if hasattr(r, 'to_dict'):
        relationships_data.append(r.to_dict())
    else:
        # Manual conversion for dataclasses without to_dict
        # Handle nested Entity objects in subject/object fields
        subj = r.subject
        obj = r.object
        subj_text = getattr(subj, "text", str(subj))
        obj_text = getattr(obj, "text", str(obj))
        
        relationships_data.append({
            "source": subj_text,
            "target": obj_text,
            "type": getattr(r, "predicate", getattr(r, "type", "related_to")),
            "confidence": getattr(r, "confidence", 1.0)
        })

# 5. Generate Structure
generator = OntologyGenerator()
nlp_ontology = generator.generate_ontology(
    {"entities": entities_data, "relationships": relationships_data},
    name="QuantumOntologyNLP"
)

# 6. Optimize (Clean up)
optimizer = OntologyOptimizer()
nlp_ontology = optimizer.optimize_ontology(nlp_ontology, remove_redundancy=True)

print(f"\nGenerated NLP Ontology with {len(nlp_ontology['classes'])} classes and {len(nlp_ontology['properties'])} properties.")

---

## Approach 2: The Generative AI Pipeline (LLM)

This approach uses a Large Language Model to "read" the text and directly propose a schema.

**Pros**: Context-aware, can handle ambiguity, generates human-like class names.
**Cons**: Non-deterministic, requires API access.

*Note: This step requires a configured LLM provider (e.g., OpenAI).* 

In [None]:
from semantica.ontology import LLMOntologyGenerator

try:
    # Initialize LLM Generator (ensure OPENAI_API_KEY is set in env)
    llm_gen = LLMOntologyGenerator(provider="openai", model="gpt-4")
    
    print("Generating ontology with LLM...")
    llm_ontology = llm_gen.generate_ontology_from_text(
        text=text_corpus,
        name="QuantumOntologyLLM"
    )
    
    print(f"Generated LLM Ontology with {len(llm_ontology['classes'])} classes and {len(llm_ontology['properties'])} properties.")
    print("Classes detected:", [c['name'] for c in llm_ontology['classes']])
    
except Exception as e:
    print(f"Skipping LLM generation: {e}")
    llm_ontology = None

---

## Comparing Results with Visualization

Let's visualize both ontologies side-by-side (if available) to see the difference in structure. The NLP model tends to be more literal, while the LLM model tends to be more conceptual.

In [None]:
from semantica.visualization import OntologyVisualizer

visualizer = OntologyVisualizer()

print("--- NLP Approach Visualization ---")
fig_nlp = visualizer.visualize_structure(nlp_ontology, output="interactive")
if fig_nlp: fig_nlp.show()

if llm_ontology:
    print("--- LLM Approach Visualization ---")
    fig_llm = visualizer.visualize_structure(llm_ontology, output="interactive")
    if fig_llm: fig_llm.show()

---

## Export to OWL

Finally, we choose the best model (or merge them using `ReuseManager`, covered in other guides) and export it.

In [None]:
from semantica.export import OWLExporter

exporter = OWLExporter()

# Export the NLP ontology by default, or the LLM one if preferred
target_ontology = llm_ontology if llm_ontology else nlp_ontology

output_file = "quantum_ontology.ttl"
exporter.export(target_ontology, output_file, format="turtle")
print(f"Successfully exported ontology to {output_file}")

## Summary

You have learned to:
1.  **Extract Ontologies Programmatically**: Using `NERExtractor` for reliable, data-driven modeling.
2.  **Generate Ontologies with AI**: Using `LLMOntologyGenerator` for conceptual, high-level modeling.
3.  **Visualize and Compare**: Using `OntologyVisualizer` to inspect the structural differences.
4.  **Validate and Export**: Ensuring quality before saving to OWL standards.