[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/introduction/01_Welcome_to_Semantica.ipynb)

Semantica is a **semantic intelligence and knowledge engineering framework**. It helps you:

- Build **knowledge graphs** from unstructured and semi-structured data
- Create a unified **semantic layer** on top of diverse data sources
- Power **GraphRAG**, AI agents, and multi-agent systems with structured knowledge
- Incorporate **temporal and quality-aware reasoning** into your applications

### Core Capabilities

- **Universal ingestion**: Files, web, feeds, databases, repositories, streams
- **Rich parsing**: PDFs, Office documents, HTML, JSON, CSV, images, code
- **Normalization**: Cleaning, language detection, entity normalization, date/number standardization
- **Semantic extraction**: Named entities, relationships, events, semantic networks
- **Knowledge graph construction**: Property graphs from entities and relations
- **Embeddings and vector search**: Text and graph embeddings, hybrid retrieval
- **Reasoning and ontology**: Rule-based inference, ontology generation and validation
- **Visualization and analytics**: Graph visualizations and quality metrics

## Who Is Semantica For?

- **AI/ML engineers** building GraphRAG systems, agents, and tools that need long-term memory
- **Data engineers** orchestrating semantic enrichment pipelines over large, heterogeneous datasets
- **Knowledge engineers and ontologists** designing and maintaining formal knowledge structures
- **Researchers and analysts** creating domain knowledge graphs from documents and data feeds
- **Product and platform teams** embedding semantic intelligence into applications and services

## Architecture Overview

Semantica is organized as three conceptual layers and multiple concrete modules.

### Layers

- **Input Layer**
  - Connects to files, web pages, APIs, databases, email, feeds, repositories, and streams
  - Normalizes these different sources into a unified internal representation

- **Semantic Layer**
  - Performs parsing, cleaning, semantic extraction, graph construction, embeddings, and reasoning
  - This is where **unstructured data becomes structured knowledge**

- **Output Layer**
  - Exposes knowledge graphs, embeddings, ontologies, and analytics
  - Integrates with vector stores, graph databases, and downstream applications

## üß© Semantica Modules Reference

Semantica is modular by design. Here is a comprehensive guide to all available modules, grouped by functionality.

### üì• Ingestion & Parsing
Modules that handle raw data input and structure.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`ingest`** | **Data Ingestion**<br>Connects to data sources. | ‚Ä¢ File, Web, Feed, Stream ingestion<br>‚Ä¢ DB, Email, Repo, MCP support |
| **`parse`** | **Document Parsing**<br>Parses raw content into structures. | ‚Ä¢ PDF, HTML, JSON, CSV, Excel<br>‚Ä¢ Image & Code parsing |

### ‚öôÔ∏è Data Processing
Modules that clean, normalize, and split data.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`normalize`** | **Data Normalization**<br>Cleans and standardizes text. | ‚Ä¢ Text cleaning & Language detection<br>‚Ä¢ Entity, Date, Number normalization |
| **`split`** | **Chunking**<br>Splits documents for RAG. | ‚Ä¢ Recursive character splitting<br>‚Ä¢ Semantic & Token-based splitting |

### üß† Extraction & Enrichment
Modules that extract meaning, structure, and vectors from raw data.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`semantic_extract`** | **Information Extraction**<br>Extracts entities and relations. | ‚Ä¢ NER & Relation Extraction<br>‚Ä¢ Event & Semantic Network detection |
| **`context`** | **Agent Memory**<br>Manages state for AI agents. | ‚Ä¢ Long-term memory & history<br>‚Ä¢ Context graph & RAG integration |

### üï∏Ô∏è Knowledge Graph Core
Modules for building, refining, and resolving knowledge graphs.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`kg`** | **Graph Construction**<br>Builds and analyzes graphs. | ‚Ä¢ Graph Building & Analysis<br>‚Ä¢ Validation & Entity Resolution |
| **`conflicts`** | **Conflict Resolution**<br>Resolves data contradictions. | ‚Ä¢ Source reliability scoring<br>‚Ä¢ Truth discovery algorithms |
| **`deduplication`** | **Entity Resolution**<br>Merges duplicate entities. | ‚Ä¢ Similarity-based blocking<br>‚Ä¢ Clustering & Canonicalization |

### üíæ Storage & Retrieval
Modules for persisting and querying data.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`embeddings`** | **Vector Embeddings**<br>Generates semantic vectors. | ‚Ä¢ Text & Graph embeddings<br>‚Ä¢ Multi-provider support (OpenAI, etc.) |
| **`vector_store`** | **Vector Database**<br>Stores and searches vectors. | ‚Ä¢ Similarity search & Filtering<br>‚Ä¢ Hybrid search (Vector + Keyword) |
| **`graph_store`** | **Property Graph Store**<br>Persists graph data. | ‚Ä¢ Neo4j, FalkorDB adapters<br>‚Ä¢ Cypher query support |
| **`triplet_store`** | **RDF Store**<br>Persists semantic triplets. | ‚Ä¢ SPARQL endpoints<br>‚Ä¢ BlazeGraph, Jena, Virtuoso adapters |

### üîé Reasoning & Analysis
Modules for deriving new knowledge and evaluating quality.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`reasoning`** | **Reasoner Facade**<br>Unified interface for inference. | ‚Ä¢ Datalog/Rule-based inference<br>‚Ä¢ Forward/Backward chaining |
| **`ontology`** | **Ontology Management**<br>Manages schema and definitions. | ‚Ä¢ Ontology generation from data<br>‚Ä¢ Validation & Evolution |
| **`visualization`** | **Visual Analytics**<br>Visualizes graphs and metrics. | ‚Ä¢ 2D/3D Graph visualization<br>‚Ä¢ Interactive plots & dashboards |
| **`evals`** | **Evaluation**<br>Benchmarks pipeline quality. | ‚Ä¢ RAG & Graph quality metrics<br>‚Ä¢ Ground truth comparison |

### üõ†Ô∏è Orchestration & Utils
Modules for managing the framework and workflows.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`core`** | **Framework Core**<br>Main entry point and config. | ‚Ä¢ Lifecycle management<br>‚Ä¢ Plugin system & Configuration |
| **`pipeline`** | **Workflow Orchestration**<br>Manages complex flows. | ‚Ä¢ DAG execution & Retries<br>‚Ä¢ Error handling & Observability |
| **`seed`** | **Data Seeding**<br>Initializes knowledge bases. | ‚Ä¢ Taxonomy & Ontology seeding<br>‚Ä¢ Reference data loading |
| **`export`** | **Data Export**<br>Exports data to files. | ‚Ä¢ JSON, CSV, RDF, GEXF export<br>‚Ä¢ Report generation |
| **`utils`** | **Utilities**<br>Common helper functions. | ‚Ä¢ Logging, Async, Hashing<br>‚Ä¢ Text processing helpers |

## Core Concepts (High-Level)

- **Knowledge graph**
  - Nodes represent entities such as people, organizations, locations, events, or concepts
  - Edges represent relationships such as `works_for`, `located_in`, `founded_by`
  - Properties capture attributes and metadata such as timestamps, sources, and confidence

- **Entities and relationships**
  - Entities are extracted from text and data using NER
  - Relationships connect entities and are extracted using pattern-based, model-based, or LLM-based methods

- **Embeddings**
  - Numerical vectors that encode semantic meaning of text or graph structures
  - Used for semantic search, clustering, and similarity-based retrieval

- **GraphRAG**
  - Combines vector search with graph traversal
  - Uses both embeddings and graph structure to retrieve rich, context-aware information

- **Ontology**
  - A formal model of classes, relationships, and constraints in a domain
  - Used to standardize meaning, enable reasoning, and integrate heterogeneous data

- **Quality and governance**
  - Quality metrics (completeness, consistency, accuracy, coverage)
  - Conflict detection and resolution at the knowledge graph level

## Installation

You can install Semantica from PyPI. In this notebook, we use a pip cell so it can run in local Jupyter or Colab.

Equivalent shell commands:

```bash
pip install semantica
pip install semantica[all]
```

## Basic Configuration

Semantica uses configuration for API keys, embedding providers, and knowledge graph options. The example below mirrors a typical configuration while staying simple enough for a notebook.

In [None]:
from semantica.core import Config
config = Config()
print(config.to_yaml())

# Welcome to Semantica

**Open Source Framework for Semantic Layer & Knowledge Engineering**

Semantica is a Python framework for transforming raw, messy, multi-source data into **semantic layers** and **knowledge graphs** that are ready to power GraphRAG, AI agents, multi-agent systems, and analytical applications.

This notebook is an executable introduction. It combines:

- High-level explanation of what Semantica is and why it exists
- A structured tour of the architecture and key modules
- Small, runnable code snippets that show the end-to-end flow

**You should use this notebook to understand the big picture, not to learn every API in depth.**

In [None]:
!pip install -U semantica


## What Is Semantica?

Semantica is a **semantic intelligence and knowledge engineering framework**. It helps you:

- Build **knowledge graphs** from unstructured and semi-structured data
- Create a unified **semantic layer** on top of diverse data sources
- Power **GraphRAG**, AI agents, and multi-agent systems with structured knowledge
- Incorporate **temporal and quality-aware reasoning** into your applications

### Core Capabilities

- **Universal ingestion**: Files, web, feeds, databases, repositories, streams
- **Rich parsing**: PDFs, Office documents, HTML, JSON, CSV, images, code
- **Normalization**: Cleaning, language detection, entity normalization, date/number standardization
- **Semantic extraction**: Named entities, relationships, events, semantic networks
- **Knowledge graph construction**: Property graphs from entities and relations
- **Embeddings and vector search**: Text and graph embeddings, hybrid retrieval
- **Reasoning and ontology**: Rule-based inference, ontology generation and validation
- **Visualization and analytics**: Graph visualizations and quality metrics

## Who Is Semantica For?

- **AI/ML engineers** building GraphRAG systems, agents, and tools that need long-term memory
- **Data engineers** orchestrating semantic enrichment pipelines over large, heterogeneous datasets
- **Knowledge engineers and ontologists** designing and maintaining formal knowledge structures
- **Researchers and analysts** creating domain knowledge graphs from documents and data feeds
- **Product and platform teams** embedding semantic intelligence into applications and services

## Architecture Overview

Semantica is organized as three conceptual layers and multiple concrete modules.

### Layers

- **Input Layer**
  - Connects to files, web pages, APIs, databases, email, feeds, repositories, and streams
  - Normalizes these different sources into a unified internal representation

- **Semantic Layer**
  - Performs parsing, cleaning, semantic extraction, graph construction, embeddings, and reasoning
  - This is where **unstructured data becomes structured knowledge**

- **Output Layer**
  - Exposes knowledge graphs, embeddings, ontologies, and analytics
  - Integrates with vector stores, graph databases, and downstream applications

## üß© Semantica Modules Reference

Semantica is modular by design. Here is a comprehensive guide to all available modules, grouped by functionality.

### üì• Ingestion & Parsing
Modules that handle raw data input and structure.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`ingest`** | **Data Ingestion**<br>Connects to data sources. | ‚Ä¢ File, Web, Feed, Stream ingestion<br>‚Ä¢ DB, Email, Repo, MCP support |
| **`parse`** | **Document Parsing**<br>Parses raw content into structures. | ‚Ä¢ PDF, HTML, JSON, CSV, Excel<br>‚Ä¢ Image & Code parsing |

### ‚öôÔ∏è Data Processing
Modules that clean, normalize, and split data.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`normalize`** | **Data Normalization**<br>Cleans and standardizes text. | ‚Ä¢ Text cleaning & Language detection<br>‚Ä¢ Entity, Date, Number normalization |
| **`split`** | **Chunking**<br>Splits documents for RAG. | ‚Ä¢ Recursive character splitting<br>‚Ä¢ Semantic & Token-based splitting |

### üß† Extraction & Enrichment
Modules that extract meaning, structure, and vectors from raw data.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`semantic_extract`** | **Information Extraction**<br>Extracts entities and relations. | ‚Ä¢ NER & Relation Extraction<br>‚Ä¢ Event & Semantic Network detection |
| **`context`** | **Agent Memory**<br>Manages state for AI agents. | ‚Ä¢ Long-term memory & history<br>‚Ä¢ Context graph & RAG integration |

### üï∏Ô∏è Knowledge Graph Core
Modules for building, refining, and resolving knowledge graphs.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`kg`** | **Graph Construction**<br>Builds and analyzes graphs. | ‚Ä¢ Graph Building & Analysis<br>‚Ä¢ Validation & Entity Resolution |
| **`conflicts`** | **Conflict Resolution**<br>Resolves data contradictions. | ‚Ä¢ Source reliability scoring<br>‚Ä¢ Truth discovery algorithms |
| **`deduplication`** | **Entity Resolution**<br>Merges duplicate entities. | ‚Ä¢ Similarity-based blocking<br>‚Ä¢ Clustering & Canonicalization |

### üíæ Storage & Retrieval
Modules for persisting and querying data.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`embeddings`** | **Vector Embeddings**<br>Generates semantic vectors. | ‚Ä¢ Text & Graph embeddings<br>‚Ä¢ Multi-provider support (OpenAI, etc.) |
| **`vector_store`** | **Vector Database**<br>Stores and searches vectors. | ‚Ä¢ Similarity search & Filtering<br>‚Ä¢ Hybrid search (Vector + Keyword) |
| **`graph_store`** | **Property Graph Store**<br>Persists graph data. | ‚Ä¢ Neo4j, FalkorDB adapters<br>‚Ä¢ Cypher query support |
| **`triplet_store`** | **RDF Store**<br>Persists semantic triplets. | ‚Ä¢ SPARQL endpoints<br>‚Ä¢ BlazeGraph, Jena, Virtuoso adapters |

### üîé Reasoning & Analysis
Modules for deriving new knowledge and evaluating quality.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`reasoning`** | **Reasoner Facade**<br>Unified interface for inference. | ‚Ä¢ Datalog/Rule-based inference<br>‚Ä¢ Forward/Backward chaining<br>‚Ä¢ Automated explanation generation |
| **`ontology`** | **Ontology Management**<br>Manages schema and definitions. | ‚Ä¢ Ontology generation from data<br>‚Ä¢ Validation & Evolution |
| **`visualization`** | **Visual Analytics**<br>Visualizes graphs and metrics. | ‚Ä¢ 2D/3D Graph visualization<br>‚Ä¢ Interactive plots & dashboards |
| **`evals`** | **Evaluation**<br>Benchmarks pipeline quality. | ‚Ä¢ RAG & Graph quality metrics<br>‚Ä¢ Ground truth comparison |

### üõ†Ô∏è Orchestration & Utils
Modules for managing the framework and workflows.

| Module | Description | Key Capabilities |
| :--- | :--- | :--- |
| **`core`** | **Framework Core**<br>Main entry point and config. | ‚Ä¢ Lifecycle management<br>‚Ä¢ Plugin system & Configuration |
| **`pipeline`** | **Workflow Orchestration**<br>Manages complex flows. | ‚Ä¢ DAG execution & Retries<br>‚Ä¢ Error handling & Observability |
| **`seed`** | **Data Seeding**<br>Initializes knowledge bases. | ‚Ä¢ Taxonomy & Ontology seeding<br>‚Ä¢ Reference data loading |
| **`export`** | **Data Export**<br>Exports data to files. | ‚Ä¢ JSON, CSV, RDF, GEXF export<br>‚Ä¢ Report generation |
| **`utils`** | **Utilities**<br>Common helper functions. | ‚Ä¢ Logging, Async, Hashing<br>‚Ä¢ Text processing helpers |

## Core Concepts (High-Level)

- **Knowledge graph**
  - Nodes represent entities such as people, organizations, locations, events, or concepts
  - Edges represent relationships such as `works_for`, `located_in`, `founded_by`
  - Properties capture attributes and metadata such as timestamps, sources, and confidence

- **Entities and relationships**
  - Entities are extracted from text and data using NER
  - Relationships connect entities and are extracted using pattern-based, model-based, or LLM-based methods

- **Embeddings**
  - Numerical vectors that encode semantic meaning of text or graph structures
  - Used for semantic search, clustering, and similarity-based retrieval

- **GraphRAG**
  - Combines vector search with graph traversal
  - Uses both embeddings and graph structure to retrieve rich, context-aware information

- **Ontology**
  - A formal model of classes, relationships, and constraints in a domain
  - Used to standardize meaning, enable reasoning, and integrate heterogeneous data

- **Quality and governance**
  - Quality metrics (completeness, consistency, accuracy, coverage)
  - Conflict detection and resolution at the knowledge graph level

## Installation

You can install Semantica from PyPI. In this notebook, we use a pip cell so it can run in local Jupyter or Colab.

Equivalent shell commands:

```bash
pip install semantica
pip install semantica[all]
```

## Basic Configuration

Semantica uses configuration for API keys, embedding providers, and knowledge graph options. The example below mirrors a typical configuration while staying simple enough for a notebook.

In [None]:
import os
from pathlib import Path

os.environ["SEMANTICA_API_KEY"] = "your_openai_key"
os.environ["SEMANTICA_EMBEDDING_PROVIDER"] = "openai"
os.environ["SEMANTICA_MODEL_NAME"] = "gpt-4"

config_text = """api_keys:
  openai: your_key_here
  anthropic: your_key_here
embedding:
  provider: openai
  model: text-embedding-3-large
  dimensions: 3072
knowledge_graph:
  backend: networkx
  temporal: true
"""
Path("config.yaml").write_text(config_text, encoding="utf-8")
Path("config.yaml").read_text(encoding="utf-8")

## Setup: Create Sample Data

First, let's create a small sample document to work with.

In [None]:
from pathlib import Path

docs_dir = Path("welcome_docs")
docs_dir.mkdir(exist_ok=True)
text_path = docs_dir / "apple.txt"
text_content = (
    "Apple Inc. was founded by Steve Jobs, Steve Wozniak and Ronald Wayne in"
    " Cupertino, California."
)
text_path.write_text(text_content, encoding="utf-8")
print(f"Created sample document at {text_path}")

## Minimal End-to-End Pipeline

The next example shows how to explicitly use several modules in sequence. This mirrors the architecture discussed earlier:

1. Ingest a directory of documents
2. Parse them into structured documents
3. Normalize text
4. Extract entities and relationships
5. Build and analyze a knowledge graph
6. Create embeddings and store them in a vector store
7. Run a hybrid semantic search query

In [None]:
from semantica.ingest import FileIngestor
from semantica.parse import DocumentParser
from semantica.normalize import TextNormalizer
from semantica.semantic_extract import NERExtractor, RelationExtractor
from semantica.kg import GraphBuilder, GraphAnalyzer
from semantica.embeddings import EmbeddingGenerator
from semantica.vector_store import VectorStore, HybridSearch

ingestor = FileIngestor()
documents = ingestor.ingest(str(docs_dir))

parser = DocumentParser()
parsed_docs = parser.parse(documents)

normalizer = TextNormalizer()
normalized_docs = normalizer.normalize(parsed_docs)

ner = NERExtractor()
entities = ner.extract(normalized_docs)
rel_extractor = RelationExtractor()
relationships = rel_extractor.extract(normalized_docs, entities)

builder = GraphBuilder()
kg = builder.build(entities, relationships)
analyzer = GraphAnalyzer()
metrics = analyzer.analyze(kg)

emb_generator = EmbeddingGenerator()
embeddings = emb_generator.generate_embeddings(documents, data_type="text")

vec_store = VectorStore()
vec_store.store(embeddings, documents, metadata={})
hybrid = HybridSearch(vec_store)
search_results = hybrid.search("Apple founders", top_k=3)
len(search_results)

## Visualization

Semantica includes a powerful visualization module. Here we create an interactive network graph from the knowledge graph built above.

In [None]:
from semantica.visualization import KGVisualizer

# Create a visualizer instance
viz = KGVisualizer(layout="force", color_scheme="vibrant")

# Generate an interactive network visualization
# This returns a Plotly figure object that renders in the notebook
fig = viz.visualize_network(kg, output="interactive")
fig.show()

## Ontology Generation

You can also automatically generate an ontology (a formal model of your domain) from the extracted entities and relationships.

In [None]:
from semantica.ontology import OntologyGenerator

generator = OntologyGenerator(base_uri="https://example.org/ontology/")

# Generate ontology from the extracted data
ontology = generator.generate_ontology({
    "entities": entities,
    "relationships": relationships
})

# View inferred classes
[cls["name"] for cls in ontology.get("classes", [])[:5]]

## Advanced: Data Splitting and Chunking

For RAG applications, splitting documents into smaller chunks is essential. Semantica provides a `split` module for this purpose.

In [None]:
from semantica.split import TextSplitter

splitter = TextSplitter(chunk_size=100, chunk_overlap=20)
chunks = splitter.split_documents(documents)

print(f"Original documents: {len(documents)}")
print(f"Generated chunks: {len(chunks)}")

## Advanced: Reasoning and Inference

The `reasoning` module allows you to derive new facts from existing knowledge using logic rules.

In [None]:
from semantica.reasoning import Reasoner

# Simple rule: If X founded Y, then X works_for Y
rule = """
IF (?x founded ?y) THEN (?x works_for ?y)
"""

reasoner = Reasoner()
reasoner.add_rule(rule)
inferred_facts = reasoner.infer_facts(kg)

print(f"Inferred {len(inferred_facts)} new facts")

## Advanced: Export and Persistence

You can save your knowledge graph to disk or export it to standard formats like CSV, JSON, or RDF.

In [None]:
from semantica.export import GraphExporter

exporter = GraphExporter()
exporter.export(kg, format="json", output_path="knowledge_graph.json")
print("Graph exported to knowledge_graph.json")

## Using the Core `Semantica` Class

For more complex systems, you can work directly with the `Semantica` core class and a configuration object. This gives you access to lifecycle management, plugin registration, and orchestration helpers.

In [None]:
from semantica.core import Semantica, ConfigManager

config_manager = ConfigManager()
config = config_manager.load_from_file("config.yaml")

framework = Semantica(config=config)
framework.initialize()

kb_result = framework.build_knowledge_base(
    sources=[str(docs_dir)],
    embeddings=True,
    graph=True,
)

framework.shutdown()
sorted(kb_result.keys())

## Where to Go Next

- Run the notebooks under `cookbook/introduction` for focused module overviews
- Explore `cookbook/use_cases` for domain-specific end-to-end workflows
- Read the **Core Concepts** documentation for deeper theory and best practices