[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/introduction/20_Triple_Store.ipynb)

# Triplet Store - Comprehensive Guide

## Overview

This notebook provides a **comprehensive walkthrough** of Semantica's triplet_store module, demonstrating RDF triplet storage, SPARQL querying, and multi-backend support for knowledge graph persistence.

**Documentation**: [API Reference](https://semantica.readthedocs.io/reference/triplet_store/)

### Learning Objectives

By the end of this notebook, you will be able to:

- Register and manage triplet stores (Blazegraph, Jena, RDF4J, Virtuoso)
- Perform CRUD operations on RDF triplets
- Execute SPARQL queries with optimization
- Use bulk loading for large datasets
- Work with multiple store backends
- Validate and track triplet operations
- Choose the right backend for your use case

### What You'll Learn

| Component | Purpose | When to Use |
|-----------|---------|-------------|
| `TripletManager` | Store coordination | All triplet operations |
| `QueryEngine` | SPARQL execution | Query optimization |
| `BulkLoader` | High-volume loading | Large datasets |
| `BlazegraphAdapter` | Blazegraph backend | High performance |
| `JenaAdapter` | Jena backend | Java integration |
| `RDF4JAdapter` | RDF4J backend | Transaction support |
| `VirtuosoAdapter` | Virtuoso backend | Enterprise scale |

---

## Installation

Install Semantica from PyPI:

```bash
pip install semantica
# Or with all optional dependencies:
pip install semantica[all]
```

---

## Step 1: Basic Triplet Store Operations

Let's start with the `TripletManager` for basic triplet store operations.

### What is TripletManager?

`TripletManager` is the main coordinator for triplet store operations:
- **Store Registration**: Register multiple backends
- **CRUD Operations**: Add, get, update, delete triples
- **Multi-Store**: Manage multiple stores simultaneously

In [None]:
from semantica.triplet_store import TripletManager
from semantica.semantic_extract.triple_extractor import Triple

# Create triple manager
manager = TripletManager()

# Register a Blazegraph store (in-memory for demo)
store = manager.register_store(
    store_id="demo",
    store_type="blazegraph",
    endpoint="http://localhost:9999/blazegraph/sparql"
)

print(f"Registered store: {store.store_id}")
print(f"Store type: {store.store_type}")
print(f"Endpoint: {store.endpoint}")

# Create a triple
triple = Triple(
    subject="http://example.org/Alice",
    predicate="http://example.org/knows",
    object="http://example.org/Bob",
    confidence=0.95
)

# Add triple to store
result = manager.add_triple(triple, store_id="demo")
print(f"\nTriple added: {result['success']}")
print(f"Triple: {triple.subject} -> {triple.predicate} -> {triple.object}")

## Step 2: Store Registration and Management

Register multiple stores and manage them.

### Supported Backends

| Backend | Best For | Performance | Features |
|---------|----------|-------------|----------|
| **Blazegraph** | Large datasets | Excellent | GPU acceleration, full-text |
| **Jena** | Java apps | Good | SHACL, inference |
| **RDF4J** | Transactions | Good | ACID, federation |
| **Virtuoso** | Enterprise | Excellent | SQL integration, clustering |

In [None]:
from semantica.triplet_store import register_store

# Register multiple stores using convenience function
blazegraph_store = register_store(
    "blazegraph_main",
    "blazegraph",
    "http://localhost:9999/blazegraph/sparql"
)

jena_store = register_store(
    "jena_backup",
    "jena",
    "http://localhost:3030/ds"
)

# List all registered stores
stores = manager.list_stores()
print(f"Registered stores: {stores}")

# Get specific store
store = manager.get_store("blazegraph_main")
print(f"\nStore details:")
print(f"  ID: {store.store_id}")
print(f"  Type: {store.store_type}")
print(f"  Endpoint: {store.endpoint}")

## Step 3: CRUD Operations

Perform Create, Read, Update, Delete operations on triples.

### Operations Overview

- **Create**: `add_triple()`, `add_triples()`
- **Read**: `get_triple()`
- **Update**: `update_triple()`
- **Delete**: `delete_triple()`

In [None]:
from semantica.triplet_store import add_triple, add_triples, get_triples, update_triple, delete_triple

# Create - Add single triple
triple1 = Triple(
    subject="http://example.org/Alice",
    predicate="http://example.org/hasAge",
    object="30"
)
result = add_triple(triple1, store_id="demo")
print(f"Added single triple: {result['success']}")

# Create - Add multiple triples
triples = [
    Triple("http://example.org/Alice", "http://example.org/hasCity", "New York"),
    Triple("http://example.org/Bob", "http://example.org/hasAge", "25"),
    Triple("http://example.org/Bob", "http://example.org/hasCity", "Boston")
]
result = add_triples(triples, store_id="demo")
print(f"\nAdded {result['total_triples']} triples in {result['batches']} batches")

# Read - Get triples for a subject
alice_triples = get_triples(
    subject="http://example.org/Alice",
    store_id="demo"
)
print(f"\nFound {len(alice_triples)} triples for Alice")

# Update - Change Alice's age
old_triple = Triple("http://example.org/Alice", "http://example.org/hasAge", "30")
new_triple = Triple("http://example.org/Alice", "http://example.org/hasAge", "31")
result = update_triple(old_triple, new_triple, store_id="demo")
print(f"\nUpdated triple: {result['success']}")

# Delete - Remove a triple
triple_to_delete = Triple("http://example.org/Bob", "http://example.org/hasCity", "Boston")
result = delete_triple(triple_to_delete, store_id="demo")
print(f"Deleted triple: {result['success']}")

## Step 4: SPARQL Query Execution

Execute SPARQL queries with the QueryEngine.

### Query Types

- **SELECT**: Retrieve variable bindings
- **ASK**: Boolean queries
- **CONSTRUCT**: Build RDF graphs
- **DESCRIBE**: Describe resources

In [None]:
from semantica.triplet_store import QueryEngine, BlazegraphAdapter

# Create query engine with caching
engine = QueryEngine(enable_caching=True, enable_optimization=True)

# Create adapter
adapter = BlazegraphAdapter(endpoint="http://localhost:9999/blazegraph/sparql")

# SELECT query
select_query = """
PREFIX ex: <http://example.org/>

SELECT ?person ?age ?city
WHERE {
    ?person ex:hasAge ?age .
    ?person ex:hasCity ?city .
}
ORDER BY DESC(?age)
LIMIT 10
"""

result = engine.execute_query(select_query, adapter)

print(f"Query Results:")
print(f"  Variables: {result.variables}")
print(f"  Results: {len(result.bindings)}")
print(f"  Execution time: {result.execution_time:.2f}s")
print(f"  Cached: {result.metadata.get('cached', False)}")

print("\nResults:")
for binding in result.bindings:
    person = binding.get('person', {}).get('value', '')
    age = binding.get('age', {}).get('value', '')
    city = binding.get('city', {}).get('value', '')
    print(f"  {person}: Age {age}, City {city}")

## Step 5: Query Optimization

Optimize SPARQL queries for better performance.

### Optimization Features

- **Query Planning**: Analyze execution steps
- **Cost Estimation**: Estimate query cost
- **Query Rewriting**: Optimize query structure
- **Caching**: Cache query results

In [None]:
from semantica.triplet_store import optimize_query, plan_query

# Original query
query = """
SELECT ?s ?p ?o
WHERE {
    ?s ?p ?o .
}
"""

# Optimize query (adds LIMIT if missing)
optimized = optimize_query(query, add_limit=True, default_limit=1000)
print("Optimized Query:")
print(optimized)

# Create query plan
plan = plan_query(query)
print(f"\nQuery Plan:")
print(f"  Original length: {len(plan.query)}")
print(f"  Optimized length: {len(plan.optimized_query)}")
print(f"  Estimated cost: {plan.estimated_cost}")
print(f"  Execution steps:")
for i, step in enumerate(plan.execution_steps, 1):
    print(f"    {i}. {step}")

## Step 6: Bulk Loading

Load large datasets efficiently with progress tracking.

### Bulk Loading Features

- **Batch Processing**: Process in configurable batches
- **Progress Tracking**: Monitor loading progress
- **Retry Mechanism**: Handle failures gracefully
- **Validation**: Validate before loading

In [None]:
from semantica.triplet_store import BulkLoader, LoadProgress

# Create bulk loader
loader = BulkLoader(
    batch_size=1000,
    max_retries=3
)

# Generate sample triples
large_dataset = [
    Triple(
        f"http://example.org/entity{i}",
        "http://example.org/hasName",
        f"Entity {i}"
    )
    for i in range(5000)
]

# Progress callback
def progress_callback(progress: LoadProgress):
    print(f"Progress: {progress.progress_percentage:.1f}% "
          f"({progress.loaded_triples}/{progress.total_triples}) "
          f"Batch {progress.current_batch}/{progress.total_batches}")

# Load triples with progress tracking
adapter = BlazegraphAdapter(endpoint="http://localhost:9999/blazegraph/sparql")
progress = loader.load_triples(
    large_dataset,
    adapter,
    progress_callback=progress_callback
)

print(f"\nLoading Complete:")
print(f"  Loaded: {progress.loaded_triples}/{progress.total_triples}")
print(f"  Failed: {progress.failed_triples}")
print(f"  Elapsed time: {progress.elapsed_time:.2f}s")
print(f"  Throughput: {progress.metadata.get('throughput', 0):.0f} triples/sec")

## Step 7: Store Adapters

Work with different triplet store backends.

### Blazegraph Adapter

High-performance triplet store with GPU acceleration.

In [None]:
from semantica.triplet_store import BlazegraphAdapter

# Create Blazegraph adapter
blazegraph = BlazegraphAdapter(
    endpoint="http://localhost:9999/blazegraph/sparql",
    namespace="kb",
    timeout=30
)

# Add triples
triples = [
    Triple("http://example.org/Alice", "http://example.org/hasSkill", "Python")
]
result = blazegraph.add_triples(triples)
print(f"Blazegraph - Added: {result['success']}")

# Execute SPARQL query
query = "SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 5"
result = blazegraph.execute_sparql(query)
print(f"Blazegraph - Found {len(result['bindings'])} results")

### Jena Adapter

Full-featured RDF framework with inference support.

In [None]:
from semantica.triplet_store import JenaAdapter

# Create Jena adapter (in-memory)
jena = JenaAdapter()

# Or connect to Fuseki endpoint
# jena = JenaAdapter(
#     endpoint="http://localhost:3030/ds",
#     dataset="default",
#     enable_inference=True
# )

# Add triples with inference
triples = [
    Triple(
        "http://example.org/Dog",
        "http://www.w3.org/2000/01/rdf-schema#subClassOf",
        "http://example.org/Animal"
    ),
    Triple(
        "http://example.org/Fido",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
        "http://example.org/Dog"
    )
]
result = jena.add_triples(triples)
print(f"Jena - Added: {result['success']}")

# Query with inference (Fido is inferred to be an Animal)
query = """
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ex: <http://example.org/>
SELECT ?animal WHERE {
    ?animal rdf:type ex:Animal .
}
"""
result = jena.query(query)
print(f"Jena - Found {len(result)} animals (with inference)")

### RDF4J Adapter

Java-based RDF framework with transaction support.

In [None]:
from semantica.triplet_store import RDF4JAdapter

# Create RDF4J adapter
rdf4j = RDF4JAdapter(
    server_url="http://localhost:8080/rdf4j-server",
    repository_id="test"
)

# Add triples with transaction
rdf4j.begin_transaction()
try:
    triple = Triple(
        "http://example.org/Alice",
        "http://example.org/hasEmail",
        "alice@example.org"
    )
    rdf4j.add_triple(
        subject=triple.subject,
        predicate=triple.predicate,
        object_literal=triple.object
    )
    rdf4j.commit_transaction()
    print("RDF4J - Transaction committed")
except Exception as e:
    rdf4j.rollback_transaction()
    print(f"RDF4J - Transaction rolled back: {e}")

### Virtuoso Adapter

Enterprise-grade RDF store with SQL integration.

In [None]:
from semantica.triplet_store import VirtuosoAdapter

# Create Virtuoso adapter
virtuoso = VirtuosoAdapter(
    host="localhost",
    port=1111,
    user="dba",
    password="dba"
)

# Create named graph
graph_uri = "http://example.org/graph1"
virtuoso.create_graph(graph_uri)

# Add triples to named graph
triple = Triple(
    "http://example.org/Alice",
    "http://example.org/worksAt",
    "http://example.org/Company1"
)
virtuoso.add_triple(
    subject=triple.subject,
    predicate=triple.predicate,
    object=triple.object,
    graph=graph_uri
)

print(f"Virtuoso - Added triple to graph: {graph_uri}")

# Query specific graph
query = f"""
PREFIX ex: <http://example.org/>
SELECT ?person ?company
FROM <{graph_uri}>
WHERE {{
    ?person ex:worksAt ?company .
}}
"""
result = virtuoso.query(query)
print(f"Virtuoso - Found {len(result)} results in graph")

## Step 8: Triple Validation

Validate triples before adding them to the store.

### Validation Checks

- **Required Fields**: Subject, predicate, object
- **Confidence Range**: 0-1 if provided
- **URI Format**: Valid URIs
- **Empty Components**: No empty values

In [None]:
from semantica.triplet_store import validate_triples

# Create triples (some invalid)
triples_to_validate = [
    Triple("http://example.org/Alice", "http://example.org/knows", "http://example.org/Bob"),  # Valid
    Triple("", "http://example.org/knows", "http://example.org/Charlie"),  # Invalid (empty subject)
    Triple("http://example.org/Dave", "", "http://example.org/Eve"),  # Invalid (empty predicate)
    Triple("http://example.org/Frank", "http://example.org/knows", "http://example.org/Grace", confidence=1.5),  # Invalid (confidence > 1)
]

# Validate triples
validation = validate_triples(triples_to_validate)

print("Validation Results:")
print(f"  Valid: {validation['valid']}")
print(f"  Valid triples: {validation['valid_triples']}/{validation['total_triples']}")
print(f"\nErrors: {len(validation['errors'])}")
for error in validation['errors']:
    print(f"  - {error}")
print(f"\nWarnings: {len(validation['warnings'])}")
for warning in validation['warnings']:
    print(f"  - {warning}")

## Step 9: Multi-Store Operations

Work with multiple stores simultaneously.

### Use Cases

- **Primary/Backup**: Replicate to backup store
- **Read/Write Split**: Separate read and write stores
- **Multi-Tenant**: Different stores for different tenants

In [None]:
# Register multiple stores
manager = TripletManager()

primary = manager.register_store(
    "primary",
    "blazegraph",
    "http://localhost:9999/blazegraph/sparql"
)

backup = manager.register_store(
    "backup",
    "jena",
    "http://localhost:3030/ds"
)

# Add to primary store
triple = Triple(
    "http://example.org/Document1",
    "http://example.org/hasAuthor",
    "http://example.org/Alice"
)
manager.add_triple(triple, store_id="primary")
print("Added to primary store")

# Replicate to backup store
manager.add_triple(triple, store_id="backup")
print("Replicated to backup store")

# List all stores
stores = manager.list_stores()
print(f"\nActive stores: {stores}")

## Step 10: Best Practices

### Choosing the Right Backend

1. **Blazegraph**: High-performance, large datasets, GPU acceleration
2. **Jena**: Java integration, SHACL validation, inference
3. **RDF4J**: Transaction support, ACID guarantees, federation
4. **Virtuoso**: Enterprise scale, SQL integration, clustering

### Performance Tips

- **Batch Operations**: Use `add_triples()` for multiple triples
- **Query Optimization**: Enable optimization and caching
- **Bulk Loading**: Use `BulkLoader` for large datasets
- **Validation**: Validate before loading to avoid errors

### Configuration

- **Batch Size**: 1000-10000 for bulk loading
- **Cache Size**: 1000-5000 for query caching
- **Timeout**: 30-60 seconds for queries
- **Retries**: 3-5 for bulk operations

## Summary

### What You've Learned

In this notebook, you've learned how to:

- Register and manage triplet stores
- Perform CRUD operations on RDF triples
- Execute and optimize SPARQL queries
- Use bulk loading for large datasets
- Work with multiple store backends
- Validate triples before operations
- Choose the right backend for your use case

### Key Takeaways

1. **Multi-Backend Support**: Choose the right backend for your needs
2. **SPARQL Power**: Full SPARQL 1.1 support with optimization
3. **Bulk Loading**: Efficient loading with progress tracking
4. **Query Optimization**: Automatic query optimization and caching
5. **Validation**: Pre-load validation prevents errors
6. **Multi-Store**: Manage multiple stores simultaneously

### Next Steps

**Further Reading**:
- [Triplet Store API Reference](https://semantica.readthedocs.io/reference/triplet_store/)
- [SPARQL 1.1 Specification](https://www.w3.org/TR/sparql11-query/)
- [Knowledge Graph Building](../use_cases/advanced_rag/01_GraphRAG_Complete.ipynb)

---

**Questions or Issues?** Check out our [GitHub repository](https://github.com/Hawksight-AI/semantica) or [documentation](https://semantica.readthedocs.io).