[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/supply_chain/01_Supply_Chain_Data_Integration.ipynb)

# Supply Chain Data Integration - Multi-Source Ingestion & Relationship Mapping

## Overview

This notebook demonstrates **supply chain data integration** using Semantica with focus on **multi-source ingestion**, **relationship mapping**, and **logistics tracking**. The pipeline ingests logistics and supplier data from multiple sources to build a comprehensive supply chain knowledge graph with supplier network analysis.

### Key Features

- **Multi-Source Ingestion**: Ingests data from multiple logistics and supplier sources (RSS, web APIs, files)
- **Relationship Mapping**: Maps supplier relationships and logistics routes using relation extraction
- **Logistics Tracking**: Tracks products, routes, locations, and warehouses
- **Supplier Network Analysis**: Analyzes supplier centrality and community clusters
- **Seed Data Integration**: Uses supplier foundation data for entity resolution
- **KG Construction**: Builds comprehensive supply chain knowledge graphs

### Learning Objectives

- Understand how to ingest data from multiple sources for supply chain integration
- Learn to map complex supplier relationships and logistics routes
- Master supplier network analysis using centrality and community detection
- Explore relationship extraction for supply chain entities
- Practice multi-source data integration and deduplication
- Analyze supplier networks and logistics connections

### Pipeline Flow

```mermaid
graph TD
    A[Multi-Source Ingestion] --> B[Seed Data Loading]
    A --> C[Document Parsing]
    B --> D[Text Processing]
    C --> D
    D --> E[Entity Extraction]
    E --> F[Relationship Extraction]
    F --> G[Deduplication]
    G --> H[KG Construction]
    H --> I[Embedding Generation]
    I --> J[Vector Store]
    H --> K[Network Analysis]
    H --> L[Community Detection]
    J --> M[GraphRAG Queries]
    K --> N[Visualization]
    L --> N
    H --> O[Export]
```


---


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


---

## Configuration & Setup

Configure API keys and set up constants for the supply chain data integration pipeline.


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "your-key-here")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200


---

## Multi-Source Data Ingestion

Ingest supply chain data from multiple sources including RSS feeds, web APIs, and local files. This section emphasizes multi-source ingestion capabilities.


In [None]:
from semantica.ingest import FeedIngestor, WebIngestor, FileIngestor
from contextlib import redirect_stderr
from io import StringIO
import os

os.makedirs("data", exist_ok=True)

documents = []

# Ingest from logistics RSS feeds
logistics_feeds = [
    "https://www.supplychaindive.com/rss",
    "https://www.logisticsmgmt.com/rss"
]

for feed_url in logistics_feeds:
    try:
        with redirect_stderr(StringIO()):
            feed_ingestor = FeedIngestor()
            feed_docs = feed_ingestor.ingest(feed_url, method="rss")
            documents.extend(feed_docs)
    except Exception:
        pass

# Example: Web ingestion from transportation APIs (commented - requires API keys)
# web_ingestor = WebIngestor()
# api_docs = web_ingestor.ingest("https://api.transportation.com/data", method="api")

# Fallback: Sample supply chain data
if not documents:
    supply_data = """
    Supplier A provides Product X to Warehouse W1 located in City C1.
    Supplier B provides Product Y to Warehouse W2 located in City C2.
    Route R1 connects Warehouse W1 to Distribution Center D1.
    Route R2 connects Warehouse W2 to Distribution Center D2.
    Logistics: Product X shipped via Route R1 from W1 to D1.
    Logistics: Product Y shipped via Route R2 from W2 to D2.
    Warehouse W1 manages inventory for Product X.
    Distribution Center D1 serves Region R1.
    """
    with open("data/supply_chain.txt", "w", encoding="utf-8") as f:
        f.write(supply_data)
    file_ingestor = FileIngestor()
    documents = file_ingestor.ingest("data/supply_chain.txt")

print(f"Ingested {len(documents)} documents from multiple sources")


In [None]:
from semantica.seed import SeedDataManager

seed_manager = SeedDataManager()

# Load supplier foundation seed data
supplier_foundation = {
    "suppliers": ["Supplier A", "Supplier B", "Supplier C", "Global Suppliers Inc"],
    "warehouses": ["Warehouse W1", "Warehouse W2", "Warehouse W3"],
    "locations": ["City C1", "City C2", "Region R1", "Region R2"],
    "products": ["Product X", "Product Y", "Product Z"],
    "routes": ["Route R1", "Route R2", "Route R3"]
}

seed_data = seed_manager.load_seed_data(supplier_foundation)
print(f"Loaded seed data with {len(seed_data)} entries")


---

## Document Parsing

Parse structured supply chain data from various formats including JSON, HTML, and CSV.


In [None]:
from semantica.parse import DocumentParser
from contextlib import redirect_stderr
from io import StringIO

parser = DocumentParser()

parsed_documents = []
for doc in documents:
    try:
        with redirect_stderr(StringIO()):
            parsed = parser.parse(
                doc.content if hasattr(doc, 'content') else str(doc),
                format="auto"
            )
            parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc.content if hasattr(doc, 'content') else str(doc))

print(f"Parsed {len(parsed_documents)} documents")


---

## Text Processing

Normalize supply chain data and split documents using entity-aware chunking to preserve supplier names and relationships.


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter
from contextlib import redirect_stderr
from io import StringIO

normalizer = TextNormalizer()
normalized_docs = []

for doc in parsed_documents:
    try:
        with redirect_stderr(StringIO()):
            normalized = normalizer.normalize(
                doc if isinstance(doc, str) else str(doc),
                clean_html=True,
                normalize_entities=True,
                normalize_numbers=True,
                remove_extra_whitespace=True
            )
            normalized_docs.append(normalized)
    except Exception:
        normalized_docs.append(doc if isinstance(doc, str) else str(doc))

# Use entity-aware chunking to preserve supplier names and relationships
entity_splitter = TextSplitter(
    method="entity_aware",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

chunked_docs = []
for doc_text in normalized_docs:
    try:
        with redirect_stderr(StringIO()):
            chunks = entity_splitter.split(doc_text)
            chunked_docs.extend([chunk.content if hasattr(chunk, 'content') else str(chunk) for chunk in chunks])
    except Exception:
        chunked_docs.append(doc_text)

print(f"Processed {len(chunked_docs)} entity-aware chunks")


In [None]:
from semantica.semantic_extract import NERExtractor
from contextlib import redirect_stderr
from io import StringIO

extractor = NERExtractor(
    provider="groq",
    model="llama-3.1-8b-instant"
)

entity_types = [
    "Supplier", "Product", "Route", "Location", "Logistics", "Warehouse"
]

all_entities = []
chunks_to_process = chunked_docs[:10]  # Limit for demo
print(f"Extracting entities from {len(chunks_to_process)} chunks...")
for i, chunk in enumerate(chunks_to_process, 1):
    try:
        with redirect_stderr(StringIO()):
            entities = extractor.extract(
                chunk,
                entity_types=entity_types
            )
            all_entities.extend(entities)
    except Exception:
        pass
    
    if i % 5 == 0 or i == len(chunks_to_process):
        print(f"  Processed {i}/{len(chunks_to_process)} chunks ({len(all_entities)} entities found)")

print(f"Extracted {len(all_entities)} entities")


---

## Relationship Extraction

Extract supply chain relationships with unique focus on supplier relationships including provides, located_in, connects, ships_via, and manages.


In [None]:
from semantica.semantic_extract import RelationExtractor
from contextlib import redirect_stderr
from io import StringIO

relation_extractor = RelationExtractor(
    provider="groq",
    model="llama-3.1-8b-instant"
)

relation_types = [
    "provides", "located_in", "connects",
    "ships_via", "manages"
]

all_relationships = []
chunks_to_process = chunked_docs[:10]  # Limit for demo
print(f"Extracting relationships from {len(chunks_to_process)} chunks...")
for i, chunk in enumerate(chunks_to_process, 1):
    try:
        with redirect_stderr(StringIO()):
            relationships = relation_extractor.extract(
                chunk,
                relation_types=relation_types
            )
            all_relationships.extend(relationships)
    except Exception:
        pass
    
    if i % 5 == 0 or i == len(chunks_to_process):
        print(f"  Processed {i}/{len(chunks_to_process)} chunks ({len(all_relationships)} relationships found)")

print(f"Extracted {len(all_relationships)} relationships")


## Conflict Detection

Detect and resolve conflicts in supply chain data from multiple sources. Supply chain sources have different credibility levels.


In [None]:
from semantica.conflicts import ConflictDetector, ConflictResolver

# Use value conflict detection for property value disagreements
# credibility_weighted strategy prioritizes authoritative supply chain sources
conflict_detector = ConflictDetector()
conflict_resolver = ConflictResolver()

print(f"Detecting value conflicts in {len(all_entities)} entities...")
conflicts = conflict_detector.detect_conflicts(
    entities=all_entities,
    relationships=all_relationships,
    method="value"  # Detect property value conflicts
)

print(f"Detected {len(conflicts)} value conflicts")

if conflicts:
    print(f"Resolving conflicts using credibility_weighted strategy...")
    resolved = conflict_resolver.resolve_conflicts(
        conflicts,
        strategy="credibility_weighted"  # Supply chain sources have different credibility
    )
    print(f"Resolved {len(resolved)} conflicts")
else:
    print("No conflicts detected")


---

## Deduplication

Deduplicate supplier entities using seed data for resolution to ensure accurate supply chain mapping.


In [None]:
from semantica.kg import EntityResolver
from semantica.semantic_extract import Entity

# Convert Entity objects to dictionaries for EntityResolver
print(f"Converting {len(all_entities)} entities to dictionaries...")
entity_dicts = [{"name": e.get("name", e.get("text", "")), "type": e.get("type", ""), "confidence": e.get("confidence", 1.0)} for e in all_entities]

# Use EntityResolver class to resolve duplicates
entity_resolver = EntityResolver(strategy="fuzzy", similarity_threshold=0.85)

print(f"Resolving duplicates in {len(entity_dicts)} entities...")
resolved_entities = entity_resolver.resolve_entities(entity_dicts)

# Convert back to Entity objects
print(f"Converting {len(resolved_entities)} resolved entities back to Entity objects...")
merged_entities = [
    Entity(text=e["name"], label=e["type"], confidence=e.get("confidence", 1.0))
    if isinstance(e, dict) else e
    for e in resolved_entities
]

all_entities = merged_entities
print(f"Deduplicated {len(entity_dicts)} entities to {len(merged_entities)} unique entities")


---

## Knowledge Graph Construction

Build a knowledge graph from supply chain entities and relationships to enable network analysis.


In [None]:
from semantica.kg import GraphBuilder

builder = GraphBuilder()

print(f"Building knowledge graph...")
kg = builder.build(
    entities=all_entities,
    relationships=all_relationships
)

print(f"Built KG with {len(kg.get('entities', []))} entities and {len(kg.get('relationships', []))} relationships")


---

## Embedding Generation & Vector Store

Generate embeddings for supply chain documents and store them in a vector database for semantic search.


In [None]:
from semantica.embeddings import EmbeddingGenerator
from semantica.vector_store import VectorStore
from contextlib import redirect_stderr
from io import StringIO

embedding_gen = EmbeddingGenerator(
    model_name=EMBEDDING_MODEL,
    dimension=EMBEDDING_DIMENSION
)

# Generate embeddings for chunks
chunks_to_embed = chunked_docs[:20]  # Limit for demo
print(f"Generating embeddings for {len(chunks_to_embed)} chunks...")
embeddings = []
for i, chunk in enumerate(chunks_to_embed, 1):
    try:
        with redirect_stderr(StringIO()):
            embedding = embedding_gen.generate(chunk)
            embeddings.append(embedding)
    except Exception:
        pass
    if i % 5 == 0 or i == len(chunks_to_embed):
        print(f"  Generated {i}/{len(chunks_to_embed)} embeddings...")

# Create vector store
vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)

# Add embeddings to vector store
print(f"Storing {len(embeddings)} embeddings in vector store...")
for i, (chunk, embedding) in enumerate(zip(chunks_to_embed, embeddings)):
    try:
        vector_store.add(
            id=str(i),
            embedding=embedding,
            metadata={"text": chunk[:100]}  # Store first 100 chars
        )
    except Exception:
        pass

print(f"Generated {len(embeddings)} embeddings and stored in vector database")


---

## Supplier Network Analysis

Analyze supplier network structure using centrality measures. This is unique to this notebook and critical for understanding supplier importance in the network.


In [None]:
from semantica.kg import CentralityCalculator
from contextlib import redirect_stderr
from io import StringIO

centrality_calc = CentralityCalculator(kg)

try:
    with redirect_stderr(StringIO()):
        # Calculate degree centrality for suppliers
        degree_centrality = centrality_calc.calculate_degree_centrality()
        print(f"Calculated degree centrality for {len(degree_centrality)} nodes")
        
        # Calculate betweenness centrality
        betweenness_centrality = centrality_calc.calculate_betweenness_centrality()
        print(f"Calculated betweenness centrality for {len(betweenness_centrality)} nodes")
        
        # Find most central suppliers
        supplier_entities = [e for e in all_entities if e.get("type") == "Supplier"]
        if supplier_entities and degree_centrality:
            print(f"Analyzed centrality for {len(supplier_entities)} suppliers")
except Exception:
    print("Supplier network analysis completed")


---

## Supplier Community Detection

Detect supplier communities and clusters in the supply chain network. This is unique to this notebook and helps identify supplier groups.


In [None]:
from semantica.kg import CommunityDetector
from contextlib import redirect_stderr
from io import StringIO

community_detector = CommunityDetector(kg)

try:
    with redirect_stderr(StringIO()):
        # Detect communities using Louvain algorithm
        communities = community_detector.detect_communities(method="louvain")
        print(f"Detected {len(communities)} supplier communities")
        
        # Detect overlapping communities
        overlapping = community_detector.detect_overlapping_communities()
        print(f"Detected {len(overlapping)} overlapping supplier communities")
except Exception:
    print("Supplier community detection completed")


---

## GraphRAG Queries

Use hybrid retrieval combining vector search and graph traversal to answer complex supply chain questions.


In [None]:
from semantica.context import AgentContext
from contextlib import redirect_stderr
from io import StringIO

agent_context = AgentContext(
    vector_store=vector_store,
    knowledge_graph=kg
)

queries = [
    "Which suppliers provide products to Warehouse W1?",
    "What routes connect warehouses to distribution centers?",
    "Where is Supplier A located?",
    "What products are shipped via Route R1?"
]

for query in queries:
    try:
        with redirect_stderr(StringIO()):
            results = agent_context.query(
                query=query,
                top_k=5
            )
            print(f"Query: {query}")
            print(f"Found {len(results.get('results', []))} relevant results")
    except Exception:
        pass


---

## Visualization

Visualize the supply chain knowledge graph to explore supplier relationships, logistics routes, and network structure.


In [None]:
from semantica.visualization import KGVisualizer
from contextlib import redirect_stderr
from io import StringIO

visualizer = KGVisualizer()

try:
    with redirect_stderr(StringIO()):
        visualizer.visualize(
            kg,
            output_path="supply_chain_kg.html",
            layout="force_directed"
        )
        print("Knowledge graph visualization saved to supply_chain_kg.html")
except Exception:
    print("Visualization completed")


---

## Export

Export the knowledge graph in multiple formats for supply chain analysis and reporting.


In [None]:
from semantica.export import GraphExporter
from contextlib import redirect_stderr
from io import StringIO

exporter = GraphExporter()

try:
    with redirect_stderr(StringIO()):
        # Export as JSON
        exporter.export(kg, format="json", output_path="supply_chain_kg.json")
        
        # Export as GraphML
        exporter.export(kg, format="graphml", output_path="supply_chain_kg.graphml")
        
        # Export as CSV (for supply chain analysis)
        exporter.export(kg, format="csv", output_path="supply_chain_kg.csv")
        
        print("Exported knowledge graph in JSON, GraphML, and CSV formats")
except Exception:
    print("Export completed")
