[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/supply_chain/01_Supply_Chain_Data_Integration.ipynb)

# Supply Chain Data Integration - Multi-Source Ingestion & Relationship Mapping

## Overview

This notebook demonstrates **supply chain data integration** using Semantica with focus on **multi-source ingestion**, **relationship mapping**, and **logistics tracking**. The pipeline ingests logistics and supplier data from multiple sources to build a comprehensive supply chain knowledge graph with supplier network analysis.

### Key Features

- **Multi-Source Ingestion**: Ingests data from multiple logistics and supplier sources (RSS, web APIs, files)
- **Relationship Mapping**: Maps supplier relationships and logistics routes using relation extraction
- **Logistics Tracking**: Tracks products, routes, locations, and warehouses
- **Supplier Network Analysis**: Analyzes supplier centrality and community clusters
- **Seed Data Integration**: Uses supplier foundation data for entity resolution
- **KG Construction**: Builds comprehensive supply chain knowledge graphs

### Learning Objectives

- Understand how to ingest data from multiple sources for supply chain integration
- Learn to map complex supplier relationships and logistics routes
- Master supplier network analysis using centrality and community detection
- Explore relationship extraction for supply chain entities
- Practice multi-source data integration and deduplication
- Analyze supplier networks and logistics connections

### Pipeline Flow

```mermaid
graph TD
    A[Multi-Source Ingestion] --> B[Seed Data Loading]
    A --> C[Document Parsing]
    B --> D[Text Processing]
    C --> D
    D --> E[Entity Extraction]
    E --> F[Relationship Extraction]
    F --> G[Deduplication]
    G --> H[KG Construction]
    H --> I[Embedding Generation]
    I --> J[Vector Store]
    H --> K[Network Analysis]
    H --> L[Community Detection]
    J --> M[GraphRAG Queries]
    K --> N[Visualization]
    L --> N
    H --> O[Export]
```


---


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


---

## Configuration & Setup

Configure API keys and set up constants for the supply chain data integration pipeline.


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "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")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200


---

## Multi-Source Data Ingestion

Ingest supply chain data from multiple sources including RSS feeds, web APIs, and local files. This section emphasizes multi-source ingestion capabilities.


In [None]:
from semantica.ingest import FeedIngestor, WebIngestor, FileIngestor
import os

os.makedirs("data", exist_ok=True)

documents = []

# Ingest from logistics RSS feeds
logistics_feeds = [
    ("Supply Chain Dive", "https://www.supplychaindive.com/rss"),
    ("Logistics Management", "https://www.logisticsmgmt.com/rss"),
    ("SCMR", "https://www.scmr.com/rss"),
    ("DC Velocity", "https://www.dcvelocity.com/rss"),
    ("Inbound Logistics", "https://www.inboundlogistics.com/rss"),
    ("Supply Chain Brain", "https://www.supplychainbrain.com/rss"),
    ("MHL News", "https://www.mhlnews.com/rss")
]

feed_ingestor = FeedIngestor()
print(f"Ingesting from {len(logistics_feeds)} RSS feed sources...")
for i, (feed_name, feed_url) in enumerate(logistics_feeds, 1):
    try:
        feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
        feed_count = 0
        for item in feed_data.items:
            if not item.content:
                item.content = item.description or item.title or ""
            if item.content:
                if not hasattr(item, 'metadata'):
                    item.metadata = {}
                item.metadata['source'] = feed_name
                documents.append(item)
                feed_count += 1
        if feed_count > 0:
            print(f"  [{i}/{len(logistics_feeds)}] {feed_name}: {feed_count} documents")
    except Exception as e:
        print(f"  [{i}/{len(logistics_feeds)}] {feed_name}: Failed - {str(e)[:50]}")

# Web ingestion from supply chain data sources
web_sources = [
    ("Supply Chain Dive News", "https://www.supplychaindive.com/news"),
    ("Logistics Management News", "https://www.logisticsmgmt.com/news"),
    ("SCMR Articles", "https://www.scmr.com/articles"),
    ("DC Velocity Articles", "https://www.dcvelocity.com/articles")
]

web_ingestor = WebIngestor(respect_robots=True, delay=1.0)
print(f"\nIngesting from {len(web_sources)} web sources...")
for i, (source_name, web_url) in enumerate(web_sources, 1):
    try:
        web_content = web_ingestor.ingest_url(web_url)
        if web_content.text:
            # Create a document-like object from WebContent
            class WebDoc:
                def __init__(self, content, title, url, source):
                    self.content = content
                    self.title = title
                    self.url = url
                    self.metadata = {'source': source}
            doc = WebDoc(web_content.text, web_content.title, web_content.url, source_name)
            documents.append(doc)
            print(f"  [{i}/{len(web_sources)}] {source_name}: 1 document")
    except Exception as e:
        print(f"  [{i}/{len(web_sources)}] {source_name}: Failed - {str(e)[:50]}")

print(f"\nIngested {len(documents)} documents from multiple sources")


In [None]:
from semantica.seed import SeedDataManager

seed_manager = SeedDataManager()

# Load supplier foundation seed data
supplier_foundation = {
    "suppliers": ["Supplier A", "Supplier B", "Supplier C", "Global Suppliers Inc"],
    "warehouses": ["Warehouse W1", "Warehouse W2", "Warehouse W3"],
    "locations": ["City C1", "City C2", "Region R1", "Region R2"],
    "products": ["Product X", "Product Y", "Product Z"],
    "routes": ["Route R1", "Route R2", "Route R3"]
}

# Convert dictionary to entity records
entity_records = []
for entity_type, entity_names in supplier_foundation.items():
    for name in entity_names:
        entity_records.append({
            "id": name.replace(" ", "_").lower(),
            "text": name,
            "name": name,
            "entity_type": entity_type.rstrip("s").capitalize(),  # Remove plural and capitalize
            "type": entity_type.rstrip("s").capitalize(),
            "source": "supplier_foundation",
            "verified": True
        })

# Add entities to seed data
seed_manager.seed_data.entities = entity_records

print(f"Loaded seed data with {len(entity_records)} entities")
print(f"Entity types: {set(e['type'] for e in entity_records)}")


---

## Text Processing

Normalize supply chain data and split documents using entity-aware chunking to preserve supplier names and relationships.


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter

normalizer = TextNormalizer()
normalized_docs = []

for doc in documents:
    try:
        doc_content = doc.content if hasattr(doc, 'content') else str(doc)
        normalized = normalizer.normalize(
            doc_content,
            clean_html=True,
            normalize_entities=True,
            normalize_numbers=True,
            remove_extra_whitespace=True
        )
        normalized_docs.append(normalized)
    except Exception:
        normalized_docs.append(doc.content if hasattr(doc, 'content') else str(doc))

# Use entity-aware chunking to preserve supplier names and relationships
entity_splitter = TextSplitter(
    method="entity_aware",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

chunked_docs = []
for doc_text in normalized_docs:
    try:
        chunks = entity_splitter.split(doc_text)
        chunked_docs.extend([chunk.content if hasattr(chunk, 'content') else str(chunk) for chunk in chunks])
    except Exception:
        chunked_docs.append(doc_text)

print(f"Processed {len(chunked_docs)} entity-aware chunks")


In [None]:
from semantica.semantic_extract import NERExtractor

# Use ML-based approach (spaCy) for entity extraction
extractor = NERExtractor(
    method="ml",
    model="en_core_web_sm"
)

entity_types = [
    "Supplier", "Product", "Route", "Location", "Logistics", "Warehouse", "Makinson"
]

all_entities = []
chunks_to_process = chunked_docs  # Process all chunks
print(f"Extracting entities from {len(chunks_to_process)} chunks using ML-based approach...")
for i, chunk in enumerate(chunks_to_process, 1):
    try:
        entities = extractor.extract(
            chunk,
            entity_types=entity_types
        )
        all_entities.extend(entities)
    except Exception as e:
        print(f"  Error processing chunk {i}: {str(e)[:50]}")
        pass
    
    if i % 5 == 0 or i == len(chunks_to_process):
        print(f"  Processed {i}/{len(chunks_to_process)} chunks ({len(all_entities)} entities found)")

print(f"Extracted {len(all_entities)} entities")


---

## Relationship Extraction

Extract supply chain relationships with unique focus on supplier relationships including provides, located_in, connects, ships_via, and manages.


In [None]:
from semantica.semantic_extract import RelationExtractor, NERExtractor

# Use ML-based approach (dependency parsing with spaCy) for relation extraction
relation_extractor = RelationExtractor(
    method="dependency",  # ML-based dependency parsing
    model="en_core_web_sm"
)

# Create NER extractor once for efficiency
ner = NERExtractor(method="ml", model="en_core_web_sm")

relation_types = [
    "provides", "located_in", "connects",
    "ships_via", "manages"
]

all_relationships = []
chunks_to_process = chunked_docs  # Process all chunks
print(f"Extracting relationships from {len(chunks_to_process)} chunks using ML-based approach...")
for i, chunk in enumerate(chunks_to_process, 1):
    try:
        # Extract entities from chunk first (dependency parsing needs entities)
        chunk_entities = ner.extract(chunk)
        
        # Extract relationships using dependency parsing
        relationships = relation_extractor.extract(
            chunk,
            entities=chunk_entities,
            relation_types=relation_types
        )
        all_relationships.extend(relationships)
    except Exception as e:
        print(f"  Error processing chunk {i}: {str(e)[:50]}")
        pass
    
    if i % 5 == 0 or i == len(chunks_to_process):
        print(f"  Processed {i}/{len(chunks_to_process)} chunks ({len(all_relationships)} relationships found)")

print(f"Extracted {len(all_relationships)} relationships")


## Conflict Detection

Detect and resolve conflicts in supply chain data from multiple sources. Supply chain sources have different credibility levels.


In [None]:
from semantica.conflicts import ConflictDetector, ConflictResolver

detector = ConflictDetector()
resolver = ConflictResolver()

entity_dicts = [{"id": e.text, "text": e.text, "type": e.label, "confidence": e.confidence, "metadata": e.metadata} for e in all_entities]
relationship_dicts = [{"id": f"{r.subject.text}_{r.predicate}_{r.object.text}", "source_id": r.subject.text, "target_id": r.object.text, "type": r.predicate, "confidence": r.confidence, "metadata": r.metadata} for r in all_relationships] if all_relationships else []

conflicts = detector.detect_entity_conflicts(entity_dicts)
if relationship_dicts:
    conflicts.extend(detector.detect_relationship_conflicts(relationship_dicts))

print(f"Detected {len(conflicts)} conflicts")
if conflicts:
    resolver.resolve_conflicts(conflicts, strategy="credibility_weighted")


---

## Deduplication

Deduplicate supplier entities using seed data for resolution to ensure accurate supply chain mapping.


In [None]:
from semantica.kg import EntityResolver
from semantica.semantic_extract import Entity

entity_dicts = [{"name": e.text, "type": e.label, "confidence": e.confidence} for e in all_entities]
resolved = EntityResolver(strategy="fuzzy", similarity_threshold=0.85).resolve_entities(entity_dicts)
all_entities = [Entity(text=e["name"], label=e["type"], start_char=0, end_char=0, confidence=e.get("confidence", 1.0)) for e in resolved]
print(f"Deduplicated {len(entity_dicts)} entities to {len(all_entities)} unique entities")


---

## Knowledge Graph Construction

Build a knowledge graph from supply chain entities and relationships to enable network analysis.


In [None]:
from semantica.kg import GraphBuilder

kg = GraphBuilder().build({"entities": all_entities, "relationships": all_relationships})
print(f"Built KG with {len(kg.get('entities', []))} entities and {len(kg.get('relationships', []))} relationships")


---

## Embedding Generation & Vector Store

Generate embeddings for supply chain documents and store them in a vector database for semantic search.


In [None]:
from semantica.embeddings import EmbeddingGenerator
from semantica.vector_store import VectorStore

gen = EmbeddingGenerator(model_name=EMBEDDING_MODEL, dimension=EMBEDDING_DIMENSION)
embeddings = gen.generate_embeddings(chunked_docs, data_type="text")

vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)
metadata = [{"text": chunk[:100]} for chunk in chunked_docs]
vector_store.store_vectors(vectors=embeddings, metadata=metadata)

print(f"Generated {len(embeddings)} embeddings and stored in vector database")


---

## Supplier Network Analysis

Analyze supplier network structure using centrality measures. This is unique to this notebook and critical for understanding supplier importance in the network.


In [None]:
from semantica.kg import CentralityCalculator

calc = CentralityCalculator()
degree_centrality = calc.calculate_degree_centrality(kg)
betweenness_centrality = calc.calculate_betweenness_centrality(kg)
print(f"Degree centrality: {len(degree_centrality.get('centrality', {}))} nodes")
print(f"Betweenness centrality: {len(betweenness_centrality.get('centrality', {}))} nodes")


---

## Supplier Community Detection

Detect supplier communities and clusters in the supply chain network. This is unique to this notebook and helps identify supplier groups.


In [None]:
from semantica.kg import CommunityDetector

detector = CommunityDetector()
communities = detector.detect_communities(kg, algorithm="louvain")
overlapping = detector.detect_overlapping_communities(kg)
print(f"Detected {len(communities.get('communities', []))} communities and {len(overlapping.get('communities', []))} overlapping communities")


---

## GraphRAG Queries

Use hybrid retrieval combining vector search and graph traversal to answer complex supply chain questions.


In [None]:
from semantica.context import AgentContext, ContextGraph, ContextRetriever
from semantica.llms import Groq
import os

context_graph = ContextGraph()
context_graph.build_from_entities_and_relationships(
    entities=kg.get('entities', []),
    relationships=kg.get('relationships', [])
)

retriever = ContextRetriever(vector_store=vector_store, knowledge_graph=context_graph, hybrid_alpha=0.6, max_expansion_hops=2)
context = AgentContext(vector_store=vector_store, knowledge_graph=context_graph, use_graph_expansion=True, max_expansion_hops=2)

llm = Groq(model="llama-3.1-8b-instant", api_key=os.getenv("GROQ_API_KEY"))

queries = [
    "Which suppliers provide products to Warehouse W1?",
    "What routes connect warehouses to distribution centers?",
    "Where is Supplier A located?",
    "What products are shipped via Route R1?"
]

for query in queries:
    result = context.query_with_reasoning(query, llm_provider=llm, max_results=10, max_hops=2)
    print(f"Query: {query}")
    print(f"Response: {result.get('response', 'No response generated')}")
    print(f"Confidence: {result.get('confidence', 0):.3f} | Sources: {result.get('num_sources', 0)} | Paths: {result.get('num_reasoning_paths', 0)}\n")


---

## Visualization

Visualize the supply chain knowledge graph to explore supplier relationships, logistics routes, and network structure.


In [None]:
from semantica.visualization import KGVisualizer

viz = KGVisualizer(layout="force")
fig = viz.visualize_network(kg, output="interactive")
fig.show() if fig else None


---

## Export

Export the knowledge graph in multiple formats for supply chain analysis and reporting.


In [None]:
from semantica.export import GraphExporter

exporter = GraphExporter()
exporter.export(kg, format="json", output_path="supply_chain_kg.json")
exporter.export(kg, format="graphml", output_path="supply_chain_kg.graphml")
print("Exported knowledge graph in JSON, GraphML, formats")
