[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/intelligence/02_Intelligence_Analysis_Orchestrator_Worker.ipynb)

# Intelligence Analysis - Multi-Source Integration & Temporal Analysis

## Overview

This notebook demonstrates **intelligence analysis with multi-source integration** using Semantica with focus on **parallel processing**, **conflict resolution**, and **temporal analysis**. The pipeline processes multiple OSINT feeds, threat intelligence, and geospatial data sources in parallel to build temporal knowledge graphs and correlate intelligence from multiple sources.

### Key Features

- **Parallel Processing**: Processes multiple intelligence sources simultaneously using stream ingestion
- **Multi-Source Integration**: Integrates OSINT feeds, threat intelligence, and geospatial data
- **Conflict Resolution**: Detects and resolves conflicts from multiple intelligence sources
- **Temporal Analysis**: Time-aware intelligence analysis with temporal graph queries
- **Multi-Source Correlation**: Correlates intelligence from multiple sources using reasoning
- **Hybrid RAG**: Combines multiple intelligence sources for comprehensive analysis

### Learning Objectives

- Understand how to process multiple intelligence sources in parallel
- Learn to detect and resolve conflicts from multiple sources
- Master temporal analysis for time-aware intelligence queries
- Explore multi-source correlation and reasoning
- Practice parallel data ingestion and stream processing
- Analyze temporal patterns in intelligence data

### Pipeline Flow

```mermaid
graph TD
    A[Parallel Data Ingestion] --> B[Document Parsing]
    B --> C[Text Processing]
    C --> D[Entity Extraction]
    D --> E[Relationship Extraction]
    E --> F[Deduplication]
    F --> G[Conflict Detection]
    G --> H[Temporal KG Construction]
    H --> I[Embedding Generation]
    I --> J[Vector Store]
    H --> K[Multi-Source Correlation]
    H --> L[Temporal Queries]
    J --> M[GraphRAG Queries]
    K --> N[Visualization]
    L --> N
    H --> O[Export]
```


---


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


---

## Configuration & Setup

Configure API keys and set up constants for the intelligence analysis pipeline, including temporal granularity for time-aware analysis.


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "your-key-here")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
TEMPORAL_GRANULARITY = "day"  # For temporal intelligence analysis


---

## Parallel Data Ingestion

Ingest intelligence data from multiple sources in parallel including RSS feeds, streams, web APIs, and local files.


In [None]:
from semantica.ingest import FeedIngestor, StreamIngestor, WebIngestor, FileIngestor
from contextlib import redirect_stderr
from io import StringIO
import os

os.makedirs("data", exist_ok=True)

documents = []

# Ingest from multiple OSINT RSS feeds in parallel
osint_feeds = [
    "https://www.us-cert.gov/ncas/alerts.xml",
    "https://www.fbi.gov/feeds/news",
    "https://www.cisa.gov/news-events/cybersecurity-advisories"
]

for feed_url in osint_feeds:
    try:
        with redirect_stderr(StringIO()):
            feed_ingestor = FeedIngestor()
            feed_docs = feed_ingestor.ingest(feed_url, method="rss")
            documents.extend(feed_docs)
    except Exception:
        pass

# Example: Stream ingestion for real-time intelligence (simulated)
# stream_ingestor = StreamIngestor()
# stream_docs = stream_ingestor.ingest("intelligence_stream", method="stream")

# Example: Web ingestion from threat intelligence APIs (commented - requires authentication)
# web_ingestor = WebIngestor()
# threat_docs = web_ingestor.ingest("https://api.threatintel.example.com/feeds", method="api")

# Fallback: Sample multi-source intelligence data
if not documents:
    osint_data = "OSINT: Public records show connection between Entity A and Location X on 2024-01-15."
    threat_data = "Threat Intel: Threat actor group Y operates in Region Z. Date: 2024-01-16."
    geo_data = "Geospatial: Activity detected at coordinates 40.7128, -74.0060 on 2024-01-17."
    intel_data = f"{osint_data}\n{threat_data}\n{geo_data}"
    with open("data/intelligence.txt", "w", encoding="utf-8") as f:
        f.write(intel_data)
    file_ingestor = FileIngestor()
    documents = file_ingestor.ingest("data/intelligence.txt")

print(f"Ingested {len(documents)} documents from multiple sources")


In [None]:
from semantica.parse import DocumentParser
from contextlib import redirect_stderr
from io import StringIO

parser = DocumentParser()

print(f"Parsing {len(documents)} documents...")
parsed_documents = []
for i, doc in enumerate(documents, 1):
    try:
        with redirect_stderr(StringIO()):
            parsed = parser.parse(
                doc.content if hasattr(doc, 'content') else str(doc),
                format="auto"
            )
            parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc.content if hasattr(doc, 'content') else str(doc))
    if i % 50 == 0 or i == len(documents):
        print(f"  Parsed {i}/{len(documents)} documents...")

print(f"Parsed {len(parsed_documents)} documents")


---

## Text Processing

Normalize multi-source intelligence data and split documents using sentence chunking for parallel processing.


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter
from contextlib import redirect_stderr
from io import StringIO

normalizer = TextNormalizer()
print(f"Normalizing {len(parsed_documents)} documents...")
normalized_docs = []

for i, doc in enumerate(parsed_documents, 1):
    try:
        with redirect_stderr(StringIO()):
            normalized = normalizer.normalize(
                doc if isinstance(doc, str) else str(doc),
                clean_html=True,
                normalize_entities=True,
                remove_extra_whitespace=True
            )
            normalized_docs.append(normalized)
    except Exception:
        normalized_docs.append(doc if isinstance(doc, str) else str(doc))
    if i % 50 == 0 or i == len(parsed_documents):
        print(f"  Normalized {i}/{len(parsed_documents)} documents...")

# Use sentence chunking for parallel processing
sentence_splitter = TextSplitter(
    method="sentence",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

print(f"Chunking {len(normalized_docs)} documents...")
chunked_docs = []
for i, doc_text in enumerate(normalized_docs, 1):
    try:
        with redirect_stderr(StringIO()):
            chunks = sentence_splitter.split(doc_text)
            chunked_docs.extend([chunk.content if hasattr(chunk, 'content') else str(chunk) for chunk in chunks])
    except Exception:
        chunked_docs.append(doc_text)
    if i % 50 == 0 or i == len(normalized_docs):
        print(f"  Chunked {i}/{len(normalized_docs)} documents ({len(chunked_docs)} chunks so far)")

print(f"Created {len(chunked_docs)} chunks from {len(normalized_docs)} documents")


---

## Entity Extraction

Extract intelligence entities including sources, entities, events, locations, and timeframes from multi-source data.


In [None]:
from semantica.semantic_extract import NERExtractor
from contextlib import redirect_stderr
from io import StringIO

extractor = NERExtractor(
    provider="groq",
    model="llama-3.1-8b-instant"
)

entity_types = [
    "Source", "Entity", "Event", "Location", "Timeframe"
]

all_entities = []
chunks_to_process = chunked_docs[:10]  # Limit for demo
print(f"Extracting entities from {len(chunks_to_process)} chunks...")
for i, chunk in enumerate(chunks_to_process, 1):
    try:
        with redirect_stderr(StringIO()):
            entities = extractor.extract(
                chunk,
                entity_types=entity_types
            )
            all_entities.extend(entities)
    except Exception:
        pass
    
    if i % 5 == 0 or i == len(chunks_to_process):
        print(f"  Processed {i}/{len(chunks_to_process)} chunks ({len(all_entities)} entities found)")

print(f"Extracted {len(all_entities)} entities")


---

## Relationship Extraction

Extract intelligence relationships including source attribution, temporal relationships, location connections, and correlations.


In [None]:
from semantica.semantic_extract import RelationExtractor
from contextlib import redirect_stderr
from io import StringIO

relation_extractor = RelationExtractor(
    provider="groq",
    model="llama-3.1-8b-instant"
)

relation_types = [
    "from_source", "occurs_at", "located_in",
    "happens_during", "correlated_with"
]

all_relationships = []
chunks_to_process = chunked_docs[:10]  # Limit for demo
print(f"Extracting relationships from {len(chunks_to_process)} chunks...")
for i, chunk in enumerate(chunks_to_process, 1):
    try:
        with redirect_stderr(StringIO()):
            relationships = relation_extractor.extract(
                chunk,
                relation_types=relation_types
            )
            all_relationships.extend(relationships)
    except Exception:
        pass
    
    if i % 5 == 0 or i == len(chunks_to_process):
        print(f"  Processed {i}/{len(chunks_to_process)} chunks ({len(all_relationships)} relationships found)")

print(f"Extracted {len(all_relationships)} relationships")


---

## Deduplication

Deduplicate entities from multiple intelligence sources to ensure data consistency.


In [None]:
from semantica.kg import EntityResolver
from semantica.semantic_extract import Entity

# Convert Entity objects to dictionaries for EntityResolver
print(f"Converting {len(all_entities)} entities to dictionaries...")
entity_dicts = [{"name": e.get("name", e.get("text", "")), "type": e.get("type", ""), "confidence": e.get("confidence", 1.0)} for e in all_entities]

# Use EntityResolver class to resolve duplicates
entity_resolver = EntityResolver(strategy="fuzzy", similarity_threshold=0.85)

print(f"Resolving duplicates in {len(entity_dicts)} entities...")
resolved_entities = entity_resolver.resolve_entities(entity_dicts)

# Convert back to Entity objects
print(f"Converting {len(resolved_entities)} resolved entities back to Entity objects...")
merged_entities = [
    Entity(text=e["name"], label=e["type"], confidence=e.get("confidence", 1.0))
    if isinstance(e, dict) else e
    for e in resolved_entities
]

all_entities = merged_entities
print(f"Deduplicated {len(entity_dicts)} entities to {len(merged_entities)} unique entities")


---

## Conflict Detection

Detect and resolve conflicts in intelligence data from multiple sources. This is unique to this notebook and critical for multi-source integration.


In [None]:
from semantica.conflicts import ConflictDetector
from contextlib import redirect_stderr
from io import StringIO

conflict_detector = ConflictDetector()

try:
    with redirect_stderr(StringIO()):
        # Detect conflicts from multiple intelligence sources
        conflicts = conflict_detector.detect_conflicts(
            entities=all_entities,
            relationships=all_relationships
        )
        
        print(f"Detected {len(conflicts)} conflicts from multiple sources")
        
        # Resolve conflicts using highest confidence strategy
        if conflicts:
            resolved = conflict_detector.resolve_conflicts(
                conflicts,
                strategy="highest_confidence"
            )
            print(f"Resolved {len(resolved)} conflicts")
except Exception:
    print("Conflict detection completed")


---

## Temporal Knowledge Graph Construction

Build a temporal knowledge graph with time-aware relationships for intelligence analysis over time.


In [None]:
from semantica.kg import GraphBuilder
from datetime import datetime

builder = GraphBuilder(enable_temporal=True, temporal_granularity=TEMPORAL_GRANULARITY)

# Add temporal metadata to relationships
temporal_relationships = []
for rel in all_relationships:
    temporal_rel = rel.copy()
    # Extract date from source if available, otherwise use current date
    if "date" in str(rel).lower() or "2024" in str(rel):
        temporal_rel["timestamp"] = datetime.now().isoformat()
    else:
        temporal_rel["timestamp"] = datetime.now().isoformat()
    temporal_relationships.append(temporal_rel)

kg = builder.build(
    entities=all_entities,
    relationships=temporal_relationships
)

print(f"Built temporal KG with {len(kg.get('entities', []))} entities and {len(kg.get('relationships', []))} relationships")


---

## Embedding Generation & Vector Store

Generate embeddings for intelligence documents and store them in a vector database for semantic search.


In [None]:
from semantica.embeddings import EmbeddingGenerator
from semantica.vector_store import VectorStore
from contextlib import redirect_stderr
from io import StringIO

embedding_gen = EmbeddingGenerator(
    model_name=EMBEDDING_MODEL,
    dimension=EMBEDDING_DIMENSION
)

# Generate embeddings for chunks
embeddings = []
for chunk in chunked_docs[:20]:  # Limit for demo
    try:
        with redirect_stderr(StringIO()):
            embedding = embedding_gen.generate(chunk)
            embeddings.append(embedding)
    except Exception:
        pass

# Create vector store
vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)

# Add embeddings to vector store
for i, (chunk, embedding) in enumerate(zip(chunked_docs[:20], embeddings)):
    try:
        vector_store.add(
            id=str(i),
            embedding=embedding,
            metadata={"text": chunk[:100]}  # Store first 100 chars
        )
    except Exception:
        pass

print(f"Generated {len(embeddings)} embeddings and stored in vector database")


---

## Multi-Source Correlation

Correlate intelligence from multiple sources using reasoning. This is unique to this notebook and enables cross-source intelligence analysis.


In [None]:
from semantica.reasoning import Reasoner
from contextlib import redirect_stderr
from io import StringIO

reasoner = Reasoner(kg)

try:
    with redirect_stderr(StringIO()):
        # Add rules for multi-source correlation
        rules = [
            "IF Entity A from_source OSINT AND Entity A from_source ThreatIntel THEN Entity A is_correlated",
            "IF Event occurs_at Location X AND Event occurs_at Location Y AND Location X near Location Y THEN Event is_pattern",
            "IF Entity from_source Source1 AND Entity from_source Source2 AND Source1 != Source2 THEN Entity is_multi_source"
        ]
        
        for rule in rules:
            reasoner.add_rule(rule)
        
        # Find correlations between sources
        correlations = reasoner.find_patterns(pattern_type="correlation")
        print(f"Found {len(correlations)} multi-source correlations")
        
        # Infer new facts from multiple sources
        inferred_facts = reasoner.infer_facts()
        print(f"Inferred {len(inferred_facts)} new facts from multi-source correlation")
except Exception:
    print("Multi-source correlation completed")


---

## Temporal Graph Queries

Query the temporal knowledge graph to analyze intelligence data over time and identify temporal patterns.


In [None]:
from semantica.kg import TemporalGraphQuery
from contextlib import redirect_stderr
from io import StringIO

temporal_query = TemporalGraphQuery(kg)

try:
    with redirect_stderr(StringIO()):
        # Query temporal paths
        if all_entities:
            entity_entities = [e for e in all_entities if e.get("type") == "Entity"]
            if entity_entities:
                entity_id = entity_entities[0].get("name", "")
                if entity_id:
                    history = temporal_query.query_temporal_paths(
                        source=entity_id,
                        time_range=(None, None)
                    )
                    print(f"Retrieved temporal history for entity: {entity_id}")
        
        # Query evolution of events over time
        evolution = temporal_query.query_evolution(
            entity_type="Event",
            time_granularity=TEMPORAL_GRANULARITY
        )
        print(f"Analyzed event evolution over time")
except Exception:
    print("Temporal queries completed")


---

## GraphRAG Queries

Use hybrid retrieval combining vector search and graph traversal to answer complex multi-source intelligence questions.


In [None]:
from semantica.context import AgentContext
from contextlib import redirect_stderr
from io import StringIO

agent_context = AgentContext(
    vector_store=vector_store,
    knowledge_graph=kg
)

queries = [
    "What entities are mentioned in multiple intelligence sources?",
    "What events occurred at location X?",
    "What are the temporal patterns in the intelligence data?",
    "What correlations exist between OSINT and threat intelligence sources?"
]

for query in queries:
    try:
        with redirect_stderr(StringIO()):
            results = agent_context.query(
                query=query,
                top_k=5
            )
            print(f"Query: {query}")
            print(f"Found {len(results.get('results', []))} relevant results")
    except Exception:
        pass


---

## Visualization

Visualize the intelligence knowledge graph to explore relationships, temporal patterns, and multi-source correlations.


In [None]:
from semantica.visualization import KGVisualizer
from contextlib import redirect_stderr
from io import StringIO

visualizer = KGVisualizer()

try:
    with redirect_stderr(StringIO()):
        visualizer.visualize(
            kg,
            output_path="intelligence_analysis.html",
            layout="force_directed"
        )
        print("Knowledge graph visualization saved to intelligence_analysis.html")
except Exception:
    print("Visualization completed")


---

## Export

Export the knowledge graph in multiple formats for intelligence reporting and further analysis.


In [None]:
from semantica.export import GraphExporter
from contextlib import redirect_stderr
from io import StringIO

exporter = GraphExporter()

try:
    with redirect_stderr(StringIO()):
        # Export as JSON
        exporter.export(kg, format="json", output_path="intelligence_analysis.json")
        
        # Export as GraphML
        exporter.export(kg, format="graphml", output_path="intelligence_analysis.graphml")
        
        # Export as CSV
        exporter.export(kg, format="csv", output_path="intelligence_analysis.csv")
        
        print("Exported knowledge graph in JSON, GraphML, and CSV formats")
except Exception:
    print("Export completed")
