[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/intelligence/02_Intelligence_Analysis_Orchestrator_Worker.ipynb)

# Intelligence Analysis - Multi-Source Integration & Temporal Analysis

## Overview

This notebook demonstrates **intelligence analysis with multi-source integration** using Semantica with focus on **parallel processing**, **conflict resolution**, and **temporal analysis**. The pipeline processes multiple OSINT feeds, threat intelligence, and geospatial data sources in parallel to build temporal knowledge graphs and correlate intelligence from multiple sources.

### Key Features

- **Parallel Processing**: Processes multiple intelligence sources simultaneously using stream ingestion
- **Multi-Source Integration**: Integrates OSINT feeds, threat intelligence, and geospatial data
- **Conflict Resolution**: Detects and resolves conflicts from multiple intelligence sources
- **Temporal Analysis**: Time-aware intelligence analysis with temporal graph queries
- **Multi-Source Correlation**: Correlates intelligence from multiple sources using reasoning
- **Hybrid RAG**: Combines multiple intelligence sources for comprehensive analysis

### Learning Objectives

- Understand how to process multiple intelligence sources in parallel
- Learn to detect and resolve conflicts from multiple sources
- Master temporal analysis for time-aware intelligence queries
- Explore multi-source correlation and reasoning
- Practice parallel data ingestion and stream processing
- Analyze temporal patterns in intelligence data

### Pipeline Flow

```mermaid
graph TD
    A[Parallel Data Ingestion] --> B[Document Parsing]
    B --> C[Text Processing]
    C --> D[Entity Extraction]
    D --> E[Relationship Extraction]
    E --> F[Deduplication]
    F --> G[Conflict Detection]
    G --> H[Temporal KG Construction]
    H --> I[Embedding Generation]
    I --> J[Vector Store]
    H --> K[Multi-Source Correlation]
    H --> L[Temporal Queries]
    J --> M[GraphRAG Queries]
    K --> N[Visualization]
    L --> N
    H --> O[Export]
```


---


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


---

## Configuration & Setup

Configure API keys and set up constants for the intelligence analysis pipeline, including temporal granularity for time-aware analysis.


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "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")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
TEMPORAL_GRANULARITY = "day"  # For temporal intelligence analysis


---

## Parallel Data Ingestion

Ingest intelligence data from multiple sources in parallel including RSS feeds, streams, web APIs, and local files.


In [None]:
from semantica.ingest import FeedIngestor, WebIngestor
import os

os.makedirs("data", exist_ok=True)

documents = []

# Ingest from multiple OSINT RSS feeds
osint_feeds = [
    "https://www.us-cert.gov/ncas/alerts.xml",
    "https://www.europol.europa.eu/rss.xml",
    "https://www.treasury.gov/resource-center/sanctions/OFAC-Enforcement/Pages/rss.xml",
    "https://krebsonsecurity.com/feed/",
    "https://www.schneier.com/feed/",
    "https://www.darkreading.com/rss.xml",
    "https://threatpost.com/feed/",
    "https://www.bleepingcomputer.com/feed/",
    "https://www.securityweek.com/rss",
    "https://www.infosecurity-magazine.com/rss/news/",
    "https://www.csoonline.com/index.rss",
    "https://www.cisa.gov/news-events/cybersecurity-advisories/rss.xml",
    "https://www.ncsc.gov.uk/news/rss",
    "https://www.cyber.gov.au/news/rss"
]

feed_ingestor = FeedIngestor()
for i, feed_url in enumerate(osint_feeds, 1):
    try:
        feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
        
        feed_count = 0
        for item in feed_data.items:
            if not item.content:
                item.content = item.description or item.title or ""
            if item.content:
                if not hasattr(item, 'metadata'):
                    item.metadata = {}
                item.metadata['source'] = feed_url
                documents.append(item)
                feed_count += 1
        
        print(f"  [{i}/{len(osint_feeds)}] Feed: {feed_count} documents")
    except Exception as e:
        print(f"  [{i}/{len(osint_feeds)}] Feed failed: {str(e)[:50]}")

# Web ingestion from various intelligence and security sources
web_links = [
    "https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices",
    "https://www.unodc.org/unodc/en/data-and-analysis/index.html",
    "https://www.cisa.gov/news-events/cybersecurity-advisories",
    "https://www.us-cert.gov/ncas/alerts",
    "https://www.europol.europa.eu/newsroom",
    "https://www.ncsc.gov.uk/news",
    "https://www.cyber.gov.au/news",
    "https://www.fbi.gov/wanted/cyber",
    "https://www.dhs.gov/news"
]

web_ingestor = WebIngestor(respect_robots=False, delay=1.0)
for i, web_url in enumerate(web_links, 1):
    try:
        web_content = web_ingestor.ingest_url(web_url)
        if web_content and web_content.text:
            web_content.content = web_content.text
            if not hasattr(web_content, 'metadata'):
                web_content.metadata = {}
            web_content.metadata['source'] = web_url
            documents.append(web_content)
            print(f"  [{i}/{len(web_links)}] Web: {len(web_content.text)} characters")
    except Exception as e:
        print(f"  [{i}/{len(web_links)}] Web failed: {str(e)[:50]}")

print(f"Ingested {len(documents)} documents from multiple sources")


In [None]:
from semantica.parse import DocumentParser

parser = DocumentParser()

print(f"Parsing {len(documents)} documents...")
parsed_documents = []
for i, doc in enumerate(documents, 1):
    try:
        parsed = parser.parse(
            doc.content if hasattr(doc, 'content') else str(doc),
            format="auto"
        )
        parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc.content if hasattr(doc, 'content') else str(doc))
    if i % 50 == 0 or i == len(documents):
        print(f"  Parsed {i}/{len(documents)} documents...")

print(f"Parsed {len(parsed_documents)} documents")


---

## Text Processing

Normalize multi-source intelligence data and split documents using sentence chunking for parallel processing.


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter

normalizer = TextNormalizer()
print(f"Normalizing {len(parsed_documents)} documents...")
normalized_docs = []

for i, doc in enumerate(parsed_documents, 1):
    try:
        normalized = normalizer.normalize(
            doc if isinstance(doc, str) else str(doc),
            clean_html=True,
            normalize_entities=True,
            remove_extra_whitespace=True
        )
        normalized_docs.append(normalized)
    except Exception:
        normalized_docs.append(doc if isinstance(doc, str) else str(doc))
    if i % 50 == 0 or i == len(parsed_documents):
        print(f"  Normalized {i}/{len(parsed_documents)} documents...")

# Use sentence chunking for parallel processing
sentence_splitter = TextSplitter(
    method="sentence",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

print(f"Chunking {len(normalized_docs)} documents...")
chunked_docs = []
for i, doc_text in enumerate(normalized_docs, 1):
    try:
        chunks = sentence_splitter.split(doc_text)
        chunked_docs.extend([chunk.content if hasattr(chunk, 'content') else str(chunk) for chunk in chunks])
    except Exception:
        chunked_docs.append(doc_text)
    if i % 50 == 0 or i == len(normalized_docs):
        print(f"  Chunked {i}/{len(normalized_docs)} documents ({len(chunked_docs)} chunks so far)")

print(f"Created {len(chunked_docs)} chunks from {len(normalized_docs)} documents")


---

## Entity Extraction

Extract intelligence entities including sources, entities, events, locations, and timeframes from multi-source data.


In [None]:
from semantica.semantic_extract import NERExtractor

extractor = NERExtractor(method="ml", model="en_core_web_sm")

chunks_to_process = chunked_docs
entity_results = extractor.extract(chunks_to_process)

all_entities = []
relevant_types = ["PERSON", "ORG", "GPE", "LOC", "EVENT", "DATE"]
for entities in entity_results:
    all_entities.extend([e for e in entities if e.label in relevant_types])

print(f"Extracted {len(all_entities)} entities")


---

## Relationship Extraction

Extract intelligence relationships including source attribution, temporal relationships, location connections, and correlations.


In [None]:
from semantica.semantic_extract import RelationExtractor

intelligence_relation_types = [
    "from_source", "occurs_at", "located_in", "happens_during", "correlated_with",
    "targets", "associated_with", "operates_in", "based_in", "related_to",
    "part_of", "works_for", "founded_by", "member_of", "collaborates_with",
    "affiliated_with", "before", "after", "during", "causes", "affects",
    "influences", "uses", "employs", "owns", "controls", "manages"
]

relation_extractor = RelationExtractor(
    method=["dependency", "pattern", "cooccurrence"],
    model="en_core_web_sm",
    relation_types=intelligence_relation_types,
    confidence_threshold=0.5,
    max_distance=100
)

chunks_to_process = chunked_docs
relation_results = relation_extractor.extract(chunks_to_process, entity_results)

all_relationships = []
seen = set()
for relationships in relation_results:
    for rel in relationships:
        key = (rel.subject.text, rel.predicate, rel.object.text)
        if key not in seen:
            seen.add(key)
            all_relationships.append(rel)

relationship_types = set(rel.predicate for rel in all_relationships)
print(f"Extracted {len(all_relationships)} relationships")
print(f"Relationship types found: {sorted(relationship_types)}")

---

## Conflict Detection

Detect and resolve conflicts in intelligence data from multiple sources. This is unique to this notebook and critical for multi-source integration.


In [None]:
from semantica.conflicts import ConflictDetector, ConflictResolver

conflict_detector = ConflictDetector()
conflict_resolver = ConflictResolver()

entity_dicts = [
    {
        "id": getattr(e, "text", str(e)),
        "text": getattr(e, "text", str(e)),
        "label": getattr(e, "label", ""),
        "type": getattr(e, "label", ""),
        "confidence": getattr(e, "confidence", 1.0),
        "metadata": getattr(e, "metadata", {})
    }
    for e in all_entities
]

print(f"Detecting conflicts in {len(entity_dicts)} entities...")
entity_conflicts = conflict_detector.detect_entity_conflicts(entity_dicts)

relationship_dicts = [
    {
        "id": f"{getattr(rel.subject, 'text', str(rel.subject))}_{rel.predicate}_{getattr(rel.object, 'text', str(rel.object))}",
        "source_id": getattr(rel.subject, "text", str(rel.subject)),
        "target_id": getattr(rel.object, "text", str(rel.object)),
        "type": rel.predicate,
        "confidence": getattr(rel, "confidence", 1.0),
        "metadata": getattr(rel, "metadata", {})
    }
    for rel in all_relationships
]

print(f"Detecting conflicts in {len(relationship_dicts)} relationships...")
relationship_conflicts = conflict_detector.detect_relationship_conflicts(relationship_dicts)

all_conflicts = entity_conflicts + relationship_conflicts
print(f"Detected {len(all_conflicts)} conflicts from multiple sources")

if all_conflicts:
    resolved = conflict_resolver.resolve_conflicts(
        all_conflicts,
        strategy="credibility_weighted"
    )
    print(f"Resolved {len(resolved)} conflicts")
else:
    print("No conflicts detected")


---

## Temporal Knowledge Graph Construction

Build a temporal knowledge graph with time-aware relationships for intelligence analysis over time.


In [None]:
from semantica.kg import GraphBuilder

builder = GraphBuilder(
    enable_temporal=True,
    temporal_granularity=TEMPORAL_GRANULARITY,
    merge_entities=True,
    resolve_conflicts=True
)

kg = builder.build({
    "entities": all_entities,
    "relationships": all_relationships
})

print(f"Built temporal KG with {len(kg.get('entities', []))} entities and {len(kg.get('relationships', []))} relationships")


---

## Embedding Generation & Vector Store

Generate embeddings for intelligence documents and store them in a vector database for semantic search.


In [None]:
from semantica.embeddings import EmbeddingGenerator
from semantica.vector_store import VectorStore

embedding_gen = EmbeddingGenerator(
    model_name=EMBEDDING_MODEL,
    dimension=EMBEDDING_DIMENSION
)

chunks_to_embed = chunked_docs[:20]
embeddings = embedding_gen.generate_embeddings(chunks_to_embed, data_type="text")

vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)
vector_ids = vector_store.store_vectors(
    vectors=embeddings,
    metadata=[{"text": chunk[:100]} for chunk in chunks_to_embed]
)

print(f"Generated {len(embeddings)} embeddings and stored {len(vector_ids)} vectors in vector database")


---

## Temporal Graph Queries

Query the temporal knowledge graph to analyze intelligence data over time and identify temporal patterns.


In [None]:
from semantica.kg import TemporalGraphQuery

temporal_query = TemporalGraphQuery(temporal_granularity=TEMPORAL_GRANULARITY)

entities = kg.get('entities', [])
if entities:
    first_entity = entities[0]
    entity_id = first_entity.get('id') or first_entity.get('text', '')
    if entity_id:
        paths = temporal_query.find_temporal_paths(
            graph=kg,
            source=entity_id,
            target=entity_id,
            start_time=None,
            end_time=None
        )
        print(f"Retrieved temporal paths for entity: {entity_id}")

evolution = temporal_query.analyze_evolution(
    graph=kg,
    entity_type="Event"
)
print(f"Analyzed event evolution over time")


---

## GraphRAG Queries

Use hybrid retrieval combining vector search and graph traversal to answer complex multi-source intelligence questions.


In [None]:
from semantica.context import AgentContext, ContextGraph
from semantica.llms import Groq
import os

context_graph = ContextGraph()
context_graph.build_from_entities_and_relationships(
    entities=kg.get('entities', []),
    relationships=[{**r, 'source_id': r.get('source_id') or r.get('source'), 'target_id': r.get('target_id') or r.get('target')} for r in kg.get('relationships', [])]
)

graph_stats = context_graph.stats()
print(f"Intelligence Context Graph: {graph_stats['node_count']} nodes, {graph_stats['edge_count']} edges")

context = AgentContext(
    vector_store=vector_store,
    knowledge_graph=context_graph,
    hybrid_alpha=0.7,
    use_graph_expansion=True,
    max_expansion_hops=3
)

for chunk in chunked_docs[:30]:
    if chunk and chunk.strip():
        context.store(
            content=chunk,
            metadata={'source': 'intelligence_analysis'},
            extract_entities=True,
            link_entities=True
        )

llm = Groq(model="llama-3.1-8b-instant", api_key=os.getenv("GROQ_API_KEY"))

intelligence_queries = [
    "What entities are mentioned in multiple intelligence sources?",
    "What are the temporal patterns in the intelligence data?"
]

print("\n" + "=" * 80)
print("Intelligence Analysis - GraphRAG with Multi-Hop Reasoning")
print("=" * 80)

for query in intelligence_queries:
    print(f"\n{'='*80}")
    print(f"Intelligence Query: {query}")
    print(f"{'='*80}\n")
    
    result = context.query_with_reasoning(
        query=query,
        llm_provider=llm,
        max_results=15,
        max_hops=3,
        min_score=0.2
    )
    
    print(f"Generated Response:\n{result.get('response', 'No response available')}\n")
    
    if result.get('reasoning_path'):
        print(f"Reasoning Path:\n{result.get('reasoning_path')}\n")
    
    print(f"Confidence Score: {result.get('confidence', 0):.3f}")
    print(f"Intelligence Sources: {result.get('num_sources', 0)}")
    print(f"Reasoning Paths: {result.get('num_reasoning_paths', 0)}")
    print()


---

## Visualization

Visualize the intelligence knowledge graph to explore relationships, temporal patterns, and multi-source correlations.


In [None]:
from semantica.visualization import KGVisualizer

visualizer = KGVisualizer(layout="force", color_scheme="vibrant")
visualizer.visualize_network(kg, output="interactive")


---

## Export

Export the knowledge graph in multiple formats for intelligence reporting and further analysis.


In [None]:
from semantica.export import GraphExporter, JSONExporter, CSVExporter

GraphExporter().export_knowledge_graph(kg, "intelligence_analysis.graphml", format="graphml")
JSONExporter().export_knowledge_graph(kg, "intelligence_analysis.json")
CSVExporter().export_knowledge_graph(kg, "intelligence_analysis.csv")

print("Exported knowledge graph in JSON, GraphML, and CSV formats")
