[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/cybersecurity/02_Threat_Intelligence_Hybrid_RAG.ipynb)

# Threat Intelligence Hybrid RAG - Vector + Graph Retrieval

## Overview

This notebook demonstrates **threat intelligence hybrid RAG** using Semantica with focus on **hybrid search**, **vector + graph retrieval**, and **context-aware queries**. The pipeline combines vector search with temporal knowledge graphs for advanced threat intelligence querying.

### Key Features

- **Hybrid RAG**: Combines vector similarity search with knowledge graph traversal
- **Vector + Graph Retrieval**: Uses both vector embeddings and graph relationships
- **Context-Aware Queries**: Provides context-aware retrieval for threat intelligence
- **Temporal Knowledge Graphs**: Builds temporal KGs for threat timeline analysis
- **Multi-hop Reasoning**: Follows relationships across the graph for deeper context
- **Comprehensive Data Sources**: Multiple threat intelligence feeds, APIs, and databases
- **Modular Architecture**: Direct use of Semantica modules without core orchestrator

### Learning Objectives

- Ingest threat intelligence data from multiple sources
- Extract threat entities (IOCs, Campaigns, Threats, Actors, TTPs, Malware)
- Build temporal threat intelligence knowledge graphs
- Generate embeddings and populate vector stores
- Perform hybrid vector + graph queries
- Analyze threat networks using graph analytics
- Store and query threat intelligence using vector stores and graph stores

### Pipeline Flow

```mermaid
graph TD
    A[Data Ingestion] --> B[Document Parsing]
    B --> C[Text Processing]
    C --> D[Entity Extraction]
    D --> E[Relationship Extraction]
    E --> F[Deduplication]
    F --> G[Conflict Detection]
    G --> H[Temporal Knowledge Graph]
    H --> I[Embeddings]
    I --> J[Vector Store]
    H --> K[Temporal Queries]
    K --> L[Graph Analytics]
    L --> M[GraphRAG Queries]
    J --> M
    H --> N[Reasoning & Threat]
    M --> O[Visualization]
    N --> O
    H --> P[Graph Store]
    P --> O
    O --> Q[Export]
```


## Installation


In [1]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


Note: you may need to restart the kernel to use updated packages.




## Configuration & Setup


In [2]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "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")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
TEMPORAL_GRANULARITY = "day"


## Ingesting Threat Intelligence Data


In [4]:
from semantica.ingest import FeedIngestor, WebIngestor, FileIngestor
import os
from contextlib import redirect_stderr
from io import StringIO

os.makedirs("data", exist_ok=True)

feed_sources = [
    # Threat Intelligence RSS Feeds
    ("US-CERT Alerts", "https://www.us-cert.gov/ncas/alerts.xml"),
    ("SANS ISC", "https://isc.sans.edu/rssfeed.xml"),
    ("Krebs on Security", "https://krebsonsecurity.com/feed/"),
    ("ThreatPost", "https://threatpost.com/feed/"),
    ("BleepingComputer", "https://www.bleepingcomputer.com/feed/"),
    ("SecurityWeek", "https://www.securityweek.com/rss"),
]

feed_ingestor = FeedIngestor()
all_documents = []

print(f"Ingesting from {len(feed_sources)} feed sources...")
for i, (feed_name, feed_url) in enumerate(feed_sources, 1):
    try:
        with redirect_stderr(StringIO()):
            feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
        
        feed_count = 0
        for item in feed_data.items:
            if not item.content:
                item.content = item.description or item.title or ""
            if item.content:
                if not hasattr(item, 'metadata'):
                    item.metadata = {}
                item.metadata['source'] = feed_name
                all_documents.append(item)
                feed_count += 1
        
        if feed_count > 0:
            print(f"  [{i}/{len(feed_sources)}] {feed_name}: {feed_count} documents")
    except Exception:
        continue

if not all_documents:
    threat_data = """
    IOC: IP address 192.168.1.50 associated with APT28 campaign.
    Threat actor APT28 uses TTP: Spear phishing and credential harvesting.
    Campaign Operation GhostShell targets financial institutions.
    Malware sample hash: abc123def456 linked to APT28 infrastructure.
    IOC: Domain example-malicious.com linked to APT29 operations.
    Threat actor APT29 uses TTP: Watering hole attacks and lateral movement.
    Campaign Operation SolarWinds targets technology companies.
    IOC: File hash xyz789ghi012 associated with ransomware group.
    """
    with open("data/threat_intel.txt", "w") as f:
        f.write(threat_data)
    file_ingestor = FileIngestor()
    all_documents = file_ingestor.ingest("data/threat_intel.txt")

documents = all_documents
print(f"Ingested {len(documents)} documents")


Ingesting from 6 feed sources...
  [1/6] US-CERT Alerts: 10 documents
  [2/6] SANS ISC: 10 documents
  [3/6] Krebs on Security: 10 documents
  [4/6] ThreatPost: 10 documents
Ingested 40 documents


## Parsing Threat Intelligence Documents


In [5]:
from semantica.parse import DocumentParser

parser = DocumentParser()

print(f"Parsing {len(documents)} documents...")
parsed_documents = []
for i, doc in enumerate(documents, 1):
    try:
        parsed = parser.parse(
            doc.content if hasattr(doc, 'content') else str(doc),
            content_type="text"
        )
        parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc)
    if i % 50 == 0 or i == len(documents):
        print(f"  Parsed {i}/{len(documents)} documents...")

documents = parsed_documents


Parsing 40 documents...
  Parsed 40/40 documents...


## Normalizing and Chunking Threat Intelligence Data


In [6]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter

normalizer = TextNormalizer()
# Use entity-aware chunking to preserve threat entity boundaries for GraphRAG
splitter = TextSplitter(
    method="entity_aware",
    ner_method="spacy",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

print(f"Normalizing {len(documents)} documents...")
normalized_documents = []
for i, doc in enumerate(documents, 1):
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        remove_extra_whitespace=True,
        lowercase=False
    )
    normalized_documents.append(normalized_text)
    if i % 50 == 0 or i == len(documents):
        print(f"  Normalized {i}/{len(documents)} documents...")

print(f"Chunking {len(normalized_documents)} documents...")
chunked_documents = []
for i, doc_text in enumerate(normalized_documents, 1):
    try:
        with redirect_stderr(StringIO()):
            chunks = splitter.split(doc_text)
        chunked_documents.extend(chunks)
    except Exception:
        simple_splitter = TextSplitter(method="recursive", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        chunks = simple_splitter.split(doc_text)
        chunked_documents.extend(chunks)
    if i % 50 == 0 or i == len(normalized_documents):
        print(f"  Chunked {i}/{len(normalized_documents)} documents ({len(chunked_documents)} chunks so far)")

print(f"Created {len(chunked_documents)} chunks from {len(normalized_documents)} documents")


Normalizing 40 documents...
  Normalized 40/40 documents...
Chunking 40 documents...
  Chunked 40/40 documents (85 chunks so far)
Created 85 chunks from 40 documents


In [None]:
from semantica.semantic_extract import NERExtractor

entity_extractor = NERExtractor(
    method="ml",  
    model="en_core_web_sm"
)

all_entities = []
print(f"Extracting entities from {len(chunked_documents)} chunks using ML-based extraction...")
for i, chunk in enumerate(chunked_documents, 1):
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        entities = entity_extractor.extract_entities(chunk_text)
        # Filter entities by threat intelligence types
        filtered_entities = [
            e for e in entities 
            if any(entity_type.lower() in e.label.lower() for entity_type in ["IOC", "Campaign", "Threat", "Actor", "TTP", "Malware", "ORG", "PERSON", "GPE"])
        ]
        all_entities.extend(filtered_entities)
    except Exception:
        continue
    
    if i % 20 == 0 or i == len(chunked_documents):
        print(f"  Processed {i}/{len(chunked_documents)} chunks ({len(all_entities)} entities found)")

# Map spaCy entity types to threat intelligence types
iocs = [e for e in all_entities if "ioc" in e.label.lower() or e.text.startswith(("http", "192", "10.", "172."))]
actors = [e for e in all_entities if e.label in ["PERSON", "ORG"] or "actor" in e.label.lower()]
campaigns = [e for e in all_entities if "campaign" in e.label.lower() or "campaign" in e.text.lower()]
ttps = [e for e in all_entities if "ttp" in e.label.lower() or "technique" in e.label.lower()]

print(f"Extracted {len(iocs)} IOCs, {len(actors)} actors, {len(campaigns)} campaigns, {len(ttps)} TTPs")


Extracting entities from 85 chunks using ML-based extraction...
  Processed 20/85 chunks (248 entities found)
  Processed 40/85 chunks (504 entities found)
  Processed 60/85 chunks (811 entities found)
  Processed 80/85 chunks (851 entities found)
  Processed 85/85 chunks (857 entities found)
Extracted 2 IOCs, 741 actors, 3 campaigns, 0 TTPs


## Extracting Threat Relationships


In [9]:
from semantica.semantic_extract import RelationExtractor

relation_extractor = RelationExtractor(
    method="dependency",  
    model="en_core_web_sm",  
    confidence_threshold=0.5,  
    max_distance=50  
)

all_relationships = []
print(f"Extracting relationships from {len(chunked_documents)} chunks using ML-based dependency parsing...")
for i, chunk in enumerate(chunked_documents, 1):
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        # Extract relationships using dependency parsing
        relationships = relation_extractor.extract_relations(
            chunk_text,
            entities=all_entities,
            relation_types=["associated_with", "uses", "targets", "linked_to", "part_of", "employs"]
        )
        all_relationships.extend(relationships)
    except Exception:
        continue
    
    if i % 20 == 0 or i == len(chunked_documents):
        print(f"  Processed {i}/{len(chunked_documents)} chunks ({len(all_relationships)} relationships found)")

print(f"Extracted {len(all_relationships)} relationships")


Extracting relationships from 85 chunks using ML-based dependency parsing...
  Processed 20/85 chunks (359 relationships found)
  Processed 40/85 chunks (630 relationships found)
  Processed 60/85 chunks (830 relationships found)
  Processed 80/85 chunks (887 relationships found)
  Processed 85/85 chunks (896 relationships found)
Extracted 896 relationships


## Resolving Duplicate IOCs and Actors

**Best Approach & Methods:**

• **Multi-Factor Detection**: `DuplicateDetector` with Jaro-Winkler similarity (0.85 threshold) + property/type matching for high-precision duplicate identification

• **Keep Most Complete Merge**: `EntityMerger` with `strategy="keep_most_complete"` preserves entities with maximum information (properties, relationships, metadata)


In [None]:
from semantica.deduplication import DuplicateDetector, EntityMerger
from semantica.semantic_extract import Entity

# Convert Entity objects to dictionaries for deduplication module
print(f"Converting {len(all_entities)} entities to dictionaries...")
entity_dicts = [
    {
        "id": f"entity_{i}",
        "name": e.text,
        "type": e.label,
        "confidence": e.confidence,
        "metadata": e.metadata if hasattr(e, 'metadata') else {}
    }
    for i, e in enumerate(all_entities)
]

# Use DuplicateDetector with similarity threshold for duplicate detection
duplicate_detector = DuplicateDetector(
    similarity_threshold=0.85,  # Jaro-Winkler similarity threshold
    confidence_threshold=0.7  # Minimum confidence for duplicate candidates
)

print(f"Detecting duplicates in {len(entity_dicts)} entities...")
duplicate_groups = duplicate_detector.detect_duplicate_groups(entity_dicts)

print(f"Detected {len(duplicate_groups)} duplicate groups")

# Use EntityMerger to merge duplicates using keep_most_complete strategy
entity_merger = EntityMerger(preserve_provenance=True)

print(f"Merging duplicates using keep_most_complete strategy...")
merge_operations = entity_merger.merge_duplicates(
    entity_dicts,
    strategy="keep_most_complete",  # Preserve entity with most information
    threshold=0.85
)

# Extract merged entities from merge operations
merged_entity_dicts = []
merged_ids = set()

for op in merge_operations:
    merged_entity_dicts.append(op.merged_entity)
    # Track all source entity IDs that were merged
    for source in op.source_entities:
        merged_ids.add(source.get("id") or source.get("name"))

# Add entities that weren't merged (singletons)
for entity in entity_dicts:
    entity_id = entity.get("id") or entity.get("name")
    if entity_id not in merged_ids:
        merged_entity_dicts.append(entity)

# Convert back to Entity objects
merged_entities = [
    Entity(
        text=e.get("name", ""),
        label=e.get("type", ""),
        confidence=e.get("confidence", 1.0),
        metadata=e.get("metadata", {})
    )
    for e in merged_entity_dicts
]

print(f"Deduplicated {len(entity_dicts)} entities to {len(merged_entities)} unique entities")


Converting 857 entities to dictionaries...
Detecting duplicates in 857 entities...


## Detecting Threat Intelligence Conflicts

**Best Approach & Methods:**

• **Type Conflict Detection**: `method="type"` identifies conflicting entity classifications (e.g., IOC as both "Malware" and "Threat")

• **Highest Confidence Resolution**: `strategy="highest_confidence"` automatically resolves conflicts by prioritizing the most confident source


In [None]:
from semantica.conflicts import ConflictDetector, ConflictResolver

# Use type conflict detection for conflicting threat classifications
# highest_confidence strategy prioritizes the most confident threat intelligence source
conflict_detector = ConflictDetector()
conflict_resolver = ConflictResolver()

# Pass all entities and relationships for comprehensive conflict detection
print(f"Detecting type conflicts in {len(all_entities)} entities and {len(all_relationships)} relationships...")
conflicts = conflict_detector.detect_conflicts(
    entities=all_entities,  # Use all extracted entities
    relationships=all_relationships,  # Use all extracted relationships
    method="type"  # Detect conflicts in entity types/classifications
)

print(f"Detected {len(conflicts)} type conflicts")

if conflicts:
    print(f"Resolving conflicts using highest_confidence strategy...")
    resolved = conflict_resolver.resolve_conflicts(
        conflicts,
        strategy="highest_confidence"  # Prioritize most confident source for threat classification
    )
    print(f"Resolved {len(resolved)} conflicts")
    
    # Update entities and relationships with resolved conflicts if needed
    # The resolved conflicts can be used to update the knowledge graph
else:
    print("No conflicts detected")


## Building Temporal Threat Intelligence Knowledge Graph


In [None]:
from semantica.kg import GraphBuilder

graph_builder = GraphBuilder(
    merge_entities=True,
    resolve_conflicts=True,
    entity_resolution_strategy="fuzzy",
    enable_temporal=True,
    temporal_granularity=TEMPORAL_GRANULARITY
)

print(f"Building knowledge graph...")
kg_sources = [{
    "entities": [{"text": e.text, "type": e.label, "confidence": e.confidence} for e in merged_entities],
    "relationships": [{"source": r.source, "target": r.target, "type": r.label, "confidence": r.confidence} for r in all_relationships]
}]

kg = graph_builder.build(kg_sources)

entities_count = len(kg.get('entities', []))
relationships_count = len(kg.get('relationships', []))
print(f"Graph: {entities_count} entities, {relationships_count} relationships")


## Generating Embeddings for IOCs and Threats


In [None]:
from semantica.embeddings import EmbeddingGenerator

embedding_gen = EmbeddingGenerator(
    provider="sentence_transformers",
    model=EMBEDDING_MODEL
)

ioc_texts = [f"{ioc.text} {getattr(ioc, 'description', '')}" for ioc in iocs]
ioc_embeddings = embedding_gen.generate_embeddings(ioc_texts)

actor_texts = [f"{actor.text} {getattr(actor, 'description', '')}" for actor in actors]
actor_embeddings = embedding_gen.generate_embeddings(actor_texts)

print(f"Generated {len(ioc_embeddings)} IOC embeddings and {len(actor_embeddings)} actor embeddings")


## Populating Vector Store


In [None]:
from semantica.vector_store import VectorStore

vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)

print(f"Storing {len(ioc_embeddings)} IOC vectors and {len(actor_embeddings)} actor vectors...")
ioc_ids = vector_store.store_vectors(
    vectors=ioc_embeddings,
    metadata=[{"type": "ioc", "name": ioc.text, "label": ioc.label} for ioc in iocs]
)

actor_ids = vector_store.store_vectors(
    vectors=actor_embeddings,
    metadata=[{"type": "actor", "name": actor.text, "label": actor.label} for actor in actors]
)

print(f"Stored {len(ioc_ids)} IOC vectors and {len(actor_ids)} actor vectors")


## Temporal Graph Queries


In [None]:
from semantica.kg import TemporalGraphQuery

temporal_query = TemporalGraphQuery(
    enable_temporal_reasoning=True,
    temporal_granularity=TEMPORAL_GRANULARITY
)

query_results = temporal_query.query_at_time(
    kg,
    query={"type": "Campaign"},
    at_time="2024-01-01"
)

evolution = temporal_query.analyze_evolution(kg)
temporal_patterns = temporal_query.detect_temporal_patterns(kg, pattern_type="sequence")

print(f"Temporal queries: {len(query_results)} campaigns at query time")
print(f"Temporal patterns detected: {len(temporal_patterns)}")


## Analyzing Threat Network Structure


In [None]:
from semantica.kg import GraphAnalyzer, CentralityCalculator, CommunityDetector

graph_analyzer = GraphAnalyzer()
centrality_calc = CentralityCalculator()
community_detector = CommunityDetector()

analysis = graph_analyzer.analyze_graph(kg)

degree_centrality = centrality_calc.calculate_degree_centrality(kg)
betweenness_centrality = centrality_calc.calculate_betweenness_centrality(kg)

communities = community_detector.detect_communities(kg, method="louvain")
connectivity = graph_analyzer.analyze_connectivity(kg)

print(f"Graph analytics:")
print(f"  - Communities: {len(communities)}")
print(f"  - Connected components: {len(connectivity.get('components', []))}")
print(f"  - Graph density: {analysis.get('density', 0):.3f}")
print(f"  - Central nodes (degree): {len(degree_centrality)}")


## GraphRAG: Hybrid Vector + Graph Queries


In [None]:
from semantica.context import AgentContext

context = AgentContext(vector_store=vector_store, knowledge_graph=kg)

query = "What threats are associated with APT28?"
results = context.retrieve(
    query,
    max_results=10,
    use_graph=True,
    expand_graph=True,
    include_entities=True,
    include_relationships=True
)

print(f"GraphRAG query: '{query}'")
print(f"\nRetrieved {len(results)} results:\n")
for i, result in enumerate(results[:5], 1):
    print(f"{i}. Score: {result.get('score', 0):.3f}")
    print(f"   Content: {result.get('content', '')[:200]}...")
    if result.get('related_entities'):
        print(f"   Related entities: {len(result['related_entities'])}")
    print()


## Reasoning and Threat Analysis


In [None]:
from semantica.reasoning import Reasoner

reasoner = Reasoner()

reasoner.add_rule("IF IOC associated_with Campaign AND Campaign uses TTP THEN IOC linked_to TTP")
reasoner.add_rule("IF Actor uses TTP AND TTP targets Campaign THEN Actor part_of Campaign")

inferred_facts = reasoner.infer_facts(kg)

threat_paths = reasoner.find_paths(
    kg,
    source_type="Actor",
    target_type="IOC",
    max_hops=3
)

print(f"Inferred {len(inferred_facts)} facts")
print(f"Found {len(threat_paths)} threat paths")


## Storing Threat Intelligence Graph (Optional)


In [None]:
from semantica.graph_store import GraphStore

# Optional: Store to persistent graph database
# graph_store = GraphStore(backend="neo4j", uri="bolt://localhost:7687", user="neo4j", password="password")
# graph_store.store_graph(kg)

print("Graph store configured (commented out for demo)")


## Visualizing the Threat Intelligence Knowledge Graph


In [None]:
from semantica.visualization import KGVisualizer

visualizer = KGVisualizer()
visualizer.visualize(
    kg,
    output_path="threat_intelligence_kg.html",
    layout="spring",
    node_size=20
)

print("Visualization saved to threat_intelligence_kg.html")


## Exporting Results


In [None]:
from semantica.export import GraphExporter

exporter = GraphExporter()
exporter.export(kg, output_path="threat_intelligence_kg.json", format="json")
exporter.export(kg, output_path="threat_intelligence_kg.graphml", format="graphml")
exporter.export(kg, output_path="threat_intelligence_iocs.csv", format="csv")

print("Exported threat intelligence knowledge graph to JSON, GraphML, and CSV formats")
