[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/cybersecurity/01_Real_Time_Anomaly_Detection.ipynb)

# Real-Time Anomaly Detection - Stream Processing & Temporal KGs

## Overview

This notebook demonstrates **real-time anomaly detection** using Semantica with focus on **stream ingestion**, **temporal knowledge graphs**, and **pattern detection**. The pipeline streams security logs in real-time, builds temporal knowledge graphs, and detects anomalies using pattern detection.

### Key Features

- **Stream Processing**: Emphasizes real-time log streaming and processing
- **Temporal Knowledge Graphs**: Builds temporal KGs to track events over time
- **Pattern Detection**: Uses graph patterns to identify anomalies
- **Automated Alerting**: Generates alerts for detected anomalies
- **Comprehensive Data Sources**: Multiple security RSS feeds, APIs, and databases
- **Modular Architecture**: Direct use of Semantica modules without core orchestrator

### Learning Objectives

- Ingest security data from multiple sources (RSS feeds, APIs, streams)
- Extract security entities (Logs, Events, IPs, Users, Alerts, Attacks)
- Build temporal security knowledge graphs
- Perform temporal queries and pattern detection
- Detect anomalies using graph reasoning
- Store and query security data using vector stores and graph stores

### Pipeline Flow

```mermaid
graph TD
    A[Data Ingestion] --> B[Document Parsing]
    B --> C[Text Processing]
    C --> D[Entity Extraction]
    D --> E[Relationship Extraction]
    E --> F[Deduplication]
    F --> G[Conflict Detection]
    G --> H[Temporal Knowledge Graph]
    H --> I[Embeddings]
    I --> J[Vector Store]
    H --> K[Temporal Queries]
    K --> L[Pattern Detection]
    L --> M[Reasoning & Anomaly]
    J --> N[GraphRAG Queries]
    M --> N
    H --> O[Graph Store]
    N --> P[Visualization]
    O --> P
    P --> Q[Export]
```


## Installation


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


## Configuration & Setup


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "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")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
TEMPORAL_GRANULARITY = "minute"


## Ingesting Security Data from Multiple Sources


In [None]:
from semantica.ingest import FeedIngestor, StreamIngestor, FileIngestor
import os
from contextlib import redirect_stderr
from io import StringIO

os.makedirs("data", exist_ok=True)

feed_sources = [
    # Security RSS Feeds
    ("US-CERT Alerts", "https://www.us-cert.gov/ncas/alerts.xml"),
    ("SANS ISC", "https://isc.sans.edu/rssfeed.xml"),
    ("Krebs on Security", "https://krebsonsecurity.com/feed/"),
    ("ThreatPost", "https://threatpost.com/feed/"),
    ("BleepingComputer", "https://www.bleepingcomputer.com/feed/"),
]

feed_ingestor = FeedIngestor()
all_documents = []

print(f"Ingesting from {len(feed_sources)} feed sources...")
for i, (feed_name, feed_url) in enumerate(feed_sources, 1):
    try:
        with redirect_stderr(StringIO()):
            feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
        
        feed_count = 0
        for item in feed_data.items:
            if not item.content:
                item.content = item.description or item.title or ""
            if item.content:
                if not hasattr(item, 'metadata'):
                    item.metadata = {}
                item.metadata['source'] = feed_name
                all_documents.append(item)
                feed_count += 1
        
        if feed_count > 0:
            print(f"  [{i}/{len(feed_sources)}] {feed_name}: {feed_count} documents")
    except Exception:
        continue

# Simulate stream ingestion (in production, use actual Kafka/WebSocket)
if not all_documents:
    security_logs = """
    2024-01-01 10:00:00 - Login attempt from IP 192.168.1.100 user admin
    2024-01-01 10:01:00 - Failed login from IP 192.168.1.100 user admin
    2024-01-01 10:02:00 - Multiple failed logins from IP 192.168.1.100 user admin
    2024-01-01 10:03:00 - Unusual activity detected from IP 192.168.1.100
    2024-01-01 10:04:00 - Alert: Potential brute force attack from IP 192.168.1.100
    2024-01-01 10:05:00 - Login attempt from IP 192.168.1.101 user test
    2024-01-01 10:06:00 - Suspicious file access from IP 192.168.1.102
    2024-01-01 10:07:00 - Multiple connection attempts from IP 192.168.1.103
    """
    with open("data/security_logs.txt", "w") as f:
        f.write(security_logs)
    file_ingestor = FileIngestor()
    all_documents = file_ingestor.ingest("data/security_logs.txt")

documents = all_documents
print(f"Ingested {len(documents)} documents")


In [None]:
from semantica.parse import DocumentParser

parser = DocumentParser()

print(f"Parsing {len(documents)} documents...")
parsed_documents = []
for i, doc in enumerate(documents, 1):
    try:
        parsed = parser.parse(
            doc.content if hasattr(doc, 'content') else str(doc),
            content_type="text"
        )
        parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc)
    if i % 50 == 0 or i == len(documents):
        print(f"  Parsed {i}/{len(documents)} documents...")

documents = parsed_documents


## Normalizing and Chunking Security Logs


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter

normalizer = TextNormalizer()
# Use sentence chunking for log line boundaries (structured logs)
splitter = TextSplitter(method="sentence", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

print(f"Normalizing {len(documents)} documents...")
normalized_documents = []
for i, doc in enumerate(documents, 1):
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        remove_extra_whitespace=True,
        lowercase=False
    )
    normalized_documents.append(normalized_text)
    if i % 50 == 0 or i == len(documents):
        print(f"  Normalized {i}/{len(documents)} documents...")

print(f"Chunking {len(normalized_documents)} documents...")
chunked_documents = []
for i, doc_text in enumerate(normalized_documents, 1):
    try:
        with redirect_stderr(StringIO()):
            chunks = splitter.split(doc_text)
        chunked_documents.extend(chunks)
    except Exception:
        simple_splitter = TextSplitter(method="recursive", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        chunks = simple_splitter.split(doc_text)
        chunked_documents.extend(chunks)
    if i % 50 == 0 or i == len(normalized_documents):
        print(f"  Chunked {i}/{len(normalized_documents)} documents ({len(chunked_documents)} chunks so far)")

print(f"Created {len(chunked_documents)} chunks from {len(normalized_documents)} documents")


## Extracting Security Entities


In [None]:
from semantica.semantic_extract import NERExtractor

security_patterns = {
    "IP": r"\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b",
    "User": r"\buser\s+([a-zA-Z0-9_\-\.]+)\b",
    "Alert": r"\b(?:alert|warning|alarm):\s*([^\n\.]+)|\b(?:alert|warning|alarm)\s+(?:detected|triggered|generated|raised)\b",
    "Event": r"\b(?:login|access|connection|request|attempt|failed|successful|suspicious|unusual)\s+(?:event|attempt|request|activity|access)\b",
    "Log": r"\b\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}\s+-\s+([^\n]+)",
    "Attack": r"\b(?:attack|breach|intrusion|exploit|malware|virus|ransomware|phishing|brute\s+force|ddos)\b",
}

entity_extractor = NERExtractor(method="regex", patterns=security_patterns)

all_entities = []
print(f"Extracting entities from {len(chunked_documents)} chunks...")
for i, chunk in enumerate(chunked_documents, 1):
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        entities = entity_extractor.extract_entities(
            chunk_text,
            entity_types=["Log", "Event", "IP", "User", "Alert", "Attack"]
        )
        all_entities.extend(entities)
    except Exception:
        continue
    
    if i % 20 == 0 or i == len(chunked_documents):
        print(f"  Processed {i}/{len(chunked_documents)} chunks ({len(all_entities)} entities found)")

ips = [e for e in all_entities if e.label == "IP" or "ip" in e.label.lower()]
users = [e for e in all_entities if e.label == "User" or "user" in e.label.lower()]
alerts = [e for e in all_entities if e.label == "Alert" or "alert" in e.label.lower()]

print(f"Extracted {len(ips)} IPs, {len(users)} users, {len(alerts)} alerts")


## Extracting Security Relationships


In [None]:
from semantica.semantic_extract import RelationExtractor

# Filter entities to only meaningful security entities
filtered_entities = [
    e for e in all_entities 
    if e.label in ["IP", "User", "Alert", "Attack", "Event", "Log"] 
    and len(e.text) > 2
    and e.text.lower() not in ["to", "from", "should", "would", "choices", "connects"]
]

relation_extractor = RelationExtractor(
    method="cooccurrence",
    max_distance=60,
    confidence_threshold=0.6
)

# Deduplicate relationships
seen_relationships = set()
all_relationships = []
print(f"Extracting relationships from {len(chunked_documents)} chunks using {len(filtered_entities)} filtered entities...")
for i, chunk in enumerate(chunked_documents, 1):
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        relationships = relation_extractor.extract_relations(
            chunk_text,
            entities=filtered_entities,
            relation_types=["from", "attempts", "triggers", "detects", "associated_with", "causes"]
        )
        # Deduplicate based on subject, predicate, object
        for rel in relationships:
            rel_key = (rel.subject.text, rel.predicate, rel.object.text)
            if rel_key not in seen_relationships:
                seen_relationships.add(rel_key)
                all_relationships.append(rel)
    except Exception:
        continue
    
    if i % 20 == 0 or i == len(chunked_documents):
        print(f"  Processed {i}/{len(chunked_documents)} chunks ({len(all_relationships)} relationships found)")

print(f"Extracted {len(all_relationships)} relationships")


## Detecting Security Conflicts

- **Using entity-wide conflict detection** to identify all types of conflicts (value, type, relationship, temporal) across security entities from multiple sources. 
- **Voting resolution strategy** selects the majority consensus value, ensuring reliability through multi-source agreement for security event data.


In [None]:
from semantica.conflicts import ConflictDetector, ConflictResolver

conflict_detector = ConflictDetector()
conflict_resolver = ConflictResolver()

# Convert entities to dictionaries for conflict detection
entity_dicts = [
    {
        "id": e.text,
        "text": e.text,
        "label": e.label,
        "type": e.label,
        "confidence": e.confidence if hasattr(e, 'confidence') else 1.0,
        "metadata": e.metadata if hasattr(e, 'metadata') else {}
    }
    for e in all_entities
]

# Convert relationships to dictionaries for conflict detection
relationship_dicts = [
    {
        "id": f"{r.subject.text}_{r.predicate}_{r.object.text}",
        "source_id": r.subject.text,
        "target_id": r.object.text,
        "type": r.predicate,
        "subject": r.subject.text,
        "object": r.object.text,
        "predicate": r.predicate,
        "confidence": r.confidence if hasattr(r, 'confidence') else 1.0
    }
    for r in all_relationships
]

print(f"Detecting conflicts in {len(entity_dicts)} entities and {len(relationship_dicts)} relationships...")

# Detect entity conflicts (value, type, temporal)
value_conflicts = conflict_detector.detect_value_conflicts(entity_dicts, property_name="label")
type_conflicts = conflict_detector.detect_type_conflicts(entity_dicts)
temporal_conflicts = conflict_detector.detect_temporal_conflicts(entity_dicts)
entity_conflicts = value_conflicts + type_conflicts + temporal_conflicts

# Detect relationship conflicts
relationship_conflicts = conflict_detector.detect_relationship_conflicts(relationship_dicts)

# Combine all conflicts
all_conflicts = entity_conflicts + relationship_conflicts

print(f"Detected {len(all_conflicts)} conflicts ({len(entity_conflicts)} entity, {len(relationship_conflicts)} relationship)")

if all_conflicts:
    print(f"Resolving conflicts using voting strategy...")
    resolved = conflict_resolver.resolve_conflicts(
        all_conflicts,
        strategy="voting"
    )
    print(f"Resolved {len(resolved)} conflicts")
else:
    print("No conflicts detected")


## Building Temporal Security Knowledge Graph


In [None]:
from semantica.kg import GraphBuilder

graph_builder = GraphBuilder(
    merge_entities=False,
    resolve_conflicts=False,
    entity_resolution_strategy="fuzzy",
    enable_temporal=True,
    temporal_granularity=TEMPORAL_GRANULARITY
)

print(f"Building knowledge graph...")
kg_sources = [{
    "entities": [{"text": e.text, "type": e.label, "confidence": e.confidence if hasattr(e, 'confidence') else 1.0} for e in all_entities],
    "relationships": [{"source": r.subject.text, "target": r.object.text, "type": r.predicate, "confidence": r.confidence if hasattr(r, 'confidence') else 1.0} for r in all_relationships]
}]

kg = graph_builder.build(kg_sources)

entities_count = len(kg.get('entities', []))
relationships_count = len(kg.get('relationships', []))
print(f"Graph: {entities_count} entities, {relationships_count} relationships")


## Generating Embeddings for Events and IPs


In [None]:
from semantica.embeddings import EmbeddingGenerator

embedding_gen = EmbeddingGenerator(
    provider="sentence_transformers",
    model=EMBEDDING_MODEL
)

event_entities = [e for e in all_entities if e.label in ["Event", "Log", "Alert"]]
print(f"Generating embeddings for {len(event_entities)} events and {len(ips)} IPs...")
event_texts = [e.text for e in event_entities]
event_embeddings = embedding_gen.generate_embeddings(event_texts)

ip_texts = [ip.text for ip in ips]
ip_embeddings = embedding_gen.generate_embeddings(ip_texts)

print(f"Generated {len(event_embeddings)} event embeddings and {len(ip_embeddings)} IP embeddings")


## Populating Vector Store


In [None]:
from semantica.vector_store import VectorStore

vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)

print(f"Storing {len(event_embeddings)} event vectors and {len(ip_embeddings)} IP vectors...")
event_ids = vector_store.store_vectors(
    vectors=event_embeddings,
    metadata=[{"type": "event", "name": e.text, "label": e.label} for e in event_entities]
)

ip_ids = vector_store.store_vectors(
    vectors=ip_embeddings,
    metadata=[{"type": "ip", "name": ip.text, "label": ip.label} for ip in ips]
)

print(f"Stored {len(event_ids)} event vectors and {len(ip_ids)} IP vectors")


## Temporal Graph Queries


In [None]:
from semantica.kg import TemporalGraphQuery

temporal_query = TemporalGraphQuery(
    enable_temporal_reasoning=True,
    temporal_granularity=TEMPORAL_GRANULARITY
)

query_results = temporal_query.query_at_time(
    kg,
    query={"type": "Alert"},
    at_time="2024-01-01 10:04:00"
)

evolution = temporal_query.analyze_evolution(kg)
temporal_patterns = temporal_query.query_temporal_pattern(kg, pattern="sequence")

print(f"Temporal queries: {query_results.get('num_relationships', 0)} alerts at query time")
print(f"Temporal patterns detected: {temporal_patterns.get('num_patterns', 0)}")


## Detecting Anomaly Patterns


In [None]:
from semantica.kg import GraphAnalyzer

graph_analyzer = GraphAnalyzer()

# Detect suspicious IPs
suspicious_ips = []
for entity in kg.get("entities", []):
    if entity.get("type") == "IP":
        related_rels = [r for r in kg.get("relationships", []) 
                        if r.get("source") == entity.get("id") or r.get("target") == entity.get("id")]
        if any("alert" in str(r.get("type", "")).lower() or "attack" in str(r.get("type", "")).lower() 
               for r in related_rels):
            suspicious_ips.append(entity)

# Find all Alert entities
alert_entities = [e for e in kg.get("entities", []) if e.get("type") == "Alert"]
alert_ids = [e.get("id") for e in alert_entities if e.get("id")]

# Detect anomaly patterns (multiple failed logins, unusual activity)
anomaly_patterns = []
for ip in ips[:10]:
    ip_name = ip.text
    ip_id = ip_name  # Use text as ID if entity ID not available
    
    # Find paths from IP to each Alert entity
    paths_found = []
    for alert_id in alert_ids:
        path_result = graph_analyzer.connectivity_analyzer.calculate_shortest_paths(
            kg,
            source=ip_id,
            target=alert_id
        )
        # Check if path exists and is within max_hops (2)
        if path_result.get("exists") and path_result.get("distance", -1) <= 2:
            paths_found.append(path_result)
    
    if len(paths_found) > 0:
        anomaly_patterns.append({
            'ip': ip_name,
            'alert_count': len(paths_found),
            'pattern': 'suspicious_activity'
        })

print(f"Pattern detection: {len(anomaly_patterns)} anomaly patterns found")
print(f"Suspicious IPs: {len(suspicious_ips)}")


## Reasoning and Anomaly Detection


In [None]:
from semantica.reasoning import Reasoner
from semantica.kg import GraphAnalyzer

reasoner = Reasoner()

reasoner.add_rule("IF IP attempts Event AND Event type failed_login AND Event count > 3 THEN IP triggers Alert")
reasoner.add_rule("IF User from IP AND IP triggers Alert THEN User associated_with Alert")

inferred_facts = reasoner.infer_facts(kg)

# Find paths from IP entities to Alert entities
graph_analyzer = GraphAnalyzer()
ip_entities = [e for e in kg.get("entities", []) if e.get("type") == "IP"]
alert_entities = [e for e in kg.get("entities", []) if e.get("type") == "Alert"]

anomaly_paths = []
for ip_entity in ip_entities:
    ip_id = ip_entity.get("id") or ip_entity.get("name", "")
    for alert_entity in alert_entities:
        alert_id = alert_entity.get("id") or alert_entity.get("name", "")
        if ip_id and alert_id:
            path_result = graph_analyzer.connectivity_analyzer.calculate_shortest_paths(
                kg,
                source=ip_id,
                target=alert_id
            )
            # Check if path exists and is within max_hops (2)
            if path_result.get("exists") and path_result.get("distance", -1) <= 2:
                anomaly_paths.append(path_result)

print(f"Inferred {len(inferred_facts)} facts")
print(f"Found {len(anomaly_paths)} anomaly paths")


## Storing Security Knowledge Graph (Optional)


In [None]:
from semantica.graph_store import GraphStore

# Optional: Store to persistent graph database
# graph_store = GraphStore(backend="neo4j", uri="bolt://localhost:7687", user="neo4j", password="password")
# graph_store.store_graph(kg)

print("Graph store configured (commented out for demo)")


## GraphRAG: Hybrid Vector + Graph Queries


In [None]:
from semantica.context import AgentContext
from semantica.llms import Groq
import os

context = AgentContext(vector_store=vector_store, knowledge_graph=kg)

# Initialize LLM provider
llm_provider = Groq(
    model="llama-3.1-8b-instant",
    api_key=os.getenv("GROQ_API_KEY")
)

query = "What IPs are associated with security alerts?"
result = context.query_with_reasoning(
    query=query,
    llm_provider=llm_provider,
    max_results=10,
    max_hops=2
)

print(f"GraphRAG Query with Reasoning: '{query}'\n")
print("=" * 80)
print(f"\nGenerated Response:\n{result['response']}\n")
print("=" * 80)
if result.get('reasoning_path'):
    print(f"\nReasoning Path:\n{result['reasoning_path']}\n")
print(f"Confidence: {result.get('confidence', 0):.3f}")
print(f"Sources Used: {result.get('num_sources', 0)}")
print(f"Reasoning Paths Found: {result.get('num_reasoning_paths', 0)}")


## Visualizing the Temporal Security Knowledge Graph


In [None]:
from semantica.visualization import KGVisualizer

# Create visualizer with optimized configuration for better layout
visualizer = KGVisualizer(
    layout="force",  # Force-directed layout for better node distribution
    node_size=15,  # Slightly smaller nodes for better visibility
    color_scheme="vibrant"  # Use vibrant color scheme for better distinction
)

# Create interactive visualization with improved layout parameters
fig = visualizer.visualize_network(
    kg,
    output="interactive",  # Interactive visualization in notebook
    node_color_by="type",  # Color nodes by entity type
    hover_data=["type"],  # Show entity type in hover tooltip
    algorithm="kamada_kawai",  # Use Kamada-Kawai algorithm for better layout (more stable than spring)
    k=2.0,  # Optimal distance between nodes (larger = more spread out)
    iterations=100,  # More iterations for better convergence
    seed=42  # Fixed seed for reproducible layouts
)

# Update layout for better appearance
fig.update_layout(
    title="Anomaly Detection Knowledge Graph",
    width=1200,  # Wider view
    height=800,  # Taller view
    font=dict(size=12),
    hovermode="closest"
)

# Display the interactive figure
fig.show()

print("Interactive visualization displayed above")



## Exporting Results


In [None]:
from semantica.export import GraphExporter, CSVExporter

# Export to supported graph formats
graph_exporter = GraphExporter()
graph_exporter.export(kg, output_path="anomaly_detection_kg.json", format="json")
graph_exporter.export(kg, output_path="anomaly_detection_kg.graphml", format="graphml")

# Export knowledge graph to CSV using Semantica CSVExporter
csv_exporter = CSVExporter()
csv_exporter.export_knowledge_graph(kg, "anomaly_detection_kg")
# This creates: anomaly_detection_kg_entities.csv and anomaly_detection_kg_relationships.csv

# Export only Alert entities to a separate CSV file
alerts = [e for e in kg.get("entities", []) if e.get("type") == "Alert"]
if alerts:
    csv_exporter.export_entities(alerts, "anomaly_detection_alerts.csv")

print("Exported knowledge graph to JSON and GraphML formats")
print("Exported knowledge graph entities and relationships to CSV")
if alerts:
    print(f"Exported {len(alerts)} Alert entities to anomaly_detection_alerts.csv")
