Skip to content
/ holon Public

High-performance neural memory using VSA/HDC + ANN. Fuzzy similarity search with guards, negations, wildcards for AI.

License

Notifications You must be signed in to change notification settings

watmin/holon

Repository files navigation

Holon: Hyperdimensional Memory for Structured Data

Authors: watministrator, Grok (xAI), & Claude (Anthropic)

License: MIT Python 3.8+

Superposition Incantation

Reality doesn't fold itself. We make it fold.

Inspired by Carin Meier's VSA talk on hyperdimensional computing.

Holon encodes JSON structure into vectors, enabling similarity search over structured data. Unlike semantic embeddings that capture meaning, Holon captures structure - keys, nesting, relationships become geometry.

Quick Start

from holon import CPUStore, HolonClient

store = CPUStore(dimensions=4096)
client = HolonClient(local_store=store)

# Insert structured data
client.insert_json({"name": "Alice", "role": "developer", "skills": ["python", "ml"]})
client.insert_json({"name": "Bob", "role": "designer", "skills": ["figma", "css"]})

# Similarity search - finds structurally similar documents
results = client.search_json(probe={"role": "developer"}, limit=5)
# → Alice (high similarity), Bob (lower)

# Fuzzy matching with guards
results = client.search_json(
    probe={"skills": ["python"]},
    guard={"role": "developer"},      # Exact filter
    negations={"status": "inactive"}  # Exclude
)

# Time-aware search - "documents from around that time"
client.insert_json({"event": "deploy", "at": {"$time": 1706500000}})
results = client.search_json(probe={"at": {"$time": 1706503600}})  # ~1hr later
# → Finds events from around the same time (similarity, not range query)

# Sequence encoding for event patterns
client.insert_json({
    "session": "abc",
    "events": {"$mode": "chained", "sequence": ["login", "transfer", "logout"]}
})
results = client.search_json(probe={
    "events": {"$mode": "chained", "sequence": ["login", "transfer"]}
})
# → Finds sessions with similar event patterns
Time-Bending Lattices

Structured data encoded into geometry. Similarity becomes distance.

What Makes Holon Different

Traditional Vector DB Holon
Semantic embeddings (meaning) Structural embeddings (shape)
"Find similar text" "Find similar JSON structures"
Requires ML models Pure math (no models)
Opaque vectors Composable primitives

The genuine insight: VSA encodes structure as geometry. You can difference() two configs and get the delta as a vector. You can negate() expected changes. You can amplify() security fields. These operations compose mathematically.

Installation

git clone https://github.com/watmin/holon.git
cd holon
python -m venv holon_env
source holon_env/bin/activate
pip install -e .

All scripts use ./scripts/run_with_venv.sh to ensure venv activation:

./scripts/run_with_venv.sh python examples/basic_usage.py
./scripts/run_with_venv.sh pytest tests/

Core Primitives

Everything in Holon is built from these kernel operations:

Category Primitives
Encoding encode_data(json), encode_sequence(items, mode)
Continuous encode_scalar(v, mode), encode_scalar_log(v) - linear, circular, log-scale
VSA Ops bind(a,b), unbind(ab,a), bundle([vecs]), permute(v,k)
Learning prototype([examples]), prototype_add(p,ex,n), cleanup(noisy,codebook)
Streaming create_accumulator(), accumulate(acc,v), normalize_accumulator(acc)
Manipulation difference(a,b), amplify(v,sig,str), negate(v,x), blend(a,b,α), resonance(v,ref)
Similarity similarity(a,b,metric) - cosine, hamming, overlap, agreement, euclidean, manhattan

Quick Examples

# Learn a prototype from examples
dev_vecs = [store.encoder.encode_data(d) for d in developer_profiles]
dev_prototype = store.prototype(dev_vecs)

# Classify new data
new_vec = store.encoder.encode_data(new_profile)
is_developer = similarity(new_vec, dev_prototype) > 0.5

# Find what changed between versions
v1 = store.encoder.encode_data(config_v1)
v2 = store.encoder.encode_data(config_v2)
delta = store.difference(v1, v2)  # The change is a vector!

# "X but NOT Y" queries
all_errors = store.encoder.encode_data({"type": "error"})
known_bugs = store.encoder.encode_data({"type": "error", "known": True})
unknown_errors = store.negate(all_errors, known_bugs)

# Boost specific signals
base_query = store.encoder.encode_data({"topic": "security"})
priority_signal = store.encoder.encode_data({"severity": "critical"})
boosted = store.amplify(base_query, priority_signal, strength=2.0)

Continuous Value Encoding (Challenge 012)

Encode continuous values where similar values produce similar vectors:

# Log-scale encoding: equal ratios = equal similarity
rate_100 = store.encode_scalar_log(100)
rate_1000 = store.encode_scalar_log(1000)
rate_10000 = store.encode_scalar_log(10000)

# 100→1000 similarity ≈ 1000→10000 similarity (both 10x)
store.similarity(rate_100, rate_1000)   # ~0.94
store.similarity(rate_1000, rate_10000) # ~0.92

# Linear encoding for positions, temperatures, etc.
temp_vec = store.encode_scalar(72.5, mode="linear")

# Circular encoding for angles, hours (wraps around)
hour_vec = store.encode_scalar(23.5, mode="circular", period=24.0)
# hour 23.5 is similar to hour 0.5 (they're close on the clock)

Why it matters: Eliminates hardcoded discretization like {rate: "high"}. A rate of 100 pps is naturally similar to 150 pps and dissimilar from 100,000 pps.

Config Drift Detection (The Coolest Thing We Built)

# Encode configs as vectors
golden = store.encoder.encode_data(golden_config)
actual = store.encoder.encode_data(server_config)

# The drift is a vector
drift = store.difference(golden, actual)

# Remove expected changes
expected = store.encoder.encode_data({"version": "2.0"})
unexpected = store.negate(drift, expected, method="orthogonalize")

# Amplify security-related drift
security = store.encoder.encode_data({"tls": {}, "auth": {}})
security_drift = store.amplify(unexpected, security, 2.0)

# Find servers with similar security drift
results = client.search_by_vector(security_drift, limit=10)

This isn't possible with traditional search. The drift, the expected changes, the amplification - they're all vectors that compose.

Markers

Special $-prefixed keys control encoding behavior:

Marker Purpose Example
$time Temporal similarity {"created": {"$time": 1706500000}}
$mode Sequence encoding {"events": {"$mode": "ngram", "sequence": [...]}}
$any Wildcard {"role": {"$any": True}}
$or Disjunction {"$or": [{"a": 1}, {"b": 2}]}

Sequence Modes

Mode Use Case Example
positional Ordered lists Event sequences
chained Prefix/suffix matching Transaction chains
ngram Fuzzy substring Text search
bundle Unordered sets Tags, categories

Guards (Post-Query Filtering)

results = client.search_json(
    probe={"type": "user"},
    guard={
        "age": {"$gte": 18, "$lt": 65},
        "role": {"$in": ["admin", "mod"]},
        "bio": {"$contains": "developer"},
        "$exists": {"email": True}
    },
    negations={"status": "inactive"}
)

Marker Prefix

If your data uses $ keys, configure a different prefix:

store = CPUStore(dimensions=4096, marker_prefix="@@")
# Now use @@time, @@mode, @@any, etc.

Configuration

Backends

Backend Speed Best For
cpu (default) 3.9K/sec (1 core), 11K/sec (10 cores) General use
torchhd 300 ops/sec Accuracy-critical (Level embeddings: 200 ≈ 201 ≠ 500)
gpu 40x batch Large batch operations (1000+ items)
store = CPUStore(backend="torchhd")  # Best accuracy for numeric fields

For parallel encoding at scale, see 011-large-scale-stress-test.py.

Dimensions

Use Case Dimensions Records/GB
Simple documents (<20 fields) 1024 ~817K
Complex + time encoding 4096 ~233K
Very complex (100+ fields) 8192 ~119K

HTTP API

./scripts/run_with_venv.sh python scripts/server/holon_server.py
# Insert
curl -X POST http://localhost:8000/api/v1/items \
  -d '{"data": "{\"name\": \"Alice\"}", "data_type": "json"}'

# Search
curl -X POST http://localhost:8000/api/v1/search \
  -d '{"probe": "{\"name\": \"Alice\"}", "top_k": 5}'

# Encode vector
curl -X POST http://localhost:8000/api/v1/vectors/encode \
  -d '{"data": "{\"topic\": \"security\"}", "data_type": "json"}'

# Prototype
curl -X POST http://localhost:8000/api/v1/vectors/prototype \
  -d '{"vectors": [[1,-1,0,...], [0,1,-1,...]], "threshold": 0.5}'

See API Reference for complete documentation.

Scale Testing (Challenge 009)

We stress-tested Holon at realistic scale (14-core, 54GB RAM machine):

Samples Categories Accuracy Encode Rate Time Memory
1M 100 94.5% 25,581/sec 44s 3.9 GB
1M 1,000 84.5% 29,561/sec 68s 3.9 GB
5M 1,000 84.4% 23,322/sec 7.5 min 19.5 GB

Key finding: Accuracy depends on samples per category, not dimensions.

Factor Impact on Accuracy
10 samples/cat → 100 +13 points
Signal 70% → 90% +14 points
1024D → 8192D +6 points

At 1000 categories, 84% accuracy = 840x random baseline. The approach: encode records → average by category → find nearest prototype. No neural networks, no GPU, no training loop.

See Challenge 009 Learnings for details.

Network Anomaly Detection (Challenge 010)

We built a complete streaming anomaly detection system with deterministic consensus:

HTTP Request Detection

Metric Value
F1 Score 1.000
Throughput 8,339 req/sec
Latency 0.12 ms

Detects SQL injection, XSS, path traversal, command injection - using either:

  • Rule-based: Explicit patterns (~95% of detections)
  • Headless: Pure frequency+decay on character class bitmasks (no attack knowledge)

DDoS Detection (Two-Phase)

Attack Detected Classified Throughput
SYN Flood 3,193/sec
DNS Reflection 3,962/sec
NTP Amplification 1,752/sec
ICMP Flood 6,258/sec

Key insight: DDoS = variance drop + mean similarity rise (homogeneous attack traffic).

Distributed Consensus

All nodes with the same global_seed generate identical vectors:

# Node A (Tokyo)
vm = DeterministicVectorManager(global_seed=42)
vec_a = vm.get_vector("admin'--")

# Node B (NYC)
vm = DeterministicVectorManager(global_seed=42)
vec_b = vm.get_vector("admin'--")

assert np.array_equal(vec_a, vec_b)  # ✅ Always true

No synchronization needed - mathematical agreement enables parallel stream processing.

Accumulator Primitives

New primitives for frequency-preserving learning:

# Create accumulator (float64, no thresholding)
accum = encoder.create_accumulator()

# Stream observations
for request in stream:
    vec = encoder.encode_data(request)
    accum = encoder.accumulate(accum, vec)

# Get normalized for similarity queries
prototype = encoder.normalize_accumulator(accum)

Why it works: prototype_add() thresholds → loses frequency. Accumulator preserves frequency → 99% benign dominates → F1=1.000.

See Challenge 010 Learnings for complete details.

Packet Analysis & Structural Detection (Challenge 011)

Building on Challenge 010, we discovered the critical importance of structural encoding:

The Key Insight: Use Holon's Structural Encoding

# WRONG: Naive atom bundling (loses structure)
atoms = ["proto:tcp", "dst_port:80", "flags:PA"]
vec = sum(vm.get_vector(atom) for atom in atoms)
# Result: F1 = 0.368 ❌

# RIGHT: Structural encoding (preserves structure)
structure = {"l4": {"proto": "tcp", "dst_port": 80, "flags": "PA"}}
vec = encoder.encode_data(structure)
# Result: F1 = 1.000 ✅

Why it works: Role-filler binding (key ⊛ value) preserves that {dst_port: 80} is different from {src_port: 80} - they share "80" but are bound to different roles.

Three-Dimensional Detection

Dimension Purpose F1 Score
Transition Attack beginning/ending 0.936
Classification SYN flood, DNS reflection, etc. 0.998
Binary Is this an attack? 1.000

Knowledge Composition

# Three knowledge sources work together
prior_sim = 0.96    # Frozen baseline (survives attacks)
recent_sim = 0.84   # Adaptive with decay (tracks current traffic)
divergence = 0.30   # Prior/recent divergence (regime change signal)
Phase Prior Similarity Divergence
Normal 0.96 0.99 (stable)
Attack 0.16 0.30 (regime change!)
Recovery 0.75 0.93 (returning)

Cross-Pollination: Best of Both Batches

The integrated detector combines:

From Batch 010 From Batch 011
Port bucketing Structural encoding
IP prefix levels Prior/recent separation
Payload bitmask State machine transitions
Rule-based detection Sample-based signatures
Variance-based DDoS Culprit identification

Result: F1 = 1.000, Classification = 100%

Mitigation Synthesis: Closing the Loop

Using vector operations to derive actionable firewall rules:

# What makes attacks different from normal?
attack_delta = store.difference(attack_signature, baseline)

# Which features contribute most?
for feature in features:
    importance = similarity(encode({feature: value}), attack_delta)
    if importance > threshold:
        rules.append(f"DROP if {feature}={value}")

Generated Rules (F1 = 1.000):

# DNS reflection
iptables -A INPUT -p udp --sport 53 --dport 1024:65535 -j DROP

# SYN flood
iptables -A INPUT -p tcp --tcp-flags ALL SYN -m limit --limit 10/s -j ACCEPT

The complete pipeline: Learn → Detect → Identify → Mitigate

# Run the wrap-up demo
./scripts/run_with_venv.sh python scripts/challenges/011-batch/DEMO-batch-011-wrapup.py

See Challenge 011 Learnings for complete details.

Honest Assessment

What Holon does well:

  • Fuzzy similarity search over structured data
  • Prototype learning and classification (94.5% at 100 categories)
  • "Find similar to X" and "X but not Y" queries
  • Deep nesting (6+ levels), high field counts (100+ fields)
  • Composable vector operations that work over HTTP
  • Finding needles in haystacks (rank 1 among 500+ similar items)
  • Scales to 500k records at 11k encodes/sec
  • Streaming anomaly detection (F1=1.000 at 8k req/sec)
  • DDoS detection with attack classification (100% accuracy)
  • Distributed consensus without synchronization
  • Three-dimensional detection: transition (0.936), classification (0.998), binary (1.000)
  • Knowledge composition: prior/recent/divergence tracks regime changes
  • Mitigation synthesis: vector-derived firewall rules from attack signatures
  • Zero-hardcode detection: 100% recall, 4% FP without domain knowledge (Challenge 012)
  • Continuous value encoding: log-scale rates, circular angles with smooth similarity

What Holon cannot do:

  • Constraint satisfaction (Sudoku, SAT, graph coloring)
  • NP-hard optimization
  • Exact matching where "close enough" isn't acceptable

Brutal honesty:

  • All benchmarks use synthetic data - Real-world accuracy is unproven
  • Never compared to baselines - No TF-IDF, neural embedding, or traditional ML comparisons
  • 81.7% at 1000 categories - Still on planted signal, real data may be harder
  • HNSW index build is the bottleneck - 132s for 80k vectors (~30 min for 1M)
  • 17k/sec encoding required 10 cores - Single-threaded is ~4k/sec

Challenges & Examples

Batch Description Status
012 Zero-hardcode anomaly detection ✅ 100% recall, 4% FP
011 Structural detection & cross-pollination ✅ F1=1.000
010 Network anomaly & DDoS detection ✅ 100% detection
009 Deterministic training at scale (500k records) ✅ 94.5% accuracy
008 Full primitive showcase (7 solutions) ✅ 92-100%
007 Multi-domain demos ✅ 7/7
006 LLM memory augmentation ✅ 82% token savings
004 Sudoku (VSA limitation) ❌ Cannot solve
# Run the primitives demo
./scripts/run_with_venv.sh python examples/primitives_demo.py

Documentation

Key Concepts

  • Binding: key × value - element-wise multiplication combines related vectors
  • Bundling: a + b + c - vector superposition aggregates multiple vectors
  • Similarity: Cosine distance enables partial/substructure matching

Why "Holon"?

Named after Arthur Koestler's concept - a holon is a self-contained whole that is simultaneously part of a larger whole. Each data item in Holon is independent yet entangled through vector relationships. A document is complete on its own, but its vector connects it to every similar document in the space.

Vector Operations

From mystical runes to mathematical vectors. The power endures.


MIT Licensed

About

High-performance neural memory using VSA/HDC + ANN. Fuzzy similarity search with guards, negations, wildcards for AI.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages