Every common memory approach solves part of the problem.
Buffer, summary, vector, graph — each is useful, but incomplete alone.
Real memory is not a tool. It’s a stack.
| Layer | Role | Speed | Purpose |
|---|---|---|---|
| ⚡ Cache | Reflex | ~16ms | Instant responses |
| 🧠 Vector DB | Episodes | Medium | Semantic recall |
| 🕸️ Graph DB | Reasoning | Slower | Multi-hop logic |
| 🔄 Invalidation | Truth | Critical | Correctness over time |
User Query
↓
⚡ Semantic Cache (Redis)
↓ (miss)
🧠 Vector Search (Qdrant)
↓
🕸️ Graph Traversal (FalkorDB)
↓
Response + Cache Write
- Semantic caching (NOT string matching)
- ~16ms latency on RTX 3050
- Stores query + result + metadata
The fastest answer is the one you don’t compute twice.
- Stores embeddings + summaries
- Handles “what happened before”
- User-scoped filtering (security-first)
- Nodes = memories
- Edges = relationships
- Enables multi-hop reasoning
- TTL = time-based expiry ❌
- Invalidation = correctness ✅
✔ Mark, don’t delete
✔ Preserve history
✔ Enable audit & rollback
| Mode | Description |
|---|---|
| ⚡ Direct | Cache hit |
| 🔍 Search | Vector retrieval |
| 🧠 Reason | Graph traversal |
- User-based filtering at retrieval
- No cross-user leakage
- Permission handled early (not after)
- Model: Granite 3.3
- GPU: RTX 3050
- Cache Latency: ~16ms
- Speedup: ~50x vs full pipeline
- Cache at meaning level, not strings
- Memory is layered, not singular
- Invalidation is mandatory
- Routing is leverage
- Continuity is baseline UX
- Zep COnversational Memory
- Dual cache (global + user)
- Smart routing layer
- Event-driven invalidation
- Graph-aware reranking
Enterprise AI isn’t a chatbot.
It’s memory + routing + reasoning + correctness — working together.