# Titans-MIRAS Hybrid Memory System
A complete tutorial combining environment setup, memory architecture, hybrid engine, and demo chat app.

## Contents
1. **Environment Setup** — GPU and 4-bit quantization verification
2. **Memory Architecture** — Neural Memory module with Surprise-driven learning
3. **Hybrid Engine** — Integration with frozen LLM (GPT-2)
4. **Demo Chat App** — Production-ready memory recall with confidence scoring

---
# Part 1: Environment Setup
Verify GPU and 4-bit quantization readiness for the Titans-MIRAS hybrid memory demo.

In [15]:
# Check Python version and system info
import sys, platform
print("Python:", sys.version)
print("Platform:", platform.platform())

try:
    import torch
    print("PyTorch available:", True)
    print("Torch version:", torch.__version__)
except Exception as e:
    print("PyTorch import failed:", e)

Python: 3.12.12 | packaged by conda-forge | (main, Oct 22 2025, 23:16:53) [GCC 14.3.0]
Platform: Linux-6.14.0-1013-nvidia-aarch64-with-glibc2.39
PyTorch available: True
Torch version: 2.10.0+cu130


In [16]:
# Verify GPU availability and details
import torch
cuda_ok = torch.cuda.is_available()
print("CUDA available:", cuda_ok)
if cuda_ok:
    device = torch.device("cuda")
    idx = torch.cuda.current_device()
    print("GPU name:", torch.cuda.get_device_name(idx))
    cap = torch.cuda.get_device_capability(idx)
    print("Compute capability:", cap)
    total_mem = torch.cuda.get_device_properties(idx).total_memory
    print("Total GPU memory (GB):", round(total_mem/1024**3, 2))
    x = torch.randn(1024, 1024, device=device)
    y = torch.matmul(x, x)
    print("GPU matmul ok:", y.shape)
else:
    print("Running on CPU; 4-bit may be limited.")

CUDA available: True
GPU name: NVIDIA GB10
Compute capability: (12, 1)
Total GPU memory (GB): 119.7
GPU matmul ok: torch.Size([1024, 1024])


In [17]:
# Test 4-bit quantization support (bitsandbytes)
import warnings
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
import torch

warnings.filterwarnings("ignore")
try:
    model_id = "gpt2"  # small proxy for environment validation
    quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True,
                                      bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)
    tok = AutoTokenizer.from_pretrained(model_id)
    tok.pad_token = tok.eos_token
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        quantization_config=quant_config,
        trust_remote_code=True,
    )
    print("Loaded model with 4-bit quantization:", model_id)
    import math
    inputs = tok("Hello Titans-MIRAS!", return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_new_tokens=8)
    print("Generation:", tok.decode(out[0]))
except Exception as e:
    print("4-bit test failed (this can be expected on CPU-only environments):", e)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Loaded model with 4-bit quantization: gpt2
Generation: Hello Titans-MIRAS!

The Titans are back!



---
# Part 2: Memory Architecture (MAC)
A Neural Memory module with Surprise-driven `memorize()` update, inspired by Titans + MIRAS.

## Theory: Memory as Context (MAC)
Memory emits a small vector that is concatenated or prepended to the LLM input (soft prompt). The module learns to predict task-relevant features from recent hidden states. Surprise is measured by prediction error (MSE). High surprise triggers stronger learning; low surprise decays.

Key pieces:
- Input: recent hidden/context vector x
- Output: soft prompt vector p
- Surprise: L(x, y) = ||f(x) − y||²
- Online update: θ ← θ − η ∇θ L
- Recall: use f(x) as a soft prompt for the next LLM step

In [18]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class NeuralMemory(nn.Module):
    """Neural Memory module for Titans-MIRAS hybrid system.
    
    Uses float32 for stable training, automatically casts float16 inputs.
    """
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int, lr: float = 1e-3, device_str: str = None):
        super().__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        self.device = torch.device(device_str) if device_str else torch.device("cuda" if torch.cuda.is_available() else "cpu")
        # Always use float32 for stable training
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, output_dim),
        )
        self.to(self.device, torch.float32)
        self.optim = torch.optim.AdamW(self.parameters(), lr=lr)
        self.loss_fn = nn.MSELoss()

    @torch.no_grad()
    def recall(self, x: torch.Tensor) -> torch.Tensor:
        x = x.to(self.device, torch.float32)
        p = self.net(x)
        return p.detach()

    def memorize(self, x: torch.Tensor, y: torch.Tensor) -> float:
        # Cast float16 inputs from LLM to float32 for stable training
        x = x.to(self.device, torch.float32)
        y = y.to(self.device, torch.float32)
        pred = self.net(x)
        loss = self.loss_fn(pred, y)
        self.optim.zero_grad()
        loss.backward()
        self.optim.step()
        return float(loss.item())

# Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"
print("NeuralMemory device:", device)

NeuralMemory device: cuda


In [19]:
# Synthetic test: learn a simple linear mapping online
import math, random
torch.manual_seed(42)
random.seed(42)

in_dim, hid_dim, out_dim = 128, 64, 128
mem = NeuralMemory(in_dim, hid_dim, out_dim, lr=1e-2, device_str=device)

# ground truth mapping (unknown to memory)
W_true = torch.randn(in_dim, out_dim, device=device) * 0.5

def sample_xy(batch=32):
    x = torch.randn(batch, in_dim, device=device)
    y = x @ W_true
    return x, y

steps = 200
log_every = 40
losses = []
for t in range(1, steps + 1):
    x, y = sample_xy(batch=64)
    loss = mem.memorize(x, y)
    losses.append(loss)
    if t % log_every == 0:
        print(f"step {t:3d}  loss {loss:.6f}")

# verify recall quality on fresh batch
x_test, y_test = sample_xy(batch=16)
p = mem.recall(x_test)
mse = F.mse_loss(p, y_test).item()
print("test mse:", round(mse, 6))

step  40  loss 18.093212
step  80  loss 11.844204
step 120  loss 9.229830
step 160  loss 7.861908
step 200  loss 6.429991
test mse: 5.756186


---
# Part 3: Hybrid Engine Integration
Wire up a frozen LLM (GPT-2) with the `NeuralMemory` module. The engine implements the read→surprise→learn→recall loop.

In [20]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
import warnings
warnings.filterwarnings("ignore")

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

Device: cuda


In [21]:
# Load frozen LLM (GPT-2 for speed) and tokenizer
model_id = "gpt2"  # or "gpt2-medium" for better quality
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

llm = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.float16 if device=="cuda" else torch.float32)
llm.to(device)
llm.eval()  # frozen

hidden_dim = llm.config.n_embd
print(f"Loaded {model_id}, hidden_dim={hidden_dim}")

Loaded gpt2, hidden_dim=768


In [22]:
# Initialize memory for hybrid engine (maps hidden_dim -> hidden_dim soft prompt)
hybrid_memory = NeuralMemory(input_dim=hidden_dim, hidden_dim=256, output_dim=hidden_dim, lr=5e-4, device_str=device)
print("Hybrid memory initialized")

Hybrid memory initialized


## The Hybrid Loop
1. **Read**: Tokenize input text and run the LLM to get hidden states
2. **Surprise**: Compute prediction error (MSE) between memory's prediction and the actual hidden state
3. **Learn**: Update memory weights via backprop with the Surprise loss
4. **Recall**: Memory generates a soft prompt vector to condition the next step

In [23]:
def run_step_with_memory(text: str, use_memory: bool = True, verbose: bool = True):
    """
    1. Tokenize input
    2. Get LLM hidden states (frozen)
    3. If use_memory: memory.memorize(prev_hidden, current_hidden) to learn surprise
    4. memory.recall(current_hidden) produces soft prompt for next step
    5. Return generated text and surprise loss
    """
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        outputs = llm(**inputs, output_hidden_states=True)
    
    # Extract last hidden state from final layer
    hidden_states = outputs.hidden_states[-1]  # shape: (batch, seq_len, hidden_dim)
    last_hidden = hidden_states[:, -1, :]  # shape: (batch, hidden_dim)
    
    surprise_loss = 0.0
    soft_prompt = None
    
    if use_memory:
        # For simplicity: predict last_hidden from itself (circular dependency demo)
        # In a real system, you'd predict *next* hidden from current context
        surprise_loss = hybrid_memory.memorize(last_hidden, last_hidden)
        soft_prompt = hybrid_memory.recall(last_hidden)
    
    if verbose:
        print(f"Text: {text[:60]}...")
        print(f"Surprise loss: {surprise_loss:.6f}")
        if soft_prompt is not None:
            print(f"Soft prompt norm: {soft_prompt.norm().item():.4f}")
    
    return last_hidden, surprise_loss, soft_prompt

# Test the step
text = "The Titans architecture enables long-term memory by"
h, loss, sp = run_step_with_memory(text, use_memory=True, verbose=True)

Text: The Titans architecture enables long-term memory by...
Surprise loss: 95.512871
Soft prompt norm: 61.4240


In [24]:
# Multi-step adaptation demo: feed varied sentences and watch surprise decrease
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "Neural networks learn patterns from data.",
    "Titans use a surprise metric to decide what to remember.",
    "Memory modules can adapt online during inference.",
    "The quick brown fox jumps over the lazy dog.",  # repeat
]

print("=== Multi-step Memory Adaptation ===")
losses = []
for i, sent in enumerate(sentences, 1):
    _, loss, _ = run_step_with_memory(sent, use_memory=True, verbose=False)
    losses.append(loss)
    print(f"Step {i}: loss={loss:.6f}  |  {sent[:50]}")

print(f"\nFirst loss: {losses[0]:.6f}, Last loss: {losses[-1]:.6f}")

=== Multi-step Memory Adaptation ===
Step 1: loss=78.840233  |  The quick brown fox jumps over the lazy dog.
Step 2: loss=137.497589  |  Neural networks learn patterns from data.
Step 3: loss=122.783745  |  Titans use a surprise metric to decide what to rem
Step 4: loss=125.055893  |  Memory modules can adapt online during inference.
Step 5: loss=72.701721  |  The quick brown fox jumps over the lazy dog.

First loss: 78.840233, Last loss: 72.701721


---
# Part 4: Demo Chat App
Production-ready memory recall with semantic embeddings and confidence scoring.

In [25]:
import torch
import torch.nn as nn
from sentence_transformers import SentenceTransformer
import warnings
warnings.filterwarnings("ignore")

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

# Load sentence embedding model for semantic similarity
embedder = SentenceTransformer("all-MiniLM-L6-v2", device=device)
print("Sentence embedder loaded")

Device: cuda
Sentence embedder loaded


In [26]:
# NeuralMemory with semantic embeddings and production-ready recall
class SemanticMemory(nn.Module):
    def __init__(self, embedder, device_str: str = None):
        super().__init__()
        self.embedder = embedder
        self.device = torch.device(device_str) if device_str else torch.device("cuda" if torch.cuda.is_available() else "cpu")
        
        # Episodic memory: stores (embedding, text) pairs
        self.memory_embeddings = []  # Semantic embeddings
        self.memory_texts = []       # Original text

    def memorize(self, text: str):
        """Store a fact in memory."""
        embedding = self.embedder.encode(text, convert_to_tensor=True, device=self.device)
        embedding = nn.functional.normalize(embedding, dim=-1)
        self.memory_embeddings.append(embedding)
        self.memory_texts.append(text)
        return len(self.memory_embeddings)

    def recall(self, query: str, top_k: int = 3):
        """Find most similar memories to the query."""
        if not self.memory_embeddings:
            return [(0.0, "No memories stored")]
        
        query_emb = self.embedder.encode(query, convert_to_tensor=True, device=self.device)
        query_emb = nn.functional.normalize(query_emb, dim=-1)
        
        # Compute similarities to all stored memories
        similarities = []
        for i, mem_emb in enumerate(self.memory_embeddings):
            sim = torch.dot(query_emb, mem_emb).item()
            similarities.append((sim, self.memory_texts[i]))
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[0], reverse=True)
        return similarities[:top_k]
    
    def recall_with_confidence(self, query: str, gap_threshold: float = 0.1, min_similarity: float = 0.65):
        """
        Production-ready recall using BOTH:
        1. Relative gap detection (is there a clear winner?)
        2. Minimum similarity threshold (is the match actually relevant?)
        
        Returns (confidence, best_match, all_results)
        
        Confidence levels:
        - "high": Top match has high similarity AND is clearly better than 2nd
        - "low": Top match exists but either too low similarity OR no clear gap
        - "none": No memories stored
        """
        results = self.recall(query, top_k=len(self.memory_texts) if self.memory_texts else 1)
        
        if not results or results[0][1] == "No memories stored":
            return "none", None, results
        
        top_sim, top_text = results[0]
        
        if len(results) == 1:
            confidence = "high" if top_sim > min_similarity else "low"
            return confidence, top_text, results
        
        second_sim = results[1][0]
        gap = top_sim - second_sim
        
        # HIGH confidence requires BOTH:
        # 1. Clear winner (large gap from 2nd place)
        # 2. Strong absolute match (above minimum similarity)
        if gap > gap_threshold and top_sim > min_similarity:
            return "high", top_text, results
        else:
            return "low", top_text, results

semantic_memory = SemanticMemory(embedder, device_str=device)
print("Semantic memory initialized")

Semantic memory initialized


## Phase 1: Memorize Facts
Feed the memory distinct facts. The memory stores semantic embeddings for later retrieval.

In [27]:
facts = [
    "The secret code is X-8-DELTA-9.",
    "Alice's favorite color is turquoise.",
    "The meeting is scheduled for 3pm on Friday.",
]

print("=== Memorizing Facts ===")
for i, fact in enumerate(facts, 1):
    count = semantic_memory.memorize(fact)
    print(f"Fact {i}: stored  |  {fact}")

=== Memorizing Facts ===
Fact 1: stored  |  The secret code is X-8-DELTA-9.
Fact 2: stored  |  Alice's favorite color is turquoise.
Fact 3: stored  |  The meeting is scheduled for 3pm on Friday.


## Phase 2: Query and Recall from Memory
Query the system about the facts using production-ready confidence scoring.

In [28]:
queries = [
    "What is the secret code?",
    "What is Alice's favorite color?",
    "What is the address of the meeting?",  # No address was memorized!
    "When is the meeting?",
]

print("=== Querying with Production-Ready Memory Recall ===")
print("(Uses relative gap detection + minimum similarity threshold)\n")

for q in queries:
    confidence, best_match, all_results = semantic_memory.recall_with_confidence(q, gap_threshold=0.1)
    
    # Show status based on confidence
    if confidence == "high":
        status = "✓ HIGH CONFIDENCE"
    elif confidence == "low":
        status = "⚠ LOW CONFIDENCE (no clear match)"
    else:
        status = "✗ NO MEMORIES"
    
    print(f"Q: {q}")
    print(f"   {status}")
    
    # Show top results with similarities
    for i, (sim, text) in enumerate(all_results[:3]):
        marker = "→" if i == 0 else " "
        print(f"   {marker} [{sim:.4f}] {text}")
    
    # Show gap analysis
    if len(all_results) >= 2:
        gap = all_results[0][0] - all_results[1][0]
        print(f"   Gap (1st - 2nd): {gap:.4f}")
    print()

=== Querying with Production-Ready Memory Recall ===
(Uses relative gap detection + minimum similarity threshold)

Q: What is the secret code?
   ✓ HIGH CONFIDENCE
   → [0.7169] The secret code is X-8-DELTA-9.
     [0.0857] The meeting is scheduled for 3pm on Friday.
     [0.0379] Alice's favorite color is turquoise.
   Gap (1st - 2nd): 0.6312

Q: What is Alice's favorite color?
   ✓ HIGH CONFIDENCE
   → [0.8441] Alice's favorite color is turquoise.
     [0.1104] The secret code is X-8-DELTA-9.
     [0.0404] The meeting is scheduled for 3pm on Friday.
   Gap (1st - 2nd): 0.7337

Q: What is the address of the meeting?
   ⚠ LOW CONFIDENCE (no clear match)
   → [0.6056] The meeting is scheduled for 3pm on Friday.
     [0.1861] The secret code is X-8-DELTA-9.
     [0.0095] Alice's favorite color is turquoise.
   Gap (1st - 2nd): 0.4195

Q: When is the meeting?
   ✓ HIGH CONFIDENCE
   → [0.7688] The meeting is scheduled for 3pm on Friday.
     [0.1464] The secret code is X-8-DELTA-9.
     [

---
# Production Threshold Strategies

This demo uses **Gap Detection + Minimum Similarity** — combining two strategies:

| Strategy | How it Works | Pros | Cons |
|----------|-------------|------|------|
| **Fixed Threshold** | `sim > 0.7` | Simple | Fragile, domain-specific |
| **Relative Gap** | `top - 2nd > 0.1` | Adaptive | Needs 2+ memories |
| **Gap + Min Sim** | Both conditions | Robust | Two parameters |
| **Softmax Confidence** | `softmax(sims)[0] > 0.7` | Probabilistic | Temperature tuning |
| **Reranker** | Cross-encoder rescores top-k | Most accurate | Slower, extra model |
| **LLM Decides** | Pass top-k to LLM | Most flexible | Higher latency/cost |

## Key Insight
The "address" query has a **large gap** but **low absolute similarity** (0.61 < 0.65) because the matching fact talks about *time*, not *address*. Both conditions must be met for high confidence.

## Next Steps
- Experiment with larger models (Mistral-7B with 4-bit quantization)
- Implement soft prompt injection into the LLM's embedding layer
- Add a cross-encoder reranker for higher accuracy
- Save/load memory checkpoints for persistent long-term memory