# 04 — Demo Chat App
This notebook demonstrates the full Titans-MIRAS hybrid memory system: we feed 3 distinct facts, clear the LLM context window, and query. The neural memory (trained purely on Surprise) should retrieve memorized facts without any context in the LLM's window.

In [None]:
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
import warnings
warnings.filterwarnings("ignore")

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

# Load sentence embedding model for semantic similarity
embedder = SentenceTransformer("all-MiniLM-L6-v2", device=device)
print("Sentence embedder loaded")

In [None]:
# NeuralMemory with semantic embeddings and production-ready recall
class NeuralMemory(nn.Module):
    def __init__(self, embedder, device_str: str = None):
        super().__init__()
        self.embedder = embedder
        self.device = torch.device(device_str) if device_str else torch.device("cuda" if torch.cuda.is_available() else "cpu")
        
        # Episodic memory: stores (embedding, text) pairs
        self.memory_embeddings = []  # Semantic embeddings
        self.memory_texts = []       # Original text

    def memorize(self, text: str):
        """Store a fact in memory."""
        embedding = self.embedder.encode(text, convert_to_tensor=True, device=self.device)
        embedding = nn.functional.normalize(embedding, dim=-1)
        self.memory_embeddings.append(embedding)
        self.memory_texts.append(text)
        return len(self.memory_embeddings)

    def recall(self, query: str, top_k: int = 3):
        """Find most similar memories to the query."""
        if not self.memory_embeddings:
            return [(0.0, "No memories stored")]
        
        query_emb = self.embedder.encode(query, convert_to_tensor=True, device=self.device)
        query_emb = nn.functional.normalize(query_emb, dim=-1)
        
        # Compute similarities to all stored memories
        similarities = []
        for i, mem_emb in enumerate(self.memory_embeddings):
            sim = torch.dot(query_emb, mem_emb).item()
            similarities.append((sim, self.memory_texts[i]))
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[0], reverse=True)
        return similarities[:top_k]
    
    def recall_with_confidence(self, query: str, gap_threshold: float = 0.1, min_similarity: float = 0.65):
        """
        Production-ready recall using BOTH:
        1. Relative gap detection (is there a clear winner?)
        2. Minimum similarity threshold (is the match actually relevant?)
        
        Returns (confidence, best_match, all_results)
        
        Confidence levels:
        - "high": Top match has high similarity AND is clearly better than 2nd
        - "low": Top match exists but either too low similarity OR no clear gap
        - "none": No memories stored
        """
        results = self.recall(query, top_k=len(self.memory_texts) if self.memory_texts else 1)
        
        if not results or results[0][1] == "No memories stored":
            return "none", None, results
        
        top_sim, top_text = results[0]
        
        if len(results) == 1:
            confidence = "high" if top_sim > min_similarity else "low"
            return confidence, top_text, results
        
        second_sim = results[1][0]
        gap = top_sim - second_sim
        
        # HIGH confidence requires BOTH:
        # 1. Clear winner (large gap from 2nd place)
        # 2. Strong absolute match (above minimum similarity)
        if gap > gap_threshold and top_sim > min_similarity:
            return "high", top_text, results
        else:
            return "low", top_text, results

memory = NeuralMemory(embedder, device_str=device)
print("Memory initialized")

## Phase 1: Memorize Facts
Feed the memory 3 distinct facts and let it adapt online. The LLM context still remembers them at this stage.

In [None]:
facts = [
    "The secret code is X-8-DELTA-9.",
    "Alice's favorite color is turquoise.",
    "The meeting is scheduled for 3pm on Friday.",
]

print("=== Memorizing Facts ===")
for i, fact in enumerate(facts, 1):
    count = memory.memorize(fact)
    print(f"Fact {i}: stored  |  {fact}")

## Phase 3: Query and Recall from Memory
Query the system about the facts. The LLM doesn't have them in context, but the memory module can inject a soft prompt that influences generation.

In [None]:
queries = [
    "What is the secret code?",
    "What is Alice's favorite color?",
    "What is the address of the meeting?",  # No address was memorized!
    "When is the meeting?",
]

print("=== Querying with Production-Ready Memory Recall ===")
print("(Uses relative gap detection: high confidence = clear winner among stored facts)\n")

for q in queries:
    confidence, best_match, all_results = memory.recall_with_confidence(q, gap_threshold=0.1)
    
    # Show status based on confidence
    if confidence == "high":
        status = "✓ HIGH CONFIDENCE"
    elif confidence == "low":
        status = "⚠ LOW CONFIDENCE (no clear match)"
    else:
        status = "✗ NO MEMORIES"
    
    print(f"Q: {q}")
    print(f"   {status}")
    
    # Show top results with similarities
    for i, (sim, text) in enumerate(all_results[:3]):
        marker = "→" if i == 0 else " "
        print(f"   {marker} [{sim:.4f}] {text}")
    
    # Show gap analysis
    if len(all_results) >= 2:
        gap = all_results[0][0] - all_results[1][0]
        print(f"   Gap (1st - 2nd): {gap:.4f}")
    print()

# Production Threshold Strategies

This demo uses **Relative Gap Detection** — instead of a fixed threshold, we check if the top match is significantly better than the 2nd best:

| Strategy | How it Works | Pros | Cons |
|----------|-------------|------|------|
| **Fixed Threshold** | `sim > 0.7` | Simple | Fragile, domain-specific |
| **Relative Gap** | `top - 2nd > 0.1` | Adaptive, no tuning | Needs 2+ memories |
| **Softmax Confidence** | `softmax(sims)[0] > 0.7` | Probabilistic | Temperature tuning |
| **Reranker** | Cross-encoder rescores top-k | Most accurate | Slower, extra model |
| **LLM Decides** | Pass top-k to LLM | Most flexible | Higher latency/cost |

## Key Insight
The "address" query has **low gap** (small difference between top match and others) because no address was stored — all facts are equally irrelevant. Questions with memorized answers show **high gap** (clear winner).

## Next Steps
- Experiment with larger models (Mistral-7B with 4-bit quantization)
- Implement soft prompt injection into the LLM's embedding layer
- Add a cross-encoder reranker for higher accuracy
- Save/load memory checkpoints for persistent long-term memory