[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RichmondAlake/memorizz/blob/main/examples/semantic_cache.ipynb)

In [None]:
! pip install -qU memorizz

## Semantic Cache with Memorizz and MemAgents

A **semantic cache** is an intelligent caching mechanism that stores query-response pairs and retrieves them based on meaning similarity rather than exact text matching. Unlike traditional caches that require exact key matches, semantic caching uses vector embeddings to find previously answered similar questions.

| Scenario | Traditional Cache Result | Semantic Cache Result |
|----------|-------------------------|----------------------|
| User asks: `"What is ML?"` after `"What is machine learning?"` was cached | ❌ Cache Miss - New LLM call | ✅ Cache Hit - Returns stored answer |
| User asks: `"Explain Python"` after `"Tell me about Python programming"` was cached | ❌ Cache Miss - New LLM call | ✅ Cache Hit - Returns stored answer |
| User has typo: `"Wht is AI?"` after `"What is AI?"` was cached | ❌ Cache Miss - New LLM call | ✅ Cache Hit - Returns stored answer |
| Different language: `"¿Qué es IA?"` after `"What is AI?"` was cached | ❌ Cache Miss - New LLM call | ✅ Cache Hit - Cross-language match |
| Rephrased: `"How does neural network work?"` after `"Explain neural networks"` was cached | ❌ Cache Miss - New LLM call | ✅ Cache Hit - Semantic similarity |

| Aspect | Traditional Cache | Semantic Cache |
|--------|------------------|----------------|
| **Matching Strategy** | Exact string/key matching | Vector similarity matching |
| **Query Examples** | `"What is ML?"` ≠ `"What is machine learning?"` | `"What is ML?"` ≈ `"What is machine learning?"` |
| **Cache Hit Conditions** | Key must match exactly | Similarity score > threshold (e.g., 0.78) |
| **Storage Structure** | `{key: value}` pairs | `{query, response, embedding, metadata}` |
| **Lookup Complexity** | O(1) hash lookup | O(n) vector search or O(log n) with indexing |
| **Memory Requirements** | Low (just key-value) | Higher (stores embeddings ~1536 floats per entry) |
| **Setup Complexity** | Simple | Requires embedding provider + vector database |
| **Intelligence Level** | Dumb (exact match only) | Smart (understands meaning and context) |
| **Language Variations** | `"color"` ≠ `"colour"` | `"color"` ≈ `"colour"` |
| **Typo Tolerance** | `"machne learning"` = Cache Miss | `"machne learning"` ≈ `"machine learning"` |
| **Paraphrasing** | `"How to learn Python?"` ≠ `"Python learning guide"` | `"How to learn Python?"` ≈ `"Python learning guide"` |
| **Cache Efficiency** | Low (many similar queries stored separately) | High (one entry serves many similar queries) |
| **Configuration** | Simple size + TTL | Threshold, scope, embedding provider, TTL, size |
| **Cost Implications** | Low storage, no computation | Higher storage, embedding computation cost |
| **Use Cases** | API responses, computed values | Q&A systems, chatbots, knowledge retrieval |
| **Invalidation Strategy** | Time-based or manual | Time-based + similarity score updates |
| **Multi-language Support** | None (each language needs separate cache) | Good (cross-language semantic matching) |
| **Context Awareness** | None | Yes (understands question intent) |
| **Scalability** | Excellent (O(1) lookup) | Good (with proper vector indexing) |
| **Cache Hit Rate** | Lower (exact matches only) | Higher (semantic matches) |
| **Implementation Examples** | Redis, Memcached, in-memory dict | Memorizz, vector databases, embedding-based systems |

In [None]:
import getpass
import os

# Function to securely get and set environment variables
def set_env_securely(var_name, prompt):
    value = getpass.getpass(prompt)
    os.environ[var_name] = value

In [None]:
set_env_securely("MONGODB_URI", "Enter your MongoDB URI: ")

In [None]:
set_env_securely("OPENAI_API_KEY", "Enter your OpenAI API Key: ")

In [None]:
set_env_securely("VOYAGE_API_KEY", "Enter your VOYAGE AI API Key: ")

### Step 1: Initalize a Memory Provider

A Memory Provider is a core abstraction layer that manages the persistence, organization, and retrieval of all memory components within an agentic system. It serves as the central nervous system for memory management, providing standardized interfaces between AI agents and underlying storage technologies.


In [None]:
from memorizz.memory_provider.mongodb.provider import MongoDBConfig, MongoDBProvider

# Create a mongodb config with voyageai embeddings
mongodb_config = MongoDBConfig(
    uri=os.environ["MONGODB_URI"],
    db_name="testing_memorizz",
    embedding_provider="voyageai",
    embedding_config={
        "embedding_type": "contextualized",
        "model": "voyage-context-3",
        "output_dimension": 256,
    }
)

# Create a memory provider
memory_provider = MongoDBProvider(mongodb_config)

## Step 2: Create a MemAgent with a Semantic Cache Configuration

### Semantic Cache Configuration - The Smart Settings

| Setting | Argument | Description | Purpose | Example Values |
|---------|----------|-------------|---------|----------------|
| **Similarity Threshold** | `similarity_threshold=0.85` | Controls how "similar" two questions need to be for a cache hit (0.0 to 1.0 scale) | Prevents false positives while allowing intelligent matching of rephrased questions | `0.70` (loose), `0.78` (moderate), `0.85` (strict), `0.90` (very strict) |
| **Cache Size Limit** | `max_cache_size=100` | Maximum number of cached query-response pairs to store in memory | Manages memory usage and prevents unlimited cache growth | `50` (small), `100` (medium), `500` (large), `1000` (enterprise) |
| **Time-To-Live** | `ttl_hours=24.0` | How long cached responses remain valid before expiring (in hours) | Ensures information freshness and prevents stale responses | `1.0` (1 hour), `6.0` (6 hours), `24.0` (1 day), `168.0` (1 week) |
| **Memory Provider Sync** | `enable_memory_provider_sync=True` | Whether to store cache entries in persistent database (MongoDB) | Enables cache persistence across agent restarts and sharing between instances | `True` (persistent), `False` (in-memory only) |
| **Usage Tracking** | `enable_usage_tracking=True` | Whether to track usage statistics for cached entries | Provides analytics and enables intelligent cache eviction based on popularity | `True` (track stats), `False` (no tracking) |
| **Session Scoping** | `enable_session_scoping=False` | Whether to isolate cache entries by user session ID | Controls cache sharing across different conversation sessions | `True` (session-isolated), `False` (cross-session sharing) |
| **Cache Scope** | `scope=SemanticCacheScope.LOCAL` | Defines cache visibility boundaries between agents | Controls whether agents share cached responses or maintain separate caches | `SemanticCacheScope.LOCAL` (agent-specific), `SemanticCacheScope.GLOBAL` (shared across agents) |

In [None]:
from memorizz.enums import SemanticCacheScope
from memorizz.short_term_memory.semantic_cache import SemanticCacheConfig


# Create a semantic cache config
semantic_cache_config = SemanticCacheConfig(
    similarity_threshold=0.85,
    max_cache_size=100,
    ttl_hours=24.0,
    enable_memory_provider_sync=True,
    enable_usage_tracking=True,
    enable_session_scoping=False,
    scope=SemanticCacheScope.LOCAL # This agent will have access to cached responses made by it's own instance
)

### **Configuration Examples by Use Case**

| Use Case | Configuration Example | Rationale |
|----------|----------------------|-----------|
| **High-Accuracy System** | `similarity_threshold=0.90, max_cache_size=200, ttl_hours=6.0` | Strict matching for critical applications, shorter TTL for fresh data |
| **Cost-Optimized Chatbot** | `similarity_threshold=0.75, max_cache_size=1000, ttl_hours=48.0` | More aggressive caching to reduce LLM API calls |
| **Multi-Agent Knowledge Sharing** | `scope=SemanticCacheScope.GLOBAL, enable_session_scoping=False` | Agents learn from each other's interactions |
| **Privacy-Focused Application** | `scope=SemanticCacheScope.LOCAL, enable_session_scoping=True` | Strict isolation between agents and sessions |
| **Development/Testing** | `similarity_threshold=0.70, max_cache_size=50, ttl_hours=1.0` | Lower threshold for testing, small cache, quick expiration |

This code below creates an intelligent AI agent using the Memorizz library that combines GPT-4's language capabilities with semantic caching for enhanced performance and cost efficiency. 

The `MemAgent` is initialized with OpenAI's GPT-4 model as its core language processor, connected to a persistent memory provider (likely MongoDB) for storing cached responses, and configured with a specific instruction that defines its personality as a concise, helpful assistant. 

Most importantly, the agent has semantic caching enabled (`semantic_cache=True`) with custom configuration settings (`semantic_cache_config`), which means it can intelligently recognize when new user queries are semantically similar to previously answered questions and return cached responses instantly instead of making expensive API calls to GPT-4, resulting in faster response times, reduced costs, and consistent answers for similar queries even when phrased differently.

In [None]:
from memorizz.memagent import MemAgent
from memorizz.llms.openai import OpenAI
# Create agent with semantic cache enabled
local_scoped_agent = MemAgent(
    model=OpenAI(model="gpt-4"),
    memory_provider=memory_provider,
    instruction="You are a helpful assistant that answers questions concisely.",
    semantic_cache=True,  # Enable semantic cache
    semantic_cache_config=semantic_cache_config
)

Don't forget to save the agent before running the code below

In [None]:
local_scoped_agent.save()

## Step 3: Testing Semantic Cache




### First LLM Call (Cache Miss)

This code below demonstrates the semantic cache workflow in action by testing how the agent handles a fresh query that hasn't been cached before. 

The code uses agent.run() to ask "What is the capital of United Kingdom?" for the first time, which will result in a cache miss since no similar question has been asked previously, forcing the agent to make an actual API call to GPT-4 to generate the response, and then automatically store both the query and the GPT-4 response (along with the query's vector embedding) in the semantic cache for future use. 

The printed output will show the actual response from GPT-4, and behind the scenes, this interaction creates a new cache entry that will enable faster responses for semantically similar questions like "What's the UK's capital?" or "Tell me the capital city of Britain" in subsequent queries.

In [None]:
  # Test the REAL semantic cache workflow
print("🧪 Testing semantic cache with agent.run():")

# First query - will call LLM and cache response
print("\n1. First query (will call LLM):")
response1 = local_scoped_agent.run("What is the capital of United Kingdom?")
print(f"Response: {response1}")

### Second LLM Call (Cache Hit)

This code below demonstrates the power of semantic caching by asking a semantically similar question to the one previously cached, showing how the agent can intelligently recognize that "Tell me the capital city of United Kingdom" is essentially the same question as the earlier "What is the capital of United Kingdom?" despite different phrasing. 

When `agent.run()` processes this query, it will generate a vector embedding for the new question, perform a similarity search against cached entries, find that the similarity score exceeds the configured threshold (likely around 0.85), and return the previously cached GPT-4 response instantly without making another expensive API call. 

This results in a much faster response time, cost savings, and demonstrates how semantic caching enables intelligent query matching based on meaning rather than exact text, allowing users to ask the same question in multiple ways while still benefiting from cached responses.

In [None]:
# Similar query - should hit cache (no LLM call!)
print("\n2. Similar query (should hit cache):")
response2 = local_scoped_agent.run("Tell me the capital city of United Kingdom")
print(f"Response: {response2}")

This code below now validates the semantic cache functionality by comparing two responses to verify if the cache is working correctly, then provides diagnostic information about the cache's performance. 

The comparison logic checks if `response1` (from the first query) equals `response2` (from a semantically similar second query) - if they match, it confirms a successful cache hit where the agent retrieved the stored response instead of calling the LLM again, but if they differ, it indicates a cache miss where the agent generated a new response. 

Additionally, the code retrieves and displays cache statistics using `agent.semantic_cache_instance.get_stats()` to show the total number of cached entries and the cumulative usage count across all cache hits, providing insights into how effectively the semantic cache is being utilized and helping developers monitor the cache's performance and efficiency.

In [None]:
# Check if it was a cache hit
if response1 == response2:
    print("✅ CACHE HIT! Same response returned")
else:
    print("❌ Cache miss - responses differ")

# Check cache stats
stats = local_scoped_agent.semantic_cache_instance.get_stats()
print(f"\nCache stats: {stats['total_entries']} entries, {stats['total_usage_count']} uses")

# Step 4: Agent With Global Semantic Cache Access

This code below creates a `global_scoped_agent` that demonstrates semantic cache sharing across multiple AI agents by configuring the cache scope to `SemanticCacheScope.GLOBAL`. 

Unlike a locally-scoped agent that only accesses its own cached responses, this global-scoped agent can retrieve and benefit from cached responses created by any other agent in the system, effectively creating a shared knowledge pool where all agents learn from each other's interactions. The agent is built with GPT-4 as its language model, connected to a persistent memory provider, and configured with a strict similarity threshold of 0.85 to ensure high-quality cache matches, while the `enable_memory_provider_sync=True` setting ensures that this shared cache is stored persistently in the database. 

This global approach maximizes cache efficiency and cost savings across an entire multi-agent system, as any agent's interaction with a question like "What is machine learning?" would create a cache entry that all other agents can instantly access when users ask semantically similar questions like "Explain ML" or "Define machine learning."

In [None]:
global_scoped_agent = MemAgent(
    model=OpenAI(model="gpt-4"),
    memory_provider=memory_provider,
    instruction="You are a helpful assistant that answers questions concisely.",
    semantic_cache=True,  # Enable semantic cache
    semantic_cache_config=SemanticCacheConfig(
        similarity_threshold=0.85,
        max_cache_size=100,
        ttl_hours=24.0,
        enable_memory_provider_sync=True,
        enable_usage_tracking=True,
        enable_session_scoping=False,
        scope=SemanticCacheScope.GLOBAL
    )
)

In [None]:
global_scoped_agent.save()

This code below executes a query through the `global_scoped_agent` asking "What is the capital of United Kingdom?" which will demonstrates how the global semantic cache scope works in practice. 

When this agent processes the query, it first searches the global cache (shared across all agents in the system) for any semantically similar questions that have been previously answered, and if another agent has already cached a response to a similar question like "What's the UK's capital?" or "Tell me London's status as capital," this agent will retrieve and return that cached response instantly without calling GPT-4. 

However, if no similar question exists in the global cache, the agent will query GPT-4 for a fresh response and then store the new query-response pair in the global cache, making it available for all other agents in the system to benefit from in future interactions, thereby contributing to the collective knowledge pool while potentially saving API costs and response time.

In [None]:
global_scoped_agent.run("What is the capital of United Kingdom?")

This is a new query not in the semantic cache collection in this run

In [None]:
global_scoped_agent.run("Tell me the capital city of Australia")

Local scoped agent can't access previosuly cached responses

In [None]:
local_scoped_agent.run("What is the capital of Australia")

## Step 5: Cache Clearing Strategies

| Method | Scope | Use Case | Example Usage |
|--------|-------|----------|---------------|
| `agent.semantic_cache_instance.clear()` | **Agent's in-memory cache only** | Quick development reset, testing, debugging | Testing new cache behavior, clearing during development iterations |
| `agent.semantic_cache_instance.clear(session_id="...")` | **Specific user session** | User logout cleanup, session isolation, privacy compliance | User ends conversation, switching between different user contexts |
| `agent.semantic_cache_instance.clear(memory_id="...")` | **Specific memory context** | Context-specific cleanup, memory boundary management | Switching between different conversation topics or workflows |
| `memory_provider.clear_semantic_cache()` | **Database level (all agents globally)** | Complete system reset, production maintenance, emergency cleanup | System maintenance, migrating to new cache version, fixing corrupted cache |
| `memory_provider.clear_semantic_cache(agent_id="...")` | **All cache entries for specific agent** | Agent-specific maintenance, individual agent reset | Redeploying specific agent, fixing agent-specific cache issues |
| `memory_provider.clear_semantic_cache(memory_id="...")` | **Database entries for specific memory context** | Memory-specific database cleanup across all agents | Cleaning up specific workflow or conversation context globally |
| `memory_provider.clear_semantic_cache(agent_id="...", memory_id="...")` | **Specific agent + memory combination** | Precise targeted cleanup, surgical cache removal | Fixing specific agent-memory combination issues, targeted debugging |

### **Clearing Strategy Decision Tree**

| Scenario | Recommended Method | Rationale |
|----------|-------------------|-----------|
| **Development/Testing** | `agent.semantic_cache_instance.clear()` | Fast, local, doesn't affect other developers |
| **User Privacy/Logout** | `clear(session_id="user_session")` | Removes only user-specific cached responses |
| **Production Maintenance** | `memory_provider.clear_semantic_cache()` | Complete system cleanup, affects all agents |
| **Agent Redeployment** | `clear_semantic_cache(agent_id="agent_123")` | Removes cached responses for updated agent |
| **Emergency Cache Corruption** | `memory_provider.clear_semantic_cache()` | Nuclear option to fix system-wide issues |
| **Context Switching** | `clear(memory_id="context_456")` | Clean slate for new conversation context |

Let's clear the local agent cache

In [None]:
local_scoped_agent.semantic_cache_instance.clear()

In [None]:
local_scoped_agent.semantic_cache_instance.get_stats()

Let's clear the global agent cache using the memory provider

In [None]:
memory_provider.clear_semantic_cache(agent_id=global_scoped_agent.agent_id)

In [None]:
global_scoped_agent.semantic_cache_instance.get_stats()

Let's clear all cache in the memory provider (database)

In [None]:
memory_provider.clear_semantic_cache()
