# Semantic Cache with RedisVL4J

This notebook demonstrates how to use RedisVL4J's `SemanticCache` to efficiently cache LLM responses based on semantic similarity of queries.

First, we will set up our dependencies and create a simple `askOpenAI` helper method to assist.

In [1]:
// Load Maven dependencies
%maven redis.clients:jedis:5.2.0
%maven org.slf4j:slf4j-nop:2.0.16
%maven com.fasterxml.jackson.core:jackson-databind:2.18.0
%maven com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.18.0
%maven com.github.f4b6a3:ulid-creator:5.2.3
%maven dev.langchain4j:langchain4j:0.36.2
%maven dev.langchain4j:langchain4j-open-ai:0.36.2
%maven com.microsoft.onnxruntime:onnxruntime:1.16.3
%maven com.squareup.okhttp3:okhttp:4.12.0
%maven com.google.code.gson:gson:2.10.1

// Note: RedisVL4J JAR must be in classpath (loaded automatically by Docker container)

// Import RedisVL4J classes
import com.redis.vl.extensions.cache.*;
import com.redis.vl.utils.vectorize.*;

// Import Redis client
import redis.clients.jedis.UnifiedJedis;
import redis.clients.jedis.HostAndPort;

// Import LangChain4J
import dev.langchain4j.model.openai.OpenAiLanguageModel;

// Import Java standard libraries
import java.util.*;
import java.time.Duration;
import java.util.function.Function;

In [2]:
// Setup connection and initialize components
UnifiedJedis jedis = new UnifiedJedis(new HostAndPort("redis-stack", 6379));

// Setup OpenAI client
String apiKey = System.getenv("OPENAI_API_KEY");
OpenAiLanguageModel languageModel = OpenAiLanguageModel.builder()
    .apiKey(apiKey)
    .modelName("gpt-3.5-turbo-instruct")
    .timeout(Duration.ofSeconds(60))
    .build();

System.out.println("OpenAI API configured");

// Create the askOpenAI function
Function<String, String> askOpenAI = (question) -> {
    try {
        return languageModel.generate(question).content().trim();
    } catch (Exception e) {
        throw new RuntimeException("Failed to call OpenAI: " + e.getMessage(), e);
    }
};

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.


OpenAI API configured


In [3]:
// Test
System.out.println(askOpenAI.apply("What is the capital of France?"));

The capital of France is Paris.


## Initializing `SemanticCache`

`SemanticCache` will automatically create an index within Redis upon initialization for the semantic cache content.

We'll use the `SentenceTransformersVectorizer` which downloads and runs HuggingFace models locally using ONNX Runtime. On first use, it will download the `Xenova/all-MiniLM-L6-v2` model (~25MB ONNX) and cache it locally in `~/.cache/redisvl4j/models/`. Subsequent runs will use the cached model for fast initialization.

Note: We're using `Xenova/all-MiniLM-L6-v2` which is an ONNX-optimized version of the popular all-MiniLM-L6-v2 model. The original `redis/langcache-embed-v3` model uses SafeTensors format which requires additional conversion support.

In [4]:
// Create vectorizer using SentenceTransformersVectorizer to download and run model locally
// Note: Using Xenova/all-MiniLM-L6-v2 which has ONNX support in the main directory
// The redis/langcache-embed-v3 model uses SafeTensors format which is not yet supported
BaseVectorizer vectorizer = new SentenceTransformersVectorizer("Xenova/all-MiniLM-L6-v2");

System.out.println("Initializing vectorizer with Xenova/all-MiniLM-L6-v2 model...");
System.out.println("Model dimensions: " + vectorizer.getDimensions());

// Initialize SemanticCache using Builder pattern
SemanticCache llmcache = new SemanticCache.Builder()
    .name("llmcache")                    // underlying search index name
    .redisClient(jedis)                  // redis connection
    .distanceThreshold(0.1f)             // semantic cache distance threshold  
    .vectorizer(vectorizer)              // embedding model
    .build();

System.out.println("SemanticCache initialized with index: " + llmcache.getName());

Initializing vectorizer with Xenova/all-MiniLM-L6-v2 model...
Model dimensions: 384
SemanticCache initialized with index: llmcache


In [5]:
// Look at the index specification created for the semantic cache lookup
System.out.println("Cache index '" + llmcache.getName() + "' is ready for use");

Cache index 'llmcache' is ready for use


## Basic Cache Usage

In [6]:
String question = "What is the capital of France?";

In [7]:
// Check the semantic cache -- should be empty
Optional<CacheHit> response = llmcache.check(question);
if (response.isPresent()) {
    System.out.println(response.get());
} else {
    System.out.println("Empty cache");
}

Empty cache


Our initial cache check should be empty since we have not yet stored anything in the cache. Below, store the `question`, proper `response`, and any arbitrary `metadata` (as a Java Map object) in the cache.

In [8]:
// Cache the question, answer, and arbitrary metadata
Map<String, Object> metadata = new HashMap<>();
metadata.put("city", "Paris");
metadata.put("country", "france");

llmcache.store(question, "Paris", metadata);
System.out.println("Stored in cache");

Stored in cache


Now we will check the cache again with the same question and with a semantically similar question:

In [9]:
// Check the cache again
Optional<CacheHit> cacheResponse = llmcache.check(question);
if (cacheResponse.isPresent()) {
    CacheHit hit = cacheResponse.get();
    System.out.println("Found in cache:");
    System.out.println("  Prompt: " + hit.getPrompt());
    System.out.println("  Response: " + hit.getResponse());
    System.out.println("  Distance: " + hit.getDistance());
    System.out.println("  Metadata: " + hit.getMetadata());
} else {
    System.out.println("Empty cache");
}

Found in cache:
  Prompt: What is the capital of France?
  Response: Paris
  Distance: 0.0
  Metadata: {country=france, vector_distance=0, updated_at=1758775412, city=Paris, id=2fecdce0-a4f7-4349-b61f-7b4b5ec8d6c2, inserted_at=1758775412}


In [10]:
// Check for a semantically similar result
String similarQuestion = "What actually is the capital of France?";
Optional<CacheHit> similarResponse = llmcache.check(similarQuestion);
if (similarResponse.isPresent()) {
    System.out.println(similarResponse.get().getResponse());
} else {
    System.out.println("Not found in cache");
}

Paris


## Customize the Distance Threshold

For most use cases, the right semantic similarity threshold is not a fixed quantity. Depending on the choice of embedding model, the properties of the input query, and even business use case -- the threshold might need to change.

Fortunately, you can seamlessly adjust the threshold at any point like below:

In [11]:
// Widen the semantic distance threshold
llmcache.setDistanceThreshold(0.5f);
System.out.println("Distance threshold set to 0.5");

Distance threshold set to 0.5


In [12]:
// Really try to trick it by asking around the point
// But is able to slip just under our new threshold
String trickQuestion = "What is the capital city of the country in Europe that also has a city named Nice?";
Optional<CacheHit> trickResponse = llmcache.check(trickQuestion);
if (trickResponse.isPresent()) {
    System.out.println(trickResponse.get().getResponse());
} else {
    System.out.println("Not found in cache");
}

Paris


In [13]:
// Invalidate the cache completely by clearing it out
llmcache.clear();

// Should be empty now
Optional<CacheHit> clearedResponse = llmcache.check(trickQuestion);
System.out.println("Cache after clear: " + (clearedResponse.isPresent() ? "Not empty" : "Empty"));

Cache after clear: Empty


## Utilize TTL

Redis uses TTL policies (optional) to expire individual keys at points in time in the future. This allows you to focus on your data flow and business logic without bothering with complex cleanup tasks.

A TTL policy set on the `SemanticCache` allows you to temporarily hold onto cache entries. Below, we will set the TTL policy to 5 seconds.

In [14]:
// Create a new cache with TTL
SemanticCache ttlCache = new SemanticCache.Builder()
    .name("llmcache_ttl")
    .redisClient(jedis)
    .distanceThreshold(0.1f)
    .vectorizer(vectorizer)
    .ttl(5) // 5 seconds
    .build();

System.out.println("Created cache with 5 second TTL");

Created cache with 5 second TTL


In [15]:
ttlCache.store("This is a TTL test", "This is a TTL test response");
System.out.println("Stored entry with TTL");

Thread.sleep(6000); // Sleep for 6 seconds

Stored entry with TTL


In [16]:
// Confirm that the cache has cleared by now on its own
Optional<CacheHit> ttlResult = ttlCache.check("This is a TTL test");

System.out.println("Result after TTL expiry: " + (ttlResult.isPresent() ? "Found" : "Empty (expired)"));

Result after TTL expiry: Empty (expired)


In [17]:
// Clean up TTL cache
ttlCache.clear();

## Simple Performance Testing

Next, we will measure the speedup obtained by using `SemanticCache`. We will use timing to measure the time taken to generate responses with and without `SemanticCache`.

In [18]:
/**
 * Helper function to answer a simple question using OpenAI with a wrapper
 * check for the answer in the semantic cache first.
 */
java.util.function.Function<String, String> answerQuestion = (q) -> {
    Optional<CacheHit> results = llmcache.check(q);
    if (results.isPresent()) {
        return results.get().getResponse();
    } else {
        String answer = askOpenAI.apply(q);
        return answer;
    }
};

In [19]:
long start = System.currentTimeMillis();
// asking a question -- openai response time
String perfQuestion = "What was the name of the first US President?";
String answer = answerQuestion.apply(perfQuestion);
long end = System.currentTimeMillis();

double timeWithoutCache = (end - start) / 1000.0;
System.out.println("Without caching, a call to OpenAI to answer this simple question took " + timeWithoutCache + " seconds.");

// add the entry to our LLM cache
llmcache.store(perfQuestion, "George Washington");
System.out.println("Added to cache");

Without caching, a call to OpenAI to answer this simple question took 0.684 seconds.
Added to cache


In [20]:
// Calculate the avg latency for caching over LLM usage
List<Double> times = new ArrayList<>();

for (int i = 0; i < 10; i++) {
    long cachedStart = System.currentTimeMillis();
    String cachedAnswer = answerQuestion.apply(perfQuestion);
    long cachedEnd = System.currentTimeMillis();
    times.add((cachedEnd - cachedStart) / 1000.0);
}

double avgTimeWithCache = times.stream().mapToDouble(Double::doubleValue).average().orElse(0.0);
double percentageSaved = ((timeWithoutCache - avgTimeWithCache) / timeWithoutCache) * 100;

System.out.println("Avg time taken with LLM cache enabled: " + avgTimeWithCache);
System.out.println("Percentage of time saved: " + String.format("%.2f%%", percentageSaved));

Avg time taken with LLM cache enabled: 0.0241
Percentage of time saved: 96.48%


In [21]:
// Check the stats of the cache
System.out.println("\nCache Statistics:");
System.out.println("Hit count: " + llmcache.getHitCount());
System.out.println("Miss count: " + llmcache.getMissCount());
System.out.println("Hit rate: " + String.format("%.2f%%", llmcache.getHitRate() * 100));


Cache Statistics:
Hit count: 13
Miss count: 3
Hit rate: 81.25%


In [22]:
// Clear the cache (but don't delete the index)
llmcache.clear();
System.out.println("Cache cleared");

Cache cleared


## Cache Access Controls, Tags & Filters

When running complex workflows with similar applications, or handling multiple users it's important to keep data segregated. Building on top of RedisVL4J's support for complex and hybrid queries we can tag and filter cache entries.

Let's store multiple users' data in our cache with similar prompts:

In [23]:
// Store entries with user metadata
Map<String, Object> userAbc = new HashMap<>();
userAbc.put("user", "abc");

Map<String, Object> userDef = new HashMap<>();
userDef.put("user", "def");

llmcache.store(
    "What is the phone number linked to my account?",
    "The number on file is 123-555-0000",
    userAbc
);

llmcache.store(
    "What's the phone number linked in my account?",
    "The number on file is 123-555-1111",
    userDef
);

System.out.println("Stored user-specific cache entries");

Stored user-specific cache entries


In [24]:
// Check cache entries
Optional<CacheHit> phoneResponse = llmcache.check(
    "What is the phone number linked to my account?"
);

if (phoneResponse.isPresent()) {
    System.out.println("Found entry: " + phoneResponse.get().getResponse());
} else {
    System.out.println("No entry found");
}

Found entry: The number on file is 123-555-0000


In [25]:
// Final cleanup - clear cache and close connection
llmcache.clear();
jedis.close();

System.out.println("\nAll caches cleaned up and connection closed.");
System.out.println("SemanticCache demonstration complete!");


All caches cleaned up and connection closed.
SemanticCache demonstration complete!
