# Semantic Caching for LLMs

RedisVL provides a ``SemanticCache`` interface to utilize Redis' built-in caching capabilities AND vector search in order to store responses from previously-answered questions. This reduces the number of requests and tokens sent to the Large Language Models (LLM) service, decreasing costs and enhancing application throughput (by reducing the time taken to generate responses).

This notebook will go over how to use Redis as a Semantic Cache for your applications

First, we will import [OpenAI](https://platform.openai.com) to use their API for responding to user prompts. We will also create a simple `ask_openai` helper method to assist.

In [86]:
import os
import getpass
import time
import numpy as np

from openai import OpenAI


os.environ["TOKENIZERS_PARALLELISM"] = "False"

api_key = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")

client = OpenAI(api_key=api_key)

def ask_openai(question: str) -> str:
    response = client.completions.create(
      model="gpt-3.5-turbo-instruct",
      prompt=question,
      max_tokens=200
    )
    return response.choices[0].text.strip()

In [87]:
# Test
print(ask_openai("What is the capital of France?"))

The capital of France is Paris.


## Initializing ``SemanticCache``

``SemanticCache`` will automatically create an index within Redis upon initialization for the semantic cache content.

In [88]:
from redisvl.extensions.llmcache import SemanticCache

llmcache = SemanticCache(
    name="llmcache",                     # underlying search index name
    redis_url="redis://localhost:6379",  # redis connection url string
    distance_threshold=0.1               # semantic cache distance threshold
)

In [152]:
key = llmcache.store(prompt="something about your grandma", response="she is a good person")
key

'llmcache:2e547e95c6eee5585c0bcf78c047feb2503979cad1787018c67df1e46b3a3584'

In [89]:
# look at the index specification created for the semantic cache lookup
!rvl index info -i llmcache



Index Information:
╭──────────────┬────────────────┬──────────────┬─────────────────┬────────────╮
│ Index Name   │ Storage Type   │ Prefixes     │ Index Options   │   Indexing │
├──────────────┼────────────────┼──────────────┼─────────────────┼────────────┤
│ llmcache     │ HASH           │ ['llmcache'] │ []              │          0 │
╰──────────────┴────────────────┴──────────────┴─────────────────┴────────────╯
Index Fields:
╭───────────────┬───────────────┬─────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬─────────────────┬────────────────╮
│ Name          │ Attribute     │ Type    │ Field Option   │ Option Value   │ Field Option   │ Option Value   │ Field Option   │   Option Value │ Field Option    │ Option Value   │
├───────────────┼───────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────────────────┼────────────────┤
│ prompt        │ prom

## Basic Cache Usage

In [90]:
question = "What is the capital of France?"

In [91]:
# Check the semantic cache -- should be empty
if response := llmcache.check(prompt=question):
    print(response)
else:
    print("Empty cache")

Empty cache


Our initial cache check should be empty since we have not yet stored anything in the cache. Below, store the `question`,
proper `response`, and any arbitrary `metadata` (as a python dictionary object) in the cache.

In [92]:
# Cache the question, answer, and arbitrary metadata
llmcache.store(
    prompt=question,
    response="Paris",
    metadata={"city": "Paris", "country": "france"}
)

'llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545'

Now we will check the cache again with the same question and with a semantically similar question:

In [93]:
# Check the cache again
if response := llmcache.check(prompt=question, return_fields=["prompt", "response", "metadata"]):
    print(response)
else:
    print("Empty cache")

[{'prompt': 'What is the capital of France?', 'response': 'Paris', 'metadata': {'city': 'Paris', 'country': 'france'}, 'key': 'llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545'}]


In [94]:
# Check for a semantically similar result
question = "What actually is the capital of France?"
llmcache.check(prompt=question)[0]['response']

'Paris'

In [95]:
print(ask_openai("What is the capital of Morocco?"))

The capital of Morocco is Rabat.


In [96]:
llmcache.store(
    prompt="What is the capital of Morocco?",
    response="Rabat",
    metadata={"city": "Rabat", "country": "Morocco"}
)

'llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf'

In [17]:
test_data_raw = [
    {"query": "What is the capital of Morocco?", "query_match": "llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf"},
    {"query": "What is the capital of France?", "query_match": "llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545"},
    {"query": "Could you tell me the capital of Morocco?", "query_match": "llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf"},
    {"query": "Do you know what the capital of France is?", "query_match": "llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545"},
    {"query": "I'd like to know the capital of Morocco", "query_match": "llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf"},
    {"query": "Can you share what France's capital is?", "query_match": "llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545"},
    {"query": "Tell me Morocco's capital", "query_match": "llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf"},
    {"query": "Which city is the capital of France?", "query_match": "llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545"},
    {"query": "The capital of Morocco is?", "query_match": "llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf"},
    {"query": "France - what's its capital?", "query_match": "llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545"},
    {"query": "Name the capital of Morocco", "query_match": "llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf"},
    {"query": "What city serves as the capital of France?", "query_match": "llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545"},
    {"query": "I'm wondering about Morocco's capital", "query_match": "llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf"},
    {"query": "Please tell me France's capital city", "query_match": "llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545"},
    {"query": "Which place is Morocco's capital?", "query_match": "llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf"}
]

from pydantic import BaseModel

class TestData(BaseModel):
    query: str
    query_match: str | None

test_data = [TestData(**data) for data in test_data_raw]

In [20]:
llmcache.check(prompt="how do you make a hot dog?")

[]

In [21]:
def eval_accuracy(test_data) -> float:
    correct = 0
    for data in test_data:
        match = llmcache.check(data.query)
        if match and match[0]["key"] == data.query_match:
            correct += 1
    return correct / len(test_data)

In [None]:
from redisvl.query.query import BaseQuery

import numpy as np
from redisvl.redis.utils import buffer_to_array

def calc_cosine_distance(vector1: bytes | np.ndarray | list[float], vector2: bytes | np.ndarray | list[float]) -> float:
    if isinstance(vector1, bytes):
        vector1 = buffer_to_array(vector1, dtype="float32")
    if isinstance(vector2, bytes):
        vector2 = buffer_to_array(vector2, dtype="float32")

    return 1 - (np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2)))


def metrics_panel(cache, test_data):
    query = BaseQuery().return_field("prompt_vector", decode_field=False)
    cached_records = cache.index.query(query) # get all cached records

    test_data_embeddings = cache._vectorizer.embed_many([data.query for data in test_data])

    distances = np.empty(shape=(len(cached_records), len(test_data_embeddings)), dtype=np.float32, order='C')
    for i, record in enumerate(cached_records):
        for j, test_vector in enumerate(test_data_embeddings):
            distances[i, j] = calc_cosine_distance(record["prompt_vector"], test_vector)
    
    thresholds = np.linspace(0.01, 0.8, 60)
    metrics = {}

    for threshold in thresholds:
        TP, TN, FP, FN = 0, 0, 0, 0

        for i, test in enumerate(test_data):
            index_of_nearest = np.argmin(distances[i, :])

            if distances[i, :][index_of_nearest] < threshold:
                if test.query_match == cached_records[index_of_nearest]["key"]:
                    TP += 1
                else:
                    FP += 1
            else:
                if test.query_match:
                    FN += 1
                else:
                    TN += 1

        precision = TP / (TP + FP)
        recall = TP / (TP + FN)
        F1 = 2 * (precision * recall) / (precision + recall)
        accuracy = (TP + TN) / len(test_data)

        metrics[threshold] = {
            "precision": precision,
            "recall": recall,
            "F1": F1,
            "accuracy": accuracy
        }
        
    return metrics

In [23]:
thresholds = np.arange(0.01, 0.8, 0.025)

print(len(thresholds))

scores = []

for threshold in thresholds:
    llmcache.set_threshold(threshold)
    acc = eval_accuracy(test_data)
    scores.append(acc)

best_threshold = thresholds[scores.index(max(scores))]

best_threshold, max(scores)


32


(0.16000000000000003, 1.0)

In [112]:
from redisvl.query.query import BaseQuery
query = BaseQuery().return_field("prompt_vector", decode_field=False)
res = llmcache.index.query(query)
res

[{'id': 'llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545',
  'prompt_vector': b'\xccH\xb1<2\xe0\xb0<\xa0\x81\x0c;\xfe\xb5a=\xa0\x9a1\xbd!\xc4\xf8\xbbm/(\xbdV\x93\n\xbc\x18.\x88<\x98{\n\xbc\x0er\x10=P\x03G\xbd\xe4\xa4\x9e<\xb9\xad\x05\xbdu\xae\x85=#w7\xbb\xdak\xf3;C\xf5&\xbd\x15+y\xbc\xcf\xa3c\xbb4\xb8\xb4<\xbb\x98\x08=\xa2p\x10<qN\xae<D\x94\xdc<\xb3\xdc\x9c=\xc8\xe8Z\xbd\xa7}$<h!\xf6\xbb\xe1\xc4\xb6=\xb0\xd8\xf4<\x96\xaa\x84\xbc\x1d\xde\x1e<\xd8\xff\x8a\xbc\x14D\xc35\x9d\xb4\x87<\xcd \xe7\xbc\x8c\xf3\x8e<\xf8\xcfa<Y\x97\x0f\xbdA))\xbd\xbe\xd2\x9f:^\x88\xc5\xbav|\x00=/c\t=l]\xa3\xbc\x8e\x8a\x12=3(\xc6<\x8f\xebX\xbd\x16=!;@\xf7n<\x18\xf6\x0e\xbc\x80\xfa\x06\xbdo \xe1\xbc\xcc=S\xbd!\x86\xca<\xc2\xb3s\xbcV\'z=9\xa0\xf0\xbc\x92\xef\x88;\x15\xa9\r=|\xc2\xee\xbcC\r\xec\xbc\x95\x9d\x93\xbc\xe1m!\xbd\xdeX\x02\xbd\xa4\rA=mNU=\x9fl!<\x08\xda\x12\xbara\x93\xbb\x92\x0e\x9b\xbc%d\x15<p\tr=s7\xbd\xbbq\x11\xed\xba2hn\xbci+\xc9\xbc71G=H\x16\x8c<\xa2A\x8e=}\xeaF\xbdT\x81d=\xa5\

In [106]:
test_vector = llmcache._vectorizer.embed("how do you make a hot dog?", as_buffer=True, dtype="float32")
test_vector

b'\xe41\xd3;\x9a\x1ag\xbd\x90\x1b\xa1;P\x9d_\xbcA\x94\x9d\xbc9-R=\x98e"\xbd\xa3\x1a\x90<;\xdbH\xbc6\x04d\xbc\xd4~\xb5\xb9\xc2\x88\x87<R\x83r\xbc\xc3\xb5+;\xd1\x80\xad\xb9I#\x87<"\xae\xac\xbcN\xee\xa5<\x91\xf4m<\x1e.\xc2\xbck\nR\xbd\x87fB:\x8d\xaa\x8e<\xe5\x86\x16=#U1\xbdq\xd2!<J\xa3\x95\xbd(I\x97<\xac\x8eo\xbciO\x95\xbdG\x84\x80\xbc\xec&T<\\\x13\x92\xbc=v\xd7;_\x1c\xa25(E/<\x81\t\x13<16\x9c<^\x1cm=\xe69\xb1=V\x85\xb3\xbc\x94\x02V\xbd\xc5\xb6\x01\xbdy\xc4\x11:\xde\xe2\n\xbd\x92\x12\x81\xba?\xcaD=\x00\xfe\x9d=6\xea\xba9:\xc7\xfd\xba\xad\xcc,\xbc\x86\xe1\x06\xbesY\xa1\xbc@@\r:\xc2\xdc\xaa=\xf9\xf3\x14\xbd\xc2\xc9\x05\xbc\xdc\xf9\xfb;\xacx\x9e\xbd\xa7h\x8a\xbc+\x88\xc6<K\x14\x93\xbcTQ\xbe<\x10\xe4\\\xbb\xd0\xf8\xbe\xbd_\xc8\xa5<h\xd5\n<$\x9a\x01<\xbf\xb5\xe6<\x88H\xc0<l.\xbf\xbd\xdf=\n\xbc\xfbg\xec\xbcFB0=.\x8a\xfe:\xa0\xf2\xf5\xbcWf\x99\xbcz\xe4\xc2<\xaf\xdf\xea<KC\xda<\xd8\xac\xf5;\xd4[+\xbd1\x7f\xc5<&\xc7\xfb\xbb\t<\x9e=ki\xd8=\xbd\xd2\x04:\x04\x8d\xb0<\x14\xefG\xbc\xdc}R;\r\x14\x95:\x8

In [114]:
res[0]["prompt_vector"]

b'\xccH\xb1<2\xe0\xb0<\xa0\x81\x0c;\xfe\xb5a=\xa0\x9a1\xbd!\xc4\xf8\xbbm/(\xbdV\x93\n\xbc\x18.\x88<\x98{\n\xbc\x0er\x10=P\x03G\xbd\xe4\xa4\x9e<\xb9\xad\x05\xbdu\xae\x85=#w7\xbb\xdak\xf3;C\xf5&\xbd\x15+y\xbc\xcf\xa3c\xbb4\xb8\xb4<\xbb\x98\x08=\xa2p\x10<qN\xae<D\x94\xdc<\xb3\xdc\x9c=\xc8\xe8Z\xbd\xa7}$<h!\xf6\xbb\xe1\xc4\xb6=\xb0\xd8\xf4<\x96\xaa\x84\xbc\x1d\xde\x1e<\xd8\xff\x8a\xbc\x14D\xc35\x9d\xb4\x87<\xcd \xe7\xbc\x8c\xf3\x8e<\xf8\xcfa<Y\x97\x0f\xbdA))\xbd\xbe\xd2\x9f:^\x88\xc5\xbav|\x00=/c\t=l]\xa3\xbc\x8e\x8a\x12=3(\xc6<\x8f\xebX\xbd\x16=!;@\xf7n<\x18\xf6\x0e\xbc\x80\xfa\x06\xbdo \xe1\xbc\xcc=S\xbd!\x86\xca<\xc2\xb3s\xbcV\'z=9\xa0\xf0\xbc\x92\xef\x88;\x15\xa9\r=|\xc2\xee\xbcC\r\xec\xbc\x95\x9d\x93\xbc\xe1m!\xbd\xdeX\x02\xbd\xa4\rA=mNU=\x9fl!<\x08\xda\x12\xbara\x93\xbb\x92\x0e\x9b\xbc%d\x15<p\tr=s7\xbd\xbbq\x11\xed\xba2hn\xbci+\xc9\xbc71G=H\x16\x8c<\xa2A\x8e=}\xeaF\xbdT\x81d=\xa5\x0c[<\x1f\xce\xa0\xbcQZ\xaa<z\x8b\x97:\xa0`"=\x80\xe0\xbb\xbd\xb5\xe4\x8b<`]\xa2\xbd9-W<\xcc\x85\xce<\x1

In [145]:
query = BaseQuery().return_field("prompt_vector", decode_field=False)
cached_records = llmcache.index.query(query) # get all cached records

test_data_embeddings = llmcache._vectorizer.embed_many([data.query for data in test_data])

distances = np.empty(shape=(len(cached_records), len(test_data_embeddings)), dtype=np.float32, order='C')
for i, record in enumerate(cached_records):
    for j, test_vector in enumerate(test_data_embeddings):
        distances[i, j] = calc_cosine_distance(record["prompt_vector"], test_vector)



In [146]:
min_index = np.unravel_index(np.argmin(distances), distances.shape)
min_index

(1, 0)

In [147]:
cached_records[1]

{'id': 'llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf',
 'prompt_vector': b'p_\xee<N@\x8e\xbdtV\xd0\xbc5H\xd2<\xd4]6:\xea\xcc\xba<"\xe1:\xbd\x12\x95o\xbb\xb8\x05\n=\x05T\xd69>D\xcc<y\r<\xbd\x8f.\x9e=\xf2\xcf\x8e\xbd\xb8g9=$\x1b\x97\xbci\xdb\x1f;:\x92\x81\xbdf\xf2\x9a\xbcSZ\xa7<\xea\x8b\x04\xbc;p\xe3<\x87\xec\xb8;\x87\x85\x97;\n\xbd\xf9<Tl\r=\xa14\x9c\xbd\t[\x08\xbd\x13\x94\x04<Tj\xd1=-LU=\xc6\xb7\xb0\xbcF\xa1N<\xb2\x06\x01\xbd\xfas\xba5\xfe\xd4\xbc\xbc\x15\xee\xb4\xba /\x8b<(\xbbp=\x8c\x91\xd8\xbc\x9f\xb4\xc7;H\xa4\x8a;u+\xc1\xbc\xc9"\t=\xe4\x8b\xe8<9\x95\x03\xbd\xd1\x842=\xab\x8c*\xbd\xd5\x15\x92\xbb\xe1\x9a\xf5\xbb\x90\xe0\xc4<\x04\x99\r\xbd\xdc\xef\xe0\xbc\r\xefI:\xc5EL\xbc\xac\xaf\xbc<c\xa9\x12\xbdp\x8a\xa5=\xd1/4\xbd\xc2\xb7\x99<\xa5\x8c4=\xd6\xe1\x93\xbc\xc8\xac$\xbd\xd6<\x06\xbd\n\xcb\xf6\xbc<F\xd2\xbb\xba\xc2\x08=\x1f\xc8\x0b\xbc\xac\xbd\xa5<a\xcd\xaa;\xac\x85E\xbd\xb6\x9a~\xbcr\xe9\x9e<\xec|\x97=\x8e4\x0b;4\xc8\x1e=b\xd3\xf4;\xcb{ =\xd9\xbb\x1a<Xv\x

In [148]:
distances[:, 1]

array([1.6258106e-12, 3.2236418e-01], dtype=float32)

In [149]:
thresholds = np.linspace(0.01, 0.8, 60)
metrics = {}

for threshold in thresholds:
    TP, TN, FP, FN = 0, 0, 0, 0

    for i, test in enumerate(test_data):
        # print(i, test)
        # distance_of_nearest = np.min(distances[i, :])
        index_of_nearest = np.argmin(distances[:, i])
        print("nearest index: ",index_of_nearest)
        # print(distances[:, i][index_of_nearest])
        if distances[:, i][index_of_nearest] < threshold:
            print(test.query_match == cached_records[index_of_nearest]["id"])
            print(test.query_match, cached_records[index_of_nearest]["id"], "\n\n")
            if test.query_match == cached_records[index_of_nearest]["id"]:
                TP += 1
            else:
                FP += 1
        else:
            if test.query_match:
                FN += 1
            else:
                TN += 1

    # print(TP, TN, FP, FN)

    precision = TP / (TP + FP) if TP + FP > 0 else 0
    recall = TP / (TP + FN) if TP + FN > 0 else 0
    F1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0
    accuracy = (TP + TN) / len(test_data)

    metrics[threshold] = {
        "precision": precision,
        "recall": recall,
        "F1": F1,
        "accuracy": accuracy
    }

nearest index:  1
True
llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf 


nearest index:  0
True
llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545 llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545 


nearest index:  1
nearest index:  0
nearest index:  1
nearest index:  0
nearest index:  1
nearest index:  0
nearest index:  1
nearest index:  0
nearest index:  1
nearest index:  0
nearest index:  1
nearest index:  0
nearest index:  1
nearest index:  1
True
llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf llmcache:68059df1094306e0352726ac962848f78c29cf4ac099619a18a631a218a1c7bf 


nearest index:  0
True
llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545 llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545 


nearest index:  1
nearest index:  0
nearest index:  1
nearest index:  

In [151]:
def get_best_threshold(metrics: dict) -> float:
    """
    Returns the threshold with the highest F1 score.
    If multiple thresholds have the same F1 score, returns the lowest threshold.
    """
    return max(metrics.items(), key=lambda x: (x[1]['F1'], -x[0]))[0]


get_best_threshold(metrics)

0.14389830508474577

## Customize the Distance Threshhold

For most use cases, the right semantic similarity threshhold is not a fixed quantity. Depending on the choice of embedding model,
the properties of the input query, and even business use case -- the threshhold might need to change. 

Fortunately, you can seamlessly adjust the threshhold at any point like below:

In [10]:
# Widen the semantic distance threshold
llmcache.set_threshold(0.3)

In [11]:
# Really try to trick it by asking around the point
# But is able to slip just under our new threshold
question = "What is the capital city of the country in Europe that also has a city named Nice?"
llmcache.check(prompt=question)[0]['response']

'Paris'

In [12]:
# Invalidate the cache completely by clearing it out
llmcache.clear()

# should be empty now
llmcache.check(prompt=question)

[]

## Utilize TTL

Redis uses TTL policies (optional) to expire individual keys at points in time in the future.
This allows you to focus on your data flow and business logic without bothering with complex cleanup tasks.

A TTL policy set on the `SemanticCache` allows you to temporarily hold onto cache entries. Below, we will set the TTL policy to 5 seconds.

In [13]:
llmcache.set_ttl(5) # 5 seconds

In [14]:
llmcache.store("This is a TTL test", "This is a TTL test response")

time.sleep(6)

In [15]:
# confirm that the cache has cleared by now on it's own
result = llmcache.check("This is a TTL test")

print(result)

[]


In [16]:
# Reset the TTL to null (long lived data)
llmcache.set_ttl()

## Simple Performance Testing

Next, we will measure the speedup obtained by using ``SemanticCache``. We will use the ``time`` module to measure the time taken to generate responses with and without ``SemanticCache``.

In [17]:
def answer_question(question: str) -> str:
    """Helper function to answer a simple question using OpenAI with a wrapper
    check for the answer in the semantic cache first.

    Args:
        question (str): User input question.

    Returns:
        str: Response.
    """
    results = llmcache.check(prompt=question)
    if results:
        return results[0]["response"]
    else:
        answer = ask_openai(question)
        return answer

In [18]:
start = time.time()
# asking a question -- openai response time
question = "What was the name of the first US President?"
answer = answer_question(question)
end = time.time()

print(f"Without caching, a call to openAI to answer this simple question took {end-start} seconds.")

# add the entry to our LLM cache
llmcache.store(prompt=question, response="George Washington")

Without caching, a call to openAI to answer this simple question took 0.9034533500671387 seconds.


'llmcache:67e0f6e28fe2a61c0022fd42bf734bb8ffe49d3e375fd69d692574295a20fc1a'

In [19]:
# Calculate the avg latency for caching over LLM usage
times = []

for _ in range(10):
    cached_start = time.time()
    cached_answer = answer_question(question)
    cached_end = time.time()
    times.append(cached_end-cached_start)

avg_time_with_cache = np.mean(times)
print(f"Avg time taken with LLM cache enabled: {avg_time_with_cache}")
print(f"Percentage of time saved: {round(((end - start) - avg_time_with_cache) / (end - start) * 100, 2)}%")

Avg time taken with LLM cache enabled: 0.09753389358520508
Percentage of time saved: 89.2%


In [20]:
# check the stats of the index
!rvl stats -i llmcache


Statistics:
╭─────────────────────────────┬─────────────╮
│ Stat Key                    │ Value       │
├─────────────────────────────┼─────────────┤
│ num_docs                    │ 1           │
│ num_terms                   │ 19          │
│ max_doc_id                  │ 6           │
│ num_records                 │ 53          │
│ percent_indexed             │ 1           │
│ hash_indexing_failures      │ 0           │
│ number_of_uses              │ 45          │
│ bytes_per_record_avg        │ 45.0566     │
│ doc_table_size_mb           │ 0.000134468 │
│ inverted_sz_mb              │ 0.00227737  │
│ key_table_size_mb           │ 2.76566e-05 │
│ offset_bits_per_record_avg  │ 8           │
│ offset_vectors_sz_mb        │ 3.91006e-05 │
│ offsets_per_term_avg        │ 0.773585    │
│ records_per_doc_avg         │ 53          │
│ sortable_values_size_mb     │ 0           │
│ total_indexing_time         │ 19.454      │
│ total_inverted_index_blocks │ 21          │
│ vector_index_sz_mb 

In [21]:
# Clear the cache AND delete the underlying index
llmcache.delete()

## Cache Access Controls, Tags & Filters
When running complex workflows with similar applications, or handling multiple users it's important to keep data segregated. Building on top of RedisVL's support for complex and hybrid queries we can tag and filter cache entries using custom-defined `filterable_fields`.

Let's store multiple users' data in our cache with similar prompts and ensure we return only the correct user information:

In [22]:
private_cache = SemanticCache(
    name="private_cache",
    filterable_fields=[{"name": "user_id", "type": "tag"}]
)

private_cache.store(
    prompt="What is the phone number linked to my account?",
    response="The number on file is 123-555-0000",
    filters={"user_id": "abc"},
)

private_cache.store(
    prompt="What's the phone number linked in my account?",
    response="The number on file is 123-555-1111",
    filters={"user_id": "def"},
)

'private_cache:5de9d651f802d9cc3f62b034ced3466bf886a542ce43fe1c2b4181726665bf9c'

In [23]:
from redisvl.query.filter import Tag

# define user id filter
user_id_filter = Tag("user_id") == "abc"

response = private_cache.check(
    prompt="What is the phone number linked to my account?",
    filter_expression=user_id_filter,
    num_results=2
)

print(f"found {len(response)} entry \n{response[0]['response']}")

found 1 entry 
The number on file is 123-555-0000


In [24]:
# Cleanup
private_cache.delete()

Multiple `filterable_fields` can be defined on a cache, and complex filter expressions can be constructed to filter on these fields, as well as the default fields already present.

In [25]:

complex_cache = SemanticCache(
    name='account_data',
    filterable_fields=[
        {"name": "user_id", "type": "tag"},
        {"name": "account_type", "type": "tag"},
        {"name": "account_balance", "type": "numeric"},
        {"name": "transaction_amount", "type": "numeric"}
    ]
)
complex_cache.store(
    prompt="what is my most recent checking account transaction under $100?",
    response="Your most recent transaction was for $75",
    filters={"user_id": "abc", "account_type": "checking", "transaction_amount": 75},
)
complex_cache.store(
    prompt="what is my most recent savings account transaction?",
    response="Your most recent deposit was for $300",
    filters={"user_id": "abc", "account_type": "savings", "transaction_amount": 300},
)
complex_cache.store(
    prompt="what is my most recent checking account transaction over $200?",
    response="Your most recent transaction was for $350",
    filters={"user_id": "abc", "account_type": "checking", "transaction_amount": 350},
)
complex_cache.store(
    prompt="what is my checking account balance?",
    response="Your current checking account is $1850",
    filters={"user_id": "abc", "account_type": "checking"},
)

'account_data:d48ebb3a2efbdbc17930a8c7559c548a58b562b2572ef0be28f0bb4ece2382e1'

In [26]:
from redisvl.query.filter import Num

value_filter = Num("transaction_amount") > 100
account_filter = Tag("account_type") == "checking"
complex_filter = value_filter & account_filter

# check for checking account transactions over $100
complex_cache.set_threshold(0.3)
response = complex_cache.check(
    prompt="what is my most recent checking account transaction?",
    filter_expression=complex_filter,
    num_results=5
)
print(f'found {len(response)} entry')
print(response[0]["response"])

found 1 entry
Your most recent transaction was for $350


In [27]:
# Cleanup
complex_cache.delete()