# Semantic Caching

RedisVL provides the ``LLMCache`` interface to turn Redis, with it's vector search capability, into a semantic cache to store query results, thereby reducing the number of requests and tokens sent to the Large Language Models (LLM) service. This decreases expenses and enhances performance by reducing the time taken to generate responses.

This notebook will go over how to use ``LLMCache`` for your applications

First, we will import OpenAI to user their API for responding to prompts.

In [24]:
import os
import openai
openai.api_key = "sk-<YOUR KEY HERE>"

def ask_openai(question):
    response = openai.Completion.create(
      engine="text-davinci-003",
      prompt=question,
      max_tokens=200
    )
    return response.choices[0].text.strip()

In [2]:
# test it
print(ask_openai("What is the capital of France?"))

The capital of France is Paris.


## Initializing and using ``LLMCache``

``LLMCache`` will automatically create an index within Redis upon initialization for the semantic cache. The same ``SearchIndex`` class used in the previous tutorials is used here to perform index creation and manipulation.

In [3]:
from redisvl.llmcache.semantic import SemanticCache
cache = SemanticCache(
    redis_url="redis://localhost:6379",
    threshold=0.9, # semantic similarity threshold
    )

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# check the cache
cache.check("What is the capital of France?")

[]

In [5]:
# store the question and answer
cache.store("What is the capital of France?", "Paris")

In [6]:
# check the cache again
cache.check("What is the capital of France?")

['Paris']

In [7]:
# check for a semantically similar result
cache.check("What really is the capital of France?")

[]

In [8]:
# decrease the semantic similarity threshold
cache.set_threshold(0.7)
cache.check("What really is the capital of France?")

['Paris']

In [10]:
# adversarial example (not semantically similar enough)
cache.check("What is the capital of Spain?")

[]

## Performance

Next, we will measure the speedup obtained by using ``LLMCache``. We will use the ``time`` module to measure the time taken to generate responses with and without ``LLMCache``.

In [11]:
def answer_question(question: str):
    results = cache.check(question)
    if results:
        return results[0]
    else:
        answer = ask_openai(question)
        cache.store(question, answer)
        return answer

In [22]:
import time
start = time.time()
answer = answer_question("What is the capital of France?")
end = time.time()
print(f"Time taken without cache {time.time() - start}")

Time taken without cache 0.7418899536132812


In [23]:
cached_start = time.time()
cached_answer = answer_question("What is the capital of France?")
cached_end = time.time()
print(f"Time Taken with cache: {cached_end - cached_start}")
print(f"Percentage of time saved: {round(((end - start) - (cached_end - cached_start)) / (end - start) * 100, 2)}%")

Time Taken with cache: 0.07415914535522461
Percentage of time saved: 90.0%


In [25]:
# remove the index and all cached items
cache.index.delete()