# Building a Semantic Cache with Redis and VertexAI Gemini Model

<a href="https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/semantic-cache/semantic_caching_gemini.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Intro
Google's Vertex AI has expanded its capabilities by introducing [Generative AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview). This advanced technology comes with a specialized [in-console studio experience](https://cloud.google.com/vertex-ai/docs/generative-ai/start/quickstarts/quickstart), a [dedicated API](https://cloud.google.com/vertex-ai/docs/generative-ai/start/quickstarts/api-quickstart) and [Python SDK](https://cloud.google.com/vertex-ai/docs/python-sdk/use-vertex-ai-python-sdk) designed for deploying and managing instances of Google's powerful Gemini language models.

Redis offers robust vector database features. When coupled with its versatile data structures - including lists, hashes, JSON, and sets - Redis shines as the optimal solution for crafting high-quality Large Language Model (LLM)-based applications. It embodies a streamlined architecture and exceptional performance, making it an instrumental tool for production environments.

**Below we will design a semantic caching layer using Gemini (LLM) and Redis.**

## 1. Setup
Before we begin, we must install some required libraries, authenticate with Google, create a Redis database, and initialize other required components.


### Install required libraries

In [None]:
# NBVAL_SKIP
!pip install redisvl>=0.3.0 unstructured[pdf]
!pip install llama-parse llama-index-readers-file
!pip install langchain langchain-google-vertexai

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

In [9]:
import textwrap

from IPython.display import display
from IPython.display import Markdown

def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

### Connect to resources

In [None]:
import os
import nest_asyncio
from getpass import getpass


# input your Llama Cloud API Key
if "LLAMA_CLOUD_API_KEY" not in os.environ:
    os.environ["LLAMA_CLOUD_API_KEY"] = getpass("LLAMA_CLOUD_API_KEY:")

# input your GCP project ID and region for Vertex AI
if "GCP_PROJECT_ID" not in os.environ:
    PROJECT_ID = getpass("GCP_PROJECT_ID:") #'central-beach-194106'
    REGION = input("GCP_REGION:") #'us-central1'
else:
    PROJECT_ID = os.environ["GCP_PROJECT_ID"]
    REGION = os.environ["GCP_REGION"]

# need this for running llama-index code in Jupyter Notebooks
nest_asyncio.apply()

In [11]:
from google.colab import auth

auth.authenticate_user()
print('Authenticated')

Authenticated


In [None]:
import vertexai

vertexai.init(project=PROJECT_ID, location=REGION)

In [12]:
from langchain_google_vertexai import ChatVertexAI

# Create LLM instance
llm = ChatVertexAI(
    model_name="gemini-pro",
    temperature=0.5,
    top_p=0.85
)

In [35]:
from redisvl.utils.vectorize import VertexAITextVectorizer

# Create vectorizer instance
vectorizer = VertexAITextVectorizer(
    model = "textembedding-gecko@003",
    api_config = {"project_id": PROJECT_ID, "location": REGION}
)

### Fetch sample dataset

For the semantic caching demonstration today we will be focusing on a chevy colorado truck brochure.


In [36]:
!mkdir -p 'data/'
!wget 'https://raw.githubusercontent.com/redis-developer/LLM-Document-Chat/main/docs/2022-chevrolet-colorado-ebrochure.pdf' -O 'data/2022-chevrolet-colorado-ebrochure.pdf'

--2024-05-21 15:29:42--  https://raw.githubusercontent.com/redis-developer/LLM-Document-Chat/main/docs/2022-chevrolet-colorado-ebrochure.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3566101 (3.4M) [application/octet-stream]
Saving to: ‘data/2022-chevrolet-colorado-ebrochure.pdf’


2024-05-21 15:29:42 (62.3 MB/s) - ‘data/2022-chevrolet-colorado-ebrochure.pdf’ saved [3566101/3566101]



### Install Redis locally
If you have a Redis db running elsewhere with [Redis Stack](https://redis.io/docs/about/about-stack/) installed, you don't need to run it on this machine. You can skip to the "Connect to Redis server" step.

In [37]:
# NBVAL_SKIP
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb jammy main
Starting redis-stack-server, database path /var/lib/redis-stack


Replace the connection params below with your own if you are connecting to an external Redis instance.

In [23]:
import os
import redis

# Redis connection params
REDIS_HOST = os.getenv("REDIS_HOST", "localhost") #"redis-12110.c82.us-east-1-2.ec2.cloud.redislabs.com"
REDIS_PORT = os.getenv("REDIS_PORT", "6379")      #12110
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")  #"pobhBJP7Psicp2gV0iqa2ZOc1WdXXXXX"

# Create Redis client
redis_client = redis.Redis(
  host=REDIS_HOST,
  port=REDIS_PORT,
  password=REDIS_PASSWORD
)

# Test connection
redis_client.ping()

True

In [24]:
# Clear Redis database (optional)
redis_client.flushdb()

True

## 2. Semantic Caching Strategies


There are several strategies for semantic caching that should be considered for your RAG application depending on the data source, who the users are, the kinds of questions you anticipate, and the validity of the data sources in question.

**Common caching strategies with LLMs:**

1.   Pre-generated Cache w/ Google Gemini LLM
2.   Online caching (live)

This notebook will explore building a pre-generated cache with Google Gemini.




### Pre-generated Cache

The objective of the pre-generated cache is to load Redis with **frequently asked questions**, derived from domain experts, LLMs, or session data (collected from your application). This is the SAFEST caching strategy because the information can be manually vetted before placing in the cache. This information.

This notebook will use:
- **LlamaIndex** to parse the pdf file with high fidelity.
- **Google Gemini LLM** to extract FAQs from the extracted source material.
- **Redis** to index the FAQs for semantic caching.

*Note that many of these components could be changed out for other foundation models and parsing tools that are approved by your organization for use.*

#### Parse source pdf document

Our chevy colorado product manual is the source of "truth" here.



In [39]:
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

In [40]:
# set up LlamaParse agent to convert the PDF into markdown
parser = LlamaParse(
    result_type="markdown"  # "markdown" and "text" are available
)

file_extractor = {".pdf": parser}
reader = SimpleDirectoryReader("./data", file_extractor=file_extractor)
documents = reader.load_data()

Started parsing the file under job_id 5afa9727-102d-4c25-965a-de91941db75c
.

In [43]:
from llama_index.core.node_parser import MarkdownNodeParser

parser = MarkdownNodeParser()

nodes = parser.get_nodes_from_documents(documents)

In [48]:
for node in nodes:
  print("############", "\n", node.text)

############ 
 COLORADO 2022
---
Choose your adventure. The 2022 Colorado delivers everything you could ask for in a midsize pickup. Engine choices that are powerful and efficient, including an available GM-exclusive Duramax® 2.8L Turbo-Diesel engine that provides up to 7,700 lbs. of towing muscle. A ZR2 off-road beast with the capability to conquer tough trails. And a comfortable interior filled with convenience and technology features. So go ahead. Choose your best life in Colorado.

Colorado Crew Cab ZR2 in Sand Dune Metallic with available ZR2 Dusk Special Edition. Vehicle shown can tow up to 5,000 lbs.

1 Requires Colorado Crew Cab Short Box LT 2WD with available Trailering Package, LT Convenience Package and Safety Package. 2 Maximum trailering ratings are intended for comparison purposes only. Before you buy a vehicle or use it for trailering, carefully review the Trailering section of the Owner’s Manual. The trailering capacity of your specific vehicle may vary. The weight of p

#### Extract FAQs with Gemini

First we will define a chain to properly prompt an LLM to extract FAQs as a JSON object per node.

In [110]:
from langchain import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List


# Define your desired data structure.
class PromptResponse(BaseModel):
    prompt: str = Field(description="User input question about information in the document.")
    response: str = Field(description="Grounded answer from the LLM pertaining to the user's question.")

class FAQs(BaseModel):
    pairs: List[PromptResponse] = Field(description="List of prompt response pairs extracted from the document")


# Set up a parser + inject instructions into the prompt template.
json_parser = JsonOutputParser(pydantic_object=FAQs)

In [87]:
prompt = PromptTemplate(
    template="""You are a document intelligence tool used to extract FAQs
    from portions of a source PDF document. Put yourself in the shoes of a potential
    reader of the provided material and anticipate what questions they might have.
    Looking at the context below, your goal is to pull out as many likely question
    (prompt) and answer (response) pairs as possible, using only what's provided.

    Other Rules:
    - Create as many FAQs as possible, even rewording the same question and answer
    a few times while remaining faithful to the truth and provided content.
    - Ignore heavy marketing or salesly concepts and focus specifically on
    factual data.
    - It's ok if your response is an empty JSON object for a particular section.

    {format_instructions}

    Document Context:\n{doc}\n""",
    input_variables=["doc"],
    partial_variables={"format_instructions": json_parser.get_format_instructions()},
)

faq_generator_chain = prompt | llm | json_parser

Let's test this out with a sample document from wikipedia first.

In [88]:
sample_doc = """Obi-Wan Kenobi (Ewan McGregor) is a young apprentice Jedi knight
under the tutelage of Qui-Gon Jinn (Liam Neeson) ; Anakin Skywalker (Jake Lloyd),
who will later father Luke Skywalker and become known as Darth Vader, is just
a 9-year-old boy. When the Trade Federation cuts off all routes to the planet
Naboo, Qui-Gon and Obi-Wan are assigned to settle the matter."""

In [89]:
faqs = faq_generator_chain.invoke({"doc": sample_doc})

In [90]:
faqs

{'pairs': [{'prompt': 'Who is Obi-Wan Kenobi?',
   'response': 'Obi-Wan Kenobi is a young apprentice Jedi knight under the tutelage of Qui-Gon Jinn.'},
  {'prompt': 'Who is Qui-Gon Jinn?',
   'response': 'Qui-Gon Jinn is a Jedi knight who is the mentor of Obi-Wan Kenobi.'},
  {'prompt': 'Who is Anakin Skywalker?',
   'response': 'Anakin Skywalker is a 9-year-old boy who will later father Luke Skywalker and become known as Darth Vader.'},
  {'prompt': 'What is the Trade Federation?',
   'response': 'The Trade Federation is a group that cuts off all routes to the planet Naboo.'},
  {'prompt': 'What is the mission of Qui-Gon and Obi-Wan?',
   'response': 'Qui-Gon and Obi-Wan are assigned to settle the matter of the Trade Federation cutting off all routes to the planet Naboo.'}]}

Now we can apply this same logic to nodes from our pdf document.

In [91]:
def extract_faqs(nodes):
    all_faqs = []
    for i, node in enumerate(nodes):
        print(f"Processing node {i+1} of {len(nodes)}", flush=True)
        results = faq_generator_chain.invoke({"doc": node.text})
        if results and results.get("pairs"):
            all_faqs.extend(results["pairs"])
    return all_faqs

In [92]:
all_faqs = extract_faqs(nodes)

Processing node 1 of 39
Processing node 2 of 39
Processing node 3 of 39
Processing node 4 of 39
Processing node 5 of 39
Processing node 6 of 39
Processing node 7 of 39
Processing node 8 of 39
Processing node 9 of 39
Processing node 10 of 39
Processing node 11 of 39
Processing node 12 of 39
Processing node 13 of 39
Processing node 14 of 39
Processing node 15 of 39
Processing node 16 of 39
Processing node 17 of 39
Processing node 18 of 39
Processing node 19 of 39
Processing node 20 of 39
Processing node 21 of 39
Processing node 22 of 39
Processing node 23 of 39
Processing node 24 of 39
Processing node 25 of 39
Processing node 26 of 39
Processing node 27 of 39
Processing node 28 of 39
Processing node 29 of 39
Processing node 30 of 39
Processing node 31 of 39
Processing node 32 of 39
Processing node 33 of 39
Processing node 34 of 39
Processing node 35 of 39
Processing node 36 of 39
Processing node 37 of 39
Processing node 38 of 39
Processing node 39 of 39


In [93]:
print("Generated", len(all_faqs), "frequently asked questions.")

Generated 329 frequently asked questions.


In [94]:
all_faqs[:10]

[{'prompt': 'What are the engine choices for the 2022 Colorado?',
  'response': 'The 2022 Colorado offers a variety of engine choices, including a GM-exclusive Duramax® 2.8L Turbo-Diesel engine that provides up to 7,700 lbs. of towing muscle.'},
 {'prompt': 'What is the maximum towing capacity of the 2022 Colorado?',
  'response': 'The maximum towing capacity of the 2022 Colorado is 7,700 lbs. This is for the Colorado Crew Cab Short Box LT 2WD with available Trailering Package, LT Convenience Package and Safety Package.'},
 {'prompt': 'What are some of the off-road features of the 2022 Colorado?',
  'response': 'The 2022 Colorado ZR2 is an off-road beast with the capability to conquer tough trails.'},
 {'prompt': 'What are some of the interior features of the 2022 Colorado?',
  'response': 'The 2022 Colorado has a comfortable interior filled with convenience and technology features.'},
 {'prompt': 'What is the availability of certain features on the 2022 Colorado?',
  'response': 'Due 

#### Index FAQs into Redis

Now we will create embeddings of each prompt and load them into Redis for our semantic cache.

In [95]:
# Embed each chunk content using the vertex AI vectorizer
embeddings = vectorizer.embed_many([pair["prompt"] for pair in all_faqs])

# Check to make sure we've created enough embeddings, 1 per FAQ record
len(embeddings) == len(all_faqs)

True

In [96]:
from redisvl.extensions.llmcache import SemanticCache

cache = SemanticCache(vectorizer=vectorizer, distance_threshold=0.2)

17:41:42 redisvl.index.index INFO   Index already exists, not overwriting.


In [97]:
cache.check("testing") # cache should be empty

[]

In [98]:
for i, entry in enumerate(all_faqs):
    cache.store(prompt=entry["prompt"], response=entry["response"], vector=embeddings[i])

#### Example queries

In [99]:
cache.check("What models of chevy colorado are available?")

[{'id': 'llmcache:2c9593e34f4f4b71508cb81d3e22e4b85dfda747add6f5fef3231e61d62f4015',
  'vector_distance': '0.100723683834',
  'prompt': 'How many models of the Colorado are there?',
  'response': 'There are four models of the Colorado: WT, LT, Z71, and ZR2.',
  'prompt_vector': '&\x15uu.J#>\x0fe\x07\x14=Jٛ=mX";\x7fGX=<sIG\x04=\x0b[<\'S<\x0eO\x18;Ӽ+5<\x05ȦHz\x14t~\x11T;y2|i\x08<7ݚ6.r,\x01\x0e=A?캇<\x1bsp<b<\x04=BF\x1c=_<ü%Gw=zs!\x08==bD=Ȑ\x12q:l**=q\r\x19HT\x00;\r\'\x04\x0c8\x00IQ<$\x05잮R"\x1b6=#\x1e=\x07\x1eD \\z=<t9\x19ڼ]P:D=\x1dA!=`<\n=\r\r5%+\x16k+=۴\x1d\x19\u2d7c薃:F*K=Ð:m@\x10=j=\x11̻5<]Z9\x02;^Y<\x05v\r=l!({0@&J=ʻ\x0b<ԛ<\x1e=:bD=\x16\x0f!&<1,\x08$="t*<\x11\x1c=z<Դ"=[|=vܞW\x14<BO(,e|<^V$<4w=Ы<\x05\x16K\x16\x03v:3<\x057=\x16=z\x07=q<}\t\\;H\x17)}U=\x14\x04z~=y=ļ\x02ܣ<H3z\x1c)\n7-*\x1f\x0e=<\x03.=~\x0e\x08K;P4=\x10=Dw\x0c<}m)<^\x02\r<\x11q]U<h\x1e\ry;<\x053U\x19<`\x05p@2\'\x021,A<ox\x04=dؼ"<IQ\x0bM\x7f\x04ż_ \x1f<"z\x0c\x00\x16=L\x02;\x19\\<\x06,\x14gB\x11\x08|4敶\x1c7E\x07<̆v,=c\x13Ni

In [100]:
cache.check("How many miles per galon does the chevy colorado get?")

[{'id': 'llmcache:eec7e7c4980868783056773594de1c89568fea5c7428db1b6d98e40044193bef',
  'vector_distance': '0.112104892731',
  'prompt': 'What is the fuel economy of the Chevrolet Colorado?',
  'response': 'The fuel economy of the Chevrolet Colorado varies depending on the engine and drivetrain configuration. The most fuel-efficient model is the 2.5L DOHC 4-cylinder engine with 2WD, which has an EPA-estimated fuel economy of 20 mpg in the city and 25 mpg on the highway.',
  'prompt_vector': 'έz)T\x16ϼ{X<\x1dV<<<;aʼ<g6\x02=w\r=L˪˗\x1a=dZ\x05=DE<\\\x03\x19(=˺\x7ff\x00=\'S\x7fvNC8f.B̻`w<M2L\x15\x18=\x1e \no1ϼG1ͷ˼B\t\x0f<5Cv<<$<N=QY|;g\x1ba=i#`?Q=*v;\x01~=$\t=|v\x00><G<\\Cb=\x1cZU*}mv<P\t=}j<C09<J<¼ t_<]=\t<<KPbxUhG=i\x0e%eVEE=|*=V|=<j\x06d⟽\x08\x18\x0ej\x1eR=&\u07fcЬ{#+\x11C=иԼn꼌E=J\x17+};=\'R=Xtyo,0\u074b=Z v<I卼^<B<\x03=ޒ</J<ê=^6@m̽\t\x1d:(\x05\x01=zqG-3\x1b=>;i5\x1c<T=!q;\x15.\x0b=`<C\'\x18=*)oiw=O\n<\x03|;(;L\x16&=N;\x19Ly=<\x13̤"N<^<Vzļ\x16Af;ZWJ=\x05*1F=2<pdӣ<\x06ۛ\x04\x0c\rW5\x07\ue4

In [None]:
cache.clear()

#### Full Document Analysis with Gemini

Because Gemini is a long context model, we can provide a much longer prompt (at a higher cost). We will attempt to use it to parse FAQs from the entire document at once.

In [119]:
prompt = PromptTemplate(
    template="""
    You are a document intelligence tool used to extract FAQs
    from a source PDF document. Put yourself in the shoes of a potential
    reader of the provided material and anticipate what questions they might have.
    Look at the document below. Your goal is to pull out as many likely question
    (prompt) and answer (response) pairs as possible, using only what's provided to
    steer your generated answers. These will be used to populate a semantic cache.

    Other Rules:
    - Create as many useful FAQs as possible while remaining faithful to the truth and provided content.
    - Generate multiple FAQs on the same topic with different phrasing or wording as
    long as the answers are correct and extracted from the document.
    - Ignore marketing, introductory, or salesly concepts and focus specifically on
    factual data that can be fact-checked later on.

    Format your response as a large JSON object. See formatting instructions below.

    {format_instructions}

    Document:\n{doc}\n""",
    input_variables=["doc"],
    partial_variables={"format_instructions": json_parser.get_format_instructions()},
)

faq_generator_chain = prompt | llm | json_parser

In [120]:
result = faq_generator_chain.invoke({"doc": documents[0].text})

In [113]:
result

{'properties': {'pairs': [{'prompt': 'What is the maximum towing capacity of the 2022 Colorado?',
    'response': 'The 2022 Colorado has a maximum towing capacity of 7,700 lbs. with the available Duramax® 2.8L Turbo-Diesel engine.'},
   {'prompt': 'What is the fuel economy of the 2022 Colorado with the available Duramax® 2.8L Turbo-Diesel engine?',
    'response': 'The 2022 Colorado with the available Duramax® 2.8L Turbo-Diesel engine has an EPA-estimated fuel economy of 30 mpg highway.'},
   {'prompt': 'What is the difference between the WT, LT, Z71, and ZR2 models?',
    'response': 'The WT is the base model, the LT adds more features like a body-color rear bumper and power-adjustable mirrors, the Z71 is an off-road model with a higher ground clearance and more rugged suspension, and the ZR2 is the most off-road capable model with a Multimatic DSSV™ Damping System and front and rear locking differentials.'},
   {'prompt': 'What are the different engine options available for the 2022 

## Cleanup

In [None]:
cache.delete()