# Deploy a RAG application with vector search in Firestore: Challenge Lab
## GENAI069

link: https://partner.cloudskillsboost.google/paths/2310/course_templates/1289/labs/531796

## Objective
This lab tests your ability to develop a real-world Generative AI Q&A solution using a RAG framework. You will use Firestore as a vector database and deploy a Flask app as a user interface to query a food safety knowledge base.

This lab uses the following technologies and Google Cloud services:
 - Vertex AI
 - Vertex AI Colab Enterprise
 - Vertex AI Embeddings API
 - Gemini 2.0 Flash
 - Cloud Firestore

In this challenge lab, you will demonstrate your ability to load a text document and split it into chunks, generate embeddings for each chunk, store the text chunks and their embeddings, conduct vector search to return similar documents to a query document, complete a RAG framework by having Gemini generate a response based on a context of similar documents to a query.

## Task 1. Create a Colab Enterprise Notebook
In this section, you will set up a Colab Enterprise notebook environment in the Google Cloud Console.

1. In the Google Cloud Console, navigate to Vertex AI > Colab Enterprise.

2. When prompted to enable APIs, click ENABLE.

3. Within the Colab Enterprise panel in the console, click on Create Notebook. Rename the notebook to cymbal_ingest_to_vector_database.ipynb.

4. Paste the following code into the top cell of the notebook and run the cell.

In [1]:
!pip install --quiet --upgrade google-cloud-logging google_cloud_firestore google_cloud_aiplatform langchain langchain-google-vertexai langchain_community langchain_experimental pymupdf

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m229.5/229.5 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.7/102.7 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m90.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m72.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m438.9/438.9 kB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

5. After the cell completes running, indicated by a checkmark to the left of the cell, the packages should be installed. To use them, restart the runtime.

6. Import the following packages by running the following command:

In [2]:
import vertexai
import logging
import google.cloud.logging
from vertexai.language_models import TextEmbeddingModel
from vertexai.generative_models import GenerativeModel

import pickle
from IPython.display import display, Markdown

from langchain_google_vertexai import VertexAIEmbeddings
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_experimental.text_splitter import SemanticChunker

from google.cloud import firestore
from google.cloud.firestore_v1.vector import Vector
from google.cloud.firestore_v1.base_vector_query import DistanceMeasure

7. Next, initialize Vertex AI with your project-id qwiklabs-gcp-01-e6855b6ff8c2 and a location of us-central1

In [3]:
PROJECT_ID = "qwiklabs-gcp-01-e6855b6ff8c2"
LOCATION = "us-central1"
import vertexai
vertexai.init(project=PROJECT_ID, location=LOCATION)

8. Populate a variable named embedding_model with an instance of the langchain_google_vertexai class VertexAIEmbeddings. Pass it a parameter model_name set to the text embedding model version of text-embedding-005. You will use this LangChain class for your embedding model so that you can use a LangChain semantic chunker to chunk your dataset.

In [4]:
embedding_model = VertexAIEmbeddings(model_name="text-embedding-005")

## Task 2. Download, process and chunk data semantically
In this section, you will prepare the NYC Food Safety Manual for Retrieval-Augmented Generation (RAG). Clean the PDF content and split it into meaningful chunks based on semantic similarity using sentence embeddings and generate numerical representations (embeddings) for each identified text chunk.

1. Download the New York City Department of Health and Mental Hygiene's Food Protection Training Manual. This document will serve as your Retrieval-Augmented Generation source content.

In [5]:
!gcloud storage cp gs://partner-genai-bucket/genai069/nyc_food_safety_manual.pdf .

Copying gs://partner-genai-bucket/genai069/nyc_food_safety_manual.pdf to file://./nyc_food_safety_manual.pdf

Average throughput: 159.9MiB/s


In [6]:
!ls -al

total 8704
drwxr-xr-x 3 root root    4096 Jun 22 17:38 .
drwxr-xr-x 1 root root    4096 Jun 22 17:26 ..
drwxr-xr-x 5 root root    4096 Jun 22 17:38 .config
-rw-r--r-- 1 root root 8898262 Jun 22 17:38 nyc_food_safety_manual.pdf


2. Use the LangChain class PyMuPDFLoader to load the contents of the PDF to a variable named data

In [7]:
# Replace with your PDF file path
file_path = "nyc_food_safety_manual.pdf"

# Instantiate the loader
loader = PyMuPDFLoader(file_path)

# Load the document
data = loader.load()

# 'data' now holds the loaded document content
#  You can access the content of each page using data[i].page_content
#  and metadata using data[i].metadata

print(data[0].page_content[:200]) # Print the first 200 chars of the first page content.
print(data[0].metadata) # Print the metadata of the first page.

The Health Code
These are regulations that were
formulated to allow the  Department
to effectively protect the health of the
population. Among the rules
embodied in the Health Code is
Article 81 which
{'producer': 'Acrobat Distiller 8.0.0 (Macintosh)', 'creator': 'QuarkXPress 8.5', 'creationdate': '2014-06-24T12:42:42-04:00', 'source': 'nyc_food_safety_manual.pdf', 'file_path': 'nyc_food_safety_manual.pdf', 'total_pages': 94, 'format': 'PDF 1.6', 'title': 'FOR BIND Food Protect Manual rev6 14_Conv-Sig', 'author': 'Hizzoner', 'subject': '', 'keywords': '', 'moddate': '2015-11-12T10:57:27-05:00', 'trapped': '', 'modDate': "D:20151112105727-05'00'", 'creationDate': "D:20140624124242-04'00'", 'page': 0}


3. The following function is provided to do some basic cleaning on artifacts found in this particular document. Create a variable called 'cleaned_pages' that is a list of strings, with each string being a page of content cleaned by this function.

In [8]:
def clean_page(page):
  return page.page_content.replace("-\n","")\
                          .replace("\n"," ")\
                          .replace("\x02","")\
                          .replace("\x03","")\
                          .replace("fo d P R O T E C T I O N  T R A I N I N G  M A N U A L","")\
                          .replace("N E W  Y O R K  C I T Y  D E P A R T M E N T  O F  H E A L T H  &  M E N T A L  H Y G I E N E","")

In [9]:
# Create a variable called 'cleaned_pages' that is a list of strings, with each string being a page of content cleaned by this function clean_page().
cleaned_pages = [clean_page(page) for page in data]

# optionally print-out the first cleaned page to verify
print(cleaned_pages[0])



4. Use LangChain's SemanticChunker with the embedding_model you created earlier to split the first five pages of cleaned_pages into text chunks. The SemanticChunker determines when to start a new chunk when it encounters a larger distance between sentence embeddings. Save the strings of page content from the resulting documents into a list of strings called chunked_content. Take a look at a few of the chunks to get familiar with the content.

In [11]:
from langchain_core.documents import Document
from langchain_experimental.text_splitter import SemanticChunker

# Select the first five pages from 'cleaned_pages'
first_five_cleaned_pages = cleaned_pages[:5]

# Convert the list of strings back into a list of Document objects for SemanticChunker
# SemanticChunker typically works with Document objects, not raw strings directly.
documents_to_chunk = [Document(page_content=text) for text in first_five_cleaned_pages]

# Instantiate SemanticChunker with the embedding_model
# The 'breakpoint_threshold_type' parameter can be set to 'percentile' or 'standard_deviation'
# 'percentile' is a common choice, meaning it breaks when the distance is above a certain percentile.
text_splitter = SemanticChunker(
    embeddings=embedding_model,
    breakpoint_threshold_type="percentile" # You can experiment with "standard_deviation" as well
)

# Split the documents into chunks
chunked_documents = text_splitter.split_documents(documents_to_chunk)

# Save the strings of page content from the resulting documents into a list of strings called 'chunked_content'
chunked_content = [doc.page_content for doc in chunked_documents]

# Take a look at a few of the chunks to get familiar with the content.
print(f"Number of chunks created: {len(chunked_content)}\n")

print("--- First chunk ---")
print(chunked_content[0])
print("\n--- Second chunk ---")
print(chunked_content[1])
print("\n--- Third chunk ---")
print(chunked_content[2])
print("\n--- Last chunk of the first five pages ---")
# Print the last chunk if there are enough, otherwise print the last available.
if len(chunked_content) > 3:
    print(chunked_content[-1])
else:
    print("Not enough chunks to print the 'last chunk of the first five pages'.")


Number of chunks created: 15

--- First chunk ---

--- Second chunk ---
Registration is done on-line. The link is: nyc.gov/foodprotectioncourse Register for Health Academy Classes On-Line You may now register and pay online for courses offered at the Department of Health and Mental Hygiene’s Health Academy, including the Food Protection Course for restaurants. This new service allows you to avoid going to the Citywide Licensing Center to register for a course. You may also use the on-line service to pay for and request an appointment to replace your Food Protection Certificate. How does it work? Go to the registration web page, nyc.gov/healthacademy, select a course and date, pay the appropriate fee and receive confirmation. You will be asked to provide some personal information before registering. In most cases, you will be able to select from a list of course dates.

--- Third chunk ---
If you don’t see a date that is convenient, check back as new course dates are added frequently. 1

5. Use the embedding_model to generate embeddings of the text chunks, saving them to a list called chunked_embeddings. To do so, pass your list of chunks to the VertexAIEmbeddings class's embed_documents() method.

In [12]:
# Assuming the previous code blocks have been executed and necessary variables are defined:
# - embedding_model (an instance of VertexAIEmbeddings)
# - chunked_content (list of strings, where each string is a text chunk)

# Generate embeddings for the text chunks
# The embed_documents() method expects a list of strings
chunked_embeddings = embedding_model.embed_documents(chunked_content)

# Verify the number of embeddings generated matches the number of chunks
print(f"Number of chunks: {len(chunked_content)}")
print(f"Number of embeddings generated: {len(chunked_embeddings)}")

Number of chunks: 15
Number of embeddings generated: 15


In [16]:
# Optionally: print-out the first chunked_content to see its structure (it will be a list of floats)
print("\nFirst chunked_content (first few dimensions):")
print(chunked_content[0][:10]) # Print only the first 10 dimensions for brevity
print(f"Dimension of chunked_content: {len(chunked_content[0])}")


First chunked_content (first few dimensions):
The Health
Dimension of chunked_content: 2976


In [None]:
# Optionally: print-out the first embedding to see its structure (it will be a list of floats)
print("\nFirst embedding (first few dimensions):")
print(chunked_embeddings[0][:10]) # Print only the first 10 dimensions for brevity
print(f"Dimension of embeddings: {len(chunked_embeddings[0])}")

6. You should have successfully chunked & embedded a short section of the document. To get the chunks & corresponding embeddings for the full document, run the following code:

In [14]:
!gcloud storage cp gs://partner-genai-bucket/genai069/chunked_content.pkl .
!gcloud storage cp gs://partner-genai-bucket/genai069/chunked_embeddings.pkl .

chunked_content = pickle.load(open("chunked_content.pkl", "rb"))
chunked_embeddings = pickle.load(open("chunked_embeddings.pkl", "rb"))

# Do not delete this logging statement.
client = google.cloud.logging.Client()
client.setup_logging()
log_message = f"chunked contents are: {chunked_content[0][:20]}"
logging.info(log_message)

Copying gs://partner-genai-bucket/genai069/chunked_content.pkl to file://./chunked_content.pkl
Copying gs://partner-genai-bucket/genai069/chunked_embeddings.pkl to file://./chunked_embeddings.pkl

Average throughput: 166.5MiB/s


INFO:root:chunked contents are: The Health Code Thes


## Task 3. Prepare your vector database
In this section, you will set up a Firestore database to store the processed NYC Food Safety Manual chunks and their embeddings for efficient retrieval. You'll then build a search function to find relevant information based on a user query.

1. Create a Firestore database with the default name of (default) in Native Mode and leave the other settings to default -> manual step (do this in the Google CLoud Console).

2. Next, in your Colab Enterprise Notebook populate a 'db' variable with a Firestore Client.

3. Use a variable called 'collection' to create a reference to a collection named **food-safety**.

In [15]:
from google.cloud import firestore

# Initialize Firestore client
db = firestore.Client(project=PROJECT_ID)
collection = db.collection("food-safety")

In [18]:
print(PROJECT_ID)

qwiklabs-gcp-01-e6855b6ff8c2


4. Using a combination of your lists 'chunked_content' and 'chunked_embeddings', add a document to your collection for each of your chunked documents. Each document can be assigned a random ID, but it should have a field called content to store the chunk text and a field called embedding to store a Firestore Vector() of the associated embedding.

In [17]:
# Store each embedding and chunk
for i, (embedding, chunk) in enumerate(zip(chunked_embeddings, chunked_content)):
    doc = {
        "embedding": embedding,
        "chunk": chunk
    }
    collection.document(f"chunk_{i}").set(doc)

5. Create a vector index for your collection using your embedding field.

**Note-1:** A find_nearest() operation cannot be executed on a collection without an index. When attempted, the system will return an error message including instructions to create the index using a gcloud command.

**Note-2:** For index creation we need to use this 'gcloud' command in the CLoud Shell

gcloud firestore indexes composite create \
--collection-group=collection-group \
--query-scope=COLLECTION \
--field-config field-path=vector-field,vector-config='vector-configuration' \
--database=database-id

In [19]:
!gcloud firestore indexes composite create \
    --collection-group=food-safety \
    --query-scope=COLLECTION \
    --field-config field-path=embedding,vector-config='{"dimension":"768", "flat": "{}"}' \
    --database="(default)" \
    --project=qwiklabs-gcp-01-e6855b6ff8c2

Create request issued
Created index [CICAgOjXh4EK].


6. Complete the function below to receive a query, get its embedding, and compile a context consisting of the text from the 5 documents with the most similar embeddings. This time, use the embed_query() method of the LangChain 'VertexAIEmbeddings' embedding_model to embed the user's query.

In [21]:
def search_vector_database(query: str):

  context = ""

  # 1. Generate the embedding of the query
  query_embedding_list = embedding_model.embed_query(query)

  # Firestore expects Vector objects for vector search
  query_vector = Vector(query_embedding_list)

  # 2. Get the 5 nearest neighbors from your collection.
  # Call the get() method on the result of your call to
  # find_nearest to retrieve document snapshots.

  # The find_nearest method typically expects the vector field name, the query vector,
  # the distance measure, and the limit.
  nearest_neighbors = collection.find_nearest(
      vector_field="embedding",
      query_vector=query_vector,
      distance_measure=DistanceMeasure.COSINE, # Use COSINE as discussed for text-embedding-005
      limit=5
  ).get() # .get() retrieves the actual document snapshots


  # 3. Call to_dict() on each snapshot to load its data.
  # Combine the snapshots into a single string named context
  for snapshot in nearest_neighbors:
    doc_data = snapshot.to_dict()
    if doc_data and "chunk" in doc_data:
      context += doc_data["chunk"] + "\n\n" # Add a newline for separation between chunks


  return context

7. Next, call the function with the query How should I store food? to confirm it's functionality.

In [24]:
print(search_vector_database("How should I store food?"))




## Task 4. Deploy a Generative AI application to search your vector store
Now that your vector database is prepared, in this section you will work on the client application to query it and return answers generated by Gemini.