## Introduction

In this tutorial, we learn how to use Google Cloud AI tools to quickly bring the power of Large Language Models to enterprise systems.

This tutorial covers the following -

- What are embeddings - what business challenges do they help solve ?
- Understanding Text with Vertex AI Text Embeddings
- Find Embeddings fast with Vertex AI Vector Search
- Grounding LLM outputs with Vector Search

This tutorial is based on [the blog post](https://cloud.google.com/blog/products/ai-machine-learning/how-to-use-grounding-for-your-llms-with-text-embeddings), combined with sample code.


## Bringing Gen AI and LLMs to production services
Many people are now starting to think about how to bring Gen AI and LLMs to production services, and facing with several challenges.

* "How to integrate LLMs or AI chatbots with existing IT systems, databases and business data?"
* "We have thousands of products. How can I let LLM memorize them all precisely?"
* "How to handle the hallucination issues in AI chatbots to build a reliable service?"

Here is a quick solution: **grounding** with **embeddings** and **vector search**.

What is grounding? What are embedding and vector search? In this tutorial, we will learn these crucial concepts to build reliable Gen AI services for enterprise use. But before we dive deeper, let's try the demo below.

![Search Gif](../assets/embeddings/search.gif)

#### Exercise: Try the Stack Overflow semantic search demo (like the gif ebove):

This demo is available as a [public live demo](https://ai-demos.dev/). Select "STACKOVERFLOW" and enter any coding question as a query, so it runs a text search on 8 million questions posted on Stack Overflow. Try the text semantic search with some queries like 'How to shuffle rows in SQL?' or arbitrary programming questions.

In this tutorial, we are going to see how to build a similar search experience - what is involved in building solutions like this using `Vertex AI Embeddings API` and `Vector Search`.


## 1. What is Embeddings?
With the rise of LLMs, why is it becoming important for IT engineers and ITDMs to understand how they work?

In traditional IT systems, most data is organized as structured or tabular data, using simple keywords, labels, and categories in databases and search engines.

![Traditional search](../assets/embeddings/search-traditional.png)

In contrast, AI-powered services arrange data into a simple data structure known as `embeddings`.

![Embedding search](../assets/embeddings/search-embedding.png)

Once trained with specific content like text, images, or any content, AI creates a space called "embedding space", which is essentially a map of the content's meaning.

![Embedding Space 1](../assets/embeddings/embedding-space1.png)

AI can identify the location of each content on the map, that's what embedding is.

![Embedding Space 2](../assets/embeddings/embedding-space2.png)

Let's take an example where a text discusses _movies, music, and actors_, with a distribution of 10%, 2%, and 30%, respectively. In this case, the AI can create an embedding with three values: 0.1, 0.02, and 0.3, in 3 dimensional space.

![Embedding Space 3](../assets/embeddings/embedding-space3.png)

AI can put content with similar meanings closely together in the space.

This is how Google organizes data across various services like Google Search, YouTube, Play, and many others, to provide search results and recommendations with relevant content.

Embeddings can also be used to represent different types of things in businesses, such as products, users, user activities, conversations, music & videos, signals from IoT sensors, and so on.

AI and Embeddings are now playing a crucial role in creating a new way of human-computer interaction.

![Content Embedding Overview](../assets/embeddings/content-embedding-overview.png)

AI organizes data into embeddings, which represent what the user is looking for, the meaning of contents, or many other things you have in your business. This creates a new level of user experience that is becoming the new standard.

To learn more about embeddings, [Foundational courses: Embeddings on Google Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture) and [Meet AI’s multitool: Vector embeddings by Dale Markowitz](https://cloud.google.com/blog/topics/developers-practitioners/meet-ais-multitool-vector-embeddings) are great materials.

## Vertex AI Embeddings for Text

With the [Vertex AI Embeddings for Text](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings), you can easily create a text embedding with LLM. The product is also available on [Vertex AI Model Garden](https://cloud.google.com/model-garden)

![Text Embedding Model Garden](../assets/embeddings/text-embedding-model-garden.png)

This API is designed to extract embeddings from texts. It can take text input up to 3,072 input tokens, and outputs 768 dimensional text embeddings.

#### LLM text embedding business use cases
With the embedding API, you can apply the innovation of embeddings, combined with the LLM capability, to various text processing tasks, such as:

* **LLM-enabled Semantic Search**: text embeddings can be used to represent both the meaning and intent of a user's query and documents in the * embedding space. Documents that have similar meaning to the user's query intent will be found fast with vector search technology. The model is capable of generating text embeddings that capture the subtle nuances of each sentence and paragraphs in the document.

* **LLM-enabled Text Classification**: LLM text embeddings can be used for text classification with a deep understanding of different contexts without any training or fine-tuning (so-called zero-shot learning). This wasn't possible with the past language models without task-specific training.

* **LLM-enabled Recommendation**: The text embedding can be used for recommendation systems as a strong feature for training recommendation models such as Two-Tower model. The model learns the relationship between the query and candidate embeddings, resulting in next-gen user experience with semantic product recommendation.

LLM-enabled Clustering, Anomaly Detection, Sentiment Analysis, and more, can be also handled with the LLM-level deep semantics understanding.

#### Sorting 8 million texts at "librarian-level" precision

Vertex AI Embeddings for Text has an embedding space with `768 dimensions`. As explained earlier, the space represents a huge map of a wide variety of texts in the world, organized by their meanings. With each input text, the model can find a location (embedding) in the map.

By visualizing the embedding space, you can actually observe how the model sorts the texts at the "librarian-level" precision.

**Exercise: Try the Nomic AI Atlas**

[Nomic AI](http://nomic.ai/) provides a platform called Atlas for storing, visualizing and interacting with embedding spaces with high scalability and in a smooth UI, and they worked with Google for visualizing the embedding space of the 8 million Stack Overflow questions. You can try exploring around the space, zooming in and out to each data point on your browser on this page, courtesy of Nomic AI.

The embedding space represents a huge map of texts, organized by their meanings With each input text, the model can find a location (embedding) in the map Like a librarian reading through millions of texts, sorting them with millions of nano-categories

Try exploring it [here](https://atlas.nomic.ai/map/edaff028-12b5-42a0-8e8b-6430c9b8222b/bcb42818-3581-4fb5-ac30-9883d01f98ec). Zoom into a few categories, point each dots, and see how the LLM is sorting similar questions close together in the space.

Nomic AI provides a platform called Atlas for storing, visualizing and interacting with embedding spaces with high scalability and in a smooth UI, and they worked with Google for visualizing the embedding space of the 8 million Stack Overflow questions. You can try exploring around the space, zooming in and out to each data point on your browser on this page, courtesy of Nomic AI.

The embedding space represents a huge map of texts, organized by their meanings With each input text, the model can find a location (embedding) in the map Like a librarian reading through millions of texts, sorting them with millions of nano-categories

Try exploring it here. Zoom into a few categories, point each dots, and see how the LLM is sorting similar questions close together in the space.


##### The librarian-level semantic understanding

Here are the examples of the librarian-level semantic understanding by Embeddings API with Stack Overflow questions.

![StackOverflow](../assets/embeddings/stackoverflow-embedding.png)

For example, the model thinks the question _“Does moving the request line to a header frame require an app change?”_ is similar to the question _“Does an application developed on HTTP1x require modifications to run on HTTP2?”_. That is because The model knows both questions talk about what's the change required to support the HTTP2 header frame.

Note that this demo didn't require any training or fine-tuning with computer programming specific datasets. This is the innovative part of the zero-shot learning capability of the LLM. It can be applied to a wide variety of industries, including finance, healthcare, retail, manufacturing, construction, media, and more, for deep semantic search on the industry-focused business documents without spending time and cost for collecting industry specific datasets and training models.


## Text Embeddings in Action

Lets try using Text Embeddings in action with actual sample code.

### 1. Setup the Environment

Before get started with the Vertex AI services, we need to setup the following.

* Install Python SDK
* Environment variables
* Authentication using Service Account
* Enable APIs
* Set IAM permissions (Vertex AI User, BigQuery User and Storage Admin)
* Install Python SDK

```bash
!pip install --upgrade --user google-cloud-aiplatform google-cloud-storage google-cloud-bigquery[pandas]
```

Vertex AI, Cloud Storage and BigQuery APIs can be accessed with multiple ways including REST API and Python SDK. In this tutorial we will use the SDK.

In [1]:
# import libraries
import os
import vertexai
from IPython.display import Markdown, display
from google.oauth2 import service_account
from dotenv import load_dotenv

In [2]:
# initiate service account (authentication)
json_path = '../llm-ai.json' # replace with your own service account
credentials = service_account.Credentials.from_service_account_file(json_path)

In [3]:
# start Vertex AI
load_dotenv()
vertexai.init(project=os.environ["PROJECT_ID"], # replace with your own project
              credentials=credentials)

In [4]:
# generate an unique id for this session
from datetime import datetime

UID = datetime.now().strftime("%m%d%H%M")

print(UID)

01301158


#### 2. Getting Started with Vertex AI Embeddings for Text

Now it's ready to get started with embeddings!

##### A. Data Preparation

We will be using the [Stack Overflow public dataset](https://console.cloud.google.com/marketplace/product/stack-exchange/stack-overflow) hosted on BigQuery table `bigquery-public-data.stackoverflow.posts_questions`. This is a very big dataset with 23 million rows that doesn't fit into the memory. We are going to limit it to 1000 rows for this tutorial.

In [5]:
# load the BQ Table into a Pandas Dataframe
import pandas as pd
from google.cloud import bigquery

QUESTIONS_SIZE = 1000

bq_client = bigquery.Client(project=os.environ["PROJECT_ID"], credentials=credentials)
QUERY_TEMPLATE = """
        SELECT distinct q.id, q.title
        FROM (SELECT * FROM `bigquery-public-data.stackoverflow.posts_questions`
        where Score > 0 ORDER BY View_Count desc) AS q
        LIMIT {limit} ;
        """
query = QUERY_TEMPLATE.format(limit=QUESTIONS_SIZE)
query_job = bq_client.query(query)
rows = query_job.result()
df = rows.to_dataframe()

# examine the data
df.head()

Unnamed: 0,id,title
0,73250763,Error CS0246: The type or namespace name 'Stre...
1,73206525,Keycloak 19.0 behind nginx (https) admin conso...
2,73475664,Citing Institutional Author or Organization in...
3,73399777,Azure build failing due to Method not found: '...
4,73426773,UnboundLocalError: local variable 'raw_labels'...


##### B. Call the API to generate embeddings

With the Stack Overflow dataset, we will use the title column (the question title) and generate embedding for it with Embeddings for Text API. The API is available under the [`vertexai`](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai) package of the SDK.

From the package, import `TextEmbeddingModel` and get a model.

In [6]:
# Load the text embeddings model
from vertexai.preview.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

In this tutorial we will use `textembedding-gecko@001` model for getting text embeddings. Please take a look at [Supported models](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings#supported_models) on the doc to see the list of supported models.

Once you get the model, you can call its [get_embeddings](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextEmbeddingModel#vertexai_language_models_TextEmbeddingModel_get_embeddings) function to get embeddings. You can pass up to 5 texts at once in a call. But there is a caveat. By default, the text embeddings API has a "request per minute" quota set to `60` for **new Cloud projects** and `600` for **projects with usage history** (see [Quotas and limits](https://cloud.google.com/vertex-ai/docs/quotas#request_quotas) to check the latest quota value for `base_model:textembedding-gecko`). So, rather than using the function directly, you may want to define a wrapper like below to:
* limit under 10 calls per second
* pass 5 texts each time.

In [7]:
import time
import tqdm  # to show a progress bar

# get embeddings for a list of texts
BATCH_SIZE = 5


def get_embeddings_wrapper(texts):
    embs = []
    for i in tqdm.tqdm(range(0, len(texts), BATCH_SIZE)):
        time.sleep(1)  # to avoid the quota error
        result = model.get_embeddings(texts[i : i + BATCH_SIZE])
        embs = embs + [e.values for e in result]
    return embs

The following code will get embedding for the question titles and add them as a new column embedding to the `DataFrame`. This will take a few minutes.

In [8]:
# get embeddings for the question titles and add them as "embedding" column
df = df.assign(embedding=get_embeddings_wrapper(list(df.title)))
df.head()

100%|██████████| 200/200 [05:06<00:00,  1.53s/it]


Unnamed: 0,id,title,embedding
0,73250763,Error CS0246: The type or namespace name 'Stre...,"[-0.02774483524262905, 0.010048817843198776, -..."
1,73206525,Keycloak 19.0 behind nginx (https) admin conso...,"[-0.0495707169175148, 0.021820401772856712, 0...."
2,73475664,Citing Institutional Author or Organization in...,"[-0.007552702911198139, -0.008695865981280804,..."
3,73399777,Azure build failing due to Method not found: '...,"[-0.0029502741526812315, -0.025221263989806175..."
4,73426773,UnboundLocalError: local variable 'raw_labels'...,"[0.005036971066147089, 0.03410046175122261, 0...."


#### C. Look at the embedding similarities

Let's see how these embeddings are organized in the embedding space with their meanings by quickly calculating the similarities between them and sorting them.

As embeddings are vectors, you can calculate similarity between two embeddings by using one of the popular metrics like the followings:

![Embedding Similarity](../assets/embeddings/embedding-similarity.png)

Which metric should we use? Usually it depends on how each model is trained. In case of the model `textembedding-gecko@001`, we need to use **inner product (dot product)**.

In the following code, it picks up one question randomly and uses the numpy `np.dot` function to calculate the similarities between the question and other questions.


In [9]:
import random
import numpy as np

# pick one of them as a key question
key = random.randint(0, len(df))

# calc dot product between the key and other questions
embs = np.array(df.embedding.to_list())
similarities = np.dot(embs[key], embs.T)

# print similarities for the first 5 questions
similarities[:5]

array([0.54679927, 0.499551  , 0.48490172, 0.59063316, 0.61370155])

Finally, sort the questions with the similarities and print the list.

In [10]:
# print the question
print(f"Key question: {df.title[key]}\n")

# sort and print the questions by similarities
sorted_questions = sorted(
    zip(df.title, similarities), key=lambda x: x[1], reverse=True
)[:20]

for i, (question, similarity) in enumerate(sorted_questions):
    print(f"{similarity:.4f} {question}")

Key question: Mat Select option cant get data

1.0000 Mat Select option cant get data
0.7215 Mirth: Mapper: Cannot get data to map to variables
0.6698 Why should I use the the .selectable modifier instead of .clickable for select one item in Compose LazyColumn?
0.6595 Getting Value error when using DecisionBoundaryDisplay
0.6580 Option to exclude patterns from auto import not working?
0.6562 jQuery - select function is not working on mobile
0.6560 Cannot read properties of undefined 'on'
0.6461 Unable to use contains for quick search in Nuxeo
0.6426 Error: Clickable element "Save" was not found by text|CSS|XPath
0.6393 Why do I encounter "INVALID_PARAMETER VALUE" error when opening "Models" tab in MLFlow UI?
0.6390 loadTs is not a function
0.6335 How to get data from list of map to display in Recordable list view
0.6324 How to show the information of the selected row with DataTable React.js
0.6323 Is there a way to enable/disable subsets of data taken by a formula?
0.6293 Want to retri

#### 3. Find embeddings fast with Vertex AI Vector Search

As we have explained above, you can find similar embeddings by calculating the distance or similarity between the embeddings.

But this isn't easy when you have millions or billions of embeddings. For example, if you have 1 million embeddings with 768 dimensions, you need to repeat the distance calculations for 1 million x 768 times. This would take some seconds - too slow.

So the researchers have been studying a technique called [Approximate Nearest Neighbor (ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search) for faster search. ANN uses "vector quantization" for separating the space into multiple spaces with a tree structure. This is similar to the index in relational databases for improving the query performance, enabling very fast and scalable search with billions of embeddings.

With the rise of LLMs, the ANN is getting popular quite rapidly, known as the Vector Search technology.

![ANN](../assets/embeddings/ann.png)

In 2020, Google Research published a new ANN algorithm called [ScaNN](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html). It is considered one of the best ANN algorithms in the industry, also the most important foundation for search and recommendation in major Google services such as Google Search, YouTube and many others.

##### A. What is Vertex AI Vector Search?

Google Cloud developers can take the full advantage of Google's vector search technology with [Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview) (previously called Matching Engine). With this fully managed service, developers can just add the embeddings to its index and issue a search query with a key embedding for the blazingly fast vector search. In the case of the Stack Overflow demo, **Vector Search can find relevant questions from 8 million embeddings in tens of milliseconds**.

![Vector Search](../assets/embeddings/vector-search.png)

With Vector Search, we don't need to spend much time and money building your own vector search service from scratch or using open source tools if our goal is high scalability, availability and maintainability for production systems.

##### B. Get Started with Vector Search

When we already have the embeddings, then getting started with Vector Search is pretty easy. In this section, we will follow the steps below.

**Setting up Vector Search**

* Save the embeddings in JSON files on Cloud Storage
* Build an Index
* Create an Index Endpoint
* Deploy the Index to the endpoint

**Use Vector Search**

* Query with the endpoint

##### C. Save the embeddings in a JSON file

To load the embeddings to Vector Search, we need to save them in JSON files with JSONL format. See more information in the docs at [Input data format and structure](https://cloud.google.com/vertex-ai/docs/matching-engine/match-eng-setup/format-structure#data-file-formats).

First, export the id and embedding columns from the DataFrame in JSONL format, and save it.

In [12]:
# save id and embedding as a json file
jsonl_string = df[["id", "embedding"]].to_json(orient="records", lines=True)
with open("questions.json", "w") as f:
    f.write(jsonl_string)


In [15]:
# show the first few lines of the json file
pd.read_json("questions.json", lines=True).head()

Unnamed: 0,id,embedding
0,73250763,"[-0.0277448352, 0.0100488178, -0.0102676898, 0..."
1,73206525,"[-0.0495707169, 0.0218204018, 0.00418983170000..."
2,73475664,"[-0.0075527029, -0.008695866, -0.0365824923, 0..."
3,73399777,"[-0.0029502742, -0.025221264, 0.00913337710000..."
4,73426773,"[0.0050369711, 0.0341004618, 0.0198147688, 0.0..."


##### D. Then, create a new Cloud Storage bucket and copy the file to it.

In [18]:
from google.cloud import storage

# create function to create a new bucket and upload a file
def create_bucket_and_upload_file(bucket_name, source_file_name, destination_blob_name):
    """Create a new bucket in GCS and upload a file to it."""
    # Initialize a client
    storage_client = storage.Client(credentials=credentials)

    # Create a new bucket
    bucket = storage_client.bucket(bucket_name)
    bucket.location = "US"  # you can change the location
    bucket = storage_client.create_bucket(bucket, location=bucket.location)

    # Upload a file
    blob = bucket.blob(destination_blob_name)
    blob.upload_from_filename(source_file_name)

    print(f"File {source_file_name} uploaded to {destination_blob_name} in bucket {bucket_name}. URI: gs://{bucket_name}/{destination_blob_name}")




In [27]:
def upload_file_to_gcs(bucket_name, source_file_name, destination_blob_name):
    """Uploads a file to Google Cloud Storage, creating the bucket if it doesn't exist."""
    # Initialize a client
    storage_client = storage.Client(credentials=credentials)

    # Check if the bucket exists
    bucket = storage_client.bucket(bucket_name)
    if not bucket.exists():
        # Create a new bucket if it does not exist
        bucket.location = "us-central1"  # You can change the location if needed
        bucket = storage_client.create_bucket(bucket, location=bucket.location)
        print(f"Bucket {bucket_name} created.")
    else:
        print(f"Bucket {bucket_name} already exists.")

    # Upload a file
    blob = bucket.blob(destination_blob_name)
    blob.upload_from_filename(source_file_name)

    print(f"File {source_file_name} uploaded to {destination_blob_name} in bucket {bucket_name}. URI: gs://{bucket_name}/{destination_blob_name}")

In [28]:
# Example usage
bucket_name = "example_bukcet"  # Replace with your bucket name
source_file_name = "questions.json"  # Replace with the name of your file
destination_blob_name = "questions.json"  # The name you want for the file in the bucket

upload_file_to_gcs(bucket_name, source_file_name, destination_blob_name)


  bucket.location = "us-central1"  # You can change the location if needed


Bucket example_bukcet created.
File questions.json uploaded to questions.json in bucket example_bukcet. URI: gs://example_bukcet/questions.json


Json file was successfully uploaded to bucket

![GCS](../assets/embeddings/cloud-storage.png)


##### E. Create an Index

Now it's ready to load the embeddings to Vector Search. Its APIs are available under the [`aiplatform`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform) package of the SDK.

Create an [`MatchingEngineIndex`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex) with its _create_tree_ah_index_ function (_Matching Engine_ is the previous name of Vector Search).

In [22]:
# init the aiplatform package
from google.cloud import aiplatform

aiplatform.init(project=os.environ["PROJECT_ID"], # replace with your own project
              credentials=credentials)

In [30]:
# create index
my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=f"embvs-tutorial-index-{UID}",
    contents_delta_uri=f"gs://{bucket_name}",
    dimensions=768,
    location="us-central1",
    approximate_neighbors_count=20,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
)

Creating MatchingEngineIndex
Create MatchingEngineIndex backing LRO: projects/840606066459/locations/us-central1/indexes/950721316258840576/operations/3394750878131945472
MatchingEngineIndex created. Resource name: projects/840606066459/locations/us-central1/indexes/950721316258840576
To use this MatchingEngineIndex in another session:
index = aiplatform.MatchingEngineIndex('projects/840606066459/locations/us-central1/indexes/950721316258840576')


By calling the `create_tree_ah_index` function, it starts building an Index. This will take under a few minutes if the dataset is small, otherwise about 50 minutes or more depending on the size of the dataset. We can check status of the index creation on the [Vector Search Console > INDEXES](https://console.cloud.google.com/vertex-ai/matching-engine/indexes) tab.

![Vector Search Creation](../assets/embeddings/vector-search-creation.png)

**The parameters for creating index**

* `contents_delta_uri`: The URI of Cloud Storage directory where we stored the embedding JSON files
* `dimensions`: Dimension size of each embedding. In this case, it is 768 as we are using the embeddings from the Text Embeddings API.
* `approximate_neighbors_count`: how many similar items we want to retrieve in typical cases
* `distance_measure_type`: what metrics to measure distance/similarity between embeddings. In this case it's DOT_PRODUCT_DISTANCE

See the [document](https://cloud.google.com/vertex-ai/docs/vector-search/create-manage-index) for more details on creating Index and the parameters.

**Batch Update or Streaming Update?**

There are two types of index: `Index for Batch Update` (used in this tutorial) and `Index for Streaming Updates`. The Batch Update index can be updated with a batch process whereas the Streaming Update index can be updated in real-time. The latter one is more suited for use cases where you want to add or update each embeddings in the index more often, and crucial to serve with the latest embeddings, such as e-commerce product search.

##### F. Create Index Endpoint and deploy the Index

To use the Index, we need to create an Index Endpoint. It works as a server instance accepting query requests for our Index.


In [31]:
# create IndexEndpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=f"embvs-tutorial-index-endpoint-{UID}",
    public_endpoint_enabled=True,
)

Creating MatchingEngineIndexEndpoint
Create MatchingEngineIndexEndpoint backing LRO: projects/840606066459/locations/us-central1/indexEndpoints/1115384177634574336/operations/3254576339730038784
MatchingEngineIndexEndpoint created. Resource name: projects/840606066459/locations/us-central1/indexEndpoints/1115384177634574336
To use this MatchingEngineIndexEndpoint in another session:
index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/840606066459/locations/us-central1/indexEndpoints/1115384177634574336')


This tutorial utilizes a [`Public Endpoint`](https://cloud.google.com/vertex-ai/docs/vector-search/setup/setup#choose-endpoint) and does not support [`Virtual Private Cloud (VPC)`](https://cloud.google.com/vpc/docs/private-services-access). Unless you have a specific requirement for VPC, we recommend using a Public Endpoint. Despite the term "public" in its name, it does not imply open access to the public internet. Rather, it functions like other endpoints in Vertex AI services, which are secured by default through IAM. Without explicit IAM permissions, as we have previously established, no one can access the endpoint.

With the Index Endpoint, deploy the Index by specifying an unique deployed index ID.

In [32]:
DEPLOYED_INDEX_ID = f"embvs_tutorial_deployed_{UID}"

In [33]:
# deploy the Index to the Index Endpoint
my_index_endpoint.deploy_index(index=my_index, deployed_index_id=DEPLOYED_INDEX_ID)

Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/840606066459/locations/us-central1/indexEndpoints/1115384177634574336
Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/840606066459/locations/us-central1/indexEndpoints/1115384177634574336/operations/8186017931700731904
MatchingEngineIndexEndpoint index_endpoint Deployed index. Resource name: projects/840606066459/locations/us-central1/indexEndpoints/1115384177634574336


<google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint.MatchingEngineIndexEndpoint object at 0x000001D1C96E72B0> 
resource name: projects/840606066459/locations/us-central1/indexEndpoints/1115384177634574336

If it is the first time to deploy an Index to an Index Endpoint, it will take around 25 minutes to automatically build and initiate the backend for it. After the first deployment, it will finish in seconds. To see the status of the index deployment, open the [Vector Search Console > INDEX ENDPOINTS](https://console.cloud.google.com/vertex-ai/matching-engine/index-endpoints) tab and click the Index Endpoint.


![Vector Search Index Endpoint](../assets/embeddings/vector-search-index-endpoint.png)

##### G. Utilities

Sometimes it takes tens of minutes to create or deploy Indexes and we would lose connection with the Jupyter/Colab runtime. In that case, instead of creating or deploying new Index again, we can check the Vector Search Console and get the existing ones to continue.

**Get an existing Index**

To get an Index object that already exists, replace the following [our-index-id] with the index ID and run the cell. We can check the ID on the [Vector Search Console > INDEXES tab](https://console.cloud.google.com/vertex-ai/matching-engine/indexes).


In [35]:
my_index_id = "950721316258840576"  # @param {type:"string"}
my_index = aiplatform.MatchingEngineIndex(my_index_id)
print(my_index)

<google.cloud.aiplatform.matching_engine.matching_engine_index.MatchingEngineIndex object at 0x000001D1C96E7310> 
resource name: projects/840606066459/locations/us-central1/indexes/950721316258840576


**Get an existing Index Endpoint**

To get an Index Endpoint object that already exists, replace the following [our-index-endpoint-id] with the Index Endpoint ID and run the cell. We can check the ID on the [Vector Search Console > INDEX ENDPOINTS tab](https://console.cloud.google.com/vertex-ai/matching-engine/index-endpoints).

In [36]:
my_index_endpoint_id = "1115384177634574336"  # @param {type:"string"}
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint(my_index_endpoint_id)
print(my_index_endpoint)

<google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint.MatchingEngineIndexEndpoint object at 0x000001D1C978CA90> 
resource name: projects/840606066459/locations/us-central1/indexEndpoints/1115384177634574336


##### H. Run Query

Finally it's ready to use Vector Search. In the following code, it creates an embedding for a test question, and find similar question with the Vector Search.

In [37]:
test_embeddings = get_embeddings_wrapper(["How to read JSON with Python?"])

100%|██████████| 1/1 [00:02<00:00,  2.58s/it]


In [38]:
# inspect the emnbeddings of our input
print(test_embeddings)

[[-0.009405541233718395, -0.012011843733489513, 0.03272425755858421, -0.017712930217385292, 0.024488620460033417, -0.0064089735969901085, 0.027680082246661186, 0.01308376993983984, -0.028244134038686752, 0.01817595772445202, 0.025589874014258385, 0.038305673748254776, 0.03444906696677208, -0.024641847237944603, -0.020275678485631943, 0.0040742806158959866, -0.04790573567152023, -0.04560501500964165, 0.0008088672766461968, 0.026284685358405113, -0.04548338055610657, -0.019667573273181915, 0.005759553983807564, 0.009407688863575459, -0.031886711716651917, -0.10478324443101883, 0.030070681124925613, 0.013050236739218235, -0.007181359920650721, -0.03439131751656532, -0.04584177955985069, 0.03567810729146004, -0.025517383590340614, -0.04891308397054672, 0.021599987521767616, 0.04497474059462547, 0.04183082655072212, 0.008171860128641129, 0.002165100071579218, -0.0056815375573933125, 0.013008587062358856, -0.0028206240385770798, 0.038202106952667236, 0.02001797780394554, -0.02535422705113887

In [39]:
# Test query
response = my_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=test_embeddings,
    num_neighbors=20,
)

# show the result
import numpy as np

for idx, neighbor in enumerate(response[0]):
    id = np.int64(neighbor.id)
    similar = df.query("id == @id", engine="python")
    print(f"{neighbor.distance:.4f} {similar.title.values[0]}")

0.7774 read_json instead of open and json.load
0.7162 how to sum variables extracted from .json file?
0.7132 How to convert json to nested Maps using Gson?
0.7013 How to get batch predictions with jsonl data in sagemaker?
0.6919 How do I get React to accept my Promise of data?
0.6886 Easy Way To Import A Table from MS SQL Server to Django?
0.6855 Need help in extracting values from Json for Jmeter
0.6849 Pandas ValueError: Can only compare identically-labeled Series objects from single dataframe?
0.6748 View JSONModel in redis-cli after using python redis-om to save it to the database
0.6732 How to authenticate private API token in Python to access RightSignature
0.6715 Strange behavior on json_decode in PHP code
0.6708 To Find data inside tag how to write REGEX Expression?
0.6650 how to scrape the instagram followers popup with python playwright
0.6612 In Keras, how to use Model.predict function when looping over a tensorflow Dataset?
0.6608 Parsing dictionary from VMware module
0.659

The `find_neighbors` function only takes milliseconds to fetch the similar items even when we have billions of items on the Index, thanks to the ScaNN algorithm. Vector Search also supports [autoscaling](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-public#autoscaling) which can automatically resize the number of nodes based on the demands of your workloads.

## IMPORTANT: Cleaning Up

In case we are using your own Cloud project, not a temporary project on Qwiklab, please make sure to delete all the Indexes, Index Endpoints and Cloud Storage buckets after finishing this tutorial. Otherwise the remaining objects would **incur unexpected costs**.

If we used Workbench, we may also need to delete the Notebooks from [the console](https://console.cloud.google.com/vertex-ai/workbench).

In [40]:
# create function to delete a bucket and all its contents
def delete_bucket_and_contents(bucket_name):
    """Deletes a bucket and all its contents in Google Cloud Storage."""
    # Initialize a client
    storage_client = storage.Client(credentials=credentials)

    # Get the bucket
    bucket = storage_client.bucket(bucket_name)

    # Check if the bucket exists
    if bucket.exists():
        # Delete all the contents of the bucket
        blobs = bucket.list_blobs()
        for blob in blobs:
            blob.delete()
            print(f"Blob {blob.name} deleted.")

        # Delete the bucket
        bucket.delete()
        print(f"Bucket {bucket_name} deleted.")
    else:
        print(f"Bucket {bucket_name} does not exist or is already deleted.")



In [41]:
# delete the bucket
delete_bucket_and_contents(bucket_name)

Blob questions.json deleted.
Bucket example_bukcet deleted.


In [42]:
# delete Index Endpoint
my_index_endpoint.undeploy_all()
my_index_endpoint.delete(force=True)

# delete Index
my_index.delete()

Undeploying MatchingEngineIndexEndpoint index_endpoint: projects/840606066459/locations/us-central1/indexEndpoints/1115384177634574336
Undeploy MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/840606066459/locations/us-central1/indexEndpoints/1115384177634574336/operations/9050709060155867136
MatchingEngineIndexEndpoint index_endpoint undeployed. Resource name: projects/840606066459/locations/us-central1/indexEndpoints/1115384177634574336
Deleting MatchingEngineIndexEndpoint : projects/840606066459/locations/us-central1/indexEndpoints/1115384177634574336
Delete MatchingEngineIndexEndpoint  backing LRO: projects/840606066459/locations/us-central1/operations/7075880623553904640
MatchingEngineIndexEndpoint deleted. . Resource name: projects/840606066459/locations/us-central1/indexEndpoints/1115384177634574336
Deleting MatchingEngineIndex : projects/840606066459/locations/us-central1/indexes/950721316258840576
Delete MatchingEngineIndex  backing LRO: projects/840606066459/l

## Summary

### Grounding LLM outputs with Vertex AI Vector Search

As we have seen, by combining the Embeddings API and Vector Search, we can use the embeddings to "ground" LLM outputs to real business data with low latency.

For example, if an user asks a question, Embeddings API can convert it to an embedding, and issue an query on Vector Search to find similar embeddings in its index. Those embeddings represent the actual business data in the databases. As we are just retrieving the business data and not generating any artificial texts, there is no risk of having hallucinations in the result.

![Summary](../assets/embeddings/summary-1.png)

#### The difference between the questions and answers

In this tutorial, we have used the Stack Overflow dataset. There is a reason why we had to use it; As the dataset has many pairs of **questions and answers**, so we can just find quesions similar to our question to find answers to it.

In many business use cases, the semantics (meaning) of questions and answers are different. Also, there could be cases where we would want to add variety of recommended or personalized items to the results, like product search on e-commerce sites.

In these cases, the simple semantics search don't work well. It's more like a recommendation system problem where we may want to train a model (e.g. Two-Tower model) to learn the relationship between the question embedding space and answer embedding space. Also, many production systems adds reranking phase after the semantic search to achieve higher search quality. Please see [Scaling deep retrieval with TensorFlow Recommenders and Vertex AI Matching Engine](https://cloud.google.com/blog/products/ai-machine-learning/scaling-deep-retrieval-tensorflow-two-towers-architecture) to learn more.

#### Hybrid of semantic + keyword search
Another typical challenge we will face in production system is to support keyword search combined with the semantic search. For example, for e-commerce product search, we may want to let users find product by entering its product name or model number. As LLM doesn't memorize those product names or model numbers, semantic search can't handle those "usual" search functionalities.

Vertex AI Search is another product we may consider for those requirements. While Vector Search provides a simple semantic search capability only, Search provides a integrated search solution that combines semantic search, keyword search, reranking and filtering, available as an out-of-the-box tool.

#### What about Retrieval Augmented Generation (RAG)?

In this tutorial, we have looked at the simple combination of LLM embeddings and vector search. From this starting point, we may also extend the design to [Retrieval Augmented Generation (RAG)](https://www.google.com/search?q=Retrieval+Augmented+Generation+(RAG)&oq=Retrieval+Augmented+Generation+(RAG)).

RAG is a popular architecture pattern of implementing grounding with LLM with text chat UI. The idea is to have the LLM text chat UI as a frontend for the document retrieval with vector search and summarization of the result.

![Summary with RAG](../assets/embeddings/summary-rag.png)

There are some pros and cons between the two solutions.


| Feature                   | Emb + vector search | RAG            |
|---------------------------|---------------------|----------------|
| Design                    | simple              | complex        |
| UI                        | Text search UI      | Text chat UI   |
| Summarization of result   | No                  | Yes            |
| Multi-turn (Context aware)| No                  | Yes            |
| Latency                   | millisecs           | seconds        |
| Cost                      | lower               | higher         |
| Hallucinations            | No risk             | Some risk      |

The Embedding + vector search pattern we have looked at with this tutorial provides simple, fast and low cost semantic search functionality with the LLM intelligence. RAG adds context-aware text chat experience and result summarization to it. While RAG provides the more "Gen AI-ish" experience, it also adds a risk of hallucination and higher cost and time for the text generation.

To learn more about how to build a RAG solution, you may look at [Building Generative AI applications made easy with Vertex AI PaLM API and LangChain](https://cloud.google.com/blog/products/ai-machine-learning/generative-ai-applications-with-vertex-ai-palm-2-models-and-langchain).

### Resources

To learn more, please check out the following resources:

#### Documentations

[Vertex AI Embeddings for Text API documentation](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings)

[Vector Search documentation](https://cloud.google.com/vertex-ai/docs/matching-engine/overview)

#### Vector Search blog posts

[Vertex Matching Engine: Blazing fast and massively scalable nearest neighbor search](https://cloud.google.com/blog/products/ai-machine-learning/vertex-matching-engine-blazing-fast-and-massively-scalable-nearest-neighbor-search)

[Find anything blazingly fast with Google's vector search technology](https://cloud.google.com/blog/topics/developers-practitioners/find-anything-blazingly-fast-googles-vector-search-technology)

[Enabling real-time AI with Streaming Ingestion in Vertex AI](https://cloud.google.com/blog/products/ai-machine-learning/real-time-ai-with-google-cloud-vertex-ai)

[Mercari leverages Google's vector search technology to create a new marketplace](https://cloud.google.com/blog/topics/developers-practitioners/mercari-leverages-googles-vector-search-technology-create-new-marketplace)

[Recommending news articles using Vertex AI Matching Engine](https://cloud.google.com/blog/products/ai-machine-learning/recommending-articles-using-vertex-ai-matching-engine)

[What is Multimodal Search: "LLMs with vision" change businesses](https://cloud.google.com/blog/products/ai-machine-learning/multimodal-generative-ai-search)



