## Vertex AI Vector Search Quickstart

### Prerequisites
This tutorial requires a Google Cloud project that is linked with a billing account. To follow this tutorial, make sure to read and run [this notebook](01%20Text%20Embeddings%20and%20Vertex%20AI%20Vector%20Search.ipynb) before.

### 1. Setup the Environment

Before get started with the Vertex AI services, we need to setup the following.

* Install Python SDK
* Environment variables
* Authentication using Service Account
* Enable APIs
* Set IAM permissions (Vertex AI User, BigQuery User and Storage Admin)
* Install Python SDK

```bash
!pip install --upgrade --user google-cloud-aiplatform google-cloud-storage google-cloud-bigquery[pandas]
```

Vertex AI, Cloud Storage and BigQuery APIs can be accessed with multiple ways including REST API and Python SDK. In this tutorial we will use the SDK.

In [1]:
# import libraries
import os
import vertexai
from IPython.display import Markdown, display
from google.oauth2 import service_account
from dotenv import load_dotenv

In [2]:
# initiate service account (authentication)
json_path = '../llm-ai.json' # replace with your own service account
credentials = service_account.Credentials.from_service_account_file(json_path)

In [3]:
# start Vertex AI
load_dotenv()
vertexai.init(project=os.environ["PROJECT_ID"], # replace with your own project
              credentials=credentials)

In [4]:
# generate an unique id for this session
from datetime import datetime

UID = datetime.now().strftime("%m%d%H%M")

print(UID)

02012009


### 2. Prepare Dataset

In this tutorial, we will use [TheLook](https://console.cloud.google.com/marketplace/product/bigquery-public-data/thelook-ecommerce) dataset that has a products table with about 30,000 rows of synthetic product data for a fictious e-commerce clothing site.

In [5]:
# load the BQ Table into a Pandas Dataframe
import pandas as pd
from google.cloud import bigquery

ROWS_SIZE = 30000

bq_client = bigquery.Client(project=os.environ["PROJECT_ID"], credentials=credentials)
QUERY_TEMPLATE = """
        SELECT * FROM `bigquery-public-data.thelook_ecommerce.products`
        LIMIT {limit}
        """
query = QUERY_TEMPLATE.format(limit=ROWS_SIZE)
query_job = bq_client.query(query)
rows = query_job.result()
df = rows.to_dataframe()

# examine the data
df.head()

Unnamed: 0,id,cost,category,name,brand,retail_price,department,sku,distribution_center_id
0,13842,2.51875,Accessories,Low Profile Dyed Cotton Twill Cap - Navy W39S55D,MG,6.25,Women,EBD58B8A3F1D72F4206201DA62FB1204,1
1,13928,2.33835,Accessories,Low Profile Dyed Cotton Twill Cap - Putty W39S55D,MG,5.95,Women,2EAC42424D12436BDD6A5B8A88480CC3,1
2,14115,4.87956,Accessories,Enzyme Regular Solid Army Caps-Black W35S45D,MG,10.99,Women,EE364229B2791D1EF9355708EFF0BA34,1
3,14157,4.64877,Accessories,Enzyme Regular Solid Army Caps-Olive W35S45D (...,MG,10.99,Women,00BD13095D06C20B11A2993CA419D16B,1
4,14273,6.50793,Accessories,Washed Canvas Ivy Cap - Black W11S64C,MG,15.99,Women,F531DC20FDE20B7ADF3A73F52B71D0AF,1


### 3. Generate Text Embeddings and Store it as JSONL File

In [6]:
# Load the text embeddings model
from vertexai.preview.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

In [7]:
import time
import tqdm  # to show a progress bar

# get embeddings for a list of texts
BATCH_SIZE = 5


def get_embeddings_wrapper(texts):
    embs = []
    for i in tqdm.tqdm(range(0, len(texts), BATCH_SIZE)):
        time.sleep(1)  # to avoid the quota error
        result = model.get_embeddings(texts[i : i + BATCH_SIZE])
        embs = embs + [e.values for e in result]
    return embs

**Run this code to get JSONL File locally**

In [8]:
# get embeddings for the name column and add them as "embedding" column
# df = df.assign(embedding=get_embeddings_wrapper(list(df.name)))
# df.head()

In [9]:
# save id and embedding as a json file
# jsonl_string = df[["id", "embedding"]].to_json(orient="records", lines=True)
# with open("product-embs.json", "w") as f:
    # f.write(jsonl_string)


For purpose of this tutorial, we will download JSONL file from public bucket provided by Google Cloud `gs://github-repo/data/vs-quickstart/product-embs.json`

**Run this code to get JSONL File from Google Cloud Storage URI**

In [10]:
from google.cloud import storage

# define a function to download a file from a GCS bucket
def download_file_from_gcs(bucket_name, blob_name, local_file_path):
    """Downloads a file from a GCS bucket to a local file."""
    # Initialize a client
    storage_client = storage.Client(credentials=credentials)

    # Get the bucket and blob (file) object
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)

    # Download the file
    blob.download_to_filename(local_file_path)
    print(f"File {blob_name} downloaded from {bucket_name} to local path {local_file_path}.")




In [11]:
local_file_name = 'product-embs.json'
local_file_path = f"./{local_file_name}"

In [14]:
# get the file
download_file_from_gcs('github-repo', 'data/vs-quickstart/product-embs.json', local_file_path)

File data/vs-quickstart/product-embs.json downloaded from github-repo to local path ./product-embs.json.


### 4. Upload the data (JSONL) into Cloud Storage

In [15]:
# upload the file to GCS
def upload_file_to_gcs(bucket_name, source_file_name, destination_blob_name):
    """Uploads a file to Google Cloud Storage, creating the bucket if it doesn't exist."""
    # Initialize a client
    storage_client = storage.Client(credentials=credentials)

    # Check if the bucket exists
    bucket = storage_client.bucket(bucket_name)
    if not bucket.exists():
        # Create a new bucket if it does not exist
        bucket.location = "us-central1"  # You can change the location if needed
        bucket = storage_client.create_bucket(bucket, location=bucket.location)
        print(f"Bucket {bucket_name} created.")
    else:
        print(f"Bucket {bucket_name} already exists.")

    # Upload a file
    blob = bucket.blob(destination_blob_name)
    blob.upload_from_filename(source_file_name)

    print(f"File {source_file_name} uploaded to {destination_blob_name} in bucket {bucket_name}. URI: gs://{bucket_name}/{destination_blob_name}")

In [16]:
# Example usage
bucket_name = "example_bukcet"  # Replace with your bucket name
source_file_name = "product-embs.json"  # Replace with the name of your file
destination_blob_name = "product-embs.json"  # The name you want for the file in the current folder

upload_file_to_gcs(bucket_name, source_file_name, destination_blob_name)


  bucket.location = "us-central1"  # You can change the location if needed


Bucket example_bukcet created.
File product-embs.json uploaded to product-embs.json in bucket example_bukcet. URI: gs://example_bukcet/product-embs.json


### 5. Create Vector Search Index

In [17]:
# init the aiplatform package
from google.cloud import aiplatform

aiplatform.init(project=os.environ["PROJECT_ID"], # replace with your own project
              credentials=credentials)

In [18]:
# create Index
my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=f"vs-quickstart-index-{UID}",
    contents_delta_uri=f"gs://{bucket_name}",
    dimensions=768,
    approximate_neighbors_count=10,
)

Creating MatchingEngineIndex
Create MatchingEngineIndex backing LRO: projects/840606066459/locations/us-central1/indexes/8273574310363267072/operations/1568673170628542464
MatchingEngineIndex created. Resource name: projects/840606066459/locations/us-central1/indexes/8273574310363267072
To use this MatchingEngineIndex in another session:
index = aiplatform.MatchingEngineIndex('projects/840606066459/locations/us-central1/indexes/8273574310363267072')


### 6. Create Index Endpoint and deploy the Index

In [19]:
# create IndexEndpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=f"vs-quickstart-index-endpoint-{UID}", 
    public_endpoint_enabled=True
)

Creating MatchingEngineIndexEndpoint
Create MatchingEngineIndexEndpoint backing LRO: projects/840606066459/locations/us-central1/indexEndpoints/4827476170494705664/operations/2324152008119943168
MatchingEngineIndexEndpoint created. Resource name: projects/840606066459/locations/us-central1/indexEndpoints/4827476170494705664
To use this MatchingEngineIndexEndpoint in another session:
index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/840606066459/locations/us-central1/indexEndpoints/4827476170494705664')


In [20]:
DEPLOYED_INDEX_ID = f"vs_quickstart_deployed_{UID}"

In [21]:
# deploy the Index to the Index Endpoint (it takes up to 20-30 minutes if this is the first time you deploy)
my_index_endpoint.deploy_index(index=my_index, deployed_index_id=DEPLOYED_INDEX_ID)

Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/840606066459/locations/us-central1/indexEndpoints/4827476170494705664
Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/840606066459/locations/us-central1/indexEndpoints/4827476170494705664/operations/5782916521940484096
MatchingEngineIndexEndpoint index_endpoint Deployed index. Resource name: projects/840606066459/locations/us-central1/indexEndpoints/4827476170494705664


<google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint.MatchingEngineIndexEndpoint object at 0x000001D07910D640> 
resource name: projects/840606066459/locations/us-central1/indexEndpoints/4827476170494705664

### 7. Get Index Info


**Get an existing Index**

To get an Index object that already exists, replace the following [our-index-id] with the index ID and run the cell. We can check the ID on the [Vector Search Console > INDEXES tab](https://console.cloud.google.com/vertex-ai/matching-engine/indexes).


In [22]:
my_index_id = "8273574310363267072"  # @param {type:"string"}
my_index = aiplatform.MatchingEngineIndex(my_index_id)
print(my_index)

<google.cloud.aiplatform.matching_engine.matching_engine_index.MatchingEngineIndex object at 0x000001D079102370> 
resource name: projects/840606066459/locations/us-central1/indexes/8273574310363267072


**Get an existing Index Endpoint**

To get an Index Endpoint object that already exists, replace the following [our-index-endpoint-id] with the Index Endpoint ID and run the cell. We can check the ID on the [Vector Search Console > INDEX ENDPOINTS tab](https://console.cloud.google.com/vertex-ai/matching-engine/index-endpoints).

In [23]:
my_index_endpoint_id = "4827476170494705664"  # @param {type:"string"}
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint(my_index_endpoint_id)
print(my_index_endpoint)

<google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint.MatchingEngineIndexEndpoint object at 0x000001D07910DA60> 
resource name: projects/840606066459/locations/us-central1/indexEndpoints/4827476170494705664


### 8. Run a Query with Vector Search

Finally it's ready to use Vector Search. In the following code, it creates an embedding for a test question, and find similar question with the Vector Search.

**Get an embedding to run a query**

First, load the embedding JSON file to build a dict of product names and embeddings.

In [24]:
import json

# build dicts for product names and embs
product_names = {}
product_embs = {}
with open("product-embs.json") as f:
    for l in f.readlines():
        p = json.loads(l)
        id = p["id"]
        product_names[id] = p["name"]
        product_embs[id] = p["embedding"]

In [26]:
# get the embedding for ID 6523 "cloudveil women's excursion short"
# you can also try with other IDs such as 12711, 18090, 19536 and 11863
query_emb = product_embs["6523"]
print(query_emb)

[-0.015140533447265625, 0.029022620990872383, 0.043999187648296356, 0.0008045680006034672, 0.02479265257716179, -0.058345310389995575, 0.010426630266010761, 0.023504989221692085, -0.03466186299920082, -0.00134370313026011, 0.007397875655442476, -0.01431096438318491, 0.024990102276206017, 0.06665688753128052, 0.023334601894021034, -0.005286165047436953, -0.06492510437965393, -0.0345313623547554, 0.060259561985731125, 0.010223621502518654, -0.09199754148721695, 0.01886577345430851, 0.03483972325921059, -0.027113549411296844, -0.03256196156144142, -0.07872982323169708, 0.037879571318626404, -0.009713241830468178, -0.03232517093420029, -0.07063174992799759, 0.0024606185033917427, -0.015956062823534012, -0.003946097567677498, 0.021167505532503128, -0.008327499032020569, 0.055032506585121155, 0.019084438681602478, 0.0015176940942183137, 0.00926684495061636, 0.06493163108825684, 0.0036904136650264263, 0.02693367190659046, 0.04891353100538254, -0.001483380445279181, -0.0366176962852478, -0.013

**Run a Query**

Pass the embedding to find_neighbors function to find similar product names.

In [27]:
# run query
response = my_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID, queries=[query_emb], num_neighbors=10
)

# show the results
for idx, neighbor in enumerate(response[0]):
    print(f"{neighbor.distance:.2f} {product_names[neighbor.id]}")

1.00 cloudveil women's excursion short
0.82 quiksilver womens cruiser short
0.80 xcvi women's alisal short
0.80 cloudveil men's kahuna short
0.78 ibex women's gozo short
0.78 sanctuary clothing women's coquette short
0.78 sunner women's collins printed short
0.77 hurley lowrider cargo 2.5 short - women's
0.77 stitch's women's fox knee length short
0.77 sanctuary clothing women's passenger skirt


### 9. Clean Up

In [28]:
# create function to delete a bucket and all its contents
def delete_bucket_and_contents(bucket_name):
    """Deletes a bucket and all its contents in Google Cloud Storage."""
    # Initialize a client
    storage_client = storage.Client(credentials=credentials)

    # Get the bucket
    bucket = storage_client.bucket(bucket_name)

    # Check if the bucket exists
    if bucket.exists():
        # Delete all the contents of the bucket
        blobs = bucket.list_blobs()
        for blob in blobs:
            blob.delete()
            print(f"Blob {blob.name} deleted.")

        # Delete the bucket
        bucket.delete()
        print(f"Bucket {bucket_name} deleted.")
    else:
        print(f"Bucket {bucket_name} does not exist or is already deleted.")



In [29]:
# delete the bucket
delete_bucket_and_contents(bucket_name)

Blob product-embs.json deleted.
Bucket example_bukcet deleted.


In [30]:
# delete Index Endpoint
my_index_endpoint.undeploy_all()
my_index_endpoint.delete(force=True)

# delete Index
my_index.delete()

Undeploying MatchingEngineIndexEndpoint index_endpoint: projects/840606066459/locations/us-central1/indexEndpoints/4827476170494705664
Undeploy MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/840606066459/locations/us-central1/indexEndpoints/4827476170494705664/operations/3338024874231726080
MatchingEngineIndexEndpoint index_endpoint undeployed. Resource name: projects/840606066459/locations/us-central1/indexEndpoints/4827476170494705664
Deleting MatchingEngineIndexEndpoint : projects/840606066459/locations/us-central1/indexEndpoints/4827476170494705664
Delete MatchingEngineIndexEndpoint  backing LRO: projects/840606066459/locations/us-central1/operations/6792285788424896512
MatchingEngineIndexEndpoint deleted. . Resource name: projects/840606066459/locations/us-central1/indexEndpoints/4827476170494705664
Deleting MatchingEngineIndex : projects/840606066459/locations/us-central1/indexes/8273574310363267072
Delete MatchingEngineIndex  backing LRO: projects/840606066459/