Source : https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/matching_engine/sdk_matching_engine_create_stack_overflow_embeddings.ipynb

See notebook "GCP-MatchinEngine-3-understanding.ipynb" for more detailed analysis

In [1]:
from google.cloud import aiplatform,storage,bigquery
import os

In [2]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS']= "L:\\gcp-project-0523-628d01f95284.json"

In [3]:
PROJECT_ID = "gcp-project-0523"
REGION     = 'us-central1'
BUCKET     = 'gcp-project-0523-ann-bucket'
BUCKET_URI = 'gs://gcp-project-0523-ann-bucket'

In [4]:
aiplatform.init(project        = PROJECT_ID,
                location       = REGION,
                staging_bucket = BUCKET_URI)

In [5]:
%%time
from google.cloud import bigquery

client = bigquery.Client(project=PROJECT_ID)

NUM_ROWS = 10000

QUERY = f"""
        SELECT distinct q.id, q.title, q.body, q.tags, a.body as answers, a.score 
        FROM (SELECT * FROM `bigquery-public-data.stackoverflow.posts_questions` where Score>0 ORDER BY View_Count desc) AS q 
              INNER JOIN 
              (SELECT * FROM `bigquery-public-data.stackoverflow.posts_answers`  where Score>0 ORDER BY Score desc) AS a 
              ON q.id = a.parent_id 
        where q.tags like '%python%'
        LIMIT {NUM_ROWS};
        """

query_job = client.query(QUERY)
rows = query_job.result()


# Convert to a dataframe
df = rows.to_dataframe()

# Examine the data
df.head()

CPU times: total: 156 ms
Wall time: 14.9 s


Unnamed: 0,id,title,body,tags,answers,score
0,65901634,Matplotlib figure '.supxlabel' does not work,<p>I'm trying to set figure labels for my cond...,python|matplotlib|plot|attributeerror,<p>I got into the same problem when using the ...,14
1,33436221,Displaying rotatable 3D plots in IPython or Ju...,<p>(Mac OSX 10.10.5)</p>\n\n<p>I can reproduce...,macos|matplotlib|plot|jupyter-notebook|ipython,<p>Use <code>%matplotlib notebook</code> inste...,157
2,7791574,How can I print a Python file's docstring when...,<p>I have a Python script with a docstring. Wh...,python|docstring,<p>Here is an alternative that does not hardco...,14
3,12410242,python capitalize first letter only,<p>I am aware .capitalize() capitalizes the fi...,python|capitalize|letter,<p>This is similar to @Anon's answer in that i...,40
4,39402795,How to pad a string with leading zeros in Pyth...,<p>I'm trying to make <code>length = 001</code...,python|python-3.x|math|rounding,<p>Make use of the <code>zfill()</code> helper...,176


In [6]:
# Extract the question ids and question text
ids = df.id.tolist()
# Verify the length
len(ids)

10000

In [7]:
questions = df.title.tolist()
len(questions)

10000

In [72]:
df[df.id==26678457]

Unnamed: 0,id,title,body,tags,answers,score
2392,26678457,How do I install python3-gi within virtualenv?,"<p>I'm following the <a href=""http://python-gt...",python|python-3.x|virtualenv,"<p>I installed <a href=""https://pypi.python.or...",1


### Instantiate the text encoding model

Use the sentence-t5 encoder developed by Google for converting text to embeddings.
https://tfhub.dev/google/sentence-t5/st5-base/1

    The sentence-T5 family of models encode text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language processing tasks.

    Our model is built on top of T5 (i.e. the Text-To-Text Transfer Transformer). It is trained on a variety of data sources and initialized from pre-trained T5 models with different model sizes. The input is variable-length English text and the output is a 768-dimensional vector. The sentence-T5 base model employs a 12-layer transformer architecture as the T5 base model does.

In [8]:
import tensorflow as tf
import tensorflow_hub as hub
# Registers the ops.
import tensorflow_text as text  # noqa: F401

hub_url = "https://tfhub.dev/google/sentence-t5/st5-base/1"

encoder = hub.KerasLayer(hub_url)



### Defining an encoding function
    Define a function to be used later that will take sentences and convert them to embeddings.

In [9]:
from typing import List

import numpy as np
from tqdm.auto import tqdm

In [10]:
def encode_text_to_embedding(text_encoder: hub.KerasLayer,
                             sentences: List[str], 
                             batch_size: int = 500
                            ) -> np.ndarray:
    embeddings_list = []

    # Process data in chunks to prevent out-of-memory errors
    for i in tqdm(range(0, len(sentences), batch_size)):
        batch = sentences[i : i + batch_size]
        embeddings_list.append(text_encoder(tf.constant(batch)))

    return np.squeeze(np.column_stack(embeddings_list))

### Test the encoding function

Encode a subset of data and see if the embeddings and distance metrics make sense.

According to the sentence-T5 research paper (https://arxiv.org/pdf/2108.08877.pdf), the similarity of embeddings is calculated using the dot-product.

In [12]:
# Encode 500 questions

questions = df.title.tolist()[:500]

question_embeddings = encode_text_to_embedding(text_encoder=encoder, sentences=questions )

  0%|          | 0/1 [00:00<?, ?it/s]

In [13]:
DIMENSIONS = len(question_embeddings[0])

In [14]:
DIMENSIONS

768

In [15]:
question_index = 0

print(f"Query question = {questions[question_index]}")
scores = np.dot(question_embeddings[question_index], question_embeddings.T)

# Print top 20 matches
for index, (question, score) in enumerate(
    sorted(zip(questions, scores), key=lambda x: x[1], reverse=True)[:20]
):
    print(f"\t{index}: {question}: {score}")

Query question = Matplotlib figure '.supxlabel' does not work
	0: Matplotlib figure '.supxlabel' does not work: 1.0
	1: Why Python ggplot returns name 'aes' is not defined?: 0.8768320083618164
	2: When I use matplotlib in jupyter notebook,it always raise " matplotlib is currently using a non-GUI backend" error?: 0.8750013113021851
	3: Pylint showing invalid variable name in output: 0.8627825379371643
	4: How to avoid overlapping of labels & autopct in a matplotlib pie chart?: 0.8541396260261536
	5: Keras: "RuntimeError: Failed to import pydot." after installing graphviz and pydot: 0.8539811372756958
	6: Break // in x axis of matplotlib: 0.8530298471450806
	7: 'image "pyimage2" doesn't exist'?: 0.8514245748519897
	8: Tkinter code using font module can't run from command line?: 0.8474218845367432
	9: Cannot find col function in pyspark: 0.8422090411186218
	10: Python pickle error: UnicodeDecodeError: 0.8415253758430481
	11: Python pickle error: UnicodeDecodeError: 0.8415253758430481
	12:

In [16]:
scores

array([1.        , 0.77600867, 0.75355935, 0.7701317 , 0.7532948 ,
       0.803599  , 0.7683612 , 0.73177254, 0.81928754, 0.712507  ,
       0.7486217 , 0.75650936, 0.6917933 , 0.71170557, 0.7674632 ,
       0.72664917, 0.6946776 , 0.8076164 , 0.7615867 , 0.8049627 ,
       0.735347  , 0.7117299 , 0.75184864, 0.6867894 , 0.78570294,
       0.75756395, 0.7482575 , 0.7359009 , 0.74563944, 0.80372405,
       0.71766245, 0.71312696, 0.6750647 , 0.6801456 , 0.7290168 ,
       0.7963338 , 0.7509193 , 0.74277556, 0.7329016 , 0.74573076,
       0.7330417 , 0.75013065, 0.6765859 , 0.71299046, 0.73668087,
       0.73551565, 0.8079541 , 0.75317436, 0.78902376, 0.7890237 ,
       0.71709245, 0.71832186, 0.76755947, 0.7152434 , 0.73595476,
       0.7743169 , 0.7494388 , 0.8066292 , 0.7736256 , 0.7461685 ,
       0.69805485, 0.85302985, 0.7527468 , 0.82494515, 0.74771786,
       0.70103115, 0.6528059 , 0.8088636 , 0.74008924, 0.7430892 ,
       0.78183854, 0.7922012 , 0.7851174 , 0.7579281 , 0.69561

In [17]:
question_embeddings

array([[-0.06134836, -0.03442549,  0.02700819, ...,  0.0046175 ,
        -0.02479105, -0.01712988],
       [-0.01910195, -0.02929889,  0.02678945, ...,  0.04623033,
        -0.01961115,  0.00468232],
       [-0.02352966, -0.01513609,  0.01258736, ...,  0.01607407,
        -0.04259783, -0.03183158],
       ...,
       [-0.03252523, -0.01787602,  0.00703245, ...,  0.02072232,
         0.01635229, -0.00986958],
       [-0.05336634, -0.01497216,  0.03298664, ..., -0.00926095,
        -0.02550137, -0.00674705],
       [-0.06846734, -0.03369173,  0.00343583, ...,  0.02068361,
        -0.03420502, -0.01852596]], dtype=float32)

In [22]:
len(question_embeddings)

500

#### Save the train split in JSONL format.

The data must be formatted in JSONL format, which means each embedding dictionary is written as a JSON string on its own line.

See more information in the docs at Input data format and structure. (https://cloud.google.com/vertex-ai/docs/matching-engine/match-eng-setup/format-structure#json)

In [27]:
import json

BATCH_SIZE = 500

questions = df.title.tolist()[:5000]

embeddings_file_name='embeddings_file.json'

print('# of questiosn are : ', str(len(questions)))

with open(embeddings_file_name, "w") as f:
    for i in tqdm(range(0, len(questions), BATCH_SIZE)):
        id_chunk = ids[i : i + BATCH_SIZE]

        question_chunk_embeddings = encode_text_to_embedding(text_encoder=encoder,
                                                             sentences=questions[i : i + BATCH_SIZE])

        # Append to file
        embeddings_formatted = [
            json.dumps(
                {
                    "id": str(id),
                    "embedding": [str(value) for value in embedding],
                }
            )
            + "\n"
            for id, embedding in zip(id_chunk, question_chunk_embeddings)
        ]
        f.writelines(embeddings_formatted)
        print(i)

# of questiosn are :  5000


  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0


  0%|          | 0/1 [00:00<?, ?it/s]

500


  0%|          | 0/1 [00:00<?, ?it/s]

1000


  0%|          | 0/1 [00:00<?, ?it/s]

1500


  0%|          | 0/1 [00:00<?, ?it/s]

2000


  0%|          | 0/1 [00:00<?, ?it/s]

2500


  0%|          | 0/1 [00:00<?, ?it/s]

3000


  0%|          | 0/1 [00:00<?, ?it/s]

3500


  0%|          | 0/1 [00:00<?, ?it/s]

4000


  0%|          | 0/1 [00:00<?, ?it/s]

4500


In [29]:
UNIQUE_FOLDER_NAME = "embeddings_folder_unique"
remote_folder = f"{BUCKET_URI}/{UNIQUE_FOLDER_NAME}/"
! gsutil cp {embeddings_file_name} {remote_folder}

Copying file://embeddings_file.json [Content-Type=application/json]...
/ [0 files][    0.0 B/ 56.4 MiB]                                                
-
- [0 files][ 15.6 MiB/ 56.4 MiB]                                                
\
|
| [0 files][ 31.6 MiB/ 56.4 MiB]                                                
/
/ [1 files][ 56.4 MiB/ 56.4 MiB]                                                

Operation completed over 1 objects/56.4 MiB.                                     


## Create Indexes

### Create ANN Index (for Production Usage)

In [31]:
DISPLAY_NAME = "stack_overflow_index"
DESCRIPTION  = "questions from stackoverflow"

In [32]:
tree_ah_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(display_name                = DISPLAY_NAME,
                                                                    contents_delta_uri          = remote_folder,
                                                                    dimensions                  = DIMENSIONS,
                                                                    approximate_neighbors_count = 150,
                                                                    distance_measure_type       = "DOT_PRODUCT_DISTANCE",
                                                                    leaf_node_embedding_count   = 500,
                                                                    leaf_nodes_to_search_percent= 80,
                                                                    description                 = DESCRIPTION
                                                                   )

Creating MatchingEngineIndex


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:Creating MatchingEngineIndex


Create MatchingEngineIndex backing LRO: projects/473197248954/locations/us-central1/indexes/552069186352840704/operations/7286602301895081984


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:Create MatchingEngineIndex backing LRO: projects/473197248954/locations/us-central1/indexes/552069186352840704/operations/7286602301895081984


MatchingEngineIndex created. Resource name: projects/473197248954/locations/us-central1/indexes/552069186352840704


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:MatchingEngineIndex created. Resource name: projects/473197248954/locations/us-central1/indexes/552069186352840704


To use this MatchingEngineIndex in another session:


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:To use this MatchingEngineIndex in another session:


index = aiplatform.MatchingEngineIndex('projects/473197248954/locations/us-central1/indexes/552069186352840704')


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:index = aiplatform.MatchingEngineIndex('projects/473197248954/locations/us-central1/indexes/552069186352840704')


In [33]:
INDEX_RESOURCE_NAME = tree_ah_index.resource_name
INDEX_RESOURCE_NAME

'projects/473197248954/locations/us-central1/indexes/552069186352840704'

## Deploy Indexes

#### Create End Point without VPC

In [34]:
DISPLAY_NAME = "stack_overflow_endpoint"
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(display_name = DISPLAY_NAME,
                                                                  description  = DESCRIPTION,
                                                                  public_endpoint_enabled = True,
                                                                  #network      = VPC_NETWORK_FULL,
                                                                 )

Creating MatchingEngineIndexEndpoint


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Creating MatchingEngineIndexEndpoint


Create MatchingEngineIndexEndpoint backing LRO: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920/operations/3787727603893272576


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Create MatchingEngineIndexEndpoint backing LRO: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920/operations/3787727603893272576


MatchingEngineIndexEndpoint created. Resource name: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:MatchingEngineIndexEndpoint created. Resource name: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920


To use this MatchingEngineIndexEndpoint in another session:


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:To use this MatchingEngineIndexEndpoint in another session:


index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920')


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920')


#### Create End Point with VPC

In [99]:
PROJECT_NUMBER =   473197248954
VPC_NETWORK    =   "vector-db-vpc"

VPC_NETWORK_FULL = "projects/{}/global/networks/{}".format(PROJECT_NUMBER, VPC_NETWORK)
VPC_NETWORK_FULL

'projects/473197248954/global/networks/vector-db-vpc'

In [95]:
!gcloud config set project 'gcp-project-0523'

ERROR: (gcloud.config.set) The project property must be set to a valid project ID, not the project name ['gcp-project-0523']
To set your project, run:

  $ gcloud config set project PROJECT_ID

or to unset it, run:

  $ gcloud config unset project


In [91]:
!gcloud services enable servicenetworking.googleapis.com --project='gcp-project-0523'

ERROR: (gcloud) The project property must be set to a valid project ID, not the project name ['gcp-project-0523']
To set your project, run:

  $ gcloud config set project PROJECT_ID

or to unset it, run:

  $ gcloud config unset project


In [96]:
DISPLAY_NAME = "stack_overflow_vpc"
DESCRIPTION  = "questions from stackoverflow with vpc"

In [104]:
my_vpc_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(display_name = DISPLAY_NAME,
                                                                      description  = DESCRIPTION,
                                                                      network      = VPC_NETWORK_FULL,
                                                                     )

InvalidArgument: 400 Cannot use vpc projects/473197248954/global/networks/vector-db-vpc for project 473197248954. Error NETWORK_NOT_PEERED

### CREATING VPC

In [107]:
VPC_NETWORK = "vector-db-vpc2"  # @param {type:"string"}
PEERING_RANGE_NAME = "ann-haystack-range"
PROJECT_ID = "gcp-project-0523"

In [108]:
! gcloud compute networks create {VPC_NETWORK} --bgp-routing-mode=regional --subnet-mode=auto --project={PROJECT_ID}

NAME            SUBNET_MODE  BGP_ROUTING_MODE  IPV4_RANGE  GATEWAY_IPV4
vector-db-vpc2  AUTO         REGIONAL


Created [https://www.googleapis.com/compute/v1/projects/gcp-project-0523/global/networks/vector-db-vpc2].

Instances on this network will not be reachable until firewall rules
are created. As an example, you can allow all internal traffic between
instances as well as SSH, RDP, and ICMP by running:

$ gcloud compute firewall-rules create <FIREWALL_NAME> --network vector-db-vpc2 --allow tcp,udp,icmp --source-ranges <IP_RANGE>
$ gcloud compute firewall-rules create <FIREWALL_NAME> --network vector-db-vpc2 --allow tcp:22,tcp:3389,icmp



In [109]:
# Add necessary firewall rules
! gcloud compute firewall-rules create {VPC_NETWORK}-allow-icmp --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow icmp

! gcloud compute firewall-rules create {VPC_NETWORK}-allow-internal --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow all --source-ranges 10.128.0.0/9

! gcloud compute firewall-rules create {VPC_NETWORK}-allow-rdp --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow tcp:3389

! gcloud compute firewall-rules create {VPC_NETWORK}-allow-ssh --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow tcp:22


NAME                       NETWORK         DIRECTION  PRIORITY  ALLOW  DENY  DISABLED
vector-db-vpc2-allow-icmp  vector-db-vpc2  INGRESS    65534     icmp         False


Creating firewall...
..Created [https://www.googleapis.com/compute/v1/projects/gcp-project-0523/global/firewalls/vector-db-vpc2-allow-icmp].
done.


NAME                           NETWORK         DIRECTION  PRIORITY  ALLOW  DENY  DISABLED
vector-db-vpc2-allow-internal  vector-db-vpc2  INGRESS    65534     all          False


Creating firewall...
..Created [https://www.googleapis.com/compute/v1/projects/gcp-project-0523/global/firewalls/vector-db-vpc2-allow-internal].
done.


NAME                      NETWORK         DIRECTION  PRIORITY  ALLOW     DENY  DISABLED
vector-db-vpc2-allow-rdp  vector-db-vpc2  INGRESS    65534     tcp:3389        False


Creating firewall...
..Created [https://www.googleapis.com/compute/v1/projects/gcp-project-0523/global/firewalls/vector-db-vpc2-allow-rdp].
done.


NAME                      NETWORK         DIRECTION  PRIORITY  ALLOW   DENY  DISABLED
vector-db-vpc2-allow-ssh  vector-db-vpc2  INGRESS    65534     tcp:22        False


Creating firewall...
..Created [https://www.googleapis.com/compute/v1/projects/gcp-project-0523/global/firewalls/vector-db-vpc2-allow-ssh].
done.


In [110]:
# Reserve IP range
! gcloud compute addresses create {PEERING_RANGE_NAME} --global --prefix-length=16 --network={VPC_NETWORK} --purpose=VPC_PEERING --project={PROJECT_ID} --description="peering range"


Created [https://www.googleapis.com/compute/v1/projects/gcp-project-0523/global/addresses/ann-haystack-range].


In [112]:
# Set up peering with service networking
# Your account must have the "Compute Network Admin" role to run the following.
! gcloud services vpc-peerings connect --service=servicenetworking.googleapis.com --network={VPC_NETWORK} --ranges={PEERING_RANGE_NAME} --project={PROJECT_ID}

Operation "operations/pssn.p24-473197248954-d112d0de-17ab-4c31-b407-bf9bb9203893" finished successfully.


In [125]:
!gcloud services enable servicenetworking.googleapis.com --project={PROJECT_ID}

### Create Endpoint on VPC

In [52]:
DISPLAY_NAME = "stack_overflow_endpoint_vpc"
my_vpc_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(display_name = DISPLAY_NAME,
                                                                      description  = DESCRIPTION,
                                                                      network      = 'projects/473197248954/global/networks/vector-db-vpc2',
                                                                     )

Creating MatchingEngineIndexEndpoint


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Creating MatchingEngineIndexEndpoint


Create MatchingEngineIndexEndpoint backing LRO: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312/operations/249024196686905344


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Create MatchingEngineIndexEndpoint backing LRO: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312/operations/249024196686905344


MatchingEngineIndexEndpoint created. Resource name: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:MatchingEngineIndexEndpoint created. Resource name: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


To use this MatchingEngineIndexEndpoint in another session:


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:To use this MatchingEngineIndexEndpoint in another session:


index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312')


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312')


In [53]:
my_vpc_index_endpoint

<google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint.MatchingEngineIndexEndpoint object at 0x000001C9EFD261F0> 
resource name: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312

In [54]:
my_vpc_index_endpoint.gca_resource

name: "projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312"
display_name: "stack_overflow_endpoint_vpc"
description: "questions from stackoverflow"
etag: "AMEw9yNk1ns2cf4ffTCVXKGB9vPnqhTbk6Rx4tiSXAQ0jktTnljJ2m9n79AYUqZIqCSC"
create_time {
  seconds: 1686003188
  nanos: 59508000
}
update_time {
  seconds: 1686003188
  nanos: 663817000
}
network: "projects/473197248954/global/networks/vector-db-vpc2"

#### Deploy ANN Index on Private Endpoint (VPC)

In [55]:
DEPLOYED_INDEX_ID = "stack_overflow_index_deployed_vpc"

In [56]:
tree_ah_index

<google.cloud.aiplatform.matching_engine.matching_engine_index.MatchingEngineIndex object at 0x000001C9C5F1FB50> 
resource name: projects/473197248954/locations/us-central1/indexes/552069186352840704

In [57]:
my_vpc_index_endpoint = my_vpc_index_endpoint.deploy_index(index             = tree_ah_index,
                                                           deployed_index_id = DEPLOYED_INDEX_ID)

Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312/operations/8776590091112939520


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312/operations/8776590091112939520


MatchingEngineIndexEndpoint index_endpoint Deployed index. Resource name: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:MatchingEngineIndexEndpoint index_endpoint Deployed index. Resource name: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


In [120]:
my_vpc_index_endpoint.deployed_indexes

[id: "deployed_index_id_vpc_unique"
index: "projects/473197248954/locations/us-central1/indexes/6450658798301347840"
create_time {
  seconds: 1685905719
  nanos: 335164000
}
private_endpoints {
  match_grpc_address: "10.25.0.5"
}
index_sync_time {
  seconds: 1685907894
  nanos: 894871000
}
automatic_resources {
  min_replica_count: 2
  max_replica_count: 2
}
deployment_group: "default"
]

#### Create Online Queries

In [126]:
test_embeddings = encode_text_to_embedding(text_encoder = encoder,
                                           sentences    = ["How do I install tensorflow with GPU support?"]
)

  0%|          | 0/1 [00:00<?, ?it/s]

In [127]:
# Test query
NUM_NEIGHBOURS = 20
DEPLOYED_INDEX_ID = "deployed_index_id_vpc_unique"

response = my_vpc_index_endpoint.match(deployed_index_id = "deployed_index_id_vpc_unique",
                                       queries           = [test_embeddings.tolist()],
                                       num_neighbors     = 20,
                                      )

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNAVAILABLE: ipv4:10.25.0.5:10000: WSA Error"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNAVAILABLE: ipv4:10.25.0.5:10000: WSA Error {grpc_status:14, created_time:"2023-06-04T20:38:58.044130574+00:00"}"
>

In [128]:
import grpc
import match_service_pb2
import match_service_pb2_grpc

ModuleNotFoundError: No module named 'match_service_pb2'

In [129]:
./grpc_cli

SyntaxError: invalid syntax (730042570.py, line 1)

#### Deploy ANN Index on Public Endpoint

In [35]:
DEPLOYED_INDEX_ID = "stack_overflow_index_deployed"

In [36]:
my_index_endpoint = my_index_endpoint.deploy_index(index             = tree_ah_index,
                                                   deployed_index_id = DEPLOYED_INDEX_ID)

Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920


Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920/operations/5517109860803543040


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920/operations/5517109860803543040


TimeoutError: Operation did not complete within the designated timeout of 900 seconds.

In [37]:
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920')

In [38]:
my_index_endpoint.deployed_indexes

[id: "stack_overflow_index_deployed"
index: "projects/473197248954/locations/us-central1/indexes/552069186352840704"
create_time {
  seconds: 1685998888
  nanos: 17952000
}
index_sync_time {
  seconds: 1686001183
  nanos: 567720000
}
automatic_resources {
  min_replica_count: 2
  max_replica_count: 2
}
deployment_group: "default"
]

In [39]:
my_index_endpoint.gca_resource

name: "projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920"
display_name: "stack_overflow_endpoint"
description: "questions from stackoverflow"
deployed_indexes {
  id: "stack_overflow_index_deployed"
  index: "projects/473197248954/locations/us-central1/indexes/552069186352840704"
  create_time {
    seconds: 1685998888
    nanos: 17952000
  }
  index_sync_time {
    seconds: 1686001183
    nanos: 567720000
  }
  automatic_resources {
    min_replica_count: 2
    max_replica_count: 2
  }
  deployment_group: "default"
}
etag: "AMEw9yMtt6hKFuRfSz4EO4iUVQcZHIbukPOP0Sjf1D8YIZath9agy02x8UigbtsAqDah"
create_time {
  seconds: 1685998799
  nanos: 194929000
}
update_time {
  seconds: 1685998799
  nanos: 897639000
}
public_endpoint_domain_name: "195708010.us-central1-473197248954.vdb.vertexai.goog"

#### Create Online Queries

In [31]:
test_embeddings = encode_text_to_embedding(text_encoder = encoder,
                                           sentences    = ["How do I install tensorflow with GPU support?"]
)

  0%|          | 0/1 [00:00<?, ?it/s]

In [32]:
test_embeddings

array([-3.46920043e-02, -6.79826736e-02,  4.70267944e-02,  7.88554735e-03,
       -3.78703587e-02, -3.27263549e-02, -4.16305335e-03,  4.51407507e-02,
       -1.70209594e-02, -4.59558144e-02,  3.49068902e-02, -3.51876095e-02,
        7.97955692e-02, -1.50319990e-02,  6.50781542e-02,  3.16455401e-02,
        1.23584569e-02, -4.27518925e-03,  2.01913230e-02,  2.80608609e-02,
        3.12974416e-02,  1.53770046e-02, -3.46761718e-02,  2.38961838e-02,
       -4.36633639e-02,  3.24505456e-02,  5.36020733e-02,  1.21157384e-02,
       -2.31851898e-02,  1.74899139e-02,  6.93742782e-02, -4.13260534e-02,
       -1.86444726e-02,  4.57508117e-02, -2.05389801e-02,  1.13655291e-02,
       -4.48801853e-02, -6.17317632e-02, -3.50356214e-02,  4.99744229e-02,
       -6.11441722e-03, -4.73947413e-02, -3.82532515e-02,  4.20878232e-02,
       -3.89112085e-02, -3.85740474e-02,  3.27653885e-02,  6.00255728e-02,
       -7.06862332e-03, -2.63761580e-02, -4.46173511e-02, -4.22799438e-02,
       -4.73105833e-02, -

In [40]:
# Test query
NUM_NEIGHBOURS = 20

response = my_index_endpoint.match(deployed_index_id = DEPLOYED_INDEX_ID,
                                   queries           = [test_embeddings.tolist()],
                                   num_neighbors     = NUM_NEIGHBOURS,
                                  )

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "DNS resolution failed for :10000: UNKNOWN: unparseable host:port: ':10000'"
	debug_error_string = "UNKNOWN:DNS resolution failed for :10000: UNKNOWN: unparseable host:port: ':10000' {created_time:"2023-06-04T17:03:32.252714579+00:00", grpc_status:14}"
>

In [None]:
response

In [None]:
neighbor_ids = [neighbor.id for neighbor in response[0]]
neighbor_distances = [neighbor.distance for neighbor in response[0]]

for match_index, neighbor in enumerate(response[0]):
    titles = df[df.id.astype(str) == neighbor.id].title.tolist()

    if len(titles) > 0:
        print(
            f"{match_index}: title = '{titles[0]}', distance = {neighbor.distance:0.2f}"
        )

In [74]:
import google.cloud.aiplatform.v1beta1 as aiplatform_v1beta1

In [None]:
from google.cloud.aiplatform_v1beta1 import 

In [40]:
import google.cloud.aiplatform_v1beta1 as aiplatform_v1beta1
from google.oauth2 import service_account

In [41]:
scopes = ["https://www.googleapis.com/auth/cloud-platform"]
sa_file_path = 'L:\\gcp-project-0523-628d01f95284.json'

credentials = service_account.Credentials.from_service_account_file(sa_file_path, scopes=scopes)

In [42]:
CLIENT_OPTION = {"api_endpoint": "195708010.us-central1-473197248954.vdb.vertexai.goog" }

In [43]:
vertex_ai_client = aiplatform_v1beta1.MatchServiceClient(credentials=credentials,client_options=CLIENT_OPTION)

In [44]:
request = aiplatform_v1beta1.FindNeighborsRequest(index_endpoint    = 'projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920',
                                                  deployed_index_id = "stack_overflow_index_deployed",
                                              )

In [45]:
test_embeddings = encode_text_to_embedding(text_encoder = encoder,
                                           sentences    = ["How do I install tensorflow with GPU support?"]
)

  0%|          | 0/1 [00:00<?, ?it/s]

In [46]:
dp1 = aiplatform_v1beta1.IndexDatapoint(feature_vector = test_embeddings,
                                        datapoint_id   ="0", 
                                       )

In [47]:
query = aiplatform_v1beta1.FindNeighborsRequest.Query(datapoint = dp1)

In [48]:
query

datapoint {
  datapoint_id: "0"
  feature_vector: -0.034692004323005676
  feature_vector: -0.06798267364501953
  feature_vector: 0.04702679440379143
  feature_vector: 0.00788554735481739
  feature_vector: -0.03787035867571831
  feature_vector: -0.03272635489702225
  feature_vector: -0.004163053352385759
  feature_vector: 0.04514075070619583
  feature_vector: -0.01702095940709114
  feature_vector: -0.04595581442117691
  feature_vector: 0.034906890243291855
  feature_vector: -0.03518760949373245
  feature_vector: 0.07979556918144226
  feature_vector: -0.015031998977065086
  feature_vector: 0.06507815420627594
  feature_vector: 0.03164554014801979
  feature_vector: 0.01235845685005188
  feature_vector: -0.004275189246982336
  feature_vector: 0.02019132301211357
  feature_vector: 0.02806086093187332
  feature_vector: 0.03129744157195091
  feature_vector: 0.015377004630863667
  feature_vector: -0.03467617183923721
  feature_vector: 0.02389618381857872
  feature_vector: -0.04366336390376091


In [50]:
type(query)

google.cloud.aiplatform_v1beta1.types.match_service.FindNeighborsRequest.Query

In [65]:
request.queries.append(query)

TypeError: Parameter to MergeFrom() must be instance of same class: expected google.cloud.aiplatform.v1beta1.FindNeighborsRequest.Query got Query.

In [51]:
pip list protobuf

Package                              Version
------------------------------------ --------------------
absl-py                              1.4.0
adal                                 1.2.7
aiofiles                             23.1.0
aiohttp                              3.8.4
aiokafka                             0.8.0
aiosignal                            1.3.1
alabaster                            0.7.12
alembic                              1.10.4
amqp                                 5.1.1
anaconda-client                      1.11.0
anaconda-navigator                   2.3.1
anaconda-project                     0.11.1
anyio                                3.6.2
apache-airflow                       2.6.1
apache-airflow-providers-celery      3.1.0
apache-airflow-providers-common-sql  1.4.0
apache-airflow-providers-ftp         3.3.1
apache-airflow-providers-http        4.3.0
apache-airflow-providers-imap        3.1.1
apache-airflow-providers-sqlite      3.3.2
apache-beam                     

In [None]:
response = vertex_ai_client.find_neighbors(request2)

In [78]:
request2 = aiplatform_v1beta1.types.FindNeighborsRequest(index_endpoint    = 'projects/473197248954/locations/us-central1/indexEndpoints/4876087778581938176',
                                                  deployed_index_id = "deployed_index_id_unique",
                                              )

In [79]:
dp2 = aiplatform_v1beta1.IndexDatapoint(feature_vector = test_embeddings,
                                   datapoint_id="0", 
                                    )

In [80]:
query2 = aiplatform_v1beta1.FindNeighborsRequest.Query(datapoint = dp2)

In [81]:
query2

datapoint {
  datapoint_id: "0"
  feature_vector: -0.034692004323005676
  feature_vector: -0.06798267364501953
  feature_vector: 0.04702679440379143
  feature_vector: 0.00788554735481739
  feature_vector: -0.03787035867571831
  feature_vector: -0.03272635489702225
  feature_vector: -0.004163053352385759
  feature_vector: 0.04514075070619583
  feature_vector: -0.01702095940709114
  feature_vector: -0.04595581442117691
  feature_vector: 0.034906890243291855
  feature_vector: -0.03518760949373245
  feature_vector: 0.07979556918144226
  feature_vector: -0.015031998977065086
  feature_vector: 0.06507815420627594
  feature_vector: 0.03164554014801979
  feature_vector: 0.01235845685005188
  feature_vector: -0.004275189246982336
  feature_vector: 0.02019132301211357
  feature_vector: 0.02806086093187332
  feature_vector: 0.03129744157195091
  feature_vector: 0.015377004630863667
  feature_vector: -0.03467617183923721
  feature_vector: 0.02389618381857872
  feature_vector: -0.04366336390376091


In [62]:
request2.queries.append(query2)

NameError: name 'request2' is not defined

In [73]:
my_vpc_index_endpoint.delete(force=True)

Undeploying MatchingEngineIndexEndpoint index_endpoint: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Undeploying MatchingEngineIndexEndpoint index_endpoint: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


Undeploy MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312/operations/6821183427904012288


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Undeploy MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312/operations/6821183427904012288


MatchingEngineIndexEndpoint index_endpoint undeployed. Resource name: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:MatchingEngineIndexEndpoint index_endpoint undeployed. Resource name: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


Deleting MatchingEngineIndexEndpoint : projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


INFO:google.cloud.aiplatform.base:Deleting MatchingEngineIndexEndpoint : projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


Delete MatchingEngineIndexEndpoint  backing LRO: projects/473197248954/locations/us-central1/operations/4729261400990416896


INFO:google.cloud.aiplatform.base:Delete MatchingEngineIndexEndpoint  backing LRO: projects/473197248954/locations/us-central1/operations/4729261400990416896


MatchingEngineIndexEndpoint deleted. . Resource name: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


INFO:google.cloud.aiplatform.base:MatchingEngineIndexEndpoint deleted. . Resource name: projects/473197248954/locations/us-central1/indexEndpoints/4019277949474701312


In [74]:
my_index_endpoint.delete(force=True)

Undeploying MatchingEngineIndexEndpoint index_endpoint: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Undeploying MatchingEngineIndexEndpoint index_endpoint: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920


Undeploy MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920/operations/7848004142944485376


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:Undeploy MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920/operations/7848004142944485376


MatchingEngineIndexEndpoint index_endpoint undeployed. Resource name: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint:MatchingEngineIndexEndpoint index_endpoint undeployed. Resource name: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920


Deleting MatchingEngineIndexEndpoint : projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920


INFO:google.cloud.aiplatform.base:Deleting MatchingEngineIndexEndpoint : projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920


Delete MatchingEngineIndexEndpoint  backing LRO: projects/473197248954/locations/us-central1/operations/5130081767826391040


INFO:google.cloud.aiplatform.base:Delete MatchingEngineIndexEndpoint  backing LRO: projects/473197248954/locations/us-central1/operations/5130081767826391040


MatchingEngineIndexEndpoint deleted. . Resource name: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920


INFO:google.cloud.aiplatform.base:MatchingEngineIndexEndpoint deleted. . Resource name: projects/473197248954/locations/us-central1/indexEndpoints/3475468294469713920


In [75]:
tree_ah_index.delete()

Deleting MatchingEngineIndex : projects/473197248954/locations/us-central1/indexes/552069186352840704


INFO:google.cloud.aiplatform.base:Deleting MatchingEngineIndex : projects/473197248954/locations/us-central1/indexes/552069186352840704


Delete MatchingEngineIndex  backing LRO: projects/473197248954/locations/us-central1/indexes/552069186352840704/operations/745827530581213184


INFO:google.cloud.aiplatform.base:Delete MatchingEngineIndex  backing LRO: projects/473197248954/locations/us-central1/indexes/552069186352840704/operations/745827530581213184


MatchingEngineIndex deleted. . Resource name: projects/473197248954/locations/us-central1/indexes/552069186352840704


INFO:google.cloud.aiplatform.base:MatchingEngineIndex deleted. . Resource name: projects/473197248954/locations/us-central1/indexes/552069186352840704
