# Implement Hybrid Search with Vertex AI Vector Search

## GENAI115

#### Objectives
In this lab, you learn how to perform the following tasks:

 a. Install and explore the basic tools needed for hybrid search.
 
 b. Convert product information into two types of search-friendly formats:
 - Dense embeddings (for meaning-based search)
 - Sparse embeddings (for keyword-based search)
 
 c. Create a hybrid search index that supports both dense and sparse embeddings.
 
 d. Explore how combining both methods can improve search accuracy and relevance.
 
 ## Task 1. Prepare the environment in Vertex AI Workbench

In [1]:
! pip install --upgrade --quiet --user google-cloud-aiplatform google-cloud-storage

==> Restart Kernel....

In [2]:
PROJECT_ID = "qwiklabs-gcp-03-9a19ca3f5800"  
LOCATION = "us-central1"

In [3]:
from datetime import datetime
UID = datetime.now().strftime("%m%d%H%M")

## Task 2. Prepare a dataset with hybrid embeddings

1. Run the following code in a cell to download the dataset and load it into a pandas DataFrame:

In [6]:
import pandas as pd

CSV_URL = "https://storage.googleapis.com/qwiklabs-gcp-03-9a19ca3f5800/google_merch_shop_items.csv"

df = pd.read_csv(CSV_URL)
df["title"]

0                          Google Sticker
1                    Google Cloud Sticker
2                       Android Black Pen
3                   Google Ombre Lime Pen
4                    For Everyone Eco Pen
                      ...                
197        Google Recycled Black Backpack
198    Google Cascades Unisex Zip Sweater
199    Google Cascades Womens Zip Sweater
200         Google Cloud Skyline Backpack
201       Google City Black Tote Backpack
Name: title, Length: 202, dtype: object

2. Run the following code in a cell to train a vectorizer (using TfidfVectorizer from scikit-learn) to generate sparse embeddings from product titles:

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample Text Data
corpus = df.title.tolist()
# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()
# Fit and Transform
vectorizer.fit_transform(corpus)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 839 stored elements and shape (202, 243)>

3. Run the following code in a new cell to create a function that generates sparse embeddings:

In [8]:
# wrapper for sparse embedding
def get_sparse_embedding(text):
    # Transform Text into TF-IDF Sparse Vector
    tfidf_vector = vectorizer.transform([text])

    # Create Sparse Embedding for the New Text
    values = []
    dims = []
    for i, tfidf_value in enumerate(tfidf_vector.data):
        values.append(float(tfidf_value))
        dims.append(int(tfidf_vector.indices[i]))
    return {"values": values, "dimensions": dims}

4. Run the following code in a new cell to test the wrapper with a product title:

In [9]:
text_text = "Chrome Dino Pin"
get_sparse_embedding(text_text)

{'values': [0.5212913389979028, 0.5212913389979028, 0.6756557405747007],
 'dimensions': [33, 48, 157]}

5. To build a hybrid index, each item should have both sparse_embedding and embedding (for dense embedding). The following code uses Google's text-embedding-005 model to generate dense text embeddings of 768 dimensions for semantic search. Run the following code in a new cell to create a wrapper for dense embeddings:

In [10]:
from vertexai.preview.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("text-embedding-005")

# wrapper for dense embedding
def get_dense_embedding(text):
    return model.get_embeddings([text])[0].values



6. Test the wrapper method by running the following code in a new cell:

In [11]:
text_text = "Chrome Dino Pin"
get_dense_embedding(text_text)

[-0.06114290654659271,
 0.017346370965242386,
 -0.004251249600201845,
 -0.02798495627939701,
 -0.0011111712083220482,
 0.012573054060339928,
 -0.05285409837961197,
 -0.030847828835248947,
 -0.003995438572019339,
 -0.05352348834276199,
 -0.08685804903507233,
 0.034621480852365494,
 -0.01601252891123295,
 -0.012502963654696941,
 -0.024926766753196716,
 0.03435458615422249,
 0.0061781443655490875,
 -0.07511552423238754,
 -0.024149153381586075,
 -0.0012593824649229646,
 0.00979007687419653,
 -0.07821597903966904,
 -0.020418530330061913,
 -0.01775938645005226,
 0.023217324167490005,
 0.008083238266408443,
 0.011712702922523022,
 0.020031563937664032,
 -0.013191470876336098,
 -0.019503341987729073,
 0.06235750392079353,
 0.015036839991807938,
 0.03624138981103897,
 0.032345667481422424,
 -0.014319414272904396,
 -0.02620174176990986,
 -0.06564001739025116,
 -0.04058409854769707,
 -0.009316562674939632,
 0.03755204379558563,
 0.050808507949113846,
 -0.11942362040281296,
 0.018268104642629623,


7. Run the following code in a new cell to set the bucket name and then create it:

In [12]:
BUCKET_URI = f"gs://{PROJECT_ID}-vs-hybridsearch-{UID}"
! gcloud storage buckets create -l $LOCATION --project $PROJECT_ID $BUCKET_URI

Creating gs://qwiklabs-gcp-03-9a19ca3f5800-vs-hybridsearch-07040530/...


8. Run the following gcloud code in a new cell to copy the embeddings file (a sample JSON array of objects with id, title, embedding, and sparse_embedding properties) to your storage bucket:

In [13]:
! gcloud storage cp gs://partner-genai-bucket/genai115/items.json  $BUCKET_URI/items.json

Copying gs://partner-genai-bucket/genai115/items.json to gs://qwiklabs-gcp-03-9a19ca3f5800-vs-hybridsearch-07040530/items.json
  Completed files 1/1 | 3.3MiB/3.3MiB                                          


9. Run the following code in a new cell to initialize the aiplatform package:

In [14]:
from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=LOCATION)

10. Run the following code in a new cell to create a hybrid index using the JSONL file in your bucket. This cell takes a minute or two to complete:

In [15]:
my_hybrid_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=f"vs-hybridsearch-index-{UID}",
    contents_delta_uri=BUCKET_URI,
    dimensions=768,
    approximate_neighbors_count=10,
)

Creating MatchingEngineIndex
Create MatchingEngineIndex backing LRO: projects/958287624711/locations/us-central1/indexes/3272447870447386624/operations/4041690635013455872
MatchingEngineIndex created. Resource name: projects/958287624711/locations/us-central1/indexes/3272447870447386624
To use this MatchingEngineIndex in another session:
index = aiplatform.MatchingEngineIndex('projects/958287624711/locations/us-central1/indexes/3272447870447386624')


11. Run the following code in a new cell to create an index endpoint:

In [16]:
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=f"vs-hybridsearch-index-endpoint-{UID}", public_endpoint_enabled=True
)

Creating MatchingEngineIndexEndpoint
Create MatchingEngineIndexEndpoint backing LRO: projects/958287624711/locations/us-central1/indexEndpoints/6648054120835973120/operations/3330121893888917504
MatchingEngineIndexEndpoint created. Resource name: projects/958287624711/locations/us-central1/indexEndpoints/6648054120835973120
To use this MatchingEngineIndexEndpoint in another session:
index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/958287624711/locations/us-central1/indexEndpoints/6648054120835973120')


12. Run the following code in a new cell to deploy the index to the endpoint, specifying a unique deployed index ID.

In [None]:
DEPLOYED_HYBRID_INDEX_ID = f"vs_hybridsearch_deployed_{UID}"
my_index_endpoint.deploy_index(
    index=my_hybrid_index, deployed_index_id=DEPLOYED_HYBRID_INDEX_ID
)

Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/958287624711/locations/us-central1/indexEndpoints/6648054120835973120
Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/958287624711/locations/us-central1/indexEndpoints/6648054120835973120/operations/4699216180609548288


## Task 3. Run hybrid queries
In this task, you generate both dense and sparse embeddings for the query Kids, and encapsulate them in a HybridQuery object. In addition to the dense_embedding and sparse_embedding, you pass in an rrf_ranking_alpha, which provides a way to merge the ranking from semantic and token-based search results. This means a search for "cozy hoodie", for example, could surface products similar to a search for "comfortable sweatshirt" even if the exact keywords aren't present.

1. Run the following code in a new cell to prepare the query for the word Kids:

In [None]:
from google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint import (
    HybridQuery,
)

query_text = "Kids"
query_dense_emb = get_dense_embedding(query_text)
query_sparse_emb = get_sparse_embedding(query_text)
query = HybridQuery(
    dense_embedding=query_dense_emb,
    sparse_embedding_dimensions=query_sparse_emb["dimensions"],
    sparse_embedding_values=query_sparse_emb["values"],
    rrf_ranking_alpha=0.5,
)

2. Run the following code in a new cell to run the query and print distances for each item in the response:

In [None]:
# run a hybrid query
response = my_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_HYBRID_INDEX_ID,
    queries=[query],
    num_neighbors=10,
)

# print results
for idx, neighbor in enumerate(response[0]):
    title = df.title[int(neighbor.id)]
    dense_dist = neighbor.distance if neighbor.distance else 0.0
    sparse_dist = neighbor.sparse_distance if neighbor.sparse_distance else 0.0
    print(f"{title:<40}: dense_dist: {dense_dist:.3f}, sparse_dist: {sparse_dist:.3f}")

Note: To merge the token-based and semantic search results, hybrid search uses Reciprocal Rank Fusion (RRF). For more information about RRF and how to specify the rrf_ranking_alpha parameter, refer to the next section.

#### What is reciprocal rank fusion?
RRF provides a way to merge the ranking from semantic and token-based search results. In many production information retrieval or recommender systems, the results go through further precision ranking algorithms – so-called reranking. With the combination of the millisecond-level fast retrieval with vector search, and precision reranking on the results, you can build multi-stage systems that provide higher search quality and recommendation performance.

