<a href="https://colab.research.google.com/github/rastringer/promptcraft_notebooks/blob/main/wands_products_only.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Retrieval with LLMs and Embeddings

Matching customer queries to products via embeddings and Retrieval Augmentated Generation.

### Overview

This notebook demonstrates one method of using large language models to interact with data. Using the Wayfair [WANDS](https://www.aboutwayfair.com/careers/tech-blog/wayfair-releases-wands-the-largest-and-richest-publicly-available-dataset-for-e-commerce-product-search-relevance) dataset of more than 42,000 products, we will go through the following steps:

* Download the data into a pandas dataframe

* Generate embeddings for the product descriptions

* Create and deploy and index of the embeddings on Vertex AI Matching Engine, a service which enables nearest neighbor search at scale

* Prompt an LLM to retrieve relevant product suggestions from the embedded data.


In [None]:
# Install the packages
! pip3 install --upgrade google-cloud-aiplatform
! pip3 install shapely<2.0.0


### Colab only: Uncomment the following cell to restart the kernel



In [12]:
# Automatically restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

Set your Google Cloud project id and region

In [1]:
PROJECT_ID = "notebooks-370010"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

Updated property [core/project].


In [2]:
REGION = "us-central1"  # @param {type: "string"}

We will need a Cloud Storage bucket to store embeddings initially. Please create a bucket and add the URI below.

In [3]:
BUCKET_URI = "gs://genai-experiments"

Authenticate your Google Cloud account
Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

1. Vertex AI Workbench

Do nothing as you are already authenticated.

2. Local JupyterLab instance, uncomment and run:

In [None]:
# ! gcloud auth login

3. Colab, uncomment and run:

In [4]:
from google.colab import auth
auth.authenticate_user()

Install and intialize the SDK and language model. GCP uses the `gecko` model for text embeddings.

In [5]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

In [6]:
# Load the "Vertex AI Embeddings for Text" model
from vertexai.preview.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

Now we're ready to prepare the data

In [33]:
import os
import pandas as pd

path = "data"

os.path.exists(path)
if not os.path.exists(path):
  os.makedirs(path)
  print("data directory created")
else:
  print("data directory found")

data directory created


In [34]:
# download datasets
!wget -q https://raw.githubusercontent.com/wayfair/WANDS/main/dataset/label.csv
!wget -q https://raw.githubusercontent.com/wayfair/WANDS/main/dataset/product.csv
!wget -q https://raw.githubusercontent.com/wayfair/WANDS/main/dataset/query.csv

!mv *.csv data/

In [35]:
!ls data

label.csv  product.csv	query.csv


The dataset features a wealth of information. The queries (user searchers), and the rating of the responses to the queries, have been particularly interesting to researchers. For this demo however we will focus on the product descriptions.  

In [36]:
product_df = pd.read_csv("data/product.csv", sep='\t')
product_df

Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count
0,0,solid wood platform bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,"good , deep sleep can be quite difficult to ha...",overallwidth-sidetoside:64.7|dsprimaryproducts...,15.0,4.5,15.0
1,1,all-clad 7 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,"create delicious slow-cooked meals , from tend...",capacityquarts:7|producttype : slow cooker|pro...,100.0,2.0,98.0
2,2,all-clad electrics 6.5 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,prepare home-cooked meals on any schedule with...,features : keep warm setting|capacityquarts:6....,208.0,3.0,181.0
3,3,all-clad all professional tools pizza cutter,"Slicers, Peelers And Graters",Browse By Brand / All-Clad,this original stainless tool was designed to c...,overallwidth-sidetoside:3.5|warrantylength : l...,69.0,4.5,42.0
4,4,baldwin prestige alcott passage knob with roun...,Door Knobs,Home Improvement / Doors & Door Hardware / Doo...,the hardware has a rich heritage of delivering...,compatibledoorthickness:1.375 '' |countryofori...,70.0,5.0,42.0
...,...,...,...,...,...,...,...,...,...
42989,42989,malibu pressure balanced diverter fixed shower...,Shower Panels,Home Improvement / Bathroom Remodel & Bathroom...,the malibu pressure balanced diverter fixed sh...,producttype : shower panel|spraypattern : rain...,3.0,4.5,2.0
42990,42990,emmeline 5 piece breakfast dining set,Dining Table Sets,Furniture / Kitchen & Dining Furniture / Dinin...,,basematerialdetails : steel| : gray wood|ofhar...,1314.0,4.5,864.0
42991,42991,maloney 3 piece pub table set,Dining Table Sets,Furniture / Kitchen & Dining Furniture / Dinin...,this pub table set includes 1 counter height t...,additionaltoolsrequirednotincluded : power dri...,49.0,4.0,41.0
42992,42992,fletcher 27.5 '' wide polyester armchair,Teen Lounge Furniture|Accent Chairs,Furniture / Living Room Furniture / Chairs & S...,"bring iconic , modern style to your space in a...",legmaterialdetails : rubberwood|backheight-sea...,1746.0,4.5,1226.0


Filter the dataframe to consider `product_id`, `product_name`, `product_description`.

In [37]:
product_df = product_df.filter(["product_id", "product_name", "product_description"], axis=1)

In [38]:
product_df = product_df.rename(columns={"product_description": "product_text", "product_id": "id"})

In [39]:
product_df.dropna()

Unnamed: 0,id,product_name,product_text
0,0,solid wood platform bed,"good , deep sleep can be quite difficult to ha..."
1,1,all-clad 7 qt . slow cooker,"create delicious slow-cooked meals , from tend..."
2,2,all-clad electrics 6.5 qt . slow cooker,prepare home-cooked meals on any schedule with...
3,3,all-clad all professional tools pizza cutter,this original stainless tool was designed to c...
4,4,baldwin prestige alcott passage knob with roun...,the hardware has a rich heritage of delivering...
...,...,...,...
42988,42988,paradise pressure balanced diverter dual showe...,this complete shower system offers a soothing ...
42989,42989,malibu pressure balanced diverter fixed shower...,the malibu pressure balanced diverter fixed sh...
42991,42991,maloney 3 piece pub table set,this pub table set includes 1 counter height t...
42992,42992,fletcher 27.5 '' wide polyester armchair,"bring iconic , modern style to your space in a..."


After dropping the NaNs, we have almost 37,000 rows.

In [48]:
len(product_df)

42994

The following three cells contain functions from this [notebook](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/matching_engine/sdk_matching_engine_create_stack_overflow_embeddings_vertex.ipynb) from the vertex-ai-samples repository.

`encode_texts_to_embeddings` will be used later to convert the product descriptions into   embeddings.

In [9]:
from typing import List, Optional

# Define an embedding method that uses the model
def encode_texts_to_embeddings(text: List[str]) -> List[Optional[List[float]]]:
    try:
        embeddings = model.get_embeddings(text)
        return [embedding.values for embedding in embeddings]
    except Exception:
        return [None for _ in range(len(text))]

These helper functions achieve the following:

* `generate_batches` splits the product descriptions into batches of five, since the embeddings API will field up to five text instances in each request.

* `encode_text_to_embedding_batched` calls the embeddings API and handles rate limiting using `time.sleep`.

In [10]:
import functools
import time
from concurrent.futures import ThreadPoolExecutor
from typing import Generator, List, Tuple

import numpy as np
from tqdm.auto import tqdm


# Generator function to yield batches of sentences
def generate_batches(
    text: List[str], batch_size: int
) -> Generator[List[str], None, None]:
    for i in range(0, len(text), batch_size):
        yield text[i : i + batch_size]


def encode_text_to_embedding_batched(
    text: List[str], api_calls_per_second: int = 10, batch_size: int = 5
) -> Tuple[List[bool], np.ndarray]:

    embeddings_list: List[List[float]] = []

    # Prepare the batches using a generator
    batches = generate_batches(text, batch_size)

    seconds_per_job = 1 / api_calls_per_second

    with ThreadPoolExecutor() as executor:
        futures = []
        for batch in tqdm(
            batches, total=math.ceil(len(text) / batch_size), position=0
        ):
            futures.append(
                executor.submit(functools.partial(encode_texts_to_embeddings), batch)
            )
            time.sleep(seconds_per_job)

        for future in futures:
            embeddings_list.extend(future.result())

    is_successful = [
        embedding is not None for text, embedding in zip(text, embeddings_list)
    ]
    embeddings_list_successful = np.squeeze(
        np.stack([embedding for embedding in embeddings_list if embedding is not None])
    )
    return is_successful, embeddings_list_successful

Let's encode a subset of data and check the distance metrics provide sane product suggestions.

In [50]:
import math

# Encode a subset of questions for validation
products = product_df.product_text.tolist()[:500]
is_successful, product_embeddings = encode_text_to_embedding_batched(
    text=product_df.product_text.tolist()[:500]
)

# Filter for successfully embedded sentences
products = np.array(products)[is_successful]

  0%|          | 0/100 [00:00<?, ?it/s]

In [51]:
DIMENSIONS = len(product_embeddings[0])

print(DIMENSIONS)

768


This function takes a description from the dataset (rather than a user) and looks for relevant matches. The first answer is likely to be the exact match.

In [53]:
import random

product_index = random.randint(0, 99)

print(f"Product query: {products[product_index]} \n")

scores = np.dot(product_embeddings[product_index], product_embeddings.T)

# Print top 3 matches
for index, (product, score) in enumerate(
    sorted(zip(products, scores), key=lambda x: x[1], reverse=True)[:3]
):
    print(f"\t{index}: \n {product}: \n {score} \n")

Product query: a rare find for a large entry , a large walking closet , or even a sitting area for a commercial boutique entry . the item is not only functional for a unique setting but will surely grab attention with its style and beauty . an example of a style of furnishings , originating in france about 1720 , evolved from baroque styling , this product is hand-carved by some of the worlds finest carvers in solid mahogany . special attention was given to the quality of the carvings and the brilliant new gold platine finishing used . the item is upholstered with an imported cream taffeta fabric , which works so beautifully with the rich finish chosen . each piece is hand-carved , one piece at a time , by some of the world 's best carvers . the special mahogany wood comes from replanted forests . the selected wood has been kiln-dried . the item in its final form is stress and pressure tested prior to finishing to insure long-lasting stability , in most all climates , a heritage brand 

### Data formatting for building an index

We need to save the embeddings and the `id` and `product_name` columns to the JSON lines format in order to creat an index on Matching Engine. For more details, see the documentation [here](https://cloud.google.com/vertex-ai/docs/matching-engine/match-eng-setup/format-structure).

In [54]:
import tempfile
from pathlib import Path

# Create temporary file to write embeddings to
embeddings_file_path = Path(tempfile.mkdtemp())

print(f"Embeddings directory: {embeddings_file_path}")

Embeddings directory: /tmp/tmpzi8n7mda


In [55]:
product_embeddings = np.array(product_embeddings)

In [26]:
!touch json_output.json

Let's take a look at the shape and type of the embeddings. At the moment, the `product_embeddings` are a numpy array. We will need to convert them to a Python dictionary to use them as another column in a dataframe.

In [56]:
type(product_embeddings)

numpy.ndarray

In [57]:
embeddings_list = product_embeddings.tolist()
embeddings_dicts = [{'embedding': embedding} for embedding in embeddings_list]


In [58]:
embeddings_df = product_df.merge(pd.DataFrame(embeddings_dicts), left_on='id', right_index=True)


In [59]:
embeddings_df

Unnamed: 0,id,product_name,product_text,embedding
0,0,solid wood platform bed,"good , deep sleep can be quite difficult to ha...","[-0.02153787948191166, -0.01955561526119709, 0..."
1,1,all-clad 7 qt . slow cooker,"create delicious slow-cooked meals , from tend...","[-0.05451446399092674, -0.015920285135507584, ..."
2,2,all-clad electrics 6.5 qt . slow cooker,prepare home-cooked meals on any schedule with...,"[-0.030471259728074074, 0.008680124767124653, ..."
3,3,all-clad all professional tools pizza cutter,this original stainless tool was designed to c...,"[-0.04747460037469864, 0.00038485010736621916,..."
4,4,baldwin prestige alcott passage knob with roun...,the hardware has a rich heritage of delivering...,"[-0.030091606080532074, -0.010061231441795826,..."
...,...,...,...,...
495,495,monmouth free standing umbrella base,set the foundation for a shady space in your y...,"[-0.02213788963854313, -0.010204324498772621, ..."
496,496,obyrne valet stand,the jewelry valet stand offers a practical sto...,"[-0.018322506919503212, -0.05783892795443535, ..."
497,497,demir upholstered bench,,"[-0.012941082008183002, -0.0014768290566280484..."
498,498,decaro gray rug,,"[-0.012941082008183002, -0.0014768290566280484..."


Now we can convert the entire dataframe to JSON lines.

In [31]:
json_lines = embeddings_df.to_json(orient='records', lines=True)

In [32]:
json_lines



In [33]:
import json

output_file = 'merged_data.json'
with open(output_file, 'w') as file:
    for index, row in embeddings_df.iterrows():
        data = {
            'id': row['id'],
            'product_name': row['product_name'],
            'product_text': row['product_text'],
            'embedding': row['embedding']
        }
        json_line = json.dumps(data)
        file.write(json_line + '\n')

Copy the JSON lines file to Cloud Storage.

In [34]:
!gsutil cp merged_data.json gs://genai-experiments/

Copying file://merged_data.json [Content-Type=application/json]...
- [1 files][  8.3 MiB/  8.3 MiB]                                                
Operation completed over 1 objects/8.3 MiB.                                      


In [None]:
!cat json_output.json

### Creating the index in Matching Engine

*This is a long-running operation which can take up to an hour.

In [35]:
DIMENSIONS = 768
# Add a display name
DISPLAY_NAME = "wands_index"
DESCRIPTION = "products and descriptions from Wayfair"
remote_folder = BUCKET_URI

tree_ah_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=DISPLAY_NAME,
    contents_delta_uri=remote_folder,
    dimensions=DIMENSIONS,
    approximate_neighbors_count=150,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
    leaf_node_embedding_count=500,
    leaf_nodes_to_search_percent=5,
    description=DESCRIPTION,
)

Creating MatchingEngineIndex


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:Creating MatchingEngineIndex


Create MatchingEngineIndex backing LRO: projects/62374552305/locations/us-central1/indexes/1796997823971983360/operations/3717593055692324864


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:Create MatchingEngineIndex backing LRO: projects/62374552305/locations/us-central1/indexes/1796997823971983360/operations/3717593055692324864


MatchingEngineIndex created. Resource name: projects/62374552305/locations/us-central1/indexes/1796997823971983360


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:MatchingEngineIndex created. Resource name: projects/62374552305/locations/us-central1/indexes/1796997823971983360


To use this MatchingEngineIndex in another session:


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:To use this MatchingEngineIndex in another session:


index = aiplatform.MatchingEngineIndex('projects/62374552305/locations/us-central1/indexes/1796997823971983360')


INFO:google.cloud.aiplatform.matching_engine.matching_engine_index:index = aiplatform.MatchingEngineIndex('projects/62374552305/locations/us-central1/indexes/1796997823971983360')


In the results of the cell above, make note of the information under this line:

*To use this MatchingEngineIndex in another session*:

If Colab runtime resets, you will need this line to set the index variable:

`
index = aiplatform.MatchingEngineIndex(...)
`

Use `gcloud` to list indexes

In [20]:
!gcloud ai indexes list --region="us-central1"

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
---
createTime: '2023-07-10T19:44:24.622019Z'
deployedIndexes:
- deployedIndexId: wands_3_deployed_index_id
  indexEndpoint: projects/62374552305/locations/us-central1/indexEndpoints/5768609745359339520
description: products and descriptions from Wayfair
displayName: wands_3
etag: AMEw9yOWMFhDdcq7Vx0UjGyZHtJ5GuAPyq5hNED_kiDAiPgm-Jm8dqQ7BKQi_TrUSGOs
indexStats:
  shardsCount: 1
  vectorsCount: '500'
indexUpdateMethod: BATCH_UPDATE
metadata:
  config:
    algorithmConfig:
      treeAhConfig:
        leafNodeEmbeddingCount: '500'
        leafNodesToSearchPercent: 5
    approximateNeighborsCount: 150
    dimensions: 768
    distanceMeasureType: DOT_PRODUCT_DISTANCE
    shardSize: SHARD_SIZE_MEDIUM
metadataSchemaUri: gs://google-cloud-aiplatform/schema/matchingengine/metadata/nearest_neighbor_search_1.0.0.yaml
name: projects/62374552305/locations/us-central1/indexes/1796997823971983360
updateTime: '2023-07-10T20:42:25.525178Z'


In [38]:
INDEX_RESOURCE_NAME = tree_ah_index.resource_name
INDEX_RESOURCE_NAME

'projects/62374552305/locations/us-central1/indexes/1796997823971983360'

### Deploy the index

In [18]:
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=DISPLAY_NAME,
    description=DISPLAY_NAME,
    public_endpoint_enabled=True,
)

NameError: ignored

* Note, here is how to get an existing MatchingEngineIndexEndpoint (from another project, or if the Colab runtime resets).

In [23]:
# my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint(
#     index_endpoint_name = 'projects/62374552305/locations/us-central1/indexEndpoints/5768609745359339520',
# )

In [24]:
DEPLOYED_INDEX_ID = "wands_3_deployed_index_id"

DEPLOYED_INDEX_ID

'wands_3_deployed_index_id'

In [17]:
my_index_endpoint = my_index_endpoint.deploy_index(
    index=index, deployed_index_id=DEPLOYED_INDEX_ID
)

my_index_endpoint.deployed_indexes

AttributeError: ignored

In [7]:
my_index_endpoint = 'projects/62374552305/locations/us-central1/indexEndpoints/5768609745359339520'

### Quick test query

Embedding a query should return relevant nearest neighbors.

In [11]:
test_embeddings = encode_texts_to_embeddings(text=["a midcentury modern dining table"])

In [25]:
# Test query
NUM_NEIGHBOURS = 5

response = my_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=test_embeddings,
    num_neighbors=NUM_NEIGHBOURS,
)

response

[[MatchNeighbor(id='216', distance=0.7966769933700562),
  MatchNeighbor(id='350', distance=0.7943713665008545),
  MatchNeighbor(id='488', distance=0.7903410196304321),
  MatchNeighbor(id='465', distance=0.7732850909233093),
  MatchNeighbor(id='393', distance=0.7718973159790039)]]

Now let's make that information useful, by creating helper functions to take the `id`s and match them to products.

In [27]:
# Get the ids of the nearest neighbor results

def get_nn_ids(response):
  id_list = [item.id for sublist in response for item in sublist]
  id_list = [eval(i) for i in id_list]
  print(id_list)
  results_df = product_df[product_df['id'].isin(id_list)]
  return results_df

In [28]:
# Create embeddings from a customer chat message

def get_embeddings(input_text):
  chat_embeddings = encode_texts_to_embeddings(text=[input_text])

  return chat_embeddings

# customer_chat_embeddings = get_embeddings(customer_message)
# print(customer_chat_embeddings)

In [30]:
# Retrieve the nearest neighbor lookups for
# the embedded customer message

NUM_NEIGHBOURS = 3

def get_nn_response(chat_embeddings):
  response = my_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=chat_embeddings,
    num_neighbors=NUM_NEIGHBOURS,
)
  return response

In [98]:
# Create a dataframe of results. This will be the data we
# ask the language model to base its recommendations on

def get_nn_ids(response):
  id_list = [item.id for sublist in response for item in sublist]
  id_list = [eval(i) for i in id_list]
  print(id_list)
  results_df = product_df[product_df['id'].isin(id_list)]

  return results_df

### RAG using the LLM and embeddings

In [135]:
import vertexai
from vertexai.preview.language_models import ChatModel, InputOutputTextPair

chat_model = ChatModel.from_pretrained("chat-bison@001")
parameters = {
    "temperature": 0.1,
    "max_output_tokens": 1024,
    "top_p": 0.8,
    "top_k": 40
}

customer_message = """\
Interested in a persian style rug
"""

# Chain together the helper functions to get results
# from customer_message
def process_customer_message(customer_message):
  customer_chat_embeddings = get_embeddings(customer_message)
  response = get_nn_response(customer_chat_embeddings)
  results_df = get_nn_ids(response)
  return results_df

results_df = process_customer_message(customer_message)

service_context=f"""You are a customer service bot, writing in polite British English. \
    Suggest the top three relevant \
    products only from {results_df}, mentioning:
     product names and \
     brief descriptions \
    Number them and leave a line between suggestions. \
    Preface the list of products with an introductory sentence such as \
    'Here are some relevant products: ' \
    Ensure each recommendation appears only once."""


chat = chat_model.start_chat(
    context=f"""{service_context}""",
)
response = chat.send_message(customer_message, **parameters)
print(f"Response from Model: \n {response.text}")

[82, 85, 83, 84, 276]
Response from Model: 
 Here are some relevant products:

1. Oriental hand knotted wool plum area rug

This elegantly hand-woven rug is from China and features a beautiful plum color. It is perfect for adding a touch of elegance to any room.

2. Traci ombre braided cotton aqua area rug

This cheerful ombre braided rug will highlight any room. It is made of cotton and features a soft, plush feel.

3. Roswell ombre braided cotton black area rug

This ombre braided rug is perfect for adding a touch of style to any room. It is made of cotton and features a soft, plush feel.


A user may ask follow up questions, which the LLM could answer based on the information in the dataframe.

In [43]:
response = chat.send_message("""could you tell me more about the <product>?""", **parameters)
print(f"Response from Model: {response.text}")

Response from Model: The Gabbeh geometric hand-knotted wool plum/gray area rug is a contemporary take on the traditional Gabbeh look. It is made from 100% wool and features a geometric pattern in shades of plum and gray. The rug is available in a variety of sizes to suit your needs.


### Cleaning up

To delete all the GCP resources used, uncomment and run the following cells.

In [None]:
# Force undeployment of indexes and delete endpoint
# my_index_endpoint.delete(force=True)

In [None]:
# Delete indexes
# tree_ah_index.delete()