# End-to-end example of semantic text search using Azure OpenAI Service and Azure Cache for Redis Enterprise 

Referenced codes:
- https://github.com/openai/openai-cookbook/blob/5b5f22812158002f19e24fcb5c9a391a6551c1e2/examples/Obtain_dataset.ipynb
- https://github.com/openai/openai-cookbook/blob/main/examples/Semantic_text_search_using_embeddings.ipynb
- https://github.com/RedisAI/vecsim-demo/blob/master/SemanticSearch1k.ipynb
- https://redis-py.readthedocs.io/en/stable/examples/search_vector_similarity_examples.html

Azure OpenAI Service References:
- https://learn.microsoft.com/en-us/azure/cognitive-services/openai/quickstart?pivots=programming-language-studio
- https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/embeddings?tabs=console

Azure Cache for Redis Enterprise References:
- https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/quickstart-create-redis-enterprise
- https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-redis-modules
- https://redis.io/docs/stack/search/
- https://redis.io/docs/stack/search/reference/vectors/
- https://www.youtube.com/watch?v=_Lrbesg4DhY

Pre-register environment variables.

In [None]:
# For Ubuntu
# !export OPENAI_NAME=<your-openai-name>
# !export OPENAI_KEY=<your-openai-key>
# !export REDIS_NAME=<your-redis-name>
# !export REDIS_KEY=<your-redis-key>

# For Windows
# !set OPENAI_NAME=<your-openai-name>
# !set OPENAI_KEY=<your-openai-key>
# !set REDIS_NAME=<your-redis-name>
# !set REDIS_KEY=<your-redis-key>

In [1]:
import os

import numpy as np
import openai
import pandas as pd
import redis
import tiktoken
from openai.embeddings_utils import get_embedding, cosine_similarity
from redis.commands.search.query import Query
from redis.commands.search.result import Result
from redis.commands.search.field import VectorField, TextField, NumericField

In [2]:
# Azure OpenAI Service parameters
openai_name = os.environ["OPENAI_NAME"]
openai_uri = f"https://{openai_name}.openai.azure.com/"

openai.api_type = "azure"
# openai.api_base = "https://<your-openai-name>.openai.azure.com/"
openai.api_base = openai_uri
openai.api_version = "2022-12-01"
# openai.api_key = "<your-openai-key>"
openai.api_key = os.environ["OPENAI_KEY"]

Use GPT-2/GPT-3 tokenizer for V1 models and use cl100k_base tokenizer for V2 models.
- https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

Use text search embedding models in this example.
- https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models#text-search-embedding

In [3]:
# embedding model parameters
# embedding_model_for_doc = "<your-deployment-name>" 
embedding_model_for_doc = "text-search-ada-doc-001"
# embedding_model_for_query = "<your-deployment-name>" 
embedding_model_for_query = "text-search-ada-query-001"
# embedding_encoding = "cl100k_base"
embedding_encoding = "gpt2" # these models above uses GPT-2/GPT-3 tokenizer
max_tokens = 2000  # the number of max imput tokens is 2046
embedding_dimension = 1024  # the number of output dimensions is 1024

### Load the dataset
To save space, use a pre-filtered dataset. Download and copy to the folder named `data` that exists in the same directory as this notebook.
- https://github.com/openai/openai-cookbook/blob/main/examples/data/fine_food_reviews_1k.csv

The original dataset used in this example is fine-food reviews from Amazon. 
- https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews

In [4]:
# load & inspect dataset
input_datapath = "data/fine_food_reviews_1k.csv"

df = pd.read_csv(input_datapath, index_col=0)
df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]
df = df.dropna()
df["Combined"] = (
    "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
)

df.head(2)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,Combined
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...


In [5]:
# subsample to 1k most recent reviews and remove samples that are too long
top_n = 1000
df = df.sort_values("Time").tail(top_n * 2)  # first cut to first 2k entries, assuming less than half will be filtered out
df.drop("Time", axis=1, inplace=True)

encoding = tiktoken.get_encoding(embedding_encoding)

# omit reviews that are too long to embed
df["N_tokens"] = df["Combined"].apply(lambda x: len(encoding.encode(x)))
df = df[df["N_tokens"] <= max_tokens].tail(top_n)

display(df.head(2))
print(len(df))

Unnamed: 0,ProductId,UserId,Score,Summary,Text,Combined,N_tokens
0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,51
297,B003VXHGPK,A21VWSCGW7UUAR,4,"Good, but not Wolfgang Puck good","Honestly, I have to admit that I expected a li...","Title: Good, but not Wolfgang Puck good; Conte...",178


1000


### Get embeddings and save them for future reuse

In [6]:
%%time
# This may take a few minutes
df["Embedding"] = df["Combined"].apply(lambda x: get_embedding(x, engine=embedding_model_for_doc))
df.to_csv("data/fine_food_reviews_with_embeddings_1k.csv")

df.head(2)

CPU times: user 4.19 s, sys: 358 ms, total: 4.54 s
Wall time: 3min 9s


Unnamed: 0,ProductId,UserId,Score,Summary,Text,Combined,N_tokens,Embedding
0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,51,"[0.04436948895454407, 0.002606603316962719, -0..."
297,B003VXHGPK,A21VWSCGW7UUAR,4,"Good, but not Wolfgang Puck good","Honestly, I have to admit that I expected a li...","Title: Good, but not Wolfgang Puck good; Conte...",178,"[-0.002989507280290127, 0.010171078145503998, ..."


### Semantic text search using embeddings

In [7]:
# if read embedding from the csv file created above, remove the following comments

# %%time
# datafile_path = "data/fine_food_reviews_with_embeddings_1k.csv"
# df = pd.read_csv(datafile_path, index_col=0)
# df["Embedding"] = df["Embedding"].apply(eval).apply(np.array)

# df.head(2)

In [8]:
# search through the reviews for a specific product on local PC
def search_reviews(df, description, n=3, pprint=True, engine="text-search-ada-query-001"):
    embedding = get_embedding(description, engine=engine)
    df["Similarity"] = df["Embedding"].apply(lambda x: cosine_similarity(x, embedding))  # Use cosine similarity
    df["Ret_Combined"] = df["Combined"].str.replace("Title: ", "").str.replace("; Content:", ": ")
    results = (
        df.sort_values("Similarity", ascending=False)
        .head(n)
        .loc[:,["Similarity", "Ret_Combined"]]
    )
    if pprint:
        for i,r in results.iterrows():
            print("%s | %s\n" % (r["Similarity"], r["Ret_Combined"][:200]))
    return results

In [9]:
results = search_reviews(df, "delicious beans", n=3, engine=embedding_model_for_query)

0.3770173260434461 | Best beans your money can buy:  These are, hands down, the best jelly beans on the market.  There isn't a gross one in the bunch and each of them has an intense, delicious flavor.  Though I hesitate t

0.37512639635782036 | Delicious!:  I enjoy this white beans seasoning, it gives a rich flavor to the beans I just love it, my mother in law didn't know about this Zatarain's brand and now she is traying different seasoning

0.37370195283798296 | Jamaican Blue beans:  Excellent coffee bean for roasting. Our family just purchased another 5 pounds for more roasting. Plenty of flavor and mild on acidity when roasted to a dark brown bean and befor



In [10]:
results = search_reviews(df, "whole wheat pasta", n=3, engine=embedding_model_for_query)

0.4129591672746144 | Bland and vaguely gamy tasting, skip this one:  As far as prepared dinner kits go, "Barilla Whole Grain Mezze Penne with Tomato and Basil Sauce" just did not do it for me...and this is coming from a p

0.39829441154909384 | Tasty and Quick Pasta:  Barilla Whole Grain Fusilli with Vegetable Marinara is tasty and has an excellent chunky vegetable marinara.  I just wish there was more of it.  If you aren't starving or on a 

0.3979758963573513 | sooo good:  tastes so good. Worth the money. My boyfriend hates wheat pasta and LOVES this. cooks fast tastes great.I love this brand and started buying more of their pastas. Bulk is best.



In [14]:
results = search_reviews(df, "bad delivery", n=3, engine=embedding_model_for_query)

0.3692189044013717 | great product, poor delivery:  The coffee is excellent and I am a repeat buyer.  Problem this time was with the UPS delivery.  They left the box in front of my garage door in the middle of the drivewa

0.3692189044013717 | great product, poor delivery:  The coffee is excellent and I am a repeat buyer.  Problem this time was with the UPS delivery.  They left the box in front of my garage door in the middle of the drivewa

0.3692189044013717 | great product, poor delivery:  The coffee is excellent and I am a repeat buyer.  Problem this time was with the UPS delivery.  They left the box in front of my garage door in the middle of the drivewa



In [15]:
results = search_reviews(df, "spoilt", n=3, engine=embedding_model_for_query)

0.27814680872535336 | Supurb:  I was introduced to Lagavulin 16 three days ago. I'm no single malt Scotch authority, but this was straight out delicious; better than anything I've tried, including highly-touted and expensi

0.2730928788578238 | More Good Stuff:  Spitting seeds may not seem too etiquite but it sure is fun. All the available flavors are so intense it is hard to quit. It is however recommended to keep them to yourself and dispo

0.2720145903089504 | Disappointed:  The metal cover has severely disformed. And most of the cookies inside have been crushed into small pieces. Shopping experience is awful. I'll never buy it online again.



In [16]:
results = search_reviews(df, "pet food", n=3, engine=embedding_model_for_query)

0.37976877329898806 | Good food:  The only dry food my queen cat will eat. Helps prevent hair balls. Good packaging. Arrives promptly. Recommended by a friend who sells pet food.

0.356221526222854 | Good product:  I like that this is a better product for my pets but really for the price of it I couldn't afford to buy this all the time. My cat isn't very picky usually and she ate this, we usually 

0.3532099786500319 | Perfect for Giving Medications to Our Dogs:  Every month, we give our three dogs (two Aussies and a Golden/Flat-Coat Retriever mix) pills for flea/tick/worm repellant.  In addition, one of our Aussies



Putting aside the accuracy, it is possible to  throw queries in another language than English.

In [17]:
results = search_reviews(df, "おいしいお豆", n=3, engine=embedding_model_for_query)

0.3027765950820416 | Simple and Authentic:  This is a fantastic do-it-yourself poke product. Just add sesame oil and green onion for color then enjoy your authentic Hawaiian treat!

0.30150749900539286 | spicy:  It is a too spicy grocery in japan.<br /><br />If you cook for udon or something, you can use one.<br /><br />You should buy one.

0.2937023705102833 | sesamiOil:  This is a good grocery for us.<br /><br />If you cook something,you can use it.<br /><br />It is smells so good.<br /><br />You should buy it.



### Store vectors in Redis Enterprise

In [18]:
# redis_name = "<your-redis-name>"
redis_name = os.environ["REDIS_NAME"]
redis_host = f"{redis_name}.southcentralus.redisenterprise.cache.azure.net"  # Example of redis in the South Central US region
# redis_key =  "<your-redis-key>"
redis_key = os.environ["REDIS_KEY"]

In [19]:
redis_conn = redis.StrictRedis(host=redis_host,port=10000, password=redis_key, ssl=True)

Choose Hierarchical Navigable Small World (HNSW) index for efficient searches by Approximate Nearest Neighbor (ANN).
- https://redis.io/docs/stack/search/reference/vectors/

In [21]:
# create index on vector field
schema = ([
    VectorField("Embedding", "HNSW", {"TYPE": "FLOAT32", "DIM": embedding_dimension, "DISTANCE_METRIC": "COSINE"}),  # RediSearch uses cosine DISTANCE
    TextField("ProductId"),
    TextField("UserId"),
    NumericField("Score"),
    TextField("Summary"),
    TextField("Text"),
    TextField("Combined"),
    NumericField("N_tokens")
])
# redis_conn.ft().dropindex(schema)   # remove comment if drop-create index
redis_conn.ft().create_index(schema)

b'OK'

In [23]:
%%time
# store data into redis
for i, row in df.iterrows():
    d = {
        "Embedding": np.array(row["Embedding"]).astype(np.float32).tobytes(),
        "ProductId": row["ProductId"],
        "UserId":    row["UserId"],
        "Score":     row["Score"],
        "Summary":   row["Summary"],
        "Text":      row["Text"],
        "Combined":  row["Combined"],
        "N_tokens":  row["N_tokens"]
    }
    redis_conn.hset(str(i), mapping=d)

CPU times: user 1.01 s, sys: 59.8 ms, total: 1.07 s
Wall time: 2min 32s


### Semantic text search by RediSearch
Vector search query samples:
- https://redis.io/docs/stack/search/reference/vectors/#vector-search-examples

In [24]:
# search through the reviews for a specific product on Redis using RediSearch
def search_reviews_redis(query, n=3, pprint=True, engine="text-search-ada-query-001"):
    q_vec = np.array(get_embedding(query, engine=engine)).astype(np.float32).tobytes()
    
    q = Query(f"*=>[KNN {n} @Embedding $vec_param AS vector_score]").sort_by("vector_score").paging(0,n).return_fields("vector_score", "Combined").return_fields("vector_score").dialect(2)
    params_dict = {"vec_param": q_vec}
    ret_redis = redis_conn.ft().search(q, query_params = params_dict)
    
    columns = ["Similarity", "Ret_Combined"]
    ret_df = pd.DataFrame(columns=columns)
    for doc in ret_redis.docs:
        sim = 1 - float(doc.vector_score)  # converts cosine DISTANCE to cosine SIMILARITY
        com = doc.Combined[:200].replace("Title: ", "").replace("; Content:", ": ")
        append_df = pd.DataFrame(data=[[sim, com]], columns=columns)
        ret_df = pd.concat([ret_df, append_df], ignore_index=True, axis=0)
        if pprint:
            print("%s | %s\n" % (sim, com))
    return ret_df

In [25]:
results_redis = search_reviews_redis("delicious beans", n=3, engine=embedding_model_for_query)

0.375126361847 | Delicious!:  I enjoy this white beans seasoning, it gives a rich flavor to the beans I just love it, my mother in law didn't know about this Zatarain's brand and now she is traying diff

0.37370193004600005 | Jamaican Blue beans:  Excellent coffee bean for roasting. Our family just purchased another 5 pounds for more roasting. Plenty of flavor and mild on acidity when roasted to a dark brown

0.373187303543 | Good Buy:  I liked the beans. They were vacuum sealed, plump and moist. Would recommend them for any use. I personally split and stuck them in some vodka to make vanilla extract. Yum!



Comparison with the cosine similarity calculated on the code above

In [26]:
results = search_reviews(df, "delicious beans", n=3, engine=embedding_model_for_query)

0.3770173260434461 | Best beans your money can buy:  These are, hands down, the best jelly beans on the market.  There isn't a gross one in the bunch and each of them has an intense, delicious flavor.  Though I hesitate t

0.37512639635782036 | Delicious!:  I enjoy this white beans seasoning, it gives a rich flavor to the beans I just love it, my mother in law didn't know about this Zatarain's brand and now she is traying different seasoning

0.37370195283798296 | Jamaican Blue beans:  Excellent coffee bean for roasting. Our family just purchased another 5 pounds for more roasting. Plenty of flavor and mild on acidity when roasted to a dark brown bean and befor



In [None]:
# redis_conn.flushall()  # remove comment if flush all the databases
# redis_conn.keys()

In [2]:
# !jupyter nbconvert --to html semantic-text-search-with-azure-openai-and-redis-e2e.ipynb

[NbConvertApp] Converting notebook semantic-text-search-with-azure-openai-and-redis-e2e.ipynb to html
[NbConvertApp] Writing 643802 bytes to semantic-text-search-with-azure-openai-and-redis-e2e.html
