## Semantic text search using embeddings

We can search through all our reviews semantically in a very efficient manner and at very low cost, by simply embedding our search query, and then finding the most similar reviews. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).

In [None]:
import pandas as pd
import numpy as np

datafile_path = "villa_database_with_embeddings.csv"

df = pd.read_csv(datafile_path)
df["embedding"] = df.embedding.apply(eval).apply(np.array)


In [None]:
def convertDtype(array):
    return np.array(array).astype(np.float16)

In [None]:
import pyarrow.feather as feather
import pandas as pd
import numpy as np

df = feather.read_feather("villa_index_float32.feather")
# df["embedding"] = df.embedding.apply(convertDtype)


In [None]:
df.embedding.dtype

dtype('O')

Remember to use the documents embedding engine for documents (in this case reviews), and query embedding engine for queries. Note that here we just compare the cosine similarity of the embeddings of the query and the documents, and show top_n best matches.

In [None]:
from openai.embeddings_utils import get_embedding, cosine_similarity
import openai
openai.api_key = "key"


# search through the reviews for a specific product
def search_reviews(df, product_description, n=3, pprint=True):
    product_embedding = get_embedding(
        product_description,
        engine="text-embedding-ada-002"
    )
    df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, product_embedding))

    results = (
        df.sort_values("similarity", ascending=False)
        .head(n)
    )
    print(df.sort_values("similarity", ascending=False).head(n))
    if pprint:
        for r in results:
            print(r[:200])
    return results


In [None]:
import joblib

kmeans_model = joblib.load("villa_clustering_model.joblib")


In [None]:
import pickle
with open("villa_clustering_model.pkl", "rb") as f:
    kmeans_model = pickle.load(f)


In [None]:
kmeans_model

AttributeError: 'KMeans' object has no attribute '__all__'

In [None]:
product_embedding = np.array(get_embedding(
        "sun warrior",
        engine="text-embedding-ada-002"
    )).reshape(1, -1)

In [None]:
product_embedding.dtype

dtype('float64')

In [None]:
cluster = kmeans_model.predict(product_embedding.astype("float32")).item()
cluster

2

In [None]:
results = search_reviews(df, "sun warrior", n=10, pprint=False)


                                           embedding  cluster  similarity
2  [-0.0028219079, -0.0105307195, 0.0041326457, -...        2    0.835098
3  [0.00081198575, -0.010347571, -0.001964573, -0...        3    0.820642
4  [-0.010866422, -0.0023932648, 0.0016940071, -0...        4    0.820241
1  [-0.01243439, -0.012545745, 0.0036076137, -0.0...        1    0.810199
0  [-0.003391798, -0.011681105, 0.0029432552, -0....        0    0.800552


In [None]:
df2 = feather.read_feather("cluster_3.feather")
results = search_reviews(df2, "sun warrior", n=10, pprint=False)


      index  cprcode                       pr_engname  \
4495  62836   159768                        CLUB SW/S   
639    9061    29928       H W SUMBUCA 75 CL. HIRAN.W   
3355  47106   216524               SACRED HILL SHIRAZ   
1908  27529   210082                XANADU DJL SHIRAZ   
3990  56214   233253                       HAKU VODKA   
2666  38169   209247                  CH.  CHEVALIERS   
4188  58886   216526                   STIMSOM MERLOT   
3973  56000   248486  BUENA VISTA THE LEGENDARY BADGE   
2714  38714   227890            SUNTORY HOROYOI WHITE   
4524  63292   244947            MCW HANWOOD CAB.SAUVG   

                                   pr_name  \
4495                             CLUB SW/S   
639             H W SUMBUCA 75 CL. HIRAN.W   
3355                    SACRED HILL SHIRAZ   
1908                     XANADU DJL SHIRAZ   
3990                    SUNTORY HAKU VODKA   
2666                        CH. CHEVALIERS   
4188                        STIMSOM MERLOT   
3973

In [None]:
results

30593                                      APRICOT APRICOT
33249                      APRICOT LATTICE APRICOT LATTICE
13215                      APRICOT LATTICE APRICOT LATTICE
10111                      APRICOT LATTICE APRICOT LATTICE
62877          APRICOT YELLOW IMPORT APRICOT YELLOW IMPORT
                               ...                        
23809                              GRAPE FRUIT GRAPE FRUIT
29929                                KIWI PUREE KIWI PUREE
52600    APPLE OURIN PRINT#36(JP) APPLE OURIN PRINT#36(JP)
18653                                  HOT FRUIT HOT FRUIT
15043                        AUS PEACH (WH) AUS PEACH (WH)
Name: combined, Length: 100, dtype: object

In [None]:
results = search_reviews(df, "whole wheat pasta", n=3)

WHOLE WHEAT CROISSANT AND OAT:  WHOLE WHEAT CROISSANT AND OAT

SANREMO LINGUINE PASTA 500 G.:  SANREMO ลิงกวินี 500 กรัม

VEGAN SANDWICH WHOLE WHEAT LOAF 360G:  วีแกนแซนด์วิชโฮลวีทโลฟ 360กรัม



We can search through these reviews easily. To speed up computation, we can use a special algorithm, aimed at faster search through embeddings.

As we can see, this can immediately deliver a lot of value. In this example we show being able to quickly find the examples of delivery failures.

In [None]:
results = search_reviews(df, "dog food", n=3)

       cprcode                             pr_engname  \
16862   237726  ORIJEN ORIGINAL BIOLOGICALLY DOG FOOD   
65319   249574       DOGGA DOGA DRIED SALMON DOG FOOD   
63914   248446           DOGSTER PLAY MIX TUNA+CARROT   

                                            pr_name  \
16862    ORIJEN ORIGINAL BIOLOGICALLY DOG FOOD 340G   
65319  ด็อกก้า ดูก้า เนื้อปลาแซลม่อนอบแห้ง ขนมสุนัข   
63914                  DOGSTER PLAY MIX TUNA+CARROT   

                                                combined  n_tokens  \
16862  ORIJEN ORIGINAL BIOLOGICALLY DOG FOOD ORIJEN O...        25   
65319  DOGGA DOGA DRIED SALMON DOG FOOD ด็อกก้า ดูก้า...        55   
63914  DOGSTER PLAY MIX TUNA+CARROT DOGSTER PLAY MIX ...        21   

                                               embedding  similarity  
16862  [0.011956698261201382, -0.018227148801088333, ...    0.856884  
65319  [-0.009410101920366287, -0.00703136483207345, ...    0.856395  
63914  [-0.019568517804145813, -0.013602837920188904,...

In [None]:
results = search_reviews(df, "pet food", n=2)

WHISKAS POCKETS TUNA  1.2KG:  WHISKAS อาหารแมวชนิดเม็ด รสทูน่า 1.2 กก.

ROYAL CANIN MINI ADULT 800G:  ROYAL CANIN MINI ADULT 800G

