## Semantic text search using embeddings

We can search through all our reviews semantically in a very efficient manner and at very low cost, by simply embedding our search query, and then finding the most similar reviews. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).

In [17]:
import pandas as pd
import numpy as np

datafile_path = "villa_database_with_embeddings.csv"

df = pd.read_csv(datafile_path)
df["embedding"] = df.embedding.apply(eval).apply(np.array)


Remember to use the documents embedding engine for documents (in this case reviews), and query embedding engine for queries. Note that here we just compare the cosine similarity of the embeddings of the query and the documents, and show top_n best matches.

In [1]:
from openai.embeddings_utils import get_embedding, cosine_similarity
import openai
openai.api_key = "put your key here"


# search through the reviews for a specific product
def search_reviews(df, product_description, n=3, pprint=True):
    product_embedding = get_embedding(
        product_description,
        engine="text-embedding-ada-002"
    )
    df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, product_embedding))

    results = (
        df.sort_values("similarity", ascending=False)
        .head(n)
        .combined
    )
    if pprint:
        for r in results:
            print(r[:200])
            print()
    return results


results = search_reviews(df, "please show me apricots", n=10)


NameError: name 'df' is not defined

In [20]:
results

30593                                      APRICOT APRICOT
33249                      APRICOT LATTICE APRICOT LATTICE
13215                      APRICOT LATTICE APRICOT LATTICE
10111                      APRICOT LATTICE APRICOT LATTICE
62877          APRICOT YELLOW IMPORT APRICOT YELLOW IMPORT
                               ...                        
23809                              GRAPE FRUIT GRAPE FRUIT
29929                                KIWI PUREE KIWI PUREE
52600    APPLE OURIN PRINT#36(JP) APPLE OURIN PRINT#36(JP)
18653                                  HOT FRUIT HOT FRUIT
15043                        AUS PEACH (WH) AUS PEACH (WH)
Name: combined, Length: 100, dtype: object

In [8]:
results = search_reviews(df, "whole wheat pasta", n=3)

WHOLE WHEAT CROISSANT AND OAT:  WHOLE WHEAT CROISSANT AND OAT

SANREMO LINGUINE PASTA 500 G.:  SANREMO ลิงกวินี 500 กรัม

VEGAN SANDWICH WHOLE WHEAT LOAF 360G:  วีแกนแซนด์วิชโฮลวีทโลฟ 360กรัม



We can search through these reviews easily. To speed up computation, we can use a special algorithm, aimed at faster search through embeddings.

In [4]:
results = search_reviews(df, "bad delivery", n=1)

great product, poor delivery:  The coffee is excellent and I am a repeat buyer.  Problem this time was with the UPS delivery.  They left the box in front of my garage door in the middle of the drivewa



As we can see, this can immediately deliver a lot of value. In this example we show being able to quickly find the examples of delivery failures.

In [16]:
results = search_reviews(df, "canin pet food", n=3)

ROYAL CANIN MINI ADULT 800G:  ROYAL CANIN MINI ADULT 800G

ROYAL CANIN MINI INDOOR ADULT 500G:  ROYAL CANIN MINI INDOOR ADULT 500G

WHISKAS POCKETS TUNA  1.2KG:  WHISKAS อาหารแมวชนิดเม็ด รสทูน่า 1.2 กก.



In [9]:
results = search_reviews(df, "pet food", n=2)

WHISKAS POCKETS TUNA  1.2KG:  WHISKAS อาหารแมวชนิดเม็ด รสทูน่า 1.2 กก.

ROYAL CANIN MINI ADULT 800G:  ROYAL CANIN MINI ADULT 800G

