## 1. Load the dataset

The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).

We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.

To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

In [1]:
# imports
import pandas as pd, tiktoken, openai
from openai.embeddings_utils import get_embedding
openai.api_key = "key"

In [None]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191
x = get_embedding("sun warrior", engine=embedding_model, encoding=embedding_encoding, max_tokens=max_tokens)


RetryError: RetryError[<Future at 0x7fc2f1dbd5e0 state=finished raised TypeError>]

In [None]:
x

[-0.004845874849706888,
 0.004899438004940748,
 -0.016358766704797745,
 -0.024475134909152985,
 -0.01734180562198162,
 0.012571548111736774,
 -0.019156644120812416,
 0.009036391042172909,
 -0.010227379389107227,
 -0.026945333927869797,
 0.022861942648887634,
 0.01032190304249525,
 -0.023479493334889412,
 -0.006654413416981697,
 0.007977734319865704,
 0.002637189347296953,
 0.02520611137151718,
 -0.01204852107912302,
 0.012943338602781296,
 0.013094575144350529,
 -0.010580264963209629,
 -0.0035099510569125414,
 0.004070787224918604,
 0.00863939430564642,
 -0.020631201565265656,
 -0.0019203906413167715,
 0.012161948718130589,
 -0.019194453954696655,
 0.030373364686965942,
 -0.03102872334420681,
 0.0036170771345496178,
 -0.0078138941898942,
 -0.00607782369479537,
 -0.017820721492171288,
 0.004864779766649008,
 -0.015640392899513245,
 0.0013737330446019769,
 -0.01555217057466507,
 0.01953473687171936,
 -0.016169721260666847,
 0.0073160738684237,
 0.008273906074464321,
 0.01141836866736412,

In [2]:
input_datapath = "dynamodb_export_full.csv"  # to save space, we provide a pre-filtered dataset
df0 = pd.read_csv(input_datapath, index_col=0)



  df0 = pd.read_csv(input_datapath, index_col=0)


In [4]:
from pyarrow import feather

df = feather.read_feather("villa_database_with_float32_embeddings.feather")

In [8]:
print(df0.iloc[200].pr_engname)
print(df.iloc[200].pr_engname)


MCVITIES MINI BN CHOCOLATE FLAVOUR BISCU
MCVITIES MINI BN CHOCOLATE FLAVOUR BISCU


In [9]:
print(df0.columns)

Index(['iprcode', 'oprcode', 'ordertype', 'pr_abb', 'pr_active', 'pr_cgcode',
       'pr_dpcode', 'pr_engname', 'pr_ggcode', 'pr_market', 'pr_name',
       'pr_sa_method', 'pr_sucode1', 'pr_suref3', 'prtype', 'pstype',
       'pr_country_th', 'pr_country_en', 'pr_keyword_th', 'pr_keyword_en',
       'pr_filter_th', 'pr_filter_en', 'online_category_l1_th',
       'online_category_l1_en', 'online_category_l2_th',
       'online_category_l2_en', 'online_category_l3_th',
       'online_category_l3_en', 'villa_category_l1_en', 'villa_category_l2_en',
       'villa_category_l3_en', 'villa_category_l4_en', 'content_en',
       'content_th', 'hema_brand_th', 'hema_brand_en', 'hema_sizedesc',
       'pr_brand_en', 'pr_brand_th', 'pr_online_name_en', 'pr_online_name_th',
       'hema_name_en', 'hema_name_th', 'pr_name_en', 'pr_name_th',
       'pr_barcode', 'pr_barcode2', 'sort_weight', 'master_online',
       'salemode_unit', 'ba_nprice', 'sort_villa_sku',
       'product_detail_description', '

In [57]:
df.columns

Index(['cprcode', 'pr_engname', 'pr_name', 'combined', 'n_tokens', 'embedding',
       'pr_filter_en', 'hema_brand_en', 'ba_nprice'],
      dtype='object')

In [67]:
type(df["hema_brand_en"][3])

str

In [58]:
df.to_feather("../villa_database_with_float32_embeddings.feather")

In [45]:
for code in df.cprcode:
    if code == 20150:
        print("found it")
        break

In [None]:
# load & inspect dataset
input_datapath = "dynamodb_export_full.csv"  # to save space, we provide a pre-filtered dataset
df = pd.read_csv(input_datapath, index_col=0)
df = df[["pr_engname", "pr_name"]]
df = df.dropna()
df["combined"] = (
    df.pr_engname.str.strip() + " " + df.pr_name.str.strip()
)
df.head(2)


  df = pd.read_csv(input_datapath, index_col=0)


Unnamed: 0_level_0,pr_engname,pr_name,combined
cprcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
225407,KONJAC LINGUINI,บุกเส้นแบน ตราโมกุ,KONJAC LINGUINI บุกเส้นแบน ตราโมกุ
241101,BUMILGOCHUJANG,บูมิลโคชูจังซอสเผ้ดเกาหลี,BUMILGOCHUJANG บูมิลโคชูจังซอสเผ้ดเกาหลี


In [None]:
# subsample to 1k most recent reviews and remove samples that are too long
# top_n = 1000
# df = df.sort_values("Time").tail(top_n * 2)  # first cut to first 2k entries, assuming less than half will be filtered out
# df.drop("Time", axis=1, inplace=True)

encoding = tiktoken.get_encoding(embedding_encoding)


# omit reviews that are too long to embed
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
# df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)


65608

In [None]:
encoding.encode("Hello world")

[9906, 1917]

## 2. Get embeddings and save them for future reuse

In [None]:
# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage

# This may take a few minutes
df["embedding"] = df.combined.apply(lambda x: get_embedding(x, engine=embedding_model))
df.to_csv("villa_database_small_with_embeddings.csv")
