# Get embeddings from dataset 本地的数据集embeddings处理，向量化

This notebook gives an example on how to get embeddings from a large dataset.


## 1. Load the dataset  加载数据集

The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).

We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.

To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

In [1]:
import pandas as pd
import tiktoken

from utils.embedings_utils import get_embedding

In [2]:
embedding_model = "text-embedding-3-small"
embedding_encoding = "cl100k_base"
max_tokens = 8000  # the maximum for text-embedding-3-small is 8191

In [3]:
# load & inspect dataset  加载检查数据集
input_datapath = "data/fine_food_reviews_1k.csv"  # to save space, we provide a pre-filtered dataset
df = pd.read_csv(input_datapath, index_col=0)
df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]
df = df.dropna() #删除包含缺失值的所有行
#合数据集：新增一个字段combined列，由Title和Content合并而成
df["combined"] = (
    "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
)
df.head(2)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...


In [4]:
# subsample to 1k most recent reviews and remove samples that are too long
# 取出1000条最新的评论记录，如果超过了token阀值则移除
top_n = 1000
#按Time字段进行排序
df = df.sort_values("Time").tail(top_n * 2)  # first cut to first 2k entries, assuming less than half will be filtered out
#删除Time字段，后面不需要使用到Time
df.drop("Time", axis=1, inplace=True)

encoding = tiktoken.get_encoding(embedding_encoding)

# omit reviews that are too long to embed
#将得到的编码embedding剔除超长的，放入n_tokens列
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)

1000

## 2. Get embeddings and save them for future reuse

In [5]:
#新增一列embedding，存放通过lambda函数，获取到每一个联合列combined的值对应的embedding
#另存到文件fine_food_reviews_with_embeddings_1k.csv，作为知识库备用
df["embedding"] = df.combined.apply(lambda x: get_embedding(x, model=embedding_model))
df.to_csv("data/fine_food_reviews_with_embeddings_1k.csv")

RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for text-embedding-3-small in organization org-vQCmbcqeaYFVL7mCXJnFGpTo on requests per day (RPD): Limit 200, Used 200, Requested 1. Please try again in 7m12s. Visit https://platform.openai.com/account/rate-limits to learn more. You can increase your rate limit by adding a payment method to your account at https://platform.openai.com/account/billing.', 'type': 'requests', 'param': None, 'code': 'rate_limit_exceeded'}}

In [22]:
a = get_embedding("hi", model=embedding_model)