# Embeddings
https://platform.openai.com/docs/models#embeddings

임베딩(Embeddings)은 텍스트를 수치적으로 표현한 값으로, 두 텍스트 간의 연관성을 측정하는 데 사용된다.

임베딩은 검색, 군집화(clustering), 추천 시스템, 이상 탐지, 분류와 같은 작업에 유용하다.

**모델 및 출력 차원**

| 모델 이름                     | 설명                                                              | 출력 차원 |
|-------------------------------|-------------------------------------------------------------------|-----------|
| **text-embedding-3-large**   | 영어 및 비영어 작업 모두에서 가장 강력한 성능을 가진 모델           | 3,072     |
| **text-embedding-3-small**   | 2세대 ada 임베딩 모델보다 성능이 향상된 모델                        | 1,536     |
| **text-embedding-ada-002**   | 1세대 모델 16개를 대체하는 가장 강력한 2세대 임베딩 모델             | 1,536     |

## MTEB Leaderboard
**Massive Text Embedding Benchmark (MTEB) Leaderboard**

https://huggingface.co/spaces/mteb/leaderboard

**MTEB Leaderboard**는 Hugging Face에서 제공하는 벤치마크 리더보드 페이지로, 다양한 언어 모델(Language Model)과 임베딩 모델(Embedding Model)의 성능을 객관적으로 비교·평가하는 공간이다.

**MTEB Leaderboard에서 순위 산정 방식**

**MTEB Leaderboard**의 순위는 다양한 자연어 처리 태스크(분류, 클러스터링, 검색, 문장 유사도 등)에서 모델이 얻은 점수들의 평균을 기반으로 산정된다. 즉, 여러 벤치마크 데이터셋에서 모델의 성능을 측정하고, 이를 종합하여 평균 점수를 계산한 뒤, 이 평균 점수가 높은 순서대로 모델이 정렬된다.

**주요 평가 방식**

- **평가 태스크 종류**
  - 분류(Classification): F1 점수
  - 클러스터링(Clustering): V-measure
  - 쌍 분류(Pair Classification): Average Precision
  - 재정렬(Reranking): MRR@k, MAP
  - 검색(Retrieval): nDCG@k
  - 의미 유사도(STS): Spearman correlation
  - 요약(Summarization): Spearman correlation  
  각 태스크별로 대표적인 평가 지표가 다르며, 모델은 여러 태스크에서 평가를 받는다[2].

- **평균 점수 산정**
  - 각 태스크별로 모델이 얻은 점수를 모두 합산한 뒤, 태스크 수로 나누어 평균 점수를 구한다.
  - 이 평균 점수가 리더보드의 기본 순위 기준이 된다.

- **부분 평가 가능**
  - 모든 태스크를 수행하지 않아도 특정 태스크만 평가받아 부분 리더보드에 오를 수 있다. 예를 들어, 클러스터링 태스크만 평가받아 해당 부분 순위에 표시될 수 있다.

In [1]:
from google.colab import userdata
from openai import OpenAI
import os

# API_KEY 를 명시적으로 전달하는 방법 - 이걸 더 추천!
OPENAI_API_KEY = userdata.get('MY_OPENAI_API_KEY')
client = OpenAI(api_key=OPENAI_API_KEY)

# 환경변수에서 API_KEY 를 전달하는 방법
# os.environ("OPENAI_API_KEY") = userdata.get("OPENAI_API_KEY")
# client = OpenAI()

In [2]:
text = "안녕하세요~"
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=[text],
)
# print(response)
print(len(response.data[0].embedding))

1536


In [13]:
def texts_to_embedding(texts, model="text-embedding-3-small"):
    # 특수문자/개행문자 등을 제거하면 임베딩품질이 높아진다.
    texts = [text.replace("\n", " ") for text in texts]
    response = client.embeddings.create(
        model=model,
        input=texts,
    )
    return [data.embedding for data in response.data]

# texts_to_embedding(["Hello, world", "bye bye, world"])

## 음식 리뷰 유사도 검색

데이터셋 : https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews

In [7]:
!gdown 1tSQZQFYD64_mrL9CjDcn6KruZp7_smuD

Downloading...
From: https://drive.google.com/uc?id=1tSQZQFYD64_mrL9CjDcn6KruZp7_smuD
To: /content/fine_food_reviews_1k.csv
  0% 0.00/439k [00:00<?, ?B/s]100% 439k/439k [00:00<00:00, 114MB/s]


In [10]:
# 데이터 로드
import pandas as pd

df = pd.read_csv("fine_food_reviews_1k.csv")
df = df.drop("Unnamed: 0", axis=1)
display(df.head(3))
df.info()

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,n_tokens
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,33
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",26
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,242


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Time       1000 non-null   int64 
 1   ProductId  1000 non-null   object
 2   UserId     1000 non-null   object
 3   Score      1000 non-null   int64 
 4   Summary    1000 non-null   object
 5   Text       1000 non-null   object
 6   n_tokens   1000 non-null   int64 
dtypes: int64(3), object(4)
memory usage: 54.8+ KB


In [11]:
# Summary + Text -> Combined 컬럼 생성
df["Combined"] = df["Summary"].str.strip() + "; " + df["Text"].str.strip()
display(df.head(3))

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,n_tokens,Combined
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,33,where does one start...and stop... with a tre...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",26,Arrived in pieces; Not pleased at all. When I ...
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,242,"It isn't blanc mange, but isn't bad . . .; I'm..."


In [14]:
# 임베딩 변환
df["Embedding"] = texts_to_embedding(df["Combined"].tolist())
display(df.head(3))

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,n_tokens,Combined,Embedding
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,33,where does one start...and stop... with a tre...,"[0.030276387929916382, -0.020651785656809807, ..."
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",26,Arrived in pieces; Not pleased at all. When I ...,"[0.01129516027867794, 0.03488067910075188, -0...."
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,242,"It isn't blanc mange, but isn't bad . . .; I'm...","[0.0022991024889051914, 0.004840383306145668, ..."


In [18]:
# embed_df 로 변환
# embed_df = pd.DataFrame(df["Embedding"].to_list(), index=df["Combined"])
# embed_df2 = pd.DataFrame(df["Embedding"].tolist(), index=df["Combined"])
# display(embed_df.head(2))
# display(embed_df2.head(2))
embed_df = df[["Embedding"]]
embed_df.index = df["Combined"]
display(embed_df.head(2))

Unnamed: 0_level_0,Embedding
Combined,Unnamed: 1_level_1
where does one start...and stop... with a treat like this; Wanted to save some to bring to my Chicago family but my North Carolina family ate all 4 boxes before I could pack. These are excellent...could serve to anyone,"[0.030276387929916382, -0.020651785656809807, ..."
"Arrived in pieces; Not pleased at all. When I opened the box, most of the rings were broken in pieces. A total waste of money.","[0.01129516027867794, 0.03488067910075188, -0...."


In [23]:
import numpy as np

# 유사도 검색
from sklearn.metrics.pairwise import cosine_similarity

def review_search(query, embed_df, top_n=5):
  # 검색어를 임베딩
  query_embed = texts_to_embedding([query])

  embed_df["cos_sim"] = embed_df["Embedding"].apply(lambda x: cosine_similarity(query_embed, [x])[0,0])
  top_n_texts = embed_df.sort_values("cos_sim", ascending=False).head(top_n)
  return top_n_texts

result = review_search("Best coffee", embed_df)
display(result.head(2))

Unnamed: 0_level_0,Embedding,cos_sim
Combined,Unnamed: 1_level_1,Unnamed: 2_level_1
super coffee; Great coffee and so easy to brew. This coffee has great aroma and is good to the last drop. I actually like all the brands. This is the way coffee should taste!!,"[-0.004845583811402321, -0.039783548563718796,...",0.616236
super coffee; Great coffee and so easy to brew. This coffee has great aroma and is good to the last drop. I actually like all the brands. This is the way coffee should taste!!,"[-0.004845583811402321, -0.039783548563718796,...",0.616236
