# Embeddings
https://platform.openai.com/docs/models#embeddings

임베딩(Embeddings)은 텍스트를 수치적으로 표현한 값으로, 두 텍스트 간의 연관성을 측정하는 데 사용된다.

임베딩은 검색, 군집화(clustering), 추천 시스템, 이상 탐지, 분류와 같은 작업에 유용하다.

**모델 및 출력 차원**

| 모델 이름                     | 설명                                                              | 출력 차원 |
|-------------------------------|-------------------------------------------------------------------|-----------|
| **text-embedding-3-large**   | 영어 및 비영어 작업 모두에서 가장 강력한 성능을 가진 모델           | 3,072     |
| **text-embedding-3-small**   | 2세대 ada 임베딩 모델보다 성능이 향상된 모델                        | 1,536     |
| **text-embedding-ada-002**   | 1세대 모델 16개를 대체하는 가장 강력한 2세대 임베딩 모델             | 1,536     |

## MTEB Leaderboard
**Massive Text Embedding Benchmark (MTEB) Leaderboard**

https://huggingface.co/spaces/mteb/leaderboard

**MTEB Leaderboard**는 Hugging Face에서 제공하는 벤치마크 리더보드 페이지로, 다양한 언어 모델(Language Model)과 임베딩 모델(Embedding Model)의 성능을 객관적으로 비교·평가하는 공간이다.

**MTEB Leaderboard에서 순위 산정 방식**

**MTEB Leaderboard**의 순위는 다양한 자연어 처리 태스크(분류, 클러스터링, 검색, 문장 유사도 등)에서 모델이 얻은 점수들의 평균을 기반으로 산정된다. 즉, 여러 벤치마크 데이터셋에서 모델의 성능을 측정하고, 이를 종합하여 평균 점수를 계산한 뒤, 이 평균 점수가 높은 순서대로 모델이 정렬된다.

**주요 평가 방식**

- **평가 태스크 종류**
  - 분류(Classification): F1 점수
  - 클러스터링(Clustering): V-measure
  - 쌍 분류(Pair Classification): Average Precision
  - 재정렬(Reranking): MRR@k, MAP
  - 검색(Retrieval): nDCG@k
  - 의미 유사도(STS): Spearman correlation
  - 요약(Summarization): Spearman correlation  
  각 태스크별로 대표적인 평가 지표가 다르며, 모델은 여러 태스크에서 평가를 받는다[2].

- **평균 점수 산정**
  - 각 태스크별로 모델이 얻은 점수를 모두 합산한 뒤, 태스크 수로 나누어 평균 점수를 구한다.
  - 이 평균 점수가 리더보드의 기본 순위 기준이 된다.

- **부분 평가 가능**
  - 모든 태스크를 수행하지 않아도 특정 태스크만 평가받아 부분 리더보드에 오를 수 있다. 예를 들어, 클러스터링 태스크만 평가받아 해당 부분 순위에 표시될 수 있다.

In [2]:
from google.colab import userdata

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

In [3]:
from openai import OpenAI

client = OpenAI(api_key=OPENAI_API_KEY)

text = '안녕하세요~'

response = client.embeddings.create(
    model='text-embedding-3-small',
    input=[text]
)
print(response)

print(response.data[0].embedding)

CreateEmbeddingResponse(data=[Embedding(embedding=[0.006884642411023378, -0.08130010217428207, -0.006427524611353874, 0.021306157112121582, 0.03779585286974907, -0.03752826899290085, -0.0452212318778038, 0.035788990557193756, -0.0462469607591629, -0.029411640018224716, -0.05409600958228111, 0.003882715478539467, 0.012230693362653255, 0.00863507017493248, -0.03732758387923241, 0.02918865531682968, -0.044217802584171295, 0.01925470121204853, -0.027672361582517624, 0.019444238394498825, 0.061454493552446365, 0.01404801569879055, -0.010703249834477901, -0.019957100972533226, 0.038598597049713135, 0.05579069256782532, 0.05681641772389412, -0.027605466544628143, 0.0445522777736187, -0.05128640681505203, 0.027114899829030037, -0.038665492087602615, -0.014427089132368565, -0.053025685250759125, -0.027360182255506516, 0.004186531528830528, 0.01599912904202938, -0.022354183718562126, -0.02515263669192791, -0.037706658244132996, -0.014505133964121342, -0.00541294552385807, -0.004041591659188271, 

In [4]:
def texts_to_embedding(texts, model='text-embedding-3-small'):
    # 특수문자/개행문자등을 제거하면 임베딩품질 높아진다.
    texts = [text.replace('\n', ' ') for text in texts]
    response = client.embeddings.create(
        model=model,
        input=texts
    )
    return [data.embedding for data in response.data]

texts_to_embedding(['hello world', 'byebye world'])

[[-0.00676333112642169,
  -0.03919631987810135,
  0.034175805747509,
  0.02876211516559124,
  -0.02478501945734024,
  -0.04203926399350166,
  -0.030289441347122192,
  0.04932808503508568,
  -0.013897152617573738,
  -0.01764741726219654,
  0.015363989397883415,
  -0.027038201689720154,
  -0.020974265411496162,
  -0.027854792773723602,
  0.008619560860097408,
  0.035627517849206924,
  -0.05368323251605034,
  -0.0022720859851688147,
  0.008808586746454239,
  0.04799734428524971,
  0.03710947930812836,
  -0.009247126057744026,
  -0.008778342977166176,
  0.011402016505599022,
  0.014078617095947266,
  -0.0021624513901770115,
  -0.03756314143538475,
  0.04542659968137741,
  0.011250795796513557,
  -0.03964998200535774,
  0.02345428057014942,
  -0.050628580152988434,
  0.012044702656567097,
  -1.5505995179410093e-05,
  0.0160293597728014,
  0.006135766394436359,
  0.03196798637509346,
  0.00336087285540998,
  -0.008604438975453377,
  -0.01055518165230751,
  -0.037381675094366074,
  -0.0345084

## 음식리뷰 유사도 검색

In [5]:
!gdown 1064bqxqSAimNboTPSi6fxpdqKbny_9tL

Downloading...
From: https://drive.google.com/uc?id=1064bqxqSAimNboTPSi6fxpdqKbny_9tL
To: /content/fine_food_reviews_1k.csv
100% 34.6M/34.6M [00:00<00:00, 101MB/s]


In [11]:
# 데이터로드
import pandas as pd

df = pd.read_csv('fine_food_reviews_1k.csv')
df.head()

df = df.drop(['embedding','Unnamed: 0'] , axis=1)
df.head()

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,n_tokens
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,33
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",26
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,242
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,216
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,76


In [17]:
!gdown 1tSQZQFYD64_mrL9CjDcn6KruZp7_smuD

Downloading...
From: https://drive.google.com/uc?id=1tSQZQFYD64_mrL9CjDcn6KruZp7_smuD
To: /content/fine_food_reviews_1k.csv
  0% 0.00/439k [00:00<?, ?B/s]100% 439k/439k [00:00<00:00, 7.26MB/s]


In [19]:
# 데이터로드
import pandas as pd

df = pd.read_csv('fine_food_reviews_1k.csv')
df.head()

df = df.drop(['Unnamed: 0'] , axis=1)
df.head()

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,n_tokens
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,33
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",26
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,242
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,216
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,76


In [20]:
summary = [summary.replace('\n', ' ') for summary in df['Summary']]

summary

['where does one  start...and stop... with a treat like this',
 'Arrived in pieces',
 "It isn't blanc mange, but isn't bad . . .",
 "These also have SALT and it's not sea salt.",
 'Happy with the product',
 'Good Sauce',
 'Blackcat',
 'Excellent product',
 'Bulk k-Cups',
 "It's Okay",
 'FABULOUS...',
 'Exactly what I was looking for: Fast, fantastic Chai!',
 'Broken in a million pieces',
 'Deceptive description',
 'Makes me drool just thinking of them',
 "it's alright",
 'these are mini candy bars.',
 'My dogs love them!',
 'Loved these gluten free healthy bars, saved $$ ordering on Amazon',
 'Should advertise coconut as an ingredient more prominently',
 "Great bold taste-- compare to Emeril's Bold",
 'Great flavor no bite',
 "Great bold taste-- compare to Emeril's Bold",
 'Great flavor no bite',
 'Food Caused Illness',
 'Yum',
 'Saifun was too thin!',
 'Should advertise coconut as an ingredient more prominently',
 'Loved these gluten free healthy bars, saved $$ ordering on Amazon',
 '

In [22]:
summary[0]

'where does one  start...and stop... with a treat like this'

In [27]:
# Summary + Text -> Combined
df['combined'] = df['Summary'].str.strip() + '; ' + df['Text'].str.strip()
df.head()

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,n_tokens,combined
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,33,where does one start...and stop... with a tre...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",26,Arrived in pieces; Not pleased at all. When I ...
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,242,"It isn't blanc mange, but isn't bad . . .; I'm..."
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,216,These also have SALT and it's not sea salt.; I...
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,76,Happy with the product; My dog was suffering w...


In [28]:
# 임베딩변환
df['embedding'] = texts_to_embedding(df['combined'].tolist())
df.head()

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,n_tokens,combined,embedding
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,33,where does one start...and stop... with a tre...,"[0.030276387929916382, -0.020651785656809807, ..."
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",26,Arrived in pieces; Not pleased at all. When I ...,"[0.01129516027867794, 0.03488067910075188, -0...."
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,242,"It isn't blanc mange, but isn't bad . . .; I'm...","[0.0022991024889051914, 0.004840383306145668, ..."
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,216,These also have SALT and it's not sea salt.; I...,"[-0.015768831595778465, 0.013766037300229073, ..."
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,76,Happy with the product; My dog was suffering w...,"[7.225894660223275e-05, -0.06748031824827194, ..."


In [41]:
embedding_df = pd.DataFrame(df['embedding'].tolist(), index=df['combined'])
embedding_df.head()

embed_df = df['embedding'].to_frame()
embed_df.index = df['combined']
embed_df.head()

Unnamed: 0_level_0,embedding
combined,Unnamed: 1_level_1
where does one start...and stop... with a treat like this; Wanted to save some to bring to my Chicago family but my North Carolina family ate all 4 boxes before I could pack. These are excellent...could serve to anyone,"[0.030276387929916382, -0.020651785656809807, ..."
"Arrived in pieces; Not pleased at all. When I opened the box, most of the rings were broken in pieces. A total waste of money.","[0.01129516027867794, 0.03488067910075188, -0...."
"It isn't blanc mange, but isn't bad . . .; I'm not sure that custard is really custard without eggs. But this comes close. I got it for use in a ""Vegan pancake"" recipe. We were having houseguests who were Vegan and I wanted to make some special breakfasts while they were here. One of the cooking/recipe sites had a recipe using this and there were lots of great reviews. I tried the recipe and it turned out like wallpaper paste -- yuck!<br />However, the so-called custard isn't so bad. I think it's probably just cornstarch and annatto (yellow coloring with a slight flavor). It's fun playing with it. You could dress it up with fruit. Seems to come out on the thin side when you make it as directed, so I use less milk because I like my custards to set firm. As a custard sauce it's fine. I would say it tastes something between a pudding and a custard.<br /><br />If you want a really good egg-free ""custard"" get an original recipe for ""blanc mange."" It takes a lot longer to make, but it's certainly worth the difference.","[0.0022991024889051914, 0.004840383306145668, ..."
"These also have SALT and it's not sea salt.; I like the fact that you can see what you're getting and that there are no bones or dark meat. There are 7 nice big chunks in every jar.<br /><br />These taste like tuna in a can but, because they're preserved in glass, you don't have to worry about either aluminum or BPA; BUT ... they are not just tuna and spring water.<br /><br />There is salt in there, too, and it's not healthy sea salt, it's toxic table salt.<br /><br />I am trying to contact Tonnino to confirm that. I might be wrong because the label states that the ingredients are ""tuna fish"" but the sticker on the top clarifies that it is the smaller (healthier) yellowfin, so the ""salt"" listed in the ingredients might be sea salt but, if it was, why don't they say so?<br /><br />Without confirmation, I will continue to look for a salt-free olive-oil free tuna preserved in glass.<br /><br />If you know of one, please contact me!","[-0.015768831595778465, 0.013766037300229073, ..."
Happy with the product; My dog was suffering with itchy skin. He had been eating Natural Choice brand (cheaper) since he was a puppy. I was nervous to change foods. The vet suggested to change foods sand see if the skin issues cleared up. Wellness brand did the job. My dog seems to love the food and the skin issues cleared up within a few weeks.,"[7.225894660223275e-05, -0.06748031824827194, ..."


In [53]:
# 유사도 검색
from sklearn.metrics.pairwise import cosine_similarity

def review_search(query, emb_df, top_n=5):
    # 검색어 동일한 임베딩변환
    query_emb = texts_to_embedding([query])

    # cosine_similarity(X, Y): X, Y 모두 batch차원 감싸서 전달
    embed_df['cos_sim'] = embed_df['embedding'].apply(lambda emb: cosine_similarity([emb], query_emb)[0, 0])
    top_n_texts = embed_df.sort_values('cos_sim', ascending=False).head(top_n)
    return top_n_texts

review_search('delicious fruit', embed_df)

Unnamed: 0_level_0,embedding,cos_sim
combined,Unnamed: 1_level_1,Unnamed: 2_level_1
"Delicious .; These plums are sweet and juicy, and the aroma is like perfume. And it doesn't hurt that they are good for you, too.","[0.0315956249833107, -0.001973315142095089, -0...",0.615405
"Delicious!; For anyone who says ""I don't like fruitcake"" or anyone who's never had fruitcake and wonders what all the fuss is about, try this. (As long as you're not allergic to tree nuts or any other ingredient.) It's chock-a-block with nuts and moist fruit. I will definitely be buying more.","[-0.016479600220918655, 0.0018902381416410208,...",0.602955
"These are Delicious!; Great taste, right price, fabulous snack! It is the only fruit I can get my little one to eat and I can't keep my high school son out of them either. They are great pick-me-ups on the way to my daughter's soccer practice or before my early morning run. And best of all, Amazon ships these right to my door every month. No more finding the right store who carries them. I just set up the automatic recurring shipment once and it works like a charm!","[0.028737444430589676, -0.01361179817467928, -...",0.583637
"Delicious!; Wonderful! Deep, rich, pure black raspberry syrup! Absolutely delicious on waffles, cheesecake, ice cream, yogurt, drinks, etc. Thrilled to see that there are at least some berry syrup makers who do not feel the need to ""sour"" the flavor of perfect berry products with citric acid!","[0.008850290440022945, -0.037961218506097794, ...",0.536565
"What a treat!; Ordered these as part of a presentation on Malaysia as this fruit had left a very positive impression on me during my visit. We gave these to 4th graders who were equally impressed with the exterior soft spikes and the juicy center. The fruits arrived in good condition, just slightly bruised. The kept well in the fridge in a single-double layer covered with a damp paper towel. I had also rinsed in a vinegar-water (1:10 parts) solution to be sure to inhibit any mold (like I do with berries). They kept from Saturday when they arrived to Thursday with minimal browning. I ordered 4 pounds and had plenty of fruits for the 40 kids. I think there were about 55 - 60 fruits all together. It was a fantastic treat and the kids were requesting we bring them again for the Halloween party! I'm sure they'll remember the presentation for a long time.","[-0.025338858366012573, -0.0016857946757227182...",0.48538


In [26]:
def texts_to_embedding(texts, model='text-embedding-3-small'):
    # 특수문자/개행문자등을 제거하면 임베딩품질 높아진다.
    texts = [text.replace('\n', ' ') for text in texts]
    response = client.embeddings.create(
        model=model,
        input=texts
    )
    return [data.embedding for data in response.data]

len(texts_to_embedding(summary[:1]))
texts_to_embedding(summary[:1])

[[0.028258666396141052,
  -0.006451049819588661,
  0.006753820925951004,
  -0.0015633098082616925,
  -0.04078936576843262,
  -0.00971290748566389,
  0.024722294881939888,
  0.057227835059165955,
  -0.004234762862324715,
  0.0032981899566948414,
  0.014654136262834072,
  -0.018327763304114342,
  0.010681775398552418,
  -0.042694807052612305,
  0.011618348769843578,
  0.0008462461410090327,
  -0.015316196717321873,
  0.04069247841835022,
  0.05096248537302017,
  0.0636223703622818,
  0.03197266161441803,
  0.02966352552175522,
  -0.026078712195158005,
  0.03898081183433533,
  -0.0010657553793862462,
  -0.017697999253869057,
  -0.013604529201984406,
  0.03198881074786186,
  0.021864132955670357,
  -0.01599440537393093,
  0.002468596212565899,
  -0.04227496311068535,
  -0.036332570016384125,
  -0.042081188410520554,
  0.03697848320007324,
  0.01625276915729046,
  0.013701415620744228,
  0.002785497112199664,
  0.0179886594414711,
  0.00551043963059783,
  0.021185925230383873,
  -0.01471065