# OpenAI API - Embeddings (임베딩)

이 노트북에서는 텍스트를 벡터로 변환하는 **임베딩(Embeddings)** 기술을 다룹니다.
임베딩은 텍스트의 의미를 수치화하여 검색, 추천, 분류 등 다양한 작업에 활용됩니다.

**학습 목표:**
1. **임베딩 개념 이해:** 텍스트가 어떻게 벡터로 표현되는지 이해하기
2. **OpenAI Embeddings API 사용법:** `text-embedding-3-small` 등의 모델 사용법 익히기
3. **유사도 검색 구현:** 임베딩 벡터 간의 코사인 유사도(Cosine Similarity)를 계산하여 검색 기능 구현하기

# Embeddings
https://platform.openai.com/docs/models#embeddings

임베딩(Embeddings)은 텍스트를 수치적으로 표현한 값으로, 두 텍스트 간의 연관성을 측정하는 데 사용된다.

임베딩은 검색, 군집화(clustering), 추천 시스템, 이상 탐지, 분류와 같은 작업에 유용하다.

**모델 및 출력 차원**

| 모델 이름                     | 설명                                                              | 출력 차원 |
|-------------------------------|-------------------------------------------------------------------|-----------|
| **text-embedding-3-large**   | 영어 및 비영어 작업 모두에서 가장 강력한 성능을 가진 모델           | 3,072     |
| **text-embedding-3-small**   | 2세대 ada 임베딩 모델보다 성능이 향상된 모델                        | 1,536     |
| **text-embedding-ada-002**   | 1세대 모델 16개를 대체하는 가장 강력한 2세대 임베딩 모델             | 1,536     |


## MTEB Leaderboard
**Massive Text Embedding Benchmark (MTEB) Leaderboard**

https://huggingface.co/spaces/mteb/leaderboard

**MTEB Leaderboard**는 Hugging Face에서 제공하는 벤치마크 리더보드 페이지로, 다양한 언어 모델(Language Model)과 임베딩 모델(Embedding Model)의 성능을 객관적으로 비교·평가하는 공간이다.

**MTEB Leaderboard에서 순위 산정 방식**

**MTEB Leaderboard**의 순위는 다양한 자연어 처리 태스크(분류, 클러스터링, 검색, 문장 유사도 등)에서 모델이 얻은 점수들의 평균을 기반으로 산정된다. 즉, 여러 벤치마크 데이터셋에서 모델의 성능을 측정하고, 이를 종합하여 평균 점수를 계산한 뒤, 이 평균 점수가 높은 순서대로 모델이 정렬된다.

**주요 평가 방식**

- **평가 태스크 종류**
  - 분류(Classification): F1 점수
  - 클러스터링(Clustering): V-measure
  - 쌍 분류(Pair Classification): Average Precision
  - 재정렬(Reranking): MRR@k, MAP
  - 검색(Retrieval): nDCG@k
  - 의미 유사도(STS): Spearman correlation
  - 요약(Summarization): Spearman correlation  
  각 태스크별로 대표적인 평가 지표가 다르며, 모델은 여러 태스크에서 평가를 받는다[2].

- **평균 점수 산정**
  - 각 태스크별로 모델이 얻은 점수를 모두 합산한 뒤, 태스크 수로 나누어 평균 점수를 구한다.
  - 이 평균 점수가 리더보드의 기본 순위 기준이 된다.

- **부분 평가 가능**
  - 모든 태스크를 수행하지 않아도 특정 태스크만 평가받아 부분 리더보드에 오를 수 있다. 예를 들어, 클러스터링 태스크만 평가받아 해당 부분 순위에 표시될 수 있다.

In [2]:
from openai import OpenAI
from dotenv import load_dotenv  # .env 파일의 환경변수 로드
import os                       # 환경변수 접근용

load_dotenv()                   # 현재 위치의 .env를 읽어와 환경변수로 등록
api_key = os.getenv("openai_key")  # .env의 openai_key 값을 가져옴

client = OpenAI(api_key=api_key)

In [3]:
# 임베딩 생성 요청
response = client.embeddings.create(
    model = 'text-embedding-3-small',
    input = ["Hello world", "안녕하십니까?"]  # 임베딩으로 변환할 문장 리스트
)

response

CreateEmbeddingResponse(data=[Embedding(embedding=[-0.0021344777196645737, -0.04909106716513634, 0.021025855094194412, 0.03132360428571701, -0.04531010240316391, -0.026405276730656624, -0.02897202968597412, 0.060341741889715195, -0.025713635608553886, -0.014808779582381248, 0.015446625649929047, -0.030047914013266563, -0.02039569430053234, -0.03338315337896347, 0.025821225717663765, 0.014232412911951542, -0.07002470642328262, 0.012418779544532299, 0.014824149198830128, 0.04884514957666397, 0.020749198272824287, -0.008837621659040451, -0.015123859979212284, -0.016614729538559914, 0.025959553197026253, -0.0028491723351180553, -0.024376465007662773, 0.024284247308969498, 0.001773288007825613, -0.05573080852627754, 0.02311614342033863, -0.04546380043029785, -0.00866855401545763, 0.0031392769888043404, 0.004530241712927818, 0.0017713668057695031, 0.026666563004255295, 0.010182477533817291, -0.012026850134134293, -0.011535017751157284, -0.014900998212397099, -0.023177623748779297, 0.02540624

In [4]:
emb_vec0 = response.data[0].embedding  # 첫 번째 문장의 임베딩 벡터
emb_vec1 = response.data[1].embedding

print(len(emb_vec0))  # 첫 번째 임베딩 벡터 차원(길이)
print(len(emb_vec1))

1536
1536


In [6]:
# 텍스트 리스트를 받아 OpenAI 임베딩 API를 통해 벡터 리스트로 변환하는 함수
def text_to_embedding(texts, model='text-embedding-3-small'):
    # 개행 문자(\n)는 임베딩 성능에 부정적인 영향을 줄 수 있으므로 공백으로 치환
    texts = [text.replace('\n', ' ') for text in texts]
    
    # API 호출: 입력 텍스트 리스트에 대한 임베딩 생성 요청
    response = client.embeddings.create(model = model, input = texts)
    
    # 응답에서 각 텍스트에 해당하는 임베딩 벡터만 추출하여 리스트로 반환
    return [data.embedding for data in response.data]

# 함수 테스트: 2개의 문장을 임베딩으로 변환
vecs = text_to_embedding(['Hello world', '안녕하세요'])

# 변환된 벡터의 차원 확인 (text-embedding-3-small 모델은 기본 1536차원)
print(len(vecs[0]))
print(len(vecs[1]))

1536
1536


## 음식리뷰 임베딩처리

https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews

corpus를 임베딩 벡터로 변환하고, 이후에 유사도를 통한 검색이 가능하다.

In [8]:
import pandas as pd

# CSV 로드 (0번 컬럼은 인덱스로 사용)
review_df = pd.read_csv('fine_food_reviews_1k.csv', index_col = 0)
review_df.info()  # 컬럼 타입/결측치/행 수 등 메타정보
review_df.head()  # 상위 5개 행

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Time       1000 non-null   int64 
 1   ProductId  1000 non-null   object
 2   UserId     1000 non-null   object
 3   Score      1000 non-null   int64 
 4   Summary    1000 non-null   object
 5   Text       1000 non-null   object
dtypes: int64(2), object(4)
memory usage: 54.7+ KB


Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos..."
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...


In [9]:
review_df = review_df[['Summary', 'Text']]  # 필요한 컬럼 2개만 사용

# Summary와 Text를 공백 제거 후 하나의 문자열로 결합
review_df['Content'] = 'Title: ' + review_df['Summary'].str.strip() + "; Content: " + review_df['Text'].str.strip()
review_df.head()

Unnamed: 0,Summary,Text,Content
0,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...
1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...
2,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,"Title: It isn't blanc mange, but isn't bad . ...."
3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,Title: These also have SALT and it's not sea s...
4,Happy with the product,My dog was suffering with itchy skin. He had ...,Title: Happy with the product; Content: My dog...


In [10]:
# Content를 리스트로 변환 임베딩 생성 -> embedding 컬럼에 저장
review_df['embedding'] = text_to_embedding(review_df['Content'].tolist())
review_df.head(3)

Unnamed: 0,Summary,Text,Content,embedding
0,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,"[0.036636363714933395, -0.023187169805169106, ..."
1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...,"[0.011416858062148094, 0.034257568418979645, -..."
2,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,"Title: It isn't blanc mange, but isn't bad . ....","[0.0033871119376271963, 0.012614777311682701, ..."


In [11]:
embed_df = review_df['embedding'].to_frame('embedding')  # Series(embedding) -> DataFrame으로 변환
embed_df.index = review_df['Content']                    # 인덱스는 Content 텍스트로 설정
embed_df

Unnamed: 0_level_0,embedding
Content,Unnamed: 1_level_1
Title: where does one start...and stop... with a treat like this; Content: Wanted to save some to bring to my Chicago family but my North Carolina family ate all 4 boxes before I could pack. These are excellent...could serve to anyone,"[0.036636363714933395, -0.023187169805169106, ..."
"Title: Arrived in pieces; Content: Not pleased at all. When I opened the box, most of the rings were broken in pieces. A total waste of money.","[0.011416858062148094, 0.034257568418979645, -..."
"Title: It isn't blanc mange, but isn't bad . . .; Content: I'm not sure that custard is really custard without eggs. But this comes close. I got it for use in a ""Vegan pancake"" recipe. We were having houseguests who were Vegan and I wanted to make some special breakfasts while they were here. One of the cooking/recipe sites had a recipe using this and there were lots of great reviews. I tried the recipe and it turned out like wallpaper paste -- yuck!<br />However, the so-called custard isn't so bad. I think it's probably just cornstarch and annatto (yellow coloring with a slight flavor). It's fun playing with it. You could dress it up with fruit. Seems to come out on the thin side when you make it as directed, so I use less milk because I like my custards to set firm. As a custard sauce it's fine. I would say it tastes something between a pudding and a custard.<br /><br />If you want a really good egg-free ""custard"" get an original recipe for ""blanc mange."" It takes a lot longer to make, but it's certainly worth the difference.","[0.0033871119376271963, 0.012614777311682701, ..."
"Title: These also have SALT and it's not sea salt.; Content: I like the fact that you can see what you're getting and that there are no bones or dark meat. There are 7 nice big chunks in every jar.<br /><br />These taste like tuna in a can but, because they're preserved in glass, you don't have to worry about either aluminum or BPA; BUT ... they are not just tuna and spring water.<br /><br />There is salt in there, too, and it's not healthy sea salt, it's toxic table salt.<br /><br />I am trying to contact Tonnino to confirm that. I might be wrong because the label states that the ingredients are ""tuna fish"" but the sticker on the top clarifies that it is the smaller (healthier) yellowfin, so the ""salt"" listed in the ingredients might be sea salt but, if it was, why don't they say so?<br /><br />Without confirmation, I will continue to look for a salt-free olive-oil free tuna preserved in glass.<br /><br />If you know of one, please contact me!","[-0.0028755709063261747, 0.01466672495007515, ..."
Title: Happy with the product; Content: My dog was suffering with itchy skin. He had been eating Natural Choice brand (cheaper) since he was a puppy. I was nervous to change foods. The vet suggested to change foods sand see if the skin issues cleared up. Wellness brand did the job. My dog seems to love the food and the skin issues cleared up within a few weeks.,"[0.012078606523573399, -0.056000273674726486, ..."
...,...
Title: Delicious!; Content: I have ordered these raisins multiple times. They are always great and arrive timely. I can't go back to store bought chocolate covered raisins now! Love this product.,"[0.01574314944446087, -0.03915570676326752, -0..."
Title: Good Training Treat; Content: My dog will come in from outside when I am training her and look at the cupboard waiting for her treat. When I use the clicker training method she comes because she knows she has something special.,"[-0.0230502188205719, -0.01385764591395855, 0...."
Title: Jamica Me Crazy Coffee; Content: Wolfgang Puck's Jamaica Me Crazy is that wonderful blend of island flavors in a coffee. Have loved it from the first time tasting. Great product.,"[-0.02968551591038704, -0.045753590762615204, ..."
Title: Party Peanuts; Content: Great product for the price. Mix with the Asian rice crackers for an excellent snack. Big container lasts a long time. Only lightly slighted. Peanuts are whole and large.,"[0.0010573141044005752, -0.02257317304611206, ..."


### 유사도 검색
사용자가 입력한 쿼리와 유사한 리뷰를 조회할 수 있다. (벡터 서치 기반)

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

# 사용자의 쿼리와 가장 유사한 리뷰를 벡터 유사도 기반으로 검색하는 함수
def review_vector_search(query, embed_df=embed_df, top_n=5):
    # 1. 사용자의 쿼리를 임베딩 벡터로 변환
    query_emb = text_to_embedding([query])
    
    # 2. 원본 데이터프레임 복사 (원본 데이터 보존)
    df = embed_df.copy()
    
    # 3. 쿼리 벡터와 모든 리뷰 임베딩 벡터 간의 코사인 유사도 계산
    # apply 함수를 사용하여 각 행의 'embedding' 컬럼 값과 query_emb 간의 유사도를 계산
    df['cos_sim'] = df['embedding'].apply(lambda emb: cosine_similarity([emb], query_emb)[0, 0])
    
    # 4. 유사도가 높은 순서대로 정렬하여 상위 Top-N개 선택
    df = df.sort_values('cos_sim', ascending=False).head(top_n)
    
    # 5. 인덱스(Content)를 컬럼으로 변환하여 보기 좋게 정리
    df = df.reset_index()
    df = df[['Content', 'cos_sim']]
    
    return df

# Content 컬럼의 내용이 길어도 잘리지 않고 모두 출력되도록 설정
pd.set_option('display.max_colwidth', None)

# 검색 테스트: 'delicious fruit' (맛있는 과일)과 관련된 리뷰 검색
review_vector_search('delicious fruit')

Unnamed: 0,Content,cos_sim
0,"Title: Delicious!; Content: For anyone who says ""I don't like fruitcake"" or anyone who's never had fruitcake and wonders what all the fuss is about, try this. (As long as you're not allergic to tree nuts or any other ingredient.) It's chock-a-block with nuts and moist fruit. I will definitely be buying more.",0.649392
1,"Title: Delicious .; Content: These plums are sweet and juicy, and the aroma is like perfume. And it doesn't hurt that they are good for you, too.",0.61226
2,"Title: Delicious!; Content: Wonderful! Deep, rich, pure black raspberry syrup! Absolutely delicious on waffles, cheesecake, ice cream, yogurt, drinks, etc. Thrilled to see that there are at least some berry syrup makers who do not feel the need to ""sour"" the flavor of perfect berry products with citric acid!",0.599878
3,"Title: These are Delicious!; Content: Great taste, right price, fabulous snack! It is the only fruit I can get my little one to eat and I can't keep my high school son out of them either. They are great pick-me-ups on the way to my daughter's soccer practice or before my early morning run. And best of all, Amazon ships these right to my door every month. No more finding the right store who carries them. I just set up the automatic recurring shipment once and it works like a charm!",0.591856
4,"Title: very delicious; Content: This is my very favorite of the truffles. It is a medium dark and very delicious. If I have one, it screams at me to have just one more! As long as they are made, I will be first in line to purchase.",0.50216


cosine_similarity([emb], query_emb)  
[emb] = [[0.213121531215, ..]] => (1, d)  
query_emb = (1, d)  
cosine similiarity 계산시 [[0.955....]] => (1, 1)  
[0, 0] => 값만 빼온다.

In [14]:
review_vector_search('best coffee') 

Unnamed: 0,Content,cos_sim
0,"Title: Best coffee ever!; Content: In my opinion this is the best coffee ever! I've been drinking coffee for 50 plus years and this is what I serve to myself and friends. However, I wish I could find this grind in a pound size, so I could make a full pot rather than just a single cup.",0.612556
1,"Title: Great Coffee; Content: I have a coffee maker that grinds my coffee beans. It's hard to find whole bean decafinated coffee. When I find it in the brand that I like, I am excited. Seattle's Best is my favorite.",0.610765
2,"Title: Better than you-know-who's coffee...; Content: So my wife is a latte freak, and nursing, so decaf is the approved type. After the Senseo left the market, I struggled and found the <a href=""http://www.amazon.com/gp/product/B0047BIWSK"">Aerobie AeroPress Coffee and Espresso Maker</a> which is like a French Press for the 21st century. After getting our recipe figured out, my wife, who's been buying Venti Decaf Latte's at $4 a pop almost daily for years now declares that Seattle's best Level 3 Decaf in her home-made Latte is the best coffee she can get. We've tried other bands, and this is her favorite, hands down!",0.605614
3,"Title: BEST cup of coffee I've ever had!; Content: I thought I'd splurge and try this coffee. It costs much more than other decaf K-Cup options. But I hoped that meant it was better coffee. It IS better coffee. I've never had a better cup of coffee than this. It is excellent when compared to any other decaf or regular coffee I've tried.<br /><br />If you like BOLD, FLAVORFUL decaf coffee try this coffee and you'll really like it.",0.603389
4,"Title: BEST cup of coffee I've ever had!; Content: I thought I'd splurge and try this coffee. It costs much more than other decaf K-Cup options. But I hoped that meant it was better coffee. It IS better coffee. I've never had a better cup of coffee than this. It is excellent when compared to any other decaf or regular coffee I've tried.<br /><br />If you like BOLD, FLAVORFUL decaf coffee try this coffee and you'll really like it.",0.603389
