## 4. get_embedding
임베딩 모델 선택

<img style="float: right;" src="../img/logo.png" width="120"><br>

<div style="text-align: right"> <b>Kwang Myung Yu</b></div>
<div style="text-align: right"> Initial issue : 2025.11.16 </div>
<div style="text-align: right"> last update : 2025.11.16 </div>

개정 이력  
- `2025.11.16` : 노트북 초기 생성 

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [7]:
import os
import pandas as pd
import numpy as np
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_google_genai import GoogleGenerativeAI
from langchain_pinecone import PineconeEmbeddings
from rag_pkg.utils.path import RAW_DATA_PATH, INTERMEDIATE_DATA_PATH, PROCESSED_DATA_PATH
from rag_pkg.module.preprocess import preprocess_for_rag
from rag_pkg.module.models import get_embedding
from rag_pkg.module.vector_db import load_documents, get_vector_store

### 1. 데이터 로드

In [3]:
review_data_path = RAW_DATA_PATH / "whisky_reviews.csv"

reviews = pd.read_csv(review_data_path)

### 2. 데이터 전처리

In [4]:
review_processed = preprocess_for_rag(reviews, min_comments=2)
print(review_processed['document_text'].iloc[0])

위스키 이름: Springbank10-year-old
태그: Green-House

향(Nose) [점수: 94.0]: The nose is full of aromatic power. We still have a little touch of solvent.\nLeather, orange peel, star anise.\nPlum, prunes, dates, figs, clementine, passion fruit peel.\nOld dry wood, dust, old book.\nWe have aromas of old rum, Demerara, and almost a little cane sugar.

맛(Taste) [점수: 96.0]: On the palate it is surprisingly fresh and "almost" light.\nFresh apricot, pineapple, passion fruit, guava, papaya.\nIt is very tropical.\nBut the dominant remains woody, with pretty spices.\nCloves, anise, pepper, bitter chocolate, cinnamon, nutmeg. They are all there.\nWe have fresh mint, some dry aromatic herbs.\nA little barbecue charcoal, smoke.

피니쉬(Finish) [점수: 95.0]: Long finish on liquorice, camphor, smoke, ash and barbecue, light peat.\nPepper, cloves, fresh mint.\nIt's long, comforting, it feels like you're at the edge of the fireplace.


### 3. Document load

In [5]:
documents = load_documents(
    df=review_processed,
    document_text_col="document_text"
)

In [6]:
print(len(documents))
documents[:5]

716


[Document(metadata={'whisky_name': 'Springbank10-year-old', 'link': 'https://www.whiskybase.com/whiskies/whisky/41678/springbank-10-year-old', 'tags': 'Green-House', 'nose_score': 94.0, 'taste_score': 96.0, 'finish_score': 95.0}, page_content='위스키 이름: Springbank10-year-old\n태그: Green-House\n\n향(Nose) [점수: 94.0]: The nose is full of aromatic power. We still have a little touch of solvent.\\nLeather, orange peel, star anise.\\nPlum, prunes, dates, figs, clementine, passion fruit peel.\\nOld dry wood, dust, old book.\\nWe have aromas of old rum, Demerara, and almost a little cane sugar.\n\n맛(Taste) [점수: 96.0]: On the palate it is surprisingly fresh and "almost" light.\\nFresh apricot, pineapple, passion fruit, guava, papaya.\\nIt is very tropical.\\nBut the dominant remains woody, with pretty spices.\\nCloves, anise, pepper, bitter chocolate, cinnamon, nutmeg. They are all there.\\nWe have fresh mint, some dry aromatic herbs.\\nA little barbecue charcoal, smoke.\n\n피니쉬(Finish) [점수: 95.0]:

### 4. get_embedding
임베딩 모델 선택

In [8]:
gemini_embedding = get_embedding(model = "gemini")

In [9]:
# 임베딩 테스트
vector_test = gemini_embedding.embed_query("hell0, world!")
print(vector_test[:5])
print(len(vector_test))

[-0.023279504850506783, -0.004911753814667463, 9.782671440916602e-06, -0.061770886182785034, -0.0026196828112006187]
3072


In [10]:
pinecone_embedding = get_embedding(model="pinecone")

In [11]:
# 임베딩 테스트
vector_test = pinecone_embedding.embed_query("hell0, world!")
print(vector_test[:5])
print(len(vector_test))

[0.00119781494140625, -0.00315093994140625, 0.00493621826171875, -0.04364013671875, 0.04266357421875]
1024
