## 5. Vector db
문서 벡터화 및 적재

<img style="float: right;" src="../img/logo.png" width="120"><br>

<div style="text-align: right"> <b>Kwang Myung Yu</b></div>
<div style="text-align: right"> Initial issue : 2025.11.16 </div>
<div style="text-align: right"> last update : 2025.11.16 </div>

개정 이력  
- `2025.11.16` : 노트북 초기 생성 

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
import os
import pandas as pd
import numpy as np
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_google_genai import GoogleGenerativeAI
from langchain_pinecone import PineconeEmbeddings
from rag_pkg.utils.path import RAW_DATA_PATH, INTERMEDIATE_DATA_PATH, PROCESSED_DATA_PATH
from rag_pkg.module.preprocess import preprocess_for_rag
from rag_pkg.module.models import get_embedding
from rag_pkg.module.vector_db import load_documents, get_vector_store

### 1. 데이터 로드

In [3]:
review_data_path = RAW_DATA_PATH / "whisky_reviews.csv"

reviews = pd.read_csv(review_data_path)

### 2. 데이터 전처리

In [4]:
review_processed = preprocess_for_rag(reviews, min_comments=2)
print(review_processed['document_text'].iloc[0])

위스키 이름: Springbank10-year-old
태그: Green-House

향(Nose) [점수: 94.0]: The nose is full of aromatic power. We still have a little touch of solvent.\nLeather, orange peel, star anise.\nPlum, prunes, dates, figs, clementine, passion fruit peel.\nOld dry wood, dust, old book.\nWe have aromas of old rum, Demerara, and almost a little cane sugar.

맛(Taste) [점수: 96.0]: On the palate it is surprisingly fresh and "almost" light.\nFresh apricot, pineapple, passion fruit, guava, papaya.\nIt is very tropical.\nBut the dominant remains woody, with pretty spices.\nCloves, anise, pepper, bitter chocolate, cinnamon, nutmeg. They are all there.\nWe have fresh mint, some dry aromatic herbs.\nA little barbecue charcoal, smoke.

피니쉬(Finish) [점수: 95.0]: Long finish on liquorice, camphor, smoke, ash and barbecue, light peat.\nPepper, cloves, fresh mint.\nIt's long, comforting, it feels like you're at the edge of the fireplace.


### 3. Document load

In [5]:
documents = load_documents(
    df=review_processed,
    document_text_col="document_text"
)

In [6]:
print(len(documents))
documents[:5]

716


[Document(metadata={'whisky_name': 'Springbank10-year-old', 'link': 'https://www.whiskybase.com/whiskies/whisky/41678/springbank-10-year-old', 'tags': 'Green-House', 'nose_score': 94.0, 'taste_score': 96.0, 'finish_score': 95.0}, page_content='위스키 이름: Springbank10-year-old\n태그: Green-House\n\n향(Nose) [점수: 94.0]: The nose is full of aromatic power. We still have a little touch of solvent.\\nLeather, orange peel, star anise.\\nPlum, prunes, dates, figs, clementine, passion fruit peel.\\nOld dry wood, dust, old book.\\nWe have aromas of old rum, Demerara, and almost a little cane sugar.\n\n맛(Taste) [점수: 96.0]: On the palate it is surprisingly fresh and "almost" light.\\nFresh apricot, pineapple, passion fruit, guava, papaya.\\nIt is very tropical.\\nBut the dominant remains woody, with pretty spices.\\nCloves, anise, pepper, bitter chocolate, cinnamon, nutmeg. They are all there.\\nWe have fresh mint, some dry aromatic herbs.\\nA little barbecue charcoal, smoke.\n\n피니쉬(Finish) [점수: 95.0]:

### 4. get_Vector db

In [7]:
embedding = get_embedding(model = "gemini")

In [8]:
vector_store = get_vector_store(
    documents = documents,
    embedding=embedding,
    type = "faiss",
    dimension=3072
)

In [9]:
retriever = vector_store.as_retriever()

In [10]:
retriever.invoke("헤비한 육향이나 거칠고 진득한 스모크 계열을 선호해요.")

[Document(id='97c1482d-b770-4075-905a-b03a7e222e20', metadata={'whisky_name': 'Ardbeg1966 MI', 'link': 'https://www.whiskybase.com/whiskies/whisky/8672/ardbeg-1966-mi', 'tags': 'Chocolate,Citric,Dried Fruit,Sherried,Smokey,Tobacco,Hay-like,Kippery,Leathery,Medicinal,New Wood,Nutty,Toasted'}, page_content='위스키 이름: Ardbeg1966 MI\n태그: Chocolate,Citric,Dried Fruit,Sherried,Smokey,Tobacco,Hay-like,Kippery,Leathery,Medicinal,New Wood,Nutty,Toasted\n\n향(Nose): Shoe polish, hansaplast, dirty smoke, coal tar, soot, wood tar, sweet spices, dark wood, beeswax\n\n맛(Taste): Sweet spices, dirty smoke, wood tar, dark sugar, dark wood, chocolate, a bit of salmiak\n\n피니쉬(Finish): Sweet dirty smoke, sweet spices, dark sugar, dark wood, wood tar'),
 Document(id='e667ed18-d40d-4fde-8544-71536a7f051a', metadata={'whisky_name': 'Ardbeg1966 MI', 'link': 'https://www.whiskybase.com/whiskies/whisky/8671/ardbeg-1966-mi', 'tags': 'Oily,Chocolate,Citric,Fresh Fruit,Hay-like,Leafy,Mossy,New Wood,Smokey,Tobacco'}, 

In [11]:
retriever.invoke("Bowmore 1964 Black 위스키에 대해 알려주세요. 어떤 맛과 향이 두드러지나요?")

[Document(id='7f6f1268-f3b3-4fe0-9b98-a2dc4a039c87', metadata={'whisky_name': 'Bowmore1964 Black', 'link': 'https://www.whiskybase.com/whiskies/whisky/3681/bowmore-1964-black', 'tags': 'New Wood,Yeasty'}, page_content='위스키 이름: Bowmore1964 Black\n태그: New Wood,Yeasty\n\n향(Nose): Great! Greasy, voluminous and yet round, only very subtle smoke from Bowmore, in the form of extinguished incense sticks, lots of camphor, saffron, cumin, freshly grated cocoa and coffee beans, dark toffee, nougat, old Armagnac, Demerara sugar, sweet old tobacco leaves , candied ginger, plum and black cherry, sweet liquorice, a little chamomile in between, blackberries, marzipan, fresh coffee beans become more intense over time and dark chocolate becomes darker\n\n맛(Taste): Lots of high-quality, freshly poured coffee, camphor, nutmeg, licorice, plus great sweetness from freshly squeezed pomegranate, bitter and also blood orange, tobacco, cumin, ginger, beautiful anise, alcohol super gently integrated and yet very

In [12]:
retriever.invoke("Highland Park 1958과 비슷한 느낌의 위스키를 추천받고 싶어요. 어떤 걸 마셔보면 좋을까요?")

[Document(id='35dea43d-2b5e-4c86-a096-b154887240d6', metadata={'whisky_name': 'Highland Park1958', 'link': 'https://www.whiskybase.com/whiskies/whisky/16477/highland-park-1958', 'tags': 'Sherried'}, page_content="위스키 이름: Highland Park1958\n태그: Sherried\n\n향(Nose): A full bed of heather in springtime bloom, wild honey fresh from the hive, blood oranges, sandalwood, blood oranges and tangerines, sandalwood and pine log smoke\n\n맛(Taste): Pine sap, fir honey, roasted chestnuts, more mango, tangerines and blood oranges, Manuka honey on a croissant, liquorice, all-spice\n\n피니쉬(Finish): Stunning, long, resinous, warming. more heather-honey, smoke, espresso, roasted hazelnuts, dark fudge, it's a session killer"),
 Document(id='072092c8-7c14-40ae-86a9-93e003630368', metadata={'whisky_name': 'Highland Park1958', 'link': 'https://www.whiskybase.com/whiskies/whisky/45818/highland-park-1958', 'tags': 'Chocolate,Citric,Dried Fruit,Fresh Fruit,Honey,Malt Extract,New Wood,Sherried,Smokey,Solvent,Vani

In [13]:
retriever.invoke("어두운 초콜릿 느낌과 약간의 셰리 풍미를 함께 느껴보고 싶어요. 스모크는 적었으면 합니다.")

[Document(id='830e5396-255c-4953-bafc-b4e62c5cf646', metadata={'whisky_name': 'Glen Grant1957 GM', 'link': 'https://www.whiskybase.com/whiskies/whisky/210218/glen-grant-1957-gm', 'tags': 'Chocolate'}, page_content='위스키 이름: Glen Grant1957 GM\n태그: Chocolate\n\n향(Nose): It starts with plenty of fruit, which goes towards all kinds of berries at first: red berries, raspberries, and blackberries. But also sweet cherries and red apples. Then loads of dark chocolate. Rich honey and hazelnuts are in the background. Later also pine needles and a hint of furniture polish.\n\n맛(Taste): Dark chocolate and orange marmalade. Some lovely rich honey. Damped wood and tobacco, followed by Earl Grey tea and eucalyptus. Then cranberries. A touch of oak. Liquorice. A hint of cinnamon and crushed black peppercorns.\n\n피니쉬(Finish): Very long and somewhat dry, with dark chocolate, toffee and tobacco, but also thyme and cranberries and a hint of oak and black pepper.'),
 Document(id='1662ec7e-c446-4510-b2af-f52

In [14]:
vector_store = get_vector_store(
    documents = documents,
    embedding=embedding,
    type = "chroma"
)

In [15]:
retriever = vector_store.as_retriever()

In [16]:
retriever.invoke("헤비한 육향이나 거칠고 진득한 스모크 계열을 선호해요.")

[Document(metadata={'tags': 'Chocolate,Citric,Dried Fruit,Sherried,Smokey,Tobacco,Hay-like,Kippery,Leathery,Medicinal,New Wood,Nutty,Toasted', 'link': 'https://www.whiskybase.com/whiskies/whisky/8672/ardbeg-1966-mi', 'whisky_name': 'Ardbeg1966 MI'}, page_content='위스키 이름: Ardbeg1966 MI\n태그: Chocolate,Citric,Dried Fruit,Sherried,Smokey,Tobacco,Hay-like,Kippery,Leathery,Medicinal,New Wood,Nutty,Toasted\n\n향(Nose): Shoe polish, hansaplast, dirty smoke, coal tar, soot, wood tar, sweet spices, dark wood, beeswax\n\n맛(Taste): Sweet spices, dirty smoke, wood tar, dark sugar, dark wood, chocolate, a bit of salmiak\n\n피니쉬(Finish): Sweet dirty smoke, sweet spices, dark sugar, dark wood, wood tar'),
 Document(metadata={'tags': 'Oily,Chocolate,Citric,Fresh Fruit,Hay-like,Leafy,Mossy,New Wood,Smokey,Tobacco', 'link': 'https://www.whiskybase.com/whiskies/whisky/8671/ardbeg-1966-mi', 'whisky_name': 'Ardbeg1966 MI'}, page_content="위스키 이름: Ardbeg1966 MI\n태그: Oily,Chocolate,Citric,Fresh Fruit,Hay-like,Le

In [17]:
retriever.invoke("Bowmore 1964 Black 위스키에 대해 알려주세요. 어떤 맛과 향이 두드러지나요?")

[Document(metadata={'link': 'https://www.whiskybase.com/whiskies/whisky/3681/bowmore-1964-black', 'tags': 'New Wood,Yeasty', 'whisky_name': 'Bowmore1964 Black'}, page_content='위스키 이름: Bowmore1964 Black\n태그: New Wood,Yeasty\n\n향(Nose): Great! Greasy, voluminous and yet round, only very subtle smoke from Bowmore, in the form of extinguished incense sticks, lots of camphor, saffron, cumin, freshly grated cocoa and coffee beans, dark toffee, nougat, old Armagnac, Demerara sugar, sweet old tobacco leaves , candied ginger, plum and black cherry, sweet liquorice, a little chamomile in between, blackberries, marzipan, fresh coffee beans become more intense over time and dark chocolate becomes darker\n\n맛(Taste): Lots of high-quality, freshly poured coffee, camphor, nutmeg, licorice, plus great sweetness from freshly squeezed pomegranate, bitter and also blood orange, tobacco, cumin, ginger, beautiful anise, alcohol super gently integrated and yet very buttery and present\n\n피니쉬(Finish): Anise,

In [18]:
retriever.invoke("Highland Park 1958과 비슷한 느낌의 위스키를 추천받고 싶어요. 어떤 걸 마셔보면 좋을까요?")

[Document(metadata={'whisky_name': 'Highland Park1958', 'tags': 'Sherried', 'link': 'https://www.whiskybase.com/whiskies/whisky/16477/highland-park-1958'}, page_content="위스키 이름: Highland Park1958\n태그: Sherried\n\n향(Nose): A full bed of heather in springtime bloom, wild honey fresh from the hive, blood oranges, sandalwood, blood oranges and tangerines, sandalwood and pine log smoke\n\n맛(Taste): Pine sap, fir honey, roasted chestnuts, more mango, tangerines and blood oranges, Manuka honey on a croissant, liquorice, all-spice\n\n피니쉬(Finish): Stunning, long, resinous, warming. more heather-honey, smoke, espresso, roasted hazelnuts, dark fudge, it's a session killer"),
 Document(metadata={'tags': 'Chocolate,Citric,Dried Fruit,Fresh Fruit,Honey,Malt Extract,New Wood,Sherried,Smokey,Solvent,Vanilla', 'link': 'https://www.whiskybase.com/whiskies/whisky/45818/highland-park-1958', 'finish_score': 92.0, 'nose_score': 89.0, 'taste_score': 90.0, 'whisky_name': 'Highland Park1958'}, page_content='위스

In [19]:
#embedding = PineconeEmbeddings(model="multilingual-e5-large")

vector_store = get_vector_store(
    documents = documents,
    embedding=embedding,
    type = "pinecone",
    dimension=3072
)

✓ Pinecone 인덱스 'whisky-reviews'에 716개 문서 추가 완료


In [20]:
retriever = vector_store.as_retriever()

In [21]:
retriever.invoke("헤비한 육향이나 거칠고 진득한 스모크 계열을 선호해요.")

[Document(id='e066b61d-a9c1-432e-98c7-fb7aacf1877a', metadata={'link': 'https://www.whiskybase.com/whiskies/whisky/8672/ardbeg-1966-mi', 'tags': 'Chocolate,Citric,Dried Fruit,Sherried,Smokey,Tobacco,Hay-like,Kippery,Leathery,Medicinal,New Wood,Nutty,Toasted', 'whisky_name': 'Ardbeg1966 MI'}, page_content='위스키 이름: Ardbeg1966 MI\n태그: Chocolate,Citric,Dried Fruit,Sherried,Smokey,Tobacco,Hay-like,Kippery,Leathery,Medicinal,New Wood,Nutty,Toasted\n\n향(Nose): Shoe polish, hansaplast, dirty smoke, coal tar, soot, wood tar, sweet spices, dark wood, beeswax\n\n맛(Taste): Sweet spices, dirty smoke, wood tar, dark sugar, dark wood, chocolate, a bit of salmiak\n\n피니쉬(Finish): Sweet dirty smoke, sweet spices, dark sugar, dark wood, wood tar'),
 Document(id='e5498b44-83fb-42ce-a4a7-763b3926b605', metadata={'link': 'https://www.whiskybase.com/whiskies/whisky/8671/ardbeg-1966-mi', 'tags': 'Oily,Chocolate,Citric,Fresh Fruit,Hay-like,Leafy,Mossy,New Wood,Smokey,Tobacco', 'whisky_name': 'Ardbeg1966 MI'}, 

In [22]:
retriever.invoke("Bowmore 1964 Black 위스키에 대해 알려주세요. 어떤 맛과 향이 두드러지나요?")

[Document(id='caf3f261-9554-405c-947a-eb402d39f529', metadata={'link': 'https://www.whiskybase.com/whiskies/whisky/3681/bowmore-1964-black', 'tags': 'New Wood,Yeasty', 'whisky_name': 'Bowmore1964 Black'}, page_content='위스키 이름: Bowmore1964 Black\n태그: New Wood,Yeasty\n\n향(Nose): Great! Greasy, voluminous and yet round, only very subtle smoke from Bowmore, in the form of extinguished incense sticks, lots of camphor, saffron, cumin, freshly grated cocoa and coffee beans, dark toffee, nougat, old Armagnac, Demerara sugar, sweet old tobacco leaves , candied ginger, plum and black cherry, sweet liquorice, a little chamomile in between, blackberries, marzipan, fresh coffee beans become more intense over time and dark chocolate becomes darker\n\n맛(Taste): Lots of high-quality, freshly poured coffee, camphor, nutmeg, licorice, plus great sweetness from freshly squeezed pomegranate, bitter and also blood orange, tobacco, cumin, ginger, beautiful anise, alcohol super gently integrated and yet very

In [23]:
retriever.invoke("Highland Park 1958과 비슷한 느낌의 위스키를 추천받고 싶어요. 어떤 걸 마셔보면 좋을까요?")

[]

In [24]:
retriever.invoke("어두운 초콜릿 느낌과 약간의 셰리 풍미를 함께 느껴보고 싶어요. 스모크는 적었으면 합니다.")

[]