
# ChromaDB 기반 임베딩, 라우팅, 유사도 검색


- **의존성 설치**: chromadb, pypdf, openai, tqdm, langchain, python-dotenv, langchain-openai
- **파일 로딩**: 업로드된 PDF / TXT에서 텍스트를 추출
- **청크 분할**: RecursiveCharacterTextSplitter 사용
- **임베딩**: Openai/text-embedding-3-large
- **벡터DB**: 로컬 Chroma 퍼시스턴스(./chroma_db)
- **라우팅**: 각 문서별 별도 컬렉션 생성, LLM 기반 Query Routing
- **검색**: 쿼리 유사도 검색 + 결과 표 형태 출력



# 의존성 설치

In [1]:
%pip install -qU chromadb pypdf tqdm langchain openai python-dotenv langchain-openai

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


# 환경설정
- 임베딩 모델 정의
- ChromaDB 경로 정의
- 컬렉션 이름 정의

In [1]:
import os
from pathlib import Path
from tqdm import tqdm
from pypdf import PdfReader
import pandas as pd

import chromadb
from chromadb.utils import embedding_functions
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

# Load environment (.env must contain OPENAI_API_KEY)
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY is missing in .env file")

# Embedding function (OpenAI)
embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key=OPENAI_API_KEY,
    model_name="text-embedding-3-large"
)

# Chroma persistence dir
PERSIST_DIR = "./chroma_db_router"

# File mapping: each file -> one collection
FILE_MAP = {
    "ai4science": "./IS-208 과학을 위한 AI(AI4Science) 연구의 패러다임을 바꾸다.pdf",
    "spri_report": "./SPRi AI Brief_9월호_산업동향_0909_F.pdf",
    "finance": "./finance-keywords.txt",
    "nlp": "./nlp-keywords.txt",
}

  from .autonotebook import tqdm as notebook_tqdm


# 문서 로딩 & 텍스트 스플릿
- RecursiveCharacterTextSplitter 사용

In [2]:
def load_pdf(path: str):
    reader = PdfReader(path)
    for i, page in enumerate(reader.pages):
        try:
            text = page.extract_text() or ""
        except:
            text = ""
        if text.strip():
            yield text, {"source": path, "page": i+1}

def load_txt(path: str):
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        text = f.read()
    yield text, {"source": path, "page": -1}

splitter = RecursiveCharacterTextSplitter(
    chunk_size=900,
    chunk_overlap=120,
    separators=["\n\n", "\n", " ", ""]
)

def chunk_text(text: str):
    return splitter.split_text(text)

# 개별 컬렉션(ai4science, spri_report, finance, nlp) 생성 및 적재

In [3]:
client = chromadb.PersistentClient(path=PERSIST_DIR)

collections = {}

for name, filepath in FILE_MAP.items():
    path = Path(filepath)
    if not path.exists():
        print(f"[WARN] Missing file: {filepath}")
        continue

    # 새 컬렉션 (기존 있으면 삭제 후 생성)
    try:
        client.delete_collection(name)
    except:
        pass
    collection = client.create_collection(name, embedding_function=embedding_fn)
    collections[name] = collection

    # 로드
    if path.suffix.lower() == ".pdf":
        pairs = list(load_pdf(filepath))
    else:
        pairs = list(load_txt(filepath))

    # 청크 분할 후 적재
    docs, ids, metas = [], [], []
    for doc_id, (text, meta) in enumerate(pairs, start=1):
        chunks = chunk_text(text)
        for idx, ch in enumerate(chunks):
            docs.append(ch.strip())
            ids.append(f"{name}-{doc_id}-{idx}")
            clean_meta = {**meta, "chunk": idx+1}
            for k, v in clean_meta.items():
                if v is None: clean_meta[k] = "N/A"
            metas.append(clean_meta)
    collection.add(ids=ids, documents=docs, metadatas=metas)

    print(f"[INFO] Collection {name} created with {collection.count()} chunks")

[INFO] Collection ai4science created with 80 chunks
[INFO] Collection spri_report created with 70 chunks
[INFO] Collection finance created with 4 chunks
[INFO] Collection nlp created with 10 chunks


# LLM 기반 Routing 설정

In [4]:
llm_router = ChatOpenAI(model="gpt-4o-mini", api_key=OPENAI_API_KEY)

ROUTING_PROMPT = """You are a router.
Decide which knowledge base best matches the user's query.
Available DBs:
- ai4science: AI4Science 연구 보고서
- spri_report: 산업동향 보고서
- finance: 금융 키워드
- nlp: NLP 키워드

Answer ONLY with one word: ai4science, spri_report, finance, or nlp.

Query: {query}
"""

def route_query(query: str) -> str:
    resp = llm_router.invoke(ROUTING_PROMPT.format(query=query))
    route = resp.content.strip().lower()
    if route not in collections:
        route = "ai4science"  # fallback
    return route


# 유사도 검색

In [5]:
def search(query: str, top_k: int = 5):

    target_db = route_query(query)
    collection = collections[target_db]

    res = collection.query(
        query_texts=[query],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )

    docs = res.get("documents", [[]])[0]
    metas = res.get("metadatas", [[]])[0]
    dists = res.get("distances", [[]])[0]

    rows = []
    for i, (doc, meta, dist) in enumerate(zip(docs, metas, dists), start=1):
        rows.append({
            "rank": i,
            "distance": dist,
            "snippet": (doc[:160] + ("..." if len(doc) > 160 else "")),
            "db": target_db,
            "source": Path(meta.get("source")).name,
            "page": meta.get("page"),
            "chunk": meta.get("chunk"),
        })
    return pd.DataFrame(rows)

display(search("S&P 500 지수의 정의와 특징"))
display(search("AI 연구 패러다임 전환"))

Unnamed: 0,rank,distance,snippet,db,source,page,chunk
0,1,0.724383,S&P 500\n\n정의: S&P 500은 미국 주식 시장에 상장된 500개의 대형...,finance,finance-keywords.txt,-1,1
1,2,0.889912,Growth Stock\n\n정의: 성장주는 평균 이상의 높은 성장률을 보이는 기업...,finance,finance-keywords.txt,-1,3
2,3,0.895336,"ESG (Environmental, Social, and Governance)\n\...",finance,finance-keywords.txt,-1,4
3,4,0.94583,Earnings Per Share (EPS)\n\n정의: 주당순이익(EPS)은 기업...,finance,finance-keywords.txt,-1,2


Unnamed: 0,rank,distance,snippet,db,source,page,chunk
0,1,0.779547,SPRi�이슈리포트�IS-208과학을�위한�AI(AI4Science)�연구의�패러다...,ai4science,IS-208 과학을 위한 AI(AI4Science) 연구의 패러다임을 바꾸다.pdf,10,1
1,2,0.80411,SPRi�이슈리포트�IS-208과학을�위한�AI(AI4Science)�연구의�패러다...,ai4science,IS-208 과학을 위한 AI(AI4Science) 연구의 패러다임을 바꾸다.pdf,4,1
2,3,0.805671,SPRi�이슈리포트�IS-208과학을�위한�AI(AI4Science)�연구의�패러다...,ai4science,IS-208 과학을 위한 AI(AI4Science) 연구의 패러다임을 바꾸다.pdf,7,1
3,4,0.809505,SPRi�이슈리포트�IS-208과학을�위한�AI(AI4Science)�연구의�패러다...,ai4science,IS-208 과학을 위한 AI(AI4Science) 연구의 패러다임을 바꾸다.pdf,6,1
4,5,0.810753,SPRi�이슈리포트�IS-208과학을�위한�AI(AI4Science)�연구의�패러다...,ai4science,IS-208 과학을 위한 AI(AI4Science) 연구의 패러다임을 바꾸다.pdf,15,1
