# 로컬에서 임베딩 + RAG 구현하기

## 로컬 임베딩 모델 사용 (BGE-m3)

In [5]:
%pip install sentence-transformers

Collecting sentence-transformers
  Using cached sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Using cached transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Using cached torch-2.7.0-cp312-none-macosx_11_0_arm64.whl.metadata (29 kB)
Collecting scikit-learn (from sentence-transformers)
  Using cached scikit_learn-1.6.1-cp312-cp312-macosx_12_0_arm64.whl.metadata (31 kB)
Collecting scipy (from sentence-transformers)
  Using cached scipy-1.15.3-cp312-cp312-macosx_14_0_arm64.whl.metadata (61 kB)
Collecting setuptools (from torch>=1.11.0->sentence-transformers)
  Using cached setuptools-80.4.0-py3-none-any.whl.metadata (6.5 kB)
Collecting networkx (from torch>=1.11.0->sentence-transformers)
  Using cached networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting safetensors>=0.4.3 (from transformers<5.0.0,>=4.41.0->sentence-transformers)
  Using 

In [1]:
from glob import glob 

for g in glob('../chap04_summary_document/data/*.pdf'):
    print(g)

../chap04_summary_document/data/생성형 AI 기반의 영농 의사결정 지원 시스템 개발과 향후계획.pdf
../chap04_summary_document/data/농업용 저수지 치수능력 증대를 위한 기존 사례 검토.pdf
../chap04_summary_document/data/인공지능을 활용한 농업기반시설물 안전점검 방안.pdf
../chap04_summary_document/data/인공지능 기법을 활용한 농촌지역의 객체 정보 추출방안.pdf
../chap04_summary_document/data/포화 불균일성을 고려한 육계사 쿨링패드 시스템 성능 평가.pdf
../chap04_summary_document/data/APEX 모델을 이용한 옥수수-가을배추 재배지의 시비 수준별 비점오염 부하량 평가.pdf
../chap04_summary_document/data/저수지 제체 월류수위 예측을 위한 Fuzzy Time Series법의 적용성 비교 평가.pdf
../chap04_summary_document/data/재난 관리용 고해상도 지형 데이터 자동 생성 방법 제안.pdf
../chap04_summary_document/data/기후변화 대응 농업시설물의 신뢰성 기반 설계.pdf


In [2]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

def read_pdf_and_split_text(pdf_path, chunk_size=1000, chunk_overlap=100):
    """
    주어진 PDF 파일을 읽고 텍스트를 분할합니다.
    매개변수:
        pdf_path (str): PDF 파일의 경로.
        chunk_size (int, 선택적): 각 텍스트 청크의 크기. 기본값은 1000입니다.
        chunk_overlap (int, 선택적): 청크 간의 중첩 크기. 기본값은 100입니다.
    반환값:
        list: 분할된 텍스트 청크의 리스트.
    """
    print(f"PDF: {pdf_path} -----------------------------")

    pdf_loader = PyPDFLoader(pdf_path)
    data_from_pdf = pdf_loader.load()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )

    splits = text_splitter.split_documents(data_from_pdf)
    
    print(f"Number of splits: {len(splits)}\n")
    return splits


In [6]:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-m3"
model_kwargs = {"device": "mps"}  # 🔴 M1/M2/M3/M4 맥북은 'mps'
encode_kwargs = {"normalize_embeddings": True}

hf = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)


  from .autonotebook import tqdm as notebook_tqdm


In [8]:
%pip install pypdf

Collecting pypdf
  Downloading pypdf-5.5.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.5.0-py3-none-any.whl (303 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.5.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [9]:
import os
from langchain_chroma import Chroma

persist_directory='./chroma_store'

if os.path.exists(persist_directory):
    print("Loading existing Chroma store")
    vectorstore = Chroma(
        persist_directory=persist_directory, 
        embedding_function=hf
    )
else:
    print("Creating new Chroma store")
    
    vectorstore = None
    for g in glob('../chap04_summary_document/data/*.pdf'):
        chunks = read_pdf_and_split_text(g)
        # 100개씩 나눠서 저장
        for i in range(0, len(chunks), 100):
            if vectorstore is None:
                vectorstore = Chroma.from_documents(
                    documents=chunks[i:i+100],
                    embedding=hf,
                    persist_directory=persist_directory
                )
            else:
                vectorstore.add_documents(
                    documents=chunks[i:i+100]
                )

Creating new Chroma store
PDF: ../chap04_summary_document/data/생성형 AI 기반의 영농 의사결정 지원 시스템 개발과 향후계획.pdf -----------------------------
Number of splits: 12

PDF: ../chap04_summary_document/data/농업용 저수지 치수능력 증대를 위한 기존 사례 검토.pdf -----------------------------
Number of splits: 13

PDF: ../chap04_summary_document/data/인공지능을 활용한 농업기반시설물 안전점검 방안.pdf -----------------------------
Number of splits: 13

PDF: ../chap04_summary_document/data/인공지능 기법을 활용한 농촌지역의 객체 정보 추출방안.pdf -----------------------------
Number of splits: 17

PDF: ../chap04_summary_document/data/포화 불균일성을 고려한 육계사 쿨링패드 시스템 성능 평가.pdf -----------------------------
Number of splits: 36

PDF: ../chap04_summary_document/data/APEX 모델을 이용한 옥수수-가을배추 재배지의 시비 수준별 비점오염 부하량 평가.pdf -----------------------------
Number of spli

In [10]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

chunks = retriever.invoke("AI가 농업분야에서 어떻게 활용될 수 있는지 설명해줘")

for chunk in chunks:
    print(chunk.metadata)
    print(chunk.page_content)

{'creationdate': '2024-06-07T10:12:03+08:00', 'creator': 'PyPDF', 'moddate': '2024-06-07T10:19:41+08:00', 'page': 4, 'page_label': '5', 'producer': 'Adobe PDF Library 10.0.1; modified using iTextSharp™ 5.5.0 ©2000-2013 iText Group NV (AGPL-version)', 'source': '../chap04_summary_document/data/생성형 AI 기반의 영농 의사결정 지원 시스템 개발과 향후계획.pdf', 'total_pages': 8}
6▶
그림 5. AI 팜두레 첫 화면 – 기본 예시 질문 결과
Rural Resources
+
특집    생성형 AI 기반의 영농 의사결정 지원 시스템 개발과 향후계획
그림 4. AI 팜두레 instruction 작성
{'creationdate': '2024-06-07T10:12:10+08:00', 'creator': 'PyPDF', 'moddate': '2024-06-07T10:19:42+08:00', 'page': 0, 'page_label': '1', 'producer': 'Adobe PDF Library 10.0.1; modified using iTextSharp™ 5.5.0 ©2000-2013 iText Group NV (AGPL-version)', 'source': '../chap04_summary_document/data/인공지능 기법을 활용한 농촌지역의 객체 정보 추출방안.pdf', 'total_pages': 7}
10▶
인공지능 기법을 활용한 농촌지역의 
객체 정보 추출방안
1. 머리말
최근 4차 산업혁명 기술이 경제 화두로 부상한 이후 3D, 빅데이터, IoT, 
AI를 활용한 사업들이 성