### HWP 문서 기반 RAG

이 노트북에서는 HWP(한글) 문서를 PDF로 변환하고, 
AI 기반 문서 분석을 통해 RAG(Retrieval-Augmented Generation) 시스템을 구축하는 전체 과정을 다룹니다.

#### 필수 패키지 설치
HWP 파일을 처리하기 위해 필요한 `olefile` 패키지를 설치합니다.
Windows 에서만 사용이 가능한 점을 안내드립니다.


In [None]:
%pip install olefile

#### HWP → PDF 일괄 변환
Windows COM 객체를 사용하여 HWP 파일들을 PDF 형식으로 일괄 변환합니다.  
한글 프로그램이 설치된 Windows 환경에서만 작동합니다.  

In [None]:
from pathlib import Path

import win32com.client as win32


def hwp_to_pdf(folder: str, out_dir: str):
    folder = Path(folder).resolve()
    out_dir = Path(out_dir).resolve()
    out_dir.mkdir(parents=True, exist_ok=True)

    hwp = win32.gencache.EnsureDispatch("HWPFrame.HwpObject")
    hwp.RegisterModule("FilePathCheckDLL", "FilePathCheckerModule")

    for hwp_file in folder.glob("*.hwp"):
        try:
            hwp.Open(str(hwp_file))
        except Exception as e:
            print("✖ 열기 실패:", hwp_file.name, e)
            continue

        pdf_path = out_dir / f"{hwp_file.stem}.pdf"
        hwp.SaveAs(str(pdf_path), "PDF", "")
        hwp.Run("FileClose")  # ← 핵심 변경: 문서 닫기
        print("✔", pdf_path.name)

    hwp.Quit()


hwp_to_pdf("./hangul_hwp", "./hangul_hwp/pdf")

#### Docling을 이용한 PDF 처리

In [None]:
import logging
import time
from pathlib import Path

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling_core.types.doc.base import ImageRefMode

# 입력 PDF 파일 경로와 출력 디렉토리 설정
input_doc_path = Path("/hangul_hwp/sample.pdf")
output_dir = Path("/hangul_hwp/pdf/output")

# PDF 파이프라인 옵션 구성
pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = 2.0  # 이미지 해상도 - 스케일 설정
pipeline_options.generate_picture_images = True

# 문서 변환기 초기화 - PDF 형식 옵션과 함께 설정
doc_converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)

# PDF 문서 변환 실행
conv_res = doc_converter.convert(input_doc_path)

In [None]:
# 변환 완료된 마크다운 파일 저장
# 변환된 문서를 마크다운 형식으로 저장하고 이미지 참조를 포함한 파일을 생성
from itertools import accumulate

# 출력 디렉토리 생성 (부모 디렉토리도 함께 생성)
output_dir.mkdir(parents=True, exist_ok=True)
doc_filename = conv_res.input.file.stem

# 이미지 참조가 포함된 전체 마크다운 파일 저장
md_filename_with_refs = output_dir / f"{doc_filename}-with-image.md"
conv_res.document.save_as_markdown(md_filename_with_refs, image_mode=ImageRefMode.REFERENCED)

#### VLM 기반 이미지 설명(Caption) 생성
VLM 을 이용해서 추출된 이미지들에 대한 자동 캡션을 생성하고, 마크다운 파일의 이미지 참조를 텍스트 설명으로 교체합니다.


In [None]:
import base64
import glob
import re

from dotenv import load_dotenv
from langchain_core.messages import HumanMessage

load_dotenv()
llm = ""

# 이미지 폴더에서 PNG 파일들 가져오기
image_folder = "/hangul_hwp/pdf/output/sample-with-image-_artifacts/"
image_paths = glob.glob(f"{image_folder}*.png")

# 배치 처리를 위한 메시지 리스트 생성
messages_batch = []
image_filenames = []

prompt = """
You are an assistant that **describes images in Korean** with **simple, precise language**, and **summarizes the key contents of tables and charts**. Follow the instructions below exactly.

## Goals

1. Provide a clear, concise Korean description of the image’s essential content.
2. If the image includes a **table or chart**, summarize the **main findings, comparisons, and notable numbers**.
3. Prefer facts visible in the image over speculation. When unsure, acknowledge uncertainty briefly.

## Output Language & Style

* **Language:** Korean only.
* **Style:** Plain, neutral, and professional; short sentences; no flowery language.
* **Length:** 3–6 concise bullet points total. Use additional bullets only when necessary for tables/charts.

## Structure (use these sections if applicable)

* **한줄 요약:** One-sentence Korean summary of the image’s main idea.
* **핵심 내용:** 2–4 bullets describing who/what/where/when and any prominent actions or states.
* **표/차트 요약:** For tables/charts only—2–4 bullets on trends, extremes, comparisons, and key figures.
* **불확실/가정:** 0–2 bullets noting unclear parts or assumptions (keep minimal).

> If there is **no** table or chart, omit the “표/차트 요약” section.
> If everything is clear, omit “불확실/가정”.

## What to Describe (priority order)

1. **Primary subjects & context:** people/objects, setting/location, time cues (e.g., day/night), activity.
2. **Salient visual details:** text in the image (OCR), logos, labels, axes titles/units, legends, outliers, anomalies.
3. **Quantities & comparisons:** counts, relative sizes, increases/decreases, maxima/minima, dominant categories.
4. **For tables/charts specifically:**

   * **Chart type:** bar/line/pie/scatter/table, etc.
   * **Variables & units:** name axes, categories, time ranges.
   * **Top insights:** biggest/smallest, trend direction, inflection points, noteworthy gaps.
   * **Representative numbers:** include a few key values (rounded if needed); avoid exhaustive listings.

## Formatting Rules

* Use **bulleted lists**; avoid paragraphs.
* **Numbers:** Keep original units; round sensibly (e.g., 12,345 → 12.3천 if helpful). Include % or units explicitly.
* **Text from image:** Quote short snippets exactly; for long text, summarize.
* **No redundant restatement** across sections.

## Uncertainty & Limits

* If a detail is ambiguous, write: “확인 어려움(…가능성)”.
* **Do not guess** hidden or off-frame content.
* If image quality is too low: state “이미지 해상도가 낮아 세부 확인 어려움”.

## Safety & Privacy

* **No PII extraction** beyond text clearly shown in the image.
* Avoid sensitive attribute inferences (e.g., race, health, religion) unless explicitly labeled in the image text.
* Redact or generalize sensitive data if not essential to the summary.

## Prohibited

* No external knowledge beyond what is visible.
* No speculation about intent, emotions, or future outcomes unless directly evidenced.

## Examples (format only; content is illustrative)

* **한줄 요약:** 회의실에서 5명이 프레젠테이션 자료를 검토 중.
* **핵심 내용:**

  * 대형 스크린에 막대그래프와 To-Do 목록 표시.
  * 발표자는 그래프 상단을 가리킴; 참석자들은 노트북 사용.
* **표/차트 요약:**

  * 막대그래프: 2023→2025 매출 증가, 2024 대비 2025 약 **+18%**.
  * 카테고리별: A가 최대, C가 최소.
* **불확실/가정:** 슬라이드 하단의 작은 글씨는 해상도 문제로 확인 어려움.

---

**Execution Checklist (internal):**

* [ ] Describe core scene in Korean, 1-line summary first.
* [ ] Capture salient visual text and numeric highlights.
* [ ] If table/chart present, add a focused summary with 2–4 key insights and representative numbers.
* [ ] Mark uncertainties briefly; avoid speculation.
* [ ] Keep within 3–6 bullets overall; remove non-essential details.

"""


for image_path in image_paths:
    with open(image_path, "rb") as f:
        encoded_image = base64.b64encode(f.read()).decode("utf-8")

    message = HumanMessage(
        content=[
            {
                "type": "text",
                "text": prompt,
            },
            {"type": "image_url", "image_url": f"data:image/png;base64,{encoded_image}"},
        ]
    )

    messages_batch.append([message])
    image_filenames.append(Path(image_path).name)

batch_results = llm.batch(messages_batch)

# 이미지 파일명과 설명을 매핑
image_descriptions = {}
for filename, result in zip(image_filenames, batch_results, strict=True):
    image_descriptions[filename] = result.content
    print(f"{filename}: {result.content}\n")


# 마크다운 파일 읽어두기
with open(md_filename_with_refs, encoding="utf-8") as f:
    markdown_content = f.read()


# 이미지 참조를 캡션으로 교체
def replace_image_with_caption(match):
    full_path = match.group(1)  # 전체 경로 추출
    image_filename = Path(full_path).name  # 파일명만 추출
    if image_filename in image_descriptions:
        caption = image_descriptions[image_filename]
        return f"\n[Image Caption] {caption}\n"
    return match.group(0)  # 설명이 없으면 원본 유지


# 이미지 참조 패턴 찾기 및 교체 (백슬래시와 슬래시 모두 처리)
image_pattern = r"!\[.*?\]\(([^)]+\.png)\)"
updated_markdown = re.sub(image_pattern, replace_image_with_caption, markdown_content)

# 캡션이 포함된 마크다운 파일 저장
md_filename_with_captions = output_dir / f"{doc_filename}-with-captions.md"
with open(md_filename_with_captions, "w", encoding="utf-8") as f:
    f.write(updated_markdown)

print(f"캡션이 포함된 마크다운 파일이 저장되었습니다: {md_filename_with_captions}")

In [None]:
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    separators=["##\n", "\n\n", "\n", " ", ""],
)

# 캡션이 적용된 파일 내용 읽기
with open(md_filename_with_captions, encoding="utf-8") as f:
    captioned_content = f.read()


document = Document(
    page_content=captioned_content,
    metadata={"source": str(md_filename_with_captions)},
)

splits = text_splitter.split_documents([document])

split_lengths = [len(split.page_content) for split in splits]
print(f"총 분할된 청크 수: {len(splits)}")
print(f"평균 청크 길이: {sum(split_lengths) / len(split_lengths):.1f} 문자")
print(f"최소 청크 길이: {min(split_lengths)} 문자")
print(f"최대 청크 길이: {max(split_lengths)} 문자")

In [None]:
#### Qdrant 벡터 데이터베이스 설정

from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore, RetrievalMode
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

# Qdrant 클라이언트 설정
client = QdrantClient(host="localhost", port=6333)

dense_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# 컬렉션 이름 설정
collection_name = "hangul_hwp_rag"
try:
    client.create_collection(
        collection_name=collection_name,
        vectors_config={"dense": VectorParams(size=1536, distance=Distance.COSINE)},
    )
    qdrant = QdrantVectorStore(
        client=client,
        collection_name=collection_name,
        embedding=dense_embeddings,
        retrieval_mode=RetrievalMode.DENSE,
        vector_name="dense",
    )

    # 문서를 벡터 스토어에 추가
    qdrant.add_documents(splits)

except Exception as e:
    print(f"에러 발생: {e}")

#### 리랭커 기반 검색 품질 향상
> Qwen/Qwen3-Reranker-0.6B 모델을 사용하여 검색 결과를 재순위화하고, 가장 관련성 높은 문서들만 선별합니다.


In [None]:
from langchain_classic.retrievers import ContextualCompressionRetriever
from langchain_classic.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

retriever = qdrant.as_retriever(search_kwargs={"k": 10})

# 모델 초기화
model = HuggingFaceCrossEncoder(model_name="Qwen/Qwen3-Reranker-0.6B")

# 상위 개의 문서 선택
compressor = CrossEncoderReranker(model=model, top_n=5)

# 문서 압축 검색기 초기화
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever,
)

In [None]:
# 검색 쿼리
query = ""

# 기본 retriever 검색
basic_docs = retriever.invoke(query)
basic_docs

In [None]:
# Reranker 적용된 검색
compressed_docs = compression_retriever.invoke(query)
compressed_docs

In [None]:
from typing import TypedDict

from langchain_core.documents import Document
from langchain_core.messages import AIMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langgraph.graph import END, StateGraph


# State 정의
class RAGState(TypedDict):
    question: str
    documents: list[Document]
    answer: AIMessage


# LLM 초기화
llm = ChatOpenAI(model="gpt-4.1", temperature=0)

# RAG 프롬프트 템플릿
rag_prompt = ChatPromptTemplate.from_template("""You are a retrieval-augmented generation assistant.
Answer the user’s question **only** using the provided documents.
Be **specific**, **structured**, and **unambiguous**.

---

## Inputs

* **Documents:** `{context}`

  * May include multiple sources with titles, IDs, and possibly line numbers.
* **Question:** `{question}`

## Non-Negotiable Rules

1. **Grounding:** Use information **solely** from `{context}`. **Do not invent** facts.
2. **Language:** Write the final answer **in Korean**.
3. **Citations:** Cite **every non-obvious claim** with the **exact source** (doc title/ID and, if available, line or page numbers).
4. **Uncertainty:** If the documents are insufficient or conflicting, **state that clearly**, explain what’s missing, and **do not speculate**.
5. **Numerics & Terminology:** Preserve units, quote key numbers exactly (round only when useful and indicate that you rounded), and keep technical terms accurate.

## How to Read the Documents

* Extract: definitions, constraints, assumptions, data (tables/charts), algorithms, and conclusions.
* If a figure/table/chart is present, summarize core variables, trends, extremes, and a few representative values.
* When sources disagree, list the competing statements with citations and provide the most defensible interpretation **based on the documents**.

## Output Structure (use the headings verbatim)

* **한줄 요약:** One-sentence Korean summary of the answer.
* **핵심 답변:** 2–5 concise bullets that directly answer `{question}` using facts from the documents.
* **근거 및 인용:** 2–6 bullets, each stating a claim with its **precise** citation(s). Use the format:

  * `【DocTitle or DocID, Lx–Ly】` if line numbers exist, otherwise `【DocTitle or DocID】`.
* **표/차트 요약 (해당 시):** If the answer relies on a table/chart, summarize key variables, trends, maxima/minima, and 1–3 representative numbers with units and citations.
* **불확실/한계:** If any gaps, ambiguity, or conflicts exist, describe them briefly and say **what additional info** would resolve them.

## Citation Rules

* Place citations **immediately after** the sentence they support.
* Use **multiple citations** if a sentence synthesizes several sources.
* Do **not** cite general knowledge—only what is grounded in `{context}`.

## Style Rules

* Korean only; professional, neutral tone; short, clear sentences.
* Prefer bullet points over paragraphs.
* Avoid redundant restatement.
* No external knowledge, no web results, no personal opinions.

## Failure Mode Handling

* If the documents do **not** answer the question:

  * Write “주어진 문서만으로는 질문에 완전한 답을 내리기 어렵습니다.”
  * Summarize what **is** known (with citations).
  * List the **exact** additional data needed.

---

### Response Template (fill in Korean)

**한줄 요약:** <1 sentence>

**핵심 답변:**

* <fact/step/decision 1> 【<Doc/ID[, Lx–Ly]>】
* <fact/step/decision 2> 【<Doc/ID[, Lx–Ly]>】
* <…>

**근거 및 인용:**

* <claim or data point> 【<Doc/ID[, Lx–Ly]>】
* <…>

**표/차트 요약 (해당 시):**

* <variable & trend> — <representative value(s) + unit> 【<Doc/ID[, page/line]>】
* <…>

**불확실/한계:**

* <gap/ambiguity and what would resolve it>
""")


def retrieve_documents(state: RAGState) -> RAGState:
    """문서 검색 단계"""
    question = state["question"]
    documents = compression_retriever.invoke(question)

    return {"question": question, "documents": documents, "answer": ""}


def generate_answer(state: RAGState) -> RAGState:
    """답변 생성 단계"""
    question = state["question"]
    documents = state["documents"]

    context = "\n\n".join([doc.page_content for doc in documents])

    answer = llm.invoke(rag_prompt.format(context=context, question=question))

    return {"question": question, "documents": documents, "answer": answer}


# LangGraph 워크플로우 구성
workflow = StateGraph(RAGState)

# 노드 추가
workflow.add_node("retrieve", retrieve_documents)
workflow.add_node("generate", generate_answer)

# 엣지 추가
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "generate")
workflow.add_edge("generate", END)

# 그래프 컴파일
rag_app = workflow.compile()
rag_app

In [None]:
question = "/no_think 입찰가격 평가는 어떻게 이뤄지나요?"
initial_state = {"question": question, "documents": [], "answer": ""}

for chunk, metadata in rag_app.stream(initial_state, stream_mode="messages"):
    if chunk.content:
        print(chunk.content, end="", flush=True)