### 사전 준비 사항 

#### (1) uv add (터미널)

```bash
uv add pdfplumber sentence-transformers faiss-cpu numpy torch python-dotenv transformers accelerate
```

#### (2) .env 파일 세팅
```bash
HF_TOKEN = ""
```

#### (3) pdf 파일 세팅
pdf 파일 100개를 `data/raw/files` 에 위치합니다.  
성능 개선을 위한 테스트에서는 전체를 대상으로 인덱싱하지 않고, 5개 인덱싱 코드 실행 권장(주석 해제)

In [1]:
import pdfplumber
import os
from pathlib import Path
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import json

from preprocess import pp_v4 as pp
from preprocess.pp_v4 import ALL_DATA

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# device 자동 선택 (CUDA / MPS / CPU)
device = (
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)

print(f"사용 디바이스: {device}")

사용 디바이스: mps


In [3]:
from dotenv import load_dotenv

load_dotenv()

True

In [4]:
BASE_DIR = Path.cwd().parent  # /codeit-part3-team4
RAW_FOLDER = BASE_DIR / "data/raw/files"

# PDF to text
def extract_text(pdf_path: Path | str) -> str:
    texts = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            texts.append(page.extract_text() or "")
    return "\n".join(texts)

# 폴더에서 PDF 목록 가져오기
def get_pdf_paths(folder_path: Path | str) -> list[Path]:
    folder = Path(folder_path)
    pdf_paths = [p for p in folder.glob("*.pdf")]
    return sorted(pdf_paths)

In [5]:
# Chunking
def chunk(text: str, size: int = 800) -> list[str]:
    return [text[i:i+size] for i in range(0, len(text), size)]

In [6]:
# 임베딩 및 인덱스 만들기

# 한국어 임베딩 모델
embed_model_name = "nlpai-lab/KoE5"
embed_model = SentenceTransformer(embed_model_name, device=device)

def build_index(chunks: list[str]):
    embs = embed_model.encode(chunks, convert_to_numpy=True, show_progress_bar=False)
    index = faiss.IndexFlatL2(embs.shape[1])
    index.add(embs.astype("float32"))
    return index, chunks

Loading weights: 100%|██████████| 391/391 [00:00<00:00, 1861.00it/s, Materializing param=pooler.dense.weight]                               


In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"

print("모델 로딩 중")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    low_cpu_mem_usage=True
)

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
    pad_token_id=tokenizer.eos_token_id
)

모델 로딩 중


`torch_dtype` is deprecated! Use `dtype` instead!
Fetching 4 files: 100%|██████████| 4/4 [03:27<00:00, 51.79s/it] 
Loading weights: 100%|██████████| 291/291 [00:25<00:00, 11.24it/s, Materializing param=model.norm.weight]                              
Some parameters are on the meta device because they were offloaded to the disk.
Passing `generation_config` together with generation-related arguments=({'pad_token_id'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.


In [8]:
# 데이터 폴더 설정
docs = get_pdf_paths(RAW_FOLDER)
print(f"발견된 PDF: {len(docs)}개")
for i, doc in enumerate(docs):
    print(f"{i}: {doc.name}")

발견된 PDF: 100개
0: (사)벤처기업협회_2024년 벤처확인종합관리시스템 기능 고도화 용역사업 .pdf
1: (사)부산국제영화제_2024년 BIFF & ACFM 온라인서비스 재개발 및 행사지원시.pdf
2: (사）한국대학스포츠협의회_KUSF 체육특기자 경기기록 관리시스템 개발.pdf
3: (재)예술경영지원센터_통합 정보시스템 구축 사전 컨설팅.pdf
4: 2025 구미 아시아육상경기선수권대회 조직위원회_2025 구미아시아육상경.pdf
5: BioIN_의료기기산업 종합정보시스템(정보관리기관) 기능개선 사업(2차).pdf
6: KOICA 전자조달_[긴급] [지문] [국제] 우즈베키스탄 열린 의정활동 상하원 .pdf
7: 경기도 안양시_호계체육관 배드민턴장 및 탁구장 예약시스템 구축 용역.pdf
8: 경기도 평택시_2024년도 평택시 버스정보시스템(BIS) 구축사업.pdf
9: 경기도사회서비스원_2024년 통합사회정보시스템 운영지원.pdf
10: 경상북도 봉화군_봉화군 재난통합관리시스템 고도화 사업(협상)(긴급).pdf
11: 경희대학교_[입찰공고] 산학협력단 정보시스

In [9]:
# 전체 문서별 인덱스 저장
doc_indexes = {}
for doc_path in docs:
    print(f"처리 중: {doc_path.name}")
    chunks = pp.chunk_from_alldata(doc_path.name, ALL_DATA)

    if chunks is None:
        text = pp.clean_text(extract_text(doc_path))
        chunks = pp.chunk(text)

    index, chunks_list = build_index(chunks)
    doc_indexes[doc_path] = (index, chunks_list)
print("모든 문서 인덱싱 완료")

처리 중: (사)벤처기업협회_2024년 벤처확인종합관리시스템 기능 고도화 용역사업 .pdf
처리 중: (사)부산국제영화제_2024년 BIFF & ACFM 온라인서비스 재개발 및 행사지원시.pdf
처리 중: (사）한국대학스포츠협의회_KUSF 체육특기자 경기기록 관리시스템 개발.pdf
처리 중: (재)예술경영지원센터_통합 정보시스템 구축 사전 컨설팅.pdf
처리 중: 2025 구미 아시아육상경기선수권대회 조직위원회_2025 구미아시아육상경.pdf
처리 중: BioIN_의료기기산업 종합정보시스템(정보관리기관) 기능개선 사업(2차).pdf
처리 중: KOICA 전자조달_[긴급] [지문] [국제] 우즈베키스탄 열린 의정활동 상하원 .pdf
처리 중: 경기도 안양시_호계체육관 배드민턴장 및 탁구장 예약시스템 구축 용역.pdf
처리 중: 경기도 평택시_2024년도 평택시 버스정보시스템(BIS) 구축사업.pdf
처리 중: 경기도사회서비스원_2024년 통합사회정보시스템 운영지원.pdf
처리 중: 경상북도 봉화군_봉화군 재난통합관리시스템 고도화 사업(협상)(긴급).pdf
처리 중: 경희대학교_[입찰공고] 산하

In [10]:
# # 전체 중 5개 문서만 문서별 인덱스 저장
# doc_indexes = {}
# for doc_path in docs[:5]:
#     print(f"처리 중: {doc_path.name}")
#     text = extract_text(doc_path)
#     chunks = chunk(text)
#     index, chunks_list = build_index(chunks)
#     doc_indexes[doc_path] = (index, chunks_list)
# print("5개 문서 인덱싱 완료")

In [11]:
# 질문 리스트 (key, 질문)
queries = [
    ("project_name", "사업(용역)명은 무엇인가?"),
    ("agency", "발주 기관(수요기관)은 어디인가?"),
    ("purpose", "사업 목적(추진 배경)은 무엇인가?"),
    ("budget", "총 사업 예산(사업비)은 얼마인가?"),
    ("contract_type", "계약 방식(일반경쟁/제한경쟁/협상에 의한 계약 등)은 무엇인가?"),
    ("deadline", "입찰/제안서 제출 마감일시는 언제인가?"),
    ("duration", "사업 수행 기간은 얼마나 되는가?"),
    ("requirements_must", "필수 요구사항(기능/성능/보안 등)은 무엇인가?"),
    ("eval_items", "평가 항목(기술/가격 등) 구성은 어떻게 되는가?"),
    ("price_eval", "가격 평가 방식(최저가/협상 등)은 무엇인가?"),
    ("eligibility", "참가 자격 요건(면허/실적/인증/등급)은 무엇인가?"),
]

In [12]:
def build_query_prompt(queries):
    return "\n".join(
        f"{i}. {q}"
        for i, (_, q) in enumerate(queries, start=1)
    )

In [13]:
# RFP 분석용 프롬프트
RFP_PROMPT = """
너는 정부·공공기관 제안요청서(RFP)를 분석하는 전문가다.
아래 컨텍스트는 하나의 정부 RFP 문서에서 추출된 내용이다.

[분석 규칙]
- 추측 금지, 문서에 명시된 내용만 사용
- 문서에 없으면 반드시 "명시 없음"
- **출력은 질문 개수와 동일한 줄 수**
- **각 줄에는 답변 텍스트만 작성**
- 번호, 질문 문장, '답:', 기호, 설명을 절대 포함하지 말 것

[질문 목록]
{questions}

[컨텍스트]
{context}

질문 순서대로 답변만 한 줄씩 출력하라.
"""

In [14]:
import re

def answer(index, embed_model, chunks, queries, top_k: int = 15) -> dict:
    # 1. 검색용 대표 질문
    search_query = "RFP 주요 사업 정보 요약"

    q_emb = embed_model.encode([search_query], convert_to_numpy=True)
    _, I = index.search(q_emb.astype("float32"), top_k)

    context = "\n\n".join(chunks[i] for i in I[0])

    questions_text = build_query_prompt(queries)
    prompt = RFP_PROMPT.format(
        context=context,
        questions=questions_text
    )

    # 2. HuggingFace generator 호출
    output = generator(
        prompt,
        do_sample=False,
        temperature=0.0,
        repetition_penalty=1.2,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )[0]["generated_text"]

    # 3. 프롬프트 제거
    raw_answer = output[len(prompt):].strip()

    # 4. 불필요한 텍스트 정리
    cleaned_answers = []
    for line in raw_answer.splitlines():
        line = line.strip()
        if not line:
            continue

         # 깨진 문자 제거: �
        line = re.sub(r"[�]", "", line)

        # "1. 질문문?" 같은 줄 제거
        if re.match(r"^\d+\.\s+.*\?$", line):
            continue

        # "답:" 제거
        line = re.sub(r"^답\s*:\s*", "", line)

        # 질문 번호가 남아 있으면 제거
        line = re.sub(r"^\d+\.\s*", "", line)

        # 앞쪽 화살표 제거 (->, →)
        line = re.sub(r"^(->|→)\s*", "", line)

        # 깨진 문자 제거
        line = re.sub(r"[�]", "", line)

        # 질문/번호/답 제거
        line = re.sub(r"^답\s*[:：]\s*", "", line)
        line = re.sub(r"^\d+\.\s*", "", line)
        line = re.sub(r"^(->|→)\s*", "", line)

        # 따옴표 정리
        line = re.sub(r"^['\"]|['\"]$", "", line)

        # 없음 통일
        if line in ["없음", ""]:
            line = "명시 없음"
        if re.fullmatch(r"['\"]?명시\s*없음['\"]?", line):
            line = "명시 없음"

        cleaned_answers.append(line)

    # 5. key: value 매핑
    results = {}
    for (key, _), value in zip(queries, cleaned_answers):
        results[key] = value if value else "명시 없음"

    # 6. 답변 개수 부족 시 보정
    for key, _ in queries[len(cleaned_answers):]:
        results[key] = "명시 없음"

    return results

In [15]:
# 테스트: 단일 문서
if docs:
    test_doc = docs[0]
    print(f"\n=== {test_doc.name} 분석 ===")

    index, chunks = doc_indexes[test_doc]
    result = answer(index, embed_model, chunks, queries)

    for k, v in result.items():
        print(f"{k}: {v}")


=== (사)벤처기업협회_2024년 벤처확인종합관리시스템 기능 고도화 용역사업 .pdf 분석 ===


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Passing `generation_config` together with generation-related arguments=({'pad_token_id', 'eos_token_id', 'do_sample', 'repetition_penalty', 'temperature'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


RuntimeError: MPS backend out of memory (MPS allocated: 10.61 GiB, other allocations: 14.61 GiB, max allowed: 20.13 GiB). Tried to allocate 60.91 MiB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

In [17]:
import gc
def clean_cache():
    gc.collect()
    if torch.backends.mps.is_available() and torch.backends.mps.is_built():
        torch.mps.empty_cache()
    elif torch.cuda.is_available():
        torch.cuda.empty_cache()


In [18]:
import json

results = {}

clean_cache()

for doc in docs:
    clean_cache()
    print(f"\n=== {doc.name} 분석 ===")

    if doc not in doc_indexes:
        print(f"[SKIP] 인덱스 없음: {doc.name}")
        continue

    index, chunks = doc_indexes[doc]

    result = answer(
        index=index,
        embed_model=embed_model,
        chunks=chunks,
        queries=queries
    )

    results[doc.name] = result

    for k, v in result.items():
        print(f"{k}: {v}")

# JSON 저장
output_path = BASE_DIR / "rfp_hf_answer_results_v2.4.json"
with open(output_path, "w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

print(f"\n전체 문서 분석 완료! 저장 위치: {output_path}")


=== (사)벤처기업협회_2024년 벤처확인종합관리시스템 기능 고도화 용역사업 .pdf 분석 ===


Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


RuntimeError: MPS backend out of memory (MPS allocated: 9.61 GiB, other allocations: 14.61 GiB, max allowed: 20.13 GiB). Tried to allocate 60.91 MiB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).