# Question - Answering with Retrieval

본 대회의 과제는 중앙정부 재정 정보에 대한 **검색 기능**을 개선하고 활용도를 높이는 질의응답 알고리즘을 개발하는 것입니다. <br>이를 통해 방대한 재정 데이터를 일반 국민과 전문가 모두가 쉽게 접근하고 활용할 수 있도록 하는 것이 목표입니다. <br><br>
베이스라인에서는 평가 데이터셋만을 활용하여 source pdf 마다 Vector DB를 구축한 뒤 langchain 라이브러리와 llama-2-ko-7b 모델을 사용하여 RAG 프로세스를 통해 추론하는 과정을 담고 있습니다. <br>( train_set을 활용한 훈련 과정은 포함하지 않으며, test_set  에 대한 추론만 진행합니다. )

# Download Library

In [None]:
"""
!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install transformers[torch] -U

!pip install datasets
!pip install langchain
!pip install langchain_community
!pip install PyMuPDF
!pip install sentence-transformers
!pip install faiss-gpu
"""

# Import Library

In [2]:
import os
import unicodedata

import torch
import pandas as pd
from tqdm import tqdm
import fitz  # PyMuPDF
import pickle

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline,
    BitsAndBytesConfig,
    TrainingArguments,
    TrainerCallback
)
from accelerate import Accelerator

from trl import SFTConfig, SFTTrainer
from datasets import load_dataset, Dataset
import pickle
import wandb

# Langchain 관련
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate 
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from peft import LoraConfig, get_peft_model

  from .autonotebook import tqdm as notebook_tqdm


# Vector DB

In [1]:
def process_pdf(file_path, chunk_size=800, chunk_overlap=50):
    """PDF 텍스트 추출 후 chunk 단위로 나누기"""
    # PDF 파일 열기
    doc = fitz.open(file_path)
    text = ''
    # 모든 페이지의 텍스트 추출
    for page in doc:
        text += page.get_text()
    # 텍스트를 chunk로 분할
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    chunk_temp = splitter.split_text(text)
    # Document 객체 리스트 생성
    chunks = [Document(page_content=t) for t in chunk_temp]
    return chunks


def create_vector_db(chunks, model_path="intfloat/multilingual-e5-base"):
    """FAISS DB 생성"""
    # 임베딩 모델 설정
    model_kwargs = {'device': 'cuda'}
    encode_kwargs = {'normalize_embeddings': True}
    embeddings = HuggingFaceEmbeddings(
        model_name=model_path,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
    )
    # FAISS DB 생성 및 반환
    db = FAISS.from_documents(chunks, embedding=embeddings)
    return db

def normalize_path(path):
    """경로 유니코드 정규화"""
    return unicodedata.normalize('NFC', path)


def process_pdfs_from_dataframe(df, base_directory):
    """딕셔너리에 pdf명을 키로해서 DB, retriever 저장"""
    pdf_databases = {}
    unique_paths = df['Source_path'].unique()
    
    for path in tqdm(unique_paths, desc="Processing PDFs"):
        # 경로 정규화 및 절대 경로 생성
        normalized_path = normalize_path(path)
        full_path = os.path.normpath(os.path.join(base_directory, normalized_path.lstrip('./'))) if not os.path.isabs(normalized_path) else normalized_path
        
        pdf_title = os.path.splitext(os.path.basename(full_path))[0]
        print(f"Processing {pdf_title}...")
        
        # PDF 처리 및 벡터 DB 생성
        chunks = process_pdf(full_path)
        db = create_vector_db(chunks)
        
        # Retriever 생성
        retriever = db.as_retriever(search_type="mmr", 
                                    search_kwargs={'k': 3, 'fetch_k': 8})
        
        # 결과 저장
        pdf_databases[pdf_title] = {
                'db': db,
                'retriever': retriever
        }
    return pdf_databases


# DB 생성

In [3]:
base_directory = './' # Your Base Directory
df = pd.read_csv('./test.csv')
pdf_databases = process_pdfs_from_dataframe(df, base_directory)
pickle_file_path = os.path.join(base_directory, 'pdf_databases_e5_base.pickle')
with open(pickle_file_path, 'wb') as f:
    pickle.dump(pdf_databases, f)

Processing PDFs:   0%|          | 0/9 [00:00<?, ?it/s]

Processing 중소벤처기업부_혁신창업사업화자금(융자)...


  warn_deprecated(
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Processing PDFs:  11%|█         | 1/9 [00:27<03:37, 27.21s/it]

Processing 보건복지부_부모급여(영아수당) 지원...


Processing PDFs:  22%|██▏       | 2/9 [00:30<01:33, 13.32s/it]

Processing 보건복지부_노인장기요양보험 사업운영...


Processing PDFs:  33%|███▎      | 3/9 [00:34<00:52,  8.76s/it]

Processing 산업통상자원부_에너지바우처...


Processing PDFs:  44%|████▍     | 4/9 [00:38<00:34,  6.83s/it]

Processing 국토교통부_행복주택출자...


Processing PDFs:  56%|█████▌    | 5/9 [00:41<00:23,  5.78s/it]

Processing 「FIS 이슈 & 포커스」 22-4호 《중앙-지방 간 재정조정제도》...


Processing PDFs:  67%|██████▋   | 6/9 [00:45<00:15,  5.19s/it]

Processing 「FIS 이슈 & 포커스」 23-2호 《핵심재정사업 성과관리》...


Processing PDFs:  78%|███████▊  | 7/9 [00:49<00:09,  4.68s/it]

Processing 「FIS 이슈&포커스」 22-2호 《재정성과관리제도》...


Processing PDFs:  89%|████████▉ | 8/9 [00:53<00:04,  4.51s/it]

Processing 「FIS 이슈 & 포커스」(신규) 통권 제1호 《우발부채》...


Processing PDFs: 100%|██████████| 9/9 [00:57<00:00,  6.37s/it]


In [2]:
base_directory = './' # Your Base Directory
df = pd.read_csv('./test.csv')
with open('pdf_databases_cpu.pickle', 'rb') as f:
    pdf_databases = pickle.load(f)

  return torch.load(io.BytesIO(b))


In [None]:
wandb.init(project="search_competition", entity="tjwjddn980117")

# MODEL Import

In [None]:
import pandas as pd
import unicodedata
from tqdm import tqdm
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM, pipeline
from datasets import Dataset, DatasetDict
from trl import SFTTrainer
from peft import LoraConfig
import wandb

# Initialize 

# CSV 파일 읽기
train_df = pd.read_csv('stf_e5_base_train.csv')
eval_df = pd.read_csv('stf_e5_base_eval.csv')

# 포맷팅 함수 정의
def formatting_prompts_func(example):
    input_texts = []
    target_texts = []
    for i in range(len(example)):
        input_text = f"""다음 정보를 바탕으로 질문에 답하세요:
{example['Context'][i]}
질문: {example['Question'][i]}
답변: 
"""
        target_text = example['Answer'][i]
        input_texts.append(input_text)
        target_texts.append(target_text)
    return input_texts, target_texts

# 포맷팅된 텍스트를 데이터셋에 추가
train_inputs, train_targets = formatting_prompts_func(train_df)
eval_inputs, eval_targets = formatting_prompts_func(eval_df)

# 데이터셋을 DataFrame으로 변환하여 쉽게 사용할 수 있게 함
train_df = pd.DataFrame({'input_text': train_inputs, 'target_text': train_targets})
eval_df = pd.DataFrame({'input_text': eval_inputs, 'target_text': eval_targets})

# DataFrame의 일부 데이터 출력
print("Train DataFrame example:")
print(train_df.head())

print("\nEval DataFrame example:")
print(eval_df.head())

In [4]:
from peft import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING

model_type = "rtzr/ko-gemma-2-9b-it"
default_target_modules = TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING.get(model_type, None)

print(f"Default target modules for {model_type}: {default_target_modules}")

ImportError: cannot import name 'TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING' from 'peft' (c:\Users\Seo\anaconda3\envs\Search_Baseline\lib\site-packages\peft\__init__.py)

In [2]:
def setup_llm_SFTTrainer():
    # 4비트 양자화 설정
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )

    # 모델 ID 
    #model_id = "beomi/llama-2-ko-7b"
    model_id = "rtzr/ko-gemma-2-9b-it"

    # 토크나이저 로드 및 설정
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.use_default_system_prompt = False

    # 모델 로드 및 양자화 설정 적용
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True
    )

    
    # Load LoRA configuration
    peft_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        target_modules=[
        "model.embed_tokens", # OK
        #"model.layers.0.input_layernorm", # X
        #"model.layers.0.post_attention_layernorm", # X
        #"model.layers.0.pre_feedforward_layernorm", # X
        #"model.layers.0.post_feedforward_layernorm", # X
        #"model.norm" # X
        ],
        task_type="CAUSAL_LM",
    )        
    #for name, param in model.named_parameters():
    #    print(name, param.requires_grad)
    
    training_args = TrainingArguments(
        output_dir="./results",
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        num_train_epochs=5,
        logging_dir='./logs',
        logging_steps=1000,
        save_steps=1000,
        evaluation_strategy="steps"
    )

    #dataset
    train_dataset = load_dataset('csv', data_files='stf_e5_base_train.csv')['train']
    eval_dataset = load_dataset('csv', data_files='stf_e5_base_eval.csv')['train']  
    
#    def formatting_prompts_func(example):
#        output_texts = []
#         for i in range(len(example)):
#             text = template = """다음 정보를 바탕으로 질문에 답하세요. 답변은 꼭 문장으로 하세요. 주어를 꼭 적으세요. :
# {example[Context]}
# 
# 질문: {example[Question]}
# 
# 답변: {example[Answer]}
# """
#             output_texts.append(text)
#         return output_texts
    def formatting_prompts_func(example):
        output_texts = []
        for i in range(len(example)):
            text = template = """{example[Answer]}"""
            output_texts.append(text)
        return output_texts
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        peft_config = peft_config,
        formatting_func=formatting_prompts_func,
        train_dataset = train_dataset,
        eval_dataset = eval_dataset,   
    )

    # Train model
    trainer.train()
    # 모델 저장
    # trainer.save_model("./results")
    # # 파인튜닝된 모델에 양자화 적용
    # trainer.model = AutoModelForCausalLM.from_pretrained(
    #     "./results",  # 파인튜닝된 모델이 저장된 경로
    #     quantization_config=bnb_config
    # )
# 
    # # 모델 저장 (선택 사항)
    # #model_quantized.save_pretrained("./quantized_model")

    finetuned_model = "gemma_ko_9b_ver1.01"
    # Save trained model
    trainer.model.save_pretrained(finetuned_model)
    tokenizer.save_pretrained(finetuned_model)

    text_generation_pipeline = pipeline(
        model=trainer.model,
        tokenizer=tokenizer,
        task="text-generation",
        temperature=0.2,
        return_full_text=False,
        max_new_tokens=450, 
    )

    hf = HuggingFacePipeline(pipeline=text_generation_pipeline)
    return hf
    # return text_generation_pipeline

In [3]:
def setup_llm_SFTTrainer_with_finetuning():
    # 4비트 양자화 설정
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
    # 모델 경로 지정
    finetuned_model = "gemma_ko_9b_ver1.01"
    
    # 모델과 토크나이저 로드
    model = AutoModelForCausalLM.from_pretrained(finetuned_model, 
                                                 quantization_config=bnb_config,
                                                 trust_remote_code=True,
                                                 device_map="auto",)
    tokenizer = AutoTokenizer.from_pretrained(finetuned_model)

    text_generation_pipeline = pipeline(
        model=model,
        tokenizer=tokenizer,
        task="text-generation",
        temperature=0.2,
        return_full_text=False,
        max_new_tokens=450, 
    )

    hf = HuggingFacePipeline(pipeline=text_generation_pipeline)
    return hf

In [4]:
llm = setup_llm_SFTTrainer_with_finetuning()

Loading checkpoint shards: 100%|██████████| 10/10 [00:07<00:00,  1.29it/s]
  warn_deprecated(


In [3]:
llm = setup_llm_SFTTrainer()

Loading checkpoint shards: 100%|██████████| 10/10 [00:09<00:00,  1.07it/s]
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mtjwjddn15584[0m ([33mtjwjddn980117[0m). Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 0/10 [00:00<?, ?it/s]It is strongly recommended to train Gemma2 models with the `eager` attention implementation instead of `sdpa`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
  attn_output = torch.nn.functional.scaled_dot_product_attention(
100%|██████████| 10/10 [00:05<00:00,  1.72it/s]


{'train_runtime': 8.5773, 'train_samples_per_second': 1.749, 'train_steps_per_second': 1.166, 'train_loss': 10.00172119140625, 'epoch': 5.0}


The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM'

# Langchain 을 이용한 추론

In [7]:
def normalize_string(s):
    """유니코드 정규화"""
    return unicodedata.normalize('NFC', s)

def format_docs(docs):
    """검색된 문서들을 하나의 문자열로 포맷팅"""
    context = ""
    for doc in docs:
        context += doc.page_content
        context += '\n'
    return context

# 결과를 저장할 리스트 초기화
results = []

# DataFrame의 각 행에 대해 처리
for _, row in tqdm(df.iterrows(), total=len(df), desc="Answering Questions"):
    # 소스 문자열 정규화
    source = normalize_string(row['Source'])
    question = row['Question']

    # 정규화된 키로 데이터베이스 검색
    normalized_keys = {normalize_string(k): v for k, v in pdf_databases.items()}
    retriever = normalized_keys[source]['retriever']

    # RAG 체인 구성
    template = """
    다음 정보를 바탕으로 질문에 답하세요:
    {context}

    질문: {question}

    답변:
    """
    prompt = PromptTemplate.from_template(template)

    # RAG 체인 정의
    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

    # 답변 추론
    #print(f"Question: {question}")
    full_response = rag_chain.invoke(question)

    #print(f"Answer: {full_response}\n")

    # 결과 저장
    results.append({
        "Source": row['Source'],
        "Source_path": row['Source_path'],
        "Question": question,
        "Answer": full_response
    })



In [5]:

def normalize_string(s):
    """유니코드 정규화"""
    return unicodedata.normalize('NFC', s)

def format_docs(docs):
    """검색된 문서들을 하나의 문자열로 포맷팅"""
    context = ""
    for doc in docs:
        context += doc.page_content
        context += '\n'
    return context

# 결과를 저장할 리스트 초기화
results = []

# CSV 파일 읽기
df = pd.read_csv('stf_e5_base_test.csv')

# DataFrame의 각 행에 대해 처리
for _, row in tqdm(df.iterrows(), total=len(df), desc="Answering Questions"):
    # 소스 문자열 정규화
    context = row['Context']
    question = row['Question']

    # RAG 체인 구성
    template = """다음 정보를 바탕으로 질문에 답하세요. 답변은 꼭 문장으로 하세요. 주어를 꼭 적으세요. :
    {context}

    질문: {question}

    답변:
    """
    prompt = PromptTemplate.from_template(template)

    # RAG 체인 정의
    rag_chain = ( 
        prompt
        | llm
        | StrOutputParser()
    )

    # 답변 추론
    # print(f"Question: {question}")
    full_response = rag_chain.invoke({"context":context, "question":question})

    # print(f"Answer: {full_response}\n")
    
    # 결과 저장
    results.append({
        'Question': question,
        'Context': context,
        'Answer': full_response
    })


  attn_output = torch.nn.functional.scaled_dot_product_attention(
Answering Questions:  10%|█         | 10/98 [01:12<12:28,  8.51s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Answering Questions: 100%|██████████| 98/98 [14:13<00:00,  8.71s/it]


# Submission

In [6]:
# 제출용 샘플 파일 로드
submit_df = pd.read_csv("./sample_submission.csv")

# 생성된 답변을 제출 DataFrame에 추가
submit_df['Answer'] = [item['Answer'] for item in results]
submit_df['Answer'] = submit_df['Answer'].fillna("데이콘")     # 모델에서 빈 값 (NaN) 생성 시 채점에 오류가 날 수 있음 [ 주의 ]

# 결과를 CSV 파일로 저장
submit_df.to_csv("./gemma_9b_ver1.01_submission.csv", encoding='UTF-8-sig', index=False)