In [1]:
!nvidia-smi

Tue Oct 15 16:10:42 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0              43W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
#path = '/content/drive/MyDrive/DACON/Finance/reprocessed/'
path ='/content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/'
base_dir = path # Your Base Directory

# 설명

## Question - Answering with Retrieval

본 대회의 과제는 중앙정부 재정 정보에 대한 **검색 기능**을 개선하고 활용도를 높이는 질의응답 알고리즘을 개발하는 것입니다. <br>이를 통해 방대한 재정 데이터를 일반 국민과 전문가 모두가 쉽게 접근하고 활용할 수 있도록 하는 것이 목표입니다. <br><br>
베이스라인에서는 평가 데이터셋만을 활용하여 source pdf 마다 Vector DB를 구축한 뒤 langchain 라이브러리와 llama-2-ko-7b 모델을 사용하여 RAG 프로세스를 통해 추론하는 과정을 담고 있습니다. <br>( train_set을 활용한 훈련 과정은 포함하지 않으며, test_set  에 대한 추론만 진행합니다. )

## Mount/Login

구글 드라이브를 마운트하고 허깅페이스에 로그인
- 이때 허깅페이스 토큰은 kdt3 그룹에 대해 읽기/쓰기 권한이 있는 토큰이어야 함

## Download Library
필요/사용 라이브러리 다운로드
이때 버전 문제로 설치를 한 뒤 세션을 한번 재시작해줘야 합니다
<br>(그리고 세션 완전히 끊기면 다운로드 후 재시작을 다시 해줘야...)

## Import Library
한번 재시작했으면 위 과정 없이 Import만 실행해주면 됩니다

## Vector DB
문서를 여러 조각(chunk)로 나누고, 임베딩 유사도를 통해 관련 조각을 찾을 수 있게 DB화하는 함수들이 정의되어 있습니다.

## DB 생성
Vector DB에서 정의된 함수들로 문서 DB를 만들어줍니다.<br><br>
이때 Train과 Test를 한번에 하려고 하면 코랩이 터질 확률이 높으므로 Train하고 Create Dataset까지 실행해 업로드 한 뒤 재시작해서 램을 비우고 Test를 하는 것이 좋습니다.<br> 또한 문서 임베딩을 어떤 모델로 할지 인자로 넘겨줄 수 있습니다

## Create Dataset
DB 생성에서 만든 db와 데이터 dataframe을 사용해 HuggingFace 데이터셋 생성 후 업로드

## Fine-Tuning
학습 데이터셋으로 모델에 대한 파인튜닝 진행 후 Huggingface에 업로드<br>
4비트 양자화 LoRA로 파인튜닝<br>
기반 모델 또는 넣어줄때 사용할 프롬프트, 학습 관련 하이퍼파라미터 수정 가능

## Langchain 을 이용한 추론
모델을 사용한 추론


## 실행
### 기본
Mount/Login -> Download Library -> 재시작 (처음 1번)
Mount/Login -> Import Library (이후)

### 데이터셋 만들기
기본 -> Vector DB -> DB 생성 -> Create Dataset에서 첫 셀 + Train/Valid/Test 중 해당하는 셀

### 모델 학습하기
기본 -> Fine-Tuning(업로드할 위치, 데이터셋 위치, 모델 링크 확인 필수)

### 학습된 모델로 추론하기
기본 -> Langchain을 이용한 추론(모델 링크, 데이터셋 위치 확인) -> Submission(저장할 파일명 확인)

# Mount/Login

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
ls {path}

241008_csv_checker.ipynb           [0m[01;34meval[0m/                       [01;34msub[0m/          train.csv
combined_train_aug_v3.csv          [01;34mgemma2_financeQA-finetune[0m/  [01;34mtemp[0m/         [01;34mtrain_source[0m/
combined_train_aug_v3_editted.csv  [01;34mprocessed[0m/                  test.csv
[01;34mdata[0m/                              sample_submission.csv       [01;34mtest_source[0m/


In [5]:
import os

token_path = os.path.join(base_dir,'data','token')
with open(token_path,'r') as f:
    master_token = f.readline().strip('\n')

In [6]:
from huggingface_hub import login

login(token=master_token, add_to_git_credential=True)

Token is valid (permission: fineGrained).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Download Library

In [7]:
!apt-get install tesseract-ocr
!apt-get install poppler-utils

!pip install orjson==3.10.6

!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install transformers[torch] -U

!pip install datasets
!pip install langchain
!pip install langchain_community
!pip install langchain-teddynote
!pip install PyMuPDF
!pip install sentence-transformers
!pip install faiss-gpu
#!pip install peft
#!pip install trl
!pip install unstructured pdfminer.six
!pip install pillow-heif
#!pip install unstructured_inference
#!pip install unstructured_pytesseract
!pip install pikepdf pypdf

!pip install pymupdf4llm

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.5).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Looking in indexes: https://pypi.org/simple/


# Import Library

In [8]:
import os
import unicodedata
import torch
import pandas as pd
from tqdm.auto import tqdm
import fitz  # PyMuPDF

from langchain.document_loaders.parsers.pdf import PDFPlumberParser

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline,
    BitsAndBytesConfig
)
from accelerate import Accelerator

## peft
#from peft import prepare_model_for_kbit_training
#from peft import PeftModel
#from peft import LoraConfig, get_peft_model
#
#
## Langchain 관련
#from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
#from langchain.prompts import PromptTemplate
#from langchain.schema.runnable import RunnablePassthrough, RunnableParallel
#from langchain.schema.output_parser import StrOutputParser

# PDF 로딩/청크화 관련
from langchain.document_loaders.parsers.pdf import PDFPlumberParser
from langchain.document_loaders.pdf import PDFPlumberLoader
from langchain.document_loaders import UnstructuredPDFLoader
from langchain_teddynote.retrievers import KiwiBM25Retriever
from langchain.retrievers import EnsembleRetriever, MultiQueryRetriever

from unstructured.cleaners.core import clean_extra_whitespace, clean, clean_non_ascii_chars

#import pdfplumber

import pymupdf4llm
import pymupdf

In [9]:
# gpu memory 할당 해제
import gc, time

def free_cuda():
  mem = 1
  while mem > 0 :
    time.sleep(0.5)
    mem = gc.collect()
    torch.cuda.empty_cache()
    print("freed : ",mem)

# Vector DB

In [10]:
from operator import itemgetter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from unstructured.cleaners.core import clean_extra_whitespace, clean, clean_non_ascii_chars


# 불릿포인트 제거용 함수
def remove_bulletpoints(text):
    cleaned_text = text
    for symbol in ['ㅇ','-','□', '※', '▸','∙','●','☞','■','','','·']:
        cleaned_text = cleaned_text.replace(symbol, f"-")
    return cleaned_text

def replace_sign_symbol(text):
    cleaned_text = text
    cleaned_text = cleaned_text.replace('△', "-")
    return cleaned_text


# 숫자 심볼 숫자로 변환
def replace_num_symbols_with_number(text):
    cleaned_text = text
    for idx, symbol in enumerate(['①', '②', '③', '④', '⑤', '⑥', '⑦', '⑧', '⑨', '⑩', '⑪', '⑫', '⑬', '⑭', '⑮']):
        cleaned_text = cleaned_text.replace(symbol, f"{idx+1})")
    return cleaned_text

In [11]:
def normalize_path(path):
    """경로 유니코드 정규화"""
    return unicodedata.normalize('NFC', path)

def process_path(base_dir,file_path):
  norm_path = normalize_path(file_path)
  if not os.path.isabs(norm_path):
    return os.path.normpath(os.path.join(base_dir, norm_path))
  else : return norm_path

def subpath_list(dir_path):
  return list(map(lambda x : os.path.join(dir_path,x),os.listdir(dir_path)))

def processed_path_matcher(dir_path,file_path):
  sub_list = subpath_list(dir_path)
  path_list = list()
  for sub in sub_list:
    path_list.extend(subpath_list(sub))
  prcssd_list =list(map(normalize_path,path_list))
  for real_path,prcssd_path in zip(path_list,prcssd_list) :
    if file_path == prcssd_path : return real_path
  else : return file_path

In [12]:
from operator import itemgetter

def clean_string(text):
    text_string = clean(text, dashes=True,trailing_punctuation=True, bullets=True)
    text_string = replace_num_symbols_with_number(text_string)
    text_string = remove_bulletpoints(text_string)
    return text_string


def clean_table(text):
    text_string = replace_num_symbols_with_number(text)
    text_string = replace_sign_symbol(text_string)
    text_string = remove_bulletpoints(text_string)
    return text_string


# 전체 마크다운 처리
def process_pdf(file_path, chunk_size=256, chunk_overlap=32):
    """PDF 텍스트 추출 후 chunk 단위로 나누기"""
    # PDF 파일 열기
    doc = pymupdf4llm.to_markdown(file_path)

    headers_to_split_on = [
        ("#","Header 1"),
        ("##","Header 2"),
        ("###","Header 3"),
    ]

    md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
    md_chunks = md_splitter.split_text(doc)

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    chunks = splitter.split_documents(md_chunks)

    return chunks


def create_vector_db(chunks, model_path="intfloat/multilingual-e5-small"):
    """FAISS DB 생성"""
    # 임베딩 모델 설정
    model_kwargs = {'device': 'cuda'}
    encode_kwargs = {'normalize_embeddings': True}
    embeddings = HuggingFaceEmbeddings(
        model_name=model_path,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
    )
    # FAISS DB 생성 및 반환
    db = FAISS.from_documents(chunks, embedding=embeddings)
    return db


#앙상블
def process_pdfs_from_dataframe(df, base_dir, chunk_size=256, model_path = "intfloat/multilingual-e5-small"):
    """딕셔너리에 pdf명을 키로해서 DB, retriever 저장"""
    pdf_databases = {}
    unique_paths = df['Source_path'].unique()

    for file_path in tqdm(unique_paths, desc="Processing PDFs"):
        # 경로 정규화 및 절대 경로 생성
        full_path = process_path(base_dir,file_path)
        full_path = processed_path_matcher(base_dir,full_path)
        pdf_title = os.path.basename(full_path)
        print(f"Processing {pdf_title}...")

        # PDF 처리 및 벡터 DB 생성
        chunks = process_pdf(full_path,chunk_size)
        db = create_vector_db(chunks, model_path=model_path)

        kiwi_bm25_retriever = KiwiBM25Retriever.from_documents(chunks)
        faiss_retriever = db.as_retriever()
        retriever = EnsembleRetriever(
            retrievers=[kiwi_bm25_retriever, faiss_retriever],
            weights=[0.5, 0.5],
            search_type="mmr",
        )

        # 결과 저장
        pdf_databases[pdf_title] = {
                'db': db,
                'retriever': retriever
        }
    return pdf_databases


## Preprocessing Tables

In [13]:
!pip install gmft



In [14]:
import gmft.table_detection
import gmft
import markdown
from gmft.auto import CroppedTable, TableDetector, AutoTableFormatter, AutoFormatConfig
from gmft.pdf_bindings import PyPDFium2Document

In [26]:
def make_table(tab,doc,pnum,formatter):
  rect = gmft.common.Rect(tab.bbox)
  temp = gmft.table_detection.CroppedTable(doc.get_page(pnum),rect,0.8)
  ft = formatter.extract(temp)
  try :
    table = ft.df()
  except Exception as e:
    return None
  return table

def define_formatter():
    config = AutoFormatConfig()
    config.semantic_spanning_cells=True
    config.enable_multi_header=True
    config.total_overlap_reject_threshold = 0.3
    config.large_table_assumption = True
    formatter = AutoTableFormatter(config=config)
    return formatter

table_header = """
<head>
<style>
html {
  background-color: #dd0000
}
body {
}
table {
  border: 0.5px solid black;
  border-collapse: collapse;
  background-color: #fdfdfd;
  object-fit : fill;
  width : 100%;
}
th, td {
  border: 1px solid black;
  background-color: #fdfdfd;
}
</style>
</head>
"""

def replace_tables_from_pdf(full_path):
    pdf = pymupdf.open(full_path)
    doc = PyPDFium2Document(full_path)
    formatter = define_formatter()
    chunks, tables_list = list(), list()
    for pnum, page in enumerate(tqdm(pdf)):
        latest_text = ""
        tables = page.find_tables()
        for idx, tab in enumerate(tables):
            table = make_table(tab,doc,pnum,formatter)
            if table is None : continue
            page.add_redact_annot(tab.bbox)
            table_md = clean_table(table).to_markdown(index=False)
            table_body=markdown.markdown(table_md, extensions=['markdown.extensions.tables'])
            page.apply_redactions()
            if len(table) > 1 : page.draw_rect(tab.bbox,fill=(.5,.25,.55))
            rc = page.insert_htmlbox(tab.bbox,table_header+table_body,scale_low=0)
            prev = (tab.bbox[0], tab.bbox[1], tab.bbox[2], tab.bbox[3])

    return pdf

In [27]:
def recreate_pdfs_from_dataframe(df, base_dir,save_dir):
    """딕셔너리에 pdf명을 키로해서 DB, retriever 저장"""
    unique_paths = df['Source_path'].unique()

    for path in tqdm(unique_paths, desc="Processing PDFs"):
        # 경로 정규화 및 절대 경로 생성
        norm_path = normalize_path(path)
        if not os.path.isabs(norm_path):
          full_path = os.path.normpath(os.path.join(base_dir, norm_path.lstrip('./')))
        else : full_path = norm_path

        pdf_name = os.path.basename(full_path)
        print(f"Processing {pdf_name}...")
        save_path = os.path.join(save_dir, norm_path)
        pdf_dir = os.path.dirname(save_path)
        if not os.path.exists(pdf_dir) : os.makedirs(pdf_dir)
        new_pdf = replace_tables_from_pdf(full_path)
        new_pdf.save(save_path,garbage=4,deflate=True)
    return

In [28]:
PROCESSEDDIR = os.path.join(base_dir,'processed')
if not os.path.exists(PROCESSEDDIR) : os.makedirs(PROCESSEDDIR)

train_df = pd.read_csv(f'{base_dir}train.csv')
recreate_pdfs_from_dataframe(train_df, base_dir,PROCESSEDDIR)
test_df = pd.read_csv(f'{base_dir}test.csv')
recreate_pdfs_from_dataframe(test_df, base_dir,PROCESSEDDIR)

Processing PDFs:   0%|          | 0/16 [00:00<?, ?it/s]

Processing 1-1 2024 주요 재정통계 1권.pdf...


  0%|          | 0/137 [00:00<?, ?it/s]

Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling 

  0%|          | 0/314 [00:00<?, ?it/s]

Processing 재정통계해설.pdf...


  0%|          | 0/164 [00:00<?, ?it/s]

Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Processing 국토교통부_전세임대(융자).pdf...


  0%|          | 0/4 [00:00<?, ?it/s]

Processing 고용노동부_청년일자리창출지원.pdf...


  0%|          | 0/3 [00:00<?, ?it/s]

Processing 고용노동부_내일배움카드(일반).pdf...


  0%|          | 0/4 [00:00<?, ?it/s]

Processing 보건복지부_노인일자리 및 사회활동지원.pdf...


  0%|          | 0/5 [00:00<?, ?it/s]

Processing 중소벤처기업부_창업사업화지원.pdf...


  0%|          | 0/2 [00:00<?, ?it/s]

Processing 보건복지부_생계급여.pdf...


  0%|          | 0/4 [00:00<?, ?it/s]

Processing 국토교통부_소규모주택정비사업.pdf...


  0%|          | 0/4 [00:00<?, ?it/s]

Filling in gap at top of table
Processing 국토교통부_민간임대(융자).pdf...


  0%|          | 0/3 [00:00<?, ?it/s]

Processing 고용노동부_조기재취업수당.pdf...


  0%|          | 0/3 [00:00<?, ?it/s]

Processing 2024년도 성과계획서(총괄편).pdf...


  0%|          | 0/345 [00:00<?, ?it/s]

Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Filling in gap at top of table
Processing 「FIS 이슈 & 포커스」 23-3호 《조세지출 연계관리》.pdf...


  0%|          | 0/9 [00:00<?, ?it/s]

Filling in gap at top of table
Processing 「FIS 이슈 & 포커스」 22-3호 《재정융자사업》.pdf...


  0%|          | 0/9 [00:00<?, ?it/s]

Filling in gap at top of table
Processing 월간 나라재정 2023년 12월호.pdf...


  0%|          | 0/68 [00:00<?, ?it/s]

Processing PDFs:   0%|          | 0/9 [00:00<?, ?it/s]

Processing 중소벤처기업부_혁신창업사업화자금(융자).pdf...


  0%|          | 0/3 [00:00<?, ?it/s]

Processing 보건복지부_부모급여(영아수당) 지원.pdf...


  0%|          | 0/3 [00:00<?, ?it/s]

Filling in gap at top of table
Processing 보건복지부_노인장기요양보험 사업운영.pdf...


  0%|          | 0/4 [00:00<?, ?it/s]

Processing 산업통상자원부_에너지바우처.pdf...


  0%|          | 0/11 [00:00<?, ?it/s]

Filling in gap at top of table
Processing 국토교통부_행복주택출자.pdf...


  0%|          | 0/3 [00:00<?, ?it/s]

Processing 「FIS 이슈 & 포커스」 22-4호 《중앙-지방 간 재정조정제도》.pdf...


  0%|          | 0/9 [00:00<?, ?it/s]

Processing 「FIS 이슈 & 포커스」 23-2호 《핵심재정사업 성과관리》.pdf...


  0%|          | 0/11 [00:00<?, ?it/s]

Filling in gap at top of table
Processing 「FIS 이슈&포커스」 22-2호 《재정성과관리제도》.pdf...


  0%|          | 0/9 [00:00<?, ?it/s]

Filling in gap at top of table
Processing 「FIS 이슈 & 포커스」(신규) 통권 제1호 《우발부채》.pdf...


  0%|          | 0/16 [00:00<?, ?it/s]

## Split train/valid

In [14]:
base_dir = path

In [15]:
data_df = pd.read_csv(os.path.join(base_dir,'train.csv'))
test_df = pd.read_csv(os.path.join(base_dir,'test.csv'))

In [16]:
from sklearn.model_selection import train_test_split

train_df,valid_df = train_test_split(data_df,test_size=0.2,stratify=data_df.Source,random_state=801)

# Dataset Config

In [53]:
tab_ver = 'tab_v1.7'
model_dict = {
'large':"intfloat/multilingual-e5-large",
'base':"intfloat/multilingual-e5-base",
}

if tab_ver == 'tab_v0' : file_dir = base_dir
else : file_dir = os.path.join(base_dir,'processed',tab_ver)

#model_option = 'large'
model_option = 'base'
model_path = model_dict[model_option]
chunk_size = 256

In [54]:
aug_type= 'NoAug'

In [73]:
aug_type = "AugGPT"

In [78]:
aug_type= 'AugAEDA'

In [79]:
db_config = {
    'model' : model_option,
    'tab_process' : tab_ver,
    'aug' : aug_type,
    'chunck_size' : chunk_size
}

In [80]:
db_name = "{model}-ensemble-{tab_process}-{chunck_size}".format(**db_config)

## Apply Augmentation

In [57]:
ls {base_dir}

241008_csv_checker.ipynb           [0m[01;34meval[0m/                       [01;34msub[0m/          train.csv
combined_train_aug_v3.csv          [01;34mgemma2_financeQA-finetune[0m/  [01;34mtemp[0m/         [01;34mtrain_source[0m/
combined_train_aug_v3_editted.csv  [01;34mprocessed[0m/                  test.csv
[01;34mdata[0m/                              sample_submission.csv       [01;34mtest_source[0m/


In [58]:
if aug_type != 'NoAug':
  aug_file = 'combined_train_aug_v3_editted.csv'
  aug_path = os.path.join(base_dir,aug_file)

In [59]:
ques_dict={
    'NoAug' : 'Question',
    'AugGPT' : 'Question_aug_GPT',
    'AugAEDA' : 'AEDA_Question'
}
ans_dict = {
    'NoAug' : 'Answer',
    'AugGPT' : 'Answer',
    'AugAEDA' : 'Answer'
}

In [60]:
key_col = 'SAMPLE_ID'
info_col = ['Source', 'Source_path']
ques_col = ques_dict[aug_type]
ans_col = ans_dict[aug_type]

In [61]:
if aug_type != 'NoAug':
  aug_df = pd.read_csv(aug_path,sep='|')
  train_id = train_df[key_col].values
  cond = aug_df[key_col].isin(train_id)
  display(aug_df.info())
  display(aug_df.columns)
  col_list = [key_col]+info_col+[ques_col,ans_col]
  aug_train = aug_df.loc[cond,col_list]
  aug_train = aug_train.rename(columns = {ques_col : "Question", ans_col : "Answer"})
  train_augged= pd.concat([train_df,aug_train])
  train_augged.info()

In [62]:
free_cuda()

freed :  233
freed :  20
freed :  0


# DB 생성

In [63]:
temp_path = '/content/src/'
file_dir, os.listdir(file_dir)

('/content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.7',
 ['train_source', 'test_source', 'pdf_db'])

In [28]:
file_path = os.path.join(file_dir,'train_source') + ' ' + os.path.join(file_dir,'test_source')
if not os.path.exists(temp_path) : os.makedirs(temp_path)

In [50]:
!rsync -rvzh {file_path} {temp_path} --bwlimit 4096000000000000 --progress

sending incremental file list
test_source/
test_source/국토교통부_행복주택출자.pdf
         32.77K   1%    0.00kB/s    0:00:00            2.55M 100%  240.51MB/s    0:00:00 (xfr#1, to-chk=24/27)
test_source/보건복지부_노인장기요양보험 사업운영.pdf
         32.77K   1%    2.40MB/s    0:00:00            2.33M 100%   96.41MB/s    0:00:00 (xfr#2, to-chk=23/27)
test_source/보건복지부_부모급여(영아수당) 지원.pdf
         32.77K   1%    1.25MB/s    0:00:01            2.26M 100%   65.41MB/s    0:00:00 (xfr#3, to-chk=22/27)
test_source/산업통상자원부_에너지바우처.pdf
         32.77K   1%  914.29kB/s    0:00:02            2.51M 100%   54.35MB/s    0:00:00 (xfr#4, to-chk=21/27)
test_source/중소벤처기업부_혁신창업사업화자금(융자).pdf
         32.77K   1%  695.65kB/s    0:00:03            2.42M 100%   41.98MB/s    0:00:00 (xfr#5, to-chk=20/27)
test_source/「FIS 이슈 & 포커스」 22-4호 《중앙-지방 간 재정조정제도》.pdf
         32.77K   1%  551.72kB/s  

In [51]:
train_db = process_pdfs_from_dataframe(train_df, temp_path, chunk_size=chunk_size, model_path=model_path)

Processing PDFs:   0%|          | 0/16 [00:00<?, ?it/s]

Processing 2024 나라살림 예산개요.pdf...
Processing /content/src/train_source/2024 나라살림 예산개요.pdf...


  embeddings = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/160k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

Processing 재정통계해설.pdf...
Processing /content/src/train_source/재정통계해설.pdf...
Processing 1-1 2024 주요 재정통계 1권.pdf...
Processing /content/src/train_source/1-1 2024 주요 재정통계 1권.pdf...
Processing 월간 나라재정 2023년 12월호.pdf...
Processing /content/src/train_source/월간 나라재정 2023년 12월호.pdf...
Processing 2024년도 성과계획서(총괄편).pdf...
Processing /content/src/train_source/2024년도 성과계획서(총괄편).pdf...
Processing 중소벤처기업부_창업사업화지원.pdf...
Processing /content/src/train_source/중소벤처기업부_창업사업화지원.pdf...
Processing 「FIS 이슈 & 포커스」 23-3호 《조세지출 연계관리》.pdf...
Processing /content/src/train_source/「FIS 이슈 & 포커스」 23-3호 《조세지출 연계관리》.pdf...
Processing 고용노동부_내일배움카드(일반).pdf...
Processing /content/src/train_source/고용노동부_내일배움카드(일반).pdf...
Processing 보건복지부_생계급여.pdf...
Processing /content/src/train_source/보건복지부_새

In [None]:
free_cuda()

freed :  0


In [64]:
aug_type

'NoAug'

In [65]:
test_db = process_pdfs_from_dataframe(test_df, temp_path, chunk_size=chunk_size, model_path=model_path)

Processing PDFs:   0%|          | 0/9 [00:00<?, ?it/s]

Processing 중소벤처기업부_혁신창업사업화자금(융자).pdf...
Processing /content/src/test_source/중소벤처기업부_혁신창업사업화자금(융자).pdf...


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/179k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Processing 보건복지부_부모급여(영아수당) 지원.pdf...
Processing /content/src/test_source/보건복지부_부모급여(영아수당) 지원.pdf...
Processing 보건복지부_노인장기요양보험 사업운영.pdf...
Processing /content/src/test_source/보건복지부_노인장기요양보험 사업운영.pdf...
Processing 산업통상자원부_에너지바우처.pdf...
Processing /content/src/test_source/산업통상자원부_에너지바우처.pdf...
Processing 국토교통부_행복주택출자.pdf...
Processing /content/src/test_source/국토교통부_행복주택출자.pdf...
Processing 「FIS 이슈 & 포커스」 22-4호 《중앙-지방 간 재정조정제도》.pdf...
Processing /content/src/test_source/「FIS 이슈 & 포커스」 22-4호 《중앙-지방 간 재정조정제도》.pdf...
Processing 「FIS 이슈 & 포커스」 23-2호 《핵심재정사업 성과관리》.pdf...
Processing /content/src/test_source/「FIS 이슈 & 포커스」 23-2호 《핵심재정사업 성과관리》.pdf...
Processing 「FIS 이슈&포커스」 22-2호 《재정성과관리제도》.pdf...
Processing /content/src/test_sour

In [31]:
import pickle

def check_and_mkdir(func):
    def wrapper(*args,**kwargs):
        if not os.path.exists(args[0]): os.makedirs(args[0])
        return func(*args,**kwargs)
    return wrapper

@check_and_mkdir
def save_pkl(save_dir,file_name,save_object):
    if not os.path.exists(save_dir): os.mkdir(save_dir)
    file_path = os.path.join(save_dir,file_name)
    with open(file_path,'wb') as f:
        pickle.dump(save_object,f)

def load_pkl(file_path):
    with open(file_path,'rb') as f:
        data = pickle.load(f)
    return data

In [66]:
file_dir

'/content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.7'

In [67]:
db_name

'base-ensemble-tab_v1.7-256'

In [68]:
#db_path = os.path.join('/content','pdf_db')
db_path = os.path.join(file_dir,'pdf_db')
#save_pkl(db_path, f'{db_name}_train.dat',train_db)
save_pkl(db_path, f'{db_name}_test.dat',test_db)

In [35]:
os.listdir(db_path)

['large-ensemble-tab_v1.7-256_train.dat',
 'large-ensemble-tab_v1.7-256_test.dat']

In [69]:
train_db_name = 'large-ensemble-tab_v1.0-256_train.dat'
test_db_name = 'large-ensemble-tab_v1.0-256_test.dat'
train_db_path = os.path.join(db_path,train_db_name)
test_db_path = os.path.join(db_path,test_db_name)

In [None]:
file_path = train_db_path + ' ' + test_db_path
temp_path = '/content/pdf_db'
if not os.path.exists(temp_path) : os.makedirs(temp_path)

In [None]:
!rsync -vzh {file_path} {temp_path} --bwlimit 4096000000000000 --progress

base-ensemble-tab_v1.0-256_test.dat
         10.10G 100%   57.04MB/s    0:02:48 (xfr#1, to-chk=1/2)
base-ensemble-tab_v1.0-256_train.dat
         17.97G 100%   60.64MB/s    0:04:42 (xfr#2, to-chk=0/2)

sent 16.54G bytes  received 54 bytes  36.64M bytes/sec
total size is 28.06G  speedup is 1.70


In [None]:
train_db = load_pkl(os.path.join(temp_path,train_db_name))
#test_db = load_pkl(os.path.join(temp_path,test_db_name))

# Create Dataset

In [36]:
def normalize_string(s):
    """유니코드 정규화"""
    return unicodedata.normalize('NFC', s)

def format_docs(docs):
    """검색된 문서들을 하나의 문자열로 포맷팅"""
    context = ""
    for doc in docs:
        context += doc.page_content
        context += '\n'
    return context

def make_dataset(df, pdf_databases):
    dataset = dict()
    dataset['context'] = list()
    dataset['question'] = list()
    dataset['answer'] = list()
    normalized_keys = {normalize_string(k): v for k, v in pdf_databases.items()}

    for _, row in tqdm(df.iterrows(), total=len(df), desc="Making"):
        # 소스 문자열 정규화
        source = normalize_string(row['Source'])+'.pdf'
        question = row['Question']
        dataset['question'].append(question)
        if 'Answer' in df.columns:
          dataset['answer'].append(row['Answer'])
        else: dataset['answer'].append('')

        # 정규화된 키로 데이터베이스 검색
        retriever = normalized_keys[source]['retriever']
        context = format_docs(retriever.invoke(question))
        dataset['context'].append(context)
    return dataset


# Dataset

In [70]:
if aug_type != 'NoAug':
  train_df = train_augged

In [81]:
dataset_name = "kdt3/DACON-QA-{model}-ensemble-{tab_process}-{aug}-{chunck_size}".format(**db_config)
train_name = "kdt3/DACON-QA-{model}-ensemble-{tab_process}-{aug}-{chunck_size}".format(**db_config)
#fname = "gemma2_large_ensemble_markdown_256_5epoch_reprocessed_result.csv"

push_url = dataset_name
push_url

'kdt3/DACON-QA-base-ensemble-tab_v1.7-AugAEDA-256'

In [90]:
## 만약 데이터셋을 분할해서 업로드해줘야할 경우 합치는 방법 참조 코드
from datasets import load_dataset, concatenate_datasets
from datasets import Dataset

train_dataset = load_dataset(dataset_name)['train']

train_dataset = concatenate_datasets([train_dataset, Dataset.from_dict(make_dataset(train_df.iloc[296:], train_db))])
train_dataset.push_to_hub(dataset_name, private=True, split='train')


DatasetNotFoundError: Dataset 'kdt3/DACON-QA-large-ensemble-tab_v1.7-AugAEDA-256' doesn't exist on the Hub or cannot be accessed.

## Train 데이터 생성 & 업로드

In [39]:
from datasets import Dataset
train_dataset = make_dataset(train_df, train_db)
train_dataset = Dataset.from_dict(train_dataset)
train_dataset.push_to_hub(push_url, private=True, split='train')


NameError: name 'train_db' is not defined

## Valid 데이터 생성 & 업로드

In [92]:
from datasets import Dataset
valid_dataset = make_dataset(valid_df, train_db)
valid_dataset = Dataset.from_dict(valid_dataset)
valid_dataset.push_to_hub(push_url, private=True, split='valid')

Making:   0%|          | 0/100 [00:00<?, ?it/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/348 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/kdt3/DACON-QA-large-ensemble-tab_v1.7-AugAEDA-256/commit/b15ecb68e7d039be0736a43545c9b7b26a57296f', commit_message='Upload dataset', commit_description='', oid='b15ecb68e7d039be0736a43545c9b7b26a57296f', pr_url=None, pr_revision=None, pr_num=None)

## Test 데이터 생성 & 업로드

In [82]:
from datasets import Dataset
test_dataset = make_dataset(test_df, test_db)
test_dataset = Dataset.from_dict(test_dataset)
test_dataset.push_to_hub(push_url, private=True, split='test')

Making:   0%|          | 0/98 [00:00<?, ?it/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/kdt3/DACON-QA-base-ensemble-tab_v1.7-AugAEDA-256/commit/ffa2f965c5c918711f634b8a83f646ce37c02205', commit_message='Upload dataset', commit_description='', oid='ffa2f965c5c918711f634b8a83f646ce37c02205', pr_url=None, pr_revision=None, pr_num=None)