In [1]:
!nvidia-smi

Wed Oct  9 05:11:30 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0              43W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
#path = '/content/drive/MyDrive/DACON/Finance/reprocessed/'
path ='/content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/'
base_dir = path # Your Base Directory

# 설명

## Question - Answering with Retrieval

본 대회의 과제는 중앙정부 재정 정보에 대한 **검색 기능**을 개선하고 활용도를 높이는 질의응답 알고리즘을 개발하는 것입니다. <br>이를 통해 방대한 재정 데이터를 일반 국민과 전문가 모두가 쉽게 접근하고 활용할 수 있도록 하는 것이 목표입니다. <br><br>
베이스라인에서는 평가 데이터셋만을 활용하여 source pdf 마다 Vector DB를 구축한 뒤 langchain 라이브러리와 llama-2-ko-7b 모델을 사용하여 RAG 프로세스를 통해 추론하는 과정을 담고 있습니다. <br>( train_set을 활용한 훈련 과정은 포함하지 않으며, test_set  에 대한 추론만 진행합니다. )

## Mount/Login

구글 드라이브를 마운트하고 허깅페이스에 로그인
- 이때 허깅페이스 토큰은 kdt3 그룹에 대해 읽기/쓰기 권한이 있는 토큰이어야 함

## Download Library
필요/사용 라이브러리 다운로드
이때 버전 문제로 설치를 한 뒤 세션을 한번 재시작해줘야 합니다
<br>(그리고 세션 완전히 끊기면 다운로드 후 재시작을 다시 해줘야...)

## Import Library
한번 재시작했으면 위 과정 없이 Import만 실행해주면 됩니다

## Vector DB
문서를 여러 조각(chunk)로 나누고, 임베딩 유사도를 통해 관련 조각을 찾을 수 있게 DB화하는 함수들이 정의되어 있습니다.

## DB 생성
Vector DB에서 정의된 함수들로 문서 DB를 만들어줍니다.<br><br>
이때 Train과 Test를 한번에 하려고 하면 코랩이 터질 확률이 높으므로 Train하고 Create Dataset까지 실행해 업로드 한 뒤 재시작해서 램을 비우고 Test를 하는 것이 좋습니다.<br> 또한 문서 임베딩을 어떤 모델로 할지 인자로 넘겨줄 수 있습니다

## Create Dataset
DB 생성에서 만든 db와 데이터 dataframe을 사용해 HuggingFace 데이터셋 생성 후 업로드

## Fine-Tuning
학습 데이터셋으로 모델에 대한 파인튜닝 진행 후 Huggingface에 업로드<br>
4비트 양자화 LoRA로 파인튜닝<br>
기반 모델 또는 넣어줄때 사용할 프롬프트, 학습 관련 하이퍼파라미터 수정 가능

## Langchain 을 이용한 추론
모델을 사용한 추론


## 실행
### 기본
Mount/Login -> Download Library -> 재시작 (처음 1번)
Mount/Login -> Import Library (이후)

### 데이터셋 만들기
기본 -> Vector DB -> DB 생성 -> Create Dataset에서 첫 셀 + Train/Valid/Test 중 해당하는 셀

### 모델 학습하기
기본 -> Fine-Tuning(업로드할 위치, 데이터셋 위치, 모델 링크 확인 필수)

### 학습된 모델로 추론하기
기본 -> Langchain을 이용한 추론(모델 링크, 데이터셋 위치 확인) -> Submission(저장할 파일명 확인)

# Mount/Login

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
ls {path}

241008_csv_checker.ipynb           [0m[01;34meval[0m/                       [01;34msub[0m/          train.csv
combined_train_aug_v3.csv          [01;34mgemma2_financeQA-finetune[0m/  [01;34mtemp[0m/         [01;34mtrain_source[0m/
combined_train_aug_v3_editted.csv  [01;34mprocessed[0m/                  test.csv
[01;34mdata[0m/                              sample_submission.csv       [01;34mtest_source[0m/


In [5]:
import os

token_path = os.path.join(base_dir,'data','token')
with open(token_path,'r') as f:
    master_token = f.readline().strip('\n')

In [6]:
from huggingface_hub import login

login(token=master_token, add_to_git_credential=True)

Token is valid (permission: fineGrained).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Download Library

In [7]:
!apt-get install tesseract-ocr
!apt-get install poppler-utils

!pip install orjson==3.10.6

!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install transformers[torch] -U

!pip install datasets
!pip install langchain
!pip install langchain_community
!pip install langchain-teddynote
!pip install PyMuPDF
!pip install sentence-transformers
!pip install faiss-gpu
#!pip install peft
#!pip install trl
!pip install unstructured pdfminer.six
!pip install pillow-heif
#!pip install unstructured_inference
#!pip install unstructured_pytesseract
!pip install pikepdf pypdf

!pip install pymupdf4llm

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 49 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 3s (1,744 kB/s)
Selecting previously unselected package tesseract-ocr-eng.
(Reading database ... 123621 files and directories currently installed.)
Preparing to unpack .../tesseract-ocr-

# Import Library

In [8]:
import os
import unicodedata
import torch
import pandas as pd
from tqdm.auto import tqdm
import fitz  # PyMuPDF

from langchain.document_loaders.parsers.pdf import PDFPlumberParser

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline,
    BitsAndBytesConfig
)
from accelerate import Accelerator

## peft
#from peft import prepare_model_for_kbit_training
#from peft import PeftModel
#from peft import LoraConfig, get_peft_model
#
#
## Langchain 관련
#from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
#from langchain.prompts import PromptTemplate
#from langchain.schema.runnable import RunnablePassthrough, RunnableParallel
#from langchain.schema.output_parser import StrOutputParser

# PDF 로딩/청크화 관련
from langchain.document_loaders.parsers.pdf import PDFPlumberParser
from langchain.document_loaders.pdf import PDFPlumberLoader
from langchain.document_loaders import UnstructuredPDFLoader
from langchain_teddynote.retrievers import KiwiBM25Retriever
from langchain.retrievers import EnsembleRetriever, MultiQueryRetriever

from unstructured.cleaners.core import clean_extra_whitespace, clean, clean_non_ascii_chars

#import pdfplumber

import pymupdf4llm
import pymupdf


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from .kiwi_bm25 import KiwiBM25Retriever


# Vector DB

In [9]:
from operator import itemgetter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from unstructured.cleaners.core import clean_extra_whitespace, clean, clean_non_ascii_chars


# 불릿포인트 제거용 함수
def remove_bulletpoints(text):
    cleaned_text = text
    for symbol in ['ㅇ','-','□', '※', '▸','∙','●','☞','■','','','·']:
        cleaned_text = cleaned_text.replace(symbol, f"-")
    return cleaned_text

def replace_sign_symbol(text):
    cleaned_text = text
    cleaned_text = cleaned_text.replace('△', "-")
    return cleaned_text


# 숫자 심볼 숫자로 변환
def replace_num_symbols_with_number(text):
    cleaned_text = text
    for idx, symbol in enumerate(['①', '②', '③', '④', '⑤', '⑥', '⑦', '⑧', '⑨', '⑩', '⑪', '⑫', '⑬', '⑭', '⑮']):
        cleaned_text = cleaned_text.replace(symbol, f"{idx+1})")
    return cleaned_text

In [10]:
from operator import itemgetter

def clean_string(text):
    text_string = clean(text, dashes=True,trailing_punctuation=True, bullets=True)
    text_string = replace_num_symbols_with_number(text_string)
    text_string = remove_bulletpoints(text_string)
    return text_string


def clean_table(text):
    text_string = replace_num_symbols_with_number(text)
    text_string = replace_sign_symbol(text_string)
    text_string = remove_bulletpoints(text_string)
    return text_string


# 전체 마크다운 처리
def process_pdf(file_path, chunk_size=256, chunk_overlap=32):
    """PDF 텍스트 추출 후 chunk 단위로 나누기"""
    # PDF 파일 열기
    doc = pymupdf4llm.to_markdown(file_path)

    headers_to_split_on = [
        ("#","Header 1"),
        ("##","Header 2"),
        ("###","Header 3"),
    ]

    md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
    md_chunks = md_splitter.split_text(doc)

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    chunks = splitter.split_documents(md_chunks)

    return chunks


def create_vector_db(chunks, model_path="intfloat/multilingual-e5-small"):
    """FAISS DB 생성"""
    # 임베딩 모델 설정
    model_kwargs = {'device': 'cuda'}
    encode_kwargs = {'normalize_embeddings': True}
    embeddings = HuggingFaceEmbeddings(
        model_name=model_path,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
    )
    # FAISS DB 생성 및 반환
    db = FAISS.from_documents(chunks, embedding=embeddings)
    return db

def normalize_path(path):
    """경로 유니코드 정규화"""
    return unicodedata.normalize('NFC', path)

#앙상블
def process_pdfs_from_dataframe(df, base_dir, chunk_size=256, model_path = "intfloat/multilingual-e5-small"):
    """딕셔너리에 pdf명을 키로해서 DB, retriever 저장"""
    pdf_databases = {}
    unique_paths = df['Source_path'].unique()

    for path in tqdm(unique_paths, desc="Processing PDFs"):
        # 경로 정규화 및 절대 경로 생성
        norm_path = normalize_path(path)
        if not os.path.isabs(norm_path):
          full_path = os.path.normpath(os.path.join(base_dir, norm_path.lstrip('./')))
        else : full_path = norm_path

        pdf_title = os.path.basename(full_path)
        print(f"Processing {pdf_title}...")

        # PDF 처리 및 벡터 DB 생성
        chunks = process_pdf(full_path,chunk_size)
        db = create_vector_db(chunks, model_path=model_path)

        kiwi_bm25_retriever = KiwiBM25Retriever.from_documents(chunks)
        faiss_retriever = db.as_retriever()
        retriever = EnsembleRetriever(
            retrievers=[kiwi_bm25_retriever, faiss_retriever],
            weights=[0.5, 0.5],
            search_type="mmr",
        )

        # 결과 저장
        pdf_databases[pdf_title] = {
                'db': db,
                'retriever': retriever
        }
    return pdf_databases


## Preprocessing Tables

In [11]:
!pip install gmft

Collecting gmft
  Downloading gmft-0.3.1-py3-none-any.whl.metadata (9.9 kB)
Collecting pypdfium2>=4 (from gmft)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Downloading gmft-0.3.1-py3-none-any.whl (53 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.4/53.4 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m85.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdfium2, gmft
Successfully installed gmft-0.3.1 pypdfium2-4.30.0


In [None]:
import gmft.table_detection
import gmft
import markdown
from gmft.auto import CroppedTable, TableDetector, AutoTableFormatter, AutoFormatConfig
from gmft.pdf_bindings import PyPDFium2Document

In [None]:
def make_table(tab,doc,pnum,formatter):
  rect = gmft.common.Rect(tab.bbox)
  temp = gmft.table_detection.CroppedTable(doc.get_page(pnum),rect,0.8)
  ft = formatter.extract(temp)
  try :
    table = ft.df()
  except Exception as e:
    print(e,'\t page: ',pnum)
    table = tab.to_pandas()
  return table

def define_formatter():
    config = AutoFormatConfig()
    config.semantic_spanning_cells=True
    config.enable_multi_header=True
    config.total_overlap_reject_threshold = 0.3
    config.large_table_assumption = True
    formatter = AutoTableFormatter(config=config)
    return formatter

table_header = """
<head>
<style>
body {
  width : 100%;
  height : 100%;
}
table {
  border: 1px solid black;
  border-collapse: collapse;
  background-color: #fdfdfd;
  width : 100%;
  height : 100%;
}
th, td {
  border: 1px solid black;
  background-color: #fdfdfd;
}
</style>
</head>
"""

def replace_tables_from_pdf(full_path):
    pdf = pymupdf.open(full_path)
    doc = PyPDFium2Document(full_path)
    formatter = define_formatter()
    chunks, tables_list = list(), list()
    for pnum, page in enumerate(tqdm(pdf)):
        latest_text = ""
        tables = page.find_tables()
        for idx, tab in enumerate(tables):
            table = make_table(tab,doc,pnum,formatter)
            page.add_redact_annot(tab.bbox)
            table_md = clean_table(table).to_markdown(index=False)
            table_body=markdown.markdown(table_md, extensions=['markdown.extensions.tables'])
            page.apply_redactions()
            rc = page.insert_htmlbox(tab.bbox,table_header+table_body)
            prev = (tab.bbox[0], tab.bbox[1], tab.bbox[2], tab.bbox[3])

    return pdf

In [None]:
def recreate_pdfs_from_dataframe(df, base_dir,save_dir):
    """딕셔너리에 pdf명을 키로해서 DB, retriever 저장"""
    unique_paths = df['Source_path'].unique()

    for path in tqdm(unique_paths, desc="Processing PDFs"):
        # 경로 정규화 및 절대 경로 생성
        norm_path = normalize_path(path)
        if not os.path.isabs(norm_path):
          full_path = os.path.normpath(os.path.join(base_dir, norm_path.lstrip('./')))
        else : full_path = norm_path

        pdf_name = os.path.basename(full_path)
        print(f"Processing {pdf_name}...")
        save_path = os.path.join(save_dir, norm_path)
        pdf_dir = os.path.dirname(save_path)
        if not os.path.exists(pdf_dir) : os.makedirs(pdf_dir)
        new_pdf = replace_tables_from_pdf(full_path)
        new_pdf.save(save_path,garbage=4,deflate=True)
    return

In [None]:
PROCESSEDDIR = os.path.join(base_dir,'processed')
if not os.path.exists(PROCESSEDDIR) : os.makedirs(PROCESSEDDIR)

train_df = pd.read_csv(f'{base_dir}train.csv')
recreate_pdfs_from_dataframe(train_df, base_dir,PROCESSEDDIR)
test_df = pd.read_csv(f'{base_dir}test.csv')
recreate_pdfs_from_dataframe(test_df, base_dir,PROCESSEDDIR)

Processing PDFs:   0%|          | 0/16 [00:00<?, ?it/s]

Processing 1-1 2024 주요 재정통계 1권.pdf...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/273 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/76.8k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

  0%|          | 0/137 [00:00<?, ?it/s]

The identified boxes have significant overlap: 59.09% of area is overlapping (Max is 30.00%) 	 page:  18
The header is not included as a row. Consider adding it back as a row.
Invoking large table row guess! set TATRFormatConfig.force_large_table_assumption to False to disable this.
No rows or columns detected 	 page:  21
No rows or columns detected 	 page:  21
The header is not included as a row. Consider adding it back as a row.
The header is not included as a row. Consider adding it back as a row.
The header is not included as a row. Consider adding it back as a row.
The header is not included as a row. Consider adding it back as a row.
The header is not included as a row. Consider adding it back as a row.
The header is not included as a row. Consider adding it back as a row.
The header is not included as a row. Consider adding it back as a row.
The header is not included as a row. Consider adding it back as a row.
The identified boxes have significant overlap: 78.39% of area is ove

  0%|          | 0/314 [00:00<?, ?it/s]

Invoking large table row guess! set TATRFormatConfig.force_large_table_assumption to False to disable this.
The identified boxes have significant overlap: 83.31% of area is overlapping (Max is 30.00%) 	 page:  1
Invoking large table row guess! set TATRFormatConfig.force_large_table_assumption to False to disable this.
Invoking large table row guess! set TATRFormatConfig.force_large_table_assumption to False to disable this.
Invoking large table row guess! set TATRFormatConfig.force_large_table_assumption to False to disable this.


## Split train/valid

In [12]:
base_dir = path

In [13]:
data_df = pd.read_csv(os.path.join(base_dir,'train.csv'))
test_df = pd.read_csv(os.path.join(base_dir,'test.csv'))

In [14]:
from sklearn.model_selection import train_test_split

train_df,valid_df = train_test_split(data_df,test_size=0.2,stratify=data_df.Source,random_state=801)

In [15]:
display(train_df.Source.value_counts())
display(valid_df.Source.value_counts())

Unnamed: 0_level_0,count
Source,Unnamed: 1_level_1
재정통계해설,94
2024년도 성과계획서(총괄편),92
2024 나라살림 예산개요,70
1-1 2024 주요 재정통계 1권,40
보건복지부_생계급여,14
월간 나라재정 2023년 12월호,13
「FIS 이슈 & 포커스」 22-3호 《재정융자사업》,13
「FIS 이슈 & 포커스」 23-3호 《조세지출 연계관리》,10
고용노동부_청년일자리창출지원,10
중소벤처기업부_창업사업화지원,9


Unnamed: 0_level_0,count
Source,Unnamed: 1_level_1
재정통계해설,24
2024년도 성과계획서(총괄편),23
2024 나라살림 예산개요,18
1-1 2024 주요 재정통계 1권,10
「FIS 이슈 & 포커스」 22-3호 《재정융자사업》,4
월간 나라재정 2023년 12월호,4
보건복지부_생계급여,3
고용노동부_청년일자리창출지원,3
중소벤처기업부_창업사업화지원,2
「FIS 이슈 & 포커스」 23-3호 《조세지출 연계관리》,2


# DB 생성

In [16]:
tab_ver = 'tab_v1.0'
model_dict = {
'large':"intfloat/multilingual-e5-large",
'base':"intfloat/multilingual-e5-base",
}

if tab_ver == 'tab_v0' : file_dir = base_dir
else : file_dir = os.path.join(base_dir,'processed',tab_ver)

model_option = 'base'
model_path = model_dict[model_option]
chunk_size = 256

In [None]:
train_db = process_pdfs_from_dataframe(train_df, file_dir, chunk_size=chunk_size, model_path=model_path)

Processing PDFs:   0%|          | 0/16 [00:00<?, ?it/s]

Processing 2024 나라살림 예산개요.pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/train_source/2024 나라살림 예산개요.pdf...


  embeddings = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/179k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Processing 재정통계해설.pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/train_source/재정통계해설.pdf...
Processing 1-1 2024 주요 재정통계 1권.pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/train_source/1-1 2024 주요 재정통계 1권.pdf...
Processing 월간 나라재정 2023년 12월호.pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/train_source/월간 나라재정 2023년 12월호.pdf...
Processing 2024년도 성과계획서(총괄편).pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/train_source/2024년도 성과계획서(총괄편).pdf...
Processing 중소벤처기업부_창업사업화지원.pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/train_source/중소벤처기업부_창업사업화지원.pdf...
Processing 「FIS 이슈 & 포커스」 23-3호 《조세지출 연계관리》.pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/train_source/「FIS 이슈 & 포커스」 23-3호 《조세지출 연계관리》.pdf...
Processing 고용노동부_내일배움카드(일반).pd

Processing PDFs:   0%|          | 0/16 [00:00<?, ?it/s]

Processing 1-1 2024 주요 재정통계 1권.pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/train_source/1-1 2024 주요 재정통계 1권.pdf...

KeyboardInterrupt: 

In [None]:
test_db = process_pdfs_from_dataframe(test_df, file_dir, chunk_size=chunk_size, model_path=model_path)

Processing PDFs:   0%|          | 0/9 [00:00<?, ?it/s]

Processing 중소벤처기업부_혁신창업사업화자금(융자).pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/test_source/중소벤처기업부_혁신창업사업화자금(융자).pdf...
Processing 보건복지부_부모급여(영아수당) 지원.pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/test_source/보건복지부_부모급여(영아수당) 지원.pdf...
Processing 보건복지부_노인장기요양보험 사업운영.pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/test_source/보건복지부_노인장기요양보험 사업운영.pdf...
Processing 산업통상자원부_에너지바우처.pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/test_source/산업통상자원부_에너지바우처.pdf...
Processing 국토교통부_행복주택출자.pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/test_source/국토교통부_행복주택출자.pdf...
Processing 「FIS 이슈 & 포커스」 22-4호 《중앙-지방 간 재정조정제도》.pdf...
Processing /content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v1.0/test_source/「FIS 이슈 & 포커스」 22-4호 《중앙-지방 간 재정조정제도》.pdf...
Proces

In [17]:
db_config = {
    'model' : model_option,
    'tab_process' : tab_ver,
    'aug' : 'NoAug',
    'chunck_size' : chunk_size
}

In [18]:
db_name = "{model}-ensemble-{tab_process}-{chunck_size}".format(**db_config)

In [19]:
import pickle

def check_and_mkdir(func):
    def wrapper(*args,**kwargs):
        if not os.path.exists(args[0]): os.makedirs(args[0])
        return func(*args,**kwargs)
    return wrapper

@check_and_mkdir
def save_pkl(save_dir,file_name,save_object):
    if not os.path.exists(save_dir): os.mkdir(save_dir)
    file_path = os.path.join(save_dir,file_name)
    with open(file_path,'wb') as f:
        pickle.dump(save_object,f)

def load_pkl(file_path):
    with open(file_path,'rb') as f:
        data = pickle.load(f)
    return data

In [20]:
db_path = os.path.join(file_dir,'pdf_db')
#save_pkl(db_path, f'{db_name}_train.dat',train_db)
#save_pkl(db_path, f'{db_name}_test.dat',test_db)

In [35]:
os.listdir(db_path)

['base-ensemble-tab_v1.0-256_train.pkl',
 'base-ensemble-tab_v1.0-256_train.dat',
 'base-ensemble-tab_v1.0-256_test.dat']

In [37]:
train_db_name = 'base-ensemble-tab_v1.0-256_train.dat'
test_db_name = 'base-ensemble-tab_v1.0-256_test.dat'
train_db_path = os.path.join(db_path,train_db_name)
test_db_path = os.path.join(db_path,test_db_name)

In [30]:
file_path = train_db_path + ' ' + test_db_path
temp_path = '/content/pdf_db'
if not os.path.exists(temp_path) : os.makedirs(temp_path)

In [36]:
!rsync -vzh {file_path} {temp_path} --bwlimit 4096000000000000 --progress

base-ensemble-tab_v1.0-256_test.dat
         10.10G 100%   57.04MB/s    0:02:48 (xfr#1, to-chk=1/2)
base-ensemble-tab_v1.0-256_train.dat
         17.97G 100%   60.64MB/s    0:04:42 (xfr#2, to-chk=0/2)

sent 16.54G bytes  received 54 bytes  36.64M bytes/sec
total size is 28.06G  speedup is 1.70


In [38]:
train_db = load_pkl(os.path.join(temp_path,train_db_name))
test_db = load_pkl(os.path.join(temp_path,test_db_name))

# Create Dataset

In [50]:
def normalize_string(s):
    """유니코드 정규화"""
    return unicodedata.normalize('NFC', s)

def format_docs(docs):
    """검색된 문서들을 하나의 문자열로 포맷팅"""
    context = ""
    for doc in docs:
        context += doc.page_content
        context += '\n'
    return context

def make_dataset(df, pdf_databases):
    dataset = dict()
    dataset['context'] = list()
    dataset['question'] = list()
    dataset['answer'] = list()
    normalized_keys = {normalize_string(k): v for k, v in pdf_databases.items()}

    for _, row in tqdm(df.iterrows(), total=len(df), desc="Making"):
        # 소스 문자열 정규화
        source = normalize_string(row['Source'])+'.pdf'
        question = row['Question']
        dataset['question'].append(question)
        if 'Answer' in df.columns:
          dataset['answer'].append(row['Answer'])
        else: dataset['answer'].append('')

        # 정규화된 키로 데이터베이스 검색
        retriever = normalized_keys[source]['retriever']
        context = format_docs(retriever.invoke(question))
        dataset['context'].append(context)
    return dataset


In [51]:
dataset_name = "kdt3/DACON-QA-{model}-ensemble-{tab_process}-{aug}-{chunck_size}".format(**db_config)
train_name = "kdt3/DACON-QA-{model}-ensemble-{tab_process}-{aug}-{chunck_size}".format(**db_config)
#fname = "gemma2_large_ensemble_markdown_256_5epoch_reprocessed_result.csv"

push_url = dataset_name
push_url

'kdt3/DACON-QA-base-ensemble-tab_v1.0-NoAug-256'

In [52]:
## 만약 데이터셋을 분할해서 업로드해줘야할 경우 합치는 방법 참조 코드
from datasets import load_dataset, concatenate_datasets
from datasets import Dataset

train_dataset = load_dataset(dataset_name)['train']

train_dataset = concatenate_datasets([train_dataset, Dataset.from_dict(make_dataset(train_df.iloc[296:], train_db))])
train_dataset.push_to_hub(dataset_name, private=True, split='train')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


KeyboardInterrupt: 

## Train 데이터 생성 & 업로드

In [53]:
from datasets import Dataset
train_dataset = make_dataset(train_df, train_db)
train_dataset = Dataset.from_dict(train_dataset)
train_dataset.push_to_hub(push_url, private=True, split='train')


Making:   0%|          | 0/396 [00:00<?, ?it/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/kdt3/DACON-QA-base-ensemble-tab_v1.0-NoAug-256/commit/e559a271b9b43ffa9fdc8793d8895793d0237eb3', commit_message='Upload dataset', commit_description='', oid='e559a271b9b43ffa9fdc8793d8895793d0237eb3', pr_url=None, pr_revision=None, pr_num=None)

## Valid 데이터 생성 & 업로드

In [54]:
from datasets import Dataset
valid_dataset = make_dataset(valid_df, train_db)
valid_dataset = Dataset.from_dict(valid_dataset)
valid_dataset.push_to_hub(push_url, private=True, split='valid')

Making:   0%|          | 0/100 [00:00<?, ?it/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/347 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/kdt3/DACON-QA-base-ensemble-tab_v1.0-NoAug-256/commit/b9df89a8aa848d9b15933e8b42b5778f54bcc6a1', commit_message='Upload dataset', commit_description='', oid='b9df89a8aa848d9b15933e8b42b5778f54bcc6a1', pr_url=None, pr_revision=None, pr_num=None)

## Test 데이터 생성 & 업로드

In [55]:
from datasets import Dataset
test_dataset = make_dataset(test_df, test_db)
test_dataset = Dataset.from_dict(test_dataset)
test_dataset.push_to_hub(push_url, private=True, split='test')

Making:   0%|          | 0/98 [00:00<?, ?it/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/447 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/kdt3/DACON-QA-base-ensemble-tab_v1.0-NoAug-256/commit/1bcbb4d16988579092248f56ad0f125e073c2b4d', commit_message='Upload dataset', commit_description='', oid='1bcbb4d16988579092248f56ad0f125e073c2b4d', pr_url=None, pr_revision=None, pr_num=None)