In [1]:
!nvidia-smi

Thu Nov  7 15:51:11 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P0              45W / 350W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
#path = '/content/drive/MyDrive/DACON/Finance/reprocessed/'
path ='/content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/'
base_dir = path # Your Base Directory

# 설명

## Question - Answering with Retrieval

본 대회의 과제는 중앙정부 재정 정보에 대한 **검색 기능**을 개선하고 활용도를 높이는 질의응답 알고리즘을 개발하는 것입니다. <br>이를 통해 방대한 재정 데이터를 일반 국민과 전문가 모두가 쉽게 접근하고 활용할 수 있도록 하는 것이 목표입니다. <br><br>
베이스라인에서는 평가 데이터셋만을 활용하여 source pdf 마다 Vector DB를 구축한 뒤 langchain 라이브러리와 llama-2-ko-7b 모델을 사용하여 RAG 프로세스를 통해 추론하는 과정을 담고 있습니다. <br>( train_set을 활용한 훈련 과정은 포함하지 않으며, test_set  에 대한 추론만 진행합니다. )

## Mount/Login

구글 드라이브를 마운트하고 허깅페이스에 로그인
- 이때 허깅페이스 토큰은 kdt3 그룹에 대해 읽기/쓰기 권한이 있는 토큰이어야 함

## Download Library
필요/사용 라이브러리 다운로드
이때 버전 문제로 설치를 한 뒤 세션을 한번 재시작해줘야 합니다
<br>(그리고 세션 완전히 끊기면 다운로드 후 재시작을 다시 해줘야...)

## Import Library
한번 재시작했으면 위 과정 없이 Import만 실행해주면 됩니다

## Vector DB
문서를 여러 조각(chunk)로 나누고, 임베딩 유사도를 통해 관련 조각을 찾을 수 있게 DB화하는 함수들이 정의되어 있습니다.

## DB 생성
Vector DB에서 정의된 함수들로 문서 DB를 만들어줍니다.<br><br>
이때 Train과 Test를 한번에 하려고 하면 코랩이 터질 확률이 높으므로 Train하고 Create Dataset까지 실행해 업로드 한 뒤 재시작해서 램을 비우고 Test를 하는 것이 좋습니다.<br> 또한 문서 임베딩을 어떤 모델로 할지 인자로 넘겨줄 수 있습니다

## Create Dataset
DB 생성에서 만든 db와 데이터 dataframe을 사용해 HuggingFace 데이터셋 생성 후 업로드

## Fine-Tuning
학습 데이터셋으로 모델에 대한 파인튜닝 진행 후 Huggingface에 업로드<br>
4비트 양자화 LoRA로 파인튜닝<br>
기반 모델 또는 넣어줄때 사용할 프롬프트, 학습 관련 하이퍼파라미터 수정 가능

## Langchain 을 이용한 추론
모델을 사용한 추론


## 실행
### 기본
Mount/Login -> Download Library -> 재시작 (처음 1번)
Mount/Login -> Import Library (이후)

### 데이터셋 만들기
기본 -> Vector DB -> DB 생성 -> Create Dataset에서 첫 셀 + Train/Valid/Test 중 해당하는 셀

### 모델 학습하기
기본 -> Fine-Tuning(업로드할 위치, 데이터셋 위치, 모델 링크 확인 필수)

### 학습된 모델로 추론하기
기본 -> Langchain을 이용한 추론(모델 링크, 데이터셋 위치 확인) -> Submission(저장할 파일명 확인)

# Mount/Login

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
ls {path}

241008_csv_checker.ipynb             [0m[01;34mgemma2_financeQA-finetune[0m/  [01;34mtest_source[0m/
combined_train_aug_v3.5_editted.csv  [01;34mprocessed[0m/                  train.csv
combined_train_aug_v3.csv            sample_submission.csv       [01;34mtrain_source[0m/
combined_train_aug_v3_editted.csv    [01;34msub[0m/                        Untitled0.ipynb
[01;34mdata[0m/                                [01;34mtemp[0m/
[01;34meval[0m/                                test.csv


In [5]:
import os

token_path = os.path.join(base_dir,'data','token')
with open(token_path,'r') as f:
    master_token = f.readline().strip('\n')

In [6]:
from huggingface_hub import login

login(token=master_token, add_to_git_credential=True)

Token is valid (permission: fineGrained).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Download Library

In [7]:
!apt-get install tesseract-ocr
!apt-get install poppler-utils

!pip install orjson==3.10.6

!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install transformers[torch] -U

!pip install datasets
!pip install langchain
!pip install langchain_community
!pip install langchain-teddynote
!pip install PyMuPDF
!pip install sentence-transformers
!pip install faiss-gpu
!pip install unstructured pdfminer.six
!pip install pillow-heif
!pip install pikepdf pypdf

!pip install pymupdf4llm

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.5).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Looking in indexes: https://pypi.org/simple/


# Import Library

In [8]:
import os
import unicodedata
import torch
import pandas as pd
from tqdm.auto import tqdm
import fitz  # PyMuPDF

from langchain.document_loaders.parsers.pdf import PDFPlumberParser

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline,
    BitsAndBytesConfig
)
from accelerate import Accelerator

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter

# PDF 로딩/청크화 관련
from langchain.document_loaders.parsers.pdf import PDFPlumberParser
from langchain.document_loaders.pdf import PDFPlumberLoader
from langchain.document_loaders import UnstructuredPDFLoader
from langchain_teddynote.retrievers import KiwiBM25Retriever
from langchain.retrievers import EnsembleRetriever, MultiQueryRetriever

from unstructured.cleaners.core import clean_extra_whitespace, clean, clean_non_ascii_chars

import pymupdf4llm
import pymupdf

In [9]:
# gpu memory 할당 해제
import gc, time

def free_cuda():
  mem = 1
  while mem > 0 :
    time.sleep(0.5)
    mem = gc.collect()
    torch.cuda.empty_cache()
    print("freed : ",mem)

# Vector DB

In [10]:
from operator import itemgetter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from unstructured.cleaners.core import clean_extra_whitespace, clean, clean_non_ascii_chars


# 불릿포인트 제거용 함수
def remove_bulletpoints(text):
    cleaned_text = text
    for symbol in ['ㅇ','-','□', '※', '▸','∙','●','☞','■','','','·']:
        cleaned_text = cleaned_text.replace(symbol, f"-")
    return cleaned_text

def replace_sign_symbol(text):
    cleaned_text = text
    cleaned_text = cleaned_text.replace('△', "-")
    return cleaned_text


# 숫자 심볼 숫자로 변환
def replace_num_symbols_with_number(text):
    cleaned_text = text
    for idx, symbol in enumerate(['①', '②', '③', '④', '⑤', '⑥', '⑦', '⑧', '⑨', '⑩', '⑪', '⑫', '⑬', '⑭', '⑮']):
        cleaned_text = cleaned_text.replace(symbol, f"{idx+1})")
    return cleaned_text

def erase_unicode_chr(text):
  return re.sub(r'\\u[0-9a-fA-F]{4}','-',text)

In [11]:
def normalize_path(path):
    """경로 유니코드 정규화"""
    return unicodedata.normalize('NFC', path)

def process_path(base_dir,file_path):
  norm_path = normalize_path(file_path)
  if not os.path.isabs(norm_path):
    return os.path.normpath(os.path.join(base_dir, norm_path))
  else : return norm_path

def subpath_list(dir_path):
  return list(map(lambda x : os.path.join(dir_path,x),os.listdir(dir_path)))

def processed_path_matcher(dir_path,file_path):
  sub_list = subpath_list(dir_path)
  path_list = list()
  for sub in sub_list:
    path_list.extend(subpath_list(sub))
  prcssd_list =list(map(normalize_path,path_list))
  for real_path,prcssd_path in zip(path_list,prcssd_list) :
    if file_path == prcssd_path : return real_path
  else : return file_path

In [12]:
from operator import itemgetter

def clean_string(text):
    text_string = clean(text, dashes=True,trailing_punctuation=True, bullets=True)
    text_string = replace_num_symbols_with_number(text_string)
    text_string = remove_bulletpoints(text_string)
    return text_string

def clean_table(text):
    text_string = replace_num_symbols_with_number(text)
    text_string = replace_sign_symbol(text_string)
    text_string = remove_bulletpoints(text_string)
    return erase_unicode_chr(text_string)

# 전체 마크다운 처리
def process_pdf(file_path, chunk_size=256, chunk_overlap=32):
    """PDF 텍스트 추출 후 chunk 단위로 나누기"""
    # PDF 파일 열기
    doc = pymupdf4llm.to_markdown(file_path)

    headers_to_split_on = [
        ("#","Header 1"),
        ("##","Header 2"),
        ("###","Header 3"),
    ]

    md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
    md_chunks = md_splitter.split_text(doc)

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    chunks = splitter.split_documents(md_chunks)

    return chunks


def create_vector_db(chunks, model_path="intfloat/multilingual-e5-small"):
    """FAISS DB 생성"""
    # 임베딩 모델 설정
    model_kwargs = {'device': 'cuda'}
    encode_kwargs = {'normalize_embeddings': True}
    embeddings = HuggingFaceEmbeddings(
        model_name=model_path,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
    )
    # FAISS DB 생성 및 반환
    db = FAISS.from_documents(chunks, embedding=embeddings)
    return db




In [13]:
import pickle

def check_and_mkdir(func):
    def wrapper(*args,**kwargs):
        if not os.path.exists(args[0]): os.makedirs(args[0])
        return func(*args,**kwargs)
    return wrapper

@check_and_mkdir
def save_pkl(save_dir,file_name,save_object):
    if not os.path.exists(save_dir): os.mkdir(save_dir)
    file_path = os.path.join(save_dir,file_name)
    with open(file_path,'wb') as f:
        pickle.dump(save_object,f)

def load_pkl(file_path):
    with open(file_path,'rb') as f:
        data = pickle.load(f)
    return data

## Preprocessing Tables

In [14]:
!pip install gmft



In [15]:
import gmft.table_detection
import gmft
import markdown
from gmft.auto import CroppedTable, TableDetector, AutoTableFormatter, AutoFormatConfig
from gmft.pdf_bindings import PyPDFium2Document

In [16]:
from collections import defaultdict

def make_table(tab,doc,pnum,formatter):
  rect = gmft.common.Rect(tab.bbox)
  temp = gmft.table_detection.CroppedTable(doc.get_page(pnum),rect,0.8)
  ft = formatter.extract(temp)
  try :
    table = ft.df()
  except Exception as e:
    return None
  return table

def define_formatter():
    config = AutoFormatConfig()
    config.semantic_spanning_cells=True
    config.enable_multi_header=True
    config.total_overlap_reject_threshold = 0.3
    config.large_table_assumption = True
    formatter = AutoTableFormatter(config=config)
    return formatter

def extract_tables_from_pdf(full_path,tab_word='[[TABLE_{0}]]'):
    pdf = pymupdf.open(full_path)
    doc = PyPDFium2Document(full_path)
    formatter = define_formatter()
    chunks, tables_dict, cnt= list(), defaultdict(list),0
    for pnum, page in enumerate(tqdm(pdf)):
        latest_text = ""
        tables = page.find_tables()
        for idx, tab in enumerate(tables):
            table = make_table(tab,doc,pnum,formatter)
            if table is None : continue
            if len(table) <= 1 : continue
            page.add_redact_annot(tab.bbox)
            table_md = clean_table(table.to_markdown(index=False))
            page.apply_redactions()
            page.draw_rect(tab.bbox,color=(.0,0,0),fill=(.99,.99,.99))
            tab_mark = tab_word.format(cnt)
            content = tab_mark
            rc = page.insert_htmlbox(tab.bbox,content,scale_low=0)
            tables_dict[pnum].append((tab_mark,table_md))
            cnt+=1
            #prev = (tab.bbox[0], tab.bbox[1], tab.bbox[2], tab.bbox[3])

    return pdf, tables_dict

def extract_table_and_pdf(pdf_path,base_dir,save_dir):
    # 경로 정규화 및 절대 경로 생성
    norm_path = normalize_path(pdf_path)
    if not os.path.isabs(norm_path):
      full_path = os.path.normpath(os.path.join(base_dir, norm_path.lstrip('./')))
    else : full_path = norm_path

    pdf_name = os.path.basename(full_path)
    print(f"Processing {pdf_name}...")
    save_path = os.path.join(save_dir, norm_path)
    pdf_dir = os.path.dirname(save_path)
    if not os.path.exists(pdf_dir) : os.makedirs(pdf_dir)
    new_pdf,tab_list = extract_tables_from_pdf(full_path,tab_word)
    new_pdf.save(save_path,garbage=4,deflate=True)
    return tab_list

def reform_pdfs_from_df(df, base_dir,save_dir,name='data'):
    """딕셔너리에 pdf명을 키로해서 DB, retriever 저장"""
    unique_paths = df['Source_path'].unique()
    tab_dict = dict()
    for path in tqdm(unique_paths, desc="Processing PDFs"):
      tab_dict[path]=extract_table_and_pdf(path,base_dir,save_dir)
    save_pkl(os.path.join(save_dir,'tables'),f'tab_{name}.pkl',tab_dict)
    return tab_dict

In [17]:
def convert_neg_idx(idxs,len_obj):
  rslt = deepcopy(idxs)
  for i in idxs:
    if i >= 0 : continue
    new_num = i + len_obj
    del rslt[i]
    rslt.append(new_num)
  return rslt

def add_escape(sent):
  idxs = list(filter(lambda x : sent[x] in ['[',']'],range(len(sent))))
  temp, idxs = list(sent), convert_neg_idx(idxs,len(sent))
  idxs = sorted(idxs)[::-1]
  for i in idxs:
    temp.insert(i,'\\')
  return ''.join(temp)

In [18]:
import re
from copy import deepcopy
from collections import defaultdict
from langchain_core.documents import Document as Doc

def get_former_idx(err_list):
  rslt = list()
  for i in err_list:
    cand = list(filter(lambda x : x not in err_list,range(i)))
    idx = max(cand) if cand else 0
    rslt.append(idx)
  return rslt

def get_latter_idx(err_list):
  rslt = list()
  for i in err_list:
    cand = list(filter(lambda x : x not in err_list,range(i,err_list[-1]+2)))
    idx = min(cand) if cand else err_list[-1]+1
    rslt.append(idx)
  return rslt

def make_table_page(content,tab_mark,table,tab_caption=None,th_len=100):
  if len(content)<len(tab_mark) : return None, None
  if tab_caption is None : tab_caption = tab_mark
  re_sep = '[\s\|]*'
  re_mark = insert_btwn_chr(tab_mark,re_sep)
  re_trgt = re.compile(re_mark)
  flag = list(re.finditer(re_trgt,content))
  if flag :
      front,end = flag[0].pos,flag[0].endpos
      start = min(0,front-th_len)
      new_page = content[start:front] + '\n' + table + f'\n{tab_caption}'
      page = content[:front]+tab_mark+content[:end]
  else : new_page, page = None, None
  return new_page,page

def get_insert_idx(former_idx,latter_idx,tab_page):
  rslt = list()
  for former,latter in zip(former_idx,latter_idx):
    pages = tab_page.values()
    first, last = 0,max(pages)
    i0 = tab_page[former] if former in tab_page else first
    i1 = tab_page[latter] if latter in tab_page else last
    rslt.append(int((i0+i1)/2))
  return rslt

def set_err_tab_page(new_pages,err_list,table_list,tab_page,tab_caption):
  if len(err_list) == 0 : return new_pages
  err_idx, err_tabs = zip(*err_list)
  if len(new_pages) == 0 : insert_idx = [-1 for _ in err_list]
  else :
    former_idx = get_former_idx(err_idx)
    latter_idx = get_latter_idx(err_idx)
    insert_idx = get_insert_idx(former_idx,latter_idx,tab_page)
  for page,i_tab,tab in zip(insert_idx,err_idx,err_tabs):
    content =tab +'\n'+ tab_caption.format(i_tab)
    new_pages[page].append(Doc(page_content=content))
  return new_pages

def convert_neg_num_page(page_dict,book_len):
  rslt = deepcopy(page_dict)
  for page,docs in page_dict.items():
    if page >= 0 : continue
    new_num = page+book_len
    del rslt[page]
    rslt[new_num] = docs
  return rslt

def insert_pages(doc_list,new_pages):
  new_pages = convert_neg_num_page(new_pages,len(doc_list))
  page_list = sorted(list(new_pages.keys()))[::-1]
  for page in page_list:
    if page >= len(doc_list) -1 : doc_list += new_pages[page]
    else : doc_list = doc_list[:page+1]+new_pages[page]+doc_list[page+1:]
  return doc_list

def get_table_page(num,doc_list,tab_mark,table):
    this_page = doc_list[num]['text']
    next_page = doc_list[num+1]['text'] if num+1 < len(doc_list) else ''
    both_page = this_page + next_page if next_page != '' else ''

    this_rslt,page0 = make_table_page(this_page,tab_mark,table)
    next_rslt,page1 = make_table_page(next_page,tab_mark,table)
    both_rslt,page2 = make_table_page(both_page,tab_mark,table)

    if this_rslt is not None : page_content,page = this_rslt, page0
    if next_rslt is not None : page_content,page,num = next_rslt, page1,num+1
    elif both_rslt is not None :
      page_content,page = both_rslt, page2[:len(this_page)-(len(both_page)-len(both_rslt))]
    else : page_content,page = None, this_page
    return page_content, page, num

def expand_pages(doc_list,new_pages):
  for page,tables in new_pages.items():
    content = doc_list[page]['text']+'\n'+'\n'.join(tables)
    doc_list[page]['text'] = content
  return doc_list

def insert_table_2_doc(doc_list,table_dict,tab_word='[[TABLE_{0}]]'):
  new_pages,cnt = defaultdict(list),0
  for num,table_info in table_dict.items():
    for tab_mark,table in table_info:
      table_page,page_adjst,page_num = get_table_page(num,doc_list,tab_mark,table)
      if table_page is not None:
        new_pages[num].append(table_page)
        doc_list[page_num]['text'] = page_adjst
        cnt+=1
      else : new_pages[num].append(table+'\n'+tab_mark)

  doc_list = expand_pages(doc_list,new_pages)
  return doc_list, cnt/sum(map(len,table_dict.values()))

#def insert_table_2_doc(doc_list,table_list,tab_word='[[TABLE_{0}]]'):
#  cnt = 0
#  tab_list=[Doc(page_content = '#별첨\n본문에 첨부되어 있던 표')]
#  for i,table in enumerate(table_list):
#    tab_mark = tab_word.format(i)
#    for num,doc in enumerate(doc_list):
#      table_page,page_adjst,page_num = get_table_page(num,doc_list,tab_mark,table)
#      if table_page is not None:
#        tab_list.append(Doc(page_content = table_page))
#        doc_list[page_num] = Doc(page_content=page_adjst,metadata=doc.metadata)
#        cnt+=1
#        break
#    else : tab_list.append(Doc(page_content = table))
#
#  doc_list = doc_list+tab_list
#  return doc_list, cnt/len(table_list)


#def insert_table_2_doc(doc_list,table_list,tab_word='[[TABLE_{0}]]'):
#  err_list,tab_page=[],dict()
#  new_pages = defaultdict(list)
#  for i,table in enumerate(table_list):
#    tab_mark = tab_word.format(i)
#    for num,doc in enumerate(doc_list):
#      table_page,page_adjst,page_num = get_table_page(num,doc_list,tab_mark,table)
#      if table_page is not None:
#        new_pages[page_num+1].append(Doc(page_content = table_page, metadata=doc.metadata))
#        tab_page[i] = page_num
#        tab_page[i],doc_list[page_num] = page_num, Doc(page_content=page_adjst,metadata=doc.metadata)
#        break
#    else : err_list.append([i,table])
#
#  new_pages = set_err_tab_page(new_pages,err_list,table_list,tab_page,tab_word)
#  return doc_list, 1-len(err_list)/len(table_list)

def insert_btwn_chr(sent,sep):
  c = list(add_escape(sent))
  d = c.copy()
  diff = (len(c)-len(sent))//2
  for i in range(2*diff+1,len(c)-diff*2+2)[::-1]:
    d.insert(i-1,sep)
  return ''.join(d)



In [19]:
import difflib

def union_strs(str0,str1):
    output_list = difflib.ndiff(str0, str1)
    return ''.join(map(lambda x : x[-1],output_list))

def pdf_2_chunck_w_table(file_path, tables, tab_word,chunk_size=256, chunk_overlap=32):
    """PDF 텍스트 추출 후 chunk 단위로 나누기"""
    # PDF 파일 열기
    doc = pymupdf4llm.to_markdown(file_path,page_chunks=True,table_strategy='lines')
    doc,rate = insert_table_2_doc(doc,tables,tab_word)
    doc0 = '\n'.join(map(lambda x : x['text'],doc))
#    doc1 = pymupdf4llm.to_markdown(file_path)
#    docs = union_strs(doc0,doc1)
    md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
    md_chunks = md_splitter.split_text(doc0)
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    chunks = splitter.split_documents(md_chunks)
    print(file_path)
    print(f'table mark detect rate : {rate:.5f}')

    return chunks,rate

def make_chunk_dict_from_df(df, base_dir, table_dict, chunk_size=256):
    """딕셔너리에 pdf명을 키로해서 DB, retriever 저장"""
    unique_paths = df['Source_path'].unique()
    chunk_dict = dict()
    err_tab_dict = dict()

    for file_path in tqdm(unique_paths, desc="Processing PDFs"):
        # 경로 정규화 및 절대 경로 생성
        full_path = process_path(base_dir,file_path)
        full_path = processed_path_matcher(base_dir,full_path)
        pdf_title = os.path.basename(full_path)
        print(f"Processing {pdf_title}...")

        # PDF 처리 및 벡터 DB 생성
        chunk_dict[file_path]= pdf_2_chunck_w_table(full_path,table_dict[file_path],chunk_size)
    return chunk_dict

In [20]:
#앙상블
def process_pdfs_from_df(df, base_dir, table_dict, tab_word, chunk_size=256, model_path = "intfloat/multilingual-e5-small"):
    """딕셔너리에 pdf명을 키로해서 DB, retriever 저장"""
    pdf_databases = {}
    unique_paths = df['Source_path'].unique()
    rate_dict=dict()

    for file_path in tqdm(unique_paths, desc="Processing PDFs"):
        # 경로 정규화 및 절대 경로 생성
        full_path = process_path(base_dir,file_path)
        full_path = processed_path_matcher(base_dir,full_path)
        pdf_title = os.path.basename(full_path)
        print(f"Processing {pdf_title}...")

        # PDF 처리 및 벡터 DB 생성
        chunks,rate =pdf_2_chunck_w_table(full_path,table_dict[file_path],tab_word,chunk_size)
        db = create_vector_db(chunks, model_path=model_path)

        kiwi_bm25_retriever = KiwiBM25Retriever.from_documents(chunks)
        faiss_retriever = db.as_retriever()
        retriever = EnsembleRetriever(
            retrievers=[kiwi_bm25_retriever, faiss_retriever],
            weights=[0.5, 0.5],
            search_type="mmr",
        )

        # 결과 저장
        pdf_databases[pdf_title] = {
                'db': db,
                'retriever': retriever
        }
        rate_dict[pdf_title] = rate
    return pdf_databases, rate_dict

### extract tables and reform pdfs

In [21]:
headers_to_split_on = [
    ("#","Header 1"),
    ("##","Header 2"),
    ("###","Header 3"),
]

tab_word = '!표{0}!'

In [22]:
train_df = pd.read_csv(f'{base_dir}train.csv')
test_df = pd.read_csv(f'{base_dir}test.csv')

In [None]:
PROCESSEDDIR = os.path.join(base_dir,'processed')
if not os.path.exists(PROCESSEDDIR) : os.makedirs(PROCESSEDDIR)

reform_pdfs_from_df(train_df, base_dir,PROCESSEDDIR,'trn')
reform_pdfs_from_df(test_df, base_dir,PROCESSEDDIR,'tst');

# Dataset Config

In [23]:
tab_ver = 'tab_v2.2'
model_dict = {
'large':"intfloat/multilingual-e5-large",
'base':"intfloat/multilingual-e5-base",
}

if tab_ver == 'tab_v0' : file_dir = base_dir
else : file_dir = os.path.join(base_dir,'processed',tab_ver)

model_option = 'large'
#model_option = 'base'
model_path = model_dict[model_option]
chunk_size = 256

In [52]:
aug_type = "AugGPT"

In [27]:
aug_type= 'AugAEDA'

In [28]:
aug_type= 'GPTOnly'

In [24]:
aug_type= 'NoAug'

In [53]:
db_config = {
    'model' : model_option,
    'tab_process' : tab_ver,
    'aug' : aug_type,
    'chunck_size' : chunk_size
}

In [54]:
db_name = "{model}-ensemble-{tab_process}-{chunck_size}".format(**db_config)

## Split train/valid

In [27]:
base_dir = path

In [28]:
data_df = pd.read_csv(os.path.join(base_dir,'train.csv'))
test_df = pd.read_csv(os.path.join(base_dir,'test.csv'))

In [29]:
from sklearn.model_selection import train_test_split

train_df,valid_df = train_test_split(data_df,test_size=0.2,stratify=data_df.Source,random_state=801)

## Apply Augmentation

In [30]:
ls {base_dir}

241008_csv_checker.ipynb             [0m[01;34mgemma2_financeQA-finetune[0m/  [01;34mtest_source[0m/
combined_train_aug_v3.5_editted.csv  [01;34mprocessed[0m/                  train.csv
combined_train_aug_v3.csv            sample_submission.csv       [01;34mtrain_source[0m/
combined_train_aug_v3_editted.csv    [01;34msub[0m/                        Untitled0.ipynb
[01;34mdata[0m/                                [01;34mtemp[0m/
[01;34meval[0m/                                test.csv


In [31]:
aug_file = 'combined_train_aug_v3.5_editted.csv'
aug_path = os.path.join(base_dir,aug_file)

In [32]:
ques_dict={
    'NoAug' : 'Question',
    'AugGPT' : 'Question_aug_GPT',
    'AugAEDA' : 'AEDA_Question',
    'GPTOnly' : 'Question_aug_GPT',
}
ans_dict = {
    'NoAug' : 'Answer',
    'AugGPT' : 'Answer',
    'AugAEDA' : 'Answer',
    'GPTOnly' : 'Answer',
}

In [33]:
key_col = 'SAMPLE_ID'
info_col = ['Source', 'Source_path']
ques_base = ques_dict['NoAug']
ans_base = ans_dict['NoAug']
ques_col = ques_dict[aug_type]
ans_col = ans_dict[aug_type]

In [34]:
filter_list = ['TRAIN_451', 'TRAIN_452', 'TRAIN_453', 'TRAIN_454', 'TRAIN_455', 'TRAIN_456']

In [35]:
aug_df = pd.read_csv(aug_path,sep='\t')
aug_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 496 entries, 0 to 495
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   SAMPLE_ID         496 non-null    object
 1   Source            496 non-null    object
 2   Source_path       496 non-null    object
 3   Question          496 non-null    object
 4   Answer            496 non-null    object
 5   Question_aug_GPT  496 non-null    object
 6   Answer_aug_GPT    496 non-null    object
 7   AEDA_Question     496 non-null    object
 8   AEDA_Answer       496 non-null    object
dtypes: object(9)
memory usage: 35.0+ KB


In [36]:
import numpy as np
train_id = train_df[key_col].values
#print(pd.Series(filter_list).isin(train_id))
cond = (aug_df[key_col].isin(train_id)) # & (~(aug_df[key_col].isin(filter_list)))
display(aug_df.columns), len(train_id), np.sum(cond)

Index(['SAMPLE_ID', 'Source', 'Source_path', 'Question', 'Answer',
       'Question_aug_GPT', 'Answer_aug_GPT', 'AEDA_Question', 'AEDA_Answer'],
      dtype='object')

(None, 396, 396)

In [37]:
col_list = [key_col]+info_col+[ques_base,ans_base]
train_adjst = aug_df.loc[cond,col_list]
train_df= train_adjst.rename(columns = {ques_col : "Question", ans_col : "Answer"})

In [38]:
if aug_type != 'NoAug':
  col_list = [key_col]+info_col+[ques_col,ans_col]
  aug_train = aug_df.loc[cond,col_list]
  aug_train = aug_train.rename(columns = {ques_col : "Question", ans_col : "Answer"})
  if 'Only' in aug_type : train_augged=aug_train
  else : train_augged= pd.concat([train_df,aug_train])
  train_augged.info()

In [39]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 396 entries, 0 to 495
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   SAMPLE_ID    396 non-null    object
 1   Source       396 non-null    object
 2   Source_path  396 non-null    object
 3   Question     396 non-null    object
 4   Answer       396 non-null    object
dtypes: object(5)
memory usage: 18.6+ KB


In [40]:
valid_id = valid_df[key_col].values
cond = (aug_df[key_col].isin(valid_id)) # & (~(aug_df[key_col].isin(filter_list)))
col_list = [key_col]+info_col+[ques_base,ans_base]
valid_adjst = aug_df.loc[cond,col_list]
valid_df= valid_adjst.rename(columns = {ques_col : "Question", ans_col : "Answer"})

In [41]:
free_cuda()

freed :  30
freed :  0


# DB 생성

In [42]:
temp_path = '/content/src/'
file_dir, os.listdir(file_dir)

('/content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v2.2',
 ['train_source', 'tables', 'test_source'])

In [48]:
src_dirs = list(filter(lambda x : x != 'pdf_db',os.listdir(file_dir)))
file_path = ' '.join([os.path.join(file_dir,sub) for sub in src_dirs])
if not os.path.exists(temp_path) : os.makedirs(temp_path)

In [46]:
!rsync -rvzh {file_path} {temp_path} --bwlimit 4096000000000000 --progress

sending incremental file list
tables/
tables/tab_trn.pkl
          2.26M 100%  193.43MB/s    0:00:00 (xfr#1, to-chk=26/30)
tables/tab_tst.pkl
        114.96K 100%  244.06kB/s    0:00:00 (xfr#2, to-chk=25/30)
test_source/
test_source/국토교통부_행복주택출자.pdf
          1.87M 100%    1.46MB/s    0:00:01 (xfr#3, to-chk=24/30)
test_source/보건복지부_노인장기요양보험 사업운영.pdf
          1.97M 100%    4.66MB/s    0:00:00 (xfr#4, to-chk=23/30)
test_source/보건복지부_부모급여(영아수당) 지원.pdf
          1.91M 100%    1.56MB/s    0:00:01 (xfr#5, to-chk=22/30)
test_source/산업통상자원부_에너지바우처.pdf
          1.99M 100%    2.83MB/s    0:00:00 (xfr#6, to-chk=21/30)
test_source/중소벤처기업부_혁신창업사업화자금(융자).pdf
          1.84M 100%    1.56MB/s    0:00:01 (xfr#7, to-chk=20/30)
test_source/「FIS 이슈 & 포커스」 22-4호 《중앙-지방 간 재정조정제도》.pdf
          2.55M 100%    4.01MB/s    0:00:00 (xfr#8, to-chk=19/30)
test_source/「FIS 이슈 & 포

In [43]:
pkl_dir = os.path.join(temp_path,'tables')
tab_dict_trn = load_pkl(os.path.join(pkl_dir,'tab_trn.pkl'))
tab_dict_tst= load_pkl(os.path.join(pkl_dir,'tab_tst.pkl'))

In [44]:
tab_word

'!표{0}!'

In [50]:
train_db, detect_rate = process_pdfs_from_df(train_df, temp_path, tab_dict_trn, tab_word, chunk_size=chunk_size, model_path=model_path)

Processing PDFs:   0%|          | 0/16 [00:00<?, ?it/s]

Processing 1-1 2024 주요 재정통계 1권.pdf...
Processing /content/src/train_source/1-1 2024 주요 재정통계 1권.pdf...
/content/src/train_source/1-1 2024 주요 재정통계 1권.pdf
table mark detect rate : 0.94444


  embeddings = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Processing 2024 나라살림 예산개요.pdf...
Processing /content/src/train_source/2024 나라살림 예산개요.pdf...
/content/src/train_source/2024 나라살림 예산개요.pdf
table mark detect rate : 0.66447
Processing 재정통계해설.pdf...
Processing /content/src/train_source/재정통계해설.pdf...
/content/src/train_source/재정통계해설.pdf
table mark detect rate : 0.85000
Processing 국토교통부_전세임대(융자).pdf...
Processing /content/src/train_source/국토교통부_전세임대(융자).pdf...
/content/src/train_source/국토교통부_전세임대(융자).pdf
table mark detect rate : 1.00000
Processing 고용노동부_청년일자리창출지원.pdf...
Processing /content/src/train_source/고용노동부_청년일자리창출지원.pdf...
/content/src/train_source/고용노동부_청년일자리창출지원.pdf
table mark detect rate : 1.00000
Processing 고용노동부_내일배움카드(일반).pdf...
Processing /content/src/train_source/고용노동부_내일배움카드(일반).pdf...
/content/src/train_source/고용노

In [51]:
detect_rate

{'1-1 2024 주요 재정통계 1권.pdf': 0.9444444444444444,
 '2024 나라살림 예산개요.pdf': 0.6644736842105263,
 '재정통계해설.pdf': 0.85,
 '국토교통부_전세임대(융자).pdf': 1.0,
 '고용노동부_청년일자리창출지원.pdf': 1.0,
 '고용노동부_내일배움카드(일반).pdf': 1.0,
 '보건복지부_노인일자리 및 사회활동지원.pdf': 1.0,
 '중소벤처기업부_창업사업화지원.pdf': 0.75,
 '보건복지부_생계급여.pdf': 1.0,
 '국토교통부_소규모주택정비사업.pdf': 0.7142857142857143,
 '국토교통부_민간임대(융자).pdf': 0.75,
 '고용노동부_조기재취업수당.pdf': 0.4444444444444444,
 '2024년도 성과계획서(총괄편).pdf': 0.6307692307692307,
 '「FIS 이슈 & 포커스」 23-3호 《조세지출 연계관리》.pdf': 0.7272727272727273,
 '「FIS 이슈 & 포커스」 22-3호 《재정융자사업》.pdf': 1.0,
 '월간 나라재정 2023년 12월호.pdf': 1.0}

In [53]:
free_cuda()

freed :  0


In [45]:
aug_type

'NoAug'

In [46]:
test_db, detect_rate = process_pdfs_from_df(test_df, temp_path, tab_dict_tst, tab_word[1:-1], chunk_size=chunk_size, model_path=model_path)

Processing PDFs:   0%|          | 0/9 [00:00<?, ?it/s]

Processing 중소벤처기업부_혁신창업사업화자금(융자).pdf...
Processing /content/src/test_source/중소벤처기업부_혁신창업사업화자금(융자).pdf...
/content/src/test_source/중소벤처기업부_혁신창업사업화자금(융자).pdf
table mark detect rate : 0.75000


  embeddings = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Processing 보건복지부_부모급여(영아수당) 지원.pdf...
Processing /content/src/test_source/보건복지부_부모급여(영아수당) 지원.pdf...
/content/src/test_source/보건복지부_부모급여(영아수당) 지원.pdf
table mark detect rate : 1.00000
Processing 보건복지부_노인장기요양보험 사업운영.pdf...
Processing /content/src/test_source/보건복지부_노인장기요양보험 사업운영.pdf...
/content/src/test_source/보건복지부_노인장기요양보험 사업운영.pdf
table mark detect rate : 1.00000
Processing 산업통상자원부_에너지바우처.pdf...
Processing /content/src/test_source/산업통상자원부_에너지바우처.pdf...
/content/src/test_source/산업통상자원부_에너지바우처.pdf
table mark detect rate : 0.81250
Processing 국토교통부_행복주택출자.pdf...
Processing /content/src/test_source/국토교통부_행복주택출자.pdf...
/content/src/test_source/국토교통부_행복주택출자.pdf
table mark detect rate : 0.57143
Processing 「FIS 이슈 & 포커스」 22-4호 《중앙-지방 간 재

In [47]:
detect_rate

{'중소벤처기업부_혁신창업사업화자금(융자).pdf': 0.75,
 '보건복지부_부모급여(영아수당) 지원.pdf': 1.0,
 '보건복지부_노인장기요양보험 사업운영.pdf': 1.0,
 '산업통상자원부_에너지바우처.pdf': 0.8125,
 '국토교통부_행복주택출자.pdf': 0.5714285714285714,
 '「FIS 이슈 & 포커스」 22-4호 《중앙-지방 간 재정조정제도》.pdf': 0.7692307692307693,
 '「FIS 이슈 & 포커스」 23-2호 《핵심재정사업 성과관리》.pdf': 0.0,
 '「FIS 이슈&포커스」 22-2호 《재정성과관리제도》.pdf': 0.5555555555555556,
 '「FIS 이슈 & 포커스」(신규) 통권 제1호 《우발부채》.pdf': 0.8823529411764706}

In [57]:
file_dir

'/content/drive/MyDrive/kdt-EST-AI/project/dacon_fis/src/processed/tab_v2.2'

In [58]:
db_name

'large-ensemble-tab_v2.2-256'

In [31]:
#db_path = os.path.join('/content','pdf_db')
db_path = os.path.join(file_dir,'pdf_db')
#save_pkl(db_path, f'{db_name}_train.dat',train_db)
#save_pkl(db_path, f'{db_name}_test.dat',test_db)

In [32]:
os.listdir(db_path)

['large-ensemble-tab_v1.7-256_train.dat',
 'large-ensemble-tab_v1.7-256_test.dat',
 'base-ensemble-tab_v1.7-256_test.dat',
 'base-ensemble-tab_v1.7-256_train.dat']

In [33]:
train_db_name = 'large-ensemble-tab_v1.7-256_train.dat'
test_db_name = 'large-ensemble-tab_v1.7-256_test.dat'
train_db_path = os.path.join(db_path,train_db_name)
test_db_path = os.path.join(db_path,test_db_name)

In [34]:
file_path = ' '.join([#train_db_path,
            test_db_path
                      ])
temp_path = '/content/pdf_db'
if not os.path.exists(temp_path) : os.makedirs(temp_path)

In [35]:
!rsync -vzh {file_path} {temp_path} --bwlimit 4096000000000000 --progress

large-ensemble-tab_v1.7-256_test.dat
         20.24G 100%   29.64MB/s    0:10:51 (xfr#1, to-chk=0/1)

sent 11.91G bytes  received 35 bytes  18.25M bytes/sec
total size is 20.24G  speedup is 1.70


In [36]:
#train_db = load_pkl(os.path.join(temp_path,train_db_name))
test_db = load_pkl(os.path.join(temp_path,test_db_name))

# Create Dataset

In [48]:
def normalize_string(s):
    """유니코드 정규화"""
    return unicodedata.normalize('NFC', s)

def format_docs(docs):
    """검색된 문서들을 하나의 문자열로 포맷팅"""
    context = ""
    for doc in docs:
        context += doc.page_content
        context += '\n'
    return context

def make_dataset(df, pdf_databases):
    dataset = dict()
    dataset['context'] = list()
    dataset['question'] = list()
    dataset['answer'] = list()
    normalized_keys = {normalize_string(k): v for k, v in pdf_databases.items()}

    for _, row in tqdm(df.iterrows(), total=len(df), desc="Making"):
        # 소스 문자열 정규화
        source = normalize_string(row['Source'])+'.pdf'
        question = row['Question']
        dataset['question'].append(question)
        if 'Answer' in df.columns:
          dataset['answer'].append(row['Answer'])
        else: dataset['answer'].append('')

        # 정규화된 키로 데이터베이스 검색
        retriever = normalized_keys[source]['retriever']
        context = format_docs(retriever.invoke(question))
        dataset['context'].append(context)
    return dataset


# Dataset

In [49]:
if aug_type != 'NoAug':
  train_df = train_augged

In [55]:
dataset_name = "kdt3/DACON-QA-{model}-ensemble-{tab_process}-refined0{aug}-{chunck_size}".format(**db_config)
train_name = "kdt3/DACON-QA-{model}-ensemble-{tab_process}-refined0{aug}-{chunck_size}".format(**db_config)
#fname = "gemma2_large_ensemble_markdown_256_5epoch_reprocessed_result.csv"

push_url = dataset_name
push_url

'kdt3/DACON-QA-large-ensemble-tab_v2.2-refined0AugGPT-256'

In [None]:
## 만약 데이터셋을 분할해서 업로드해줘야할 경우 합치는 방법 참조 코드
from datasets import load_dataset, concatenate_datasets
from datasets import Dataset

train_dataset = load_dataset(dataset_name)['train']

train_dataset = concatenate_datasets([train_dataset, Dataset.from_dict(make_dataset(train_df.iloc[296:], train_db))])
train_dataset.push_to_hub(dataset_name, private=True, split='train')


## Train 데이터 생성 & 업로드

In [76]:
from datasets import Dataset
train_dataset = make_dataset(train_df, train_db)
train_dataset = Dataset.from_dict(train_dataset)
train_dataset.push_to_hub(push_url, private=True, split='train')


Making:   0%|          | 0/792 [00:00<?, ?it/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/448 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/kdt3/DACON-QA-large-ensemble-tab_v2.2-refined0NoAug-256/commit/26867e51b5a4612cadd26270773d5a99c0cdc725', commit_message='Upload dataset', commit_description='', oid='26867e51b5a4612cadd26270773d5a99c0cdc725', pr_url=None, pr_revision=None, pr_num=None)

## Valid 데이터 생성 & 업로드

In [77]:
from datasets import Dataset
valid_dataset = make_dataset(valid_df, train_db)
valid_dataset = Dataset.from_dict(valid_dataset)
valid_dataset.push_to_hub(push_url, private=True, split='valid')

Making:   0%|          | 0/100 [00:00<?, ?it/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/448 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/kdt3/DACON-QA-large-ensemble-tab_v2.2-refined0NoAug-256/commit/e147b138f86f2b41cf624565ea0da9c2fbb16bdd', commit_message='Upload dataset', commit_description='', oid='e147b138f86f2b41cf624565ea0da9c2fbb16bdd', pr_url=None, pr_revision=None, pr_num=None)

## Test 데이터 생성 & 업로드

In [56]:
from datasets import Dataset
test_dataset = make_dataset(test_df, test_db)
test_dataset = Dataset.from_dict(test_dataset)
test_dataset.push_to_hub(push_url, private=True, split='test')

Making:   0%|          | 0/98 [00:00<?, ?it/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/447 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/kdt3/DACON-QA-large-ensemble-tab_v2.2-refined0AugGPT-256/commit/b43386be271e9d2df5b0284bc8ed2af141163484', commit_message='Upload dataset', commit_description='', oid='b43386be271e9d2df5b0284bc8ed2af141163484', pr_url=None, pr_revision=None, pr_num=None)