### 대규모 데이터셋
- 사용해야 될 경우
    - 가용 훈련 데이터의 수가 pretrained 모델 훈련시의 데이터 양과 비슷할 때
    - 도메인의 차이 가 클때
- 조심해야 할 점
    - (반)자동으로 생성된 데이터가 많기 때문에 데이터 품질이 낮을 수 있다.
    - 편향성, 낮은 품질, **저작권 위반** 등

### 같은 모델, 다른 데이터셋 예시

In [76]:
from transformers import pipeline, set_seed

set_seed(42)
gpt_pipe_1 = pipeline("text-generation", model="openai-gpt")
gpt_pipe_2 = pipeline("text-generation", model="gpt2")

Device set to use cuda:0
Device set to use cuda:0


In [77]:
print("파라미터 수 비교")
(sum([len(p) for p in gpt_pipe_1.model.parameters()]), 
 sum([len(p) for p in gpt_pipe_2.model.parameters()]))

파라미터 수 비교


(225310, 237137)

In [78]:
# 같은 입력에 대한 출력 비교
def pipe_out(pipe, prompt, num_return_sequences):
    out = pipe(
        prompt,
        num_return_sequences=num_return_sequences,
        clean_up_tokenization_spaces=True,
        truncation=True,
        )
    return out

display(pipe_out(gpt_pipe_1, '\n UDA is', 3)) # 로멘스 소설 데이터
print("*"*50)
display(pipe_out(gpt_pipe_2, '\n UDA is', 3)) # 레딧 기사 텍스트 데이터

[{'generated_text': '\n UDA is no - kill. " \n " he\'s a little young to be a ranger to a ranger. " \n " no - kill ain\'t no cowboy. i\'m only a tracker for the local rangers. " \n " no - kill'},
 {'generated_text': '\n UDA is the very first one in the line who can speak. so far, it\'s been an exceptionally hard night. " \n " does the wolf really belong to you? " \n " he does. " \n i stared at him, perplexed'},
 {'generated_text': "\n UDA isn't dead. her life is still in the water. \n he wanted to believe her. needed to believe it. \n but she 'd gotten him into this. there was no other way to prove her story, no other way to prove"}]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


**************************************************


[{'generated_text': '\n UDA is an open source game engine for Linux that brings together the best technologies of the desktop. In a recent blog post, UDA developer John Daley explains that in many respects this open source effort was the perfect vehicle for a "modern'},
 {'generated_text': '\n UDA is the most expensive driver to purchase in India, where around $450 million is spent in 2013-14 (see chart ). The value of the RTE licence can be measured in USD, which gives the price. For RTE buyers'},
 {'generated_text': "\n UDA is on the horizon.\n\nAnd maybe most damning, this is just the latest in a long line of examples of UDA's failure to meet its commitments to protect the U.S. from China's cyber attacks.\n\n"}]

In [79]:
del gpt_pipe_1
del gpt_pipe_2

import torch

# CUDA 캐시 비우기
torch.cuda.empty_cache()

In [2]:
import torch

# 메모리 사용량 확인
print(torch.cuda.memory_allocated())

0


### code dataset 가져오기

In [13]:
!cd codeparrot/ && git file-000000000000.json.gz pull

git: 'file-000000000000.json.gz' is not a git command. See 'git --help'.


In [14]:
!git clone https://huggingface.co/datasets/transformersbook/codeparrot
!cd codeparrot && git lfs pull

fatal: destination path 'codeparrot' already exists and is not an empty directory.
Downloading LFS objects: 100% (184/184), 46 GB | 58 MB/s                        

### 대용량 데이터셋 다루기
- 약 46gb 의 code 데이터셋 -> 압축 해제시 200gb 정도
- 메모리 매핑과 스트리밍 기능 이용


### 메모리 매핑
- 제로 카피 + 제로-오버헤드 메모리 매핑
- 파일로 디스크에 캐싱됨
- 데이터셋을 로딩되는 대신, 포인터를 열어 대신 사용(필요할때마다 캐싱된 파일로 불러온다.)

In [15]:
from datasets import load_dataset, DownloadConfig

download_config = DownloadConfig(delete_extracted=True)
dataset = load_dataset("./codeparrot", split="train",
                       download_config=download_config)

Downloading data: 100%|██████████| 184/184 [00:00<00:00, 1527.57files/s]
Generating train split: 18695559 examples [59:12, 5262.95 examples/s]


In [19]:
import psutil
import os

print(f"데이터셋 갯수 : {len(dataset)}")
ds_size = sum(os.stat(f["filename"]).st_size for f in dataset.cache_files)
print(f"캐시 : {ds_size/ 2 ** 30:.2f}GB")
print(f"메모리 사용량 : {psutil.Process(os.getpid()).memory_info().rss >> 20} MB")

데이터셋 갯수 : 18695559
캐시 : 183.59GB
메모리 사용량 : 1463 MB


### 데이터셋 스트리밍

In [35]:
# 접근법 streamed_dataset[숫자] 같은 방식은 접근할 수 없다.

streamed_dataset = load_dataset('./codeparrot', split='train', streaming=True) # IterableDataset
next(iter(streamed_dataset))

{'repo_name': 'ahmedbodi/AutobahnPython',
 'path': 'examples/asyncio/websocket/echo/client_coroutines.py',
 'copies': '13',
 'size': '2044',
 'content': '###############################################################################\n##\n##  Copyright (C) 2013-2014 Tavendo GmbH\n##\n##  Licensed under the Apache License, Version 2.0 (the "License");\n##  you may not use this file except in compliance with the License.\n##  You may obtain a copy of the License at\n##\n##      http://www.apache.org/licenses/LICENSE-2.0\n##\n##  Unless required by applicable law or agreed to in writing, software\n##  distributed under the License is distributed on an "AS IS" BASIS,\n##  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n##  See the License for the specific language governing permissions and\n##  limitations under the License.\n##\n###############################################################################\n\nfrom autobahn.asyncio.websocket import WebSocketClien

In [41]:
# 원격 데이터 셋 (스트리밍으로 가져온다.)
remote_dataset = load_dataset('transformersbook/codeparrot', split='train', streaming=True)

In [50]:
for data in remote_dataset:
    display(data)  # 한 개의 데이터 샘플 출력
    break  # 예시로 첫 번째 샘플만 확인

{'repo_name': 'ahmedbodi/AutobahnPython',
 'path': 'examples/asyncio/websocket/echo/client_coroutines.py',
 'copies': '13',
 'size': '2044',
 'content': '###############################################################################\n##\n##  Copyright (C) 2013-2014 Tavendo GmbH\n##\n##  Licensed under the Apache License, Version 2.0 (the "License");\n##  you may not use this file except in compliance with the License.\n##  You may obtain a copy of the License at\n##\n##      http://www.apache.org/licenses/LICENSE-2.0\n##\n##  Unless required by applicable law or agreed to in writing, software\n##  distributed under the License is distributed on an "AS IS" BASIS,\n##  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n##  See the License for the specific language governing permissions and\n##  limitations under the License.\n##\n###############################################################################\n\nfrom autobahn.asyncio.websocket import WebSocketClien

In [51]:
from huggingface_hub import login
import json

with open("hf_key_token.json") as f:
    token = json.load(f)["hf_key_token"]

login(token)

### 데이터셋 만들고, 허깅페이스에 커밋

- huggingface 저장소 생성 및 git clone
- in bash
    - 허깅페이스 로그인 깃 생성
        ```bash
        $ huggingface-cli login
        $ huggingface-cli repo create --type dataset codeparrot-train
        $ huggingface-cli repo create --type dataset codeparrot-valid
        ```

    - 가져오기
        ```bash
        $ git clone https://huggingface.co/datasets/tommyjin/codeparrot-valid
        $ git clone https://huggingface.co/datasets/tommyjin/codeparrot-train
        ```

- 훈련세트로 복사
    ```bash
        $ cd codeparrot-train
        $ cp ../codeparrot/*.json.gz .
        $ rm ./file-000000000183.json.gz
        $ git add .
        $ git commit -m "Adding dataset files"
        $ git push
    ```
- 검증 세트 복사 및 커밋
    ```bash
        $ cd ../codeparrot-valid
        $ cp ../codeparrot/file-000000000183.json.gz .
        $ mv ./file-000000000183.json.gz ./file-000000000183_validation.json.gz
        $ git add .
        $ git commit -m "Adding dataset files"
        $ git push
    ```

### 토크나이저 구축하기

- 전체적인 과정 :
    1. 정규화
    2. 사전 토큰화
    3. 토크나이저 모델
    4. 사후 처리

- 다양한 알고리즘
    - Byte Pair Encoding : 단일문자의 리스트로 시작해 점진적으로 새토큰 만들기(정해진 크기까지생성)
    - Unigram : 모든 토큰을 만든후 점진적을 토큰 삭제 (정해진 크기까지)

- 토크나이즈 성능 측정법
    - 부분 단어 생산력(subword fertilty) : 토큰화된 단어마다 생성되는 부분단어의 평균갯수
    - 연속 단어 비률(proportion of continued words) : 두개의 부분토큰으로 분할된 토큰화된 단어의 비율
    - 커버리지 측정값(coverage metrics) : 알수없는 단어나, 거의 사용되지 않는 토큰의 비율
    - **토크나이저 를 이용한 모델의 성능지표가 가장 중요하다.**

In [22]:
from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode

bytes_to_unicode_map = bytes_to_unicode()
unicode_to_byte_map = {v:k for k, v in bytes_to_unicode_map.items()}

In [36]:
import pandas as pd

pd.DataFrame.from_dict(unicode_to_byte_map.items()).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
0,!,"""",#,$,%,&,',(,),*,+,",",-,.,/,0,1,2,3,4,5,6,7,8,9,:,;,<,=,>,?,@,A,B,C,D,E,F,G,H,...,Ĝ,ĝ,Ğ,ğ,Ġ,ġ,Ģ,ģ,Ĥ,ĥ,Ħ,ħ,Ĩ,ĩ,Ī,ī,Ĭ,ĭ,Į,į,İ,ı,Ĳ,ĳ,Ĵ,ĵ,Ķ,ķ,ĸ,Ĺ,ĺ,Ļ,ļ,Ľ,ľ,Ŀ,ŀ,Ł,ł,Ń
1,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,...,28,29,30,31,32,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,173
