### 대규모 데이터셋
- 사용해야 될 경우
    - 가용 훈련 데이터의 수가 pretrained 모델 훈련시의 데이터 양과 비슷할 때
    - 도메인의 차이 가 클때
- 조심해야 할 점
    - (반)자동으로 생성된 데이터가 많기 때문에 데이터 품질이 낮을 수 있다.
    - 편향성, 낮은 품질, **저작권 위반** 등

### 같은 모델, 다른 데이터셋 예시

In [76]:
from transformers import pipeline, set_seed

set_seed(42)
gpt_pipe_1 = pipeline("text-generation", model="openai-gpt")
gpt_pipe_2 = pipeline("text-generation", model="gpt2")

Device set to use cuda:0
Device set to use cuda:0


In [77]:
print("파라미터 수 비교")
(sum([len(p) for p in gpt_pipe_1.model.parameters()]), 
 sum([len(p) for p in gpt_pipe_2.model.parameters()]))

파라미터 수 비교


(225310, 237137)

In [78]:
# 같은 입력에 대한 출력 비교
def pipe_out(pipe, prompt, num_return_sequences):
    out = pipe(
        prompt,
        num_return_sequences=num_return_sequences,
        clean_up_tokenization_spaces=True,
        truncation=True,
        )
    return out

display(pipe_out(gpt_pipe_1, '\n UDA is', 3)) # 로멘스 소설 데이터
print("*"*50)
display(pipe_out(gpt_pipe_2, '\n UDA is', 3)) # 레딧 기사 텍스트 데이터

[{'generated_text': '\n UDA is no - kill. " \n " he\'s a little young to be a ranger to a ranger. " \n " no - kill ain\'t no cowboy. i\'m only a tracker for the local rangers. " \n " no - kill'},
 {'generated_text': '\n UDA is the very first one in the line who can speak. so far, it\'s been an exceptionally hard night. " \n " does the wolf really belong to you? " \n " he does. " \n i stared at him, perplexed'},
 {'generated_text': "\n UDA isn't dead. her life is still in the water. \n he wanted to believe her. needed to believe it. \n but she 'd gotten him into this. there was no other way to prove her story, no other way to prove"}]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


**************************************************


[{'generated_text': '\n UDA is an open source game engine for Linux that brings together the best technologies of the desktop. In a recent blog post, UDA developer John Daley explains that in many respects this open source effort was the perfect vehicle for a "modern'},
 {'generated_text': '\n UDA is the most expensive driver to purchase in India, where around $450 million is spent in 2013-14 (see chart ). The value of the RTE licence can be measured in USD, which gives the price. For RTE buyers'},
 {'generated_text': "\n UDA is on the horizon.\n\nAnd maybe most damning, this is just the latest in a long line of examples of UDA's failure to meet its commitments to protect the U.S. from China's cyber attacks.\n\n"}]

In [79]:
del gpt_pipe_1
del gpt_pipe_2

import torch

# CUDA 캐시 비우기
torch.cuda.empty_cache()

In [2]:
import torch

# 메모리 사용량 확인
print(torch.cuda.memory_allocated())

0


### code dataset 가져오기

In [13]:
!cd codeparrot/ && git file-000000000000.json.gz pull

git: 'file-000000000000.json.gz' is not a git command. See 'git --help'.


In [14]:
!git clone https://huggingface.co/datasets/transformersbook/codeparrot
!cd codeparrot && git lfs pull

fatal: destination path 'codeparrot' already exists and is not an empty directory.
Downloading LFS objects: 100% (184/184), 46 GB | 58 MB/s                        

### 대용량 데이터셋 다루기
- 약 46gb 의 code 데이터셋 -> 압축 해제시 200gb 정도
- 메모리 매핑과 스트리밍 기능 이용


### 메모리 매핑
- 제로 카피 + 제로-오버헤드 메모리 매핑
- 파일로 디스크에 캐싱됨
- 데이터셋을 로딩되는 대신, 포인터를 열어 대신 사용(필요할때마다 캐싱된 파일로 불러온다.)

In [15]:
from datasets import load_dataset, DownloadConfig

download_config = DownloadConfig(delete_extracted=True)
dataset = load_dataset("./codeparrot", split="train",
                       download_config=download_config)

Downloading data: 100%|██████████| 184/184 [00:00<00:00, 1527.57files/s]
Generating train split: 18695559 examples [59:12, 5262.95 examples/s]


In [19]:
import psutil
import os

print(f"데이터셋 갯수 : {len(dataset)}")
ds_size = sum(os.stat(f["filename"]).st_size for f in dataset.cache_files)
print(f"캐시 : {ds_size/ 2 ** 30:.2f}GB")
print(f"메모리 사용량 : {psutil.Process(os.getpid()).memory_info().rss >> 20} MB")

데이터셋 갯수 : 18695559
캐시 : 183.59GB
메모리 사용량 : 1463 MB


### 데이터셋 스트리밍

In [35]:
# 접근법 streamed_dataset[숫자] 같은 방식은 접근할 수 없다.

streamed_dataset = load_dataset('./codeparrot', split='train', streaming=True) # IterableDataset
next(iter(streamed_dataset))

{'repo_name': 'ahmedbodi/AutobahnPython',
 'path': 'examples/asyncio/websocket/echo/client_coroutines.py',
 'copies': '13',
 'size': '2044',
 'content': '###############################################################################\n##\n##  Copyright (C) 2013-2014 Tavendo GmbH\n##\n##  Licensed under the Apache License, Version 2.0 (the "License");\n##  you may not use this file except in compliance with the License.\n##  You may obtain a copy of the License at\n##\n##      http://www.apache.org/licenses/LICENSE-2.0\n##\n##  Unless required by applicable law or agreed to in writing, software\n##  distributed under the License is distributed on an "AS IS" BASIS,\n##  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n##  See the License for the specific language governing permissions and\n##  limitations under the License.\n##\n###############################################################################\n\nfrom autobahn.asyncio.websocket import WebSocketClien

In [41]:
# 원격 데이터 셋 (스트리밍으로 가져온다.)
remote_dataset = load_dataset('transformersbook/codeparrot', split='train', streaming=True)

In [50]:
for data in remote_dataset:
    display(data)  # 한 개의 데이터 샘플 출력
    break  # 예시로 첫 번째 샘플만 확인

{'repo_name': 'ahmedbodi/AutobahnPython',
 'path': 'examples/asyncio/websocket/echo/client_coroutines.py',
 'copies': '13',
 'size': '2044',
 'content': '###############################################################################\n##\n##  Copyright (C) 2013-2014 Tavendo GmbH\n##\n##  Licensed under the Apache License, Version 2.0 (the "License");\n##  you may not use this file except in compliance with the License.\n##  You may obtain a copy of the License at\n##\n##      http://www.apache.org/licenses/LICENSE-2.0\n##\n##  Unless required by applicable law or agreed to in writing, software\n##  distributed under the License is distributed on an "AS IS" BASIS,\n##  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n##  See the License for the specific language governing permissions and\n##  limitations under the License.\n##\n###############################################################################\n\nfrom autobahn.asyncio.websocket import WebSocketClien

In [51]:
from huggingface_hub import login
import json

with open("hf_key_token.json") as f:
    token = json.load(f)["hf_key_token"]

login(token)

### 데이터셋 만들고, 허깅페이스에 커밋

- huggingface 저장소 생성 및 git clone
- in bash
    - 허깅페이스 로그인 깃 생성
        ```bash
        $ huggingface-cli login
        $ huggingface-cli repo create --type dataset codeparrot-train
        $ huggingface-cli repo create --type dataset codeparrot-valid
        ```

    - 가져오기
        ```bash
        $ git clone https://huggingface.co/datasets/tommyjin/codeparrot-valid
        $ git clone https://huggingface.co/datasets/tommyjin/codeparrot-train
        ```

- 훈련세트로 복사
    ```bash
        $ cd codeparrot-train
        $ cp ../codeparrot/*.json.gz .
        $ rm ./file-000000000183.json.gz
        $ git add .
        $ git commit -m "Adding dataset files"
        $ git push
    ```
- 검증 세트 복사 및 커밋
    ```bash
        $ cd ../codeparrot-valid
        $ cp ../codeparrot/file-000000000183.json.gz .
        $ mv ./file-000000000183.json.gz ./file-000000000183_validation.json.gz
        $ git add .
        $ git commit -m "Adding dataset files"
        $ git push
    ```

### 토크나이저 구축하기

- 전체적인 과정 :
    1. 정규화
    2. 사전 토큰화
    3. 토크나이저 모델
    4. 사후 처리

- 다양한 알고리즘
    - Byte Pair Encoding : 단일문자의 리스트로 시작해 점진적으로 새토큰 만들기(정해진 크기까지생성)
    - Unigram : 모든 토큰을 만든후 점진적을 토큰 삭제 (정해진 크기까지)

- 토크나이즈 성능 측정법
    - 부분 단어 생산력(subword fertilty) : 토큰화된 단어마다 생성되는 부분단어의 평균갯수
    - 연속 단어 비률(proportion of continued words) : 두개의 부분토큰으로 분할된 토큰화된 단어의 비율
    - 커버리지 측정값(coverage metrics) : 알수없는 단어나, 거의 사용되지 않는 토큰의 비율
    - **토크나이저 를 이용한 모델의 성능지표가 가장 중요하다.**

In [1]:
# gpt2 토크나이저 확인하기
from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode

bytes_to_unicode_map = bytes_to_unicode()
unicode_to_byte_map = {v:k for k, v in bytes_to_unicode_map.items()}

In [2]:
import pandas as pd

pd.DataFrame.from_dict(unicode_to_byte_map.items()).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,246,247,248,249,250,251,252,253,254,255
0,!,"""",#,$,%,&,',(,),*,...,ĺ,Ļ,ļ,Ľ,ľ,Ŀ,ŀ,Ł,ł,Ń
1,33,34,35,36,37,38,39,40,41,42,...,152,153,154,155,156,157,158,159,160,173


In [6]:
# gpt 토크나이저
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer("""\
def HelloWorld():
    print("hello)
""").tokens()[:2], len(tokenizer) # 어휘사전 크기

(['def', 'ĠHello'], 50257)

### 토크나이저 훈련하기

- BPE 토크나이저 <u>**통계값 추출 훈련**</u>
    1. 목표 어휘사전 크기를 정하기
    2. iterator 준비
    3. train_new_iterator() 호출

In [23]:
print(f'''가장 긴 토큰 :
{sorted(tokenizer.vocab.items(), key=lambda x:len(x[0]), reverse=True)[:1]}''')

가장 긴 토큰 :
[('ÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ', 35496)]


In [31]:
from tqdm.auto import tqdm
from datasets import load_dataset
length = 100000
dataset_name = 'tommyjin/codeparrot-train'
dataset = load_dataset(dataset_name, split="train", streaming=True)
iter_dataset = iter(dataset)

Resolving data files:   0%|          | 0/183 [00:00<?, ?it/s]

In [35]:
base_vocab = list(unicode_to_byte_map.keys())
print(base_vocab) # 기본이 되는 

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '¡', '¢', '£', '¤', '¥', '¦', '§', '¨', '©', 'ª', '«', '¬', '®', '¯', '°', '±', '²', '³', '´', 'µ', '¶', '·', '¸', '¹', 'º', '»', '¼', '½', '¾', '¿', 'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', '×', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'Þ', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', '÷', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'þ', 'ÿ', 'Ā', 'ā', 'Ă', 'ă', 'Ą', 'ą', 'Ć', 'ć', 'Ĉ', 'ĉ', 'Ċ', 'ċ'

In [None]:
def batch_iterator(batch_size=10):
    for _ in tqdm(range(0, length, batch_size)):
        yield [next(iter_dataset)['content'] for _ in range(batch_size)]

new_tokenizer = tokenizer.train_new_from_iterator(
    batch_iterator(),
    vocab_size=12500,
    initial_alphabet = base_vocab
)

In [46]:
print(new_tokenizer("""\
def HelloWorld():
    print("hello)
""").tokens())

['def', 'ĠH', 'ello', 'Wor', 'ld', '():', 'ĊĠĠĠ', 'Ġprint', '("', 'hello', ')', 'Ċ']


In [61]:
import keyword

print(f"""\
키워드 갯수 : {len(keyword.kwlist)}""")
for kw in keyword.kwlist:
    if kw not in new_tokenizer.vocab:
        print(f"\"{kw}\" is not in voc")

키워드 갯수 : 35
"await" is not in voc
"finally" is not in voc
"nonlocal" is not in voc


In [None]:
length = 200000
new_tokenizer_larger = tokenizer.train_new_from_iterator(
    batch_iterator(),
    vocab_size=32768, # 8배수 => gpu 계산 효율적
    initial_alphabet = base_vocab
)

In [63]:
print(f"""\
키워드 갯수 : {len(keyword.kwlist)}""")
for kw in keyword.kwlist:
    if kw not in new_tokenizer_larger.vocab:
        print(f"\"{kw}\" is not in voc")

키워드 갯수 : 35
"nonlocal" is not in voc


In [None]:
new_tokenizer_larger.push_to_hub("codeparrot")
new_tokenizer_larger.push_to_hub("codeparrot-small")

CommitInfo(commit_url='https://huggingface.co/tommyjin/tokenizer_from_codeparrot_dataset/commit/02ce052496e88c5e1f460497a5411c3d3bdd7928', commit_message='Upload tokenizer', commit_description='', oid='02ce052496e88c5e1f460497a5411c3d3bdd7928', pr_url=None, repo_url=RepoUrl('https://huggingface.co/tommyjin/tokenizer_from_codeparrot_dataset', endpoint='https://huggingface.co', repo_type='model', repo_id='tommyjin/tokenizer_from_codeparrot_dataset'), pr_revision=None, pr_num=None)

### 다양한 종류의 코드 생성 훈련 목표

1. 코잘 언어 모델링
    - 코드 시작부분을 입력하고, 뒤를 입력해주는 작업
        >input
        >```python
        >    def add_numbers(a,b):
        >        "add two numbers"
        >        return ____
        >```
        >decoder out
        >```python
        >    def add_numbers(a,b):
        >        "add two numbers"
        >        return a + b
        >```
2. 마스크드 언어 모델링(노이즈 제거)
    - 입력 토큰중 일부가 마스킹 또는 변경
        >input
        >```python
        >    class add_numbers(a,b):
        >        "add [MASK] numbers"
        >        return a+a
        >```
        >incoder out
        >```python
        >    def add_numbers(a,b):
        >        "add two numbers"
        >        return a + b
        >```
3. seq 2 seq 훈련
    - 입력과 출력을 분리하여 입력에 따라 코드가 생성되는 일
        >input
        >```python
        >        "add two numbers"
        >```
        >incoder out
        >```python
        >    def add_numbers(a,b):
        >        return a + b
        >```

### 모델 훈련 from scratch

In [3]:
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("tommyjin/codeparrot")
config = AutoConfig.from_pretrained("gpt2-xl", vocab_size = len(tokenizer))
model = AutoModelForCausalLM.from_config(config)

In [2]:
print(f'params {sum([p.numel() for p in model.parameters()])/1000**2:.1f}M 개')

params 1529.6M 개


In [None]:
model.save_pretrained("model/" + "codeparrot", push_to_hub=True)

In [4]:
tokenizer = AutoTokenizer.from_pretrained("tommyjin/codeparrot")
config_small = AutoConfig.from_pretrained("gpt2", vocab_size = len(tokenizer))
model_small = AutoModelForCausalLM.from_config(config_small)
print(f'params {sum([p.numel() for p in model_small.parameters()])/1000**2:.1f}M 개')

params 111.0M 개


In [None]:
model_small.save_pretrained("model/" + "codeparrot" + "-small", push_to_hub=True)

### 데이터 로더 구축

- 입력 글자수 = 시퀀스의 수 * 시퀀스 길이 * 글자별 토큰

In [5]:
from datasets import load_dataset

dataset = load_dataset("tommyjin/codeparrot-train", split='train',
                       streaming=True)

Resolving data files:   0%|          | 0/183 [00:00<?, ?it/s]

In [6]:
from tqdm import tqdm

examples, total_characters, total_tokens = 500, 0, 0
for _, example in tqdm(zip(range(examples), iter(dataset)),total = examples):
    total_characters += len(example['content'])
    total_tokens += len(tokenizer(example['content']).tokens())

    # 샘플 확인
    if _ == 10 :
        print(f'''\
        10개의 캐릭터 : {example['content'][:10]},
        10개의 토큰  : {tokenizer(example['content']).tokens()[:10]}''')
    
character_per_token = total_characters / total_tokens
print(character_per_token)

  0%|          | 1/500 [00:01<08:20,  1.00s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (2599 > 1024). Running this sequence through the model will result in indexing errors
  5%|▌         | 27/500 [00:01<00:13, 34.78it/s]

        10개의 캐릭터 : import uni,
        10개의 토큰  : ['import', 'Ġunittest', ',', 'Ġos', ',', 'Ġerrno', 'Ċ', 'from', 'Ġctypes', 'Ġimport']


100%|██████████| 500/500 [00:03<00:00, 138.88it/s]

3.6231516195736053





### 스트리밍을 위한 버퍼 채우기(일정한 크기만큼)

In [179]:
import torch
from torch.utils.data import IterableDataset

class ConstantLengthDataset(IterableDataset):
    def __init__(self, tokenizer, dataset, seq_length=1024,
                 num_of_sequences=1024, chars_per_token = 3.6):
        self.tokenizer = tokenizer
        self.concat_token_id = tokenizer.eos_token_id
        self.dataset = dataset
        self.seq_length = seq_length
        self.input_characters = seq_length * chars_per_token * num_of_sequences

    def __iter__(self):
        iteration = iter(self.dataset)
        more_sample = True
        while more_sample:
            buffer, buffer_len = [], 0
            while True:
                if buffer_len >= self.input_characters:
                    m = f"버퍼 채우는중: {buffer_len}>={self.input_characters:.0f}"
                    print(m)
                    break
                try:
                    m = f"버퍼 채우기완: {buffer_len}<{self.input_characters:.0f}"
                    print(m)
                    buffer.append(next(iteration)["content"])
                    buffer_len += len(buffer[-1])
                except StopIteration:
                    iteration = iter(self.dataset)

            all_token_ids = []
            tokenizer_inputs = self.tokenizer(buffer, truncation=False)
            for tokenizer_input in tokenizer_inputs['input_ids']:
                all_token_ids.extend(tokenizer_input + [self.concat_token_id])

            for i in range(0, len(all_token_ids), self.seq_length):
                input_ids = all_token_ids[i:i+self.seq_length]
                if len(input_ids) == self.seq_length:
                    yield torch.tensor(input_ids)

In [180]:
shuffled_dataset = dataset.shuffle(buffer_size=100)
constant_length_dataset = ConstantLengthDataset(tokenizer, shuffled_dataset,
                                                num_of_sequences=10)

dataset_iteration = iter(constant_length_dataset)

In [181]:
lengths = [len(b) for _, b in zip(range(5), dataset_iteration)]

버퍼 채우기완: 0<36864
버퍼 채우기완: 1804<36864
버퍼 채우는중: 44799>=36864


In [213]:
lengths

[1024, 1024, 1024, 1024, 1024]

### wnadb 와 그라디언트 어큐뮬레이션을

- wandb 를 위한 사전 준비
  - login with wandb
    ```bash
    wandb login
    ```

In [79]:
import torch
from datasets import load_dataset
import wandb

# 1. wandb 설정
wandb.init(
    project="large-dataset-training",
    config={
        "model_name": "codeparrot-small",
        "train_batch_size": 8,
        "valid_batch_size": 8,
        "accumulation_steps": 16,
        "epochs": 3,
        "learning_rate": 5e-5,
        "max_length": 128,
        "seq_length": 1024,
        "weight_decay": 0.1,
        "num_warmup_steps": 750,
        "max_train_steps": 50,
        "save_checkpoint_steps": 200,
        "shuffle_buffer": 10,
    },
)

In [19]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AdamW

# 모델 및 토크나이저 설정
model_name = wandb.config["model_name"]
model = AutoModelForCausalLM.from_pretrained('tommyjin/'+model_name)
tokenizer = AutoTokenizer.from_pretrained('tommyjin/'+model_name)

In [20]:
# 데이터셋 스트리밍
dataset = load_dataset("tommyjin/codeparrot-train", split="train", streaming=True)

Resolving data files:   0%|          | 0/183 [00:00<?, ?it/s]

In [38]:
tokenizer.pad_token = tokenizer.eos_token

In [45]:
# 데이터 전처리 함수
def preprocess(batch):
    tokenized = tokenizer(
        batch["content"],
        truncation=True,
        padding="max_length",
        max_length=wandb.config["max_length"],
        return_tensors="pt",
    )
    return {
        "input_ids": tokenized["input_ids"].squeeze(0),
        "attention_mask": tokenized["attention_mask"].squeeze(0),
    }

In [31]:
# 훈련 설정
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(32768, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=32768, bias=False)
)

In [25]:
# 옵티마이저
optimizer = torch.optim.AdamW(model.parameters(),
                  lr=wandb.config["learning_rate"],
                  weight_decay=wandb.config["weight_decay"])

In [26]:
# Mixed Precision Training
scaler = torch.amp.GradScaler(device = 'cuda')

In [27]:
# wandb를 위한 설정
wandb.watch(model, log="all", log_freq=10)

In [28]:
#  훈련 args
accumulation_steps = wandb.config["accumulation_steps"]
batch_size = wandb.config["train_batch_size"]
epochs = wandb.config["epochs"]
save_checkpoint_steps = wandb.config["save_checkpoint_steps"]
wandb_interval = 100

In [None]:
batch = {"input_ids": [], "attention_mask": [], "labels": []}
for epoch in range(epochs):
    print(f"Epoch {epoch + 1}/{epochs}")
    model.train()
    optimizer.zero_grad()

    for step, example in enumerate(dataset):
        processed = preprocess(example)
        batch["input_ids"].append(processed["input_ids"])
        batch["attention_mask"].append(processed["attention_mask"])

        # 배치 크기가 지정된 값에 도달하면 학습 진행
        if len(batch["input_ids"]) == batch_size:
            input_ids = torch.stack(batch["input_ids"]).to(device)
            attention_mask = torch.stack(batch["attention_mask"]).to(device)
            labels = input_ids.clone()

            with torch.amp.autocast(device_type="cuda"):  # Mixed Precision Training
                outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
                loss = outputs.loss

            # 그라디언트 누적
            loss = loss / accumulation_steps  # 손실을 어큐뮬레이션 스텝 수로 나누어 평균 처리
            scaler.scale(loss).backward()

            # accumulation_steps 만큼 그라디언트가 누적되면 파라미터 업데이트
            if (step + 1) % accumulation_steps == 0:
                scaler.step(optimizer)
                scaler.update()
                optimizer.zero_grad()

            # wandb 로깅
            if (step + 1) % wandb_interval == 0:
                wandb.log({"loss": loss.item(), "step": step + 1})
                print(f"Step {step + 1}, Loss: {loss.item()}")

            # 배치 초기화
            batch = {"input_ids": [], "attention_mask": [], "labels": []}

    # 에포크 종료 후 wandb에 에포크 단위 로깅
    wandb.log({"epoch": epoch + 1})

    # 체크포인트 저장
    if step % save_checkpoint_steps == 0:
        # 디렉터리 경로 설정
        checkpoint_path = f"/mnt/e/ai_career/ai_study_Transformer/codeparrot-small/checkpoint_epoch_{step}"

        # 모델과 토크나이저 저장
        model.save_pretrained(checkpoint_path)
        tokenizer.save_pretrained(checkpoint_path)
    if (epoch + 1) % save_checkpoint_steps == 0:
        wandb.save(f"checkpoint_epoch_{epoch + 1}/*")



In [87]:
model.save_pretrained("tommyjin/codeparrot-small")
tokenizer.save_pretrained("tommyjin/codeparrot-small", push_to_hub=True)

No files have been modified since last commit. Skipping to prevent empty commit.


('tommyjin/codeparrot-small/tokenizer_config.json',
 'tommyjin/codeparrot-small/special_tokens_map.json',
 'tommyjin/codeparrot-small/vocab.json',
 'tommyjin/codeparrot-small/merges.txt',
 'tommyjin/codeparrot-small/added_tokens.json',
 'tommyjin/codeparrot-small/tokenizer.json')

In [None]:
# 모델 저장
wandb.save("codeparrot-small/*")

print("Training completed and model saved.")
wandb.finish()

### 잘훈련된 모델 테스트 from `transformersbook/codeparrot`

In [6]:
import os
from transformers import pipeline, set_seed, AutoTokenizer, AutoModelForCausalLM

In [53]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [54]:
model_ckpt = 'transformersbook/codeparrot'
model = AutoModelForCausalLM.from_pretrained(model_ckpt, cache_dir = "./cache_dir")
tokenizer = AutoTokenizer.from_pretrained(model_ckpt, cache_dir = "./cache_dir")
generation = pipeline('text-generation',
                      model=model,
                      tokenizer=tokenizer,
                      device=0)

Device set to use cuda:0


In [229]:
predicted = generation('''# this code is most nice code ever
def lit_code(''',
temperature = 1.3,
top_k = 15,
do_sample = True,
num_beams = 1
)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


In [230]:
print(predicted[0]['generated_text'])

# this code is most nice code ever
def lit_code(code):
	if code[0:2]=='#':
		return '[[%02x,%s]]' % (chr(code[2+2)%31,code


In [231]:
print(predicted[0]['generated_text'].split("\n\n")[0])

# this code is most nice code ever
def lit_code(code):
	if code[0:2]=='#':
		return '[[%02x,%s]]' % (chr(code[2+2)%31,code
