TPU 이외에 의존성 있는 패키지를 설치합니다.

In [1]:
!pip install ratsnlp 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# 구글 드라이브 연동하기
모델 체크포인트 등을 저장해 둘 구글 드라이브를 연결합니다. 자신의 구글 계정에 적용됩니다.

In [2]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)

Mounted at /gdrive


# 각종 설정
모델 하이퍼파라메터(hyperparameter)와 저장 위치 등 설정 정보를 선언합니다.

In [3]:
import torch
from ratsnlp.nlpbook.generation import GenerationTrainArguments
args = GenerationTrainArguments(
    pretrained_model_name="skt/kogpt2-base-v2",
    downstream_corpus_name="nsmc",
    downstream_model_dir="/gdrive/My Drive/nlpbook/checkpoint-gener2",
    max_seq_length=32,
    batch_size=32 if torch.cuda.is_available() else 4,
    learning_rate=5e-5,
    epochs=3,
    tpu_cores=0 if torch.cuda.is_available() else 8,
    seed=7,
)

# 랜덤 시드 고정
학습 재현을 위해 랜덤 시드를 고정합니다.

In [4]:
from ratsnlp import nlpbook
nlpbook.set_seed(args)

set seed: 7


# 로거 설정
메세지 출력 등을 위한 logger를 설정합니다.

In [5]:
nlpbook.set_logger(args)

INFO:ratsnlp:Training/evaluation parameters GenerationTrainArguments(pretrained_model_name='skt/kogpt2-base-v2', downstream_task_name='sentence-generation', downstream_corpus_name='nsmc', downstream_corpus_root_dir='/content/Korpora', downstream_model_dir='/gdrive/My Drive/nlpbook/checkpoint-gener2', max_seq_length=32, save_top_k=1, monitor='min val_loss', seed=7, overwrite_cache=False, force_download=False, test_mode=False, learning_rate=5e-05, epochs=3, batch_size=32, cpu_workers=2, fp16=False, tpu_cores=0)
INFO:ratsnlp:Training/evaluation parameters GenerationTrainArguments(pretrained_model_name='skt/kogpt2-base-v2', downstream_task_name='sentence-generation', downstream_corpus_name='nsmc', downstream_corpus_root_dir='/content/Korpora', downstream_model_dir='/gdrive/My Drive/nlpbook/checkpoint-gener2', max_seq_length=32, save_top_k=1, monitor='min val_loss', seed=7, overwrite_cache=False, force_download=False, test_mode=False, learning_rate=5e-05, epochs=3, batch_size=32, cpu_worker

# 말뭉치 다운로드
실습에 사용할 말뭉치(NSMC)를 다운로드합니다.

In [6]:
from Korpora import Korpora
Korpora.fetch(
    corpus_name=args.downstream_corpus_name,
    root_dir=args.downstream_corpus_root_dir,
    force_download=args.force_download,
)

[nsmc] download ratings_train.txt: 14.6MB [00:00, 88.7MB/s]                            
[nsmc] download ratings_test.txt: 4.90MB [00:00, 38.2MB/s]                            


# 토크나이저 준비
토큰화를 수행하는 토크나이저를 선언합니다

In [7]:
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    args.pretrained_model_name,
    eos_token="</s>",
)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.


In [8]:
#여기는 nsmc안에 있는 train & test 데이터를 덮어씌우는 곳

import pandas as pd
sample_df=pd.read_csv("/content/sample_data/dataset.csv",encoding="CP949")
sample_df

Unnamed: 0,wav_id,발화문,상황,1번 감정,1번 감정세기,2번 감정,2번 감정세기,3번 감정,3번 감정세기,4번 감정,4번감정세기,5번 감정,5번 감정세기,나이,성별
0,5e48d03cc38c123b9ec6d9b6,요번에 마지막 기회였는데 그것마저도 떨어졌어. 나 어쩌면 좋지?,sad,Sadness,1,Sadness,2,Sadness,1,Sadness,1,Fear,2,46,female
1,5f3d0a858a3c1005aa97c8b8,너무 화가 나니까 사과도 사과 같지가 않아. 진정성이 없어.,angry,disgust,1,angry,2,angry,1,angry,1,sadness,2,46,female
2,5f90c9bcd338b948c4e6a71b,그래. 신나는 음악 듣고 스트레스 풀고 싶다.,happiness,neutral,0,angry,1,sadness,1,happiness,1,neutral,0,48,female
3,5ed640d579bf120ed2b815bd,그저께 밤에 갑자기 의식을 잃어버리더라고. 그러더니 어제 끝내 눈을 감았어.,sad,Sadness,1,Sadness,2,Sadness,1,Sadness,1,Sadness,2,48,female
4,5fbb8e1344697678c497b84a,너무 무서워. 아무것도 안 보여. 어떻게 해야될지 모르겠어.,fear,neutral,0,neutral,0,neutral,0,neutral,0,neutral,0,29,male
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43986,5e3554075807b852d9e0805c,나 어떻게 살지? 망했다. 망했어.,sad,Sadness,1,Sadness,1,Fear,1,Sadness,1,Sadness,1,26,male
43987,5f01f7efb140144dfcfed549,"어제 몸이 너무 아팠는데, 밤새 작업해서 기획안을 완성했어. 완성된 기획안을 상사한...",sad,Angry,1,Sadness,1,Sadness,1,Sadness,1,Sadness,2,46,female
43988,5f98ad0c9e04b149046ce1b5,내일 집에서 나올 때 우산 가지고 나오면 괜찮아. 몸도 건강하지!,neutral,neutral,0,happiness,1,sadness,1,neutral,0,neutral,0,46,female
43989,5f787822d338b948c4e68cae,저 상사 새끼를 잊어버릴 많아 노래를 부탁해.,angry,angry,2,angry,2,angry,2,angry,1,angry,1,48,female


In [9]:
from sklearn.model_selection import train_test_split

X=sample_df[['wav_id','발화문','상황']]
x_train, x_test= train_test_split(X, test_size=0.2)

In [10]:
x_train.to_csv('/content/Korpora/nsmc/ratings_train.txt', sep = '\t', index=False)
x_test.to_csv('/content/Korpora/nsmc/ratings_test.txt', sep = '\t', index=False)

In [20]:
import csv

def convert_emotion_to_number(emotion):
    emotion_map = {
        'angry': '0',
        'anger': '0',
        'disgust': '1',
        'fear': '2',
        'happiness': '3',
        'neutral': '4',
        'sadness': '5',
        'sad':'5',
        'surprise': '6',
    }
    return emotion_map.get(emotion, '-1')  # -1은 매핑되는 감정이 없을 경우를 처리하기 위한 기본값입니다.

def convert_csv_emotion_to_number(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        reader = csv.reader(f, delimiter='\t')
        lines = list(reader)

    for i in range(1, len(lines)):
        _, _, emotion = lines[i]
        emotion_number = convert_emotion_to_number(emotion)
        lines[i][2] = emotion_number

    with open(output_file, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f, delimiter='\t')
        writer.writerows(lines)

# 변환할 CSV 파일 경로와 저장할 파일 경로를 지정합니다.
input_file = '/content/Korpora/nsmc/ratings_train.txt'
output_file = '/content/Korpora/nsmc/final_train.txt'

convert_csv_emotion_to_number(input_file, output_file)

In [21]:
import csv

def convert_emotion_to_number(emotion):
    emotion_map = {
        'angry': '0',
        'anger': '0',
        'disgust': '1',
        'fear': '2',
        'happiness': '3',
        'neutral': '4',
        'sadness': '5',
        'sad':'5',
        'surprise': '6',
    }
    return emotion_map.get(emotion, '-1')  # -1은 매핑되는 감정이 없을 경우를 처리하기 위한 기본값입니다.

def convert_csv_emotion_to_number(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        reader = csv.reader(f, delimiter='\t')
        lines = list(reader)

    for i in range(1, len(lines)):
        _, _, emotion = lines[i]
        emotion_number = convert_emotion_to_number(emotion)
        lines[i][2] = emotion_number

    with open(output_file, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f, delimiter='\t')
        writer.writerows(lines)

# 변환할 CSV 파일 경로와 저장할 파일 경로를 지정합니다.
input_file = '/content/Korpora/nsmc/ratings_test.txt'
output_file = '/content/Korpora/nsmc/final_test.txt'

convert_csv_emotion_to_number(input_file, output_file)

# corpus 코드 수정해야 하는 부분

※ 라벨을 다음과 같이 수정

분노 = '0' / 
혐오 = '1' /
공포 = '2' /
행복 = '3' /
중립 = '4' /
슬픔 = '5' /
놀람 = '6'

In [22]:
import os
import csv
import time
import torch
import logging
from filelock import FileLock
from dataclasses import dataclass
from typing import List, Optional
from torch.utils.data.dataset import Dataset
from transformers import PreTrainedTokenizerFast
from ratsnlp.nlpbook.generation.arguments import GenerationTrainArguments

3
logger = logging.getLogger("ratsnlp")


@dataclass
class GenerationExample:
    text: str


@dataclass
class GenerationFeatures:
    input_ids: List[int]
    attention_mask: Optional[List[int]] = None
    token_type_ids: Optional[List[int]] = None
    labels: Optional[List[int]] = None


class NsmcCorpus:

    def __init__(self):
        pass

    def _read_corpus(cls, input_file, quotechar='"'):
        with open(input_file, "r", encoding="utf-8") as f:
            return list(csv.reader(f, delimiter="\t", quotechar=quotechar))

    def _create_examples(self, lines):
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            _, review_sentence, sentiment = line

            # 아래 코드를 적절하게 수정
            if sentiment == '0':
                sentiment = '분노'
            elif sentiment == '1':
                sentiment = '혐오'
            elif sentiment == '2':
                sentiment = '공포'
            elif sentiment == '3':
                sentiment = '행복'
            elif sentiment == '4':
                sentiment = '중립'
            elif sentiment == '5':
                sentiment = '슬픔'
            elif sentiment == '6':
                sentiment = '놀람'
            else:
                raise ValueError(f"Invalid sentiment label: {sentiment}")

            text = sentiment + " " + review_sentence
            examples.append(GenerationExample(text=text))
        return examples

    def get_examples(self, data_root_path, mode):
        data_fpath = os.path.join(data_root_path, f"ratings_{mode}.txt")
        logger.info(f"loading {mode} data... LOOKING AT {data_fpath}")
        return self._create_examples(self._read_corpus(data_fpath))

def _convert_examples_to_generation_features(
        examples: List[GenerationExample],
        tokenizer: PreTrainedTokenizerFast,
        args: GenerationTrainArguments,
):

    logger.info(
        "tokenize sentences, it could take a lot of time..."
    )
    start = time.time()
    batch_encoding = tokenizer(
        [example.text for example in examples],
        max_length=args.max_seq_length,
        padding="max_length",
        truncation=True,
    )
    logger.info(
        "tokenize sentences [took %.3f s]", time.time() - start
    )

    features = []
    for i in range(len(examples)):
        inputs = {k: batch_encoding[k][i] for k in batch_encoding}
        feature = GenerationFeatures(**inputs, labels=batch_encoding["input_ids"][i])
        features.append(feature)

    for i, example in enumerate(examples[:5]):
        logger.info("*** Example ***")
        logger.info("sentence: %s" % (example.text))
        logger.info("tokens: %s" % (" ".join(tokenizer.convert_ids_to_tokens(features[i].input_ids))))
        logger.info("features: %s" % features[i])

    return features


class GenerationDataset(Dataset):

    def __init__(
            self,
            args: GenerationTrainArguments,
            tokenizer: PreTrainedTokenizerFast,
            corpus,
            mode: Optional[str] = "train",
            convert_examples_to_features_fn=_convert_examples_to_generation_features,
    ):
        if corpus is not None:
            self.corpus = corpus
        else:
            raise KeyError("corpus is not valid")
        if not mode in ["train", "val", "test"]:
            raise KeyError(f"mode({mode}) is not a valid split name")
        # Load data features from cache or dataset file
        cached_features_file = os.path.join(
            args.downstream_corpus_root_dir,
            args.downstream_corpus_name,
            "cached_{}_{}_{}_{}_{}".format(
                mode,
                tokenizer.__class__.__name__,
                str(args.max_seq_length),
                args.downstream_corpus_name,
                args.downstream_task_name,
            ),
        )

        # Make sure only the first process in distributed training processes the dataset,
        # and the others will use the cache.
        lock_path = cached_features_file + ".lock"
        with FileLock(lock_path):

            if os.path.exists(cached_features_file) and not args.overwrite_cache:
                start = time.time()
                self.features = torch.load(cached_features_file)
                logger.info(
                    f"Loading features from cached file {cached_features_file} [took %.3f s]", time.time() - start
                )
            else:
                corpus_path = os.path.join(
                    args.downstream_corpus_root_dir,
                    args.downstream_corpus_name,
                )
                logger.info(f"Creating features from dataset file at {corpus_path}")
                examples = self.corpus.get_examples(corpus_path, mode)
                tokenizer.pad_token = tokenizer.eos_token
                self.features = convert_examples_to_features_fn(
                    examples,
                    tokenizer,
                    args,
                )
                start = time.time()
                logger.info(
                    "Saving features into cached file, it could take a lot of time..."
                )
                torch.save(self.features, cached_features_file)
                logger.info(
                    "Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
                )

    def __len__(self):
        return len(self.features)

    def __getitem__(self, i):
        return self.features[i]

    def get_labels(self):
        return self.corpus.get_labels()

# 학습데이터 구축
학습데이터를 만듭니다.

from ratsnlp.nlpbook.generation import NsmcCorpus, GenerationDataset 라인은 주석처리

In [24]:
# from ratsnlp.nlpbook.generation import NsmcCorpus, GenerationDataset
from torch.utils.data import DataLoader, SequentialSampler, RandomSampler
corpus = NsmcCorpus()
train_dataset = GenerationDataset(
    args=args,
    corpus=corpus,
    tokenizer=tokenizer,
    mode="train",
)
train_dataloader = DataLoader(
    train_dataset,
    batch_size=args.batch_size,
    sampler=RandomSampler(train_dataset, replacement=False),
    collate_fn=nlpbook.data_collator,
    drop_last=False,
    num_workers=args.cpu_workers,
)

INFO:ratsnlp:Creating features from dataset file at /content/Korpora/nsmc
INFO:ratsnlp:Creating features from dataset file at /content/Korpora/nsmc
INFO:ratsnlp:loading train data... LOOKING AT /content/Korpora/nsmc/ratings_train.txt
INFO:ratsnlp:loading train data... LOOKING AT /content/Korpora/nsmc/ratings_train.txt
INFO:ratsnlp:tokenize sentences, it could take a lot of time...
INFO:ratsnlp:tokenize sentences, it could take a lot of time...
INFO:ratsnlp:tokenize sentences [took 2.358 s]
INFO:ratsnlp:tokenize sentences [took 2.358 s]
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:sentence: 슬픔 코로나 때문에 재택근무 한대.
INFO:ratsnlp:sentence: 슬픔 코로나 때문에 재택근무 한대.
INFO:ratsnlp:tokens: ▁슬픔 ▁코 로나 ▁때문에 ▁재 택 근무 ▁한대 . </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>
INFO:ratsnlp:tokens: ▁슬픔 ▁코 로나 ▁때문에 ▁재 택 근무 ▁한대 . </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>

# 테스트 데이터 구축
학습 중에 평가할 테스트 데이터를 구축합니다.

In [25]:
val_dataset = GenerationDataset(
    args=args,
    corpus=corpus,
    tokenizer=tokenizer,
    mode="test",
)
val_dataloader = DataLoader(
    val_dataset,
    batch_size=args.batch_size,
    sampler=SequentialSampler(val_dataset),
    collate_fn=nlpbook.data_collator,
    drop_last=False,
    num_workers=args.cpu_workers,
)


INFO:ratsnlp:Creating features from dataset file at /content/Korpora/nsmc
INFO:ratsnlp:Creating features from dataset file at /content/Korpora/nsmc
INFO:ratsnlp:loading test data... LOOKING AT /content/Korpora/nsmc/ratings_test.txt
INFO:ratsnlp:loading test data... LOOKING AT /content/Korpora/nsmc/ratings_test.txt
INFO:ratsnlp:tokenize sentences, it could take a lot of time...
INFO:ratsnlp:tokenize sentences, it could take a lot of time...
INFO:ratsnlp:tokenize sentences [took 0.307 s]
INFO:ratsnlp:tokenize sentences [took 0.307 s]
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:sentence: 혐오 안그래도 냄새 뺄려고 환기시키는 중이었어.
INFO:ratsnlp:sentence: 혐오 안그래도 냄새 뺄려고 환기시키는 중이었어.
INFO:ratsnlp:tokens: ▁혐오 ▁안 그래 도 ▁냄새 ▁ 뺄 려고 ▁환기 시키는 ▁중 이었 어 . </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>
INFO:ratsnlp:tokens: ▁혐오 ▁안 그래 도 ▁냄새 ▁ 뺄 려고 ▁환기 시키는 ▁중 이었 어 . </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>


# 모델 초기화
프리트레인이 완료된 GPT2 모델을 읽고, 문장 생성 모델을 초기화합니다.

In [26]:
from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained(
    args.pretrained_model_name
)

# 학습 준비
Task와 Trainer를 준비합니다.

In [27]:
from ratsnlp.nlpbook.generation import GenerationTask
task = GenerationTask(model, args)

In [28]:
trainer = nlpbook.get_trainer(args)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True, used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


# 학습
준비한 데이터와 모델로 학습을 시작합니다. 학습 결과물(체크포인트)은 미리 연동해둔 구글 드라이브의 준비된 위치(`/gdrive/My Drive/nlpbook/checkpoint-generation`)에 저장됩니다.

In [29]:
trainer.fit(
    task,
    train_dataloaders=train_dataloader,
    val_dataloaders=val_dataloader,
)

INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  rank_zero_warn(
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type            | Params
------------------------------------------
0 | model | GPT2LMHeadModel | 125 M 
------------------------------------------
125 M     Trainable params
0         Non-trainable params
125 M     Total params
500.656   Total estimated model params size (MB)


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]