## 1. 패키지 로드

In [1]:
import tensorflow as tf
from transformers import AutoTokenizer
from transformers import TFGPT2LMHeadModel

import pandas as pd

from tqdm import tqdm_notebook
import tqdm

2022-06-22 09:22:07.850625: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-06-22 09:22:10.789372: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-22 09:22:10.792965: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-22 09:22:10.795628: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-22 09:22:10.798637: I tensorflow/core/

## 2. KoGPT2 로드

In [2]:
KoGPT2_sent_tokenizer = AutoTokenizer.from_pretrained('skt/kogpt2-base-v2', bos_token='</s>', eos_token='</s>', pad_token='<pad>')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
KoGPT2_sent_model = TFGPT2LMHeadModel.from_pretrained('skt/kogpt2-base-v2', from_pt=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFGPT2LMHeadModel: ['transformer.h.9.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.11.attn.masked_bias', 'lm_head.weight', 'transformer.h.10.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.0.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.1.attn.masked_bias']
- This IS expected if you are initializing TFGPT2LMHeadModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFGPT2LMHeadModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassifica

## 3. 학습 데이터 로드

In [4]:
### 학습 데이터 로드
train_data = pd.read_csv("./train data/train_gpt2_sent.csv")
train_data.drop("Unnamed: 0", axis=1, inplace = True)
train_data.head(2)

Unnamed: 0,문장,감정
0,이렇게 하는 거야.” 하며 시범을 보이자 그녀의 웃음소리가 병실 가득 메아리쳤다.,행복
1,"그러나 이렇게 해서라도 아까 본 피범벅 동영상을 머릿속에서 지울 수 있다면, 그녀의...",슬픔


In [5]:
train_data = train_data[train_data.문장.map(len) < 30].reset_index(drop=True)
train_data = train_data[['감정', '문장']]
train_data.head(2)

Unnamed: 0,감정,문장
0,행복,진짜 인버뤄-브뤠이트 같아.”일단 웃었다.
1,행복,고마워.


In [6]:
train_data.감정.value_counts()

슬픔    1103
분노     409
행복     286
Name: 감정, dtype: int64

## 4. 토큰화

In [7]:
# params setting
batch_size = 32

In [8]:
# 데이터 로드 및 토큰화 함수
def get_emo_data():
    for question, answer in zip(train_data.감정.to_list(), train_data.문장.to_list()):
        bos_token = [KoGPT2_sent_tokenizer.bos_token_id]
        eos_token = [KoGPT2_sent_tokenizer.eos_token_id]
        sent = KoGPT2_sent_tokenizer.encode('<usr>' + question + '<sys>' + answer) 
        yield bos_token + sent + eos_token

In [9]:
# 모델이 필요로하는 DataSet 생성
dataset = tf.data.Dataset.from_generator(get_emo_data, output_types=tf.int32)

In [10]:
# 입력데이터의 크기가 가변 일때 같은 크기로 읽을 수 있게 변환
dataset = dataset.padded_batch(batch_size=batch_size, padded_shapes=(None,), padding_values=KoGPT2_sent_tokenizer.pad_token_id)

## 5. 모델 학습

In [11]:
# 1, 옵티마이저 정의
adam = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)

# 2. Step 정의
steps = len(train_data) // batch_size + 1
print(steps)
# 3. 에포치 설정(학습횟수)
EPOCHS = 5

57


In [12]:
# 4. 모델 학습
for epoch in range(EPOCHS):
    epoch_loss = 0

    for batch in tqdm_notebook(dataset, total=steps):
        with tf.GradientTape() as tape:
            result = KoGPT2_sent_model(batch, labels=batch)
            loss = result[0]
            batch_loss = tf.reduce_mean(loss)
            
        grads = tape.gradient(batch_loss, KoGPT2_sent_model.trainable_variables)
        adam.apply_gradients(zip(grads, KoGPT2_sent_model.trainable_variables))
        epoch_loss += batch_loss / steps

    print('[Epoch: {:>4}] cost = {:>.9}'.format(epoch + 1, epoch_loss))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch in tqdm_notebook(dataset, total=steps):


  0%|          | 0/57 [00:00<?, ?it/s]

[Epoch:    1] cost = 2.60400867


  0%|          | 0/57 [00:00<?, ?it/s]

[Epoch:    2] cost = 1.72206843


  0%|          | 0/57 [00:00<?, ?it/s]

[Epoch:    3] cost = 1.35414803


  0%|          | 0/57 [00:00<?, ?it/s]

[Epoch:    4] cost = 1.03839958


  0%|          | 0/57 [00:00<?, ?it/s]

[Epoch:    5] cost = 0.856860101


## 6. 전체 모델 저장

In [13]:
KoGPT2_sent_model.save_pretrained('./model/Gen_sent_GPT2_model.h5')

In [14]:
load_model = TFGPT2LMHeadModel.from_pretrained('./model/Gen_sent_GPT2_model.h5')

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at ./model/Gen_sent_GPT2_model.h5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [17]:
def return_answer_by_chatbot(user_text):
    sent = '<usr>' + user_text + '<sys>'
    input_ids = [KoGPT2_sent_tokenizer.bos_token_id] + KoGPT2_sent_tokenizer.encode(sent)
    input_ids = tf.convert_to_tensor([input_ids])
    output = load_model.generate(input_ids, max_length=50, do_sample=True, top_k=20)
    sentence = KoGPT2_sent_tokenizer.decode(output[0].numpy().tolist())
    chatbot_response = sentence.split('<sys> ')[1].replace('</s>', '')
    return chatbot_response

In [18]:
return_answer_by_chatbot('슬픔')

'아즈윈은 한동안 방긋 웃지도 않았다.'