# 루브릭
1. 챗봇 훈련데이터 전처리 과정이 체계적으로 진행되었는가?  
   챗봇 훈련데이터를 위한 전처리와 Augmentation이 적절히 수행되어 3만개 가량의 훈련데이터셋이 구축되었다.  
2. transformer 모델을 활용한 챗봇 모델이 과적합을 피해 안정적으로 훈련되었는가?  
   과적합을 피할 수 있는 하이퍼파라미터 셋이 적절히 제시되었다.  
3. 챗봇이 사용자의 질문에 그럴듯한 형태로 답하는 사례가 있는가?  
   주어진 예문을 포함하여 챗봇에 던진 질문에 적절히 답하는 사례가 제출되었다.  

In [1]:
import pandas as pd

import re
from konlpy.tag import Mecab

import random
import gensim
from gensim.models.keyedvectors import Word2VecKeyedVectors
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import numpy as np
from tqdm import tqdm

from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction

from hyperopt import fmin, tpe, hp, Trials

2024-07-16 12:25:08.035909: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Step 1. 데이터 다운로드
ChabotData.csv를 pandas를 이용해 읽어온 후, 데이터의 질문과 답변을 각각 question, answers 변수에 나눠서 저장.  
[songys/Chatbot_data](https://github.com/songys/Chatbot_data)

In [2]:
df = pd.read_csv('./data/ChatbotData.csv')
df.head()

Unnamed: 0,Q,A,label
0,12시 땡!,하루가 또 가네요.,0
1,1지망 학교 떨어졌어,위로해 드립니다.,0
2,3박4일 놀러가고 싶다,여행은 언제나 좋죠.,0
3,3박4일 정도 놀러가고 싶다,여행은 언제나 좋죠.,0
4,PPL 심하네,눈살이 찌푸려지죠.,0


In [3]:
questions = df['Q']
answers = df['A']

# Step 2. 데이터 정제
preprocess_sentence() 함수 구현  
- 영문자의 경우, 모두 소문자로 변환
- 영문자와 한글, 숫자, 그리고 주요 특수문자를 제외하곤 정규식을 활용하여 모두 제거  

이전과 다르게 생략된 기능들은 토크나이저에서 지원.

In [4]:
def preprocess_sentence(sentence):

    sentence = sentence.lower().strip()
    sentence = re.sub(r"[^0-9ㄱ-ㅣ가-힣a-zA-Z?.!]+", " ", sentence)
    sentence = sentence.strip()
    
    return sentence

# Step 3. 데이터 토큰화
KoNLPy의 mecab 클래스 사용.  
build_corpus() 함수 구현
- 소스 문장 데이터와 타겟 문장 데이터를 입력으로 받음.
- 데이터를 앞서 정의한 preprocess_sentence() 함수로 정제하고, 토큰화
- 토큰화는 전달받은 토크나이즈 함수를 사용. mecab.morphs 함수를 전달
- 토큰의 개수가 일정 길이 이상인 문장은 데이터에서 제외
- 중복되는 문장은 데이터에서 제외. 소스 : 타겟 쌍을 비교하지 않고 소스는 소스대로 타겟은 타겟대로 검사. 중복 쌍이 흐트러지지 않도록 유의.  

구현한 함수를 활용하여 qeustions와 answers를 각각 que_corpus, ans_corpus에 토큰화하여 저장.

In [5]:
def build_corpus(src_sentences, tgt_sentences, tokenize_fn, max_len = 40):
    src_corpus = []
    tgt_corpus = []
    src_seen = set()
    tgt_seen = set()

    for src, tgt in zip(src_sentences, tgt_sentences):
        src = preprocess_sentence(src)
        tgt = preprocess_sentence(tgt)

        src_tokens = tokenize_fn(src)
        tgt_tokens = tokenize_fn(tgt)

        if len(src_tokens) <= max_len and len(tgt_tokens) <= max_len:
            src_joined = ' '.join(src_tokens)
            tgt_joined = ' '.join(tgt_tokens)
            if src_joined not in src_seen and tgt_joined not in tgt_seen:
                src_corpus.append(src_tokens)
                tgt_corpus.append(tgt_tokens)
                src_seen.add(src_joined)
                tgt_seen.add(tgt_joined)

    return src_corpus, tgt_corpus

In [6]:
mecab = Mecab()

que_corpus, ans_corpus = build_corpus(questions, answers, mecab.morphs)

print(len(que_corpus), len(ans_corpus))
print()
print('que_corpus: ',
      '\n'.join([' '.join(tokens) for tokens in que_corpus[:5]]), sep = '\n')
print()
print('ans_corpus: ',
      '\n'.join([' '.join(tokens) for tokens in ans_corpus[:5]]), sep = '\n')

7683 7683

que_corpus: 
12 시 땡 !
1 지망 학교 떨어졌 어
3 박 4 일 놀 러 가 고 싶 다
ppl 심하 네
sd 카드 망가졌 어

ans_corpus: 
하루 가 또 가 네요 .
위로 해 드립니다 .
여행 은 언제나 좋 죠 .
눈살 이 찌푸려 지 죠 .
다시 새로 사 는 게 마음 편해요 .


# Step 4. Augmentation
Lexical Substitution을 실제로 적용.  
아래 링크를 참고하여 한국어로 사전 훈련된 Embedding 모델을 다운로드. Korean (w)가 Word2Vec으로 학습한 모델이며 용량도 적당. Korean (w) 다운로드. ko.bin 파일 얻기.  
[Kyubyong/wordvectors](https://github.com/Kyubyong/wordvectors)  
다운로드한 모델을 활용해 데이터 Augmentation. lexical_sub() 함수 참고.  
Augmentation된 que_corpus와 원본 ans_corpus가 병렬을 이루도록, 원본 que_corpus와 Augmentation된 ans_corpus가 병렬을 이루도록 하여 전체 데이터가 원래의 3배 가량으로 늘어나도록 함.

In [7]:
# wv = gensim.models.Word2Vec.load('./data/ko.bin')

In [8]:
# wv = gensim.models.KeyedVectors.load_word2vec_format('./data/ko.bin', binary = True)

In [9]:
word_vectors = Word2VecKeyedVectors.load('./data/word2vec_ko.model')

In [10]:
def lexical_sub(tokens):
    selected_tok = random.choice(tokens)

    try:
        similar_word = word_vectors.wv.similar_by_word(selected_tok)[0][0]
    except KeyError:
        similar_word = selected_tok
        # print('not changed', 'tokens:', tokens)
    
    return [similar_word if tok == selected_tok else tok for tok in tokens]

In [11]:
def augmentation_data(que_corpus, ans_corpus):
    augmented_que_corpus = []
    augmented_ans_corpus = []

    augmented_que_corpus.extend(que_corpus)
    augmented_ans_corpus.extend(ans_corpus)

    for question in que_corpus:
        augmented_question = lexical_sub(question)
        augmented_que_corpus.append(augmented_question)
        augmented_ans_corpus.append(ans_corpus[que_corpus.index(question)])
    
    for answer in ans_corpus:
        augmented_answer = lexical_sub(answer)
        augmented_ans_corpus.append(augmented_answer)
        augmented_que_corpus.append(que_corpus[ans_corpus.index(answer)])
    
    return augmented_que_corpus, augmented_ans_corpus

In [12]:
que_corpus, ans_corpus = augmentation_data(que_corpus, ans_corpus)

print(len(que_corpus), len(ans_corpus))
print()
print('que_corpus: ',
      '\n'.join([' '.join(tokens) for tokens in que_corpus[:5]]), sep = '\n')
print()
print('ans_corpus: ',
      '\n'.join([' '.join(tokens) for tokens in ans_corpus[:5]]), sep = '\n')

23049 23049

que_corpus: 
12 시 땡 !
1 지망 학교 떨어졌 어
3 박 4 일 놀 러 가 고 싶 다
ppl 심하 네
sd 카드 망가졌 어

ans_corpus: 
하루 가 또 가 네요 .
위로 해 드립니다 .
여행 은 언제나 좋 죠 .
눈살 이 찌푸려 지 죠 .
다시 새로 사 는 게 마음 편해요 .


In [13]:
def delete_duplicate(que_corpus, ans_corpus):
    que_tokens = []
    ans_tokens = []

    que_seen = set()
    ans_seen = set()

    for i in range(len(que_corpus)):
        que_joined = ' '.join(que_corpus[i])
        ans_joined = ' '.join(ans_corpus[i])

        if que_joined in que_seen and ans_joined in ans_seen:
            pass
        else:
            que_tokens.append(que_corpus[i])
            ans_tokens.append(ans_corpus[i])
            que_seen.add(que_joined)
            ans_seen.add(ans_joined)
    
    return que_tokens, ans_tokens

In [14]:
que_corpus, ans_corpus = delete_duplicate(que_corpus, ans_corpus)

print(len(que_corpus), len(ans_corpus))
print()
print('que_corpus: ',
      '\n'.join([' '.join(tokens) for tokens in que_corpus[:5]]), sep = '\n')
print()
print('ans_corpus: ',
      '\n'.join([' '.join(tokens) for tokens in ans_corpus[:5]]), sep = '\n')

22840 22840

que_corpus: 
12 시 땡 !
1 지망 학교 떨어졌 어
3 박 4 일 놀 러 가 고 싶 다
ppl 심하 네
sd 카드 망가졌 어

ans_corpus: 
하루 가 또 가 네요 .
위로 해 드립니다 .
여행 은 언제나 좋 죠 .
눈살 이 찌푸려 지 죠 .
다시 새로 사 는 게 마음 편해요 .


# Step 5. 데이터 벡터화
ans_corpus에 \<start\> 토큰과 \<end\> 토큰을 추가 후 벡터화 진행.  
```python
sample_data = ["12", "시", "땡", "!"]

print(["<start>"] + sample_data + ["<end>"])
```
챗봇 훈련 데이터는 소스 데이터와 타겟 데이터가 같은 언어를 사용. Embedding 층을 공유했을 때 많은 이점.  
- ans_corpus와 que_corpus를 결합하여 전체 데이터에 대한 단어 사전을 구축하고 벡터화하여 enc_train과 dec_train을 얻기.

In [15]:
start_token = '<start>'
end_token = '<end>'

ans_corpus = [[start_token] + ans + [end_token] for ans in ans_corpus]

In [16]:
total_corpus = que_corpus + ans_corpus

tokenizer = Tokenizer(filters = '', oov_token = '<OOV>')
tokenizer.fit_on_texts(total_corpus)

start_id = tokenizer.word_index[start_token]
end_id = tokenizer.word_index[end_token]

que_sequences = tokenizer.texts_to_sequences(que_corpus)
ans_sequences = tokenizer.texts_to_sequences(ans_corpus)

max_len = max(max(len(seq) for seq in que_sequences), max(len(seq) for seq in ans_sequences))
enc_train = pad_sequences(que_sequences, padding = 'post', maxlen = max_len)
dec_train = pad_sequences(ans_sequences, padding = 'post', maxlen = max_len)

VOCAB_SIZE = len(tokenizer.word_index) + 1


print(f"Vocabulary Size: {VOCAB_SIZE}")
print(f"Encoder Training Shape: {enc_train.shape}")
print(f"Decoder Training Shape: {dec_train.shape}")

print("Sample Encoder Sequence:", enc_train[0])
print("Sample Decoder Sequence:", dec_train[0])

Vocabulary Size: 7593
Encoder Training Shape: (22840, 41)
Decoder Training Shape: (22840, 41)
Sample Encoder Sequence: [2215  215 3416  111    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0]
Sample Decoder Sequence: [  3 279   9 149   9  43   2   4   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0]


# Step 6. 훈련하기
앞서 번역 모델을 훈련하며 정의한 Transformer를 그대로 사용.  
데이터의 크기가 작으니 하이퍼파라미터를 튜닝해야 과적합을 피할 수 있음.  
모델을 훈련하고 아래 예문에 대한 답변을 생성.
```python
# 예문
1. 지루하다, 놀러가고 싶어.
2. 오늘 일찍 일어났더니 피곤하다.
3. 간만에 여자친구랑 데이트 하기로 했어.
4. 집에 있는다는 소리야.
```
---
```python
# 제출

Translations
> 1. 잠깐 쉬 어도 돼요 . <end>
> 2. 맛난 거 드세요 . <end>
> 3. 떨리 겠 죠 . <end>
> 4. 좋 아 하 면 그럴 수 있 어요 . <end>

Hyperparameters
> n_layers: 1
> d_model: 368
> n_heads: 8
> d_ff: 1024
> dropout: 0.2

Training Parameters
> Warmup Steps: 1000
> Batch Size: 64
> Epoch At: 10
```

In [17]:
# Positional Encoding 구현
def positional_encoding(pos, d_model):
    def cal_angle(position, i):
        return position / np.power(10000, (2*(i//2)) / np.float32(d_model))

    def get_posi_angle_vec(position):
        return [cal_angle(position, i) for i in range(d_model)]

    sinusoid_table = np.array([get_posi_angle_vec(pos_i) for pos_i in range(pos)])

    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])
    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])

    return sinusoid_table

In [18]:
# Mask  생성하기
def generate_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
    return seq[:, tf.newaxis, tf.newaxis, :]

def generate_lookahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask

def generate_masks(src, tgt):
    enc_mask = generate_padding_mask(src)
    dec_enc_mask = generate_padding_mask(src)

    dec_lookahead_mask = generate_lookahead_mask(tgt.shape[1])
    dec_tgt_padding_mask = generate_padding_mask(tgt)
    dec_mask = tf.maximum(dec_tgt_padding_mask, dec_lookahead_mask)

    return enc_mask, dec_enc_mask, dec_mask

In [19]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        
        self.depth = d_model // self.num_heads
        
        self.W_q = tf.keras.layers.Dense(d_model)
        self.W_k = tf.keras.layers.Dense(d_model)
        self.W_v = tf.keras.layers.Dense(d_model)
        
        self.linear = tf.keras.layers.Dense(d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask):
        d_k = tf.cast(K.shape[-1], tf.float32)
        QK = tf.matmul(Q, K, transpose_b=True)

        scaled_qk = QK / tf.math.sqrt(d_k)

        if mask is not None: scaled_qk += (mask * -1e9)  

        attentions = tf.nn.softmax(scaled_qk, axis=-1)
        out = tf.matmul(attentions, V)

        return out, attentions
        

    def split_heads(self, x):
        bsz = x.shape[0]
        split_x = tf.reshape(x, (bsz, -1, self.num_heads, self.depth))
        split_x = tf.transpose(split_x, perm=[0, 2, 1, 3])

        return split_x

    def combine_heads(self, x):
        bsz = x.shape[0]
        combined_x = tf.transpose(x, perm=[0, 2, 1, 3])
        combined_x = tf.reshape(combined_x, (bsz, -1, self.d_model))

        return combined_x

    
    def call(self, Q, K, V, mask):
        WQ = self.W_q(Q)
        WK = self.W_k(K)
        WV = self.W_v(V)
        
        WQ_splits = self.split_heads(WQ)
        WK_splits = self.split_heads(WK)
        WV_splits = self.split_heads(WV)
        
        out, attention_weights = self.scaled_dot_product_attention(
            WQ_splits, WK_splits, WV_splits, mask)
                        
        out = self.combine_heads(out)
        out = self.linear(out)
            
        return out, attention_weights

In [20]:
# Position-wise Feed Forward Network 구현
class PoswiseFeedForwardNet(tf.keras.layers.Layer):
    def __init__(self, d_model, d_ff):
        super(PoswiseFeedForwardNet, self).__init__()
        self.d_model = d_model
        self.d_ff = d_ff

        self.fc1 = tf.keras.layers.Dense(d_ff, activation='relu')
        self.fc2 = tf.keras.layers.Dense(d_model)

    def call(self, x):
        out = self.fc1(x)
        out = self.fc2(out)
            
        return out

In [21]:
# Encoder의 레이어 구현
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, n_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()

        self.enc_self_attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = PoswiseFeedForwardNet(d_model, d_ff)

        self.norm_1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.norm_2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.do = tf.keras.layers.Dropout(dropout)
        
    def call(self, x, mask):
        '''
        Multi-Head Attention
        '''
        residual = x
        out = self.norm_1(x)
        out, enc_attn = self.enc_self_attn(out, out, out, mask)
        out = self.do(out)
        out += residual
        
        '''
        Position-Wise Feed Forward Network
        '''
        residual = out
        out = self.norm_2(out)
        out = self.ffn(out)
        out = self.do(out)
        out += residual
        
        return out, enc_attn

In [22]:
# Decoder 레이어 구현
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()

        self.dec_self_attn = MultiHeadAttention(d_model, num_heads)
        self.enc_dec_attn = MultiHeadAttention(d_model, num_heads)

        self.ffn = PoswiseFeedForwardNet(d_model, d_ff)

        self.norm_1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.norm_2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.norm_3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.do = tf.keras.layers.Dropout(dropout)
    
    def call(self, x, enc_out, dec_enc_mask, padding_mask):
        '''
        Masked Multi-Head Attention
        '''
        residual = x
        out = self.norm_1(x)
        out, dec_attn = self.dec_self_attn(out, out, out, padding_mask)
        out = self.do(out)
        out += residual

        '''
        Multi-Head Attention
        '''
        residual = out
        out = self.norm_2(out)
        # Q, K, V 순서에 주의하세요!
        out, dec_enc_attn = self.enc_dec_attn(Q=out, K=enc_out, V=enc_out, mask=dec_enc_mask)
        out = self.do(out)
        out += residual
        
        '''
        Position-Wise Feed Forward Network
        '''
        residual = out
        out = self.norm_3(out)
        out = self.ffn(out)
        out = self.do(out)
        out += residual

        return out, dec_attn, dec_enc_attn

In [23]:
# Encoder 구현
class Encoder(tf.keras.Model):
    def __init__(self,
                    n_layers,
                    d_model,
                    n_heads,
                    d_ff,
                    dropout):
        super(Encoder, self).__init__()
        self.n_layers = n_layers
        self.enc_layers = [EncoderLayer(d_model, n_heads, d_ff, dropout) 
                        for _ in range(n_layers)]
    
        self.do = tf.keras.layers.Dropout(dropout)
        
    def call(self, x, mask):
        out = x
    
        enc_attns = list()
        for i in range(self.n_layers):
            out, enc_attn = self.enc_layers[i](out, mask)
            enc_attns.append(enc_attn)
        
        return out, enc_attns

In [24]:
# Decoder 구현
class Decoder(tf.keras.Model):
    def __init__(self,
                    n_layers,
                    d_model,
                    n_heads,
                    d_ff,
                    dropout):
        super(Decoder, self).__init__()
        self.n_layers = n_layers
        self.dec_layers = [DecoderLayer(d_model, n_heads, d_ff, dropout) 
                            for _ in range(n_layers)]
                            
    def call(self, x, enc_out, dec_enc_mask, padding_mask):
        out = x
    
        dec_attns = list()
        dec_enc_attns = list()
        for i in range(self.n_layers):
            out, dec_attn, dec_enc_attn = \
            self.dec_layers[i](out, enc_out, dec_enc_mask, padding_mask)

            dec_attns.append(dec_attn)
            dec_enc_attns.append(dec_enc_attn)

        return out, dec_attns, dec_enc_attns

In [25]:
class Transformer(tf.keras.Model):
    def __init__(self,
                    n_layers,
                    d_model,
                    n_heads,
                    d_ff,
                    src_vocab_size,
                    tgt_vocab_size,
                    pos_len,
                    dropout=0.2,
                    shared_fc=True,
                    shared_emb=False):
        super(Transformer, self).__init__()
        
        self.d_model = tf.cast(d_model, tf.float32)

        if shared_emb:
            self.enc_emb = self.dec_emb = \
            tf.keras.layers.Embedding(src_vocab_size, d_model)
        else:
            self.enc_emb = tf.keras.layers.Embedding(src_vocab_size, d_model)
            self.dec_emb = tf.keras.layers.Embedding(tgt_vocab_size, d_model)

        self.pos_encoding = positional_encoding(pos_len, d_model)
        self.do = tf.keras.layers.Dropout(dropout)

        self.encoder = Encoder(n_layers, d_model, n_heads, d_ff, dropout)
        self.decoder = Decoder(n_layers, d_model, n_heads, d_ff, dropout)

        self.fc = tf.keras.layers.Dense(tgt_vocab_size)

        self.shared_fc = shared_fc

        if shared_fc:
            self.fc.set_weights(tf.transpose(self.dec_emb.weights))

    def embedding(self, emb, x):
        seq_len = x.shape[1]

        out = emb(x)

        if self.shared_fc: out *= tf.math.sqrt(self.d_model)

        out += self.pos_encoding[np.newaxis, ...][:, :seq_len, :]
        out = self.do(out)

        return out

        
    def call(self, enc_in, dec_in, enc_mask, dec_enc_mask, dec_mask):
        enc_in = self.embedding(self.enc_emb, enc_in)
        dec_in = self.embedding(self.dec_emb, dec_in)

        enc_out, enc_attns = self.encoder(enc_in, enc_mask)
        
        dec_out, dec_attns, dec_enc_attns = \
        self.decoder(dec_in, enc_out, dec_enc_mask, dec_mask)
        
        logits = self.fc(dec_out)
        
        return logits, enc_attns, dec_attns, dec_enc_attns

In [26]:
# 주어진 하이퍼파라미터로 Transformer 인스턴스 생성
transformer = Transformer(
    n_layers=1,
    d_model=512,
    n_heads=8,
    d_ff=1024,
    src_vocab_size=VOCAB_SIZE,
    tgt_vocab_size=VOCAB_SIZE,
    pos_len=200,
    dropout=0.3,
    shared_fc=True,
    shared_emb=True)

2024-07-16 12:28:13.037812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20628 MB memory:  -> device: 0, name: NVIDIA RTX A5000, pci bus id: 0000:02:00.0, compute capability: 8.6


In [27]:
# Learning Rate Scheduler 구현
class LearningRateScheduler(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(LearningRateScheduler, self).__init__()
        
        self.d_model = d_model
        self.warmup_steps = warmup_steps
    
    def __call__(self, step):
        step = tf.cast(step, tf.float32)
        arg1 = step ** -0.5
        arg2 = step * (self.warmup_steps ** -1.5)
        
        return (self.d_model ** -0.5) * tf.math.minimum(arg1, arg2)

In [28]:
# Learning Rate 인스턴스 선언 & Optimizer 구현
d_model = 512
learning_rate = LearningRateScheduler(d_model)

optimizer = tf.keras.optimizers.Adam(learning_rate,
                                        beta_1=0.9,
                                        beta_2=0.98, 
                                        epsilon=1e-9)

In [29]:
# Loss Function 정의
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_sum(loss_)/tf.reduce_sum(mask)

In [30]:
# Train Step 정의
@tf.function()
def train_step(src, tgt, model, optimizer):
    tgt_in = tgt[:, :-1]  # Decoder의 input
    gold = tgt[:, 1:]     # Decoder의 output과 비교하기 위해 right shift를 통해 생성한 최종 타겟

    enc_mask, dec_enc_mask, dec_mask = generate_masks(src, tgt_in)

    with tf.GradientTape() as tape:
        predictions, enc_attns, dec_attns, dec_enc_attns = \
        model(src, tgt_in, enc_mask, dec_enc_mask, dec_mask)
        loss = loss_function(gold, predictions)

    gradients = tape.gradient(loss, model.trainable_variables)    
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    return loss, enc_attns, dec_attns, dec_enc_attns

In [31]:
BATCH_SIZE = 64
train_dataset = tf.data.Dataset.from_tensor_slices((enc_train, dec_train)).batch(batch_size=BATCH_SIZE)

In [32]:
# Q. 위의 코드를 활용하여 모델을 훈련시켜봅시다!
EPOCHS = 3

for epoch in range(EPOCHS):
    total_loss = 0
    
    dataset_count = tf.data.experimental.cardinality(train_dataset).numpy()
    tqdm_bar = tqdm(total=dataset_count)
    
    for (batch, (src, tgt)) in enumerate(train_dataset):
        batch_loss, enc_attns, dec_attns, dec_enc_attns = train_step(src, tgt, transformer, optimizer)
        total_loss += batch_loss

        # tqdm 업데이트
        tqdm_bar.update(1)
        tqdm_bar.set_postfix(loss=total_loss.numpy() / (batch + 1))
    
    tqdm_bar.close()
    print(f'Epoch {epoch + 1} Loss {total_loss.numpy() / dataset_count}')

  0%|          | 0/357 [00:00<?, ?it/s]2024-07-16 12:28:18.294325: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:606] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2024-07-16 12:28:18.369984: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7a739a0dc0a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-07-16 12:28:18.370032: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA RTX A5000, Compute Capability 8.6
2024-07-16 12:28:18.380797: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-07-16 12:28:18.407935: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600
2024-07-16 12:28:18.592161: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at mos

Epoch 1 Loss 5.841907661502101


100%|██████████| 357/357 [00:08<00:00, 44.15it/s, loss=3.72]


Epoch 2 Loss 3.724851122089461


100%|██████████| 357/357 [00:08<00:00, 44.03it/s, loss=2.83]

Epoch 3 Loss 2.8295937759869574





# Step 7. 성능 측정하기
주어진 질문에 적절한 답변을 하는지 확인하고, BLEU Score를 계산하는 calculate_bleu() 함수 적용.

In [33]:
# 번역 생성 함수
def evaluate(sentence, model, tokenizer, max_len = max_len):
    sentence = preprocess_sentence(sentence)
    
    pieces = mecab.morphs(sentence)
    tokens = tokenizer.texts_to_sequences([' '.join(pieces)])

    _input = pad_sequences(tokens, maxlen=max_len, padding='post')

    ids = []
    output = tf.expand_dims([start_id], 0)
    for i in range(dec_train.shape[-1]):
        enc_padding_mask, combined_mask, dec_padding_mask = \
        generate_masks(_input, output)

        predictions, enc_attns, dec_attns, dec_enc_attns =\
        model(_input, 
              output,
              enc_padding_mask,
              combined_mask,
              dec_padding_mask)

        predicted_id = \
        tf.argmax(tf.math.softmax(predictions, axis=-1)[0, -1]).numpy().item()

        if end_id == predicted_id:
            result = tokenizer.sequences_to_texts([ids])
            return pieces, ' '.join(result), enc_attns, dec_attns, dec_enc_attns

        ids.append(predicted_id)
        output = tf.concat([output, tf.expand_dims([predicted_id], 0)], axis=-1)

    result = tokenizer.sequences_to_texts(ids)

    return pieces, ' '.join(result), enc_attns, dec_attns, dec_enc_attns

In [34]:
def translate(sentence, model, tokenizer):
    pieces, result, enc_attns, dec_attns, dec_enc_attns = \
    evaluate(sentence, model, tokenizer)
    
    print('Input: %s' % (sentence))
    print('Predicted translation: {}'.format(result))

    return result

In [35]:
examples = ['지루하다, 놀러가고 싶어.',
            '오늘 일찍 일어났더니 피곤하다.',
            '간만에 여자친구랑 데이트 하기로 했어.',
            '집에 있는다는 소리야.']

candidates = []
for example in examples:
    candidates.append(translate(example, transformer, tokenizer))

Input: 지루하다, 놀러가고 싶어.
Predicted translation: 사랑 은 사랑 은 사랑 이 필요 해요 .
Input: 오늘 일찍 일어났더니 피곤하다.
Predicted translation: 마음 은 잘 되 었 나 봐요 .
Input: 간만에 여자친구랑 데이트 하기로 했어.
Predicted translation: 좋 은 선택 이 었 나 봐요 .
Input: 집에 있는다는 소리야.
Predicted translation: 좋 아 하 는 것 이 에요 .


In [36]:
def calculate_bleu(reference, candidate, weights=[0.25, 0.25, 0.25, 0.25]):
    return sentence_bleu([reference],
                         candidate,
                         weights=weights,
                         smoothing_function=SmoothingFunction().method1)  # smoothing_function 적용

references = ['잠깐 쉬 어도 돼요 . <end>',
              '맛난 거 드세요 . <end>',
              '떨리 겠 죠 . <end>',
              '좋 아 하 면 그럴 수 있 어요 . <end>']

for reference, candidate in zip(references, candidates):
    print("BLEU-1:", calculate_bleu(reference, candidate, weights=[1, 0, 0, 0]))
    print("BLEU-2:", calculate_bleu(reference, candidate, weights=[0, 1, 0, 0]))
    print("BLEU-3:", calculate_bleu(reference, candidate, weights=[0, 0, 1, 0]))
    print("BLEU-4:", calculate_bleu(reference, candidate, weights=[0, 0, 0, 1]))

    print("BLEU-Total:", calculate_bleu(reference, candidate))
    print('\n')

BLEU-1: 0.3181818181818182
BLEU-2: 0.09523809523809525
BLEU-3: 0.05000000000000001
BLEU-4: 0.005263157894736842
BLEU-Total: 0.05314049749131566


BLEU-1: 0.35294117647058826
BLEU-2: 0.12500000000000003
BLEU-3: 0.06666666666666667
BLEU-4: 0.007142857142857146
BLEU-Total: 0.06770149544242768


BLEU-1: 0.29411764705882354
BLEU-2: 0.0625
BLEU-3: 0.006666666666666668
BLEU-4: 0.007142857142857146
BLEU-Total: 0.03058760346458022


BLEU-1: 0.42733711854819223
BLEU-2: 0.26589865154109743
BLEU-3: 0.20349386597532967
BLEU-4: 0.13148834416867455
BLEU-Total: 0.23481797190640163




# 하이퍼파라미터 튜닝

In [37]:
# Learning Rate Scheduler 구현
class LearningRateScheduler(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(LearningRateScheduler, self).__init__()
        
        self.d_model = d_model
        self.warmup_steps = warmup_steps
    
    def __call__(self, step):
        step = tf.cast(step, tf.float32)
        arg1 = step ** -0.5
        arg2 = step * (self.warmup_steps ** -1.5)
        
        return (self.d_model ** -0.5) * tf.math.minimum(arg1, arg2)

In [38]:
# Loss Function 정의
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_sum(loss_)/tf.reduce_sum(mask)

In [39]:
BATCH_SIZE = 64
train_dataset = tf.data.Dataset.from_tensor_slices((enc_train, dec_train)).batch(batch_size=BATCH_SIZE)

In [40]:
def objective(params, epochs = 3):
    learning_rate = LearningRateScheduler(params['d_model'], warmup_steps = 1000)
    optimizer = tf.keras.optimizers.Adam(learning_rate,
                                        beta_1=0.9,
                                        beta_2=0.98, 
                                        epsilon=1e-9)

    transformer = Transformer(
        n_layers = int(params['n_layers']),
        d_model = int(params['d_model']),
        n_heads = int(params['n_heads']),
        d_ff = int(params['d_ff']),
        src_vocab_size = VOCAB_SIZE,
        tgt_vocab_size = VOCAB_SIZE,
        pos_len = 200,
        dropout = params['dropout'],
        shared_fc = True,
        shared_emb = True
    )

    # Train Step 정의
    @tf.function()
    def train_step(src, tgt, model, optimizer):
        tgt_in = tgt[:, :-1]  # Decoder의 input
        gold = tgt[:, 1:]     # Decoder의 output과 비교하기 위해 right shift를 통해 생성한 최종 타겟

        enc_mask, dec_enc_mask, dec_mask = generate_masks(src, tgt_in)

        with tf.GradientTape() as tape:
            predictions, enc_attns, dec_attns, dec_enc_attns = \
            model(src, tgt_in, enc_mask, dec_enc_mask, dec_mask)
            loss = loss_function(gold, predictions)

        gradients = tape.gradient(loss, model.trainable_variables)    
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

        return loss, enc_attns, dec_attns, dec_enc_attns

    dataset_count = tf.data.experimental.cardinality(train_dataset).numpy()

    for epoch in range(epochs):
        total_loss = 0

        for batch, (src, tgt) in enumerate(train_dataset):
            batch_loss, enc_attns, dec_attns, dec_enc_attns = train_step(src, tgt, transformer, optimizer)
            total_loss += batch_loss

        print(f'Epoch {epoch + 1} Loss {total_loss.numpy() / dataset_count}')

    return total_loss.numpy() / dataset_count

In [41]:
space = {
    'n_layers': hp.choice('n_layers', [1, 2, 3]),
    'd_model': hp.choice('d_model', [128, 256, 512]),
    'n_heads': hp.choice('n_heads', [4, 8]),
    'd_ff': hp.choice('d_ff', [512, 1024, 2048]),
    'dropout': hp.uniform('dropout', 0.1, 0.5)
}

trials = Trials()
model = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=5, trials=trials)
print(model)

Epoch 1 Loss 4.546649665725665                       
Epoch 2 Loss 2.4272226712950804                      
Epoch 3 Loss 1.5802085726868873                      
Epoch 1 Loss 4.713446267178747                                                 
Epoch 2 Loss 2.9203616016719187                                                
Epoch 3 Loss 2.126684878052784                                                 
Epoch 1 Loss 5.03676983488708                                                  
Epoch 2 Loss 3.1375975194765404                                                
Epoch 3 Loss 2.387592048538165                                                 
Epoch 1 Loss 4.8727582830007                                                   
Epoch 2 Loss 3.1527489413734244                                                
Epoch 3 Loss 2.4724651090905114                                                
Epoch 1 Loss 4.512456984747024                                                 
Epoch 2 Loss 2.5064167268469886       

In [42]:
best_trial = trials.best_trial
best_trial['result']

{'loss': 1.5802085726868873, 'status': 'ok'}

# Step 7. 성능 측정하기
주어진 질문에 적절한 답변을 하는지 확인하고, BLEU Score를 계산하는 calculate_bleu() 함수 적용.

In [43]:
# 번역 생성 함수
def evaluate(sentence, model, tokenizer, max_len = max_len):
    sentence = preprocess_sentence(sentence)
    
    pieces = mecab.morphs(sentence)
    tokens = tokenizer.texts_to_sequences([' '.join(pieces)])

    _input = pad_sequences(tokens, maxlen=max_len, padding='post')

    ids = []
    output = tf.expand_dims([start_id], 0)
    for i in range(dec_train.shape[-1]):
        enc_padding_mask, combined_mask, dec_padding_mask = \
        generate_masks(_input, output)

        predictions, enc_attns, dec_attns, dec_enc_attns =\
        model(_input, 
              output,
              enc_padding_mask,
              combined_mask,
              dec_padding_mask)

        predicted_id = \
        tf.argmax(tf.math.softmax(predictions, axis=-1)[0, -1]).numpy().item()

        if end_id == predicted_id:
            result = tokenizer.sequences_to_texts([ids])
            return pieces, ' '.join(result), enc_attns, dec_attns, dec_enc_attns

        ids.append(predicted_id)
        output = tf.concat([output, tf.expand_dims([predicted_id], 0)], axis=-1)

    result = tokenizer.sequences_to_texts(ids)

    return pieces, ' '.join(result), enc_attns, dec_attns, dec_enc_attns

In [44]:
def translate(sentence, model, tokenizer):
    pieces, result, enc_attns, dec_attns, dec_enc_attns = \
    evaluate(sentence, model, tokenizer)
    
    print('Input: %s' % (sentence))
    print('Predicted translation: {}'.format(result))

    return result

In [45]:
examples = ['지루하다, 놀러가고 싶어.',
            '오늘 일찍 일어났더니 피곤하다.',
            '간만에 여자친구랑 데이트 하기로 했어.',
            '집에 있는다는 소리야.']

candidates = []
for example in examples:
    candidates.append(translate(example, transformer, tokenizer))

Input: 지루하다, 놀러가고 싶어.
Predicted translation: 사랑 은 사랑 은 사랑 이 필요 해요 .
Input: 오늘 일찍 일어났더니 피곤하다.
Predicted translation: 마음 은 잘 되 었 나 봐요 .
Input: 간만에 여자친구랑 데이트 하기로 했어.
Predicted translation: 좋 은 선택 이 었 나 봐요 .
Input: 집에 있는다는 소리야.
Predicted translation: 좋 아 하 는 것 이 에요 .


In [46]:
def calculate_bleu(reference, candidate, weights=[0.25, 0.25, 0.25, 0.25]):
    return sentence_bleu([reference],
                         candidate,
                         weights=weights,
                         smoothing_function=SmoothingFunction().method1)  # smoothing_function 적용

references = ['잠깐 쉬 어도 돼요 . <end>',
              '맛난 거 드세요 . <end>',
              '떨리 겠 죠 . <end>',
              '좋 아 하 면 그럴 수 있 어요 . <end>']

for reference, candidate in zip(references, candidates):
    print("BLEU-1:", calculate_bleu(reference, candidate, weights=[1, 0, 0, 0]))
    print("BLEU-2:", calculate_bleu(reference, candidate, weights=[0, 1, 0, 0]))
    print("BLEU-3:", calculate_bleu(reference, candidate, weights=[0, 0, 1, 0]))
    print("BLEU-4:", calculate_bleu(reference, candidate, weights=[0, 0, 0, 1]))

    print("BLEU-Total:", calculate_bleu(reference, candidate))
    print('\n')

BLEU-1: 0.3181818181818182
BLEU-2: 0.09523809523809525
BLEU-3: 0.05000000000000001
BLEU-4: 0.005263157894736842
BLEU-Total: 0.05314049749131566


BLEU-1: 0.35294117647058826
BLEU-2: 0.12500000000000003
BLEU-3: 0.06666666666666667
BLEU-4: 0.007142857142857146
BLEU-Total: 0.06770149544242768


BLEU-1: 0.29411764705882354
BLEU-2: 0.0625
BLEU-3: 0.006666666666666668
BLEU-4: 0.007142857142857146
BLEU-Total: 0.03058760346458022


BLEU-1: 0.42733711854819223
BLEU-2: 0.26589865154109743
BLEU-3: 0.20349386597532967
BLEU-4: 0.13148834416867455
BLEU-Total: 0.23481797190640163




Input: 지루하다, 놀러가고 싶어.  
Predicted translation: 사랑 은 사랑 은 사랑 이 필요 해요 .  
  
Input: 오늘 일찍 일어났더니 피곤하다.  
Predicted translation: 마음 은 잘 되 었 나 봐요 .  
  
Input: 간만에 여자친구랑 데이트 하기로 했어.  
Predicted translation: 좋 은 선택 이 었 나 봐요 .  
  
Input: 집에 있는다는 소리야.  
Predicted translation: 좋 아 하 는 것 이 에요 .  
  
n_layers: 1  
d_model: 512  
n_heads: 8  
d_ff: 1024  
dropout: 0.3  
  
Warmup Steps: 4000
Batch Size: 64  
Epoch At: 3