# Chatbot Algorithm

The model used in this chatbot is transformer which is first introduced in the paper "All you need is attention".

I copied the implementation of the model from Tensorflow Website, but made some modifications. 

The transformer model in Tensorflow Website originally aimed to achieve Sentence-to-Sentence translation. However I used the model to transform user's qeustions or some dialogues to kurius's answer.

Simply saying, it aim to achieve Question-to-Answer translation.


## Libarires

In [1]:
import tensorflow_datasets as tfds
import tensorflow as tf

import time
import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
import re

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import nltk

In [3]:
# %tensorflow_version 2.x
# Recommend you to use tensorflow version 2.x
print("tensorflow version: {}".format(tf.__version__))
print("tensorflow_datasets version: {}".format(tfds.__version__))

tensorflow version: 2.4.1
tensorflow_datasets version: 4.0.1


## 1. Load and Preprocess Training data

Caution!

<font size="4"><b>Training data (.csv or .txt) format</b></font>

<b>Correct Format</b>

user: question1

chatbot: answer1

user: question2

chatbot: answer2

<b>Wrong Format</b>

user: question1

user: question2

chatbot: answer1

chatbot: answer2

As you saw, one answer should appear after one question!

You can check the exact form in "../Data/dialogue_example.txt".

In [4]:
# This code is for the user who use google colab.
# from google.colab import drive 
# drive.mount('/content/gdrive/')

In [5]:
# These path should be modified!
# Make sure where your data is or where your model should be saved!
# If you use google drive, "/content/gdrive/My Drive/"" may be the root folder.

"""
training_data_path = '/content/gdrive/My Drive/Amadeus/Chatbot_core/Data/dialogue_example.txt'
emotion_classifier_checkpoint_path = "/content/gdrive/My Drive/Amadeus/Chatbot_core/Checkpoints/kurisu/emotion_classifier"
chatbot_checkpoint_path = "/content/gdrive/My Drive/Amadeus/Chatbot_core/Checkpoints/kurisu/chatbot"
"""

training_data_path = './Data/dialogue_example.txt'
emotion_classifier_checkpoint_path = "./Checkpoints/kurisu/emotion_classifier"
chatbot_checkpoint_path = "./Checkpoints/kurisu/chatbot"


In [6]:
# CSV file should contain three columns, "name", "text", and "sentiment".
data = pd.read_csv(training_data_path, sep = '\t', engine='python')

dialogue = list(data['text'].astype("string"))
dialogue = list(map(lambda s: re.sub('[!?]', '', s.lower()), dialogue)) # remove !? from original data

sent = list(data['sentiment'].astype("int"))
sent_labels = list(map(lambda i: np.eye(3)[i], sent))

## Emotion Classifier

In [7]:
# Training data for Emotion Classifier
# I just used dialogue as the training data, but you can modify sent_dialogue to improve Emotion Classifier.
sent_dialogue = dialogue

# We need to know max length of dialogues to decide padding size.
dial_len = [len(doc) for doc in sent_dialogue]
print(dial_len)
max_len = max(dial_len)
print(max_len)

[11, 28, 4, 48, 10, 26, 16, 29, 40, 42, 51, 22, 30, 20, 68, 36, 61, 4, 58, 31, 60, 65, 25, 9, 13, 57, 45, 30, 4, 67, 24, 73, 43, 117, 40, 40, 80, 87]
117


In [8]:
# tfds.features.text.SubwordTextEncoder => tfds.deprecated.text.SubwordTextEncoder in higher tensorflow version like 2.4.1
tokenizer_en = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (en for en in dialogue), target_vocab_size=2**13)

tokenizer_sent = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (sent for sent in sent_dialogue), target_vocab_size = 2**13)


"""
The documentation (about SubwordTextEncoder) says:

Encoding is fully invertible because all out-of-vocab wordpieces are byte-encoded
"""

'\nThe documentation (about SubwordTextEncoder) says:\n\nEncoding is fully invertible because all out-of-vocab wordpieces are byte-encoded\n'

In [9]:
# preprocess and encode given dialogues.
# here, we add paddings to make all encoding has same length
# This function will be used later to encode user's input!

def encode_sent(dialogue):
  result = []
  for sen in dialogue:
    re_sen = re.sub('[.,!?]', '', sen)
    temp = tokenizer_sent.encode(re_sen)

    if(len(temp) < max_len):
      temp = temp + [0] * (max_len - len(temp))
      result.append(temp)
    else:
      print("Input length over the max length")
  return result

In [10]:
# Training data
enc_sent_dialogue = encode_sent(sent_dialogue)

BATCH_SIZE = 8
BUFFER_SIZE = 20000

# Instead of defining custom "make_batch" function, I used tf.data.Dataset to generate mini-batches.
sent_dataset = [(enc_sent_dialogue[i], sent_labels[i]) for i in range(len(sent_dialogue))]
sent_train_dataset = tf.data.Dataset.from_generator(lambda: sent_dataset, output_types= (tf.int64, tf.int64), output_shapes=([None], [None]))
sent_train_dataset = sent_train_dataset.cache()
sent_train_dataset = sent_train_dataset.shuffle(BUFFER_SIZE)
sent_train_dataset = sent_train_dataset.padded_batch(BATCH_SIZE)
sent_train_dataset = sent_train_dataset.prefetch(tf.data.experimental.AUTOTUNE)

In [11]:
# This is very simple, so recommend you to modify this part.
model = tf.keras.Sequential([
                             tf.keras.layers.Dense(128, activation='relu', input_shape=(max_len,)),
                             tf.keras.layers.Dropout(0.5),
                             tf.keras.layers.Dense(128, activation='relu'), # batch_size, seq_len, 64
                             tf.keras.layers.Dropout(0.5),
                             tf.keras.layers.Dense(3, activation='softmax') # batch_size, seq_len, 4
])

model.compile(optimizer=tf.keras.optimizers.Adam(0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(sent_train_dataset, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<tensorflow.python.keras.callbacks.History at 0x1bdc0d53080>

In [15]:
model.save(emotion_classifier_checkpoint_path)

# You can check whether the model is saved well by uncommenting below codes.

"""
loaded_model = tf.keras.models.load_model(emotion_classifier_checkpoint_path)
idx = 27
test = dialogue[idx]
print(test)
test = "wait running away?"
enc_test = encode_sent([test])
answer = loaded_model.predict(enc_test)
print(np.argmax(answer, axis=1)[0] == np.argmax(sent_labels[idx]))
"""

INFO:tensorflow:Assets written to: ./Checkpoints/kurisu/emotion_classifier\assets


INFO:tensorflow:Assets written to: ./Checkpoints/kurisu/emotion_classifier\assets


'\nloaded_model = tf.keras.models.load_model(emotion_classifier_checkpoint_path)\nidx = 27\ntest = dialogue[idx]\nprint(test)\ntest = "wait running away?"\nenc_test = encode_sent([test])\nanswer = loaded_model.predict(enc_test)\nprint(np.argmax(answer, axis=1)[0] == np.argmax(sent_labels[idx]))\n'

## Preparations for Transformer Model

In [16]:
# Just remember that the model needs to know the start and the end position of the dialogue
# Thus, we define the start and the end position as tokenizer_en.vocab_size and tokenizer_en.vocab_size+1, respectively.
# vocab_size is the max number of vocabulary (depending on the training dataset)
# encoded data has sequence of numbers => vocab_size and vocab_size + 1 will inform the model what is start and what is end among the numbers. 

def encode(dialogue):
  result = []
  for en in dialogue:
    result.append([tokenizer_en.vocab_size] + tokenizer_en.encode(en) + [tokenizer_en.vocab_size+1])
  return result

In [17]:
dialogue_encode = encode(dialogue)

input_dataset = []
target_dataset = []

for i in range(len(dialogue_encode)):
  if i == len(dialogue_encode):
    break
  if i%2 != 0:
    continue
  # As I mentioned at the beginning, one question and one answer relationship is applied.
  input_dataset.append(dialogue_encode[i])
  target_dataset.append(dialogue_encode[i+1])


In [18]:
# If the dataset has correct format, the length of input and target dataset should be same.
print(len(input_dataset))
print(len(target_dataset))

19
19


In [19]:
BATCH_SIZE = 4
BUFFER_SIZE = 20000

# Instead of defining custom "make_batch" function, I used tf.data.Dataset to generate mini-batches.
sum_dataset = [(input_dataset[i], target_dataset[i]) for i in range(len(input_dataset))]
train_dataset = tf.data.Dataset.from_generator(lambda: sum_dataset, output_types= (tf.int64, tf.int64), output_shapes=([None], [None]))
train_dataset = train_dataset.cache()
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE)
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)

## Transformer Model

In [20]:
# Model


def get_angles(pos, i, d_model):
  angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
  return pos * angle_rates

# Position encoding gives the model position information of the input dialogue.
# For example, there is an input dialogue "your apple and my apple"
# apple appears twice in the dialogue. How the model can differentiate them?
# we can multiply different weights (sine and cosine in this example) to different position
def positional_encoding(position, d_model):
  angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)
  # apply sin to even indices in the array; 2i
  angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
  
  # apply cos to odd indices in the array; 2i+1
  angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
  pos_encoding = angle_rads[np.newaxis, ...]
    
  return tf.cast(pos_encoding, dtype=tf.float32)

# padding_mask, look_ahead_mask is used for masking target data because the model need to predict next words without looking at the target data.
# Specifically, we just assign (or padding) 0 to the target data by multiplying zero-vector.
# By doing this, we can hide target data from the model's calculation.
def create_padding_mask(seq):
  seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
  
  # add extra dimensions to add the padding
  # to the attention logits.
  return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

def create_look_ahead_mask(size):
  mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
  return mask  # (seq_len, seq_len)

def scaled_dot_product_attention(q, k, v, mask):
  """Calculate the attention weights.
  q, k, v must have matching leading dimensions.
  k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
  The mask has different shapes depending on its type(padding or look ahead) 
  but it must be broadcastable for addition.
  
  Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable 
          to (..., seq_len_q, seq_len_k). Defaults to None.
    
  Returns:
    output, attention_weights
  """

  # What is q, k, v?
  # Actually, these three varaibles are the same vector which represents same input dialogue.

  # Why do we use three same variables?
  # Simply saying, there are some useful relationships (like correlation) among the words in the dialogue.

  # How to get such relationships?
  # Using formula, Attention(Q,K,V) = softmax (QK^T/sqrt(dk))V => The result is attention weights

  # following code is just the implementation of the above formula.

  matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

  # scale matmul_qk
  dk = tf.cast(tf.shape(k)[-1], tf.float32)
  scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

  # add the mask to the scaled tensor.
  if mask is not None:
    scaled_attention_logits += (mask * int(-1e9)) # mask the future(target) data

  attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

  output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

  return output, attention_weights

class MultiHeadAttention(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads):
    super(MultiHeadAttention, self).__init__()
    self.num_heads = num_heads
    self.d_model = d_model
    
    assert d_model % self.num_heads == 0
    
    self.depth = d_model // self.num_heads
    
    self.wq = tf.keras.layers.Dense(d_model)
    self.wk = tf.keras.layers.Dense(d_model)
    self.wv = tf.keras.layers.Dense(d_model)
    
    self.dense = tf.keras.layers.Dense(d_model)
        
  def split_heads(self, x, batch_size):
    """Split the last dimension into (num_heads, depth).
    Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
    """
    x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth)) # num_heads는 논문에서 8개로 8개의 attention을 생성 d_model(=512) = num_heads(=8) * seq_len_q(k, v) 이므로 seq_len_q(k,v)는 64의 값을 가진다.
    return tf.transpose(x, perm=[0, 2, 1, 3])
    
  def call(self, v, k, q, mask):
    batch_size = tf.shape(q)[0]
    
    q = self.wq(q)  # (batch_size, seq_len, d_model) 1 60 512
    k = self.wk(k)  # (batch_size, seq_len, d_model)
    v = self.wv(v)  # (batch_size, seq_len, d_model)
    
    q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth) 1 8 60 64
    k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
    v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)
    
    # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
    # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
    scaled_attention, attention_weights = scaled_dot_product_attention( # 1 8 60 60 -> 1 8 60 64 multiply Transpose(1 8 60 64) = 1 8 64 60 (np.transpose(x, [0,1,3,2])) 따라서 60x64 multiply 64x60이므로 60 60이 맞음
        q, k, v, mask)
  
    
    scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

    concat_attention = tf.reshape(scaled_attention, 
                                  (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

    output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)
        
    return output, attention_weights

# model에서 feedforward 부분이며 결과적으로 output이 d_model이 되어 나가도록 한다. (다음 encoding layer에서도 써야하므로)
def point_wise_feed_forward_network(d_model, dff):
  return tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
  ])

# 참고로 Encoder와 EncoderLayer의 차이는, Encoder는 EncoderLayer를 여러개 가지고 있다. Decoder 또한 DecoderLayer를 여러개 가지고 있기 때문에 서로 클래스를 분리했다.
class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(EncoderLayer, self).__init__()

    # two sublayers
    self.mha = MultiHeadAttention(d_model, num_heads)
    self.ffn = point_wise_feed_forward_network(d_model, dff)

    # Add & Norm after each sublayer
    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    
    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)
    
  def call(self, x, training, mask):

    attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
    attn_output = self.dropout1(attn_output, training=training)
    out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)
    
    ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
    ffn_output = self.dropout2(ffn_output, training=training)
    out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)
    
    return out2

class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(DecoderLayer, self).__init__()

    self.mha1 = MultiHeadAttention(d_model, num_heads)
    self.mha2 = MultiHeadAttention(d_model, num_heads)

    self.ffn = point_wise_feed_forward_network(d_model, dff)
 
    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    
    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)
    self.dropout3 = tf.keras.layers.Dropout(rate)
    
    
  def call(self, x, enc_output, training, 
           look_ahead_mask, padding_mask):
    # enc_output.shape == (batch_size, input_seq_len, d_model)

    attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
    attn1 = self.dropout1(attn1, training=training)
    out1 = self.layernorm1(attn1 + x)
    
    attn2, attn_weights_block2 = self.mha2(
        enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
    attn2 = self.dropout2(attn2, training=training)
    out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)
    
    ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
    ffn_output = self.dropout3(ffn_output, training=training)
    out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)
    
    return out3, attn_weights_block1, attn_weights_block2

class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               maximum_position_encoding, rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers
    
    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model) # 어차피 embedding이라는 것은 어떤 특별히 만들어진 layer에 들어가서 훈련도중 나오는 weight임을 알고있음. (skip-gram 처럼)
    self.pos_encoding = positional_encoding(maximum_position_encoding, 
                                            self.d_model)
    
    
    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) 
                       for _ in range(num_layers)]
  
    self.dropout = tf.keras.layers.Dropout(rate)
        
  def call(self, x, training, mask):

    seq_len = tf.shape(x)[1]
    
    # adding embedding and position encoding.
    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)
    
    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)
    
    return x  # (batch_size, input_seq_len, d_model)

class Decoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,
               maximum_position_encoding, rate=0.1):
    super(Decoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers
    
    self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
    self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
    
    self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate) 
                       for _ in range(num_layers)]
    self.dropout = tf.keras.layers.Dropout(rate)
    
  def call(self, x, enc_output, training, 
           look_ahead_mask, padding_mask):

    seq_len = tf.shape(x)[1]
    attention_weights = {}
    
    x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]
    
    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x, block1, block2 = self.dec_layers[i](x, enc_output, training,
                                             look_ahead_mask, padding_mask)
      
      attention_weights['decoder_layer{}_block1'.format(i+1)] = block1
      attention_weights['decoder_layer{}_block2'.format(i+1)] = block2
    
    # x.shape == (batch_size, target_seq_len, d_model)
    return x, attention_weights

class Transformer(tf.keras.Model):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, 
               target_vocab_size, pe_input, pe_target, rate=0.1):
    super(Transformer, self).__init__()

    self.encoder = Encoder(num_layers, d_model, num_heads, dff, 
                           input_vocab_size, pe_input, rate)

    self.decoder = Decoder(num_layers, d_model, num_heads, dff, 
                           target_vocab_size, pe_target, rate)

    self.final_layer = tf.keras.layers.Dense(target_vocab_size)
  
  def call(self, inp, tar, training, enc_padding_mask, 
           look_ahead_mask, dec_padding_mask):

    enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)
    
    # dec_output.shape == (batch_size, tar_seq_len, d_model)
    dec_output, attention_weights = self.decoder(
        tar, enc_output, training, look_ahead_mask, dec_padding_mask)
    
    final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)
    
    return final_output, attention_weights

In [21]:
# hyper parameters, paper에 다른 버전들도 나와있다고 하니 참고할 수 있음.
# 4 128 512 8 0.1
# 6 1024 4096 16 0.3
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
dropout_rate = 0.1

"""
num_layers = 6
d_model = 1024
dff = 4096
num_heads = 16
dropout_rate = 0.3
"""

input_vocab_size = tokenizer_en.vocab_size + 2
target_vocab_size = tokenizer_en.vocab_size + 2

In [22]:
# 이것 역시 paper에 나온 공식에 따라 AdamOptimizer의 learning schedule을 custom함.

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()
    
    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps
    
  def __call__(self, step):
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)
    
    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

In [23]:
learning_rate = CustomSchedule(d_model)

optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, 
                                     epsilon=1e-9)

# Since the target sequences are padded, it is important to apply a padding mask when calculating the loss.

loss_object = tf.keras.losses.SparseCategoricalCrossentropy( # two or more labels가 있을 때 사용하라는데 거기에 integer value. 만약 one-hot encoding이라면 CategoricalCrossentropy사용. 여기선 tokenizer에 의해 integer를 가지므로 이거 쓰는듯.
    from_logits=True, reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask # real이 0이 아닐 때, pred와의 loss만을 계산한다.
  
  return tf.reduce_sum(loss_)/tf.reduce_sum(mask) # mask의 sum이 곧 valid prediction의 개수이므로 평균 내기 위해서 나눔

train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
    name='train_accuracy')

In [24]:
transformer = Transformer(num_layers, d_model, num_heads, dff,
                          input_vocab_size, target_vocab_size, 
                          pe_input=input_vocab_size, 
                          pe_target=target_vocab_size,
                          rate=dropout_rate)

In [25]:
def create_masks(inp, tar): # inp = input, tar = target
  # Encoder padding mask
  enc_padding_mask = create_padding_mask(inp)
  
  # Used in the 2nd attention block in the decoder.
  # This padding mask is used to mask the encoder outputs.
  dec_padding_mask = create_padding_mask(inp)
  
  # Used in the 1st attention block in the decoder.
  # It is used to pad and mask future tokens in the input received by 
  # the decoder.
  look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
  dec_target_padding_mask = create_padding_mask(tar)
  combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)
  
  return enc_padding_mask, combined_mask, dec_padding_mask

In [26]:
ckpt = tf.train.Checkpoint(transformer=transformer,
                           optimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, chatbot_checkpoint_path, max_to_keep=5)

# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
  ckpt.restore(ckpt_manager.latest_checkpoint)
  print ('Latest checkpoint restored!!')

Latest checkpoint restored!!


In [27]:
EPOCHS = 100

# The @tf.function trace-compiles train_step into a TF graph for faster
# execution. The function specializes to the precise shape of the argument
# tensors. To avoid re-tracing due to the variable sequence lengths or variable
# batch sizes (the last batch is smaller), use input_signature to specify
# more generic shapes.

train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
]

@tf.function(input_signature=train_step_signature)
def train_step(inp, tar):
  tar_inp = tar[:, :-1]
  tar_real = tar[:, 1:] # shifted tar_input (+ 1)
  
  enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)
  
  with tf.GradientTape() as tape:
    predictions, _ = transformer(tf.cast(inp, dtype=tf.int64), tf.cast(tar_inp,dtype=tf.int64), 
                                 tf.cast(True, dtype=tf.bool), 
                                 enc_padding_mask, 
                                 combined_mask, 
                                 dec_padding_mask)
    loss = loss_function(tar_real, predictions)

  gradients = tape.gradient(loss, transformer.trainable_variables)    
  optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
  
  train_loss(loss)
  train_accuracy(tar_real, predictions)

## Training part

In [28]:
for epoch in range(EPOCHS):
  start = time.time()
  
  train_loss.reset_states()
  train_accuracy.reset_states()
  
  for (batch, (inp, tar)) in enumerate(train_dataset):
    train_step(inp, tar)
    
    if batch % 50 == 0:
      print ('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format(
          epoch + 1, batch, train_loss.result(), train_accuracy.result()))
      
  if (epoch + 1) % 50 == 0:
    ckpt_save_path = ckpt_manager.save()
    print ('Saving checkpoint for epoch {} at {}'.format(epoch+1, ckpt_save_path))
    print("test")
  print ('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1, 
                                                train_loss.result(), 
                                                train_accuracy.result()))

  print ('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 0.0180 Accuracy 0.7222
Epoch 1 Loss 0.0197 Accuracy 0.6936
Time taken for 1 epoch: 12.0969877243042 secs

Epoch 2 Batch 0 Loss 0.0077 Accuracy 0.5312
Epoch 2 Loss 0.0221 Accuracy 0.6319
Time taken for 1 epoch: 0.16054463386535645 secs

Epoch 3 Batch 0 Loss 0.0202 Accuracy 0.6471
Epoch 3 Loss 0.0367 Accuracy 0.6754
Time taken for 1 epoch: 0.14261889457702637 secs

Epoch 4 Batch 0 Loss 0.0200 Accuracy 0.5000
Epoch 4 Loss 0.0206 Accuracy 0.6913
Time taken for 1 epoch: 0.15059685707092285 secs

Epoch 5 Batch 0 Loss 0.0021 Accuracy 0.6731
Epoch 5 Loss 0.0573 Accuracy 0.6538
Time taken for 1 epoch: 0.14960026741027832 secs

Epoch 6 Batch 0 Loss 0.0088 Accuracy 0.6250
Epoch 6 Loss 0.0321 Accuracy 0.6220
Time taken for 1 epoch: 0.1326448917388916 secs

Epoch 7 Batch 0 Loss 0.2819 Accuracy 0.6731
Epoch 7 Loss 0.0747 Accuracy 0.6212
Time taken for 1 epoch: 0.13364481925964355 secs

Epoch 8 Batch 0 Loss 0.0243 Accuracy 0.8971
Epoch 8 Loss 0.0171 Accuracy 0.6877
Time taken for

Epoch 64 Loss 0.0246 Accuracy 0.6308
Time taken for 1 epoch: 0.14162111282348633 secs

Epoch 65 Batch 0 Loss 0.0104 Accuracy 0.6196
Epoch 65 Loss 0.0235 Accuracy 0.6926
Time taken for 1 epoch: 0.13663387298583984 secs

Epoch 66 Batch 0 Loss 0.0183 Accuracy 0.6522
Epoch 66 Loss 0.0151 Accuracy 0.6280
Time taken for 1 epoch: 0.13664555549621582 secs

Epoch 67 Batch 0 Loss 0.0210 Accuracy 0.5588
Epoch 67 Loss 0.0214 Accuracy 0.6175
Time taken for 1 epoch: 0.14760589599609375 secs

Epoch 68 Batch 0 Loss 0.0033 Accuracy 0.7031
Epoch 68 Loss 0.0131 Accuracy 0.6235
Time taken for 1 epoch: 0.1436159610748291 secs

Epoch 69 Batch 0 Loss 0.0217 Accuracy 0.7115
Epoch 69 Loss 0.0316 Accuracy 0.6280
Time taken for 1 epoch: 0.13663387298583984 secs

Epoch 70 Batch 0 Loss 0.0315 Accuracy 0.5441
Epoch 70 Loss 0.0118 Accuracy 0.6113
Time taken for 1 epoch: 0.13065147399902344 secs

Epoch 71 Batch 0 Loss 0.0143 Accuracy 0.7031
Epoch 71 Loss 0.0084 Accuracy 0.7017
Time taken for 1 epoch: 0.12865519523620

## Inference Part

In [29]:
def evaluate(inp_sentence):
  start_token = [tokenizer_en.vocab_size]
  end_token = [tokenizer_en.vocab_size + 1]
  
  # inp sentence is portuguese, hence adding the start and end token
  inp_sentence = start_token + tokenizer_en.encode(inp_sentence) + end_token
  encoder_input = tf.expand_dims(inp_sentence, 0)
  
  # as the target is english, the first word to the transformer should be the
  # english start token.
  decoder_input = [tokenizer_en.vocab_size]
  output = tf.expand_dims(decoder_input, 0)

  # 최대 길이는 몇일지 모르겠다.
  for i in range(200):
    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(
        encoder_input, output)
  
    # predictions.shape == (batch_size, seq_len, vocab_size)
    predictions, attention_weights = transformer(encoder_input, 
                                                 output,
                                                 False,
                                                 enc_padding_mask,
                                                 combined_mask,
                                                 dec_padding_mask)
    # select the last word from the seq_len dimension
    predictions = predictions[: ,-1:, :]  # (batch_size, 1, vocab_size)

    predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)
    
    # return the result if the predicted_id is equal to the end token
    if predicted_id == tokenizer_en.vocab_size+1:
      return tf.squeeze(output, axis=0), attention_weights
    
    # concatentate the predicted_id to the output which is given to the decoder
    # as its input.
    output = tf.concat([output, predicted_id], axis=-1)

  return tf.squeeze(output, axis=0), attention_weights

In [30]:
def translate(sentence):
  result, attention_weights = evaluate(re.sub('[!?]', '', sentence.lower()))
  
  predicted_sentence = tokenizer_en.decode([i for i in  result 
                                            if i < tokenizer_en.vocab_size]) 
  predicted_sentence = predicted_sentence.split('.')
  predicted_sentence = [w.strip() for w in predicted_sentence]
  predicted_sentence = [w.capitalize() for w in predicted_sentence]
  predicted_sentence = '. '.join(predicted_sentence)

  print('Input: {}'.format(re.sub('[!?]', '', sentence.lower())))
  print('Predicted translation: {}'.format(predicted_sentence))

In [31]:
print("훈련시 사용된 오카베의 대화: " + dialogue[0])
print("훈련시 사용된 크리스의 대화: " + dialogue[1])
translate("are you?")

훈련시 사용된 오카베의 대화: who are you
훈련시 사용된 크리스의 대화: that's what i'd like to know
Input: are you
Predicted translation: That's what i'd like to know


In [29]:
 print("훈련시 사용된 오카베의 대화: " + dialogue[2])
print("훈련시 사용된 크리스의 대화: " + dialogue[3])
translate("What?")

훈련시 사용된 오카베의 대화: what
훈련시 사용된 크리스의 대화: back there, you were going to tell me something.
Input: what
Predicted translation: Back there, you were going to tell me something. 


In [33]:
print("훈련시 사용된 오카베의 대화: " + dialogue[4])
print("훈련시 사용된 크리스의 대화: " + dialogue[5])
translate("Back there?")

훈련시 사용된 오카베의 대화: back there
훈련시 사용된 크리스의 대화: about fifteen minutes ago.
Input: back there
Predicted translation: About fifteen minutes ago. 


In [34]:
print("훈련시 사용된 오카베의 대화: " + dialogue[6])
print("훈련시 사용된 크리스의 대화: " + dialogue[7])
translate("kurisu?")

훈련시 사용된 오카베의 대화: makise... kurisu
훈련시 사용된 크리스의 대화: i'm surprised you know of it.
Input: kurisu
Predicted translation: Well then, let's change the format of this lecture to a discussion. 


In [None]:
print("훈련시 사용된 오카베의 대화: " + dialogue[8])
print("훈련시 사용된 크리스의 대화: " + dialogue[9])
translate("You! an agent of Organization?")

훈련시 사용된 오카베의 대화: you are you an agent of the organization
훈련시 사용된 크리스의 대화: organization what i just wanted ask you...
Input: you an agent of organization
Predicted translation: Organization what i just wanted ask you. . . 


In [None]:
print("훈련시 사용된 오카베의 대화: " + dialogue[10])
print("훈련시 사용된 크리스의 대화: " + dialogue[11])
translate("I've been caught by an agent.")

훈련시 사용된 오카베의 대화: it's me. i've been caught by an organization agent.
훈련시 사용된 크리스의 대화: who are you talking to
Input: i've been caught by an agent.
Predicted translation: Who are you talking to


In [35]:
print("훈련시 사용된 오카베의 대화: " + dialogue[12])
print("훈련시 사용된 크리스의 대화: " + dialogue[13])
translate("Yeah, no trouble")

훈련시 사용된 오카베의 대화: yeah, it should be no trouble.
훈련시 사용된 크리스의 대화: huh it's turned off.
Input: yeah, no trouble
Predicted translation: Huh it's turned off. 


In [36]:
print("훈련시 사용된 오카베의 대화: " + dialogue[14])
print("훈련시 사용된 크리스의 대화: " + dialogue[15])
translate("If anyone except me touches it, it would be off")

훈련시 사용된 오카베의 대화: i'll tell you a secret. if anyone but me touches it, it deactivates.
훈련시 사용된 크리스의 대화: i see. you were talking to yourself.
Input: if anyone except me touches it, it would be off
Predicted translation: Who are you talking to


In [37]:
print("훈련시 사용된 오카베의 대화: " + dialogue[16])
print("훈련시 사용된 크리스의 대화: " + dialogue[17])
translate("genius girl, next time we will be enemies. Farewell.")

훈련시 사용된 오카베의 대화: genius girl, when next we meet, we will be enemies. farewell.
훈련시 사용된 크리스의 대화: wait
Input: genius girl, next time we will be enemies. farewell.
Predicted translation: Wait


In [38]:
print("훈련시 사용된 오카베의 대화: " + dialogue[18])
print("훈련시 사용된 크리스의 대화: " + dialogue[19])
translate("how i can feel her? she is here.")

훈련시 사용된 오카베의 대화: how i can feel her. she's really here. you're not a spirit
훈련시 사용된 크리스의 대화: do you want me to call the cops
Input: how i can feel her she is here.
Predicted translation: Do you want me to call the cops


In [39]:
print("훈련시 사용된 오카베의 대화: " + dialogue[20])
print("훈련시 사용된 크리스의 대화: " + dialogue[21])
translate("i want to know truth you were stabbed...")

훈련시 사용된 오카베의 대화: i just want to know the truth you were stabbed right here...
훈련시 사용된 크리스의 대화: what do you mean by "the truth" are you stupid do you want to die
Input: i want to know truth you were stabbed...
Predicted translation: What do you mean by "the truth" are you stupid do you want to die


In [40]:
print("훈련시 사용된 오카베의 대화: " + dialogue[22])
print("훈련시 사용된 크리스의 대화: " + dialogue[23])
translate("are you running away?")

훈련시 사용된 오카베의 대화: wait are you running away
훈련시 사용된 크리스의 대화: you're...
Input: are you running away
Predicted translation: You're. . . 


In [None]:
print("훈련시 사용된 오카베의 대화: " + dialogue[24])
print("훈련시 사용된 크리스의 대화: " + dialogue[25])
translate("time machines")

훈련시 사용된 오카베의 대화: time machines
훈련시 사용된 크리스의 대화: i think time machines are nothing more than a pipe dream.
Input: time machines
Predicted translation: I think time machines are nothing more than a pipe dream. 


In [None]:
print("훈련시 사용된 오카베의 대화: " + dialogue[32])
print("훈련시 사용된 크리스의 대화: " + dialogue[33])
translate("do you think one could build time machine?")

훈련시 사용된 오카베의 대화: do you think one could build a time machine
훈련시 사용된 크리스의 대화: my personal belief is that time machines are not possible. however, that is not to say they are impossible altogether
Input: do you think one could build time machine
Predicted translation: My personal belief is that time machines are not possible. However, that is not to say they are impossible altogether


In [None]:
print("훈련시 사용된 오카베의 대화: " + dialogue[36])
print("훈련시 사용된 크리스의 대화: " + dialogue[37])
translate("a number of time travel series exist but they are nothing but speculation")

훈련시 사용된 오카베의 대화: there are a number of time travel theories but they are nothing but speculation.
훈련시 사용된 크리스의 대화: even so, we could be missing that one critical discovery that unlocks new understanding
Input: a number of time travel series exist but they are nothing but speculation
Predicted translation: Even so, we could be missing that one critical discovery that unlocks new understanding
