---
# Seq2Seq

- https://medium.com/@dev.elect.iitd/neural-machine-translation-using-word-level-seq2seq-model-47538cba8cd7
- https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py
- https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/
- https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/
- https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
- https://www.tensorflow.org/tutorials/text/nmt_with_attention

---



- **`encoder`: it processes the input sequence and returns its own internal state.** 

    - encoder RNN의 아웃풋은 버린다! only recovering the state.
    - This state will serve as the "context", or "conditioning"      
    
    
- **`decoder`: it is trained to predict the next characters of the target sequence, given previous characters of the target sequence.** 
    - it is trained to **turn the target sequences into the same sequences** 
    - but offset by one timestep in the future, 
    - a training process called "teacher forcing" in this context. 
    
    - `Importantly, the encoder uses as initial state the state vectors from the encoder, which is how the decoder obtains information about what it is supposed to generate.` 
    - Effectively, the decoder learns to generate targets[t+1...] given targets[...t], conditioned on the input sequence.

In [1]:
import tensorflow as tf

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd

import pickle
import re
import os
import io
import time
import math

# Load Data

In [2]:
# load
with open('desc_story_data_df.pkl', 'rb') as f:
    df_story_and_desc = pickle.load(f)

print ("before: ", df_story_and_desc.shape)
# df_story_and_desc = df_story_and_desc[:150000]

desc_story_text = df_story_and_desc[['desc', 'story']]
desc_story_text.head(15)

before:  (602325, 4)


Unnamed: 0,desc,story
0,Large tree with many outstretching branches an...,Our landmark tree in town was about to be dest...
1,A green sign is describing a historic tree and...,So we decided to take the day to go out and en...
2,A large tree with roots that look like crocodi...,"To see the final glimpse of the roots, extendi..."
3,Big old tree being photographed on a sunny day,"And its magnificent trunk, larger than life it..."
4,Huge brown tree roots rose above the ground.,One last picture of its beauty so we could cap...
5,Large tree with many outstretching branches an...,We found this tree when we were walking in a n...
6,A green sign is describing a historic tree and...,It turns out it is a popular attraction here.
7,A large tree with roots that look like crocodi...,"The tree is very unusual, with its roots exposed."
8,Big old tree being photographed on a sunny day,"The trunk was really wide, as much as 12 feet!"
9,Huge brown tree roots rose above the ground.,You can see how big these roots are - pretty a...


==> 15개 단위로 같은 사진 set에 대한 내용

In [3]:
check_sent = "Large tree with many outstretching branches and leaves."
for i, row in desc_story_text[desc_story_text['desc'] == check_sent].iterrows():
    print("pair " + str(i) + ":" + row['desc'] + " ==> " + row['story'])

pair 0:Large tree with many outstretching branches and leaves. ==> Our landmark tree in town was about to be destroyed and cleared for a new mall. 
pair 5:Large tree with many outstretching branches and leaves. ==> We found this tree when we were walking in a nearby town. 
pair 10:Large tree with many outstretching branches and leaves. ==> Pictures of a tree are taken.
pair 15:Large tree with many outstretching branches and leaves. ==> They went to the botanic gardens specifically to see the large tree.
pair 20:Large tree with many outstretching branches and leaves. ==> We went to see the largest tree in the country. 


==> 하나의 Description Text에 5개의 Story가 대응됨

## cleaning

In [4]:
def drop_duplicate(df):
    drop_df = df.drop_duplicates("desc")
    drop_df = drop_df.drop_duplicates("story")
    drop_df = drop_df.reset_index(drop=True)
    return drop_df

In [5]:
desc_story_text = drop_duplicate(desc_story_text)
desc_story_text.shape

(39771, 2)

# Prepare Data

## preprocess text

- Add a start and end token to each sentence.
- Clean the sentences by removing special characters.

In [6]:
def preprocess_sentence(w):
    w = re.sub('[^a-zA-Z]+', ' ', w)
    w = re.sub('[^a-zA-Z.,!?]+', ' ', w)
    w = w.strip()
    
    # adding a start and an end token to the sentence
    w = '<start> ' + w + ' <end>'
    return w

In [7]:
clean_data = desc_story_text.copy()
clean_data['desc'] = desc_story_text['desc'].apply(lambda x: preprocess_sentence(x))
clean_data['story'] = desc_story_text['story'].apply(lambda x: preprocess_sentence(x))
clean_data.columns = ['in_desc','out_story']
clean_data

Unnamed: 0,in_desc,out_story
0,<start> Large tree with many outstretching bra...,<start> Our landmark tree in town was about to...
1,<start> A green sign is describing a historic ...,<start> So we decided to take the day to go ou...
2,<start> A large tree with roots that look like...,<start> To see the final glimpse of the roots ...
3,<start> Big old tree being photographed on a s...,<start> And its magnificent trunk larger than ...
4,<start> Huge brown tree roots rose above the g...,<start> One last picture of its beauty so we c...
...,...,...
39766,<start> AN ADVERTISEMENT ON GLASS FOR A BREWIN...,<start> A group of friends visited a brewery <...
39767,<start> A man holds the woman s hand under the...,<start> They were very excited <end>
39768,<start> A young man and older woman sitting do...,<start> They sampled many different beers <end>
39769,<start> An elderly couple dances outside of a ...,<start> After becoming a little buzzed they ev...


In [8]:
print(clean_data["in_desc"][0])
print(clean_data["out_story"][0])

<start> Large tree with many outstretching branches and leaves <end>
<start> Our landmark tree in town was about to be destroyed and cleared for a new mall <end>


## Indexing and Padding

- Create a word index and reverse word index (dictionaries mapping from word → id and id → word).   
: 단어 색인과 역방향 단어 색인을 만듭니다 (단어 → ID 및 ID → 단어에서 매핑 된 사전).

- Pad each sentence to a maximum length.

In [9]:
def tokenize(input_desc, output_story, flag):
    tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
    tokenizer.fit_on_texts(input_desc + output_story)
        
    if flag == "input":
        # sentence 최대 길이로 패딩 자동으로 해줌
        input_seq = tokenizer.texts_to_sequences(input_desc)
        padding_flag = 'pre' # input은 앞에서 0을 채움
        input_pad_seq = tf.keras.preprocessing.sequence.pad_sequences(input_seq, padding=padding_flag)
        
        return input_pad_seq, tokenizer
    else:
        output_seq = tokenizer.texts_to_sequences(output_story)
        padding_flag = 'post' # output은 뒤에서 0을 채움
        output_pad_seq = tf.keras.preprocessing.sequence.pad_sequences(output_seq, padding=padding_flag)
    
        return output_pad_seq, tokenizer

In [10]:
def get_token_data(data):
    # creating cleaned input, output pairs
    input_desc, output_story = data['in_desc'], data['out_story']
    
    tokenize_input_desc, tokenizer = tokenize(input_desc, output_story, "input")
    tokenize_output_story, tokenizer= tokenize(input_desc, output_story, "output")
    
    input_desc_train, input_desc_test, output_story_train, output_story_test = train_test_split(tokenize_input_desc, tokenize_output_story, test_size=0.2)
    
    print("input_desc_train: ", input_desc_train.shape)
    print("input_desc_test: ", input_desc_test.shape)
    print("output_story_train: ", output_story_train.shape)
    print("output_story_test: ", output_story_test.shape)
    
    return input_desc_train, input_desc_test, output_story_train, output_story_test, tokenizer

In [11]:
# Creating training and validation sets using an 80-20 split

input_desc_train, input_desc_test, output_story_train, output_story_test, tokenizer = get_token_data(clean_data)

input_desc_train:  (31816, 72)
input_desc_test:  (7955, 72)
output_story_train:  (31816, 77)
output_story_test:  (7955, 77)


In [12]:
input_desc_train

array([[   0,    0,    0, ...,  181, 8694,    5],
       [   0,    0,    0, ...,  190,  153,    5],
       [   0,    0,    0, ...,    2,  450,    5],
       ...,
       [   0,    0,    0, ...,  837,  497,    5],
       [   0,    0,    0, ...,    2,  422,    5],
       [   0,    0,    0, ...,   72, 4761,    5]], dtype=int32)

In [13]:
output_story_train

array([[   3,    1, 6412, ...,    0,    0,    0],
       [   3,   38,  451, ...,    0,    0,    0],
       [   3,  484,   17, ...,    0,    0,    0],
       ...,
       [   3,   97,  106, ...,    0,    0,    0],
       [   3, 3458,   29, ...,    0,    0,    0],
       [   3,   23,   31, ...,    0,    0,    0]], dtype=int32)

In [14]:
def get_divisor(num):
    divisors = []
    length = int(math.sqrt(num)) + 1
    for i in range(1, length):
        if num % i == 0:
            divisors.append(i)
            divisors.append(num // i) # 나누기 연산 후 정수부분만 구하기,

    divisors.sort()

    return divisors

get_divisor(len(input_desc_train))

[1, 2, 4, 8, 41, 82, 97, 164, 194, 328, 388, 776, 3977, 7954, 15908, 31816]

In [28]:
# Prepare training data

# prefetch = gpu에 올라갈 데이터 slices
train_ds = tf.data.Dataset.from_tensor_slices((input_desc_train, output_story_train)).shuffle(len(input_desc_train)).batch(164).prefetch(164)
test_ds = tf.data.Dataset.from_tensor_slices((input_desc_test, output_story_test)).batch(1) # 하나씩 출력하기 위해서 batch 1

# Write the encoder and decoder model

<img src="https://www.tensorflow.org/images/seq2seq/attention_mechanism.jpg" width="500">

- FC = Fully connected (dense) layer
- EO = Encoder output
- H = hidden state
- X = input to the decoder

---

- score = FC(tanh(FC(EO) + FC(H)))
- attention weights = softmax(score, axis = 1). Softmax by default is applied on the last axis but here we want to apply it on - the 1st axis, since the shape of score is (batch_size, max_length, hidden_size). Max_length is the length of our input. - - - Since we are trying to assign a weight to each input, softmax should be applied on that axis.
- context vector = sum(attention weights * EO, axis = 1). Same reason as above for choosing axis as 1.
- embedding output = The input to the decoder X is passed through an embedding layer.
- merged vector = concat(embedding output, context vector)
- This merged vector is then given to the GRU

In [16]:
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim):
        super(Encoder, self).__init__()
        self.emb = tf.keras.layers.Embedding(vocab_size, embedding_dim)

        # return_state는 return하는 Output에 최근의 state를 더해주느냐에 대한 옵션
        # 즉, Hidden state와 Cell state를 출력해주기 위한 옵션이라고 볼 수 있다.
        # default는 False이므로 주의하자!
        # return_sequence=True로하는 이유는 Attention mechanism을 사용할 때 우리가 key와 value는
        # Encoder에서 나오는 Hidden state 부분을 사용했어야 했다. 그러므로 모든 Hidden State를 사용하기 위해 바꿔준다.

        self.lstm = tf.keras.layers.LSTM(512, 
                                         return_sequences=True, 
                                         return_state=True, 
                                         recurrent_initializer='glorot_uniform')

    def call(self, x, training=False, mask=None):
        x = self.emb(x)
        H, h, c = self.lstm(x)
        
        return H, h, c
    
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim):
        super(Decoder, self).__init__()
        self.emb = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        # return_sequence는 return 할 Output을 full sequence 또는 Sequence의 마지막에서 출력할지를 결정하는 옵션
        # False는 마지막에만 출력, True는 모든 곳에서의 출력
        self.lstm = tf.keras.layers.LSTM(512, 
                                         return_sequences=True, 
                                         return_state=True,
                                         recurrent_initializer='glorot_uniform')
        # LSTM 출력에다가 Attention value를 dense에 넘겨주는 것이 Attention mechanism이므로
        self.att = tf.keras.layers.Attention()
        self.dense = tf.keras.layers.Dense(vocab_size, activation='softmax')

    def call(self, inputs, training=False, mask=None):
        # x : shifted output, s0 : Decoder단의 처음들어오는 Hidden state
        # c0 : Decoder단의 처음들어오는 cell state H: Encoder단의 Hidden state(Key와 value로 사용)
        x, s0, c0, H = inputs
        x = self.emb(x)

        # initial_state는 셀의 첫 번째 호출로 전달 될 초기 상태 텐서 목록을 의미
        # 이전의 Encoder에서 만들어진 Hidden state와 Cell state를 입력으로 받아야 하므로
        # S : Hidden state를 전부다 모아놓은 것이 될 것이다.(Query로 사용)
        S, h, c = self.lstm(x, initial_state=[s0, c0])

        # Query로 사용할 때는 하나 앞선 시점을 사용해줘야 하므로
        # s0가 제일 앞에 입력으로 들어가는데 현재 Encoder 부분에서의 출력이 batch 크기에 따라서 length가 현재 1이기 때문에 2차원형태로 들어오게 된다.
        # 그러므로 이제 3차원 형태로 확장해 주기 위해서 newaxis를 넣어준다.
        # 또한 decoder의 S(Hidden state) 중에 마지막은 예측할 다음이 없으므로 배제해준다.
        S_ = tf.concat([s0[:, tf.newaxis, :], S[:, :-1, :]], axis=1)

        # Attention 적용
        # 아래 []안에는 원래 Query, Key와 value 순으로 입력해야하는데 아래처럼 두가지만 입력한다면
        # 마지막 것을 Key와 value로 사용한다.
        A = self.att([S_, H])

        y = tf.concat([S, A], axis=-1)
        
        return self.dense(y), h, c
    
class Seq2seq(tf.keras.Model):
    def __init__(self, sos, eos, vocab_size, embedding_dim):
        super(Seq2seq, self).__init__()
        self.enc = Encoder(vocab_size, embedding_dim)
        self.dec = Decoder(vocab_size, embedding_dim)
        self.sos = sos
        self.eos = eos

    def call(self, inputs, training=False, mask=None):
        if training is True:
            # 학습을 하기 위해서는 우리가 입력과 출력 두가지를 다 알고 있어야 한다.
            # 출력이 필요한 이유는 Decoder단의 입력으로 shited_ouput을 넣어주게 되어있기 때문이다.
            x, y = inputs

            # LSTM으로 구현되었기 때문에 Hidden State와 Cell State를 출력으로 내준다.
            H, h, c = self.enc(x)

            # Hidden state와 cell state, shifted output을 초기값으로 입력 받고
            # 출력으로 나오는 y는 Decoder의 결과이기 때문에 전체 문장이 될 것이다.
            y, _, _ = self.dec((y, h, c, H))
            
            return y

        else:
            x = inputs
            H, h, c = self.enc(x)

            # Decoder 단에 제일 먼저 sos를 넣어주게끔 tensor화시키고
            y = tf.convert_to_tensor(self.sos)
            # shape을 맞춰주기 위한 작업이다.
            y = tf.reshape(y, (1, 1))

            # 최대 64길이 까지 출력으로 받을 것이다.
            seq = tf.TensorArray(tf.int32, 64)

            # tf.keras.Model에 의해서 call 함수는 auto graph모델로 변환이 되게 되는데,
            # 이때, tf.range를 사용해 for문이나 while문을 작성시 내부적으로 tf 함수로 되어있다면
            # 그 for문과 while문이 굉장히 효율적으로 된다.
            for idx in tf.range(64):
                y, h, c = self.dec([y, h, c, H])
                # 아래 두가지 작업은 test data를 예측하므로 처음 예측한값을 다시 다음 step의 입력으로 넣어주어야하기에 해야하는 작업이다.
                # 위의 출력으로 나온 y는 softmax를 지나서 나온 값이므로
                # 가장 높은 값의 index값을 tf.int32로 형변환해주고
                # 위에서 만들어 놓았던 TensorArray에 idx에 y를 추가해준다.
                y = tf.cast(tf.argmax(y, axis=-1), dtype=tf.int32)
                # 위의 값을 그대로 넣어주게 되면 Dimension이 하나밖에 없어서
                # 실제로 네트워크를 사용할 때 Batch를 고려해서 사용해야 하기 때문에 (1,1)으로 설정해 준다.
                y = tf.reshape(y, (1, 1))
                seq = seq.write(idx, y)

                if y == self.eos:
                    break
            
            # stack은 그동안 TensorArray로 받은 값을 쌓아주는 작업을 한다.    
            return tf.reshape(seq.stack(), (1, 64))
        
# Implement training loop
@tf.function
def train_step(model, inputs, labels, loss_object, optimizer, train_loss, train_accuracy):
    # output_labels는 실제 output과 비교하기 위함
    # shifted_labels는 Decoder부분에 입력을 넣기 위함
    output_labels = labels[:, 1:]
    shifted_labels = labels[:, :-1]
    with tf.GradientTape() as tape:
        predictions = model([inputs, shifted_labels], training=True)
        loss = loss_object(output_labels, predictions)
    
    gradients = tape.gradient(loss, model.trainable_variables)

    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    train_loss(loss)
    train_accuracy(output_labels, predictions)

In [17]:
model = Seq2seq(sos = tokenizer.word_index['<start>'],
                eos = tokenizer.word_index['<end>'],
                vocab_size = len(tokenizer.word_index) + 1,
                embedding_dim = 300)

loss_func = tf.keras.losses.SparseCategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam()

EPOCHS = 50

# 성능 지표 정의
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')

for epoch in range(EPOCHS):
    for seqs, labels in train_ds:
        train_step(model, seqs, labels, loss_func, optimizer, train_loss, train_accuracy)

    template = 'Epoch {}, Loss: {}, Accuracy:{}'
    print(template.format(epoch + 1, train_loss.result(), train_accuracy.result() * 100))

Epoch 1, Loss: 1.270072102546692, Accuracy:85.521728515625
Epoch 2, Loss: 1.0850343704223633, Accuracy:86.42794799804688
Epoch 3, Loss: 1.0010583400726318, Accuracy:86.90519714355469
Epoch 4, Loss: 0.9440162181854248, Accuracy:87.23162841796875
Epoch 5, Loss: 0.9032987952232361, Accuracy:87.46558380126953
Epoch 6, Loss: 0.8722909688949585, Accuracy:87.64449310302734
Epoch 7, Loss: 0.8468963503837585, Accuracy:87.79431915283203
Epoch 8, Loss: 0.8250592947006226, Accuracy:87.92412567138672
Epoch 9, Loss: 0.805786669254303, Accuracy:88.03858947753906
Epoch 10, Loss: 0.7882952094078064, Accuracy:88.1427230834961
Epoch 11, Loss: 0.7720071077346802, Accuracy:88.23883819580078
Epoch 12, Loss: 0.7565454244613647, Accuracy:88.33019256591797
Epoch 13, Loss: 0.7417263388633728, Accuracy:88.4198989868164
Epoch 14, Loss: 0.7274039387702942, Accuracy:88.50878143310547
Epoch 15, Loss: 0.7134785056114197, Accuracy:88.6003189086914
Epoch 16, Loss: 0.6999036073684692, Accuracy:88.69689178466797
Epoch 17

In [25]:
model.save_weights("seq2seq_model.h5")

In [19]:
# Implement algorithm test
@tf.function
def test_step(model, inputs):
    return model(inputs, training=False)

In [29]:
for idx, (test_seq, test_labels) in enumerate(test_ds):
    if idx > 20:
        break
    prediction = test_step(model, test_seq)
    test_text = tokenizer.sequences_to_texts(test_seq.numpy())
    # ground_truth
    gt_text = tokenizer.sequences_to_texts(test_labels.numpy())
    # prediction
    texts = tokenizer.sequences_to_texts(prediction.numpy())
    
    print('_')
    print('desc: ', test_text)
    print('story: ', gt_text)
    print('pred: ', texts)
    

_
desc:  ['<start> the sky during a beautiful sunset near the beach <end>']
story:  ['<start> until the sun started to go down <end>']
pred:  ['we decided to take a trip to the beach and the beach was lit up <end>']
_
desc:  ['<start> a group of people wearing masks are celebrating marde gras in a bus <end>']
story:  ['<start> halloween was fun this year my family dressed up as astrunauts <end>']
pred:  ['the city s game was getting great to a great day in the town <end>']
_
desc:  ['<start> a child stretching and smiling at the camera <end>']
story:  ['<start> today i went to a history museum with grandpa dad and sister <end>']
pred:  ['but mom and dad you see you think they were having a great time <end>']
_
desc:  ['<start> two men and a woman posing for a picture <end>']
story:  ['<start> the group decided to head out to a bar for the night <end>']
pred:  ['we even got to see a couple in town they were a extremely proud <end>']
_
desc:  ['<start> two women sharing and doing cheers 