## 인코더-디코더 방식은 입력과 출력의 길이가 다를 때 사용한다.(번역기, 텍스트 요약)

# Seq2Seq
### : 입력 시퀸스를 받아, 출력 시퀸스로 변환 모델(인코더와 디코더)
- 인코더 : 이력 시퀸스를 받아 고정 길이의 벡터를 생성한다. 최종 상태를 디코더에 전달한다.
- 디코더 : 인코더가 생성한 상태 벡터를 사용하여 출력 시퀸스를 생성한다. 상태벡터와 이전 시점의 출력 단어를 전달 받아 다음 단어를 예측한다.

In [31]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

In [32]:
lines = pd.read_csv('fra.txt', names=['src', 'tar', 'lic'], sep='\t')

len(lines)

232736

In [33]:
lines.head()

Unnamed: 0,src,tar,lic
0,Go.,Va !,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
1,Go.,Marche.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
2,Go.,En route !,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
3,Go.,Bouge !,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
4,Hi.,Salut !,CC-BY 2.0 (France) Attribution: tatoeba.org #5...


In [34]:
del lines['lic']

lines = lines[0:60000]
lines.sample(5)

Unnamed: 0,src,tar
17898,Tom is cautious.,Tom est prudent.
7026,You're great.,T'assures.
5483,I went twice.,Je m'y suis rendue à deux reprises.
37283,Tom isn't reliable.,Tom n'est pas fiable.
2055,I got an A.,J'ai eu un A.


In [35]:
lines.tar = lines.tar.apply(lambda x : '\t ' + x + ' \n')
lines.sample(5)

Unnamed: 0,src,tar
35926,She came to see me.,\t Elle est venue me rendre visite. \n
29612,They kidnapped me.,\t Ils m'ont enlevé. \n
41842,I wasn't busy today.,\t Je n'étais pas occupé aujourd'hui. \n
38993,You're the teacher.,\t Tu es l'enseignante. \n
15500,I have no proof.,\t Je n'ai pas de preuve. \n


In [36]:
src_vocab = set()
for line in lines.src:
    for char in line:
        src_vocab.add(char)

tar_vocab = set()
for line in lines.tar:
    for char in line:
        tar_vocab.add(char)

src_vocab_size = len(src_vocab) + 1
tar_vocab_size = len(tar_vocab) + 1

print('source 문장의 문자 집합 : ', src_vocab_size) # 80(<PAD> 토큰 포함)
print('target 문장의 문자 집합 : ', tar_vocab_size) # 100(<PAD> 토큰 포함)

source 문장의 문자 집합 :  80
target 문장의 문자 집합 :  102


In [37]:
src_vocab_size = len(src_vocab) + 1
tar_vocab_size = len(tar_vocab) + 1

print('source 문장의 문자 집합 : ', src_vocab_size) # 80 (<PAD> 토큰 포함)
print('target 문장의 문자 집합 : ', tar_vocab_size) # 100 (<PAD> 토큰 포함)

source 문장의 문자 집합 :  80
target 문장의 문자 집합 :  102


In [38]:
src_vocab = list(src_vocab)
tar_vocab = list(tar_vocab)

print(src_vocab[:10])
print(tar_vocab[:10])

['t', '€', 'W', '-', 'f', '%', '9', 'ï', '.', '7']
['t', 'W', '-', 'f', '%', '9', 'ï', '.', '7', 'û']


In [39]:
src_to_index = {word: i+1 for i, word in enumerate(src_vocab)}
tar_to_index = {word: i+1 for i, word in enumerate(tar_vocab)}

print('src_to_index:', src_to_index)
print('tar_to_index:', tar_to_index)

src_to_index: {'t': 1, '€': 2, 'W': 3, '-': 4, 'f': 5, '%': 6, '9': 7, 'ï': 8, '.': 9, '7': 10, 'X': 11, 'w': 12, 'H': 13, ' ': 14, '8': 15, '2': 16, 'p': 17, 'Z': 18, ':': 19, 'Q': 20, 'U': 21, ',': 22, 'A': 23, 'r': 24, 'T': 25, 'z': 26, 'a': 27, 'V': 28, 'k': 29, 'N': 30, 'q': 31, '3': 32, 'J': 33, 'L': 34, 'c': 35, 's': 36, 'n': 37, 'G': 38, '5': 39, 'x': 40, 'y': 41, 'E': 42, '4': 43, 'm': 44, 'j': 45, 'F': 46, 'l': 47, '"': 48, '/': 49, 'v': 50, 'e': 51, 'o': 52, '!': 53, 'Y': 54, 'B': 55, 'd': 56, 'P': 57, '0': 58, '?': 59, 'h': 60, '$': 61, 'R': 62, 'u': 63, 'g': 64, 'S': 65, 'O': 66, 'M': 67, 'K': 68, 'C': 69, "'": 70, 'D': 71, '’': 72, 'i': 73, 'I': 74, '&': 75, '1': 76, '6': 77, 'é': 78, 'b': 79}
tar_to_index: {'t': 1, 'W': 2, '-': 3, 'f': 4, '%': 5, '9': 6, 'ï': 7, '.': 8, '7': 9, 'û': 10, 'X': 11, 'w': 12, 'H': 13, ' ': 14, '8': 15, '2': 16, 'p': 17, ':': 18, 'Q': 19, 'U': 20, ',': 21, 'A': 22, 'r': 23, '\n': 24, '‽': 25, 'T': 26, 'Ê': 27, 'z': 28, 'a': 29, 'ù': 30, 'V': 3

In [40]:
encoder_input = []

for line in lines.src:
    encoded_line = []
    for char in line:
        encoded_line.append(src_to_index[char])
    encoder_input.append(encoded_line)

print('source encoding : ', encoder_input[:5])

source encoding :  [[38, 52, 9], [38, 52, 9], [38, 52, 9], [38, 52, 9], [13, 73, 9]]


In [41]:
decoder_target = []

for line in lines.tar:
    time=0
    encoded_line = []
    for char in line:
        if time > 0:
            encoded_line.append(tar_to_index[char])
        time += 1
    decoder_target.append(encoded_line)

print('target encoding : ', decoder_target[:5])

target encoding :  [[14, 31, 29, 14, 68, 14, 24], [14, 85, 29, 23, 39, 75, 67, 8, 14, 24], [14, 52, 42, 14, 23, 66, 80, 1, 67, 14, 68, 14, 24], [14, 70, 66, 80, 82, 67, 14, 68, 14, 24], [14, 83, 29, 63, 80, 1, 14, 68, 14, 24]]


In [42]:
max_src_len = max([len(line) for line in lines.src])
max_tar_len = max([len(line) for line in lines.tar])

print('source 문장 최대 길이 : ', max_src_len)
print('target 문장 최대 길이 : ', max_tar_len)

source 문장 최대 길이 :  22
target 문장 최대 길이 :  76


In [43]:
# 패딩처리
encoder_input = pad_sequences(encoder_input, maxlen=max_src_len, padding='post')
decoder_target = pad_sequences(decoder_target, maxlen=max_tar_len, padding='post')

In [44]:
print(encoder_input[0], decoder_target[0])

[38 52  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0] [14 31 29 14 68 14 24  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0]


In [45]:
encoder_input = to_categorical(encoder_input)
decoder_target = to_categorical(decoder_target)

In [46]:
decoder_input = []
for line in lines.tar:
    encoded_line = []
    for char in line:
        encoded_line.append(tar_to_index[char])

    decoder_input.append(encoded_line)

print('target encoding : ', decoder_input[:5])

target encoding :  [[53, 14, 31, 29, 14, 68, 14, 24], [53, 14, 85, 29, 23, 39, 75, 67, 8, 14, 24], [53, 14, 52, 42, 14, 23, 66, 80, 1, 67, 14, 68, 14, 24], [53, 14, 70, 66, 80, 82, 67, 14, 68, 14, 24], [53, 14, 83, 29, 63, 80, 1, 14, 68, 14, 24]]


In [47]:
# 입력(영어) -> 출력(프랑스어) 이전 인코더의 출력 뿐만 아니라 정답을 같이 디코더의 입력으로 전달받아 모델의 성능을 향상시키겠다.
decoder_input = pad_sequences(decoder_input, maxlen = max_tar_len, padding = 'post')
decoder_input = to_categorical(decoder_input)

In [48]:
import numpy as np
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense
from tensorflow.keras.models import Model

In [49]:
# 인코더 : enoder LSTM에서 히든 상태와 셀 상태를 반환
encoder_inputs = Input(shape=(None, src_vocab_size))
encoder_lstm = LSTM(units=256, return_state=True) # 인코더 상태 return_state

_, state_h, state_c = encoder_lstm(encoder_inputs) # 히든 상태, 셀 상태

encoder_states = [state_h, state_c]

2024-07-15 10:10:05.083724: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2024-07-15 10:10:05.085795: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2024-07-15 10:10:05.086783: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

In [50]:
# 디코더 : 인코더 상태를 전달받아 최종 출력 반환
decoder_inputs = Input(shape=(None, tar_vocab_size))
decoder_lstm = LSTM(units=256, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)

decoder_dense = Dense(tar_vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# 모델 정의
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# 모델 컴파일
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

2024-07-15 10:10:05.565667: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2024-07-15 10:10:05.567595: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2024-07-15 10:10:05.568730: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

In [51]:
from tensorflow.keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=4)

In [None]:
# 모델 학습
history = model.fit(
    x=[encoder_input, decoder_input],
    y=decoder_target,
    batch_size=128,
    epochs=40,
    validation_split=0.1
)

Epoch 1/40


2024-07-15 10:10:05.838368: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2024-07-15 10:10:05.839754: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2024-07-15 10:10:05.840608: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus



2024-07-15 10:15:07.059827: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2024-07-15 10:15:07.060712: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2024-07-15 10:15:07.061669: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40

In [None]:
encoder_model = Model(inputs=encoder_inputs, outputs=encoder_state)

In [None]:
# 이전 상태
decoder_state_input_h = Input(shape=(256,))
decoder_state_input_c = Input(shape=(256,))

decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

# 초기상태를 이전 시점의 상태로 초기화
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state = decoder_states_inputs)

decoder_states = [state_h, state_c]
decoder_outputs = decoder_softmax_layer(decoder_outputs)

decoder_model = Model(inputs=[decoder_inputs] + decoder_states_inputs, outputs = [decoder_outputs] + decoder_states)

In [None]:
index_to_src = dict(i, char) for char, i in src_to_index.items())
index_to_tar = dict(i, char) for char, i in tar_to_index.items())

In [None]:
def decode_sequence(input_seq):
    states_value = encoder_model.predicct(input_seq) # 입력을 기준으로 상태 반환

    target_seq = np.zeros((1, 1, tar_vocab_size)) # 원-핫 벡터 생성
    target_seq[0, 0, tar_to_index['\t']] = 1. # <SOS> 문장이 시작되는 

    stop_condition = False
    decoded_sentence = ''

    while not stop_condition:
        # 이전 시점의 상태를 현 시점의 초기 상태로 설정
        output_tokens, h, x = decoder_model.predict([target_seq] + states_value)

        sampled_token_index = np.argmax(output_tokens[0, -1, :]) # 가장 높은 결과로 예측
        sampled_char = index_to_tar[sampled_token_index] # 예측 결과를 문자로 변환
        decoded_sentence += sampled_char # 문자를 추가

        # <EOS> 문장의 끝에 도달하거나 최대 길이를 넘어가면 종료
        if (sampled_char == '\n' or len(decoded_sentence) > max_tar_len):
            stop_condition = True

        target_seq = np.zeros((1, 1, tar_vocab_size))
        target_seq[0, 0, sampled_token_index] = 1. # 현재 시점의 결과를 다음 시점의 입력으로 사용(tar_to_index[sampled_char])

        states_value = [h, c] # 현재 시점의 상태를 다음 시점의 입력할 상태로 사용

    return decoded_sentence

In [None]:
for seq_index in [2, 55, 123, 506, 1001]:
    input_seq = encoder_input[seq_index:seq_index+1]
    decoded_sentence = decode_sequence(input_seq)
    print(35 * "-")
    print('입력 문장 : ', lines.src[seq_index])
    print('정답 문장 : ', lines.tar[seq_index][2:len(lines.tar[seq_index])-1])
    print('번역 문장 : ', decoded_sentence[1:len(decoded_sentence)-1])