# Seq2Seq2 (Sequence-to-Sequence)

### Embedding Vector

1. 영어 Glove 임베딩(사전 학습) 사용 / 6B tokens, 400K vocab, uncased, 100d
2. 한국어 임베딩 (초기 훈련)  


In [4]:
!gdown 1qk-14tgVHPXT5jfRUE4Ua2ji4EXwS022

Downloading...
From (original): https://drive.google.com/uc?id=1qk-14tgVHPXT5jfRUE4Ua2ji4EXwS022
From (redirected): https://drive.google.com/uc?id=1qk-14tgVHPXT5jfRUE4Ua2ji4EXwS022&confirm=t&uuid=85ae9199-0d7b-4d46-a59a-18e7969838b2
To: c:\nlp\07_seq2seq\glove.6B.100d.txt

  0%|          | 0.00/347M [00:00<?, ?B/s]
  0%|          | 524k/347M [00:00<05:03, 1.14MB/s]
  0%|          | 1.05M/347M [00:00<02:52, 2.00MB/s]
  1%|          | 2.10M/347M [00:00<01:38, 3.52MB/s]
  1%|▏         | 4.72M/347M [00:00<00:52, 6.49MB/s]
  2%|▏         | 6.29M/347M [00:01<00:41, 8.12MB/s]
  3%|▎         | 10.5M/347M [00:01<00:22, 15.1MB/s]
  4%|▍         | 13.1M/347M [00:01<00:19, 17.6MB/s]
  5%|▍         | 16.3M/347M [00:01<00:16, 20.3MB/s]
  6%|▌         | 19.9M/347M [00:01<00:14, 23.3MB/s]
  7%|▋         | 23.6M/347M [00:01<00:12, 26.0MB/s]
  8%|▊         | 26.7M/347M [00:01<00:11, 27.2MB/s]
  9%|▊         | 29.9M/347M [00:01<00:11, 28.0MB/s]
 10%|▉         | 33.0M/347M [00:01<00:10, 28.7MB/s]
 11%|█  

### 학습 데이터 준비

http://www.manythings.org/anki/

1. Encoder 입력 데이터 eng
    - encoder_input_eng 준비 `I love you`
2. Decoder 출력 데이터 kor
    - 학습용 teacher-forcing 모델
        - decoder_input_kor `<sos> 난 널 사랑해`
        - docoder_output_kor `난 널 사랑해 <eos>`
    - 추론용 모델

In [5]:
!gdown 17-luLRrDsnP_yIzIchd3aCo_WMt4znPF -O eng_kor.txt

Downloading...
From: https://drive.google.com/uc?id=17-luLRrDsnP_yIzIchd3aCo_WMt4znPF
To: c:\nlp\07_seq2seq\eng_kor.txt

  0%|          | 0.00/864k [00:00<?, ?B/s]
 61%|██████    | 524k/864k [00:00<00:00, 2.79MB/s]
100%|██████████| 864k/864k [00:00<00:00, 3.87MB/s]


In [6]:
eng_inputs = []     # 영어 원문 문장 리스트
kor_inputs = []     # 한국어 입력(디코더 입력) 리스트
kor_targets = []    # 한국어 타겟(정답) 리스트

with open('eng_kor.txt', 'r', encoding='utf-8') as f:
    for line in f:                         # 파일을 한 줄씩 순회
        eng, kor, _ = line.split('\t')     # '영어\t한국어\t(기타)' 형태로 분리
        
        kor_input = '<sos>' + kor          # 디코더 입력 : 시작 토큰 추가
        kor_target = kor + '<eos>'         # 디코더 타겟 : 종료 토큰 추가
        
        eng_inputs.append(eng)             # 영어 문장
        kor_inputs.append(kor_input)       # 한국어 입력 시퀀스
        kor_targets.append(kor_target)     # 한국어 타겟 시퀀스
        
len(eng_inputs), len(kor_inputs), len(kor_targets)

(5890, 5890, 5890)

In [7]:
print(eng_inputs[2500:2505])
print(kor_inputs[2500:2505])    # <sos> 토큰 포함
print(kor_targets[2500:2505])   # <eos> 토큰 포함

['I speak French a little.', 'I take back what I said.', 'I tried to make friends.', 'I use this all the time.', 'I use this all the time.']
['<sos>저는 프랑스어를 조금 합니다.', '<sos>아까 한 말 취소야.', '<sos>난 친구를 만드려고 했어.', '<sos>나는 항상 이걸 쓴다.', '<sos>매번 이걸 쓴다.']
['저는 프랑스어를 조금 합니다.<eos>', '아까 한 말 취소야.<eos>', '난 친구를 만드려고 했어.<eos>', '나는 항상 이걸 쓴다.<eos>', '매번 이걸 쓴다.<eos>']


### 토큰화
- 인코더(영어) : 영어 토크나이저
- 디코더(한글) : 국문 토크나이저

In [8]:
VOCAB_SIZE = 10000  # 사용할 최대 어휘 수(상위 빈도 단어 기준)

### 영문 토큰화

In [9]:
from tensorflow.keras.preprocessing.text import Tokenizer

eng_tokenizer = Tokenizer(num_words=VOCAB_SIZE, oov_token='<OOV>')    # 어휘 제한 + 미등록 단어를 <OOV>로 처리
eng_tokenizer.fit_on_texts(eng_inputs)                                # 영어 문장들로 단어 사전(word_index) 학습
eng_inputs_seq = eng_tokenizer.texts_to_sequences(eng_inputs)         # 영어 문장 -> 단어 ID 시퀀스로 변환

print(eng_inputs[2500:2505])
eng_inputs_seq[2500:2505]

['I speak French a little.', 'I take back what I said.', 'I tried to make friends.', 'I use this all the time.', 'I use this all the time.']


[[2, 130, 38, 8, 268],
 [2, 111, 150, 27, 2, 82],
 [2, 300, 4, 213, 202],
 [2, 206, 12, 54, 6, 60],
 [2, 206, 12, 54, 6, 60]]

In [10]:
for seq in eng_inputs_seq[2500:2505]:
    print([eng_tokenizer.index_word[idx] for idx in seq])   # 각 ID를 단어로 바꿔 토큰 리스트로 출력

['i', 'speak', 'french', 'a', 'little']
['i', 'take', 'back', 'what', 'i', 'said']
['i', 'tried', 'to', 'make', 'friends']
['i', 'use', 'this', 'all', 'the', 'time']
['i', 'use', 'this', 'all', 'the', 'time']


In [11]:
# 실제 사용 어휘 수(설정한 VOCAB_SIZE와 사전 크기 중 작은 값)
eng_num_words = min(VOCAB_SIZE, len(eng_tokenizer.word_index))  

# 영어 시퀀스들 중 가장 긴 문장의 토큰 길이
eng_max_len = max([len(seq) for seq in eng_inputs_seq])     
print(eng_num_words, eng_max_len)

3200 101


eng_max_len는 패딩 기준으로 쓸수있는 최대 시퀀스 길이가 저장되어 있다.

### 국문 토큰화
okt = Okt(jvmpath=r"C:\Program Files\Java\jdk-21\bin\server\jvm.dll")

In [12]:
kor_tokenizer = Tokenizer(num_words=VOCAB_SIZE, oov_token='<OOV>', filters='')  # 필터 제거(특수토큰 <sos>/<eos> 유지)
kor_tokenizer.fit_on_texts(kor_inputs + kor_targets)                            # 입력 + 타겟 전체로 단어 사전 학습

kor_inputs_seq = kor_tokenizer.texts_to_sequences(kor_inputs)                   # 한국어 입력(<sos> 포함) -> ID 시퀀스
kor_targets_seq = kor_tokenizer.texts_to_sequences(kor_targets)                 # 한국어 타겟(<eos> 포함) -> ID 시퀀스

In [13]:
kor_tokenizer.index_word

{1: '<OOV>',
 2: '톰은',
 3: '<sos>톰은',
 4: '나는',
 5: '<sos>나는',
 6: '톰이',
 7: '수',
 8: '난',
 9: '<sos>난',
 10: '내가',
 11: '그',
 12: '내',
 13: '있어.',
 14: '있어.<eos>',
 15: '<sos>톰이',
 16: '이',
 17: '것',
 18: '것을',
 19: '더',
 20: '그는',
 21: '할',
 22: '있다.',
 23: '있다.<eos>',
 24: '<sos>그는',
 25: '안',
 26: '좀',
 27: '너무',
 28: '네가',
 29: '<sos>그',
 30: '한',
 31: '알고',
 32: '<sos>이',
 33: '걸',
 34: '정말',
 35: '있는',
 36: '왜',
 37: '있어?',
 38: '있어?<eos>',
 39: '프랑스어를',
 40: '<sos>내가',
 41: '줄',
 42: '네',
 43: '우리는',
 44: '가장',
 45: '하고',
 46: '적',
 47: '잘',
 48: '<sos>우리는',
 49: '사람은',
 50: '톰을',
 51: '우리',
 52: '너',
 53: '집에',
 54: '있을',
 55: '건',
 56: '그걸',
 57: '것은',
 58: '<sos>내',
 59: '그렇게',
 60: '메리가',
 61: '<sos>너',
 62: '거야.',
 63: '않아.',
 64: '것이',
 65: '없어.',
 66: '날',
 67: '거야.<eos>',
 68: '<sos>네가',
 69: '하지',
 70: '아주',
 71: '적이',
 72: '같아.',
 73: '않아.<eos>',
 74: '같아.<eos>',
 75: '없어.<eos>',
 76: '그녀는',
 77: '아직도',
 78: '게',
 79: '무슨',
 80: '못',
 81: '그것을',
 82: '좋은',
 83: '얼마나',

In [14]:
# 실제 사용 어휘 수(설정한 VOCAB_SIZE와 사전 크기 중 작은 값)
kor_num_words = min(VOCAB_SIZE, len(kor_tokenizer.word_index))  

# 한국어 시퀀스들 중 가장 긴 문장의 토큰 길이
kor_max_len = max([len(seq) for seq in kor_inputs_seq])     
print(kor_num_words, kor_max_len)

10000 89


### 패딩처리
- 인코더 padding = 'pre'
- 디코더 padding = 'post'

In [15]:
from tensorflow.keras.preprocessing.sequence import pad_sequences       # 시퀀스 길이를 맞추기 위한 패딩 함수

# 필수값: seq, maxlen=, padding='' 영어는 인코더 / 한국어 디코더
eng_inputs_padded = pad_sequences(eng_inputs_seq, maxlen=eng_max_len, padding='pre')    #   영어 입력은 최대길이 맞추고, 앞쪽에 0 패딩
kor_inputs_padded = pad_sequences(kor_inputs_seq, maxlen=kor_max_len, padding='post')   # 한국어 입력은 최대길이 맞추고, 뒤쪽에 0 패딩
kor_targets_padded = pad_sequences(kor_targets_seq, maxlen=kor_max_len, padding='post') # 한국어 타겟도 최대길이 맞추고, 뒤쪽에 0 패딩

- 영어(Encoder 입력) = pre 패딩 : 마지막 hidden state를 쓰는 경우가 많아서, 끝부분이 패딩(0)으로 끝나지 않게 실제 단어가 위로 오도록 한다.
- 한국(Decoder 입력/타겟) = post 패딩 : 생성시에 왼쪽 -> 오른쪽으로 생성/Teaching Force 방식으로 사용하여 앞에서 부터 시점  
                                          정렬하는게 자연스럽고, 뒤에만 0이 붙어 loss 마스킹/처리가 쉽다.


=> 마스킹(mask_zero 등)을 사용한다면 pre/post가 크게 차이 나지 않는다.

### 모델 생성

"encoder + decoder(teach_forcing)" 구조의 모델 생성 및 학습

#### Embedding Layer

In [16]:
with open('glove.6B.100d.txt', 'r', encoding='utf-8') as f:
    for i, vects in enumerate(f):   # 파일을 한 줄씩 읽으며 (인덱스 i, 라인 vects)로 순회
        print(vects)
        if i == 3:
            break

the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.538 -0.30223 -0.073438 -0.28312 0.37104 -0.25217 0.016215 -0.017099 -0.38984 0.87424 -0.72569 -0.51058 -0.52028 -0.1459 0.8278 0.27062

, -0.10767 0.11053 0.59812 -0.54361 0.67396 0.10663 0.038867 0.35481 0.06351 -0.094189 0.15786 -0.81665 0.14172 0.21939 0.58505 -0.52158

단어 값1 ~ 값100 형태로 나온다.

In [17]:
# 사전학습 임베딩(Glove) 기반 Embedding Matrix 생성
import numpy as np

# 토크나이저의 단어 인덱스에 맞춰 사전학습 임베딩으로 embedding_matrix를 만드는 함수
def make_embedding_matrix(num_words, embedding_dim, tokenizer, file_path):
    embedding_matrix = np.zeros((num_words + 1, embedding_dim)) # (단어수 + 1, 임베딩 차원) 0으로 초기화
    
    pretrained_embedding = {}
    
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            word, *vects = line.split()
            vects = np.array(vects, dtype=np.float32)           # 문자열 벡터값 -> float32 배열로 변환
            pretrained_embedding[word] = vects
            
    for word, index in tokenizer.word_index.items():            # 토크나이저의 (단어, 인덱스) 순회
        vects_ = pretrained_embedding.get(word)                 # 해당 단어의 사전학습 벡터 조회(없으면 None)
        if vects_ is not None:
            embedding_matrix[index] = vects_                    # 해당 인덱스 위치에 벡터 삽입 

    return embedding_matrix                                     # 사전학습 임베딩이 없는 단어는 0벡터로 나온다.

In [18]:
EMBEDDING_DIM = 100

# 영어 토크나이저 기준 임베딩 행렬 생성
en_embedding_matrix = make_embedding_matrix(
        eng_num_words,
        EMBEDDING_DIM, 
        eng_tokenizer,
    'glove.6B.100d.txt'
)

en_embedding_matrix.shape

(3201, 100)

In [19]:
list(eng_tokenizer.index_word.values())

['<OOV>',
 'i',
 'tom',
 'to',
 'you',
 'the',
 'is',
 'a',
 'that',
 'do',
 'in',
 'this',
 'was',
 'have',
 'he',
 "i'm",
 'my',
 'are',
 'of',
 "don't",
 'it',
 'me',
 'be',
 'your',
 'like',
 'for',
 'what',
 'want',
 'think',
 'we',
 'know',
 'not',
 'mary',
 'his',
 'there',
 'how',
 'can',
 'french',
 'and',
 'very',
 'has',
 "it's",
 'at',
 'go',
 'with',
 'she',
 'did',
 "didn't",
 'on',
 'here',
 'why',
 'going',
 'been',
 'all',
 'no',
 'as',
 'really',
 'help',
 'please',
 'time',
 "can't",
 'they',
 'will',
 'him',
 "isn't",
 'one',
 "you're",
 'if',
 'about',
 'good',
 "that's",
 'who',
 'too',
 "doesn't",
 'up',
 'had',
 'were',
 'where',
 'would',
 'from',
 "tom's",
 'said',
 'need',
 'something',
 'when',
 'us',
 'tell',
 'never',
 'home',
 'now',
 'still',
 'more',
 'school',
 'so',
 'an',
 'should',
 'come',
 'than',
 'some',
 'sorry',
 'but',
 'ever',
 'get',
 'work',
 'out',
 "i'll",
 "i've",
 'three',
 'by',
 'boston',
 'take',
 'keep',
 'stop',
 'just',
 'doing',

In [None]:
# 0번 인덱스를 <pad>로 두고, 이후 토큰들은 순서대로 리스트 생성
eng_word_index = ['<pad>'] + list(eng_tokenizer.index_word.values())
eng_word_index

['<pad>',
 '<OOV>',
 'i',
 'tom',
 'to',
 'you',
 'the',
 'is',
 'a',
 'that',
 'do',
 'in',
 'this',
 'was',
 'have',
 'he',
 "i'm",
 'my',
 'are',
 'of',
 "don't",
 'it',
 'me',
 'be',
 'your',
 'like',
 'for',
 'what',
 'want',
 'think',
 'we',
 'know',
 'not',
 'mary',
 'his',
 'there',
 'how',
 'can',
 'french',
 'and',
 'very',
 'has',
 "it's",
 'at',
 'go',
 'with',
 'she',
 'did',
 "didn't",
 'on',
 'here',
 'why',
 'going',
 'been',
 'all',
 'no',
 'as',
 'really',
 'help',
 'please',
 'time',
 "can't",
 'they',
 'will',
 'him',
 "isn't",
 'one',
 "you're",
 'if',
 'about',
 'good',
 "that's",
 'who',
 'too',
 "doesn't",
 'up',
 'had',
 'were',
 'where',
 'would',
 'from',
 "tom's",
 'said',
 'need',
 'something',
 'when',
 'us',
 'tell',
 'never',
 'home',
 'now',
 'still',
 'more',
 'school',
 'so',
 'an',
 'should',
 'come',
 'than',
 'some',
 'sorry',
 'but',
 'ever',
 'get',
 'work',
 'out',
 "i'll",
 "i've",
 'three',
 'by',
 'boston',
 'take',
 'keep',
 'stop',
 'just',

In [22]:
import pandas as pd 

pd.DataFrame(en_embedding_matrix, index=eng_word_index)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
<pad>,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,...,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
<OOV>,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,...,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
i,-0.046539,0.619660,0.56647,-0.465840,-1.189000,0.445990,0.066035,0.319100,0.146790,-0.22119,...,-0.323430,-0.431210,0.41392,0.283740,-0.709310,0.150030,-0.215400,-0.376160,-0.032502,0.806200
tom,-0.583880,-0.469400,0.16855,-1.670300,-0.116010,0.048738,-0.342010,-0.376910,-0.953080,-0.88260,...,0.264290,0.337860,0.35791,0.549010,0.006725,0.281580,0.343730,0.137040,0.089572,-0.542770
to,-0.189700,0.050024,0.19084,-0.049184,-0.089737,0.210060,-0.549520,0.098377,-0.201350,0.34241,...,-0.131340,0.058617,-0.31869,-0.614190,-0.623930,-0.415480,-0.038175,-0.398040,0.476470,-0.159830
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
whom,0.341680,0.545720,-0.15683,-0.205450,-0.199470,0.389270,-0.017266,-0.244750,0.124760,-0.34392,...,0.142210,-0.556420,0.45764,-0.243960,-1.146800,0.330140,-0.900430,-0.014213,0.161420,-0.444830
intimately,-0.416930,0.131300,0.50615,0.034870,0.255630,0.409070,0.234080,-0.574340,0.293270,0.14061,...,-0.636420,0.066336,-0.44926,-1.350100,0.381640,-0.051961,-0.425590,-0.667450,0.009747,0.061855
millions,0.622040,1.063600,0.13146,-0.203880,0.555570,0.443570,-0.524210,0.040587,0.470640,-0.53788,...,-0.063191,-0.111260,0.66609,-0.972050,-0.558910,-0.837910,-0.147160,0.862920,0.302910,-0.225640
inhabit,-0.927750,0.738860,0.69884,0.138580,0.020592,0.521490,-0.541050,0.501790,0.206630,-0.51982,...,0.491800,-0.235440,0.48243,-0.714400,-0.378190,0.271660,-1.243600,-0.288030,0.065840,0.053163


#### 인코더 모델

In [23]:
# Seq2Seq Encoder 모델 구성(사전학습 임베딩 + LSTM)