# Keras 임베딩 계층 사용 실습

이 노트북은 아래 문서에 실려 있는 예제를 실습하면서 작성한 것입니다.

* [How to Use Word Embedding Layers for Deep Learning with Keras
](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)

## 데이터 준비

학업 성취에 대한 대한 긍정, 부정 평가와 같이 이진 분류 문제를 다룰 때 사용할 수 있는 데이터입니다.

In [1]:
import numpy as np

# define documents
docs = ['Well done!',
    'Good work',
    'Great effort',
    'nice work',
    'Excellent!',
    'Weak',
    'Poor effort!',
    'not good',
    'poor work',
    'Could have done better.']

# define class labels
labels = np.array([1,1,1,1,1,0,0,0,0,0])

## 학습을 통해 임베딩 구하기

### 단어 인코딩

단어를 정수로 인코딩합니다. 여기서 사용하는 `one_hot()` 함수는 단어의 해시값을 구해서 정수로 인코딩하는 방식이어서 서로 다른 단어이지만 동일한 값으로 인코딩되는 경우가 발생할 수 있습니다.

In [2]:
from tensorflow.keras.preprocessing.text import one_hot

# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)

[[47, 24], [36, 19], [3, 35], [45, 19], [33], [48], [3, 35], [2, 36], [3, 19], [48, 15, 24, 12]]


위의 결과를 보면 `one_hot()` 함수를 사용하여 단어를 인코딩할 때 아래와 같이 충돌이 발생한다는 것을 알 수 있습니다.

In [3]:
print(one_hot('Well better', vocab_size))
print(one_hot('Good Great good', vocab_size))
print(one_hot('not have', vocab_size))

[47, 12]
[36, 3, 36]
[2, 15]


입력 데이터의 길이를 모두 동일하게 맞추기 위하여 `pad_sequences()` 함수를 사용합니다.

In [4]:
from tensorflow.keras.utils import pad_sequences

# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[47 24  0  0]
 [36 19  0  0]
 [ 3 35  0  0]
 [45 19  0  0]
 [33  0  0  0]
 [48  0  0  0]
 [ 3 35  0  0]
 [ 2 36  0  0]
 [ 3 19  0  0]
 [48 15 24 12]]


### 모델 구성

`Embedding` 계층의 출력은 2차원이고 `Dense` 계층의 입력은 1차원이므로 이 둘을 이어주기 위하여 `Flatten` 계층을 사용합니다.

In [5]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# summarize the model
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 4, 8)              400       
                                                                 
 flatten (Flatten)           (None, 32)                0         
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None


### 모델 훈련

In [6]:
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)

# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 80.000001


## 사전 학습된 GloVe 임베딩 사용하기

### 단어 인코딩

단어를 정수로 인코딩합니다. 이번에는 `Tokenizer`를 사용하여 인코딩해 봅니다.

In [22]:
from tensorflow.keras.preprocessing.text import Tokenizer

# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1
print(t.word_index)
print(f'vocab_size: {vocab_size}')

# integer encode the documents
encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)

{'work': 1, 'done': 2, 'good': 3, 'effort': 4, 'poor': 5, 'well': 6, 'great': 7, 'nice': 8, 'excellent': 9, 'weak': 10, 'not': 11, 'could': 12, 'have': 13, 'better': 14}
vocab_size: 15
[[6, 2], [3, 1], [7, 4], [8, 1], [9], [10], [5, 4], [11, 3], [5, 1], [12, 13, 2, 14]]


입력 데이터의 길이를 모두 동일하게 맞추기 위하여 pad_sequences() 함수를 사용합니다.

In [23]:
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[ 6  2  0  0]
 [ 3  1  0  0]
 [ 7  4  0  0]
 [ 8  1  0  0]
 [ 9  0  0  0]
 [10  0  0  0]
 [ 5  4  0  0]
 [11  3  0  0]
 [ 5  1  0  0]
 [12 13  2 14]]


In [15]:
from numpy import asarray

# load the whole embedding into memory
embeddings_index = dict()

with open('C:\DevData\GloVe\glove.6B.50d.txt', mode='r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        if word in t.word_index:
            coefs = asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs

print(f'Loaded {len(embeddings_index)} word vectors.')

Loaded 14 word vectors.


In [17]:
from numpy import zeros

# create a weight matrix for words in training docs
embedding_matrix = zeros((vocab_size, 50))
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        
print(embedding_matrix)

[[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00]
 [ 5.13589978e-01  1.96950004e-01 -5.19439995e-01 -8.62179995e-01
   1.54940002e-02  1.09729998e-01 -8.02929997e-01 -3.33609998e-01
  -1.61189993e-04  1.01889996e-02  4.6734

In [18]:
# define model
model = Sequential()
e = Embedding(vocab_size, 50, weights=[embedding_matrix], input_length=4, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# summarize the model
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 4, 50)             750       
                                                                 
 flatten_1 (Flatten)         (None, 200)               0         
                                                                 
 dense_1 (Dense)             (None, 1)                 201       
                                                                 
Total params: 951
Trainable params: 201
Non-trainable params: 750
_________________________________________________________________
None


In [20]:
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=1)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x1fb5f3e02b0>

In [21]:
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000
