# Text Classification with LSTM  

본 실습에서는 IMDB라는 널리 알려진 영화 리뷰 사이트에 얻어진 50000개의 리뷰로부터 긍정적 반응인지 부정적 반응인지를 classification 하는 sentimental analysis(감정분석)이라는 text classification task를 다루어 보겠습니다.  

keras에서는 IMDB raw corpus를 전처리하여 indexed corpus로 변환해 놓은 데이터셋을 제공하고 있습니다. Word2Vec 구현시에 raw corpus를 처리하는 방법을 익혔으므로, 이번 실습에서는 keras의 데이터셋을 활용하여 text classification을 수행하는 부분에 초점을 맞춥니다.

(참고)  
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/20_Natural_Language_Processing.ipynb
https://github.com/gilbutITbook/006975/blob/master/3.4-classifying-movie-reviews.ipynb
https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py

In [37]:
# %matplotlib inline
import matplotlib.pyplot as plt

import tensorflow as tf
import numpy as np
from scipy.spatial.distance import cdist

In [38]:
# from tf.keras.models import Sequential  # This does not work!
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, GRU, LSTM, Embedding
from tensorflow.python.keras.optimizers import Adam
# from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

In [39]:
tf.__version__

'1.13.1'

In [40]:
tf.keras.__version__

'2.2.4-tf'

In [41]:
# import imdb
from keras.datasets import imdb

In [42]:
max_features = 10000

imdb.load_data 내부 버그에 대응하기 위해 np.load를 임시로 override

In [43]:
# save np.load
np_load_old = np.load
# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)
# call load_data with allow_pickle implicitly set to true
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)   # 원래는 이 라인만 있으면 된다.
# restore np.load for future normal usage
np.load = np_load_old

In [44]:
print("Train-set size: ", len(x_train))
print("Test-set size:  ", len(x_test))

Train-set size:  25000
Test-set size:   25000


Combine into one data-set for some uses below.

In [45]:
total_data_text = list(x_train) + list(x_test)

In [46]:
num_tokens = [len(tokens) for tokens in total_data_text]
num_tokens = np.array(num_tokens)
num_tokens[:10]

array([218, 189, 141, 550, 147,  43, 123, 562, 233, 130])

In [47]:
np.mean(num_tokens)

234.75892

In [48]:
np.max(num_tokens)

2494

In [49]:
max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens = int(max_tokens)
max_tokens

580

In [50]:
np.sum(num_tokens < max_tokens) / len(num_tokens)  

0.94502

Print an example from the training-set to see that the data looks correct.

In [51]:
x_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 2,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 2,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 2,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,
 5535,
 18,

In [52]:
len(x_train[0])

218

y_train은 0(부정) 또는 1(긍정)의 값을 가진다.

In [53]:
y_train[0]

1

### Prepare Dataset for RNN  

RNN에 입력되는 sequential data(특히 자연어)에서 data의 length는 data row마다 다르다. 이런 가변길이 sequence를 입력으로 처리하기 위해

In [54]:
pad = 'pre'

In [55]:
x_train_pad = pad_sequences(x_train, maxlen=max_tokens, padding=pad, truncating=pad)

In [56]:
x_test_pad = pad_sequences(x_test, maxlen=max_tokens, padding=pad, truncating=pad)

In [57]:
x_train_pad.shape

(25000, 580)

In [58]:
x_test_pad.shape

(25000, 580)

For example, we had the following sequence of tokens above:

In [59]:
np.array(x_train[1])

array([   1,  194, 1153,  194, 8255,   78,  228,    5,    6, 1463, 4369,
       5012,  134,   26,    4,  715,    8,  118, 1634,   14,  394,   20,
         13,  119,  954,  189,  102,    5,  207,  110, 3103,   21,   14,
         69,  188,    8,   30,   23,    7,    4,  249,  126,   93,    4,
        114,    9, 2300, 1523,    5,  647,    4,  116,    9,   35, 8163,
          4,  229,    9,  340, 1322,    4,  118,    9,    4,  130, 4901,
         19,    4, 1002,    5,   89,   29,  952,   46,   37,    4,  455,
          9,   45,   43,   38, 1543, 1905,  398,    4, 1649,   26, 6853,
          5,  163,   11, 3215,    2,    4, 1153,    9,  194,  775,    7,
       8255,    2,  349, 2637,  148,  605,    2, 8003,   15,  123,  125,
         68,    2, 6853,   15,  349,  165, 4362,   98,    5,    4,  228,
          9,   43,    2, 1157,   15,  299,  120,    5,  120,  174,   11,
        220,  175,  136,   50,    9, 4373,  228, 8255,    5,    2,  656,
        245, 2350,    5,    4, 9837,  131,  152,  4

This has simply been padded to create the following sequence. Note that when this is input to the Recurrent Neural Network, then it first inputs a lot of zeros. If we had padded 'post' then it would input the integer-tokens first and then a lot of zeros. This may confuse the Recurrent Neural Network.

In [60]:
x_train_pad[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

## Create the Recurrent Neural Network


In [61]:
model = Sequential()

The first layer in the RNN is a so-called Embedding-layer which converts each integer-token into a vector of values. This is necessary because the integer-tokens may take on values between 0 and 10000 for a vocabulary of 10000 words. The RNN cannot work on values in such a wide range. The embedding-layer is trained as a part of the RNN and will learn to map words with similar semantic meanings to similar embedding-vectors, as will be shown further below.

First we define the size of the embedding-vector for each integer-token. In this case we have set it to 8, so that each integer-token will be converted to a vector of length 8. The values of the embedding-vector will generally fall roughly between -1.0 and 1.0, although they may exceed these values somewhat.

The size of the embedding-vector is typically selected between 100-300, but it seems to work reasonably well with small values for Sentiment Analysis.

In [62]:
embedding_size = 8

The embedding-layer also needs to know the number of words in the vocabulary (`num_words`) and the length of the padded token-sequences (`max_tokens`). We also give this layer a name because we need to retrieve its weights further below.

In [63]:
model.add(Embedding(input_dim=max_features,
                    output_dim=embedding_size,
                    input_length=max_tokens,
                    name='layer_embedding'))

We can now add the first Gated Recurrent Unit (GRU) to the network. This will have 16 outputs. Because we will add a second GRU after this one, we need to return sequences of data because the next GRU expects sequences as its input.

In [64]:
model.add(GRU(units=16, return_sequences=True))

This adds the second GRU with 8 output units. This will be followed by another GRU so it must also return sequences.

In [65]:
model.add(GRU(units=8, return_sequences=True))

This adds the third and final GRU with 4 output units. This will be followed by a dense-layer, so it should only give the final output of the GRU and not a whole sequence of outputs.

In [66]:
model.add(GRU(units=4))

Add a fully-connected / dense layer which computes a value between 0.0 and 1.0 that will be used as the classification output.

In [67]:
model.add(Dense(1, activation='sigmoid'))

Use the Adam optimizer with the given learning-rate.

In [68]:
optimizer = Adam(lr=1e-5)

Compile the Keras model so it is ready for training.

In [69]:
model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])

In [70]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 580, 8)            80000     
_________________________________________________________________
gru_3 (GRU)                  (None, 580, 16)           1200      
_________________________________________________________________
gru_4 (GRU)                  (None, 580, 8)            600       
_________________________________________________________________
gru_5 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 5         
Total params: 81,961
Trainable params: 81,961
Non-trainable params: 0
_________________________________________________________________


## Train the Recurrent Neural Network



In [72]:
model.fit(x_train_pad, y_train,
          validation_split=0.05, epochs=20, batch_size=64)

Train on 23750 samples, validate on 1250 samples
Epoch 1/20
Epoch 2/20

KeyboardInterrupt: 

## Performance on Test-Set

Now that the model has been trained we can calculate its classification accuracy on the test-set.

In [39]:
result = model.evaluate(x_test_pad, y_test)

  416/25000 [..............................] - ETA: 20:31 - loss: 0.5267 - acc: 0.7524

KeyboardInterrupt: 

In [40]:
print("Accuracy: {0:.2%}".format(result[1]))

NameError: name 'result' is not defined