# Text Classification with 1-d CNN

이전 실습 중에 LSTM을 이용하여 Text를 state vector로 encoding하고, 이를 Feature 로 활용하여 Text classification하는 IMDB dataset 기반 sentimental analysis task를 다루어 본 바 있습니다.  

본 실습에서는 동일한 task를 1-d CNN Encoder를 활용하여 다시 다루어 보겠습니다. 이를 통해 Feature를 추출하는 관점에 따른 모델의 특성 및 분류성능 차이를 확인해 보겠습니다.

(참고)  
https://github.com/gilbutITbook/006975/blob/master/6.4-sequence-processing-with-convnets.ipynb  


In [1]:
# %matplotlib inline
import matplotlib.pyplot as plt

import tensorflow as tf
import numpy as np
from scipy.spatial.distance import cdist

In [16]:
# from tf.keras.models import Sequential  # This does not work!
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, Conv1D, MaxPooling1D, GlobalMaxPooling1D, Embedding
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

In [21]:
# import imdb
from keras.datasets import imdb

# 데이터 관련 설정은 LSTM 케이스와 동일하게 한다.
max_features = 10000
max_tokens = 580
embedding_size = 8

# save np.load
np_load_old = np.load
# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)
# call load_data with allow_pickle implicitly set to true
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)   # 원래는 이 라인만 있으면 된다.
# restore np.load for future normal usage
np.load = np_load_old

print("Train-set size: ", len(x_train))
print("Test-set size:  ", len(x_test))

Train-set size:  25000
Test-set size:   25000


In [22]:
pad = 'pre'
x_train_pad = pad_sequences(x_train, maxlen=max_tokens, padding=pad, truncating=pad)
x_test_pad = pad_sequences(x_test, maxlen=max_tokens, padding=pad, truncating=pad)
x_train_pad.shape

(25000, 580)

In [23]:
x_train_pad[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

데이터셋 구성은 LSTM 모델의 경우와 동일합니다.  

## Create Model with Conv1D

In [28]:
model = Sequential()
model.add(Embedding(input_dim=max_features,
                    output_dim=embedding_size,
                    input_length=max_tokens,
                    name='layer_embedding'))
model.add(Conv1D(32, 7, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Conv1D(32, 7, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(1))

model.summary()

model.compile(optimizer=Adam(lr=1e-5),
              loss='binary_crossentropy',
              metrics=['acc'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 580, 8)            80000     
_________________________________________________________________
conv1d_11 (Conv1D)           (None, 574, 32)           1824      
_________________________________________________________________
max_pooling1d_5 (MaxPooling1 (None, 114, 32)           0         
_________________________________________________________________
conv1d_12 (Conv1D)           (None, 108, 32)           7200      
_________________________________________________________________
global_max_pooling1d_4 (Glob (None, 32)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 33        
Total params: 89,057
Trainable params: 89,057
Non-trainable params: 0
_________________________________________________________________


In [29]:
layer_outputs = [layer.output for layer in model.layers]
layer_outputs

[<tf.Tensor 'layer_embedding_1/embedding_lookup/Identity_1:0' shape=(?, 580, 8) dtype=float32>,
 <tf.Tensor 'conv1d_11/Relu:0' shape=(?, 574, 32) dtype=float32>,
 <tf.Tensor 'max_pooling1d_5/Squeeze:0' shape=(?, 114, 32) dtype=float32>,
 <tf.Tensor 'conv1d_12/Relu:0' shape=(?, 108, 32) dtype=float32>,
 <tf.Tensor 'global_max_pooling1d_4/Max:0' shape=(?, 32) dtype=float32>,
 <tf.Tensor 'dense_4/BiasAdd:0' shape=(?, 1) dtype=float32>]

In [30]:
model.fit(x_train_pad, y_train,
          validation_split=0.05, epochs=100, batch_size=64)

Train on 23750 samples, validate on 1250 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100


Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<tensorflow.python.keras.callbacks.History at 0x1f74677c940>

In [31]:
result = model.evaluate(x_test_pad, y_test)



In [32]:
print("Accuracy: {0:.2%}".format(result[1]))

Accuracy: 85.06%
