**IMDB 映画レビュー感情分類**

感情(肯定/否定)のラベル付けをされた、25,000のIMDB映画レビューのデータセット。レビューは前処理済みで、各レビューは単語のインデックス(整数値)のシーケンスとしてエンコードされている。

In [1]:
%matplotlib inline
from __future__ import division, print_function

In [2]:
from keras.datasets import imdb
from keras.layers import Activation, Dense, Embedding, GRU
from keras.layers.normalization import BatchNormalization
from keras.models import Sequential
from keras.preprocessing import sequence
import numpy as np

Using TensorFlow backend.


In [3]:
np.random.seed(0)

In [4]:
n_max_features = 5000
n_max_len = 80

データを準備

In [5]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=n_max_features)

データをいくつか表示してみる

In [6]:
print('X:', X_train.shape, X_train[0], sep='\n')
print('y:', y_train.shape, y_train[:10], sep='\n')

X:
(25000,)
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32]
y:
(25000,)
[1 0 0 1 0 0 1 0 1 0]


各レビューの単語長が一定になるようにパディング/足切り

In [7]:
X_train = sequence.pad_sequences(X_train, maxlen=n_max_len)
X_test = sequence.pad_sequences(X_test, maxlen=n_max_len)

In [8]:
print(X_train.shape)

(25000, 80)


In [9]:
model = Sequential()
model.add(Embedding(n_max_features, 128, dropout=0.2))
model.add(GRU(64, dropout_W=0.2, dropout_U=0.2))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [10]:
model.fit(X_train, y_train, batch_size=128, nb_epoch=5, validation_data=(X_test, y_test))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f983c6e9d30>

In [11]:
score, acc = model.evaluate(X_test, y_test, batch_size=128, verbose=0)
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.380257684059
Test accuracy: 0.832760000057
