### IMDB Dataset-Keras

( https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data)

This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

**Step 1:** Load the dataset and required packages

In [0]:
from keras.datasets import imdb
from keras.models import Sequential
from keras.utils import to_categorical
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

In [0]:
# load the dataset but only keep the top 5000 words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

**Step 2:** truncate and pad input sequences for 500 words

In [0]:
max_length=500
X_train = sequence.pad_sequences(X_train, maxlen=max_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_length)

**Step 3:** Create the model. Use embedding and set embedding vector size to 32. Create a simple LSTM layer with 100 nodes. Compile the model and test its accuracy.

In [0]:
embedding_vecor_length=32
model = Sequential()
model.add(Embedding(top_words, output_dim=32, input_length = max_length))
model.add(LSTM(100, return_sequences=False))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=32, verbose=2)

scores=model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/10
 - 383s - loss: 0.5459 - accuracy: 0.7175 - val_loss: 0.6402 - val_accuracy: 0.6106
Epoch 2/10
 - 400s - loss: 0.4297 - accuracy: 0.7997 - val_loss: 0.3253 - val_accuracy: 0.8692
Epoch 3/10
 - 399s - loss: 0.3023 - accuracy: 0.8771 - val_loss: 0.3170 - val_accuracy: 0.8686
Epoch 4/10
 - 399s - loss: 0.2419 - accuracy: 0.9045 - val_loss: 0.2937 - val_accuracy: 0.8798
Epoch 5/10
 - 401s - loss: 0.2088 - accuracy: 0.9187 - val_loss: 0.3114 - val_accuracy: 0.8765
Epoch 6/10
 - 400s - loss: 0.1834 - accuracy: 0.9287 - val_loss: 0.3133 - val_accuracy: 0.8747
Epoch 7/10
 - 397s - loss: 0.1628 - accuracy: 0.9395 - val_loss: 0.3197 - val_accuracy: 0.8673
Epoch 8/10
 - 396s - loss: 0.1472 - accuracy: 0.9430 - val_loss: 0.3384 - val_accuracy: 0.8693
Epoch 9/10
 - 396s - loss: 0.1229 - accuracy: 0.9538 - val_loss: 0.4037 - val_accuracy: 0.8612
Epoch 10/10
 - 410s - loss: 0.1045 - accuracy: 0.9610 - val_loss: 0.4059 - val_accuracy: 0.8690