In [1]:
import numpy
from keras.datasets import imdb
from keras.models import Sequential, Model
from keras.layers import Dense
from keras.layers import LSTM, Convolution1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

Using TensorFlow backend.


For this analysis we’ll be using a dataset of 50,000 movie reviews taken from IMDb. The data was compiled by Andrew Maas and can be found here: [IMDb Reviews.](http://ai.stanford.edu/~amaas/data/sentiment/)

The data is split evenly with 25k reviews intended for training and 25k for testing your classifier. Moreover, each set has 12.5k positive and 12.5k negative reviews.

IMDb lets users rate movies on a scale from 1 to 10. To label these reviews the curator of the data labeled anything with ≤ 4 stars as negative and anything with ≥ 7 stars as positive. Reviews with 5 or 6 stars were left out.

Ofcourse, we can download it and do the pre-processing ourselves, well well well Keras already provides a wrapper over pre-processed dataset.

In [0]:
top_words = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

In [0]:
# These words are encoded into numbers by the index of their place in the vocabulary.
print (X_train[0], y_train[0])

In [0]:
# Let us decode a sentence to see what it holds
d = imdb.get_word_index()
inv_map = {v: k for k, v in d.items()}
for i in X_train[0][1:]:
  if (i>=3):
    print (inv_map[i-3])

In [0]:
max_review_length = 80 # ? You need to change this, maybe 100, maybe 1000, how do you find it? Maybe the max length of reviews or maybe something for faster training! Do you always need to compromise in life?
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=#?? # What if we don't pad the test data? 

In [0]:
print (X_train[0], len(X_train[0]))


![Conv1D](https://cdn-images-1.medium.com/max/1600/1*aBN2Ir7y2E-t2AbekOtEIw.png)


In [0]:
embedding_vector_length = 300
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))
# Some bad architecture choice here, Could you please try to find what's wrong?
model.add(Convolution1D(64, 3, padding='same')) # Why padding is same? https://keras.io/layers/convolutional/
model.add(Convolution1D(32, 3, padding='same'))
model.add(Convolution1D(16, 3, padding='same'))
model.add(Flatten())
model.add(Dropout(0.2)) # Should you increase this for more generalization?
model.add(Dense(180,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid')) # Can we use 2 here?
print (model.summary())

In [0]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [0]:
model.fit(X_train, y_train, nb_epoch=2, verbose = 1, batch_size = 64)


In [0]:
loss_acc = model.evaluate(X_test, y_test, verbose=1)
print("Test data: loss = %0.6f  accuracy = %0.2f%% " % \
  (loss_acc[0], loss_acc[1]*100))

In [0]:
# 5. save model
print("Saving model to disk \n")
mp = ".\\Models\\imdb_model.h5"
model.save(mp)

In [0]:
print("New review: \'the movie was waste\'")
review = "the movie was waste"
words = review.split()
review = []
for word in words:
  if word not in d: 
    review.append(2)
  else:
    review.append(d[word]+3) 

In [0]:
review = [0]*(max_review_length-len(review)) + review

In [0]:
# review = sequence.pad_sequences(review, maxlen=80)
import numpy as np
review = np.array(review)
print (review.shape)
prediction = model.predict(review.reshape(1,80))
print("Prediction (0 = negative, 1 = positive) = ", end="")
print("%0.4f" % prediction[0][0])

Now let's  try an LSTM based model, **could you change this to BiLSTM model?**

LSTMs and their bidirectional variants are popular because they have tried to learn how and when to forget and when not to using gates in their architecture. In previous RNN architectures, vanishing gradients was a big problem and caused those nets not to learn so much.

Using Bidirectional LSTMs, you feed the learning algorithm with the original data once from beginning to the end and once from end to beginning.

In [0]:
import keras as K
model = K.models.Sequential()
model.add(K.layers.embeddings.Embedding(input_dim=top_words,
  output_dim=embedding_vector_length, mask_zero=True))
model.add(K.layers.LSTM(units=100, dropout=0.2, recurrent_dropout=0.2))  # 100 memory
model.add(K.layers.Dense(units=1, activation='sigmoid'))
print (model.summary())

In [0]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [0]:
model.fit(X_train, y_train, nb_epoch=2, verbose = 1, batch_size = 64)


In [0]:
loss_acc = model.evaluate(X_test, y_test, verbose=1)
print("Test data: loss = %0.6f  accuracy = %0.2f%% " % \
  (loss_acc[0], loss_acc[1]*100))