# Mini Project: Using Keras to analyze IMDB Movie Data

In this project, we will analyze a dataset from IMDB and use it to predict the sentiment analysis of a review.

Workspace
To open this notebook, you have two options:

* You may clone the repo https://github.com/udacity/deep-learning.git from Github and open the notebook IMDB_in_Keras.ipynb in the imdb_keras folder. 

## Instructions

In this lab, we will preprocess the data for you, and you'll be in charge of building and training the model in Keras.

### The dataset

This lab uses a dataset of 25,000 [IMDB](http://www.imdb.com/) reviews. Each review, comes with a label. A label of 0 is given to a negative review, and a label of 1 is given to a positive review. The goal of this lab is to create a model that will predict the sentiment of a review, based on the words on it. You can see more information about this dataset in the [Keras Datasets](https://keras.io/datasets/) website.

Now, the input already comes preprocessed for us for convenience. Each review is encoded as a sequence of indexes, corresponding to the words in the review. The words are ordered by frequency, so the integer 1 corresponds to the most frequent word ("the"), the integer 2 to the second most frequent word, etc. By convention, the integer 0 corresponds to unknown words.

Then, the sentence is turned into a vector by simply concatenating these integers. For instance, if the sentence is "To be or not to be." and the indices of the words are as follows:

* "to": 5
* "be": 8
* "or": 21
* "not": 3

Then the sentence gets encoded as the vector [5,8,21,3,5,8].

In [4]:
# Imports
import numpy as np
import keras
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(42)

### Loading the data

The data comes preloaded in Keras, which means we don't need to open or read any files manually. The command to load it is the following, which will actually split the words into training and testing sets and labels:

In [16]:
from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                     num_words=1000,
                                                     skip_top=0,
                                                     maxlen=None,
                                                     seed=113,
                                                     start_char=1,
                                                     oov_char=2,
                                                     index_from=3)

The meaning of all these arguments is [here](https://keras.io/datasets). But in a nutshell, the most important ones are:

* `num_words`: Top most frequent words to consider. This is useful if you don't want to consider very obscure words such as "Ultracrepidarian"
* `skip_top`: Top words to ignore. This is useful if you don't want to consider the most common words. For example, the word "the" would add no information to the review, so we can skip it by setting skip_top to 2 or higher.

In [17]:
print(x_train.shape)
print(x_train[0])
print(x_test.shape)
print(x_test[0])

(25000,)
[1, 14, 22, 16, 43, 530, 973, 2, 2, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 2, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 2, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 2, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 2, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 2, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 2, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]
(25000,)
[1, 89, 27, 2, 2, 17, 199, 132, 5, 2, 16, 2, 24, 8, 760, 4, 2, 7, 4, 22, 2, 2, 16, 2, 17, 2, 7, 2, 2, 9, 4, 2, 

In [18]:
print(y_train.shape)
print(y_train[0])

(25000,)
1


### Pre-processing the data (One-hot encoding)

We first prepare the data by one-hot encoding it into (0,1)-vectors as follows: If, for example, we have 10 words in our vocabulary, and the vector is (4,1,8), we'll turn it into the vector (1,0,0,1,0,0,0,1,0,0).

In [22]:
# One-hot encoding the output into vector mode, each of length 1000
tokenizer = Tokenizer(num_words=1000)
x_train_e = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test_e = tokenizer.sequences_to_matrix(x_test, mode='binary')

In [23]:
print(x_train_e.shape)
print(x_train_e[0:50])

(25000, 1000)
[[ 0.  1.  1. ...,  0.  0.  0.]
 [ 0.  1.  1. ...,  0.  0.  0.]
 [ 0.  1.  1. ...,  0.  0.  0.]
 ..., 
 [ 0.  1.  1. ...,  0.  0.  0.]
 [ 0.  1.  1. ...,  0.  0.  0.]
 [ 0.  1.  1. ...,  0.  0.  0.]]


And we'll also one-hot encode the output.

In [24]:
# One-hot encoding the output
num_classes = 2
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print(y_train.shape)
print(y_test.shape)

(25000, 2)
(25000, 2)


### Building the model

Now it's your turn to use all you've learned! You can build a neural network using Keras, train it, and evaluate it! Make sure you also use methods such as dropout or regularization, and good Keras optimizers to do this. A good accuracy to aim for is 85%. Can your model achieve this?

In [33]:
# Build the model architecture
model = Sequential()

model.add(Dense(256, activation='relu', input_dim=x_train_e.shape[1]))
# rate: float between 0 and 1. Fraction of the input units to drop.
model.add(Dropout(rate=0.2))
model.add(Dense(126, activation='relu'))
model.add(Dropout(rate=0.2))
model.add(Dense(num_classes))

# Add a sigmoid activation layer
model.add(Activation('softmax'))

# Compile the model using a loss function and an optimizer.
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics = ["accuracy"])

### Training the model

Run the model here. Experiment with different batch_size, and number of epochs!

In [34]:
# Run the model. Feel free to experiment with different batch sizes and number of epochs.
# model.fit(x_train, y_train,
#           batch_size=32,
#           epochs=10,
#           validation_data=(x_test, y_test), 
#           verbose=2)

model.fit(x_train_e, y_train, batch_size=32, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x127204b38>

### Evaluating the model
This will give you the accuracy of the model, as evaluated on the testing set. Can you get something over 85%?

In [35]:
score = model.evaluate(x_test_e, y_test, verbose=0)
print("Accuracy: ", score[1])

Accuracy:  0.84336
