# Deep Learning with Keras IV (imdb, mlp)
- Part I: [Deep Learning w Keras I (Intro)](https://github.com/tm1611/Deep-Learning/blob/master/Deep%20Learning%20w%20Keras%20I%20(intro).ipynb)
- Part II: [Deep Learning w Keras II (mnist data, mlp)](https://github.com/tm1611/Deep-Learning/blob/master/Deep%20Learning%20w%20Keras%20II%20(mnist%20data%2C%20mlp).ipynb)
- Part III: [Deep Learning w Keras III (student admissions, mlp)](https://github.com/tm1611/Deep-Learning/blob/master/Deep%20Learning%20w%20Keras%20III%20(student%20admissions%2C%20mlp)%20.ipynb)
 
## 1. Introduction
Overview:  
- Data: IMDB movie reviews sentiment classification
- Methodology: Artificial neural network
 - Type: Multilayer perceptron

This notebook shows how to implement a simple neural network (multilayer perceptron) to imdb movie review data. The outcome variable is whether a review is perceived as positive or negative. Our objective is to fit a model which identifies positive/negative reviews based on the text content of the review.

The model gets to 85.55% test accuracy after 10 epochs with margin for additional improvements (feature engineering, parameter tuning, etc.).

In [1]:
# Standard imports
import numpy as np

# keras imports 
import keras
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
from keras.optimizers import RMSprop
from keras import utils

# Visualization
import matplotlib.pyplot as plt
%matplotlib inline

# set seed
np.random.seed(42)

Using TensorFlow backend.


## 2. The data
We will import the [dataset](https://keras.io/datasets/) using the keras API. The dataset consists of 25,000 movies reviews from IMDB which are already vectorized. The outcome variable we are trying to model is whether a review is positive or negative. Words are indexed by overall frequency in the dataset so that the vectorized word of "3" means it is the 3rd most frequent word in the entire dataset. Unknown words are encoded as "0".

### Preprocessing 
1. Convert words in x to a matrix of dummy variables.
 - [Tokenization](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) using Keras' [tokenizer()](https://keras.io/preprocessing/text/) class.
 - convert to matrix using [sequences_to_matrix](https://keras.rstudio.com/reference/sequences_to_matrix.html).
2. Convert y to categorical using [to_categorical](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical).

Note that the data has already been preprocessed in a way that each word is encoded as how frequent the word occured in the dataset. The most frequent word is encoded as "1", the second most frequent word as "2" and so forth. 

We still have to do some preprocessing to prepare the data for our model. First, we need to convert the encoded text into a format that can be used by our algorithm. This is also known one-hot encoding, where each word becomes a dummy variable yielding a huge feature matrix with 1000 variables. The second step is to convert y into a binary matrix representation which is then used as outcome matrix to fit the model.  

In [2]:
# Load data with only the 1000 most frequent words
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=1000,
                                                      seed=42)

print("Shape x_train:", x_train.shape)
print("Shape x_test:", x_test.shape)
print("Shape y_train:", y_train.shape)
print("Shape y_test:", y_test.shape)

# Show characteristics of example review
print("Length of example review:", len(x_train[16]))
print("First five words in example: ",x_train[16][:5])

# y_train 
print("First five observations for y_train: ",y_train[:5])

Shape x_train: (25000,)
Shape x_test: (25000,)
Shape y_train: (25000,)
Shape y_test: (25000,)
Length of example review: 272
First five words in example:  [1, 14, 9, 24, 6]
First five observations for y_train:  [1 0 1 0 0]


In [3]:
# Initialize tokenizer
tokenizer = Tokenizer(num_words = 1000)

# One-hot encode x
x_train = tokenizer.sequences_to_matrix(x_train, mode="binary")
x_test = tokenizer.sequences_to_matrix(x_test, mode="binary")

# One-hot encode y
y_train = keras.utils.to_categorical(y_train, num_classes = 2)
y_test = keras.utils.to_categorical(y_test, num_classes = 2)

In [4]:
# Check x,y after processing
print("Shape of x_train:", x_train.shape)
print("Head of first text in x_train:", x_train[0][:5])
print("Shape of y_train:",y_train.shape)
print("Head of y_train:\n", y_train[:5])

Shape of x_train: (25000, 1000)
Head of first text in x_train: [0. 1. 1. 0. 1.]
Shape of y_train: (25000, 2)
Head of y_train:
 [[0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]]


## 3. The Model
Now, we can build a neural network and fit the data to our model. 

In [5]:
# time
from time import time
start_time = time()

# Model
model = Sequential()

# Input layer
model.add(Dense(128, activation = "relu", input_shape = (1000,)))
model.add(Dropout(0.5))
# Hidden layer
model.add(Dense(128, activation = "relu"))
model.add(Dropout(0.5))
# Output layer
model.add(Dense(2, activation = "sigmoid"))

# Compile
model.compile(loss="binary_crossentropy",
              optimizer = "rmsprop",
              metrics = ["accuracy"])

# Fit the model
model.fit(x_train, y_train, batch_size = 64,
          epochs = 10,
          verbose = 0,
          validation_data = (x_test, y_test),
          shuffle = False)

end_time = time()

#### Results

In [6]:
# Runnng time
total_time = end_time - start_time
print("Total running time:", round(total_time,4))

# Training set
train_score = model.evaluate(x_train, y_train, verbose = 0)
print("Train loss:",train_score[0].round(4))
print("Train accuracy:", train_score[1].round(4))

# Test set 
test_score = model.evaluate(x_test, y_test, verbose = 0)
print("Test loss:",test_score[0].round(4))
print("Test accuracy:", test_score[1].round(4))

Total running time: 27.1824
Train loss: 0.1495
Train accuracy: 0.9575
Test loss: 0.4647
Test accuracy: 0.8539
