## Sentiment Analysis

We use the movie reviews from the IMDb dataset to classify film reviews as either **positive** or **negative**. 

In **Part 1**, we use a Gradient-Boosted Decision Tree classifier to classify Bag of Words (BoW).

In **Part 2**, we use RNNs with a long short-term memory (LSTM) layer to make use of words order.

In [177]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import sklearn.preprocessing as pr
from sklearn.ensemble import GradientBoostingClassifier
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

### Load the data

We use movie reviews from the IMDb database. Keras has a built-in [IMDb movie reviews dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) that we can use. 

In [146]:
from keras.datasets import imdb

# Set the vocabulary size -- i.e. exclude the least frequent words so that we only have 5000 
vocabulary_size = 5000

# Load in training and test data
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocabulary_size)
print("Loaded dataset with {} training samples, {} test samples".format(len(X_train), len(X_test)))

Loaded dataset with 25000 training samples, 25000 test samples


In [147]:
# Inspect a sample review and its label
example_review_idx = 10

print("--- Review ---")
print(X_train[example_review_idx])
print("--- Label ---")
print(y_train[example_review_idx])

--- Review ---
[1, 785, 189, 438, 47, 110, 142, 7, 6, 2, 120, 4, 236, 378, 7, 153, 19, 87, 108, 141, 17, 1004, 5, 2, 883, 2, 23, 8, 4, 136, 2, 2, 4, 2, 43, 1076, 21, 1407, 419, 5, 2, 120, 91, 682, 189, 2818, 5, 9, 1348, 31, 7, 4, 118, 785, 189, 108, 126, 93, 2, 16, 540, 324, 23, 6, 364, 352, 21, 14, 9, 93, 56, 18, 11, 230, 53, 771, 74, 31, 34, 4, 2834, 7, 4, 22, 5, 14, 11, 471, 9, 2, 34, 4, 321, 487, 5, 116, 15, 2, 4, 22, 9, 6, 2286, 4, 114, 2679, 23, 107, 293, 1008, 1172, 5, 328, 1236, 4, 1375, 109, 9, 6, 132, 773, 2, 1412, 8, 1172, 18, 2, 29, 9, 276, 11, 6, 2768, 19, 289, 409, 4, 2, 2140, 2, 648, 1430, 2, 2, 5, 27, 3000, 1432, 2, 103, 6, 346, 137, 11, 4, 2768, 295, 36, 2, 725, 6, 3208, 273, 11, 4, 1513, 15, 1367, 35, 154, 2, 103, 2, 173, 7, 12, 36, 515, 3547, 94, 2547, 1722, 5, 3547, 36, 203, 30, 502, 8, 361, 12, 8, 989, 143, 4, 1172, 3404, 10, 10, 328, 1236, 9, 6, 55, 221, 2989, 5, 146, 165, 179, 770, 15, 50, 713, 53, 108, 448, 23, 12, 17, 225, 38, 76, 4397, 18, 183, 8, 81, 19, 12, 

Note that reviews are already processed to be separated in words, and each word is mapped to an integer. Labels are binary (0 for **negative**, 1 for **positive**).

In [148]:
# Map word IDs back to words
word2id = imdb.get_word_index() 
id2word = {i: word for word, i in word2id.items()}

print("--- Review (with words) ---")
print([id2word.get(i, " ") for i in X_train[example_review_idx]])
print("--- Label ---")
print(y_train[example_review_idx])

--- Review (with words) ---
['the', 'clear', 'fact', 'entertaining', 'there', 'life', 'back', 'br', 'is', 'and', 'show', 'of', 'performance', 'stars', 'br', 'actors', 'film', 'him', 'many', 'should', 'movie', 'reasons', 'to', 'and', 'reading', 'and', 'are', 'in', 'of', 'scenes', 'and', 'and', 'of', 'and', 'out', 'compared', 'not', 'boss', 'yes', 'to', 'and', 'show', 'its', 'disappointed', 'fact', 'raw', 'to', 'it', 'justice', 'by', 'br', 'of', 'where', 'clear', 'fact', 'many', 'your', 'way', 'and', 'with', 'city', 'nice', 'are', 'is', 'along', 'wrong', 'not', 'as', 'it', 'way', 'she', 'but', 'this', 'anything', 'up', "haven't", 'been', 'by', 'who', 'of', 'choices', 'br', 'of', 'you', 'to', 'as', 'this', "i'd", 'it', 'and', 'who', 'of', 'shot', "you'll", 'to', 'love', 'for', 'and', 'of', 'you', 'it', 'is', 'sequels', 'of', 'little', 'quest', 'are', 'seen', 'watched', 'front', 'chemistry', 'to', 'simply', 'alive', 'of', 'chris', 'being', 'it', 'is', 'say', 'easy', 'and', 'cry', 'in', 'ch

#### Pad sequences

Each review can have a different number of words. Thus, we pad them to have a max number of words. This is mostly needed for Part 2, where each review is used as an input to the RNN, word after word.

In [163]:
# Set the maximum number of words per document
max_words = 500

# Pad sequences in X_train and X_test
X_train = sequence.pad_sequences(
    X_train, maxlen=500, dtype="int32", padding="pre", truncating="pre", value=0.0)

X_test = sequence.pad_sequences(X_test, maxlen=500, dtype="int32", padding="pre", truncating="pre", value=0.0)

### Part 1 - Classify Bag of Words using grandient-boosted decision trees


#### Extract BoW features

Note that the 'words' are encoded in integers.

In [149]:
def extract_BoW_features(words_train, words_test, vocabulary_size=5000):
    """Extract Bag-of-Words for a given set of documents"""
    
    # Fit a vectorizer to training documents and use it to transform them
    vectorizer = CountVectorizer(max_features=vocabulary_size,
            preprocessor=lambda x: x, tokenizer=lambda x: x) 
    features_train = vectorizer.fit_transform(words_train).toarray()

    # Apply the same vectorizer to transform the test documents (ignore unknown words)
    features_test = vectorizer.transform(words_test).toarray()

    vocabulary = vectorizer.vocabulary_
    
    return features_train, features_test, vocabulary

In [164]:
# Extract Bag of Words features for both training and test datasets
features_train, features_test, vocabulary = extract_BoW_features(X_train, X_test)

#### Normalize BoW features in training and test set

In [167]:
features_train = pr.normalize(features_train, axis=1)
features_test = pr.normalize(features_test, axis=1)

#### Classification using a Gradient-Boosted Decision Tree classifier

Tree-based algorithms are often used on BoW to leverage the features' sparsity. Howver, they might require some hyperparameter tuning.

In [168]:
# more basic alternative to cross-validation
def split_train_validation(X_train, y_train, fraction_valid = 0.2):
    
    idx_split = int(len(y_train) * fraction_valid)
    X_valid = X_train[:idx_split]
    y_valid = y_train[:idx_split]
    
    X_train = X_train[idx_split:]
    y_train = y_train[idx_split:]
    
    return X_train, y_train, X_valid, y_valid

In [153]:
def classify_gboost(X_train, X_valid, X_test, y_train, y_valid, y_test):  

    best_model = None
    best_score_valid = 0
    
    for n_estimators in [16, 32, 64]:
        for learning_rate in [0.6, 0.8, 1]: 

            clf = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate=learning_rate, 
                                             max_depth=1, random_state=0)
            clf.fit(X_train, y_train)
            model_score = clf.score(X_valid, y_valid)
            if model_score > best_score_valid:
                model_score = best_score_valid
                best_model = clf
    return best_model

In [169]:
# create validation set to test model performance 
features_train, labels_train, features_valid, labels_valid = split_train_validation(features_train, y_train)

In [None]:
# find best model by optimizing hyperparameters
best_model = classify_gboost(features_train, features_valid, features_test, labels_train, labels_valid, y_test)

In [183]:
# Parameters of the best model
best_model.get_params

<bound method BaseEstimator.get_params of GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=1, loss='deviance', max_depth=1,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=64,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=0, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)>

In [175]:
# check performance on test set
print("[{}] Accuracy best model: train = {}, validation = {}, test = {}".format(
    best_model.__class__.__name__,
    best_model.score(features_train, labels_train),
    best_model.score(features_valid, labels_valid),
    best_model.score(features_test, y_test)))

[GradientBoostingClassifier] Accuracy best model: train = 0.822, validation = 0.8124, test = 0.81588


The tree-based model already achieves a good performance, and hyperparameters selection is sufficient to prevent overfitting.

In [180]:
y_pred = best_model.predict(features_test)

In [181]:
# check one example in test set
print("--- Review ---")
print([id2word.get(i, " ") for i in X_test[example_review_idx]])
print("--- Predicted Label ---")
print(y_pred[example_review_idx])
print("--- True Label ---")
print(y_test[example_review_idx])

--- Review ---
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',

### Part 2 - Using RNNs 

The main difference with respect to the BoW approach is that we make use of the sequence of words present in each review, and not just look at the frequency of occurrence of given words in positive / negative reviews. 

We make use of a Long Short-Term Memory (**LSTM**) layer to use the order of words in the reviews for training, which could be crucial! (imagine the difference between: **"The movie was better than expected"** vs **"I expected the movie to be better"**). 
  

#### Create a RNN architecture

We define our network using words embeddings and an LSTM layer.

In [67]:
embedding_size = 32

model = Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


In [68]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


#### Train the model 

In [69]:
# Specify training parameters: batch size and number of epochs
batch_size = 64 ## now many input to show to the network before the parameters are updated and error is computed
num_epochs = 3 ## number of times the network will go through the whole training set

# Split in training and validation set 
X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]  # first batch_size samples
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]  # rest for training

model.fit(X_train2, y_train2,
          validation_data=(X_valid, y_valid),
          batch_size=batch_size, epochs=num_epochs)



Train on 24936 samples, validate on 64 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.callbacks.History at 0x7f8bf86416d0>

In [176]:
# Evaluate your model on the test set
scores = model.evaluate(X_test, y_test, verbose=0) 

print("Test accuracy:", scores[1]) # first metric, so accuracy (see model.compile)

Test accuracy: 0.8710399866104126


The RNN model achieves a better performance than the decision tree: as expected, making use of the words sequences seems to improve the classification results. 

Results could be further improved by tuning the batch size / #epochs parameters.

In [182]:
# Optionally can save the model, to load it in future
model_file = "rnn_model.h5"  # HDF5 file
model.save(model_file)

# Load model using keras.models.load_model()
#from keras.models import load_model
#model = load_model(model_file)

Disclaimer: This notebook is inspired by a project which is part of the Natural language Processing Udacity Nanodegree.