# Sentiment Analysis on Movie Reviews

In this notebook Sentiment Analysis is performed on movie reviews.

---


## LSTM Sentiment Analysis

In [51]:
import pandas as pd
import numpy as np
import re
import os
from IPython.display import HTML

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text 
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score

from tensorflow.python.keras.models import Sequential, load_model
from tensorflow.python.keras.layers import Dense, Dropout, Embedding, LSTM, SpatialDropout1D
from tensorflow.python.keras import optimizers

from multiplicative_lstm import MultiplicativeLSTM

import nltk
nltk.download("words")
nltk.download("wordnet")
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import words
from nltk.corpus import wordnet 
allEnglishWords = words.words() + [w for w in wordnet.words()]
allEnglishWords = np.unique([x.lower() for x in allEnglishWords])

import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)

from multiplicative_lstm import MultiplicativeLSTM

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


---

## Legacy Sentiment Analysis Data Import
First, we need to import the data.

In [2]:
def parse_reviews(base_path, folder):
    file_names = [x for x in os.listdir(base_path+folder) if x.endswith(".txt")]
    reviews = []
    for file_name in file_names:
        ID, rating = file_name[:-4].split("_") # Remove .txt and split filename
        
        if int(rating) > 6 or int(rating) < 4:
            label = 1 if int(rating) > 6 else 0
            with open(base_path+folder+file_name, encoding="latin1") as f:
                reviews.append({
                    "label": label,
                    "review": f.read(),
                    "file": file_name
                })
            
    return reviews

In [3]:
base_path = "./aclImdb/"

trainReviews = parse_reviews(base_path, "train/pos/")
trainReviews += parse_reviews(base_path, "train/neg/")

testReviews = parse_reviews(base_path, "test/pos/")
testReviews += parse_reviews(base_path, "test/neg/")

In [4]:
train_df = pd.DataFrame(trainReviews)
test_df = pd.DataFrame(testReviews)

display(train_df.head())

Unnamed: 0,file,label,review
0,1670_8.txt,1,Channel 4 is a channel that allows more naught...
1,6841_7.txt,1,This is probably one of Brian De Palma's best ...
2,2584_7.txt,1,"The film starts out very slowly, with the life..."
3,6835_10.txt,1,"It was once suggested by Pauline Kael, never a..."
4,660_9.txt,1,"9/10- 30 minutes of pure holiday terror. Okay,..."


In [6]:
HTML(train_df.review.iloc[1])

---

## Data Preprocessing
The next step is data preprocessing. The following class behaves like your typical SKLearn vectorizer.

It can perform the following operations.
* Discard non alpha-numeric characters
* Set everything to lower case
* Stems all words using PorterStemmer, and change the stems back to the most occurring existent word.
* Discard non-Egnlish words (not by default).

In [7]:
class Preprocessor(object):
    ''' Preprocess data for NLP tasks. '''

    def __init__(self, alpha=True, lower=True, stemmer=True, english=False):
        self.alpha = alpha
        self.lower = lower
        self.stemmer = stemmer
        self.english = english
        
        self.uniqueWords = None
        self.uniqueStems = None
        
    def fit(self, texts):
        texts = self._doAlways(texts)

        allwords = pd.DataFrame({"word": np.concatenate(texts.apply(lambda x: x.split()).values)})
        self.uniqueWords = allwords.groupby(["word"]).size().rename("count").reset_index()
        self.uniqueWords = self.uniqueWords[self.uniqueWords["count"]>1]
        if self.stemmer:
            self.uniqueWords["stem"] = self.uniqueWords.word.apply(lambda x: PorterStemmer().stem(x)).values
            self.uniqueWords.sort_values(["stem", "count"], inplace=True, ascending=False)
            self.uniqueStems = self.uniqueWords.groupby("stem").first()
        
        #if self.english: self.words["english"] = np.in1d(self.words["mode"], allEnglishWords)
        print("Fitted.")
            
    def transform(self, texts):
        texts = self._doAlways(texts)
        if self.stemmer:
            allwords = np.concatenate(texts.apply(lambda x: x.split()).values)
            uniqueWords = pd.DataFrame(index=np.unique(allwords))
            uniqueWords["stem"] = pd.Series(uniqueWords.index).apply(lambda x: PorterStemmer().stem(x)).values
            uniqueWords["mode"] = uniqueWords.stem.apply(lambda x: self.uniqueStems.loc[x, "word"] if x in self.uniqueStems.index else "")
            texts = texts.apply(lambda x: " ".join([uniqueWords.loc[y, "mode"] for y in x.split()]))
        #if self.english: texts = self.words.apply(lambda x: " ".join([y for y in x.split() if self.words.loc[y,"english"]]))
        print("Transformed.")
        return(texts)

    def fit_transform(self, texts):
        texts = self._doAlways(texts)
        self.fit(texts)
        texts = self.transform(texts)
        return(texts)
    
    def _doAlways(self, texts):
        # Remove parts between <>'s
        texts = texts.apply(lambda x: re.sub('<.*?>', ' ', x))
        # Keep letters and digits only.
        if self.alpha: texts = texts.apply(lambda x: re.sub('[^a-zA-Z0-9 ]+', ' ', x))
        # Set everything to lower case
        if self.lower: texts = texts.apply(lambda x: x.lower())
        return texts  

In [8]:
preprocess = Preprocessor(alpha=True, lower=True, stemmer=True)

In [9]:
%%time
trainX = preprocess.fit_transform(train_df.review).values
testX = preprocess.transform(test_df.review).values

Fitted.
Transformed.
Transformed.
CPU times: user 1min 56s, sys: 2.37 s, total: 1min 58s
Wall time: 1min 58s


In [30]:
print(trainX[1])

this is probably one of brian de palma s best known movie but it isn t his best body double the fury and carry are better movie but this movie is better than blow out and obsessed de palma is very influence by hitchcock and this movie is a take off on psycho angie dickinson is a boring housewife who is think of have an affair and after her psychiatrist played by michael caine turn down an offer dickinson meets a man in a art gallery and she wind up sleep with him after this point it s best you don t know what happened but there is a murder and nancy allen is a called girl who get a look at the killer dennis franz is the detective on the case who really doesn t trust allen and she has to find the killer herself it s a pretty good movie but isn t one of de palma s best


In [31]:
print(preprocess.uniqueWords.shape)
preprocess.uniqueWords[preprocess.uniqueWords.word.str.contains("disappoint")]

(44448, 3)


Unnamed: 0,word,count,stem
17562,disappointingly,12,disappointingli
17560,disappointed,770,disappoint
17561,disappointing,327,disappoint
17563,disappointment,323,disappoint
17559,disappoint,89,disappoint
17566,disappoints,27,disappoint
17565,disappointments,20,disappoint


In [63]:
print(preprocess.uniqueStems.shape)
preprocess.uniqueStems[preprocess.uniqueStems.word.str.contains("disappoint")]

(29254, 2)


Unnamed: 0_level_0,word,count
stem,Unnamed: 1_level_1,Unnamed: 2_level_1
disappoint,disappointed,770
disappointingli,disappointingly,12


---

## Feature Engineering
Next, we take the preprocessed texts as input and calculate their TF-IDF's ([info](http://www.tfidf.com)). We retain 10000 features per text.

In [32]:
stop_words = text.ENGLISH_STOP_WORDS.union(["thats","weve","dont","lets","youre","im","thi","ha",
    "wa","st","ask","want","like","thank","know","susan","ryan","say","got","ought","ive","theyre"])
tfidf = TfidfVectorizer(min_df=2, max_features=10000, stop_words=stop_words) #, ngram_range=(1,3)

In [33]:
%%time
trainX_tfidf = tfidf.fit_transform(trainX).toarray()
testX_tfidf = tfidf.transform(testX).toarray()

trainY = train_df.label
testY = test_df.label

display(trainY.head())

0    1
1    1
2    1
3    1
4    1
Name: label, dtype: int64

CPU times: user 6.6 s, sys: 1.77 s, total: 8.37 s
Wall time: 8.37 s


In [38]:
print(trainX_tfidf[1])

[0. 0. 0. ... 0. 0. 0.]


---

## Feature Selection
Next, we take the 10k dimensional tfidf's as input, and keep the 2000 dimensions that correlate the most with our sentiment target. The corresponding words - see below - make sense.

In [39]:
from scipy.stats.stats import pearsonr

In [14]:
getCorrelation = np.vectorize(lambda x: pearsonr(trainX_tfidf[:,x], trainY)[0])
correlations = getCorrelation(np.arange(trainX_tfidf.shape[1]))
print(correlations)

[-0.01780022 -0.01936224  0.00697161 ...  0.01920079  0.00850365
 -0.00796341]


In [16]:
allIndices = np.argsort(-correlations)
bestIndices = allIndices[np.concatenate([np.arange(1000), np.arange(-1000, 0)])]

In [17]:
vocabulary = np.array(tfidf.get_feature_names())
print(vocabulary[bestIndices][:10])
print(vocabulary[bestIndices][-10:])

['great' 'love' 'excellent' 'best' 'beautiful' 'perfect' 'performance'
 'favorite' 'enjoy' 'amazing']
['money' 'stupid' 'horrible' 'worse' 'boring' 'terrible' 'awful' 'waste'
 'worst' 'bad']


In [18]:
trainX_engr = trainX_tfidf[:,bestIndices]
testX_engr = testX_tfidf[:,bestIndices]

In [41]:
print(trainX_engr[0].shape)
print(trainX_engr.shape, testX_engr.shape)

(2000,)
(22304, 2000) (22365, 2000)


---
## Logistic Regression

In [42]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 20]}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(trainX_engr, trainY)

print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
print("Best estimator: ", grid.best_estimator_)

Best cross-validation score: 0.90
Best parameters:  {'C': 10}
Best estimator:  LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


In [22]:
model_reg = LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

model_reg.fit(trainX_engr, trainY)

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [52]:
print(np.mean(cross_val_score(model_reg, testX_engr, testY, cv=5)))

0.8977867203219316


---

## Model Architecture
We choose a very simple dense network with 6 layers, performing binary classification.

In [55]:
DROPOUT = 0.5
ACTIVATION = "tanh"

embed_dim = 128
lstm_out = 196

model = Sequential([    
    Dense(int(trainX_engr.shape[1]/2), activation=ACTIVATION, input_dim=trainX_engr.shape[1]),
    Dropout(DROPOUT),
    Dense(int(trainX_engr.shape[1]/2), activation=ACTIVATION, input_dim=trainX_engr.shape[1]),
    Dropout(DROPOUT),
    Dense(int(trainX_engr.shape[1]/4), activation=ACTIVATION),
    Dropout(DROPOUT),
    Dense(100, activation=ACTIVATION),
    Dropout(DROPOUT),
    Dense(20, activation=ACTIVATION),
    Dropout(DROPOUT),
    Dense(5, activation=ACTIVATION),
    Dropout(DROPOUT),
    Dense(1, activation='sigmoid'),
])

# model = Sequential()
# model.add(Embedding(2000, 128))
# model.add(MultiplicativeLSTM(128, dropout=0.2, recurrent_dropout=0.2))
# model.add(Dense(1, activation='sigmoid'))

In [56]:
# model.compile(optimizer=optimizers.Adam(0.00005), loss='binary_crossentropy', metrics=['accuracy'])
# model.summary()

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 1000)              2001000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1000)              1001000   
_________________________________________________________________
dropout_2 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 500)               500500    
_________________________________________________________________
dropout_3 (Dropout)          (None, 500)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 100)               50100     
__________

---

## Model Training
Let's go.

In [59]:
EPOCHS = 30
BATCHSIZE = 1500

In [61]:
model.fit(trainX_engr, trainY, epochs=EPOCHS, batch_size=BATCHSIZE, validation_split=0.33)

# model.fit(trainX, trainY,
#           batch_size=1500,
#           epochs=15,
#           validation_split=0.33,
#           verbose=1,
#           callbacks=[ModelCheckpoint('imdb_mlstm2.h5', monitor='val_acc',
#                                      save_best_only=True, save_weights_only=True)])

Train on 14943 samples, validate on 7361 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras._impl.keras.callbacks.History at 0x7f3f49d27828>

In [62]:
x = np.arange(EPOCHS)
history = model.history.history

data = [
    go.Scatter(x=x, y=history["acc"], name="Train Accuracy", marker=dict(size=5), yaxis='y2'),
    go.Scatter(x=x, y=history["val_acc"], name="Valid Accuracy", marker=dict(size=5), yaxis='y2'),
    go.Scatter(x=x, y=history["loss"], name="Train Loss", marker=dict(size=5)),
    go.Scatter(x=x, y=history["val_loss"], name="Valid Loss", marker=dict(size=5))
]
layout = go.Layout(
    title="Model Training Evolution", font=dict(family='Palatino'), xaxis=dict(title='Epoch', dtick=1),
    yaxis1=dict(title="Loss", domain=[0, 0.45]), yaxis2=dict(title="Accuracy", domain=[0.55, 1]),
)
py.iplot(go.Figure(data=data, layout=layout), show_link=False)

---

## Model Evaluation

### Accuracy & Loss
Let's first centralize the probabilities and predictions with the original train and validation dataframes. Then we can print out the respective accuracies and losses.

In [82]:
train_df["prediction"] = np.round(model.predict(trainX_engr))
train_df["probability"] = model.predict(trainX_engr)

test_df["prediction"] = np.round(model.predict(testX_engr))
test_df["probability"] = model.predict(testX_engr)

train_df

Unnamed: 0,file,label,review,prediction,probability
0,1670_8.txt,1,Channel 4 is a channel that allows more naught...,1.0,0.990652
1,6841_7.txt,1,This is probably one of Brian De Palma's best ...,1.0,0.990638
2,2584_7.txt,1,"The film starts out very slowly, with the life...",1.0,0.990510
3,6835_10.txt,1,"It was once suggested by Pauline Kael, never a...",1.0,0.989994
4,660_9.txt,1,"9/10- 30 minutes of pure holiday terror. Okay,...",1.0,0.990659
5,1104_8.txt,1,The film opens with Bill Coles (Melvyn Douglas...,1.0,0.990726
6,9653_9.txt,1,First love is a desperately difficult subject ...,1.0,0.990698
7,9314_10.txt,1,"Only saw this show a few times, but will live ...",1.0,0.990722
8,6600_9.txt,1,I'm going to keep this review short and sweet....,1.0,0.990724
9,5284_9.txt,1,Father and son communicate very little. IN fac...,1.0,0.990721


In [72]:
print(model.evaluate(trainX_engr, trainY))
print((train_df.label==train_df.prediction).mean())

[0.4016041257184859, 0.9045911047345767]
0.9045911047345767


### Error Analysis
Error analysis gives us great insight in the way the model is making its errors. Often, it shows data quality issues.

In [76]:
trainCross = train_df.groupby(["prediction", "label"]).size().unstack()
trainCross

label,0,1
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,7782,106
1.0,2022,12394


In [78]:
validCross = test_df.groupby(["prediction", "label"]).size().unstack()
validCross

label,0,1
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,7091,753
1.0,2774,11747


In [83]:
truepositives = test_df[(test_df.label==True)&(test_df.label==test_df.prediction)]
print(len(truepositives), "true positives.")
truepositives.sort_values("probability", ascending=False).head(3)

11747 true positives.


Unnamed: 0,file,label,review,prediction,probability
7357,6676_10.txt,1,"A beautiful, magical, thought-provoking and he...",1.0,0.990728
9602,4453_10.txt,1,Cult-director Lucio Fulci is probably most fam...,1.0,0.990728
4900,234_7.txt,1,"Although I'm not crazy about musicals, COVER G...",1.0,0.990728


In [85]:
truenegatives = test_df[(test_df.label==False)&(test_df.label==test_df.prediction)]
print(len(truenegatives), "true negatives.")
truenegatives.sort_values("probability", ascending=True).head(3)

7091 true negatives.


Unnamed: 0,file,label,review,prediction,probability
21115,5183_1.txt,0,"This movie is just downright horrible, the mov...",0.0,0.013142
16708,4040_2.txt,0,This is just the same old crap that is spewed ...,0.0,0.013142
17225,7820_1.txt,0,For years I hesitated watching this movie. Now...,0.0,0.013142


In [86]:
falsepositives = test_df[(test_df.label==True)&(test_df.label!=test_df.prediction)]
print(len(falsepositives), "false positives.")
falsepositives.sort_values("probability", ascending=True).head(3)

753 false positives.


Unnamed: 0,file,label,review,prediction,probability
4633,5744_10.txt,1,"Right, here we go, you have probably read in p...",0.0,0.013145
4801,8565_9.txt,1,"It's not Citizen Kane, but it does deliver. Cl...",0.0,0.013145
5959,6275_9.txt,1,Miles O'Keeffe once again assumes the role of ...,0.0,0.013153


In [88]:
falsenegatives = test_df[(test_df.label==False)&(test_df.label!=test_df.prediction)]
print(len(falsenegatives), "false negatives.")
falsenegatives.sort_values("probability", ascending=False).head(3)

2774 false negatives.


Unnamed: 0,file,label,review,prediction,probability
13939,6221_1.txt,0,"I love musicals, all of them, from joyous Okla...",1.0,0.990726
21807,12130_1.txt,0,This movie was pure genius. John Waters is bri...,1.0,0.990726
20513,12253_3.txt,0,"Well, I must say, I initially found this short...",1.0,0.990725


This is the review that got predicted as positive most certainly - while being labeled as negative. However, we can easily recognize it as a poorly labeled sample.

In [49]:
HTML(valid.loc[22148].review)

---

## Model Application

### Custom Reviews
To use this model, we would store the model, along with the preprocessing vectorizers, and run the unseen texts through following pipeline.

In [105]:
unseen = pd.Series("this movie is good")

In [106]:
unseen = preprocess.transform(unseen)       # Text preprocessing
unseen = tfidf.transform(unseen).toarray()  # Feature engineering
unseen = unseen[:,bestIndices]              # Feature selection
probability = model.predict(unseen)[0,0]  # Network feedforward

Transformed.


In [107]:
print(probability)
print("Positive!") if probability > 0.5 else print("Negative!")

0.9903882
Positive!
