<font size="3"><center><font color='Blue'>**Sentiment Analysis of IMDB Movie Reviews**</center></font>

<font size="3">**Problem Statement**</font><br>Predict the sentiment of Movie Reviews in the ImDb dataset <br><br>
<font size="3">**Analysis Approach**</font>
<div style="text-align: justify"> 
    Movie reviews were pre-processed to remove html tags, non-alphabet characters and convert words in the review to lower case. To enhance the predictive performance reviews were de-noised by analyzing the frequency distribution of words used in reviews and developing a restrictive vocabulary based on specific cutoffs. Reviews transformed with this restricted vocabulary were tokenized to generate a feature representation and submitted to classification models.
 </div>

1. [Read Dataset and Pre-Process Reviews](#read_dataset)
2. [Exploratory Data Analysis](#EDA)
3. [De-Noise Input Data](#denoise)
4. [Feature Extraction (Tokenization)](#tokenize) <br>
a. Term Frequency / Document Frequency (TfIdf) <br>
b. Word Vector Representation (Embedding)
5. [Models](#model)<br>
a. Logistic Regression <br>
b. Neural Network
6. [Conclusion](#conclusion)

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from utils import preprocess, sentence_to_avg, Embedding_model, predict
import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

<a id='read_dataset'></a>

### <span style='color:Blue'> <span style='background :Yellow' >Reading Dataset and Pre-processing Reviews</span>

In [2]:
# Reading Dataset
df = pd.read_csv("IMDB Dataset.csv")
# Converting string categories to integers
df['sentiment'][df['sentiment'] == 'positive'] = 1
df['sentiment'][df['sentiment'] == 'negative'] = 0

In [3]:
# Pre-Processing reviews
p = df['review'].apply(preprocess)
df['processed_review'] = p

In [4]:
reviews=df.processed_review
sentiments=df.sentiment

In [5]:
print(f"\033[1mExample of a processed review:\033[0m \n{reviews[0]}")

[1mExample of a processed review:[0m 
one the other reviewers has mentioned that after watching just episode youll hooked they are right this exactly what happened with the first thing that struck about was its brutality and unflinching scenes violence which set right from the word trust this not show for the faint hearted timid this show pulls punches with regards drugs sex violence its hardcore the classic use the word called that the nickname given the oswald maximum security state penitentary focuses mainly emerald city experimental section the prison where all the cells have glass fronts and face inwards privacy not high the agenda city home many aryans muslims gangstas latinos christians italians irish and more scuffles death stares dodgy dealings and shady agreements are never far away would say the main appeal the show due the fact that goes where other shows wouldnt dare forget pretty pictures painted for mainstream audiences forget charm forget romance doesnt mess around th

In [6]:
print("Total Reviews: {} and Labels: {}".format(len(reviews), len(sentiments)))

Total Reviews: 50000 and Labels: 50000


<a id='EDA'></a>

### <span style='color:Blue'> <span style='background :Yellow' >Exploratory Data Analysis of the reviews</span>

<font size="3">**Word frequency were obtained**</font> 

In [7]:
total_counts = Counter() # Words in all reviews
positive_counts = Counter() # Words in positive reviews
negative_counts = Counter() # Words in negative reviews

In [8]:
for i, review in enumerate(reviews):    
    if sentiments[i] == 1:
        for word in review.split():
            total_counts[word] += 1
            positive_counts[word] += 1
    else:
        for word in review.split():
            total_counts[word] += 1
            negative_counts[word] += 1                  

In [9]:
print("Top 5 most common words and their occurances \n",total_counts.most_common(5))

Top 5 most common words and their occurances 
 [('the', 667363), ('and', 323870), ('this', 150854), ('that', 136923), ('was', 95526)]


<a id='denoise'></a>

### <span style='color:Blue'> <span style='background :Yellow' >Transform Dataset: De-Noise Input Reviews</span>

In [10]:
sorted_words = sorted(total_counts, key=total_counts.get, reverse=False)
freq_count = {word: total_counts[word] for word in sorted_words}

In [11]:
value1 = 1
value2 = 5
low_freq_words = set()
for word, cnt in freq_count.items():
    if cnt > value1 and cnt < value2:
        low_freq_words.add(word)     

In [12]:
min_count_words = ', '.join(list(low_freq_words)[0:100])

In [13]:
print(f"\033[1mLow Frequency Words:\033[0m \n{min_count_words}")

[1mLow Frequency Words:[0m 
mcmurray, pities, discus, ormsby, mishmashes, com/~fedor, aatish, com/name/nm, beazely, wiggs, stadling, buddist, hooten, anniyan, birnleys, gantlet, flaubert, sangre, grievous, shemi, dabbles, tessie, submariner, docudramas, auf, interjects, infraction, haruna, selbys, chaise, maeve, chopsocky, croker, raisers, saltzman, goudry, classicism, lowpoints, *because*, vrajesh, lyta, coulter, estimations, motorboat, pinpoints, blowtorch, bettger, responders, vertov, fahcking, trivializing, granddaddys, trekkers, tucks, subscribed, undoes, kharbanda, derby, deliveryman, fiorella, widened, zir, huss, nekhron, fertilization, obtrusively, bombadil, plop, cassells, antecedent, sango, straubs, topgun, countrywoman, shonda, tutankhamen, hesslings, jove, safdie, reverberations, rapacious, writers/producers/directors, psychoactive, unspecific, mcculloch, significances, mori, lumbly, vogueing, insoluble, katts, rebhorn, japoteurs, zoology, gdr, contingency, goodknight, ba

<font size="3">**Determine MIN_COUNT:**</font> <br>Evaluating 'min_count_words' for different ranges it was found that words with low frequency (< 100), are mispelled, have typograhical errors or colloquial words. To minimize noise in the vocabulary words used less than 100 were also eliminated from analysis. **Based on this review of data the MIN_COUNT = 100 was used**</font>

<font size="3">**Determine POLARITY_CUTOFF:**</font>  <br><div style="text-align: justify"> Words like "amazing" are more likely in positive reviews and words like "terrible" are more frequent in negative reviews. Polarity is defined as the ratio of the frequency of the word in positive versus negative reviews. Restricting the vocabulary with a polarity cutoff by including high polarity words and excluding common words like "the" that appear in both positive and negative reviews is expected to enhance the predictive power of the review. Since words like "the" appear in both positive and negative reviews the ratio of these words is 1.0 and skews the polarity off-center. Hence log of the ratio is used to center the polarity distribution.   **Based on this analysis a polarity cutoff of 0.05 was used to construct the review vocabulary**</font></div>

In [14]:
pos_neg_ratios = Counter() # Ratio of words in positive / negative review
for term, cnt in list(total_counts.most_common()):
    if positive_counts[term]:
        ratio = positive_counts[term] / float(negative_counts[term]+1)
        pos_neg_ratios[term] = np.log(ratio) # To center the polarity data around 0

In [15]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

Pos-to-neg ratio for 'the' = 0.04386315777969167
Pos-to-neg ratio for 'amazing' = 1.3538590853667367
Pos-to-neg ratio for 'terrible' = -1.9099740788413335


<font size="3">**Create Review Vocabulary**</font>

In [16]:
MIN_COUNT=100
POLARITY_CUTOFF = 0.05
review_vocab = set()
for word, ratio in pos_neg_ratios.most_common():
    if total_counts[word] > MIN_COUNT:
        if((pos_neg_ratios[word] >= POLARITY_CUTOFF) or (pos_neg_ratios[word] <= -POLARITY_CUTOFF)):
               review_vocab.add(word)

In [17]:
print("Length of review vocabulary", len(review_vocab))

Length of review vocabulary 6224


In [18]:
# Checking review vocabulary
'the' in list(review_vocab)

False

<font size="3">**Transform Dataset to words in vocabulary**</font>

In [19]:
word2index = {}
for i, word in enumerate(review_vocab):
    word2index[word] = i

In [20]:
transformed_dataset_int = []
transformed_dataset = []
for i, review in enumerate(reviews):
    indices = []
    words = []
    for word in review.split(' '):
        if word in word2index.keys():
            indices.append(word2index[word])
            words.append(word)
    transformed_dataset_int.append(indices)
    transformed_dataset.append(words)

In [21]:
assert(len(transformed_dataset) == len(reviews))

In [22]:
#train dataset 
index = int(len(transformed_dataset)*0.80)
X_train = transformed_dataset[:index]
Y_train = list(df.sentiment[:index].values)
#test dataset
X_test = transformed_dataset[index:]
Y_test = list(df.sentiment[index:].values)
print("# Processed Training Reviews: {} Labels: {}".format(len(X_train), len(Y_train)))
print("# Processed Test Reviews: {} Labels: {}".format(len(X_test), len(Y_test)))

# Processed Training Reviews: 40000 Labels: 40000
# Processed Test Reviews: 10000 Labels: 10000


<a id='tokenize'></a>

### <span style='color:Blue'> <span style='background :Yellow' > Feature Extraction (Tokenization)</span>

<font size="3">**Create Term Frequency/Document frequency (TFIDF) tokenizer**</font> 

In [23]:
# TFIDF needs strings
TfIdf_X_train = []
for a in X_train:
    TfIdf_X_train.append(' '.join(a))
    
TfIdf_X_test = []
for a in X_test:
    TfIdf_X_test.append(' '.join(a))

In [24]:
#Tfidf vectorizer
TfIdf=TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,3))

In [25]:
#transformed train reviews
X_train_TfIdf = TfIdf.fit_transform(TfIdf_X_train)
#transformed test reviews
X_test_TfIdf = TfIdf.transform(TfIdf_X_test)

In [26]:
assert(X_train_TfIdf.shape[0] == len(X_train))

<font size="3">**Generate word vectors with Word2Vec for tokenization**</font>

In [27]:
# sg = 1 skip-gram, sg = 0 bag of words
wv_vector_dim = 400
WVmodel = Word2Vec(sg=1, size=wv_vector_dim, window=5)

In [28]:
WVmodel.build_vocab(X_train)

In [29]:
WVmodel.train(X_train, total_examples = WVmodel.corpus_count,epochs=10)

(43739979, 52036230)

In [30]:
# Similarity scores of words similar to "amazing" that is likely in positive reviews
WVmodel.wv.most_similar("amazing")

[('wonderful', 0.4866514801979065),
 ('incredible', 0.4660269021987915),
 ('superb', 0.43346714973449707),
 ('outstanding', 0.4249274730682373),
 ('awesome', 0.4209737479686737),
 ('great', 0.3985666036605835),
 ('brilliant', 0.3958675265312195),
 ('breathtaking', 0.3937884569168091),
 ('fabulous', 0.3935829997062683),
 ('excellent', 0.38985300064086914)]

In [31]:
# Similarity scores of words similar to "terrible" that is likely in negative reviews
WVmodel.wv.most_similar("terrible")

[('awful', 0.5576237440109253),
 ('horrible', 0.514610767364502),
 ('horrendous', 0.49778109788894653),
 ('atrocious', 0.4964660406112671),
 ('bad', 0.49113407731056213),
 ('horrid', 0.47501182556152344),
 ('lousy', 0.4350171983242035),
 ('abysmal', 0.40601277351379395),
 ('appalling', 0.39373308420181274),
 ('laughable', 0.3919409513473511)]

In [32]:
wv_file_name = "imdb_word_vectors_" + str(wv_vector_dim) + "dim.kv"

In [33]:
# Save word vectors
WVmodel.wv.save(wv_file_name)

### <span style='color:Blue'> <span style='background :Yellow' > Models </span>

<font size="3"><font color='Blue'><span style='background :yellow' >**Model: Logistic Regression**</span></font>

In [34]:
lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)

In [35]:
#Fitting the model for tfidf features
LR_tfidf=lr.fit(X_train_TfIdf, Y_train)
##Predicting the model for tfidf features
LR_tfidf_predict=lr.predict(X_test_TfIdf)

In [36]:
print("\033[1mLogistic Regression TfIDF Model Accuracy\033[0m {}".format(accuracy_score(Y_test, LR_tfidf_predict)))

[1mLogistic Regression TfIDF Model Accuracy[0m 0.7327


<font size="3"><font color='Blue'><span style='background :yellow' >**Model: Neural Network with word embedding**</span></font>

In [37]:
word_to_vec_map = KeyedVectors.load(wv_file_name)

In [38]:
pred, W, b = Embedding_model(X_train, Y_train, word_to_vec_map)

Coat:   = 0.05760924132860845
Accuracy: 0.87


<a id='conclusion'></a>

### <span style='color:Blue'> <span style='background :Yellow' > Conclusion </span>

<div style="text-align: justify"> To develop a predictive model movie review sentiments, reviews were de-noised with a restrictive vocabulary of high polarity words. The transformed reviews were tokenized using two tokenizers: Term-frequency/document frequency (TfIdf) and word vectors generated with Word2Vec. Logistic Regression was used to predict sentiment for reviews tokenized with TfIdf and a feed forward neural network was used to predict the sentiment with reviews tokenized with word vectors. <br><br>
The neural network approach with word vector feature representation resulted in greater prediction accuracy (87%) compared to the Logistic Regression with Term Frequency/Document frequency (TfIdf) representation (73%). This finding is anticipated since word vectors capture the *similarily* between words unlike TfIdf that only *count* the occurances of words. Hence a word vector approach is likely to identify similar words in unseen test reviews not in the training dataset. As more words are used in the review to describe the positive or negative sentiment the strength of the association wiht the target sentiment is enhanced. A further improvment in predictive accuracy can be achieved by using BERT based language models that capture the context of the words in the review. </div>