# Sentiment Analysis of IMDb Movie Reviews

In this notebook, we learn sentiment analysis, focusing on classification of IMDb movie reviews. 
<br>In the first part, we implement very basic text preprocessing and develop a base-line model using logistic regression.
<br>Then, we improve the model by implementing several advanced text preprocessing techniques used to enhance NLP models.

In [1]:
import numpy as np
import pandas as pd

import nltk
nltk.download('punkt')
from nltk import word_tokenize

nltk.download('stopwords')
from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

[nltk_data] Downloading package punkt to /Users/sudhir/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sudhir/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/sudhir/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Importing the dataset

The IMDb movie review data set can be found at http://ai.stanford.edu/~amaas/data/sentiment/. 

The dataset has 50k movie reviews, of which we use 25k for training the model and rest 25k for testing the model. <br>Each set has 12.5k positive reviews and 12.5k negative reviews. 

In IMDb, movies are rated from 1-10. In the dataset, the movies with ratings <= 4 are rated negative and the ones with >=7 are rated positive. 
<br>Movies with ratings = 5,6 are omitted while creating this dataset.

We have merged all the training/test set reviews into the files full_train.txt/full_test.txt respectively. 
<br>In each file, first 12.5k are positive reviews and the rest 12.5k are negative reviews

We put all the reviews in two lists, one for training and the other for testing

In [2]:
reviews_train_raw = []
for line in open('./movie_data/full_train.txt', 'r'):
    reviews_train_raw.append(line.strip())
    
reviews_test_raw = []
for line in open('./movie_data/full_test.txt', 'r'):
    reviews_test_raw.append(line.strip())
    
target = [1 if i < 12500 else 0 for i in range(25000)]   

In [3]:
print(len(reviews_train_raw), len(reviews_test_raw))

25000 25000


## Basic text preprocessing & creating baseline model

**Tokenize**

In the first step we tokenize the reviews. Let's have a look at one of the reviews

In [4]:
reviews_train_raw[0]

'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'

In this tokenization step, we implement some basic text processing as we aim to create a base-line model. 
 - making all words lower case
 - considering words which consist of alphabets only (. , : ! ? 10 w3p kind of entries removed)

In [5]:
def tokenize(corpus):
    
    review_tokenize = []
    
    for review in corpus:
        
        tokenized = ' '.join([word for word in word_tokenize(review.lower()) if word.isalpha()])
        review_tokenize.append(tokenized)
        
    return review_tokenize

In [6]:
reviews_train = tokenize(reviews_train_raw)
reviews_test = tokenize(reviews_test_raw)

In [7]:
reviews_train[0]

'bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my years in the teaching profession lead me to believe that bromwell high satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it is'

**Vectorization**

Now, in order to feed the text data to the algorithm, we have to convert it to numerical representation. For this purpose, we use one-hot encoding. 

In this representation, we create a very large matrix where 
 - the number of rows is the number of reviews and 
 - the number of columns is the number of unique words in our dataset (called corpus)
 
For every row (review), if the corresponding word in the column exists in the review it is set to be 1, otherwise 0. The end result is a sparse matrix of few 1's.

In [8]:
cv = CountVectorizer(binary = True)
cv.fit(reviews_train)
X = cv.transform(reviews_train)
X_test = cv.transform(reviews_test)

In [9]:
print('Train set one-hot shape:', X.shape)
print('Test set one-hot shape:', X_test.shape)

Train set one-hot shape: (25000, 71400)
Test set one-hot shape: (25000, 71400)


In our training set of 25k reviews, we have around 71k unique words. Note that the one-hot encoding is done on the train set only. If a word appears in the test set that did not exist in the corpus of train set, it is of no consequence, as it wont find its place in the one-hot encoding represenation.

Lets have a look at the features

In [10]:
features = cv.get_feature_names()
print(features[0:100])

['aa', 'aaa', 'aaaaaaah', 'aaaaah', 'aaaahhhhhhh', 'aaaand', 'aaaarrgh', 'aaah', 'aaargh', 'aaaugh', 'aachen', 'aada', 'aadha', 'aag', 'aage', 'aaghh', 'aah', 'aahhh', 'aaip', 'aaja', 'aakash', 'aaker', 'aaliyah', 'aames', 'aamir', 'aan', 'aankh', 'aankhen', 'aap', 'aapke', 'aapkey', 'aardman', 'aardvarks', 'aargh', 'aaron', 'aarp', 'aarrrgh', 'aatish', 'aauugghh', 'aavjo', 'aaww', 'ab', 'aback', 'abahy', 'abanazer', 'abandon', 'abandoned', 'abandoning', 'abandonment', 'abandons', 'abanks', 'abas', 'abashed', 'abashidze', 'abatement', 'abating', 'abattoirs', 'abba', 'abbad', 'abbas', 'abbasi', 'abbey', 'abbie', 'abbot', 'abbots', 'abbott', 'abbreviated', 'abbu', 'abby', 'abc', 'abcd', 'abdic', 'abdicates', 'abdicating', 'abdomen', 'abdominal', 'abduct', 'abducted', 'abductee', 'abducting', 'abduction', 'abductions', 'abductor', 'abductors', 'abducts', 'abdul', 'abdullah', 'abe', 'abel', 'abercrombie', 'abernathy', 'aberrant', 'aberration', 'aberrations', 'aberystwyth', 'abets', 'abette

**Classification Model**

To create the base-line classification model, we use logistic regression because
 - linear models tends to perform well on sparse datasets
 - easy to interpret
 - it is fast compared to other complicated algorithms

The target variable remains the same for both training and test set, as in both the first 12.5k reviews are positive and rest 12.5k are negative

In [11]:
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80)

for c in [1, 0.5, 0.25, 0.05, 0.01]:
    
    lr = LogisticRegression(C = c)       # hyperparameter C for regularization
    lr.fit(X_train, y_train)
    print("Accuracy on train/validation set with C = %.2f: %.3f/%.3f" % (c, 
                                                  accuracy_score(y_train, lr.predict(X_train)), 
                                                  accuracy_score(y_val, lr.predict(X_val))))



Accuracy on train/validation set with C = 1.00: 0.998/0.878
Accuracy on train/validation set with C = 0.50: 0.994/0.882
Accuracy on train/validation set with C = 0.25: 0.986/0.883
Accuracy on train/validation set with C = 0.05: 0.950/0.889
Accuracy on train/validation set with C = 0.01: 0.905/0.879


In [13]:
best_model = LogisticRegression(C = 0.05)
best_model.fit(X, target)
print("Final accuracy on train/test set: %.3f/%.3f" % (accuracy_score(target, best_model.predict(X)),
                                                   accuracy_score(target, best_model.predict(X_test))))

Final accuracy on train/test set: 0.947/0.881


So, with our simple baseline model we have 88% accuracy.

Now, as a sanity check, lets look at the five largest and smallest coefficients to get the five most discriminating words for both positive and negative reviews

In [14]:
print('Coefficients:', best_model.coef_)
print('Number of coefficients:', len(best_model.coef_[0]))

Coefficients: [[-0.01020906 -0.0125288  -0.0002121  ...  0.00025972 -0.02385085
  -0.01536539]]
Number of coefficients: 71400


In [15]:
feature_coeff = {word: coeff for word, coeff in zip(cv.get_feature_names(), best_model.coef_[0])}

print('Positive words')
for best_positive in sorted(feature_coeff.items(), key = lambda x: x[1], reverse = True)[:5]:
    print(best_positive)

print('--------------')
    
print('Negative words')    
for best_negative in sorted(feature_coeff.items(), key = lambda x: x[1])[:5]:
    print(best_negative)

Positive words
('excellent', 0.9462702638847761)
('perfect', 0.7550828420618055)
('great', 0.6830621779003093)
('amazing', 0.6364490924130619)
('superb', 0.6296931914827026)
--------------
Negative words
('worst', -1.400142312303066)
('waste', -1.1860009859358518)
('awful', -0.9809824977718387)
('poorly', -0.875555391069068)
('boring', -0.8217719508421568)


The top five words in both positive and negative cases are easily identifiable.

## Advanced text preprocessing and modeling

Now, lets improve the model by implementing few advanced text preprocessing techniques.
<br>In particular, we consider these four options:

 - Removing stop words
 - Normalization: Stemming and Lemmatization
 - N-grams
 - Representations: word counts and tf-idf

**A. REMOVE STOP WORDS**

We further process the text by removing very common words (called stop words) such as 'if, we, us, he, she. 
<br>We can usually remove these words without changing the semantics of a text, and doing so often improves the performance of a model. 

In [16]:
print('Illustration of a review before removing the stop words')
print('Number of words:', len(reviews_train[0].split()))
print(reviews_train[0])

Illustration of a review before removing the stop words
Number of words: 137
bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my years in the teaching profession lead me to believe that bromwell high satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it is


In [18]:
def remove_stop_words(corpus):
    
    english_stop_words = stopwords.words('english')
    
    without_stop_words = []
    
    for review in corpus:
        
        cleaned = ' '.join([word for word in review.split() if word not in english_stop_words])
        without_stop_words.append(cleaned)
        
    return without_stop_words

In [19]:
reviews_train_no_stop_words = remove_stop_words(reviews_train)
reviews_test_no_stop_words = remove_stop_words(reviews_test)

cv = CountVectorizer(binary = True)
cv.fit(reviews_train_no_stop_words)
X = cv.transform(reviews_train_no_stop_words)
X_test = cv.transform(reviews_test_no_stop_words)

X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80)

for c in [1, 0.5, 0.25, 0.05, 0.01]:
    
    lr = LogisticRegression(C = c)
    lr.fit(X_train, y_train)
    print("Accuracy on train/validation set with C = %.2f: %.3f/%.3f" % (c, 
                                                 accuracy_score(y_train, lr.predict(X_train)),
                                                 accuracy_score(y_val, lr.predict(X_val))))



Accuracy on train/validation set with C = 1.00: 0.998/0.874
Accuracy on train/validation set with C = 0.50: 0.994/0.877
Accuracy on train/validation set with C = 0.25: 0.986/0.878
Accuracy on train/validation set with C = 0.05: 0.949/0.882
Accuracy on train/validation set with C = 0.01: 0.906/0.873


In [20]:
best_model = LogisticRegression(C = 0.05)
best_model.fit(X, target)
print("Final accuracy on train/test set: %.3f/%.3f" % (accuracy_score(target, best_model.predict(X)),
                                                      accuracy_score(target, best_model.predict(X_test))))

Final accuracy on train/test set: 0.946/0.880


In [21]:
print('Illustration of a review after removing the stop words')
print('Number of words:', len(reviews_train_no_stop_words[0].split()))
print(reviews_train_no_stop_words[0])

Illustration of a review after removing the stop words
Number of words: 69
bromwell high cartoon comedy ran time programs school life teachers years teaching profession lead believe bromwell high satire much closer reality teachers scramble survive financially insightful students see right pathetic teachers pomp pettiness whole situation remind schools knew students saw episode student repeatedly tried burn school immediately recalled high classic line inspector sack one teachers student welcome bromwell high expect many adults age think bromwell high far fetched pity


Stopwords does not signal anything about a movie review being positive or negative and are unnecessary burden, increasing the dimension of one-hot encoding representation.

In [22]:
X.shape

(25000, 71262)

Removing stop words does not seem to have making any improvements in accuracy. It could be because we have removed very few stop words, as the number of unique words are down to 71262 only from 71400.

**B. NORMALIZATION**

A common next step in text preprocessing is normalization: converting all of the different forms of a given word into one. 
<br> There are two methods for this: Stemming and Lemmatization. 

**1. Stemming**

Stemming is considered to be a brute-force approach to normalization. There are several algorithms for stemming, and in general they all use basic rules to chop off the ends of words. NLTK has several stemming algorithm implementation, and we use the Porter stemmer here.

In [23]:
print('Illustration of a review before stemming')
print('Number of words:', len(reviews_train[0].split()))
print(reviews_train[0])

Illustration of a review before stemming
Number of words: 137
bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my years in the teaching profession lead me to believe that bromwell high satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it is


In [24]:
def stem(corpus):
    
    stemmer = PorterStemmer()
    stemmed_reviews = []
    
    for review in corpus:
        
        stemmed = ' '.join([stemmer.stem(word) for word in review.split()])
        stemmed_reviews.append(stemmed)
    
    return stemmed_reviews

In [25]:
reviews_train_stemmed = stem(reviews_train)
reviews_test_stemmed = stem(reviews_test)

cv = CountVectorizer(binary = True)
cv.fit(reviews_train_stemmed)
X = cv.transform(reviews_train_stemmed)
X_test = cv.transform(reviews_test_stemmed)

X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80)

for c in [1, 0.5, 0.25, 0.05, 0.01]:
    
    lr = LogisticRegression(C = c)
    lr.fit(X_train, y_train)
    print("Accuracy on train/validation set for C = %.2f: %.3f/%.3f" % (c, 
                                                 accuracy_score(y_train, lr.predict(X_train)),
                                                 accuracy_score(y_val, lr.predict(X_val))))



Accuracy on train/validation set for C = 1.00: 0.994/0.868
Accuracy on train/validation set for C = 0.50: 0.986/0.870
Accuracy on train/validation set for C = 0.25: 0.972/0.871
Accuracy on train/validation set for C = 0.05: 0.937/0.874
Accuracy on train/validation set for C = 0.01: 0.901/0.868


In [26]:
best_model = LogisticRegression(C = 0.25)
best_model.fit(X, target)
print("Final accuracy on train/test set: %.3f/%.3f" % (accuracy_score(target, best_model.predict(X)),
                                                       accuracy_score(target, best_model.predict(X_test))))

Final accuracy on train/test set: 0.969/0.869


In [27]:
X.shape

(25000, 48439)

The number of unique words have drastically reduced from 71k to 48k, and the accuracy is suprisingly down, 0.869 from 0.881. 

In [28]:
print('Illustration of a review after stemming')
print('Number of words:', len(reviews_train_stemmed[0].split()))
print(reviews_train_stemmed[0])

Illustration of a review after stemming
Number of words: 137
bromwel high is a cartoon comedi it ran at the same time as some other program about school life such as teacher my year in the teach profess lead me to believ that bromwel high satir is much closer to realiti than is teacher the scrambl to surviv financi the insight student who can see right through their pathet teacher pomp the petti of the whole situat all remind me of the school i knew and their student when i saw the episod in which a student repeatedli tri to burn down the school i immedi recal at high a classic line inspector i here to sack one of your teacher student welcom to bromwel high i expect that mani adult of my age think that bromwel high is far fetch what a piti that it is


Let's have a look at the effect of stemming

In [29]:
differences = []

for before, after in zip(reviews_train[0].split(), reviews_train_stemmed[0].split()):
    if after != before:
        differences.append([before, after])

print(len(differences))
for i in range(5):
    print(differences[i][0], differences[i][1])

36
bromwell bromwel
comedy comedi
programs program
teachers teacher
years year


**2. Lemmatization**

Lemmatization identifies the part-of-speech of a given word and then applies more complex rules to transform the word into its true root.

In [30]:
def lemmatize(corpus):
    
    lemmatizer = WordNetLemmatizer()
    lemmatized_reviews = []
    
    for review in corpus:
               
        lemmatized = ' '.join([lemmatizer.lemmatize(word) for word in review.split()])
        lemmatized_reviews.append(lemmatized)
        
    return lemmatized_reviews

In [32]:
reviews_train_lemmatized = lemmatize(reviews_train)
reviews_test_lemmatized = lemmatize(reviews_test)

cv = CountVectorizer(binary = True)
cv.fit(reviews_train_lemmatized)
X = cv.transform(reviews_train_lemmatized)
X_test = cv.transform(reviews_test_lemmatized)

X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80)

for c in [1, 0.5, 0.25, 0.05, 0.01]:
    
    lr = LogisticRegression(C = c)
    lr.fit(X_train, y_train)
    print("Accuracy on train/validation set for C = %.2f: %.3f/%.3f" % (c, 
                                                 accuracy_score(y_train, lr.predict(X_train)),
                                                 accuracy_score(y_val, lr.predict(X_val))))



Accuracy on train/validation set for C = 1.00: 0.997/0.871
Accuracy on train/validation set for C = 0.50: 0.993/0.872
Accuracy on train/validation set for C = 0.25: 0.983/0.877
Accuracy on train/validation set for C = 0.05: 0.946/0.876
Accuracy on train/validation set for C = 0.01: 0.904/0.867


In [34]:
best_model = LogisticRegression(C = 0.25)
best_model.fit(X, target)
print("Final accuracy on train/test set: %.3f/%.3f" % (accuracy_score(target, best_model.predict(X)),
                                                       accuracy_score(target, best_model.predict(X_test))))

Final accuracy on train/test set: 0.980/0.874


In [35]:
X.shape             

(25000, 63921)

Higher number of unique words as compared to stemming.
<br>Stemming reduced the number of unique words from 71k to 48k, while lemmatization has reduced it to about 64k.
<br>The accuracy is marginally reduced.
<br>Lemmatization seems to be taking less computing time than stemming, which is a brute-force approach.

In [36]:
print('Illustration of a review after lemmatization')
print('Number of words:', len(reviews_train_lemmatized[0].split()))
print(reviews_train_lemmatized[0])

Illustration of a review after lemmatization
Number of words: 137
bromwell high is a cartoon comedy it ran at the same time a some other program about school life such a teacher my year in the teaching profession lead me to believe that bromwell high satire is much closer to reality than is teacher the scramble to survive financially the insightful student who can see right through their pathetic teacher pomp the pettiness of the whole situation all remind me of the school i knew and their student when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i here to sack one of your teacher student welcome to bromwell high i expect that many adult of my age think that bromwell high is far fetched what a pity that it is


Now, let's have a look at the effect of lemmatization

In [37]:
differences = []

for before, after in zip(reviews_train[0].split(), reviews_train_lemmatized[0].split()):
    if after != before:
        differences.append([before, after])

print(len(differences))
for i in range(5):
    print(differences[i][0], differences[i][1])

12
as a
programs program
as a
teachers teacher
years year


**C. N-GRAMS**

Till now, we only used single word features in our model, which we call unigrams or 1-gram. The predictive power of our model can be potentially enhanced by adding two (bigrams) or three (trigrams) word sequences.

For example, if we have the sequence 'didn't like the movie', considering single word features wont be able to capture that the review is negative, as the word 'like' will attribute the review to be positive.

In [38]:
cv = CountVectorizer(binary = True, ngram_range = (1,2))
cv.fit(reviews_train)
X = cv.transform(reviews_train)
X_test = cv.transform(reviews_test)

X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80)

for c in [10, 5, 1, 0.1, 0.01]:
    
    lr = LogisticRegression(C = c)
    lr.fit(X_train, y_train)
    print("Accuracy on train/validation set with C = %.2f: %.3f/%.3f" % (c, 
                                                  accuracy_score(y_train, lr.predict(X_train)),
                                                  accuracy_score(y_val, lr.predict(X_val))))



Accuracy on train/validation set with C = 10.00: 1.000/0.888
Accuracy on train/validation set with C = 5.00: 1.000/0.889
Accuracy on train/validation set with C = 1.00: 1.000/0.888
Accuracy on train/validation set with C = 0.10: 1.000/0.886
Accuracy on train/validation set with C = 0.01: 0.968/0.877


In [39]:
best_model = LogisticRegression(C = 1.00)
best_model.fit(X, target)
print("Final accuracy on train/test set: %.3f/%.3f" % (accuracy_score(target, best_model.predict(X)),
                                                       accuracy_score(target, best_model.predict(X_test))))

Final accuracy on train/test set: 1.000/0.897


It is easy to note that just by including bigrams, in addition to the unigrams, has improved the accuracy of the model.

In [40]:
print(X.shape)

(25000, 1433854)


The number of unique combinations have become order of magnitude larger (20 times). 
<br>Such humungous increase in number of features may cause overfitting, as evident in high accuracy on the test set.

**D. REPRESENTATIONS**

Till now, we considered only binary representation of a review: if a word appears in the review, its corresponding entry in the column is 1 otherwise 0. 
<br>We can further improve the representation. 

**1. Word Counts**

Instead of just representing whether a word exists in the review by 1, we can enter the number of times a particular word appears. For example, if the word 'amazing' appears three times, our entry in the word counts representation will be 3. This enhances the predictive power of the algorithm.

In [41]:
cv = CountVectorizer(binary = False)
cv.fit(reviews_train)
X = cv.transform(reviews_train)
X_test = cv.transform(reviews_test)

X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80)

for c in [1, 0.5, 0.25, 0.05, 0.01]:
    
    lr = LogisticRegression(C = c)
    lr.fit(X_train, y_train)
    print("Accuracy on train/validation set with C = %.2f: %.3f/%.3f" % (c, 
                                                  accuracy_score(y_train, lr.predict(X_train)),
                                                  accuracy_score(y_val, lr.predict(X_val))))



Accuracy on train/validation set with C = 1.00: 0.999/0.883
Accuracy on train/validation set with C = 0.50: 0.996/0.888
Accuracy on train/validation set with C = 0.25: 0.990/0.891
Accuracy on train/validation set with C = 0.05: 0.960/0.895
Accuracy on train/validation set with C = 0.01: 0.917/0.886


In [43]:
best_model = LogisticRegression(C = 0.05)
best_model.fit(X, target)
print("Final accuracy on train/test set: %.3f/%.3f" % (accuracy_score(target, best_model.predict(X)),
                                                       accuracy_score(target, best_model.predict(X_test))))

Final accuracy on train/test set: 0.958/0.881


In [44]:
X.shape

(25000, 71400)

**2. TF-IDF**

Another popular way to represent each document (review) in a corpus is to use the tf-idf statistic (term frequency-inverse document frequency).
<br> It is a weighting factor for the features (words here), that we can use in place of binary or word count representations.

The tf-idf of a word in a document is given by:
 - tf-idf = term frequency * inverse document frequency
 - term frequency: number of times the word appears in the document
 - inverse document frequency: log(total number of documents/number of documents in which this particular word appears)


Thus tf-idf
 - increases if the word appears multiple times in the document
 - decreases if it appears in multiple documents

In [45]:
tv = TfidfVectorizer()
tv.fit(reviews_train)
X = tv.transform(reviews_train)
X_test = tv.transform(reviews_test)

X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80)

for c in [10, 5, 1, 0.1, 0.01]:
    
    lr = LogisticRegression(C = c)
    lr.fit(X_train, y_train)
    print("Accuracy on train/validation set with C = %.2f: %.3f/%.3f" % (c, 
                                                 accuracy_score(y_train, lr.predict(X_train)),
                                                 accuracy_score(y_val, lr.predict(X_val))))



Accuracy on train/validation set with C = 10.00: 0.989/0.892
Accuracy on train/validation set with C = 5.00: 0.975/0.893
Accuracy on train/validation set with C = 1.00: 0.929/0.888
Accuracy on train/validation set with C = 0.10: 0.864/0.846
Accuracy on train/validation set with C = 0.01: 0.796/0.794


In [47]:
best_model = LogisticRegression(C = 1)
best_model.fit(X, target)
print("Final accuracy on train/test set: %.3f/%.3f" % (accuracy_score(target, best_model.predict(X)),
                                                       accuracy_score(target, best_model.predict(X_test))))

Final accuracy on train/test set: 0.929/0.882


In [48]:
X.shape

(25000, 71400)

**Top positive and negative features**

In [50]:
feature_coeff = {word: coeff for word, coeff in zip(tv.get_feature_names(), best_model.coef_[0])}

print('Positive words')
for best_positive in sorted(feature_coeff.items(), key = lambda x: x[1], reverse = True)[:5]:
    print(best_positive)
    
print('--------------')
   
print('Negative words')    
for best_negative in sorted(feature_coeff.items(), key = lambda x: x[1])[:5]:
    print(best_negative)

Positive words
('great', 7.5839160204097285)
('excellent', 6.201453924200456)
('best', 5.067651708523199)
('perfect', 4.75066536258515)
('wonderful', 4.612627848437834)
--------------
Negative words
('worst', -9.35768375223538)
('bad', -7.911932899691316)
('waste', -6.360866543954613)
('awful', -6.225442606697068)
('boring', -5.834843690236099)


It is interesting to note that the top five discriminating words are not same, or in similar order as our baseline model.

## Final Model

In this notebook, we chose to represent each review as a very sparse vector, with a lots of zeroes. Linear classifiers typically perform better than other algorithms on data that is represented this way. Another algorithm that could work well is Support Vector Machine with linear kernel.

We discussed several options for transforming text that can improve the accuracy of a NLP model. Which combinations of these techniques will yield the best results will depend on the task, data representation, and algorithms we choose. One should always try out many different combinations to see what works. 


As for now, lets combine few of the techniques we learnt to create a final model:

In [51]:
def combined_processing(reviews_train_raw, reviews_test_raw):
    
    data_train = tokenize(reviews_train_raw)
    data_test = tokenize(reviews_test_raw)
    
    data_train = remove_stop_words(data_train)
    data_test = remove_stop_words(data_test)
    
    data_train = lemmatize(data_train)
    data_test = lemmatize(data_test)
    
    cv = CountVectorizer(binary = False, ngram_range = (1,2))
    cv.fit(data_train)
    X = cv.transform(data_train)
    X_test = cv.transform(data_test)
    
    return X, X_test

In [52]:
X, X_test = combined_processing(reviews_train_raw, reviews_test_raw)

In [53]:
def search_best_model(X, target, clist):
    
    X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80)
    
    for c in clist:
        
        lr = LogisticRegression(C = c)
        lr.fit(X_train, y_train)
        print("Accuracy on train/validation set with C = %.2f: %.3f/%.3f" % (c, 
                                                      accuracy_score(y_train, lr.predict(X_train)),
                                                      accuracy_score(y_val, lr.predict(X_val))))  

In [54]:
clist = [10, 5, 1, 0.1, 0.1]
search_best_model(X, target, clist)



Accuracy on train/validation set with C = 10.00: 1.000/0.889
Accuracy on train/validation set with C = 5.00: 1.000/0.890
Accuracy on train/validation set with C = 1.00: 1.000/0.889
Accuracy on train/validation set with C = 0.10: 0.999/0.886
Accuracy on train/validation set with C = 0.10: 0.999/0.886


In [55]:
best_model = LogisticRegression(C = 1.0)
best_model.fit(X, target)
print("Final accuracy on train.test set: %.3f/%.3f" % (accuracy_score(target, best_model.predict(X)),
                                                       accuracy_score(target, best_model.predict(X_test))))

Final accuracy on train.test set: 1.000/0.882


Even after using regularization, we have less accuracy on the test set as compared to the training set, signalling overfitting. Thus, the hypothesis can be improved by collecting more data.