# Sentiment Analysis of IMDb Movie Reviews

The IMDB movie reviews dataset is a set of 50,000 reviews, half of which are positive and the other half negative. 


The dataset is available online and can be either directly downloaded from Stanford’s website

# Importing Libraries

In [68]:
import os
import re
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer

# Loading the Data 

We now have a data folder called aclImdb. From there, we can use the load_train_test_imdb_data function to load the training/test datasets from IMDB

In [69]:
def load_train_test_imdb_data(data_dir):
    """Loads the IMDB train/test datasets from a folder path.
    Input:
    data_dir: path to the "aclImdb" folder.
    
    Returns:
    train/test datasets as pandas dataframes.
    """

    data = {}
    for split in ["train", "test"]:
        data[split] = []
        for sentiment in ["neg", "pos"]:
            score = 1 if sentiment == "pos" else 0

            path = os.path.join(data_dir, split, sentiment)
            file_names = os.listdir(path)
            for f_name in file_names:
                with open(os.path.join(path, f_name), "r") as f:
                    review = f.read()
                    data[split].append([review, score])

    np.random.shuffle(data["train"])        
    data["train"] = pd.DataFrame(data["train"],
                                 columns=['text', 'sentiment'])
    print(data["train"])
    np.random.shuffle(data["test"])
    data["test"] = pd.DataFrame(data["test"],
                                columns=['text', 'sentiment'])
    print(data["test"])
    return data["train"], data["test"]


In [70]:
train_data, test_data = load_train_test_imdb_data(
    data_dir="aclImdb/")

                                                    text  sentiment
0      After the initial shock of realizing the guts ...          1
1      'The Shop Around the Corner (1940)' is a pleas...          1
2      How this movie escaped the wrath of MST3K I'll...          0
3      A romanticised and thoroughly false vision of ...          0
4      I, like many die-hard Trekkers (or Trekkies, i...          1
5      Nacho Vigalondo is very famous in Spain. He is...          0
6      In the areas where they overlap this fine movi...          1
7      Okay... it seems like so far, only the Barman ...          0
8      This movie was one of the best movies that I h...          1
9      I love this movie. It's wacky, funny, violent,...          1
10     I agree with one of the other comment writers ...          0
11     I was lucky enough to see this at a pre-screen...          1
12     If this film strikes you (as it did us and, ap...          1
13     This is going to be the most useless comm

# Pre-Processing

In [71]:
def clean_text(text):
    """
    Applies some pre-processing on the given text.

    Steps :
    - Removing HTML tags
    - Removing punctuation
    - Lowering text
    """
    
    # remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # remove the characters [\], ['] and ["]
    text = re.sub(r"\\", "", text)    
    text = re.sub(r"\'", "", text)    
    text = re.sub(r"\"", "", text)    
    
    # convert text to lowercase
    text = text.strip().lower()
    
    # replace punctuation characters with spaces
    filters='!"\'#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
    translate_dict = dict((c, " ") for c in filters)
    translate_map = str.maketrans(translate_dict)
    text = text.translate(translate_map)

    return text

In [72]:
clean_text("<div>This is not a sentence.<\div>").split()

['this', 'is', 'not', 'a', 'sentence']

# Model Building

Let’s train a sentiment analysis classifier. One thing to keep in mind is that the feature vectors that result from BOW are usually very large (80,000-dimensional vectors in this case). So we need to use simple algorithms that are efficient on a large number of features (e.g., Naive Bayes, linear SVM, or logistic regression). Let’s train a linear SVM classifier for example.

Because the IMDB dataset is balanced, we can evaluate our model using the accuracy score (i.e., the proportion of samples that were correctly classified)

In [73]:
import re
# Transform each text into a vector of word counts
vectorizer = CountVectorizer(stop_words="english",
                             preprocessor=clean_text
                             )
training_features = vectorizer.fit_transform(train_data["text"]) 
print('Training Features' + str(training_features))
test_features = vectorizer.transform(test_data["text"])
print('Testing Features' + str(test_features))
# Training
model = LinearSVC()
model.fit(training_features, train_data["sentiment"])
y_pred = model.predict(test_features)
print(y_pred)
# Evaluation
acc = accuracy_score(test_data["sentiment"], y_pred)
print("Accuracy on the IMDB dataset: {:.2f}".format(acc*100))

Training Features  (0, 41711)	1
  (0, 46720)	1
  (0, 47352)	1
  (0, 51636)	1
  (0, 41463)	1
  (0, 16130)	1
  (0, 21603)	1
  (0, 47382)	1
  (0, 78531)	1
  (0, 41732)	1
  (0, 62762)	1
  (0, 23353)	1
  (0, 45974)	1
  (0, 8270)	1
  (0, 1496)	1
  (0, 63423)	1
  (0, 73151)	1
  (0, 19890)	1
  (0, 2382)	1
  (0, 38186)	1
  (0, 23637)	1
  (0, 32137)	1
  (0, 24419)	1
  (0, 63438)	1
  (0, 41757)	1
  :	:
  (24999, 77034)	1
  (24999, 65051)	1
  (24999, 46110)	1
  (24999, 55179)	1
  (24999, 57644)	1
  (24999, 9260)	1
  (24999, 25049)	1
  (24999, 21940)	1
  (24999, 71200)	1
  (24999, 67836)	1
  (24999, 38813)	1
  (24999, 71642)	1
  (24999, 27740)	1
  (24999, 10313)	1
  (24999, 41787)	1
  (24999, 40)	1
  (24999, 19530)	1
  (24999, 42150)	1
  (24999, 7551)	1
  (24999, 77214)	1
  (24999, 61811)	1
  (24999, 71195)	1
  (24999, 41463)	2
  (24999, 47382)	4
  (24999, 38186)	2
Testing Features  (0, 5188)	1
  (0, 7026)	1
  (0, 13377)	1
  (0, 14462)	1
  (0, 14486)	1
  (0, 14666)	1
  (0, 14760)	1
  (0, 15424)	1
 

As you can see, following some very basic steps and using a simple linear model, we were able to reach as high as an 83.67% accuracy on the IMDB dataset. To realize how good this is, a recent state-of-the-art model can get around 95% accuracy. So this isn’t bad at all, but there is still some room for improvement.

# Improving the Current Model

Putting aside anything fine-tuning related, there are some changes we can make to immediately improve the current model.

The first thing we can do is improve the vectorization step. In fact, there are some biases attached with only looking at how many times a word occurs in a text. In particular, the longer the text, the higher its features (word counts) will be.

To fix this issue, we can use Term Frequency (TF) instead of word counts and divide the number of occurrences by the sequence length. We can also downscale these frequencies so that words that occur all the time (e.g., topic-related or stop words) have lower values. This downscaling factor is called Inverse Document Frequency (IDF) and is equal to the logarithm of the inverse word document frequency.

Put together, these new features are are called TF-IDF features.

In practice, we can train a new Linear SVM on TF-IDF features simply by replacing the CountVectorizer with a TfIdfVectorizer. This results in an accuracy of 86.64%, which is a 2% improvement over using BOW features.

The second thing we can do to further improve our model is to provide it with more context. In fact, considering every word independently can lead to some errors. For instance, if the word good occurs in a text, we will naturally tend to say that this text is positive, even if the actual expression that occurs is actually not good. These mistakes can be easily avoided with the introduction of N-grams.

An N-gram is a set of N successive words (e.g., very good [ 2-gram] and not good at all [4-gram]). Using N-grams, we produce richer word sequences.

Including N-grams in our TF-IDF vectorizer is as simple as providing an additional parameter ngram_range=(1, N). Generally speaking, the use of bi-grams improves performance, as we provide more context to the model, while higher-order N-grams have less obvious effects.

In [74]:
import re
# Transform each text into a vector of word counts
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 2))
training_features = vectorizer.fit_transform(train_data["text"]) 
print('Training Features' + str(training_features))
test_features = vectorizer.transform(test_data["text"])
print('Testing Features' + str(test_features))
# Training
model = LinearSVC()
model.fit(training_features, train_data["sentiment"])
y_pred = model.predict(test_features)
print(y_pred)
# Evaluation
acc = accuracy_score(test_data["sentiment"], y_pred)
print("Accuracy on the IMDB dataset: {:.2f}".format(acc*100))

Training Features  (0, 802371)	0.09025965705623717
  (0, 1421345)	0.08149705438412425
  (0, 1277615)	0.10097546672468202
  (0, 698733)	0.09595322966817982
  (0, 1058946)	0.06578975509830308
  (0, 186034)	0.209721397373683
  (0, 579942)	0.024618515726830645
  (0, 929634)	0.07770562225111864
  (0, 1413873)	0.10724381354558746
  (0, 525423)	0.08949794410676692
  (0, 720082)	0.0529985759492894
  (0, 505809)	0.0852939452746549
  (0, 847334)	0.028796607126835218
  (0, 52700)	0.08001956632123572
  (0, 427090)	0.0675906058364338
  (0, 1640547)	0.05470942639432049
  (0, 1413588)	0.09434697876495375
  (0, 23135)	0.10846173798683674
  (0, 165800)	0.10066626350827777
  (0, 1020574)	0.05614456052696763
  (0, 498267)	0.05778390344001641
  (0, 1389937)	0.05534900655597173
  (0, 929121)	0.09156229209556034
  (0, 1758559)	0.06566198870605877
  (0, 1047782)	0.02310697929984962
  :	:
  (24999, 1507994)	0.11016228441673938
  (24999, 1713780)	0.13518656645149235
  (24999, 1052839)	0.13518656645149235
  (24

Putting it all together, we achieve an even higher accuracy score of 88.66% which is another 2% improvement over the last version of the model.