# Example Application: Sentiment Analysis of Movie Reviews

## Terms

### corpus
dataset

### document
each datapoin, represented as a single text, is called a document

## Download Dataset

In [2]:
# !wget -nc http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -P data
# !tar xzf data/aclImdb_v1.tar.gz --skip-old-files -C data

In [3]:
! tree -dL 2 data/aclImdb

data/aclImdb
├── test
│   ├── neg
│   └── pos
└── train
    ├── neg
    ├── pos
    └── unsup

7 directories


The unsup folder contains unlabeled data, which we won’t use, and thereforeremove:

In [4]:
!rm -r data/aclImdb/train/unsup

There is ahelper function in scikit-learn to load files stored in such afolder structure, where each subfolder corresponds to a label, calledload_files .

In [5]:
from sklearn.datasets import load_files

# load_files returns a bunch, containing training texts and training labels
reviews_train = load_files("data/aclImdb/train/")

text_train, y_train = reviews_train.data, reviews_train.target

print("type of text_train: {}".format(type(text_train)))
print("length of text_train: {}".format(len(text_train)))
print("text_train[6]: \n {}".format(text_train[6]))

type of text_train: <class 'list'>
length of text_train: 25000
text_train[6]: 
 b"This movie has a special way of telling the story, at first i found it rather odd as it jumped through time and I had no idea whats happening.<br /><br />Anyway the story line was although simple, but still very real and touching. You met someone the first time, you fell in love completely, but broke up at last and promoted a deadly agony. Who hasn't go through this? but we will never forget this kind of pain in our life. <br /><br />I would say i am rather touched as two actor has shown great performance in showing the love between the characters. I just wish that the story could be a happy ending."


You can also see that the review contains some HTML line breaks. While these are unlikely to have a large impact on ourmachine learning models, it is better to clean the data and remove thisformatting before we proceed:

In [6]:
text_train = [doc.replace(b"<br />", b" " ) for doc in text_train] 

The dataset was collected such that the positive class and the negativeclass balanced, so that there are as many positive as negative strings:

In [7]:
import numpy as np

print("Samples per class (training): {}".format(np.bincount(y_train)))

Samples per class (training): [12500 12500]


We load the test dataset in the same manner:

In [8]:
reviews_test = load_files("data/aclImdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target
print("Number of documents in test data: {}".format(len(text_test)))
print("Samples per class (test): {}".format(np.bincount(y_test)))
text_test = [doc.replace(b"<br />", b" ") for doc in text_test]

Number of documents in test data: 25000
Samples per class (test): [12500 12500]


## Representing Text Data as a Bag of Words

In [9]:
bards_words = ["The fool doth think he is wise,", "but the wise man knows himself to be a fool"]

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(bards_words)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Fitting the CountVectorizer consists of the tokenization of thetraining data and building of the vocabulary, which we can access as thevocabulary_ attribute:

In [10]:
print("Vocabulary size: {}".format (len(vect.vocabulary_)))
print("Vocabulary content: \n {}".format(vect.vocabulary_))

Vocabulary size: 13
Vocabulary content: 
 {'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, 'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, 'be': 0}


To create the bag-of-words representation for the training data, we call the transform method:

In [11]:
bag_of_words = vect.transform(bards_words)
print("bag_of_words: {}".format(repr(bag_of_words)))

bag_of_words: <2x13 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>


A sparse matrix is usedas most documents only contain a small subset of the words in thevocabulary, meaning most entries in the feature array are 0. Thinkabout how many different words might appear in a movie review comparedto all the words in the English language (which is what the vocabularymodels). Storing all those zeros would be prohibitive, and a waste ofmemory. To look at the actual content of the sparse matrix, we canconvert it to a “dense” NumPy array (that also stores all the 0entries) using the toarray method:

We can see that the word counts for each word are either 0 or 1;neither of the two strings in bards_words contains a word twice.

In [12]:
print("Dense representation of bag_of_words: \n {}".format(bag_of_words.toarray()))

Dense representation of bag_of_words: 
 [[0 0 1 1 1 0 1 0 0 1 1 0 1]
 [1 1 0 1 0 1 0 1 1 1 0 1 1]]


## Bag-of-Words for Movie Reviews

In [13]:
vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)
print("X_train: \n {}".format(repr(X_train)))

X_train: 
 <25000x74849 sparse matrix of type '<class 'numpy.int64'>'
	with 3431196 stored elements in Compressed Sparse Row format>


The shape of X_train , the bag-of-words representation of the trainingdata, is 25,000×74,849, indicating that the vocabulary contains 74,849entries.

In [14]:
feature_names = vect.get_feature_names ()
print("Number of features: {}".format(len(feature_names)))
print("First 20 features: \n {}".format(feature_names[:20]))
print("Features 20010 to 20030: \n {}".format(feature_names[20010 : 20030]))
print("Every 2000th feature: \n {}".format(feature_names[::2000]))

Number of features: 74849
First 20 features: 
 ['00', '000', '0000000000001', '00001', '00015', '000s', '001', '003830', '006', '007', '0079', '0080', '0083', '0093638', '00am', '00pm', '00s', '01', '01pm', '02']
Features 20010 to 20030: 
 ['dratted', 'draub', 'draught', 'draughts', 'draughtswoman', 'draw', 'drawback', 'drawbacks', 'drawer', 'drawers', 'drawing', 'drawings', 'drawl', 'drawled', 'drawling', 'drawn', 'draws', 'draza', 'dre', 'drea']
Every 2000th feature: 
 ['00', 'aesir', 'aquarian', 'barking', 'blustering', 'bête', 'chicanery', 'condensing', 'cunning', 'detox', 'draper', 'enshrined', 'favorit', 'freezer', 'goldman', 'hasan', 'huitieme', 'intelligible', 'kantrowitz', 'lawful', 'maars', 'megalunged', 'mostey', 'norrland', 'padilla', 'pincher', 'promisingly', 'receptionist', 'rivals', 'schnaas', 'shunning', 'sparse', 'subset', 'temptations', 'treatises', 'unproven', 'walkman', 'xylophonist']


For high-dimensional, sparse data like this,linear models like LogisticRegression often work best. Let’s start byevaluating LogisticRegression using cross-validation

We obtain a mean cross-validation score of 88%, which indicatesreasonable performance for a balanced binary classification task.

In [15]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

scores = cross_val_score(LogisticRegression(), X_train, y_train, cv = 5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))



Mean cross-validation accuracy: 0.88


We know that LogisticRegression has a regularization parameter, C , whichwe can tune via cross-validation:

In [16]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C' : [ 0.001 , 0.01 , 0.1 , 1 , 10]}
grid = GridSearchCV(LogisticRegression(), param_grid , cv = 5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)



Best cross-validation score: 0.89
Best parameters:  {'C': 0.1}


One way to cut back on these is to only use tokens thatappear in at least two documents (or at least five documents, and so on). A tokenthat appears only in a single document is unlikely to appear in the testset and is therefore not helpful. We can set the minimum number of documents a token needs to appear in with the min_df parameter

By requiring at least five appearances of each token, we can bring downthe number of features to 27,271, as seen in the preceding output only about athird of the original features. Let’s look at some tokens again:

In [17]:
vect = CountVectorizer(min_df = 5).fit(text_train)
X_train = vect.transform(text_train)
print("X_train with min_df: {}".format(repr(X_train)))

X_train with min_df: <25000x27271 sparse matrix of type '<class 'numpy.int64'>'
	with 3354014 stored elements in Compressed Sparse Row format>


There are clearly many fewer numbers, and some of the more obscure wordsor misspellings seem to have vanished. Let’s see how well our modelperforms by doing a grid search again:

In [18]:
feature_names = vect.get_feature_names()
print("First 50 features: \n {}".format(feature_names[: 50]))
print("Features 20010 to 20030: \n {}".format(feature_names[20010 : 20030]))
print("Every 700th feature: \n {}".format(feature_names[:: 700]))

First 50 features: 
 ['00', '000', '007', '00s', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '100', '1000', '100th', '101', '102', '103', '104', '105', '107', '108', '10s', '10th', '11', '110', '112', '116', '117', '11th', '12', '120', '12th', '13', '135', '13th', '14', '140', '14th', '15', '150', '15th', '16', '160', '1600', '16mm', '16s', '16th']
Features 20010 to 20030: 
 ['repentance', 'repercussions', 'repertoire', 'repetition', 'repetitions', 'repetitious', 'repetitive', 'rephrase', 'replace', 'replaced', 'replacement', 'replaces', 'replacing', 'replay', 'replayable', 'replayed', 'replaying', 'replays', 'replete', 'replica']
Every 700th feature: 
 ['00', 'affections', 'appropriately', 'barbra', 'blurbs', 'butchered', 'cheese', 'commitment', 'courts', 'deconstructed', 'disgraceful', 'dvds', 'eschews', 'fell', 'freezer', 'goriest', 'hauser', 'hungary', 'insinuate', 'juggle', 'leering', 'maelstrom', 'messiah', 'music', 'occasional', 'parking', 'pleasantville', 'pronu

The best validation accuracy of the grid search is still 89%,unchanged from before. We didn’t improve our model, but having fewerfeatures to deal with speeds up processing and throwing away uselessfeatures might make the model more interpretable.

In [20]:
grid = GridSearchCV(LogisticRegression(), param_grid , cv = 5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_)) 



Best cross-validation score: 0.89


## Stopwords

using a language-specific list of stopwords, or discardingwords that appear too frequently. scikit-learn has a built-in list ofEnglish stopwords in the feature_extraction.text module:

In [23]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))
print("Every 10th stopword: \n {}".format(list(ENGLISH_STOP_WORDS)[::10]))

Number of stop words: 318
Every 10th stopword: 
 ['con', 'others', 'etc', 'wherever', 'two', 'bill', 'mostly', 'un', 'five', 'never', 'except', 'into', 'few', 'he', 'twelve', 'some', 'thence', 'themselves', 'somehow', 'can', 'through', 'sometimes', 'ltd', 'also', 'here', 'my', 'mine', 'those', 'there', 'nobody', 'everyone', 'empty']


There are now 305 (27,271–26,966) fewer features in the dataset, whichmeans that most, but not all, of the stopwords appeared.

In [24]:
# Specifying stop_words="english" uses the built-in list.
# We could also augment it and pass our own.

vect = CountVectorizer(min_df = 5, stop_words = "english").fit(text_train)
X_train = vect.transform(text_train)
print("X_train with stop words: \n {}".format(repr(X_train)))

X_train with stop words: 
 <25000x26966 sparse matrix of type '<class 'numpy.int64'>'
	with 2149958 stored elements in Compressed Sparse Row format>


Let’s run thegrid search again:

In [25]:
grid = GridSearchCV(LogisticRegression(), param_grid, cv = 5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_)) 



Best cross-validation score: 0.88


In this case, tf–idf had no impact.

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

pipe = make_pipeline ( TfidfVectorizer ( min_df = 5 ), LogisticRegression ())
param_grid = {'logisticregression__C' : [ 0.001 , 0.01 , 0.1 , 1 , 10 ]}
grid = GridSearchCV(pipe, param_grid, cv = 5)
grid.fit(text_train, y_train)
print("Best cross-validation score:  {:.2f}".format(grid.best_score_))



Best cross-validation score:  0.89


We can also inspect which words tf–idf foundmost important

In [51]:
vectorizer = grid.best_estimator_.named_steps["tfidfvectorizer"]

# transform the training dataset
X_train = vectorizer.transform(text_train)
print("X_train type", type(X_train))
print("X_train.shape", X_train.shape)
print("X_train[0,0]", X_train[0,518])

# find maximum value for each of the features over the dataset
max_value = X_train.max(axis = 0).toarray().ravel() 
print("max_value type", type(max_value))
print("max_value.shape", max_value.shape)

sorted_by_tfidf = max_value.argsort()
print("sorted_by_tfidf[:5]", sorted_by_tfidf[:5])

# get feature names
feature_names = np.array(vectorizer.get_feature_names())
print("feature_names .shape", feature_names.shape)
print("Features with lowest tfidf: \n {}".format(feature_names[sorted_by_tfidf[:20]]))
print("Features with highest tfidf: \n {}".format(feature_names[sorted_by_tfidf[20:]])) 

X_train type <class 'scipy.sparse.csr.csr_matrix'>
X_train.shape (25000, 27271)
X_train[0,0] 0.10926829914120458
max_value type <class 'numpy.ndarray'>
max_value.shape (27271,)
sorted_by_tfidf[:5] [23668 10103 11968 16943 22554]
feature_names .shape (27271,)
Features with lowest tfidf: 
 ['suplexes' 'gauche' 'hypocrites' 'oncoming' 'songwriting' 'galadriel'
 'emerald' 'mclaughlin' 'sylvain' 'oversee' 'cataclysmic' 'pressuring'
 'uphold' 'thieving' 'inconsiderate' 'ware' 'denim' 'reverting' 'booed'
 'spacious']
Features with highest tfidf: 
 ['gliding' 'orientated' 'attained' ... 'scanners' 'steve' 'pokemon']
