This project dataset comes from https://ai.stanford.edu/~amaas/data/sentiment/


@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

In [1]:
import pandas as pd
import numpy as np

import os
import sys
from tqdm import tqdm

basepath = 'aclImdb'
rnd_seed = 42

In [2]:
# labels = {'pos': 1, 'neg': 0}
# pbar = tqdm(total=50000)

# data = []
# # df = pd.DataFrame()

# for s in ('test', 'train'):
#     for l in ('pos', 'neg'):
#         path = os.path.join(basepath, s, l)
#         for file in sorted(os.listdir(path)):
#             with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
#                 text = infile.read()
#             # df = pd.concat([df, pd.DataFrame([[text, labels[l]]])], ignore_index=True)
#             data.append([text, labels[l]])
#             pbar.update(1)

# df = pd.DataFrame(data)

# df.columns = ['review', 'sentiment']

# # store as a csv

# np.random.seed(rnd_seed)
# df = df.reindex(np.random.permutation(df.index))
# df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [3]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')
print(df.info())
print(df.head(3))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 781.4+ KB
None
                                              review  sentiment
0  I was taken to this film by a friend and was s...          1
1  This trash version of `Romeo and Juliet' passe...          1
2  There is a lot to like in this film, despite i...          1


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# This allows us to create a bag of words to treat as a feature
count = CountVectorizer()

docs = np.array(['The quick brown fox',
                 'jumps over the lazy dog',
                 'The sun shines brightly',
                 'and the weather is sweet',
                 'The quick brown fox jumps over the lazy dog',
                 'One plus one is two and two is one plus one'])

print(f"The number of doc strings is: {len(docs)}")

bag_of_words = count.fit_transform(docs)

print(count.vocabulary_)
print(len(count.vocabulary_))
print(bag_of_words.toarray())
print(bag_of_words.toarray().shape)

The number of doc strings is: 6
{'the': 15, 'quick': 11, 'brown': 2, 'fox': 4, 'jumps': 6, 'over': 9, 'lazy': 7, 'dog': 3, 'sun': 13, 'shines': 12, 'brightly': 1, 'and': 0, 'weather': 17, 'is': 5, 'sweet': 14, 'one': 8, 'plus': 10, 'two': 16}
18
[[0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0]
 [0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 1 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0]
 [1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1]
 [0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 2 0 0]
 [1 0 0 0 0 2 0 0 4 0 2 0 0 0 0 0 2 0]]
(6, 18)


There are 18 words included in total amongst the 6 documents (doc strings). Each feature vector has an index corresponding to a word (index 0 is 'and' in this case, for example, and index 15 is 'the'), and the number at that index is the frequency of the word in the document.

This is an example of 1-gram representation. Each feature index corresponds to a single word. An n-gram representation encodes n consecutive words. For example, the n-gram decomposition of 'I like cats too' is 'I like', 'like cats', 'cats too' where the 1-gram representation is 'I', 'like', 'cats', 'too'.

The CountVectorizer can deal with this as well - it just needs to be initialized with CountVectorizer(ngram_range=(2,2))

A useful trick to avoid overloading the features extracted with words that don't hold much discrimantory information, such as 'and' and 'the' is known as term frequency inverse document frequency (tf-idf). This is the prodcut of the term frequency and the inverse document frequency. We have already seen the term frequency before: $tf(t, d)$ is the frequency that term $t$ appears in the document $d$ (exactly what was encoded above). For example, $tf($'plus', 'One plus one is two and two is one plus one'$) = 2$. The document frequency is the number of documents $d$ that contain the term $t$ and is denoted $df(t)$. In the example above, $df('and') = 2$.

With a total of $n_d$ documents, the inverse document frequency is given by $$idf(t) = \log \frac{n_d}{1 + df(t)},$$ and the tf-idf is simply $$tf\text{-}idf(t, d) = tf(t, d)\times idf(t)$$.

The tf-idf is used to weight the words, and clearly goes down for a word that appears a lot. Scikit-learn has a transformer for this. It also has a smoothing option, where $$idf(t) = \log \frac{1 + n_d}{1 + df(t)}$$ to assign an idf of 0 to words that appear in all documents. This also does $$tf\text{-}idf(t, d) = tf(t, d)\times (idf(t) + 1),$$ in order to prevent the weight from going to 0 completely.

It is typical to normalize the feature vectors before the $tf\text{-}idf$ weighting, but sklearn does this post fact.

As an effort to understand the calculation it will undertake, let's take up a calculation ourselves. $$tf\text{-}idf('the', d_2) = tf('the', 'jumps\, over\, the\, lazy\, dog') \times (idf('the') + 1) = 1 \times (1 + \log \frac{1 + 6}{1 + 5}) = 1 + \log \frac{7}{6}.$$ This takes the place of one index, in one feature vector. Once the entire feature vector is calculated thus, we normalize it.

In [5]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True,
                         norm='l2', # Normalizes with an l2 norm post-weighting
                         smooth_idf=True)

print(bag_of_words.toarray())
print(np.round(tfidf.fit_transform(bag_of_words).toarray(), 2))

[[0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0]
 [0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 1 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0]
 [1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1]
 [0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 2 0 0]
 [1 0 0 0 0 2 0 0 4 0 2 0 0 0 0 0 2 0]]
[[0.   0.   0.54 0.   0.54 0.   0.   0.   0.   0.   0.   0.54 0.   0.
  0.   0.34 0.   0.  ]
 [0.   0.   0.   0.48 0.   0.   0.48 0.48 0.   0.48 0.   0.   0.   0.
  0.   0.3  0.   0.  ]
 [0.   0.55 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.55 0.55
  0.   0.28 0.   0.  ]
 [0.43 0.   0.   0.   0.   0.43 0.   0.   0.   0.   0.   0.   0.   0.
  0.53 0.27 0.   0.53]
 [0.   0.   0.34 0.34 0.34 0.   0.34 0.34 0.   0.34 0.   0.34 0.   0.
  0.   0.43 0.   0.  ]
 [0.16 0.   0.   0.   0.   0.31 0.   0.   0.76 0.   0.38 0.   0.   0.
  0.   0.   0.38 0.  ]]


In [6]:
print(df.loc[1, 'review'][-50:])

 to the first group, my vote is eight.<br /><br />


In need of cleanup!

In [7]:
import re

def preprocess(txt):
    txt = re.sub(r'<[^>]*>', '', txt) # removes the extra text pieces
    emoticons = re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)', txt) # seeks emoticons and temporarily stores them
    txt = (re.sub(r'[\W]+', ' ', txt.lower()) + # removes all non-word charactes
           ' '.join(emoticons).replace('-', '')) # moves all emoticons to the end of the string, and removes noses from the emoticon faces for consistency
    return txt

preprocess(df.loc[1, 'review'][-50:])

' to the first group my vote is eight '

In [8]:
# Clearly the preprocessing works. Let's now dive into it:
df['review'] = df['review'].apply(preprocess)

We need to tokenize the data next. This will be where we create arrays of words by some judicious choice of splitting and modification. The basic tokenizer just removes all spaces and punctuations using the split function.

Now, for recognizing the words themselves. As it stands, 'love', 'loveable', 'loving', etc would all possibly end up as significantly different features, despite similar-to-identical meanings they bring to the context. Remedying this requires something called stemming.

We will use the NLTK library, which contains the Porter stemmer.
Essentially, the Porter stemmer classifies every character in a given token as either a consonant (c) or vowel (v), grouping subsequent consonants as C and subsequent vowels as V. The stemmer thus represents every word token as a combination of consonant and vowel groups. Once enumerated this way, the stemmer runs each word token through a list of rules that specify ending characters to remove according to the number of vowel-consonant groups in a token. Because English itself follows general but not absolute lexical rules, the Porter stemmer algorithm’s systematic criterion for determining suffix removal can return errors.

This allows us to effectively tokenize into a reduced, and more meaningful subspace.

In [9]:
from nltk.stem.porter import PorterStemmer

def basic_tokenizer(txt):
    return txt.split()

porter = PorterStemmer()

def porter_tokenizer(txt):
    return [porter.stem(word) for word in txt.split()]

example_text_block = "workers must keep working away at their work and thus they do"

print(f"Basic tokenization: {basic_tokenizer(example_text_block)}")
print(f"Unique tokens in basic: {sorted(set(basic_tokenizer(example_text_block)))}")
print(f"Porter tokenziation: {porter_tokenizer(example_text_block)}")
print(f"Unique tokens in porter: {sorted(set(porter_tokenizer(example_text_block)))}")

Basic tokenization: ['workers', 'must', 'keep', 'working', 'away', 'at', 'their', 'work', 'and', 'thus', 'they', 'do']
Unique tokens in basic: ['and', 'at', 'away', 'do', 'keep', 'must', 'their', 'they', 'thus', 'work', 'workers', 'working']
Porter tokenziation: ['worker', 'must', 'keep', 'work', 'away', 'at', 'their', 'work', 'and', 'thu', 'they', 'do']
Unique tokens in porter: ['and', 'at', 'away', 'do', 'keep', 'must', 'their', 'they', 'thu', 'work', 'worker']


This doesn't deal with everything - there are many nuisance words like 'and', 'at', etc. These don't add much information to the context. In the context of NLP, they are known as _stop words_. The NLTK library can help us deal with these.

In [10]:
import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

stop = stopwords.words('english')

[w for w in porter_tokenizer(example_text_block) if w not in stop]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sadit\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['worker', 'must', 'keep', 'work', 'away', 'work', 'thu']

In [11]:
from sklearn.model_selection import train_test_split

X = df['review'].values
y = df['sentiment'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=rnd_seed, test_size=0.5)

In [12]:
# Now, we move on to creating a classification model
# Optimize a logistic regressor with GridSearch
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{
    'vect__ngram_range': [(1, 1)],
    'vect__stop_words': [None],
    'vect__tokenizer': [basic_tokenizer, porter_tokenizer],
    'clf__penalty': ['l2'],
    'clf__C': [1.0, 10.0]
},
{
    'vect__ngram_range': [(1, 1)],
    'vect__stop_words': [stop, None],
    'vect__tokenizer': [basic_tokenizer],
    'vect__use_idf': [False],
    'vect__norm': [None],
    'clf__penalty': ['l2'],
    'clf__C': [1.0, 10.0]
}]

lr_tfidf = Pipeline([('vect', tfidf),
                    ('clf', LogisticRegression(solver='liblinear'))])

gs_lr_tfidf = GridSearchCV(estimator=lr_tfidf,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=2,
                           n_jobs=-1)

In [13]:
gs_lr_tfidf.fit(X_train, y_train)

print(gs_lr_tfidf.best_params_)

Fitting 5 folds for each of 8 candidates, totalling 40 fits




{'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function basic_tokenizer at 0x00000283E5C532E0>}


In [14]:
clf = gs_lr_tfidf.best_estimator_

print(f"Training accuracy: {clf.score(X_train, y_train):.3f}")
print(f"Test accuracy: {clf.score(X=X_test, y=y_test):.3f}")

Training accuracy: 0.988
Test accuracy: 0.899


Having concluded the training and testing, we can conclude that our classifier can predict a movie review with approximately a 90% accuracy.