# Applying ML to sentiment analysis

- dive into a subfield of NLP caled sentiment analysis (also called opinion mining)


- cleaning and preparing text data
- building feature vectors from text docs
- training ml models to classify positive/ negative movie reviews
- working with large text datasets using out-of-core learning

In [1]:
import pyprind # package made by Sebastian himself to visualize the progress and esimated time till completion
import pandas as pd; import numpy as np
import os

In [2]:
os.getcwd()

'C:\\Users\\Schiphol\\Documents\\pml_sr'

In [3]:
# C:\Users\Gebruiker\Anaconda\data\aclImdb
# C:\Users\Gebruiker\Anaconda\data\aclImdb\aclImdb_v1.tar\aclImdb  ..(oeps)
#os.chdir('C:\\Users\\Gebruiker\\Anaconda\\data\\aclImdb\\')
os.chdir('C:\\Users\\Schiphol\\Documents\\data\\idmb_movie_reviews\\aclImdb_v1.tar\\aclImdb')

In [9]:
# change the `basepath` to the directory of the unzipped movie dataset../ Run once !

basepath = 'C:\\Users\\Schiphol\\Documents\\data\\idmb_movie_reviews\\aclImdb_v1.tar\\aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000) # number of docs we are reading in in this iteration
df = pd.DataFrame()
for s in ('test', 'train'): # nested loop: first over train, test folder
    for l in ('pos', 'neg'): # neg vs pos folder
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf8') as infile:  # encoding='utf8' important to 
                                                                                  # avoid UnicodeDecodeError 
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:02:10


In [10]:
print (df.shape)
print (df.head(1))

(50000, 2)
                                              review  sentiment
0  I went and saw this movie last night after bei...          1


In [11]:
# class labels are sorted. Let' s shuffle the dataframe:
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

In [13]:
# store this to csv
df.to_csv('C:\\Users\\Schiphol\\Documents\\data\\move_date.csv')

### bag-of-words model. Transforming words to feature vectors

this model will represent the text as numerical feaure vectors:
    
- write a vocabulary of unique tokens from the entire set of documents
- construct feaure vector from each doc containing the count of much each word occurs in each doc.

In [21]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

# Convert a collection of text documents to a matrix of token counts
# this constructed the vocabulary of the bag-of-words model and transformed the 3 sentences to sparse feature vectors

In [11]:
# type is : scipy.sparse.csr.csr_matrix = Compressed Sparse Row format
type(bag)

scipy.sparse.csr.csr_matrix

In [24]:
# unique words are mapped to integer indices.  
print(sorted(count.vocabulary_))

['and', 'is', 'one', 'shining', 'sun', 'sweet', 'the', 'two', 'weather']


In [16]:
print (bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


Each index position in the feature vectors corresponds to the integer values that are stored as dictionary items in the CountVectorizer() vocabulary. F.i. the first feature at index position 0 resembles the count of the word 'and' , which only occurs in the last document. the word 'is' at position 1 (the second feature in the document vectors) occurs in all 3 sentences. 

- Those values are also called **raw term frequencies** - the number of times a term **T** occurs in a document **D**

### Assessing word relevancy via term frequency  (TF) and Inverse Document Frequency (IDF)

In IDF N(d) is the total number of docs, and df(d, t) is the number of docs (d) that contain the term (t). The TfidTransformer from Sklearn takes the raw term frequencies from CountVectorizer and transforms them into tf-idf's.

In [17]:
np.set_printoptions(precision=2)

In [23]:
from sklearn.feature_extraction.text import TfidfTransformer 

# Transform a count matrix to a normalized tf or tf-idf representation. 
# note that a TfidfVectorizer is Equivalent to CountVectorizer followed by TfidfTransformer !

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=False)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[ 0.    0.41  0.    0.58  0.58  0.    0.41  0.    0.  ]
 [ 0.    0.41  0.    0.    0.    0.58  0.41  0.    0.58]
 [ 0.54  0.39  0.54  0.18  0.18  0.18  0.26  0.27  0.18]]


So, the word 'is' had the largest term frequency in the 3rd doc. However, after transforming the same feature vector into tf-idf's, we see that the word 'is' is now associated witha relatively small tf-idf (0.31) since it also occurs in doc 1,2 and thus is unlikely to contain any useful information. 

To make sure that we understand how TfidfTransformer works, let us walk through an example and calculate the tf-idf of the word is in the 3rd document.
The word 'is' has a term frequency of 3 (tf = 3) in document 3, and the document frequency of this term is 3 since the term is occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:
$$\text{idf}("is", d3) = log \frac{1+3}{1+3} = 0$$
Now in order to calculate the tf-idf, we simply need to add 1 to the inverse document frequency and multiply it by the term frequency:

$$\text{tf-idf}("is",d3)= 3 \times (0+1) = 3$$

In [25]:
tf_is = 3
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print('tf-idf of term "is" = %.2f' % tfidf_is)

tf-idf of term "is" = 3.00


If we repeated these calculations for all terms in the 3rd document, we'd obtain the following tf-idf vectors: [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]. However, we notice that the values in this feature vector are different from the values that we obtained from the TfidfTransformer that we used previously. The final step that we are missing in this tf-idf calculation is the L2-normalization, which can be applied as follows:

$$\text{tfi-df}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2, 3.0^2, 3.39^2, 1.29^2, 1.29^2, 1.29^2, 2.0^2 , 1.69^2, 1.29^2]}}$$$$=[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$$$\Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45$$

As we can see, the results match the results returned by scikit-learn's TfidfTransformer (below). Since we now understand how tf-idfs are calculated, let us proceed to the next sections and apply those concepts to the movie review dataset.

In [26]:
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf 

array([ 3.39,  3.  ,  3.39,  1.29,  1.29,  1.29,  2.  ,  1.69,  1.29])

In [27]:
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf

array([ 0.5 ,  0.45,  0.5 ,  0.19,  0.19,  0.19,  0.3 ,  0.25,  0.19])

### Cleaning text data


In [28]:
df= pd.read_csv('C:\\Users\\Schiphol\\Documents\\data\\move_date.csv', encoding='latin-1')

In [29]:
df = df.drop(['Unnamed: 0'], axis=1)

In [30]:
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [31]:
# lots HTML markup, punctuations and other non-letter characters. fi: 
df.loc[24152, 'review'][-50:]

'lone is worth the price of admission).<br /><br />'

In [32]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

In [33]:
preprocessor(df.loc[24152, 'review'])

'definitely worth watching ten different directors each present a segment based on their favorite opera aria you don t need to be an opera lover to watch this film although of course if you hate opera you re really going to have a bad time with this not surprisingly the segments range from brilliant to only fair most of the fuss seems to be over godard s contribution whether you think he s brilliant or pretentious his segment won t change your mind some of the pieces have a clear narrative others are more a montage of connected images none of the pieces is more than 10 minutes or so if you re not happy with what s on the screen wait for the next segment and think about how much culture you re soaking up keep your eyes open for performances by buck henry beverly d angelo elizabeth hurley briget fonda tilda swinton and john hurt the buck henry segment alone is worth the price of admission '

In [34]:
# applied to full df:
df['review'] = df['review'].apply(preprocessor)

### Processing into tokens and removing stopwords

In [35]:
# one way to tokenize is to split text into individual words by spliiting cleaned text at its whitespace
def tokenizer(text):
    return text.split()
tokenizer('Runners like running and thus they run')

['Runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [36]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer() # reduce words to its root form. So 'running' becomes 'run'

def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [37]:
tokenizer_porter('Runners like running and thus they run')

['Runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [38]:
# remove stopwords from the texts:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Schiphol\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [39]:
# after downloading we can load the English stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [40]:
[w for w in tokenizer_porter('Runners like running and thus they run a lot') if w not in stop]

['Runner', 'like', 'run', 'thu', 'run', 'lot']

### Training a logistic regression model for document classification

First divide the df into 25K docs for training and 25K for testing:

In [17]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [18]:
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                    ('clf', LogisticRegression(random_state=0))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [None]:
# may take up to 40 minutes !! (run once!!)
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


Here we replaced the CountVectorizer() and the TfidfTransformer() with the **TfidfVectorizer()** that combines both. Our param_grid has 2 parameter dicts: in the first we use the TfidfVectorizer with its default settings (use_idf = True, smooth_idf = True and norm = 'L2') to calculate the tf-idf's, in the second dict, we set tjhose parameters to ude_idf= False, smooth_idf =  False and norm = None) in order to train a model based on raw term frequencies. 

Furthermore, for the LogisticRegressionClassifier itself, we trained models using L2 and L1 regularization via the penalty parameter and compared different regularization strengths by defining a range of values for **the inverse-regularization parameter C**.



In [None]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

In [None]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

### Extra comment

Please note that gs_lr_tfidf.best_score_ is the average k-fold cross-validation score. I.e., if we have a GridSearchCV object with 5-fold cross-validation (like the one above), the best_score_ attribute returns the average score over the 5-folds of the best model. To illustrate this with an example:

In [79]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression


np.random.seed(0)
np.set_printoptions(precision=6)
y = [np.random.randint(3) for i in range(25)]
X = (y + np.random.randn(25)).reshape(-1, 1)

In [80]:
cv5_idx = list(StratifiedKFold(n_splits=5, shuffle=False, random_state=0).split(X, y))

In [64]:
y

[0, 1, 0, 1, 1, 2, 0, 2, 0, 0, 0, 2, 1, 2, 2, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1]

In [81]:
cv5_idx

[(array([ 4,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
         22, 23, 24]), array([0, 1, 2, 3, 5])),
 (array([ 0,  1,  2,  3,  5,  9, 10, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21,
         22, 23, 24]), array([ 4,  6,  7,  8, 12])),
 (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8, 12, 13, 14, 15, 18, 19, 20, 21,
         22, 23, 24]), array([ 9, 10, 11, 16, 17])),
 (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 14, 16, 17, 21,
         22, 23, 24]), array([13, 15, 18, 19, 20])),
 (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 15, 16, 17,
         18, 19, 20]), array([14, 21, 22, 23, 24]))]

In [82]:
cross_val_score(LogisticRegression(random_state=123), X, y, cv=cv5_idx)

array([ 0.6,  0.4,  0.6,  0.2,  0.6])

By executing the code above, we created a simple data set of random integers that shall represent our class labels. Next, we fed the indices of 5 cross-validation folds (cv3_idx) to the cross_val_score scorer, which returned 5 accuracy scores -- these are the 5 accuracy values for the 5 test folds.
Next, let us use the GridSearchCV object and feed it the same 5 cross-validation sets (via the pre-generated cv3_idx indices):

In [84]:
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(LogisticRegression(), {}, cv=cv5_idx, verbose=3).fit(X, y)

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV]  ................................................................
[CV] ................................. , score=0.600000, total=   0.0s
[CV]  ................................................................
[CV] ................................. , score=0.400000, total=   0.0s
[CV]  ................................................................
[CV] ................................. , score=0.600000, total=   0.0s
[CV]  ................................................................
[CV] ................................. , score=0.200000, total=   0.0s
[CV]  ................................................................
[CV] ................................. , score=0.600000, total=   0.0s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


In [90]:
print (gs.best_score_)
print (cross_val_score(LogisticRegression(), X, y, cv= cv5_idx).mean())

0.48
0.48


### Out-of_core learning, SGDClassifier , Hashingvectorizer and working with mini-batches

above operation was computationally very expensive. In ch02 we learned the **stochastic gradient descent**, which is an optimization algorithm that updates the model's weights using one sample at the time. Here we will make use of the **partial_fit** function of the SGDClassifier in Sklearn to stream the documents directly from our local drive and train a logistic regression model using **small mini-batches of documents**.

In [91]:
import numpy as np
import re
from nltk.corpus import stopwords

basepath = 'C:\\Users\\Gebruiker\\Anaconda\\data\\aclImdb\\'

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


def stream_docs(path):
    """
    defines a generator function that returns one doc at a time
    """
    with open(path, 'r') as csv:  # encoding='utf-8'
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [92]:
os.getcwd()

'C:\\Users\\Schiphol\\Documents\\data\\idmb_movie_reviews\\aclImdb_v1.tar\\aclImdb'

In [97]:
# verify that the generator works correctly:
next(stream_docs(path='C:\\Users\\Schiphol\\Documents\\data\\move_date.csv'))

('11841,"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and 

In [101]:
# define a function that will take a document stream from the stream_docs fucntion and 
#returns a particular number of docs specified by the size parameter

def get_minibatch(doc_stream, size):
    """
    defines a function taking a document stream from the stream_docs function, rteurning 
    a particular number of docs specified by the size parameter
    """
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

Unfortunately we cannot use CountVectorizer for out-of-core learning since it requires holding the complete vocabulary in memory, this also applies to the TfidfVectorizer()

However, the **HashingVectorizer** is data independent and makes use of the hashing trick vai the 32-bit MurmurHash3 algorithm.

In [102]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier # stochastic gradient descent

In [103]:
vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='C:\\Users\\Schiphol\\Documents\\data\\move_date.csv')

We used the HashingVectorizer with our tokenizer function and set the number of features to $2^{21}$. We reinitialized a Logistic regression classifier by setting the loss parameter of the SGDClassifier to LOG. Note that, by choosing a large number of features in the HashingVectorizer, we reduce the chance to cause hash collisions but we also increase the number of coefficients in our Logistic Regression model

**SGDClassifier** = Linear classifiers (SVM, logistic regression, a.o.) with SGD training. 
This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). SGD allows minibatch (online/out-of-core) learning, see the partial_fit method. For best results using the default learning rate schedule, the data should have zero mean and unit variance.

This implementation works with data represented as dense or sparse arrays of floating point values for the features. The model it fits can be controlled with the loss parameter; by default, it fits a linear support vector machine (SVM).

The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the parameter update crosses the 0.0 value because of the regularizer, the  update is truncated to 0.0 to allow for learning sparse models and achieve
online feature selection.

In [104]:
# now can use out-of-core learning !!:

import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:29


we initialized the progress bar object w 45 iterations  and in the following for-loop we iterated over 45 minibatches of docs where each mini-batch consists of 1000 docs each. Having completed the incremental learning process, we will use the last 5000 documents to evaluate our model.

We will see that the accuracy score is a bit lower compred to the grid searched version but it took less than a minute to run and was very memory efficient.

In [105]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.867


In [106]:
# finally we can use the last 5000 documents to update our model
clf = clf.partial_fit(X_test, y_test)