Adapted from *Python Machine Learning 2nd Edition* by [Sebastian Raschka](https://sebastianraschka.com), Packt Publishing Ltd. 2017

Code Repository: https://github.com/rasbt/python-machine-learning-book-2nd-edition

Code License: [MIT License](https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/LICENSE.txt)

## Applying Machine Learning To Sentiment Analysis

### Obtaining the IMDb movie review dataset

The IMDB movie review set can be downloaded from [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/).

In [1]:
import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(5)

Unnamed: 0,review,sentiment
0,I have no clue as to what this was shot on but...,0
1,"""The Seven-Ups"" seems like a replay of ""The Fr...",1
2,"In Hollywood in the 1930's and 1940's, I think...",0
3,There are worse ways to spend an evening than ...,1
4,"""Elvira, Mistress Of The Dark"" is a sort of ""H...",1


In [2]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [3]:
df.shape

(50000, 2)

### bag-of-words model

Transforming documents into feature vectors

By calling the fit_transform method on CountVectorizer, we can construct the vocabulary of the bag-of-words model and transformed the following three sentences into sparse feature vectors:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


In [4]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = vectorizer.fit_transform(docs)

Now let us print the contents of the vocabulary: 

In [5]:
print(vectorizer.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [6]:
print(vectorizer.get_feature_names())

['and', 'is', 'one', 'shining', 'sun', 'sweet', 'the', 'two', 'weather']


As we can see from executing the preceding command, the vocabulary is stored in a Python dictionary, which maps the unique words that are mapped to integer indices. 

Next let us print the feature vectors that we just created. Each index position in the feature vectors corresponds to the integer values that are stored as dictionary items in the CountVectorizer vocabulary. 

In [7]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


Those values in the feature vectors are also called the raw **term frequencies**: *tf (t,d)*—the number of times a term t occurs in a document *d*.

### Assessing word relevancy via term frequency-inverse document frequency

In [8]:
np.set_printoptions(precision=2)

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. A useful technique called term frequency-inverse document frequency (tf-idf) can be used to downweight those frequently occurring words in the feature vectors. The tf-idf can be defined as the product of the term frequency and the inverse document frequency:

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

The inverse document frequency *idf(t, d)* can be calculated as:

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents, and *df(d, t)* is the number of documents *d* that contain the term *t*. 

Scikit-learn implements yet another transformer, the `TfidfTransformer`, that takes the raw term frequencies from `CountVectorizer` as input and transforms them into tf-idfs:

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, 
                         norm='l2', 
                         smooth_idf=True)
#print(tfidf.fit_transform(vectorizer.fit_transform(docs)).toarray())
print(tfidf.fit_transform(bag).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


As we saw in the previous subsection, the word is had the largest term frequency in the 3rd document, being the most frequently occurring word. However, after transforming the same feature vector into tf-idfs, we see that the word is is
now associated with a relatively small tf-idf (0.45) in document 3 since it is
also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.


Note: The equations for the idf and tf-idf that were implemented in scikit-learn `TfidfTransformer` are:

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

and with the L2-normalization applied.

$$\text{tfi-df}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2, 3.0^2, 3.39^2, 1.29^2, 1.29^2, 1.29^2, 2.0^2 , 1.69^2, 1.29^2]}}$$

$$=[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$

$$\Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45$$

### Cleaning text data

We need to clean up the data by removing unwanted characters. 

In [10]:
df.loc[0, 'review']

"I have no clue as to what this was shot on but you can definitely tell that they had no budget. Bad acting, horrible cinematography, and lame plot and some decent special effects do not make a good movie. The WWF style cinemtography will make you cry...where's the tripod?! The filmakers aimed high, but sorely missed their mark."

We will remove all the HTML markups. For simplicity, we will now remove all punctuation marks except for emoticon characters since they are useful for sentiment analysis. Next, all non-word characters are removed and the text is converted to lowercase. 

In [11]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [12]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [13]:
preprocessor(df.loc[0, 'review'])

'i have no clue as to what this was shot on but you can definitely tell that they had no budget bad acting horrible cinematography and lame plot and some decent special effects do not make a good movie the wwf style cinemtography will make you cry where s the tripod the filmakers aimed high but sorely missed their mark '

Apply the preprocessor to movie review data. 

In [14]:
df['review'] = df['review'].apply(preprocessor)

In [15]:
df.head(5)

Unnamed: 0,review,sentiment
11841,daniel day lewis is christy brown a victim of ...,1
19602,a complete waste of timehalla bol is a complet...,0
45519,the pros of this film are the astonishing figh...,1
25747,this film is a good start for novices that hav...,1
42642,starring jim carrey morgan freeman jennifer an...,1


### Processing documents into tokens

One simple way to tokenize documents is to split them into individual words by splitting documents by the whitespace characters. Another useful approach is word stemming, which is the process of transforming a word into its root form. Porter Stemmmer algorithm is one original stemming algorithm developed by Martin Porter in 1979. 

In [16]:
#import Natural Lanuage Toolkit 
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()
def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [17]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [18]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

We will also apply another useful technique called steop-word removal. 

In [19]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/shravani/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [20]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')
if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

In [21]:
df.head()

Unnamed: 0,review,sentiment
11841,daniel day lewis is christy brown a victim of ...,1
19602,a complete waste of timehalla bol is a complet...,0
45519,the pros of this film are the astonishing figh...,1
25747,this film is a good start for novices that hav...,1
42642,starring jim carrey morgan freeman jennifer an...,1


### Training a logistic regression model for document classification

Prepare training set and the test set. 

In [22]:
X_train = df.iloc[:40000, 0].values
y_train = df.iloc[:40000, 1].values
X_test = df.iloc[40000:50000, 0].values
y_test = df.iloc[40000:50000, 1].values

Create a pipeline for training

In [23]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

lr_tfidf_pipeline = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=42, solver='lbfgs'))])

In [24]:
lr_tfidf_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
... penalty='l2', random_state=42, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

In [25]:
lr_tfidf_pipeline.predict(X_test)

array([0, 0, 1, ..., 0, 1, 1])

In [26]:
print('Test Accuracy: %.3f' % lr_tfidf_pipeline.score(X_test, y_test))

Test Accuracy: 0.897


Try it out with a single example

In [27]:
label = {0: 'negative', 1: 'positive'}
example = ['The movie is so so']

print('Prediction: %s\nProbability: %.2f%%' %
      (label[lr_tfidf_pipeline.predict(example)[0]],
       np.max(lr_tfidf_pipeline.predict_proba(example)) * 100))

Prediction: negative
Probability: 56.01%


In [28]:
label = {0: 'negative', 1: 'positive'}
example = ['The movie is so good']

print('Prediction: %s\nProbability: %.2f%%' %
      (label[lr_tfidf_pipeline.predict(example)[0]],
       np.max(lr_tfidf_pipeline.predict_proba(example)) * 100))

Prediction: positive
Probability: 87.85%


Tune the hyperparameters with GridSearchCV

In [29]:
from sklearn.model_selection import GridSearchCV


param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__C': [1.0, 10.0, 100.0]},
              ]


gs_lr_tfidf = GridSearchCV(lr_tfidf_pipeline, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [30]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed: 32.8min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
... penalty='l2', random_state=42, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__tokenizer': [<function tokenizer at 0x7f7820a95598>, <function tokenizer_porter at 0x7f7820a95510>], 'clf__C': [1.0, 10.0, 100.0]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [None]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

In [None]:
clf = gs_lr_tfidf.best_estimator_

In [None]:
clf.predict(X_test)

In [None]:
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Pickle the model
Serializing the data to save it in file system

In [None]:
import pickle
file_pickle = open('test.pkl', 'wb')
pickle.dump(clf, file_pickle)
file_pickle.close()