Adapted from *Python Machine Learning 2nd Edition* by [Sebastian Raschka](https://sebastianraschka.com), Packt Publishing Ltd. 2017

Code Repository: https://github.com/rasbt/python-machine-learning-book-2nd-edition

Code License: [MIT License](https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/LICENSE.txt)

## Applying Machine Learning To Sentiment Analysis

### Obtaining the IMDb movie review dataset

The IMDB movie review set can be downloaded from [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/).

In [None]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [None]:
import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(5)

In [None]:
df.shape

### bag-of-words model

Transforming documents into feature vectors

By calling the fit_transform method on CountVectorizer, we can construct the vocabulary of the bag-of-words model and transformed the following three sentences into sparse feature vectors:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = vectorizer.fit_transform(docs)

Now let us print the contents of the vocabulary: 

In [None]:
print(vectorizer.vocabulary_)

In [None]:
print(vectorizer.get_feature_names())

As we can see from executing the preceding command, the vocabulary is stored in a Python dictionary, which maps the unique words that are mapped to integer indices. 

Next let us print the feature vectors that we just created. Each index position in the feature vectors corresponds to the integer values that are stored as dictionary items in the CountVectorizer vocabulary. 

In [None]:
print(bag.toarray())

Those values in the feature vectors are also called the raw **term frequencies**: *tf (t,d)*—the number of times a term t occurs in a document *d*.

### Assessing word relevancy via term frequency-inverse document frequency

In [None]:
np.set_printoptions(precision=2)

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. A useful technique called term frequency-inverse document frequency (tf-idf) can be used to downweight those frequently occurring words in the feature vectors. The tf-idf can be defined as the product of the term frequency and the inverse document frequency:

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

The inverse document frequency *idf(t, d)* can be calculated as:

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents, and *df(d, t)* is the number of documents *d* that contain the term *t*. 

Scikit-learn implements yet another transformer, the `TfidfTransformer`, that takes the raw term frequencies from `CountVectorizer` as input and transforms them into tf-idfs:

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, 
                         norm='l2', 
                         smooth_idf=True)
#print(tfidf.fit_transform(vectorizer.fit_transform(docs)).toarray())
print(tfidf.fit_transform(bag).toarray())

As we saw in the previous subsection, the word is had the largest term frequency in the 3rd document, being the most frequently occurring word. However, after transforming the same feature vector into tf-idfs, we see that the word is is
now associated with a relatively small tf-idf (0.45) in document 3 since it is
also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.


Note: The equations for the idf and tf-idf that were implemented in scikit-learn `TfidfTransformer` are:

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

and with the L2-normalization applied.

$$\text{tfi-df}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2, 3.0^2, 3.39^2, 1.29^2, 1.29^2, 1.29^2, 2.0^2 , 1.69^2, 1.29^2]}}$$

$$=[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$

$$\Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45$$

### Cleaning text data

We need to clean up the data by removing unwanted characters. 

In [None]:
df.loc[0, 'review']

We will remove all the HTML markups. For simplicity, we will now remove all punctuation marks except for emoticon characters since they are useful for sentiment analysis. Next, all non-word characters are removed and the text is converted to lowercase. 

In [None]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [None]:
preprocessor("</a>This :) is :( a test :-)!")

In [None]:
preprocessor(df.loc[0, 'review'])

Apply the preprocessor to movie review data. 

In [None]:
df['review'] = df['review'].apply(preprocessor)

In [None]:
df.head(5)

### Processing documents into tokens

One simple way to tokenize documents is to split them into individual words by splitting documents by the whitespace characters. Another useful approach is word stemming, which is the process of transforming a word into its root form. Porter Stemmmer algorithm is one original stemming algorithm developed by Martin Porter in 1979. 

In [None]:
#import Natural Lanuage Toolkit 
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()
def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [None]:
tokenizer('runners like running and thus they run')

In [None]:
tokenizer_porter('runners like running and thus they run')

We will also apply another useful technique called steop-word removal. 

In [None]:
import nltk

nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')
if w not in stop]

### Training a logistic regression model for document classification

Prepare training set and the test set. 

In [None]:
X_train = df.iloc[:200, 0].values
y_train = df.iloc[:200, 1].values
X_test = df.iloc[200:400, 0].values
y_test = df.iloc[200:400, 1].values

Create a pipeline for training

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

lr_tfidf_pipeline = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=42, solver='lbfgs'))])

In [None]:
lr_tfidf_pipeline.fit(X_train, y_train)

In [None]:
lr_tfidf_pipeline.predict(X_test)

In [None]:
print('Test Accuracy: %.3f' % lr_tfidf_pipeline.score(X_test, y_test))

Try it out with a single example

In [None]:
label = {0: 'negative', 1: 'positive'}
example = ['The movie is so so']

print('Prediction: %s\nProbability: %.2f%%' %
      (label[lr_tfidf_pipeline.predict(example)[0]],
       np.max(lr_tfidf_pipeline.predict_proba(example)) * 100))

Tune the hyperparameters with GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV


param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__C': [1.0, 10.0, 100.0]},
              ]


gs_lr_tfidf = GridSearchCV(lr_tfidf_pipeline, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [None]:
gs_lr_tfidf.fit(X_train, y_train)

In [None]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

In [None]:
clf = gs_lr_tfidf.best_estimator_

In [None]:
clf.predict(X_test)

In [None]:
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Pickle the model

In [None]:
import pickle
file_pickle = open('test.pkl', 'wb')
pickle.dump(clf, file_pickle)
file_pickle.close()