__Chapter 8 - Applying Machine Learning to Sentiment Analysis__

1. [Import](#Import)
1. [Preparing the IMDb movie review data for text processing](#Preparing-the-IMDb-movie-review-data-for-text-processing)
1. [Bag-of-words](#Bag-of-words)
1. [Transforming words into feature vectors](#Transforming-words-into-feature-vectors)
    1. [Assessing word relevancy via term frequency-inverse document frequency](#Assessing-word-relevancy-via-term-frequency-inverse-document-frequency)
        1. [Manually calculate a word](#Manually-calculate-a-word)
1. [Cleaning text data](#Cleaning-text-data)
1. [Processing documents](#Processing-documents)
1. [Training a logistic regression model for document classification](#Training-a-logistic-regression-model-for-document-classification)
1. [Working with bigger data – online algorithms and out-of-core learning](#Working-with-bigger-data–online-algorithms-and-out-of-core-learning)
    1. [Store learned model using pickle](#Store-learned-model-using-pickle)
1. [Topic modeling with Latent Dirichlet Allocation](#Topic-modeling-with-Latent-Dirichlet-Allocation)


# Import

<a id = 'Import'></a>

In [None]:
# standard libary and settings
import os
import sys
import importlib
import itertools
from io import StringIO
import warnings

warnings.simplefilter("ignore")
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:95% !important; }</style>"))

# data extensions and settings
import numpy as np

np.set_printoptions(threshold=np.inf, suppress=True)
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.options.display.float_format = "{:,.6f}".format

# modeling extensions
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.datasets import load_boston, load_wine, load_iris, load_breast_cancer, make_blobs, make_moons
from sklearn.decomposition import PCA, LatentDirichletAllocation
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier, IsolationForest
from sklearn.feature_extraction.text import CounterVectorizer, TfidfTransformer, TfidfVectorizer, HashingVectorizer
from sklearn.feature_selection import f_classif, f_regression, VarianceThreshold, SelectFromModel, SelectKBest
from sklearn.linear_model import Lasso, Ridge, ElasticNet, LinearRegression, LogisticRegression, SGDRegressor
from sklearn.metrics import precision_score, recall_score, f1_score, explained_variance_score, mean_squared_log_error, mean_absolute_error, median_absolute_error, mean_squared_error, r2_score, confusion_matrix, roc_curve, accuracy_score, roc_auc_score, homogeneity_score, completeness_score, classification_report, silhouette_samples
from sklearn.model_selection import KFold, train_test_split, GridSearchCV, StratifiedKFold, cross_val_score, RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, RobustScaler, PolynomialFeatures, OrdinalEncoder, LabelEncoder, OneHotEncoder, KBinsDiscretizer, QuantileTransformer, PowerTransformer, MinMaxScaler
from sklearn.svm import SVC, SVR
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import sklearn.utils as utils

# visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt

# custom extensions and settings
sys.path.append("/home/mlmachine") if "/home/mlmachine" not in sys.path else None
sys.path.append("/home/prettierplot") if "/home/prettierplot" not in sys.path else None

import mlmachine as mlm
from prettierplot.plotter import PrettierPlot
import prettierplot.style as style

# magic functions
%matplotlib inline

# Preparing the IMDb movie review data for text processing

Sentiment analysis is a subdiscipline of NLP that is concerned with analyzing the polarity of documents. One particular task seeks to classify documents based on the expressed emotions of the authors regarding a topic.

IMDb movies reviews have been gathered into a dataset consistening of 50,000 individual user critiques. Each review is labeled as positive or negative, where postitive means the movie received > 6 stars and negative means the movie received < 5 stars.

<a id = 'Preparing-the-IMDb-movie-review-data-for-text-processing'></a>

In [None]:
# NOTE: Only need to run this to unpack the file
# unzip tarfile, read into dataframe, and send to .csv
import tarfile
import pyprind

# unzip tarfile
with tarfile.open("aclImdb_v1.tar.gz", "r:gz") as tar:
    tar.extractall()

# read into dataframe
basepath = "/aclImdb"
labels = {"pos": 1, "neg": 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ("test", "train"):
    for l in ("pos", "neg"):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), "r", encoding="utf-8") as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ["review", "sentiment"]

# send to .csv
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv("s3://tdp-ml-datasets/misc/ImdbReviews.csv", index=False, encoding="utf-8")

# Bag-of-words

Bag-of-words is a method for represented text data in numberal feature vectors. This involves two key steps:

1. Create a vocabulary of unique token (for example, words) from entire set of documents
2. Construct a feature vector from each document that contains the counts of how often each word occurs in that specific document. These individual features vectors are typically very sparse because a single document will contains a small subset of the overall corpus vocabulary

<a id = 'Bag-of-words'></a>

## Transforming words into feature vectors

A set of text can be transformed into a numberal representation. CountVectorizer() is a tool that creates our bag-of-words, and this data can be reviewed in several different ways. 

The vocabulary of the data shows all of the unique words in the data set, along with the number of times each word appears in all documents. The bag of words is stored as a sparse matrix, and this can be converted to an array, which shows the raw term frequencies:

$$
tf(t,d)
$$

where the term frequency $tf$ is the number of times term $t$ occurs in document $d$

<a id = 'Transforming-words-into-feature-vectors'></a>

In [None]:
# CountVectorizer() example
count = CountVectorizer()
docs = np.array(
    [
        "The sun is shining",
        "The weather is sweet",
        "The sun is shining, the weather is sweet, and one and one is two",
    ]
)
bag = count.fit_transform(docs)

In [None]:
# print number count
print(count.vocabulary_)

In [None]:
# print raw term frequencies
print(bag.toarray())

In [None]:
# store in data frame
pd.DataFrame(bag.toarray(), columns=count.get_feature_names())

## Assessing word relevancy via term frequency-inverse document frequency

Words often occur across multiple documents in each class, and these typically dont contain useful information due to their pervasiveness. term frequency-inverse document frequency (TF-IDF) is a technique for downweighting frequently occuring words:

$$
\mbox{tf-idf(t,d)} = tf(t,d) \times (idf(t,d) + 1)
$$

$tf(t,d)$ is the term frequency described above, and $idf(t,d)$ is the inverse document frequency, calculated as follows:

$$
\mbox{idf(t,d)} = log \frac{1 + n_d}{1 + \mbox{df(d,t)}}
$$

$n_d$ is the total number of documents, and $df(d,t)$ is the numnber of documents $d$ that contain term $t$. Taking the log of this ensures that low document frequencies are not given too much weight.

<a id = 'Assessing-word-relevancy-via-term-frequency-inverse-document-frequency'></a>

In [None]:
# perform TF-IDF transformation
tfidf = TfidfTransformer(
    use_idf=True, norm="l2", smooth_idf=True
)
bag = tfidf.fit_transform(count.fit_transform(docs))

pd.DataFrame(bag.toarray(), columns=count.get_feature_names())

### Manually calculate a word

The word 'is' has a term frequency in the third documents of 3 $(tf =3)$ and the document frequency is also 3 because it occurs in all three documents $(df = 3)$
$$
\mbox{tf("is", 3)} = 3
$$

$$
\mbox{idf("is", 3)} = log \frac{1 + 3}{1 + 3} = 0
$$

$$
\mbox{tf-idf("is", 3)} = 3 \times (0 + 1) = 3
$$

Repeating this for each word in document three gives $[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0, 1.69, 1.29]$, which is clearly not equal to the values in the third row of the TF-IDF dataframe above. To get these values, L2-normalization needs to be applied:

$$
\mbox{tf-idf(d3)}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0, 1.69, 1.29]}{\sqrt{[3.39^2 + 3.0^2 + 3.39^2 + 1.29^2 + 1.29^2 + 1.29^2 + 2.0^2 + 1.69^2 + 1.29]}}
$$

$$
= [0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]
$$

The second value 0.45 correspond to the the word "is".

<a id = 'Manually-calculate-a-word'></a>

# Cleaning text data

<a id = 'Cleaning-text-data'></a>

In [None]:
# load and inspect data
df = pd.read_csv("s3://tdp-ml-datasets/misc/ImdbReviews.csv")

df.info()
display(df[:5])

In [None]:
# last 50 chars from first doc
df.loc[0, "review"][-50:]

> Remarks - Based on this snippet, it's clear that some of the reviews contain HTML markup, punctuation and other non-letters. Punctuation may contain some useful information, but in this example all will be removed for simplicity. We will however leave it emoticons since those convey sentiment

In [None]:
# regex parser that moves all emoticons to the end
import re


def textPreprocessor(text):
    text = re.sub("<[^>]*>", "", text)
    emoticons = re.findall("(?::|;|=)(?:-)?(?:\)|\(|D|P)", text.lower())
    text = re.sub("[\W]+", " ", text.lower()) + " ".join(emoticons).replace("-", "")
    return text

> Remarks - <[^>]*> removes the HTML markup. The remaing code removes all non-word characters and temporarily stores all emoticons. The emoticons are added to the end of the text string, and we also replace the nose character '-' from the emoticons to create consistency in the emoticons used. Adding the emoticons to end is sufficient because in this eample we will be using 1-gram tokens, therefore the order of words in the bag of words is not important. Lastly, all text is converted to lowercase, which is done for simplicity. In practice, capitalization for things such as proper nouns may carry some importance.

In [None]:
# test text_processor
textPreprocessor("</a>This :) is :( :-( a test :-)!")

In [None]:
# apply text processor to reviews
df["review"] = df["review"].apply(textPreprocessor)

# Processing documents 

__Tokenization__

Tokenizing a document means splitting the text into individual elements by splitting the words from each other and removing white space.

<a id = 'Processing-documents'></a>

In [None]:
# tokenize sample sentence
sentence = "runners like running and thus they run"


def tokenizer(text):
    return text.split()


tokenizer(sentence)

__Stemming__

Another useful technique in the context of tokenization is word stemming, which is a process for transforming a word into its root form. There are many stemming algorithms, and the original is known as the Porter stemmer algorithm. NLTK includes a Python implementation.

In [None]:
# stemming example
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()


def tokenizerPorter(text):
    return [porter.stem(word) for word in text.split()]


tokenizerPorter(sentence)

__Stop-words__

Stop-words are those that are very common, such as is, and, has and like.

In [None]:
# stop word removal
import nltk

nltk.download("stopwords")
from nltk.corpus import stopwords

stop = stopwords.words("english")
[w for w in tokenizerPorter(sentence) if w not in stop]

# Training a logistic regression model for document classification

<a id = 'Training-a-logistic-regression-model-for-document-classification'></a>

In [None]:
# manual train/test split
X_train = df.loc[:25000, "review"].values
y_train = df.loc[:25000, "sentiment"].values
X_test = df.loc[25000:, "review"].values
y_test = df.loc[25000:, "sentiment"].values

In [None]:
# TF-IDF grid search / logist regression pipeline
tfidf = TfidfVectorizer(
    strip_accents=None, lowercase=False, preprocessor=None
)
param_grid = [
    {
        "vect__ngram_range": [(1, 1)],
        "vect__stop_words": [stop, None],
        "vect__tokenizer": [tokenizer, tokenizerPorter],
        "logReg__penalty": ["l1", "l2"],
        "logReg__C": [1.0, 10.0, 100.0],
    },
    {
        "vect__ngram_range": [(1, 1)],
        "vect__stop_words": [stop, None],
        "vect__tokenizer": [tokenizer, tokenizerPorter],
        "vect__use_idf": [False],
        "vect__norm": [None],
        "logReg__penalty": ["l1", "l2"],
        "logReg__C": [1.0, 10.0, 100.0],
    },
]
logRegTfidf = Pipeline(
    [("vect", tfidf), ("log_reg", LogisticRegression(random_state=0))]
)
gs = GridSearchCV(
    estimator=logRegTfidf,
    param_grid=param_grid,
    scoring="accuracy",
    cv=5,
    verbose=1,
    n_jobs=1,
)
gs.fit(X_train, y_train)

> Remarks - The first dictionary in param_grid uses the default values for use_idf and norm, and the second dictionary forces the model to train based on the raw term frequencies.

In [None]:
# evaluate grid search CV results
print("Best parameter set: {0}".format(gs.best_params_))
print("CV accuracy: {:.3f}".format(gs.best_score_))
print("Test accuracy: {:.3f}".format(gs.score(X_test, y_test)))

# Working with bigger data – online algorithms and out-of-core learning

The example above used only 50,000 reviews, and in real world applications the dataset can be much larger. A technique called out-of-core learning allows us to work with larger data sets, which can be helpful if we don't have access to advanced computing systems. sklearn's implementation of stochastic gradient descent SGDClassifier has a partial_fit function which enables streaming of subsets of documents directly from a local drive in order to train a model on mini-batches of training data.

<a id = 'Working-with-bigger-data–online-algorithms-and-out-of-core-learning'></a>

In [None]:
# clean up text and remove stop words
def text_processor(text):
    text = re.sub("<[^>]*>", "", text)
    emoticons = re.findall("(?::|;|=)(?:-)?(?:\)|\(|D|P)", text.lower())
    text = re.sub("[\W]+", " ", text.lower()) + " ".join(emoticons).replace("-", "")
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


# generator function to read in and return on document at a time
def streamDocs(path):
    with open(path, "r", encoding="utf-8") as csv:
        next(csv)
        for line in csv:
            # Slicing is very specific to how docs are stored
            text, label = line[:-3], int(line[-2])
            yield text, label


# return a particular number of documents from streamDocs
def getMiniBatch(streamDocs, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(streamDocs)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

> Remarks - Neither CountVectorizer nor TfidfVectorizer can be used because each requires having the entire vocabulary in memory, which we won't have because we are implementating a mini-batch approach. Instead, sklearn has a process called HashingVectorizer.

In [None]:
# iterate over 45 batches of 1,000 documents
vect = HashingVectorizer(
    decode_error="ignore",
    n_features=2 ** 21,
    preprocessor=None,
    tokenizer=text_processor,
)
clf = linear_model.SGDClassifier(loss="log", random_state=1, max_iter=1)
docStream = streamDocs(path="s3://tdp-ml-datasets/misc/ImdbReviews.csv")

import pyprind

pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = getMiniBatch(docStream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()
    print("Iteration accuracy: {0}".format(clf.score(X_train, y_train)))
    print("\n")

> Remarks - choosing a large number of features in HashingVectorizer prevents hash collisions

In [None]:
# test model on remaining 5,000 documents
X_test, y_test = getMiniBatch(docStream, size=5000)
X_test = vect.transform(X_test)
print("Test set accuracy: {0}".format(clf.score(X_test, y_test)))

The test setaccuracy of the model trained in one batch is 0.899, and the accuracy of the model trained in mini batches is slightly lower at 0.867. The mini model trained in less than a minute, whereas the full batch model took over six hours. 

In [None]:
# update model one more time with 5,000 documents used as test set
clf = clf.partial_fit(X_test, y_test)

## Store learned model using pickle

Training a model can take awhile, and we lose it when the Python interpreter closes. Since we don't want to train a model every time we want to use it, we can use the pickle module to save the learned model. Pickle enables us to serialize and deserialize Python objects to compact bytecode so that we can save our classifier in its current state and then reload it later, even after the interpreter has been closed. With this pickle file in hand, we can classify new samples without needing the model to learn from the training data from scratch again. This will be used in Chapter 9 to build a Flask app.

<a id = 'Store learned model using pickle'></a>

In [None]:
# use pickle to store model objects
import pickle

dest = os.path.join("movieClassifier", "pkl_objects")
if not os.path.exists(dest):
    os.makedirs(dest)
pickle.dump(stop, open(os.path.join(dest, "stopwords.pkl"), "wb"), protocol=4)
pickle.dump(clf, open(os.path.join(dest, "classifier.pkl"), "wb"), protocol=4)

> Remarks - The dump method's first argument is the object we want to pickle, the second argument is name of the file we'll create in binary node per 'wb', and protocol specifies the latest/most efficient pickle protocol. We saved both the model and the stop words so that the NLTK stop word set doesn't have to be installed on the server where we will eventually deploy the model.

# Topic modeling with Latent Dirichlet Allocation

Topic modeling is the task of assigned topics to unlabelled documents, which can be thought of as a form of unsupervised learning. For example, assigning newspaper article to a category when we don't know the specific theme would be topic modeling.

Latent Dirichlet Allocation (LDA, which has no relationship to linear discriminant analysis) is a popular technique for topic modeling. It is a generatie probabilistic model that attempts to fid groups of words that appear together across different documents. The frequently appearing words represent topics, under the assumption that each document is a mixture of different words. The input is a bag-of-words model. The output are two new matrices - a document to topic matrix and a word to topic matrix. LDA decomposes the bag-of-words input matrix in a way where multiplying the two output matrices together would reproduce the input matrix with the lowext possible error. The downside is that we need to define the number of topics beforehand.

<a id = 'Topic-modeling-with-Latent-Dirichlet-Allocation'></a>

In [None]:
# create bog-of-words for LDA
# use max_df parameter to exclude words that appear in >10% of the docs
# use max_features to consider only the 5,000 most frequently occurring words
count = CountVectorizer(
    stop_words="english", max_df=0.1, max_features=5000
)
X = count.fit_transform(df["review"].values)

In [None]:
# perform LDA on movie reviews
lda = LatentDirichletAllocation(
    n_components=10, random_state=123, learning_method="batch"
)
X_topics = lda.fit_transform(X)

> Remarks - The 'batch' learning method has the LDA estimator do its estimation on all training data which is slower than the alternative 'online', which is effectively does the same thing that the out-of-core workflow does above.   

In [None]:
# the LDA components_ attribute stores a matrix containing the word importance for the 10 topics
lda.components_.shape

In [None]:
# print the top 5 most important words for each topic
nTopWords = 5
feature_names = count.get_feature_names()
for topicIx, topic in enumerate(lda.components_):
    print("Topic {}: ".format(topicIx + 1))
    print(" ".join([feature_names[i] for i in topic.argsort()[: -nTopWords - 1 : -1]]))

> Remarks - A few topics stand out as having a strong theme. Topic 10 seems to be about war movies, topic 8 - musicals, topic 4 - family movies.

In [None]:
# print a few reviews for a category to evaluate theme
war = X_topics[:, 9].argsort()[::-1]
for iter_ix, movie_ix in enumerate(war[:3]):
    print("\nFamily movie review {0}: ".format(iter_ix + 1))
    print(df["review"][movie_ix][:300], "...")

> Remarks - The three reviews above mention things like 'civil war', 'cavalry', 'history'. These seems to be inline with the topic of war, battles and historical conflict.