# Working with bigger data – online algorithms and out-of-core learning

In [6]:
from pandas import read_csv

In [7]:
df = read_csv('movie_data.csv', encoding='utf-8')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [8]:
df.shape

(50000, 2)

# Function for cleaning, processing and stop-word removal

We define a **tokenizer** function that cleans the unprocessed text data from the **movie_data.csv** file and separate it into word tokens while removing stop words:

In [9]:
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) \
           + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

# Functions for Streaming

We define a generator function **stream_docs** that reads in and returns one document at a time:

In [4]:
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [5]:
# let's verify if the stream_docs function works correctly
next(stream_docs(path='movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

We will now define a function, **get_minibatch**, that will take a document stream from the stream_docs function and return a particular number of documents specified by the size parameter.

In [6]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

We can't use **CountVectorizer** for out-of-core learning since it requires holding the complete vocabulary in memory. Also, **TfidfVectorizer** needs to keep all the feature vectors of the training dataset in memory to calculate the inverse document frequencies. However, another useful vectorizer for text processing implemented in scikit-learn is **HashingVectorizer**. **HashingVectorizer** is data-independent and makes use of the hashing trick via the 32-bit MurmurHash3 function by Austin Appleby (https://sites.google.com/site/murmurhash/):

In [8]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
# n_iter=1 for scikit-learn 0.18.1
# max_iter=1 for scikit-learn 0.19.1
doc_stream = stream_docs(path='movie_data.csv')

We initialized **HashingVectorizer** with our tokenizer function and set the number of features to $2^{21}$. Furthermore, we reinitialized a logistic regression classifier by setting the loss parameter of the **SGDClassifier** to 'log' - note that by choosing a large number of features in the **HashingVectorizer**, we reduce the chance of causing hash collisions, but we also increase the number of coefficients in our logistic regression model. Having set up all the complementary functions, we can now start the out-of-core learning using the following code:

In [9]:
import pyprind
import numpy as np

pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:33


We made use of the **PyPrind** package in order to estimate the progress of our learning algorithm. We initialized the progress bar object with 45 iterations and, in the following for loop, we iterated over 45 mini-batches of documents where each mini-batch consists of 1,000 documents. Having completed the incremental learning process, we will use the last 5,000 documents to evaluate the performance of our model:

In [10]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.867


As we can see, the accuracy of the model is approximately 88 percent, slightly below the accuracy that we achieved in the previous section using the grid search for hyperparameter tuning. However, out-of-core learning is very memory efficient and took less than a minute to complete. **Finally, we can use the last 5,000 documents to update our model**. **IMPORTANT (a general methodology used here by Sebastian Raschka and Vahid Mirjalili)**: After we apply a trained model to the unseen test data (generalization) to verify its accuracy, we can use the documents (observations) in the test data set to update the model for future predictions.

In [11]:
clf = clf.partial_fit(X_test, y_test)

As an alternative to the bag-of-words models used here see **word2vec** from Google: https://code.google.com/archive/p/word2vec/.

# Embedding a Machine Learning Model into a Web Application
## Serializing fitted scikit-learn estimators

We create a **movieclassifier** directory where we will later store the files and data for our web application. Within this **movieclassifier** directory, we created a **pkl_objects** subdirectory to save the serialized Python objects to our local drive.

In [12]:
import pickle
import os

dest = os.path.join('movieclassifier', 'pkl_objects')
if not os.path.exists(dest):
    os.makedirs(dest)

Via the **dump** method of the **pickle** module, we then serialized the trained logistic regression model as well as the stop word set from the Natural Language Toolkit (NLTK) library, so that we don't have to install the NLTK
vocabulary on the server.

## Pickling Python objects

In [13]:
pickle.dump(stop,
            open(os.path.join(dest, 'stopwords.pkl'),'wb'),
            protocol=4)
pickle.dump(clf,
            open(os.path.join(dest, 'classifier.pkl'), 'wb'),
            protocol=4)

The **dump** method takes as its first argument the object that we want to pickle, and for the second argument we provided an open file object that the Python object will be written to. Via the **wb** argument inside the open function, we opened the file in binary mode for **pickle**, and we set **protocol=4** to choose the latest and most efficient **pickle** protocol that has been added to Python 3.4, which is compatible with Python 3.4 or newer.