Compared with the TFxIDF vectorization and logistic regression with gradient descent, the advantages of hashing vectorizer and logistric regression with stochastic gradient descent are:
1. It allows scalable out-of-core online learning, to handle much larger text dataset. 
2. It also integrates better with the Flask App, since it doesn't require storing the vocabulory dictionary in memory for new text vectorization, and can update the classifier with new text data to improve future performance.
3. Faster and easier to pickle. 

The disadvantages of hashing vectorizer are:
1. Cannot introspect which features are most important to a model, since it does not store the vocabulory dictionary.
2. There can be collisions: distinct tokens can be mapped to the same feature index. However this should not be an issue if n_features is large enough (e.g. 2<sup>21</sup> here).
3. No IDF weighting.

Therefore, I used TFxIDF vectorization and logistic regression with gradient descent for the development stage of the project (logit_GD), and switched to hashing vectorizer with the same model (logistic regression) and similar parameters for the production stage of the project (this notebook).

In [1]:
import os
import pickle

import pandas as pd
import numpy as np
import re

# import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

from __future__ import print_function # Use print as a function like in Python3

In [2]:
stop = stopwords.words('english')

def tweetTokenizer(tweet):
    tw = [w for w in tweet.split() if w not in stop]
    return tw

In [3]:
def stream_docs(path):
    with open(path, "r") as f:
        for line in f:
            t = line.split(',')
            text, label = t[1], int(t[0])
            yield text, label

def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

In [4]:
# Define current working directory as work_dir
os.chdir(work_dir)
path = 'rtBinaryClean.csv'

In [5]:
with open(path, "r") as f:
    next(f) # skip header
    for line in f:
        text = line.split(',')[1]
        label = int(line.split(',')[0])

In [6]:
vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tweetTokenizer)

clf = SGDClassifier(loss='log', penalty='l2', alpha=0.00004, random_state=1, n_iter=1)
doc_stream = stream_docs('rtBinaryClean.csv')

In [7]:
import pyprind
pbar = pyprind.ProgBar(9)

classes = np.array([0, 1])
for _ in range(9):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0%     100%
[#########] | ETA: 00:00:00
Total time elapsed: 00:00:00


In [8]:
X_test, y_test = get_minibatch(doc_stream, size=1000)

In [9]:
X_test = vect.transform(X_test)

In [10]:
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))
clf = clf.partial_fit(X_test, y_test)
pct = float(sum(y_test))/len(y_test)
print('Random Chance Accuracy in Test Set: %.3f' % pct)

Test Accuracy: 0.717
Random Chance Accuracy in Test Set: 0.463


Pickle the stopword list and classifier for Flask App to use online:

In [32]:
cwd = os.getcwd()
dest = os.path.join(cwd, 'models', 'pkl_obj')

if not os.path.exists(dest):
    os.makedirs(dest)
with open(os.path.join(dest, 'stopwords.pkl'),'wb') as f:
    pickle.dump(stop, f)
with open(os.path.join(dest, 'classifier.pkl'),'wb') as f:
    pickle.dump(clf, f)