### We will make use of the partial_fit function of the SGDClassifier in scikit-learn to stream the documents directly from our local drive, and train a logistic regression model using small mini-batches of documents. 
1. First, we define a tokenizer function that cleans the unprocessed text data from the movie_data.csv file that we constructed at the beginning of this chapter and separate it into word tokens while removing stop words.
2. Next, we define a generator function stream_docs that reads in and returns one document at a time.
3. Define a function, get_minibatch, that will take a document stream from the stream_docs function and returns a particular number of documents specified by the size parameter.

In [2]:
import numpy as np
import re
from nltk.corpus import stopwords
stop = stopwords.words('english')
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                       text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) \
       + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

In [3]:
def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [4]:
next(stream_docs(path='movie_data.csv'))

('"This 22 minute short, short of a precursor to the later much better ""Rock and Rule"", features two folk singer mice who are going nowhere. The female mouse, Jan, signs a deal with the devil to become a hit rock star. So it\'s up to Daniel Mouse to save her soul. Made in the late \'70\'s this has all the trappings of said decade (crap music, crap clothing and hair style, awful folk tunes) This cartoon is featured on the Second disc of the 2-Disk Collector\'s Edition of ""Rock and Rule"", it also comes with a Making of that runs almost as long as the show itself.<br /><br />My Grade: D+"',
 0)

In [5]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

### Unfortunately, we can't use CountVectorizer neither TfidfVectorizer (the book mentions that it took  40 minutes to complete.)
1. For out-of-core learning since they require holding the complete vocabulary in memory or need to keep all the feature vectors of the training dataset in memory to calculate the inverse document frequencies (TF-IDF). 
2. However, another useful vectorizer for text processing implemented in scikit-learn is HashingVectorizer. 
3. HashingVectorizer is data-independent and makes use of the hashing trick via the 32-bit MurmurHash3 function by Austin Appleby.
4. It converts a collection of text documents to a matrix of token occurrences

In [6]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(decode_error='ignore',
   n_features=2**21, #The number of features (columns) in the output matrices.
   tokenizer=tokenizer)
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='movie_data.csv')

### Using the preceding code, we initialized HashingVectorizer with our tokenizer function and set the number of features to 2**21. 
1. Furthermore, we reinitialized a logistic regression classifier by setting the loss parameter of the SGDClassifier to 'log'
2. Note that by choosing a large number of features in the HashingVectorizer, we reduce the chance of causing hash collisions, but we also increase the number of coefficients in our logistic regression model. 
3. Now comes the really interesting part. Having set up all the complementary functions, we can now start the out-of-core learning using the following code:

In [7]:
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)#Transform a sequence of documents to a document-term matrix shape = (n_samples, n_features)
    clf.partial_fit(X_train, y_train, classes=classes) #Fit linear model with Stochastic Gradient Descent.

### We iterated over 45 mini-batches of documents where each mini-batch consists of 1,000 documents. Having completed the incremental learning process, we will use the last 5,000 documents to evaluate the performance of our model:

In [8]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.871


### Out-of-core learning is very memory efficient and took less than a minute to complete. 