If you have more data than you could fit into the memory of your machine, chances are that you would want to use something different than `fit` as `fit` needs to load the data into memory once to train the classifier. However, some of the classifiers  support `partial_fit` in Scikit-Learn which does not require data to load into memory. You could train you data in batches so that you could process it as you fit the data into the memory of the machine. Further, if you want to train further your classifier with new data, you do not have to retrain the whole classifier, but rather feed the new data into the classifier by calling `partial_fit` function. By doing so, you have a way of improving the classifier with available new data rather than going through loading the data once into the memory with the old data that you trained the classifier already.

My suggestion is that you would divide the data into batches in such a way that size of the batches would be close to the memory of the machine as there is an overhead of function call to model. The smaller you call `partial_fit` is the faster you train the model. In these cases, efficiency is not just bound to the data size but how you divide the data into batches. 

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from sklearn import linear_model
from sklearn import naive_bayes
from sklearn import preprocessing

DATA_DIR = 'data'
BANK_DIR = os.path.join(DATA_DIR, 'bank')
BANK_FULL_PATH = os.path.join(BANK_DIR, 'bank-full.csv')
EXPLANATION_PATH = os.path.join(BANK_DIR, 'bank-names.txt')

In [2]:
with open(EXPLANATION_PATH) as f:
    explanation = f.readlines()
explanation = '\n'.join(explanation)    

In [3]:
print(explanation)

Citation Request:

  This dataset is public available for research. The details are described in [Moro et al., 2011]. 

  Please include this citation if you plan to use this database:



  [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 

  In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.



  Available at: [pdf] http://hdl.handle.net/1822/14838

                [bib] http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt



1. Title: Bank Marketing



2. Sources

   Created by: Paulo Cortez (Univ. Minho) and Sérgio Moro (ISCTE-IUL) @ 2012

   

3. Past Usage:



  The full dataset was described and analyzed in:



  S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 

  In P. Novais et a

In [4]:
df = pd.read_csv(BANK_FULL_PATH, delimiter=';')
label_encoder = preprocessing.LabelEncoder()
for field in ('job', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'marital'):
    df[field] = label_encoder.fit_transform(df[field])
    
df['y'] = df.y == 'yes'
y = df.y.apply(lambda k: int(k))
del df['y']

standard_scaler = preprocessing.StandardScaler()
for field in df.columns:
    df[field] = standard_scaler.fit_transform(df[field])
    
X = df.as_matrix()
print(X.shape, y.shape)

((45211, 16), (45211,))


### Here are the classifiers that we could use for out-of-core learning

- `sklearn.naive_bayes.MultinomialNB`
- `sklearn.naive_bayes.BernoulliNB`
- `sklearn.linear_model.Perceptron`
- `sklearn.linear_model.SGDClassifier`
- `sklearn.linear_model.PassiveAggressiveClassifier`

In [5]:
incremental_learners = (
                        naive_bayes.MultinomialNB(), 
                        naive_bayes.BernoulliNB(), 
                        linear_model.Perceptron(), 
                        linear_model.SGDClassifier(),
                        linear_model.PassiveAggressiveClassifier()
                        )

# To be on the safe side
for learner in incremental_learners:
    assert hasattr(learner, 'partial_fit') == True
print('All of the incremental learners implement partial fit')

All of the incremental learners implement partial fit


In [6]:
y_all = np.asanyarray(np.unique(y))

multinomial = naive_bayes.BernoulliNB()
multinomial.partial_fit(X[:1000], y[:1000], y_all)


BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [7]:
multinomial.score(X[-1000:], y[-1000:])

0.57399999999999995

Let's train some more data and see if that actually improves the score for the last 1000 observations in the dataset

In [8]:
multinomial.partial_fit(X[1000:-1000], y[1000:-1000])

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [9]:
# 
multinomial.score(X[-1000:], y[-1000:])

0.64300000000000002

Indeed, it increased. Note that the last 1000 observations are not used at all by the classifier itself.