# Dask Incremental Wrapper

Some estimators can be trained incrementally – without seeing the entire dataset at once. Scikit-Learn provides the partial_fit API to stream batches of data to an estimator that can be fit in batches.

Normally, if you pass a Dask Array to an estimator expecting a NumPy array, the Dask Array will be converted to a single, large NumPy array. On a single machine, you’ll likely run out of RAM and crash the program. On a distributed cluster, all the workers will send their data to a single machine and crash it.

`dask_ml.wrappers.Incremental` provides a bridge between Dask and Scikit-Learn estimators supporting the partial_fit API. You wrap the underlying estimator in Incremental. Dask-ML will sequentially pass each block of a Dask Array to the underlying estimator’s partial_fit method.

`dask_ml.wrappers.Incremental` is a meta-estimator (an estimator that takes another estimator) that bridges scikit-learn estimators expecting NumPy arrays, and users with large Dask Arrays.

Each block of a Dask Array is fed to the underlying estimator’s partial_fit method. The training is entirely sequential, so you won’t notice massive training time speedups from parallelism. In a distributed environment, you should notice some speedup from avoiding extra IO, and the fact that models are typically much smaller than data, and so faster to move between machines.

In [None]:
!pip install -q -U dask-ml distributed

In [1]:
from dask_ml.datasets import make_classification
from dask_ml.wrappers import Incremental
from sklearn.linear_model import SGDClassifier

X, y = make_classification(chunks=25)
X

Unnamed: 0,Array,Chunk
Bytes,15.62 kiB,3.91 kiB
Shape,"(100, 20)","(25, 20)"
Count,4 Tasks,4 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 15.62 kiB 3.91 kiB Shape (100, 20) (25, 20) Count 4 Tasks 4 Chunks Type float64 numpy.ndarray",20  100,

Unnamed: 0,Array,Chunk
Bytes,15.62 kiB,3.91 kiB
Shape,"(100, 20)","(25, 20)"
Count,4 Tasks,4 Chunks
Type,float64,numpy.ndarray


In [2]:
estimator = SGDClassifier(random_state=10, max_iter=100)

clf = Incremental(estimator)

clf.fit(X, y, classes=[0, 1])

Incremental(estimator=SGDClassifier(max_iter=100, random_state=10))

In this example, we make a (small) random Dask Array. It has 100 samples, broken in the 4 blocks of 25 samples each. The chunking is only along the first axis (the samples). There is no chunking along the features.

You instantiate the underlying estimator as usual. It really is just a scikit-learn compatible estimator, and will be trained normally via its partial_fit.

Notice that we call the regular .fit method, not partial_fit for training. Dask-ML takes care of passing each block to the underlying estimator for you.

Just like `sklearn.linear_model.SGDClassifier.partial_fit()`, we need to pass the classes argument to fit. In general, any argument that is required for the underlying estimators partial_fit becomes required for the wrapped fit.



In [3]:
clf.score(X, y)

0.58

In [4]:
clf.coef_

array([[  7.04159622, -53.18316726, -50.00843776,   3.46007086,
         -7.45750787,  66.23508647, -22.81178162, -43.81375957,
         -5.47634647,  16.92657508, -21.09026996,  32.11258766,
          6.26198005,  57.06532193,   6.50974555,  -6.01623547,
        -11.09834534,  12.43828616, -12.42761286, -11.9996642 ]])

In [5]:
clf.estimator_

SGDClassifier(max_iter=100, random_state=10)