Parallel Collections
---------------------

Systems like Spark and Dask include "big data" collections with a small set of high-level primitives.  With these common patterns we can often handle computations that are more complex than map, but still structured.

In this section we repeat the SKLearn example using the PySpark RDD and the Dask Bag, which both provide parallel operations on linear collections of arbitrary objects.


### Objectives

*  Use the `concurrent.futures` function `submit` to parallelize non-map patterns

### Requirements

*  SciKit Learn
*  PySpark
*  Dask.bag

### Application

We train a machine learning model across many parameters with cross validation.  This is slightly more complex than a map so we use `submit`.  We train a support vector classifier on handwritten digits using cross validation to avoid over-fitting.

As before we start with a sequential solution.

In [None]:
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.grid_search import ParameterSampler
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

from cv_params_demo import load_cv_split, evaluate_one

digits = load_digits()

plt.imshow(digits.data[0].reshape(8, 8),
           interpolation='nearest', cmap='gray');

In [None]:
param_grid = {
    'C': np.logspace(-10, 10, 1001),
    'gamma': np.logspace(-10, 10, 1001),
    'tol': np.logspace(-4, -1, 4),
}

param_space = ParameterSampler(param_grid, 10)

In [None]:
from cv_params_demo import load_cv_split

cv_splits = [load_cv_split(i) for i in range(2)]

### PySpark

In [None]:
from pyspark import SparkContext
sc = SparkContext('local[4]')

In [None]:
param_rdd = sc.parallelize(param_space)
cv_rdd = sc.parallelize(cv_splits)

In [None]:
rdd = param_rdd.cartesian(cv_rdd).map(lambda ab: evaluate_one(SVC, ab[0], ab[1]))

In [None]:
%%time
results = rdd.collect()

### Dask.bag

In [None]:
import dask.bag as db

param_bag = db.from_sequence(param_space)
cv_bag = db.from_sequence(cv_splits)

b = param_bag.product(cv_bag).map(lambda a, b: evaluate_one(SVC, a, b))

In [None]:
%%time
results = b.compute()

In [None]:
%%time

import dask
results = b.compute(get=dask.threaded.get)

### Conclusion

*  Higher level collections include functions for common patterns
*  Move data to collection, construct lazy computation, trigger at the end
*  Used PySpark (`cartesian + map`) and Dask.bag (`product + map`) to handle nested for loop