Distributed Cross Validated Parameter Search
------------------------------------

In the previous section we parallelized cross-validated parameter search on a single machine.  In this notebook we do the same exercise, but now on a distributed cluster.  

### Requirements

This notebook should be run on the provided cluster.

### Application

We train a machine learning model across many parameters with cross validation.  This is slightly more complex than a map, so we use `submit`.  We train a support vector classifier on handwritten digits using cross validation to avoid over-fitting.

As before, we start with a sequential solution.

### Imports

In [None]:
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.grid_search import ParameterSampler
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt

### Shared Software Environment

`cv_params_demo` is a local .py file that defines the functions we are going to use.
In the local case, we imported functions from this module.
We will run into issues if our worker machines lack the `cv_params_demo.py` file.
Distributed computing frameworks have mechanisms to solve this by sending .py files around.
In order to skip dealing with this, we are going to include all of the content of that file in this notebook with the `%run` magic:

In [None]:
%run cv_params_demo.py

### Data

In [None]:
digits = load_digits()  # Collect Data

plt.imshow(digits.data[0].reshape(8, 8),  # Example element
           interpolation='nearest', cmap='gray');

### Parameters

In [None]:
param_grid = {
    'C': np.logspace(-10, 10, 1001),
    'gamma': np.logspace(-10, 10, 1001),
    'tol': np.logspace(-4, -1, 4),
}

param_samples = ParameterSampler(param_grid, 10)

list(param_samples)

### Split data for cross-validation

In [None]:
from cv_params_demo import load_cv_split

cv_splits = [load_cv_split(i) for i in range(2)]
idx, (x_train, x_test, y_train, y_test) = cv_splits[0]
x_train, y_train

### Sequential cross validated parameter search

In [None]:
%%time

results = []

for split in cv_splits:
    for params in param_samples:
        result = evaluate_one(SVC, params, split)
        results.append(result)

### Plot results

Which regions of parameter space score well?  Can we tell from the results we've computed?  

Searching over more parameters would help to improve the intuition we can gain here.

In [None]:
from cv_params_demo import plot_results

plot_results(results)

### Exercise: Distributed parallel cross validated parameter search

We can use Spark, dask.distributed, or IPython Parallel to scale our computation across multiple machines.

In [None]:
cv_splits = [load_cv_split(i) for i in range(2)]  # Increase the number 2 after parallel computation acheived
param_samples = ParameterSampler(param_grid, 10)    # Increase the number 10 after parallel computation acheived

### Concurrent.futures solution

We load the solution using `concurrent.futures`.  Then we replace the stdlib `concurrent.futures.ThreadPoolExecutor` with an API compatible executor from either `ipyparallel` or `dask.distributed`.

In [None]:
%load solutions/cvgs-1.py

In [None]:
plot_results(results)

### Spark Solution

We load the single-machine solution using the local Spark instance `'local[4]'`.  We replace this SparkContext with a new SparkContext pointing to the cluster instead.

In [None]:
%load solutions/cvgs-2.py

In [None]:
plot_results(results)