Distributed Cross Validated Parameter Search
------------------------------------

In the previous section we parallelized cross-validated parameter search on a single machine.  In this notebook we do the same exercise, but now on a distributed cluster.  

### Requirements

This notebook should be run on the provided cluster.  View the README for connection information.

### Imports

In [None]:
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import ParameterSampler
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt

### Shared Software Environment

`cv_params_demo` is a local .py file that defines the functions we are going to use.
In the local case, we imported functions from this module.
We will run into issues if our worker machines lack the `cv_params_demo.py` file.
Distributed computing frameworks have mechanisms to solve this by sending .py files around.
In order to skip dealing with this, we are going to include all of the content of that file in this notebook with the `%run` magic:

In [None]:
%run cv_params_demo.py

In [None]:
# Collect Data
digits = load_digits()  

# Construct parameter grid
param_grid = {
    'C': np.logspace(-10, 10, 1001),
    'gamma': np.logspace(-10, 10, 1001),
    'tol': np.logspace(-4, -1, 4),
}

## Exercise: Distributed parallel cross validated parameter search

We extend the concurrent.futures and Spark solutions to scale our computation across multiple machines.

In [None]:
cv_splits = [load_cv_split(i) for i in range(2)]  # Increase the number 2 after parallel computation acheived
param_samples = ParameterSampler(param_grid, 10)    # Increase the number 10 after parallel computation acheived

### Concurrent.futures solution

We've included the `concurrent.futures` solution from the previous notebook below.  You may want to replace the `ThreadPoolExecutor` below with a `dask.distributed.Client` object and point it to the Dask scheduler.

While it runs you may want to use [Dask's diagnostic dashboard](../../../9002/status) to get feedback from the cluster.  We recommend setting up the dashboard and your notebook side-by-side on your computer screen

In [None]:
# %load solutions/cvgs-1.py
from concurrent.futures import ThreadPoolExecutor
e = ThreadPoolExecutor()

futures = []

parameters = list(param_samples)

for split in cv_splits:
    for params in parameters:
        future = e.submit(evaluate_one, SVC, params, split)
        futures.append(future)

results = [f.result() for f in futures]


In [None]:
plot_results(results)

### Spark Solution

Here we provide the local Spark solution to this problem as well.  Redirect your SparkContext to the Spark master.  While this runs you may want to use [Spark's diagnostic dashboard](../../../9070) to get feedback from the cluster.

In [None]:
from pyspark import SparkContext
sc = SparkContext(...)
sc

In [None]:
cv_rdd = sc.parallelize(cv_splits)
param_rdd = sc.parallelize(list(param_samples))

rdd = param_rdd.cartesian(cv_rdd)
results = rdd.map(lambda tup: evaluate_one(SVC, tup[0], tup[1]))

results = results.collect()

In [1]:
# If you need to terminate your spark context
# sc.stop()

In [None]:
plot_results(results)

## Concluding thoughts

1.  Scaling computations can give you more precise insight into sampling problems
2.  The lessons you learned in the last section carry over from your laptop to cluster computing
3.  Visual diagnostic dashboards can help connect you to what is happening on your cluster