Ad-hoc computations with Futures
------------------------------------

Some computations are more complex than an embarrassingly parallel map over a linear collection.  We might call several different functions, we might iterate over multiple collections, or we might conditionally run computations based on the values of the data.

In this section we look at the asynchronous `Future` interface, which provides a simple API for ad-hoc parallelism.

### Objectives

*  Play with parallel computing frameworks to parallelize a machine learning workload

### Requirements

*  SciKit Learn
*  A parallel computing framework of your choice


### Application

We train a machine learning model across many parameters with cross validation.  This is slightly more complex than a map so we use `submit`.  We train a support vector classifier on handwritten digits using cross validation to avoid over-fitting.

As before we start with a sequential solution.

### Imports

In [None]:
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.grid_search import ParameterSampler
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

from cv_params_demo import load_cv_split, evaluate_one  # Functions we care about

### Data

In [None]:
digits = load_digits()  # Collect Data

plt.imshow(digits.data[0].reshape(8, 8),  # Example element
           interpolation='nearest', cmap='gray');

### Parameters

In [None]:
param_grid = {
    'C': np.logspace(-10, 10, 1001),
    'gamma': np.logspace(-10, 10, 1001),
    'tol': np.logspace(-4, -1, 4),
}

param_space = ParameterSampler(param_grid, 10)

list(param_space)

### Split data for cross-validation

In [None]:
from cv_params_demo import load_cv_split

cv_splits = [load_cv_split(i) for i in range(2)]
idx, (x_train, x_test, y_train, y_test) = cv_splits[0]
x_train, y_train

### Sequential cross validated parameter search

In [None]:
%%time

results = []

for split in cv_splits:
    for params in param_space:
        result = evaluate_one(SVC, params, split)
        results.append(result)

### Plot results

Which regions of parameter space score well?  Can we tell from the results we've computed?  

Searching over more parameters would help to improve the intuition we can gain here.

In [None]:
from cv_params_demo import plot_results

plot_results(results)

### Exercise: Parallel cross validated parameter search

Try using some of the techniques we've used before (or other techniques altogether) to parallelize the above computation.  

Afterwards, increase the number of parameters to help improve our understanding of the image.

In [None]:
cv_splits = [load_cv_split(i) for i in range(2)]  # Increase the number 2 after parallel computation acheived
param_space = ParameterSampler(param_grid, 10)    # Increase the number 10 after parallel computation acheived

In [None]:
# TODO: compute results in parallel

In [None]:
plot_results(results)