## Hyperparameter Optimization

Find best hyperparameters (parameters which govern modelling (`alpha`).  How GridsearchCV works:

> Exhaustive search over specified parameter values for an estimator.

- building classifier for all parameter combinations (cuml)
- randomly split data into test/train (cudf)
- fit and record score each estimator (cuml)
- best score (highest) is returned along with estimator with the best parameters


##  Gridsearch with cuML

In [1]:
import numpy as np
from cuml import Ridge as cumlRidge
import cudf
from sklearn import datasets, linear_model
from sklearn.externals.joblib import parallel_backend
from sklearn.model_selection import train_test_split, GridSearchCV
import dask_ml.model_selection as dcv
import sklearn

In [2]:
%load_ext snakeviz

## Load Diabetes Data

In [3]:
diabetes = datasets.load_diabetes()

In [4]:
diabetes.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [5]:
# row of data
diabetes.data[0]

array([ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
       -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613])

## Fit Data with Ridge Regression

In [6]:
# Split the data into training/testing sets
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2)

In [7]:
# data in MB
X_train.nbytes/1e6

0.02824

In [8]:
fit_intercept = True
normalize = False
alpha = np.array([1.0]) 

In [9]:
ridge = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_ridge = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

In [10]:
%%timeit
ridge.fit(X_train, y_train)

436 µs ± 2.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [11]:
%%timeit
cu_ridge.fit(X_train, y_train)

4.86 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Verify Output

In [12]:
np.testing.assert_allclose(cu_ridge.coef_.to_array(), ridge.coef_)

## Increase Data Size

In [13]:
dup_data = np.array(np.vstack([X_train]*int(1e5)))
dup_train = np.array(np.hstack([y_train]*int(1e5)))
print(f'Duplicated data in memory: {dup_data.nbytes / 1e6} MB')

Duplicated data in memory: 2824.0 MB


In [14]:
dup_ridge = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
dup_cu_ridge = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

## Load Data onto GPU

In [15]:
%%time
record_data = (('fea%d'%i, dup_data[:,i]) for i in range(dup_data.shape[1]))
gdf_data = cudf.DataFrame(record_data)
gdf_train = cudf.DataFrame(dict(train=dup_train))

CPU times: user 3.06 s, sys: 1.08 s, total: 4.14 s
Wall time: 4.14 s


In [None]:
%%timeit
dup_ridge.fit(dup_data, dup_train)

In [None]:
%%timeit
dup_cu_ridge.fit(gdf_data, gdf_train.train)

## Verify Output

In [16]:
dup_ridge.fit(dup_data, dup_train)
dup_cu_ridge.fit(gdf_data, gdf_train.train)
np.testing.assert_allclose(dup_cu_ridge.coef_.to_array(), dup_ridge.coef_)

In [17]:
params = {'alpha': np.logspace(-3, -1, 10)}
clf = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_clf = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

In [20]:
%%time
sk_grid = GridSearchCV(clf, params, cv=5, iid=False)
sk_grid.fit(dup_data, dup_train)

CPU times: user 25min 32s, sys: 20min 10s, total: 45min 43s
Wall time: 6min 1s


In [21]:
%%time
sk_cu_grid = GridSearchCV(cu_clf, params, cv=5, iid=False)
sk_cu_grid.fit(gdf_data, gdf_train.train)

CPU times: user 2min 46s, sys: 1min 32s, total: 4min 18s
Wall time: 4min 18s


In [19]:
%%snakeviz
sk_grid = GridSearchCV(clf, params, cv=5, iid=False)
sk_grid.fit(dup_data, dup_train)

 
*** Profile stats marshalled to file '/tmp/tmphm7a7w_i'. 
Embedding SnakeViz in the notebook...


In [18]:
%%snakeviz
sk_cu_grid = GridSearchCV(cu_clf, params, cv=5, iid=False)
sk_cu_grid.fit(gdf_data, gdf_train.train)

 
*** Profile stats marshalled to file '/tmp/tmp2f4va94n'. 
Embedding SnakeViz in the notebook...


In [6]:
sklearn.utils.safe_indexing??

In [25]:
gdf_data.iloc

## Swap Sklearn Gridsearch with DaskML Gridsearch

In [None]:
_ = client.profile(start=start, filename='dask-cuml-gridsearchcv-profile.html')

In [None]:
%%time
cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2', cv=5)
cu_grid.fit(three_dup_data, three_dup_train)

In [None]:
%%time
grid = dcv.GridSearchCV(clf, params, scoring='r2', cv=5)
grid.fit(three_dup_data, three_dup_data)

In [None]:
%%time
with parallel_backend('dask', scatter=[dup_data, dup_train]):
    cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
    cu_grid.fit(dup_data, dup_train)