## Distributed ML

The Randomized search is a common technique used to figuring out the best set of paramters of a machine learning algorithm. This process is really slow at take long time to finish, specially when the algorithm have a lot of parameters to tune. The process iterates several times with different paramters combinations in order to retrieve the best combination. Each iteration is executed from 3 to 5 times in order to have the best empirical evidences of the results.
Even tough the process is slow,  fortunately, we can use a cluster to speed up it, let's see how to do that.

At first we import all the necessary libraries

In [1]:
from sklearn.datasets import make_classification
from sklearn.externals.joblib import parallel_backend
from dask.distributed import Client
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV



Let's create a dummy dataset composed by 1000 examples

In [2]:
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)

Connect to the cluster

In [3]:
client = Client('192.168.1.12:8786')

Defined the set of the possibile paramters

In [4]:
parameters = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,200), (100,100,100)],
    'activation': ['tanh', 'relu','logistic'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
    'max_iter':[50, 100,200,1000, 2000]
}

Defined the process with 15 iterations and 3 trainings for each iteration

In [5]:


with parallel_backend('dask'):
    random_search = RandomizedSearchCV(
                MLPClassifier(),
                param_distributions=parameters,
                n_iter=15,
                cv=3,
                n_jobs=-1,
                verbose=1
            )
    random_search.fit(X, y)
    print('Best score obtained: {0}'.format(random_search.best_score_))
    print('Parameters:')
    params = ""
    for param, value in random_search.best_params_.items():
        print('\t{}: {}'.format(param, value))
        params += '\t{}: {}'.format(param, value)

Fitting 3 folds for each of 15 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend DaskDistributedBackend with 3 concurrent workers.
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:   57.0s finished


Best score obtained: 0.8730047412682144
Parameters:
	solver: adam
	max_iter: 200
	learning_rate: adaptive
	hidden_layer_sizes: (100, 200)
	alpha: 0.05
	activation: tanh




### Exercise 1

Try by yourself to change the parameters of the neural network [MLPClassifier.html](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html), check all available parameters and try how the training change in the cluster. 

### Exercise 2

Try to change how change the speed of the execution by change the cluster configuration. Try to stop the scheduler and the workers, than re-start both and change how the execution time decrease by changing the ```--nprocs numofworkers``` parameter of ```dask-worker``` command.