# Train and hyperparameter tune with Scikit-learn

## Prerequisites

* Install the Azure Machine Learning Python SDK and create an Azure ML Workspace

In [None]:
#check core SDK version
import azureml.core

print("SDK version:", azureml.core.VERSION)

## Initialize workspace

Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [None]:
from azureml.core.workspace import Workspace
datastore = ws.get_default_datastore()

# if a locally-saved configuration file for the workspace is not available, use the following to load workspace
# ws = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name)

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

print("Default datastore's name: {}".format(datastore.name))

## Create AmlCompute

You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, we use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training compute resource.

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

If we could not find the cluster with the given name, then we will create a new cluster here. We will create an `AmlCompute` cluster of `STANDARD_D2_V2` CPU VMs.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

#choose a name for your cluster
cpu_cluster_name = "cpu-cluster"

if cpu_cluster_name in ws.compute_targets:
    cpu_cluster = ws.compute_targets[cpu_cluster_name]
    if cpu_cluster and type(cpu_cluster) is AmlCompute:
        print('Found compute target. Will use {0} '.format(cpu_cluster_name))
else:
    print("creating new cluster")

    provisioning_config = AmlCompute.provisioning_configuration(vm_size = 'STANDARD_D2_V2', max_nodes = 1)

    #create the cluster
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, provisioning_config)
    
    #can poll for a minimum number of nodes and for a specific timeout. 
    #if no min node count is provided it uses the scale settings for the cluster
    cpu_cluster.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
#use get_status() to get a detailed status for the current cluster. 
print(cpu_cluster.get_status().serialize())

## Train model on the remote compute

Now that you have your data and training script prepared, you are ready to train on your remote compute.

Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on.

In [None]:
import os

project_folder = './train_sklearn'
os.makedirs(project_folder, exist_ok=True)

### Prepare training script

Now you will need to create your training script. We log the kernel and C parameters, and the highest accuracy the model achieves:

```python
run.log('Kernel type', np.string(args.kernel))
run.log('Regularization', np.float(args.C))

run.log('Accuracy', np.float(accuracy))
```

These run metrics will become particularly important when we begin hyperparameter tuning our model in the "Tune model hyperparameters" section.

Once your script is ready, copy the training script `train_sklearn.py` into your project directory.

In [None]:
import shutil

shutil.copy('train_sklearn.py', project_folder)

### Create an experiment

Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace.

In [None]:
from azureml.core import Experiment

experiment_name = 'train_sklearn'
experiment = Experiment(ws, name=experiment_name)

### Create a Scikit-learn estimator

In [None]:
from azureml.train.sklearn import SKLearn

script_params = {
    '--kernel': 'rbf',
    '--C': 1.0,
}

estimator = SKLearn(source_directory=project_folder, 
                    script_params=script_params,
                    compute_target=cpu_cluster,
                    entry_script='train_sklearn.py',
                    pip_packages=['joblib==0.13.2'])

The `script_params` parameter is a dictionary containing the command-line arguments to your training script `entry_script`.

### Submit job

Run your experiment by submitting your estimator object. Note that this call is asynchronous.

In [None]:
run = experiment.submit(estimator)

## Monitor your run

Monitor the progress of the run with a Jupyter widget.The widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [None]:
from azureml.widgets import RunDetails

RunDetails(run).show()

In [None]:
run.cancel()

## Tune model hyperparameters

We can optimize our model's hyperparameters using Azure Machine Learning's hyperparameter tuning capabilities.

### Start a hyperparameter sweep

First, we will define the hyperparameter space to sweep over. Let's tune the `kernel` and `C` parameters. In this example we will use random sampling to try different configuration sets of hyperparameters to maximize our primary metric, `Accuracy`.

In [None]:
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice, loguniform

param_sampling = RandomParameterSampling( {
    "--kernel": choice('rbf', 'sigmoid'),
    "--C": loguniform(-2.0, 2.0)
    }
)   

hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                                         hyperparameter_sampling=param_sampling, 
                                         primary_metric_name='Accuracy',
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=4,
                                         max_concurrent_runs=1)

In [None]:
#HyperDrive run
hyperdrive_run = experiment.submit(hyperdrive_run_config)

## Monitor HyperDrive runs

In [None]:
RunDetails(hyperdrive_run).show()

In [None]:
hyperdrive_run.wait_for_completion(show_output=True)

### Find and register best model

In [None]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run.get_details()['runDefinition']['arguments'])

List the model files uploaded during the run:

In [None]:
print(best_run.get_file_names())

Register the folder (and all files in it) as a model named `train-sklearn` under the workspace for deployment

In [None]:
#model = best_run.register_model(model_name='train-sklearn', model_path='outputs/model.joblib')