# Hyperparameter Tuning using HyperDrive

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [None]:
import azureml.core
from azureml.core import Workspace, Experiment, Model
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import choice
from azureml.widgets import RunDetails
from azureml.train.automl import AutoMLConfig
from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice, Webservice

import os
import joblib
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

print("SDK version:", azureml.core.VERSION)

## Initializing a Workspace

In [4]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

NameError: name 'Workspace' is not defined

## Creating an HyperDrive Experiment

In [None]:
experiment_name = 'hyperdrive-heart-failure'
experiment=Experiment(ws, experiment_name)

## Creating a Compute Cluster

In [None]:
cluster_name = "hd-cpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True)


print(compute_target.get_status().serialize())

## Dataset

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

## Data Wrangling

## Data Gathering

### Overview
In this project, the Heart Failure Prediction dataset from Kaggle is used. The description of the dataset is provided below.

**The dataset has the following features:**  
* age: Age (numeric)
* anaemia: Decrease of red blood cells or hemoglobin (boolean)
* creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)
* diabetes: If the patient has diabetes (boolean)
* ejection_fraction: Percentage of blood leaving the heart at each contraction (percentage)
* high_blood_pressure: If the patient has hypertension (boolean)
* platelets: Platelets in the blood (kiloplatelets/mL)
* serum_creatinine: Level of serum creatinine in the blood (mg/dL)
* serum_sodium: Level of serum sodium in the blood (mEq/L)
* sex: Woman or man (binary)

**The task:**  
Developing a ML model to predict death events using 12 clinical features.

## Hyperdrive Configuration

TODO: Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.

Following two hyperparameters are selected:  
 
**C:** The regularization strength. Regularization generally refers the concept that there should be a complexity penalty for more extreme parameters. The idea is that just looking at the training data and not paying attention to how extreme one's parameters are leads to overfitting. A high value of C tells the model to give high weight to the training data, and a lower weight to the complexity penalty. A low value tells the model to give more weight to this complexity penalty at the expense of fitting to the training data. Basically, a high C means "Trust this training data a lot", while a low value says "This data may not be fully representative of the real world data, so if it's telling you to make a parameter really large, don't listen to it".

Reference:https://stackoverflow.com/questions/67513075/what-is-c-parameter-in-sklearn-logistic-regression

**max_iter**: The number of iterations.

In [None]:
# TODO: Create an early termination policy. This is not required if you are using Bayesian sampling.
early_termination_policy = BanditPolicy(evaluation_interval=2,
                                        slack_factor=0.1)

#TODO: Create the different params that you will be using during training
param_sampling = RandomParameterSampling(
    {
        '--C': choice(0.01, 0.1, 1.0, 10.0, 100.0),
        '--max_iter': choice(5, 10, 20, 50, 150)
    }
)


## Creating a training folder if it's not available
if "training" not in os.listdir():
    os.mkdir("./training")

# Copying the traing py
os.makedirs('./training', exist_ok=True)
shutil.copy('./train.py', script_folder)

#TODO: Create your estimator and hyperdrive config
src = ScriptRunConfig(source_directory='.',
                      script='train.py',
                      compute_target=compute_target,
                      environment=env
                      )


hyperdrive_run_config = HyperDriveConfig(hyperparameter_sampling=param_sampling,
                                     primary_metric_name='AUC_weighted',
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     policy=early_termination_policy,
                                     run_config=src,
                                     max_concurrent_runs=3,
                                     max_total_runs=15,                                     
                                    )


## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [1]:
#TODO: Submit your experiment
hdr = experiment.submit(config = hyperdrive_run_config)

#monitoring the experiment
RunDetails(hdr).show()
hyperdrive_run.wait_for_completion(show_output=True)

NameError: name 'RunDetails' is not defined

## Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [None]:
best_run = hdr.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()

print("Best Run Id: {}".format(best_run.id), 
      "AUC Weighted: {}".format(best_run_metrics['AUC_weighted']), 
      "Best metrics: {}".format(best_run_metrics), sep = '\n')

In [None]:
#TODO: Save the best model
best_run.register_model(model_name = "hyperdrive_best_run.pkl", model_path = './outputs/')
print(best_run)

best_run.download_file( name= './outputs/hyper-model.pkl')

In [None]:
# Cleanining up the allocated resources
#compute_cluster.delete()

## Model Deployment

The best ML model is achieved by AutoML. Thus, the model which is trained by the hyperdrive is not deployed.

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.

