# Hyperparameter Tuning using HyperDrive

In this capstone project I will showcase how we can use data science as a investigation tool, here we use classfication algorithm to distinguish between normal traffic (good connections) and intrusion or attacks traffic (bad connections). A connection is a sequence of TCP packets starting and ending at some well difined times, between which data flows to and from source IP address to a target IP address under some well defined protocol. We will create Intrusion Detection System (IDS)

### Data description
Data is collected by packet analyzers (also known as packet/network/protocol snifers) intercept and log traffic in the network.The dataset that we will use is the NSLKDD dataset. The original 1999 KDD Cup dataset was created for the DARPA Intrusion Detection Evaluation Program, prepared and managed by MIT Lincoln Laboratory. The data was collected over nine
weeks and consists of raw tcpdump traffic in a local area network (LAN) that simulates the environment of a typical United States Air Force LAN. Some network attacks were deliberately carried out during the recording period. There were 38 different
types of attacks, but only 24 are available in the training set. 

#### Data references

https://www.unb.ca/cic/datasets/nsl.html     
https://www.kaggle.com/hassan06/nslkdd

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the configuration before running this notebook.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a `TabularDataset`.
4. Configure AutoML using `AutoMLConfig`.
5. Use AutoMLStep
6. Train the model using AmlCompute
7. Explore the results.
8. Test the best fitted model.

## Azure Machine Learning and Pipeline SDK-specific imports

In [1]:
# import python specific libraires
import os


# import python data science libaries 
import numpy as np
import pandas as pd

# import azure specific libraries
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

ModuleNotFoundError: No module named 'azureml.train'

## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [None]:
from azureml.core import Workspace, Experiment

vrk_ids_ws = Workspace.from_config()
vrk_ids_exp = Experiment(workspace=vrk_ids_ws, name="vrk_ids_train_exp")

print('Workspace name: ' + vrk_bank_ws.name, 
      'Azure region: ' + vrk_bank_ws.location, 
      'Subscription id: ' + vrk_bank_ws.subscription_id, 
      'Resource group: ' + vrk_bank_ws.resource_group, sep = '\n')

run = vrk_bank_exp.start_logging()

### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your run. 

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=vrk_bank_ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(vrk_bank_ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

## Dataset

In [None]:
%%writefile conda_dependencies.yml

dependencies:
- python=3.6.2
- scikit-learn
- pip:
  - azureml-defaults

In [None]:
from azureml.core import Environment

sklearn_env = Environment.from_conda_specification(name='sklearn-env', file_path='./conda_dependencies.yml')

## Hyperdrive Configuration

TODO: Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.

In [None]:
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import GridParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform,choice
from azureml.core import ScriptRunConfig
import os

# Specify parameter sampler
param_sampling_decision_tree = GridParameterSampling( {
        "--criterion": choice('gini', 'entropy'),
        "--max_depth": choice(20, None)
    }
)

# Specify a Policy
early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=5)

if "training" not in os.listdir():
    os.mkdir("./training")

# Create a SKLearn estimator for use with train.py
ids_script = ScriptRunConfig(source_directory='.',
                      script='NetworkdataClassifier',
                      compute_target=cpu_cluster,
                      environment=sklearn_env)

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
ids_hyperdrive_config = HyperDriveConfig(run_config=ids_script,
                                     hyperparameter_sampling=param_sampling_decision_tree,
                                     policy=early_termination_policy ,
                                     primary_metric_name="Accuracy",
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=12,
                                     max_concurrent_runs=4)

In [None]:
# Submit your hyperdrive run to the experiment and show run details with the widget.
ids_hyperdrive_run = vrk_ids_exp.submit(ids_hyperdrive_config)

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [None]:
# Monitor hyper drive runs
from azureml.widgets import RunDetails
RunDetails(ids_hyperdrive_run).show()

In [None]:
# wait for completion
ids_hyperdrive_run.wait_for_completion(show_output=True)

In [None]:
assert(ids_hyperdrive_run.get_status() == "Completed")

## Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [None]:
best_run = ids_hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']
print("\n Best run definition parameter values:", parameter_values)
print("\n ********************************************************")
print('\n Best Run Id: ', best_run.id)
print('\n Accuracy:', best_run_metrics['Accuracy'])

best_run

In [None]:
import joblib
# Get your best run and save the model from that run.

best_run.register_model(model_name='vrk_bank_best_hyper_model_predictor', model_path='outputs/vrk_bankmodel.joblib')

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

TODO: In the cell below, send a request to the web service you deployed to test it.

TODO: In the cell below, print the logs of the web service and delete the service