# Automated ML

In this capstone project I will showcase how we can use data science as a investigation tool, here we use classfication algorithm to distinguish between normal traffic (good connections) and intrusion or attacks traffic (bad connections). A connection is a sequence of TCP packets starting and ending at some well difined times, between which data flows to and from source IP address to a target IP address under some well defined protocol. We will create Intrusion Detection System (IDS)

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a `TabularDataset`.
4. Configure AutoML using `AutoMLConfig`.
5. Use AutoMLStep
6. Train the model using AmlCompute
7. Explore the results.
8. Test the best fitted model.

#### Import Dependencies.
In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
import os

## Dataset

### Data description
Data is collected by packet analyzers (also known as packet/network/protocol snifers) intercept and log traffic in the network.The dataset that we will use is the NSLKDD dataset. The original 1999 KDD Cup dataset was created for the DARPA Intrusion Detection Evaluation Program, prepared and managed by MIT Lincoln Laboratory. The data was collected over nine
weeks and consists of raw tcpdump traffic in a local area network (LAN) that simulates the environment of a typical United States Air Force LAN. Some network attacks were deliberately carried out during the recording period. There were 38 different
types of attacks, but only 24 are available in the training set. 

#### Data references

https://www.unb.ca/cic/datasets/nsl.html     
https://www.kaggle.com/hassan06/nslkdd

## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [2]:
from azureml.core import Workspace, Experiment

vrk_auto_ids_ws = Workspace.from_config()
vrk_auto_ids_exp = Experiment(workspace=vrk_auto_ids_ws, name="vrk_ids_auto_exp")

print('Workspace name: ' + vrk_auto_ids_ws.name, 
      'Azure region: ' + vrk_auto_ids_ws.location, 
      'Subscription id: ' + vrk_auto_ids_ws.subscription_id, 
      'Resource group: ' + vrk_auto_ids_ws.resource_group, sep = '\n')

auto_ids_run = vrk_auto_ids_exp.start_logging()

Workspace name: quick-starts-ws-142178
Azure region: southcentralus
Subscription id: 510b94ba-e453-4417-988b-fbdc37b55ca7
Resource group: aml-quickstarts-142178


### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your run. 

In [4]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=vrk_auto_ids_ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(vrk_auto_ids_ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


### Download and prepare data for AutoML learning


In [5]:
from azureml.data.dataset_factory import TabularDatasetFactory

# Create TabularDataset using TabularDatasetFactory

kdd_auto_webpath = [
                     'https://raw.githubusercontent.com/venkataravikumaralladi/AzureMLCapstoneProject/main/KDDTrain.csv'
                   ]

#create bankmarketing data set in tabular format using TabularDatasetFactory
ids_auto_dataset = TabularDatasetFactory.from_delimited_files(path=kdd_auto_webpath)

In [7]:
# class variables
network_data_column_names = [ 
                  'duration', 'protocol_type', 'service',
                  'flag', 'src_bytes', 'dst_bytes',
                  'land', 'wrong_fragment', 'urgent',
    
            
                  'hot', 'num_failed_logins', 'logged_in',
                  'num_compromised', 'root_shell', 'su_attempted',
                  'num_root', 'num_file_creations', 'num_shells',
                  'num_access_files', 'num_outbound_cmds', 'is_hot_login',
                  'is_guest_login',
    
                 
                  'count', 'srv_count', 'serror_rate',
                  'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate',
                  'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate',
                 
                  'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate',
                  'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
                  'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
                  'dst_host_srv_rerror_rate',
    
                   'attack_type',
                   'success_pred' ]

train_df = ids_auto_dataset.to_pandas_dataframe().dropna()
print("train df data shape ", train_df.shape)
train_df.columns = network_data_column_names
train_df.head()

train df data shape  (125972, 43)


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack_type,success_pred
0,0,udp,other,SF,146,0,0,0,0,0,...,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal,15
1,0,tcp,private,S0,0,0,0,0,0,0,...,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,19
2,0,tcp,http,SF,232,8153,0,0,0,0,...,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal,21
3,0,tcp,http,SF,199,420,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal,21
4,0,tcp,private,REJ,0,0,0,0,0,0,...,0.07,0.07,0.0,0.0,0.0,0.0,1.0,1.0,neptune,21


In [8]:
print("train df data shape ", train_df.shape)
# For this analysis we drop "success_pred" column
train_df.drop('success_pred', axis=1, inplace=True)
print("Train data frame after droping success pred is  ", train_df.shape)

train df data shape  (125972, 43)
Train data frame after droping success pred is   (125972, 42)


In [10]:
# Drop attack type in training data which is to be predicted.
train_X = train_df.drop("attack_type", axis=1)
train_Y = train_df['attack_type']
# we build binary classifier for this project
train_Y = train_Y.apply(lambda x: 0 if x == 'normal' else 1)  

print("Shape of train_X is ", train_X.shape)
print("Shape of train_Y is ", train_Y.shape)
train_X = train_X.join(train_Y)
print("After join train_X shape is ", train_X.shape)
train_X.head()

Shape of train_X is  (125972, 41)
Shape of train_Y is  (125972,)
After join train_X shape is  (125972, 42)


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack_type
0,0,udp,other,SF,146,0,0,0,0,0,...,1,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,0
1,0,tcp,private,S0,0,0,0,0,0,0,...,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,1
2,0,tcp,http,SF,232,8153,0,0,0,0,...,255,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,0
3,0,tcp,http,SF,199,420,0,0,0,0,...,255,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,0,tcp,private,REJ,0,0,0,0,0,0,...,19,0.07,0.07,0.0,0.0,0.0,0.0,1.0,1.0,1


In [11]:
#store data frame to data store for AutoMLConfig training
train_X.to_csv('training/ids_training_data.csv')
default_datastore = vrk_auto_ids_ws.get_default_datastore()
default_datastore

{
  "name": "workspaceblobstore",
  "container_name": "azureml-blobstore-0db7de48-b4fb-46f6-b782-fabf4f2cd6b1",
  "account_name": "mlstrg142178",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

In [12]:
#upload training data to data store for AutomMLConfig training
default_datastore.upload(src_dir='training', target_path='data/')

Uploading an estimated of 1 files
Uploading training/ids_training_data.csv
Uploaded training/ids_training_data.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_9efa16f14b75432c97efc96caaf01bd3

In [13]:
# get data set pointer to data store for bank training data
from azureml.core import Dataset
ids_auto_dataset = Dataset.Tabular.from_delimited_files(path=[(default_datastore, ('data/ids_training_data.csv'))])

## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

In [22]:
from azureml.train.automl import AutoMLConfig

# Set parameters for AutoMLConfig
# NOTE: DO NOT CHANGE THE experiment_timeout_minutes PARAMETER OR YOUR INSTANCE WILL TIME OUT.
# If you wish to run the experiment longer, you will need to run this notebook in your own
# Azure tenant, which will incur personal costs.
vrk_ids_automl_config = AutoMLConfig(
                   experiment_timeout_minutes=30,
                   task="classification",
                   primary_metric="accuracy",
                   compute_target = cpu_cluster,
                   training_data=ids_auto_dataset,
                   blocked_models=['XGBoostClassifier'],
                   label_column_name='attack_type',
                   n_cross_validations=5)

In [23]:
# Submit your automl run
from azureml.pipeline.steps import AutoMLStep
#vrk_auto__exp = Experiment(workspace=vrk_auto_ids_ws, name="vrk_auto_ids_train_exp")
automl_ids_run = vrk_auto_ids_exp.submit(vrk_ids_automl_config, show_output=True)

Running on remote.
No run_configuration provided, running on cpucluster with default configuration
Running on remote compute: cpucluster
Parent Run ID: AutoML_4e785f86-5726-410e-b27d-8c9e1b651c36

Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

********************************************************

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [18]:
from azureml.widgets import RunDetails
RunDetails(automl_ids_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [19]:
automl_ids_run.wait_for_completion()

{'runId': 'AutoML_c714f046-836c-44a9-9679-ea1f9111d1da',
 'target': 'cpucluster',
 'status': 'Completed',
 'startTimeUtc': '2021-04-07T14:28:22.249635Z',
 'endTimeUtc': '2021-04-07T15:26:33.330029Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'cpucluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"89ab9d0f-04e0-454a-adb8-4aef94adf060\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"data/ids_training_data.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-142178\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"510b94ba-e453-4417-988b-fbdc37b55ca7\\\\\\", \\\\\\

In [25]:
assert(automl_ids_run.get_status() == "Completed")

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [26]:
# Retrieve and save your best automl model.
best_auto_run, best_fitted_model = automl_ids_run.get_output()

In [27]:
#TODO: Save the best model
best_fitted_model._final_estimator


LightGBMClassifier(boosting_type='gbdt', class_weight=None,
                   colsample_bytree=1.0, importance_type='split',
                   learning_rate=0.1, max_depth=-1, min_child_samples=20,
                   min_child_weight=0.001, min_split_gain=0.0, n_estimators=100,
                   n_jobs=1, num_leaves=31, objective=None, random_state=None,
                   reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
                   subsample_for_bin=200000, subsample_freq=0, verbose=-10)

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [28]:
best_auto_run.register_model(model_name='vrk_ids_auto_mdl', model_path='./outputs')

Model(workspace=Workspace.create(name='quick-starts-ws-142178', subscription_id='510b94ba-e453-4417-988b-fbdc37b55ca7', resource_group='aml-quickstarts-142178'), name=vrk_ids_auto_mdl, id=vrk_ids_auto_mdl:1, version=1, tags={}, properties={})

TODO: In the cell below, send a request to the web service you deployed to test it.

In [29]:
autoscore_file = 'inference/autoscore.py'
best_auto_run.download_file('outputs/', 'inference/autoscore.py' )

ServiceException: ServiceException:
	Code: 400
	Message: (UserError) Artifact path cannot represent a directory
	Details:

	Headers: {
	    "Date": "Wed, 07 Apr 2021 16:32:02 GMT",
	    "Content-Type": "application/json; charset=utf-8",
	    "Content-Length": "578",
	    "Connection": "keep-alive",
	    "Request-Context": "appId=cid-v1:2d2e8e63-272e-4b3c-8598-4ee570a0e70d",
	    "x-ms-response-type": "error",
	    "x-ms-client-request-id": "d4523384-01fa-4d67-be6f-4b9b6cd3040c",
	    "x-ms-client-session-id": "",
	    "Strict-Transport-Security": "max-age=15724800; includeSubDomains; preload",
	    "X-Content-Type-Options": "nosniff",
	    "x-request-time": "0.035"
	}
	InnerException: {
    "additional_properties": {},
    "error": {
        "additional_properties": {
            "debugInfo": null
        },
        "code": "UserError",
        "severity": null,
        "message": "Artifact path cannot represent a directory",
        "message_format": null,
        "message_parameters": null,
        "reference_code": null,
        "details_uri": null,
        "target": null,
        "details": [],
        "inner_error": null
    },
    "correlation": {
        "operation": "9a90582eb08557409b4d3c4ae0d5a424",
        "request": "a12f020dfcd0e24d"
    },
    "environment": "southcentralus",
    "location": "southcentralus",
    "time": {},
    "component_name": "artifact"
}

TODO: In the cell below, print the logs of the web service and delete the service