# Automated ML

In this capstone project I will showcase how we can use data science as a investigation tool, here we use classfication algorithm to distinguish between normal traffic (good connections) and intrusion or attacks traffic (bad connections). A connection is a sequence of TCP packets starting and ending at some well difined times, between which data flows to and from source IP address to a target IP address under some well defined protocol. We will create Intrusion Detection System (IDS)

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a `TabularDataset`.
4. Configure AutoML using `AutoMLConfig`.
5. Use AutoMLStep
6. Train the model using AmlCompute
7. Explore the results.
8. Test the best fitted model.

#### Import Dependencies.
In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
import os

## Dataset

### Data description
Data is collected by packet analyzers (also known as packet/network/protocol snifers) intercept and log traffic in the network.The dataset that we will use is the NSLKDD dataset. The original 1999 KDD Cup dataset was created for the DARPA Intrusion Detection Evaluation Program, prepared and managed by MIT Lincoln Laboratory. The data was collected over nine
weeks and consists of raw tcpdump traffic in a local area network (LAN) that simulates the environment of a typical United States Air Force LAN. Some network attacks were deliberately carried out during the recording period. There were 38 different
types of attacks, but only 24 are available in the training set. 

#### Data references

https://www.unb.ca/cic/datasets/nsl.html     
https://www.kaggle.com/hassan06/nslkdd

## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [2]:
from azureml.core import Workspace, Experiment

vrk_auto_ids_ws = Workspace.from_config()
vrk_auto_ids_exp = Experiment(workspace=vrk_auto_ids_ws, name="vrk_ids_auto_exp")

print('Workspace name: ' + vrk_auto_ids_ws.name, 
      'Azure region: ' + vrk_auto_ids_ws.location, 
      'Subscription id: ' + vrk_auto_ids_ws.subscription_id, 
      'Resource group: ' + vrk_auto_ids_ws.resource_group, sep = '\n')

auto_ids_run = vrk_auto_ids_exp.start_logging()

Workspace name: quick-starts-ws-142286
Azure region: southcentralus
Subscription id: 610d6e37-4747-4a20-80eb-3aad70a55f43
Resource group: aml-quickstarts-142286


### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your run. 

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=vrk_auto_ids_ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(vrk_auto_ids_ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Creating...
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


### Download and prepare data for AutoML learning


In [4]:
from azureml.data.dataset_factory import TabularDatasetFactory

# Create TabularDataset using TabularDatasetFactory

kdd_auto_webpath = [
                     'https://raw.githubusercontent.com/venkataravikumaralladi/AzureMLCapstoneProject/main/KDDTrain.csv'
                   ]

#create bankmarketing data set in tabular format using TabularDatasetFactory
ids_auto_dataset = TabularDatasetFactory.from_delimited_files(path=kdd_auto_webpath)

In [11]:
# class variables
network_data_column_names = [ 
                  'duration', 'protocol_type', 'service',
                  'flag', 'src_bytes', 'dst_bytes',
                  'land', 'wrong_fragment', 'urgent',
    
            
                  'hot', 'num_failed_logins', 'logged_in',
                  'num_compromised', 'root_shell', 'su_attempted',
                  'num_root', 'num_file_creations', 'num_shells',
                  'num_access_files', 'num_outbound_cmds', 'is_hot_login',
                  'is_guest_login',
    
                 
                  'count', 'srv_count', 'serror_rate',
                  'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate',
                  'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate',
                 
                  'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate',
                  'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
                  'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
                  'dst_host_srv_rerror_rate',
    
                   'attack_type',
                   'success_pred' ]

train_df = ids_auto_dataset.to_pandas_dataframe().dropna()
print("train df data shape ", train_df.shape)
train_df.columns = network_data_column_names
train_df.head()

train df data shape  (125972, 43)


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack_type,success_pred
0,0,udp,other,SF,146,0,0,0,0,0,...,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal,15
1,0,tcp,private,S0,0,0,0,0,0,0,...,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,19
2,0,tcp,http,SF,232,8153,0,0,0,0,...,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal,21
3,0,tcp,http,SF,199,420,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal,21
4,0,tcp,private,REJ,0,0,0,0,0,0,...,0.07,0.07,0.0,0.0,0.0,0.0,1.0,1.0,neptune,21


In [12]:
print("train df data shape ", train_df.shape)
# For this analysis we drop "success_pred" column
train_df.drop('success_pred', axis=1, inplace=True)
print("Train data frame after droping success pred is  ", train_df.shape)

train df data shape  (125972, 43)
Train data frame after droping success pred is   (125972, 42)


In [13]:
# Drop attack type in training data which is to be predicted.
train_X = train_df.drop("attack_type", axis=1)
train_Y = train_df['attack_type']
# we build binary classifier for this project
train_Y = train_Y.apply(lambda x: 0 if x == 'normal' else 1)  

print("Shape of train_X is ", train_X.shape)
print("Shape of train_Y is ", train_Y.shape)
train_X = train_X.join(train_Y)
print("After join train_X shape is ", train_X.shape)
train_X.head()

Shape of train_X is  (125972, 41)
Shape of train_Y is  (125972,)
After join train_X shape is  (125972, 42)


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack_type
0,0,udp,other,SF,146,0,0,0,0,0,...,1,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,0
1,0,tcp,private,S0,0,0,0,0,0,0,...,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,1
2,0,tcp,http,SF,232,8153,0,0,0,0,...,255,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,0
3,0,tcp,http,SF,199,420,0,0,0,0,...,255,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,0,tcp,private,REJ,0,0,0,0,0,0,...,19,0.07,0.07,0.0,0.0,0.0,0.0,1.0,1.0,1


In [14]:
#store data frame to data store for AutoMLConfig training
if "training" not in os.listdir():
    os.mkdir("./training")
train_X.to_csv('training/ids_training_data.csv', index=False)
default_datastore = vrk_auto_ids_ws.get_default_datastore()
default_datastore

{
  "name": "workspaceblobstore",
  "container_name": "azureml-blobstore-1255c964-f867-4262-8ffe-389e6b54a7ca",
  "account_name": "mlstrg142286",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

In [15]:
#upload training data to data store for AutomMLConfig training
default_datastore.upload(src_dir='training', target_path='data/')

Uploading an estimated of 1 files
Uploading training/ids_training_data.csv
Uploaded training/ids_training_data.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_805f3263c8d142fb9855f94b555d1c2e

In [16]:
# get data set pointer to data store for bank training data
from azureml.core import Dataset
ids_auto_dataset = Dataset.Tabular.from_delimited_files(path=[(default_datastore, ('data/ids_training_data.csv'))])

## AutoML Configuration

Intrusion Detection system is a classification task. According AutomMLConfigParameters are set which are used for this job as mentioned here.

#### AutoML Settings:

`max_cores_per_iteration`: int = 1 <br>

`max_concurrent_iterations`: int = 1 <br>

`featurization` = auto (by default so not set here) AutoMLConfig provides featurization arguments by default auto which provided learning features automatically. <br>

`n_cross_validations` = 5 Number of cross validations to perform configured are 5.<br>

`experiment timeout_minutes` = 30 minutes is set according to lab time provided. <br>

`primary_metric` = accuracy is used as provided dataset is balanced and is best suited for job at hand. <br>

#### AutoML Config:

`training_data` : Registered tabular data pointer in default data store is provided here. <br>

`blocked_modles`: In this project I blocked XGBoostClassifier as I am facing issues in importing the model created in AutoML environment to compute envirnoment due to version difference of XGBoost library.<br>
                 
`label_column_name` : attack_type is one we have to predict if traffic is normal or attack. <br>

`compute_target`: The Azure Machine Learning compute target to run the Automated Machine Learning experiment on.<br>

In [17]:
from azureml.train.automl import AutoMLConfig

# Set parameters for AutoMLConfig
vrk_ids_automl_config = AutoMLConfig(
                   experiment_timeout_minutes=30,
                   task="classification",
                   primary_metric="accuracy",
                   compute_target = cpu_cluster,
                   training_data=ids_auto_dataset,
                   blocked_models=['XGBoostClassifier'],
                   label_column_name='attack_type',
                   n_cross_validations=5)

In [18]:
# Submit your automl run
from azureml.pipeline.steps import AutoMLStep
#vrk_auto__exp = Experiment(workspace=vrk_auto_ids_ws, name="vrk_auto_ids_train_exp")
automl_ids_run = vrk_auto_ids_exp.submit(vrk_ids_automl_config, show_output=True)

Running on remote.
No run_configuration provided, running on cpucluster with default configuration
Running on remote compute: cpucluster
Parent Run ID: AutoML_52d729eb-4c12-42fb-b93b-125aa69b16fc

Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Generating individually featurized CV splits.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

In the cell below, `RunDetails` widget to show the different experiments.

In [21]:
from azureml.widgets import RunDetails
RunDetails(automl_ids_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [22]:
automl_ids_run.wait_for_completion()

{'runId': 'AutoML_52d729eb-4c12-42fb-b93b-125aa69b16fc',
 'target': 'cpucluster',
 'status': 'Completed',
 'startTimeUtc': '2021-04-09T10:11:14.698209Z',
 'endTimeUtc': '2021-04-09T11:05:36.754122Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'cpucluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"8837fa65-f734-4206-84bb-d4e47b894b5d\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"data/ids_training_data.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-142286\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"610d6e37-4747-4a20-80eb-3aad70a55f43\\\\\\", \\\\\\

In [23]:
assert(automl_ids_run.get_status() == "Completed")

## Best Model

In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [24]:
# Retrieve and save your best automl model.
best_auto_run, best_fitted_model = automl_ids_run.get_output()

import joblib
#Save the best model
if "automl_bestmdl" not in os.listdir():
    os.mkdir("./automl_bestmdl")
joblib.dump(best_fitted_model, './automl_bestmdl/best_fit_automl_ids.pkl')

print("Auto best fitted model is: ", best_fitted_model._final_estimator)

Package:azureml-automl-runtime, training version:1.25.0, current version:1.24.0
Package:azureml-core, training version:1.25.0, current version:1.24.0
Package:azureml-dataset-runtime, training version:1.25.0, current version:1.24.0
Package:azureml-defaults, training version:1.25.0, current version:1.24.0
Package:azureml-interpret, training version:1.25.0, current version:1.24.0
Package:azureml-mlflow, training version:1.25.0, current version:1.24.0
Package:azureml-pipeline-core, training version:1.25.0, current version:1.24.0
Package:azureml-telemetry, training version:1.25.0, current version:1.24.0
Package:azureml-train-automl-client, training version:1.25.0, current version:1.24.0
Package:azureml-train-automl-runtime, training version:1.25.0, current version:1.24.0


## Model Deployment

Remember you have to deploy only one of the two models you trained. Perform the steps in the rest of this notebook only if you wish to deploy this model.


Following are steps are model deployment:
1. Register the model for operalization.
2. Prepare an entry script.
3. Prepare an inference configuration.
4. Choose a compute target.
5. Deploy the model to the compute target.
6. Test the resulting webservice.


#### Step1: Register Model: 
Register a model for operationalization.

register_model(model_name, model_path=None, tags=None, properties=None, model_framework=None, model_framework_version=None, description=None, datasets=None, sample_input_dataset=None, sample_output_dataset=None, resource_configuration=None, **kwargs)

Above function all are input parameters. Here model_path is best model is stored in file "outputs/vrk_ids_model.joblib"

In [27]:
ids_auto_mdl = best_auto_run.register_model(model_name='vrk_ids_auto_mdl', model_path='./outputs')

#### Step2: Prepare an entry script: 

An inference configuration describes how to set up the web-service containing your model. It's used later, when you deploy the model. The entry script receives data submitted to a deployed web service and passes it to the model. It then takes the response returned by the model and returns that to the client. The script is specific to your model. It must understand the data that the model expects and returns.

The two things you need to accomplish in your entry script are:

Loading your model (using a function called init())
Running your model on input data (using a function called run())

In [28]:
if "inference" not in os.listdir():
    os.mkdir("./inference")
    
autoscore_file = 'inference/autoscore.py'

# best_auto_run.download_file('outputs/scoring_file_v_1_0_0.py', 'inference/autoscore.py' )

In [None]:
%%writefile inference/autoscore.py

import json
import logging
import os
import pickle
import numpy as np
import pandas as pd
import joblib

import azureml.automl.core
from azureml.automl.core.shared import logging_utilities, log_server
from azureml.telemetry import INSTRUMENTATION_KEY

from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
from inference_schema.parameter_types.pandas_parameter_type import PandasParameterType



try:
    log_server.enable_telemetry(INSTRUMENTATION_KEY)
    log_server.set_verbosity('INFO')
    logger = logging.getLogger('azureml.automl.core.scoring_script')
except:
    pass


def init():
    global model
    # This name is model.id of model that we want to deploy deserialize the model file back
    # into a sklearn model
    model_base_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'outputs')
    model_path = os.path.join(model_base_path, 'model.pkl')
    path = os.path.normpath(model_path)
    path_split = path.split(os.sep)
    log_server.update_custom_dimensions({'model_name': path_split[-3], 'model_version': path_split[-2]})
    try:
        logger.info("Loading model from path.")
        model = joblib.load(model_path)
        logger.info("Loading successful.")
    except Exception as e:
        logging_utilities.log_traceback(e, logger)
        raise


def run(data):
    try:
        temp = json.loads(data)
        data_df = pd.DataFrame(temp['data'])
        print("data ", data_df)
        print("**********************")
        result = model.predict(data_df)
        print("Result is ", result)
        return json.dumps({"result": result.tolist()})
    except Exception as e:
        result = str(e)
        return json.dumps({"error": result})


#### Step3: Prepare an inference configuration: 

An inference configuration describes how to set up the web-service containing your model. It's used later, when you deploy the model. Here we are chossing Azure Container Instance (ACI) as a computer target and deployed using deploy API of Model class.


In [29]:
model_name = best_auto_run.properties['model_name']
model_name

'AutoML52d729eb418'

In [37]:
from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig
from azureml.core.webservice import Webservice, AciWebservice
from azureml.core.model import Model



inference_config = InferenceConfig(entry_script=autoscore_file, environment=best_auto_run.get_environment())

deployment_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=4, enable_app_insights=True)

ids_auto_websvc = Model.deploy(vrk_auto_ids_ws, 'vrk-auto-ids', [ids_auto_mdl], inference_config, deployment_config)
ids_auto_websvc.wait_for_deployment(show_output = True)

print(ids_auto_websvc.state)
print(ids_auto_websvc.scoring_uri)
print(ids_auto_websvc.swagger_uri)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-04-09 11:56:38+00:00 Creating Container Registry if not exists.
2021-04-09 11:56:39+00:00 Registering the environment.
2021-04-09 11:56:40+00:00 Use the existing image.
2021-04-09 11:56:40+00:00 Generating deployment configuration.
2021-04-09 11:56:42+00:00 Submitting deployment to compute..
2021-04-09 11:56:45+00:00 Checking the status of deployment vrk-auto-ids..
2021-04-09 12:01:21+00:00 Checking the status of inference endpoint vrk-auto-ids.
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy
http://2812921d-4655-43a9-8e6f-cb44cb78786a.southcentralus.azurecontainer.io/score
http://2812921d-4655-43a9-8e6f-cb44cb78786a.southcentralus.azurecontainer.io/swagger.json


#### Send request to deployed webservice
In the cell below, send a request to the web service you deployed to test it.

In [38]:
import requests
import json

# URL for the web service
scoring_uri = 'http://2812921d-4655-43a9-8e6f-cb44cb78786a.southcentralus.azurecontainer.io/score'

# Set the content type
headers = {'Content-Type': 'application/json'}

# Two sets of data to score, so we get two results back
data = {"data":
        [{
            "duration": 0,
            "protocol_type": "tcp",
            "service": "http",
            "flag": "REJ",
            "src_bytes": 0,
            "dst_bytes": 0,
            "land": 0,
            "wrong_fragment": 0,
            "urgent": 0,
            "hot": 0,
            "num_failed_logins": 0,
            "logged_in": 0,
            "num_compromised": 0,
            "root_shell": 0,
            "su_attempted": 0,
            "num_root": 0,
            "num_file_creations": 0,
            "num_shells": 0,
            "num_access_files": 0,
            "num_outbound_cmds": 0,
            "is_hot_login": 0,
            "is_guest_login": 0,
            "count": 0,
            "srv_count": 0,
            "serror_rate": 0,
            "srv_serror_rate": 0,
            "rerror_rate": 0,
            "srv_rerror_rate": 0,
            "same_srv_rate": 0,
            "diff_srv_rate": 0,
            "srv_diff_host_rate": 0,
            "dst_host_count": 0,
            "dst_host_srv_count": 0,
            "dst_host_same_srv_rate": 0,
            "dst_host_diff_srv_rate": 0,
            "dst_host_same_src_port_rate": 0,
            "dst_host_srv_diff_host_rate": 0,
            "dst_host_serror_rate": 0,
            "dst_host_srv_serror_rate": 0,
            "dst_host_rerror_rate": 0,
            "dst_host_srv_rerror_rate": 0 }
        ]
    }
# Convert to JSON string
input_data = json.dumps(data)

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)

print("Response Code : ", resp.status_code)
print("Predicted Value : ",resp.json())

Response Code :  200
Predicted Value :  {"error": "DataException:\n\tMessage: The number of features in [fitted data](42) does not match with those in [input data](41). Please inspect your data, and make sure that features are aligned in both the Datasets.\n\tInnerException: None\n\tErrorResponse \n{\n    \"error\": {\n        \"code\": \"UserError\",\n        \"message\": \"The number of features in [fitted data](42) does not match with those in [input data](41). Please inspect your data, and make sure that features are aligned in both the Datasets.\",\n        \"target\": \"X\",\n        \"inner_error\": {\n            \"code\": \"BadData\",\n            \"inner_error\": {\n                \"code\": \"InvalidDimension\",\n                \"inner_error\": {\n                    \"code\": \"DataShapeMismatch\"\n                }\n            }\n        },\n        \"reference_code\": \"c402b6c2-3870-45a7-8745-c063bd385962\"\n    }\n}"}


In [None]:
# Web Service Logs
print(ids_auto_websvc.get_logs())

In [None]:
# Delete the service
ids_auto_websvc.delete()

### Following section Debug cells

I left it for future reference

In [40]:
import joblib
#TODO: Save the best model
joblib.dump(best_fitted_model, 'best_fit_automm_ids.pkl')

['best_fit_automm_ids.pkl']

In [42]:
best_ids_auto_model = joblib.load('best_fit_automm_ids.pkl')

In [None]:
import pandas as pd
temp = json.loads(input_data)
data_df = pd.DataFrame(temp['data'])
print("data ", data_df)
best_ids_auto_model.predict(data_df)