# Hyperparameter Tuning using HyperDrive

In this capstone project I will showcase how we can use data science as a investigation tool, here we use classfication algorithm to distinguish between normal traffic (good connections) and intrusion or attacks traffic (bad connections). A connection is a sequence of TCP packets starting and ending at some well difined times, between which data flows to and from source IP address to a target IP address under some well defined protocol. We will create Intrusion Detection System (IDS)

### Data description
Data is collected by packet analyzers (also known as packet/network/protocol snifers) intercept and log traffic in the network.The dataset that we will use is the NSLKDD dataset. The original 1999 KDD Cup dataset was created for the DARPA Intrusion Detection Evaluation Program, prepared and managed by MIT Lincoln Laboratory. The data was collected over nine
weeks and consists of raw tcpdump traffic in a local area network (LAN) that simulates the environment of a typical United States Air Force LAN. Some network attacks were deliberately carried out during the recording period. There were 38 different
types of attacks, but only 24 are available in the training set. 

#### Data references

https://www.unb.ca/cic/datasets/nsl.html     
https://www.kaggle.com/hassan06/nslkdd

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the configuration before running this notebook.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a `TabularDataset` in training script `NetworkdataClassifier.py`.
4. Create training sciprt `NetworkdataClassifier.py`  which is passed to hyper drive configuration.
5. Configure Hyper drive using `HyperDriveConfig`.
6. Train the model using hyper drive compute
7. Explore the results.
8. Test the best fitted model.

## Azure Machine Learning and Pipeline SDK-specific imports

In [1]:
# import python specific libraires
import os


# import python data science libaries 
import numpy as np
import pandas as pd

# import azure specific libraries
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.26.0


## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [2]:
from azureml.core import Workspace, Experiment

vrk_ids_ws = Workspace.from_config()
vrk_ids_exp = Experiment(workspace=vrk_ids_ws, name="vrk_ids_exp")

print('Workspace name: ' + vrk_ids_ws.name, 
      'Azure region: ' + vrk_ids_ws.location, 
      'Subscription id: ' + vrk_ids_ws.subscription_id, 
      'Resource group: ' + vrk_ids_ws.resource_group, sep = '\n')

run = vrk_ids_exp.start_logging()

Workspace name: quick-starts-ws-142375
Azure region: southcentralus
Subscription id: d4ad7261-832d-46b2-b093-22156001df5b
Resource group: aml-quickstarts-142375


### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your run. 

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=vrk_ids_ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(vrk_ids_ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Dataset

In [4]:
%%writefile conda_dependencies.yml

dependencies:
- python=3.6.2
- scikit-learn
- pip:
  - azureml-defaults

Overwriting conda_dependencies.yml


In [5]:
from azureml.core import Environment

sklearn_env = Environment.from_conda_specification(name='sklearn-env', file_path='./conda_dependencies.yml')

## Hyperdrive Configuration


RandomForestClassifier is used for binary classification for network data traffic. Data reading, cleaning, transformation and model training is performed by software engineer using pandas and scikit in NetworkdataClassifier.py. 

NetworkdataClassifier.py accepts first argument as -- criterion(which measures quality of split), supported split criteria are “gini” for the Gini impurity and “entropy” for the information gain. 

NetworkdataClassifier.py accepts second argument as --max_depth(the maximum depth of the tree).  max_depth are sampled using choice(60, 90, 120). Small max depth values correspond to small size trees and hight values of max depth corresponds to large trees. 

Random forest classifier is selected as it can handle categorical and real-valued features with ease—little to no preprocessing required. In this case we have lot of features and itt is not clear from data,is data is linearly seperable so Random forest classifier is selected. 

Grid parameter sampling is chosen values in hyper parameters value space. In this scenario we don't have large search space and we have discrete parameter sampling. Grid sampling does a simple grid search over all possible values.

By specifying early termination policy we can automatically terminate poorly performing runs. Early termination improves computational efficiency.Bandit early termination policy is used to stop training if performance of current run is not with in the best run limits to avoid resource usage. Median stopping is an early termination policy based on running averages of primary metrics reported by the runs. This policy computes running averages across all training runs and terminates runs with primary metric values worse than the median of averages. I have choosen Bandit early for aggressive termination, where as median stopping can be used if we don't want aggresive termination.

It is observed that best parameter values for criterion is gini and max_depth is 120 and best accuracy achieved is 0.9984

In [6]:
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import GridParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform,choice
from azureml.core import ScriptRunConfig
import os

# Specify parameter sampler
param_sampling_decision_tree = GridParameterSampling( {
        "--criterion": choice('gini', 'entropy'),
        "--max_depth": choice(60, 90, 120)
    }
)

# Specify a Policy
early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=5)

# Create a SKLearn estimator for use with train.py
ids_script = ScriptRunConfig(source_directory='.',
                      script='NetworkdataClassifier.py',
                      compute_target=cpu_cluster,
                      environment=sklearn_env)

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
ids_hyperdrive_config = HyperDriveConfig(run_config=ids_script,
                                     hyperparameter_sampling=param_sampling_decision_tree,
                                     policy=early_termination_policy ,
                                     primary_metric_name="Accuracy",
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=12,
                                     max_concurrent_runs=4)

In [7]:
# Submit your hyperdrive run to the experiment and show run details with the widget.
ids_hyperdrive_run = vrk_ids_exp.submit(ids_hyperdrive_config)

## Run Details

In the cell below, use the `RunDetails` widget to show the different experiments.

In [8]:
# Monitor hyper drive runs
from azureml.widgets import RunDetails
RunDetails(ids_hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [9]:
# wait for completion
ids_hyperdrive_run.wait_for_completion(show_output=True)

RunId: HD_7687baef-6a88-4b85-8d75-88219e467a20
Web View: https://ml.azure.com/runs/HD_7687baef-6a88-4b85-8d75-88219e467a20?wsid=/subscriptions/d4ad7261-832d-46b2-b093-22156001df5b/resourcegroups/aml-quickstarts-142375/workspaces/quick-starts-ws-142375&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-04-11T07:01:24.627420][API][INFO]Experiment created<END>\n""<START>[2021-04-11T07:01:25.746191][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n""<START>[2021-04-11T07:01:25.444040][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n"

Execution Summary
RunId: HD_7687baef-6a88-4b85-8d75-88219e467a20
Web View: https://ml.azure.com/runs/HD_7687baef-6a88-4b85-8d75-88219e467a20?wsid=/subscriptions/d4ad7261-832d-46b2-b093-22156001df5b/resourcegroups/aml-quickstarts-142375/workspaces/quick-starts-ws-142375&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254



{'runId': 'HD_7687baef-6a88-4b85-8d75-88219e467a20',
 'target': 'cpucluster',
 'status': 'Completed',
 'startTimeUtc': '2021-04-11T07:01:24.302416Z',
 'endTimeUtc': '2021-04-11T07:14:02.519954Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': 'cea7f3ed-9c9b-4203-bd3e-e15f48ac9214',
  'score': '0.9986981233925',
  'best_child_run_id': 'HD_7687baef-6a88-4b85-8d75-88219e467a20_2',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlstrg142375.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_7687baef-6a88-4b85-8d75-88219e467a20/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=btTZ4%2BDViwolY%2FyDqcEl5MNFhMfwA4tE%2F6ZZ5Jd6jZU%3D&st=2021-04-11T07%3A04%3A16Z&se=2021-04-11T15%3A14%3A16Z&sp=r'},
 'submittedBy': 'ODL_User 142375'}

In [12]:
assert(ids_hyperdrive_run.get_status() == "Completed")

## Best Model

In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

get_best_run_by_primary_metric: Find and return the "Run" instance that corresponds to the best performing run amongst all child runs. The best performing run is identified solely based on the primary metric parameter specified in the HyperDriveConfig. The PrimaryMetricGoal governs whether the minimum or maximum of the primary metric is used.  Only one of the runs is returned, even if several of the Runs launched by this HyperDrive run reached the same best metric.

In [13]:
best_run = ids_hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']
print("\n Best run definition parameter values:", parameter_values)
print("\n ********************************************************")
print('\n Best Run Id: ', best_run.id)
print('\n Accuracy:', best_run_metrics['Accuracy'])

best_run


 Best run definition parameter values: {'script': 'NetworkdataClassifier.py', 'command': '', 'useAbsolutePath': False, 'arguments': ['--criterion', 'gini', '--max_depth', '90'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'cpucluster', 'dataReferences': {}, 'data': {}, 'outputData': {}, 'jobName': None, 'maxRunDurationSeconds': 2592000, 'nodeCount': 1, 'priority': None, 'credentialPassthrough': False, 'identity': None, 'environment': {'name': 'sklearn-env', 'version': 'Autosave_2021-04-11T07:01:55Z_9f0258f4', 'python': {'interpreterPath': 'python', 'userManagedDependencies': False, 'condaDependencies': {'dependencies': ['python=3.6.2', 'scikit-learn', {'pip': ['azureml-defaults']}], 'name': 'azureml_59abd4256ad8e6688a4dc7593ce35cbc'}, 'baseCondaEnvironment': None}, 'environmentVariables': {'EXAMPLE_ENV_VAR': 'EXAMPLE_VALUE'}, 'docker': {'baseImage': 'mcr.microsoft.com/azureml/intelmpi2018.3-ubuntu16.04:20210301.v1', 'platform': {'os': 'Li

Experiment,Id,Type,Status,Details Page,Docs Page
vrk_ids_exp,HD_7687baef-6a88-4b85-8d75-88219e467a20_2,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [16]:
# Retrieve and save your best hyper ddrive model.

import joblib
#Save the best model
if "hyperdrive_bestmdl" not in os.listdir():
    os.mkdir("./hyperdrive_bestmdl")
    
vrk_ids_mdl = best_run.register_model(model_name='vrk_ids_mdl', model_path='outputs')



## Model Deployment

Remember you have to deploy only one of the two models you trained. Perform the steps in the rest of this notebook only if you wish to deploy this model.


Following are steps are model deployment:
1. Register the model for operalization.
2. Prepare an entry script.
3. Prepare an inference configuration.
4. Choose a compute target.
5. Deploy the model to the compute target.
6. Test the resulting webservice.


#### Step1: Register Model: 
Register a model for operationalization.

register_model(model_name, model_path=None, tags=None, properties=None, model_framework=None, model_framework_version=None, description=None, datasets=None, sample_input_dataset=None, sample_output_dataset=None, resource_configuration=None, **kwargs)

Above function all are input parameters. Here model_path is best model is stored in file "outputs/vrk_ids_model.joblib"

In [12]:
vrk_ids_mdl = best_run.register_model(model_name='vrk_ids_mdl', model_path='outputs')

#### Step2: Prepare an entry script: 

An inference configuration describes how to set up the web-service containing your model. It's used later, when you deploy the model. The entry script receives data submitted to a deployed web service and passes it to the model. It then takes the response returned by the model and returns that to the client. The script is specific to your model. It must understand the data that the model expects and returns.

The two things you need to accomplish in your entry script are:

Loading your model (using a function called init())
Running your model on input data (using a function called run())

Here I have to convert received data to input expected by model. For conversion I use the data stored in file "ids_feature_details.json" file during training for convesion of categorial variables, and scalar object stored in file "ids_cont_scalerobj.pkl" created during training is applied to same data. 

In [79]:
%%writefile score.py

import os
import pandas as pd
import json
import pickle
import logging 
import joblib



def init():
    global deploy_model
    global read_dict
    global standard_scaler
    
    #Get the path where the deployed model can be found
    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'outputs')
    print("Model path ", model_path)
    #load models
    deploy_model = joblib.load(model_path + '/vrk_ids_model.joblib')
    
    #load column names
    with open(model_path +'/ids_feature_details.json', 'r') as filehandle:
        read_dict = json.load(filehandle)
    
    #load scaler object which is trained with train data
    standard_scaler = pickle.load(open(model_path + '/ids_cont_scalerobj.pkl', 'rb'))
    
    
def transform_test_data(input_test_data):
    
    # in dictionary keys are network_data_column_names, continious_features, symbolic_features, and
    # trained_model_column_names
    #print("Input test data shape ", input_test_data.shape)
     
    network_data_column_names_orig = read_dict['orig_network_data_column_names']
    continious_features            = read_dict['continious_features']
    symbolic_features              = read_dict['symbolic_names']
    trained_model_column_names     = read_dict['trained_model_column_names']
    
    #print("continious_features ", continious_features)
    #print("symbolic_features ", symbolic_features)
    #print("trained_model_column_names ", trained_model_column_names)
    
    
    # for this project we don't use 'success_pred' and we are predicting the 'attack_type' so remove 'attack_type'
    # data.columns = set(network_data_column_names_orig) - set(['attack_type', 'success_pred'])
    input_test_data = pd.get_dummies(input_test_data, columns=symbolic_features)
    
    #print("Input test data shape after get dummies ", input_test_data.shape)
    
    # Get missing columns in the input test data
    missing_cols = set( trained_model_column_names ) - set( input_test_data.columns )
    # Add a missing column in test set with default value equal to 0
    for c in missing_cols:
        input_test_data[c] = 0
    # Ensure the order of column in the test set is in the same order that in train set
    input_test_data = input_test_data[trained_model_column_names]
    
    #print("Input test data shape added get dummies ", input_test_data.shape)
        
    input_test_data[continious_features] = standard_scaler.transform(input_test_data[continious_features])
    
    #print("Input test data shape after apply scalar ", input_test_data.shape)
    
    return input_test_data

def run(data):
    try:
        temp = json.loads(data)
        data = pd.DataFrame(temp['data'])
        transformed_test_data = transform_test_data(data)
        result = deploy_model.predict(transformed_test_data)
        print("Result is ", result)
        # You can return any data type, as long as it is JSON serializable.
        return result.tolist()
    except Exception as e:
        error = str(e)
        prinrt("Error occured ", error)
        return error

Overwriting score.py


#### Step3: Prepare an inference configuration: 

An inference configuration describes how to set up the web-service containing your model. It's used later, when you deploy the model. Here we are chossing Azure Container Instance (ACI) as a computer target and deployed using deploy API of Model class.


In [80]:
from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig
from azureml.core.webservice import Webservice, AciWebservice
from azureml.core.model import Model



inference_config = InferenceConfig(entry_script='score.py', environment=sklearn_env)

deployment_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=4, enable_app_insights=True)
networkd_ids_service = Model.deploy(vrk_ids_ws, "vrk-ids-svc", [vrk_ids_mdl], inference_config, deployment_config)
networkd_ids_service.wait_for_deployment(show_output = True)

print(networkd_ids_service.state)
print(networkd_ids_service.scoring_uri)
print(networkd_ids_service.swagger_uri)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-04-06 11:22:05+00:00 Creating Container Registry if not exists.
2021-04-06 11:22:06+00:00 Registering the environment.
2021-04-06 11:22:07+00:00 Use the existing image.
2021-04-06 11:22:07+00:00 Generating deployment configuration.
2021-04-06 11:22:08+00:00 Submitting deployment to compute.
2021-04-06 11:22:11+00:00 Checking the status of deployment vrk-ids-svc..
2021-04-06 11:24:45+00:00 Checking the status of inference endpoint vrk-ids-svc.
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy
http://71ce3807-22cd-482d-8e4a-63a5d13c8f0b.southcentralus.azurecontainer.io/score
http://71ce3807-22cd-482d-8e4a-63a5d13c8f0b.southcentralus.azurecontainer.io/swagger.json


In the cell below, send a request to the web service you deployed to test it.

In [81]:
import requests
import json

# URL for the web service
scoring_uri = 'http://71ce3807-22cd-482d-8e4a-63a5d13c8f0b.southcentralus.azurecontainer.io/score'

# Set the content type
headers = {'Content-Type': 'application/json'}

# Two sets of data to score, so we get two results back
data = {"data":
        [{
            "duration": 0,
            "protocol_type": "tcp",
            "service": "http",
            "flag": "REJ",
            "src_bytes": 0,
            "dst_bytes": 0,
            "land": 0,
            "wrong_fragment": 0,
            "urgent": 0,
            "hot": 0,
            "num_failed_logins": 0,
            "logged_in": 0,
            "num_compromised": 0,
            "root_shell": 0,
            "su_attempted": 0,
            "num_root": 0,
            "num_file_creations": 0,
            "num_shells": 0,
            "num_access_files": 0,
            "num_outbound_cmds": 0,
            "is_hot_login": 0,
            "is_guest_login": 0,
            "count": 0,
            "srv_count": 0,
            "serror_rate": 0,
            "srv_serror_rate": 0,
            "rerror_rate": 0,
            "srv_rerror_rate": 0,
            "same_srv_rate": 0,
            "diff_srv_rate": 0,
            "srv_diff_host_rate": 0,
            "dst_host_count": 0,
            "dst_host_srv_count": 0,
            "dst_host_same_srv_rate": 0,
            "dst_host_diff_srv_rate": 0,
            "dst_host_same_src_port_rate": 0,
            "dst_host_srv_diff_host_rate": 0,
            "dst_host_serror_rate": 0,
            "dst_host_srv_serror_rate": 0,
            "dst_host_rerror_rate": 0,
            "dst_host_srv_rerror_rate": 0 }
        ]
    }
# Convert to JSON string
input_data = json.dumps(data)

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)

print("Response Code : ", resp.status_code)
print("Predicted Value : ",resp.json())

Response Code :  200
Predicted Value :  [0]


In [None]:
# Web Service Logs
print(networkd_ids_service.get_logs())

In [None]:
# Delete the service
networkd_ids_service.delete()

### Below cells I used for debugging purpose. 
I left it for future reference

In [76]:

import os
import pandas as pd
import json
import pickle
import logging 
import joblib

def transform_test_data(input_test_data):
    
    # in dictionary keys are network_data_column_names, continious_features, symbolic_features, and
    # trained_model_column_names
    print("Input test data shape ", input_test_data.shape)
    
    
    #load models
    deploy_model = joblib.load('./outputs/vrk_ids_model.joblib')
    
    #load column names
    with open('./outputs/ids_feature_details.json', 'r') as filehandle:
        read_dict = json.load(filehandle)
    
    #print(read_dict)
    
    #load scaler object which is trained with train data
    standard_scaler = pickle.load(open('./outputs/ids_cont_scalerobj.pkl', 'rb'))
    
    network_data_column_names_orig = read_dict['orig_network_data_column_names']
    continious_features            = read_dict['continious_features']
    symbolic_features              = read_dict['symbolic_names']
    trained_model_column_names     = read_dict['trained_model_column_names']
    
    #print("continious_features ", continious_features)
    #print("symbolic_features ", symbolic_features)
    #print("trained_model_column_names ", trained_model_column_names)
    
    
    # for this project we don't use 'success_pred' and we are predicting the 'attack_type' so remove 'attack_type'
    # data.columns = set(network_data_column_names_orig) - set(['attack_type', 'success_pred'])
    input_test_data = pd.get_dummies(input_test_data, columns=symbolic_features)
    
    print("Input test data shape after get dummies ", input_test_data.shape)
    
    # Get missing columns in the input test data
    missing_cols = set( trained_model_column_names ) - set( input_test_data.columns )
    # Add a missing column in test set with default value equal to 0
    for c in missing_cols:
        input_test_data[c] = 0
    # Ensure the order of column in the test set is in the same order that in train set
    input_test_data = input_test_data[trained_model_column_names]
    
    print("Input test data shape after get dummies ", input_test_data.shape)
        
    input_test_data[continious_features] = standard_scaler.transform(input_test_data[continious_features])
    
    print("Input test data shape after apply scalar ", input_test_data.shape)
    
    return deploy_model.predict(input_test_data)
    




In [77]:
test_data = [[0,'tcp','private','S0',0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,123,6,1.00,1.00,0.00,0.00,0.05,0.07,0.00,255,26,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00]]

network_data_column_names = [ 
                  'duration', 'protocol_type', 'service',
                  'flag', 'src_bytes', 'dst_bytes',
                  'land', 'wrong_fragment', 'urgent',
    
            
                  'hot', 'num_failed_logins', 'logged_in',
                  'num_compromised', 'root_shell', 'su_attempted',
                  'num_root', 'num_file_creations', 'num_shells',
                  'num_access_files', 'num_outbound_cmds', 'is_hot_login',
                  'is_guest_login',
    
                 
                  'count', 'srv_count', 'serror_rate',
                  'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate',
                  'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate',
                 
                  'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate',
                  'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
                  'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
                  'dst_host_srv_rerror_rate'
    
                    ]
transform_test_data(pd.DataFrame(test_data, columns = network_data_column_names ))

Input test data shape  (1, 41)
Input test data shape after get dummies  (1, 41)
Input test data shape after get dummies  (1, 127)
Input test data shape after apply scalar  (1, 127)


array([1])