# Automated ML

This is the notebook for the AutoML run process.

## Import Dependencies

In [1]:
from azureml.core import Workspace, Experiment, Dataset, Environment
from azureml.data.dataset_factory import TabularDatasetFactory as tdf
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.train.automl import AutoMLConfig
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core.model import Model
import requests
import json

## Set Up

### Load Workspace Elements

In [2]:
ws = Workspace.from_config()

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Workspace name: quick-starts-ws-133879
Azure region: southcentralus
Subscription id: 6971f5ac-8af1-446e-8034-05acea24681f
Resource group: aml-quickstarts-133879


### Create Experiment

In [3]:
# choose a name for experiment
experiment_name = 'MS-Malware'
project_folder = './automl'

experiment=Experiment(ws, experiment_name)

## Create/Get Compute Cluster

In [4]:
cpu_cluster_name = "malware-compute"

# if cluster already exists, use it
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Cluster {} exists. Will use this cluster.'.format(cpu_cluster_name))
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    
    cpu_cluster.wait_for_completion(show_output=True)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Dataset

### Overview

This data set was taken from the [Microsoft Malware Prediction](https://www.kaggle.com/c/microsoft-malware-prediction) competition run in 2019. "Microsoft is challenging the data science community to develop techniques to predict if a machine will soon be hit with malware. As with their previous, Malware Challenge (2015), Microsoft is providing Kagglers with an unprecedented malware dataset to encourage open-source progress on effective techniques for predicting malware occurrences."  

The raw training data set for the competition is very large: 8,921,483 observations (rows), and 83 features (variables), taking up 4.1 GB of space! 

In order to allow us to complete the tasks for this assignment in a reasonable amount of time, we took a very small subset of the entire train.csv data set -- just 10,000 rows (or about 0.1% of the entire training set). We skimmed off the top 10,000 rows from the data set for use in this project, and put it on our GitHub at https://raw.githubusercontent.com/tybyers/AZMLND_projects/capstone/capstone/data/train_1_10k.csv.

### Goal

The goal for this project is to predict whether a Windows machine is infected by various families of malware, based on different properties of that machine. Each row in this dataset corresponds to a machine, and has observations from telemetry data generated by WindowsDefender. The `HasDetections` column indicates that Malware was detected on the machine. 

For our 10,000 row data set, 4,950 machines have no malware detections, and 5,050 machines have malware detections. This is a nice, balanced data set.


In [5]:
# Load Data set
data_path = 'https://raw.githubusercontent.com/tybyers/AZMLND_projects/capstone/capstone/data/train_1_10k.csv'
dataset = tdf.from_delimited_files(path=data_path)
dataset.to_pandas_dataframe().head()

Unnamed: 0,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,...,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections
0,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1735.0,0,7,0,,53447.0,1.0,...,36144,0,,0,0,0,0.0,0,10,0
1,win8defender,1.1.14600.4,4.13.17134.1,1.263.48.0,0,7,0,,53447.0,1.0,...,57858,0,,0,0,0,0.0,0,8,0
2,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1341.0,0,7,0,,53447.0,1.0,...,52682,0,,0,0,0,0.0,0,3,0
3,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1527.0,0,7,0,,53447.0,1.0,...,20050,0,,0,0,0,0.0,0,3,1
4,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1379.0,0,7,0,,53447.0,1.0,...,19844,0,0.0,0,0,0,0.0,0,1,1


## AutoML Configuration

Below are the settings and configuration for the AutoML run. Some of the config and settings are straightforward, but we chose other configurations for the following reasons:  

  * `experiment_timeout_minutes = 30` -- Opted for 30 minutes so our Udacity-provided labs wouldn't time out, and this seemed to be sufficient to get through the training.  
  * `max_concurrent_iterations = 5` -- Runs some iterations in parallel to speed up processing.  
  * `primary_metric = 'accuracy'` -- We opted for accuracy here, although AUC would have been a good choice too. Mostly opted for this for simplicity to compare it to the Hyperdrive run.  
  * `task = 'classification'` -- Since we're classifying whether a machine is infected or not, this is a natural choice.  
  * `label_column_name = 'HasDetections'` - This is the "target" column.  
  * `enable_early_stopping = True` - No need to keep going if models are doing well enough.  
  * `n_cross_validations = 5` - Wanted to do some cross validation. Especially important with a small data set with high cardinality.  

In [6]:
# TODO: Put your automl settings here
automl_settings = {
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'accuracy'
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(
    compute_target=cpu_cluster,
    task="classification",
    training_data=dataset,
    label_column_name='HasDetections',
    enable_early_stopping=True,
    path=project_folder,
    debug_log='automl_errors.log',
    n_cross_validations=5,
    **automl_settings
)

In [7]:
# Submit experiment
remote_run = experiment.submit(config=automl_config, show_output=True)

Running on remote.
No run_configuration provided, running on malware-compute with default configuration
Running on remote compute: malware-compute
Parent Run ID: AutoML_88b5306c-678b-4281-a4ae-9eabc455b96f

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:    

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [8]:
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model


In [9]:
best_run = remote_run.get_best_child()
best_run.get_metrics()

{'precision_score_micro': 0.6377999999999999,
 'matthews_correlation': 0.27571316445343663,
 'AUC_macro': 0.6867168194972465,
 'average_precision_score_micro': 0.6828636606654682,
 'precision_score_macro': 0.6378562048728768,
 'AUC_micro': 0.687188,
 'balanced_accuracy': 0.6378569707691047,
 'f1_score_micro': 0.6377999999999999,
 'f1_score_weighted': 0.6378016067792024,
 'f1_score_macro': 0.6377532909788866,
 'average_precision_score_weighted': 0.6792550436815201,
 'precision_score_weighted': 0.6380097722708168,
 'accuracy': 0.6377999999999999,
 'average_precision_score_macro': 0.6789531018594834,
 'recall_score_macro': 0.6378569707691047,
 'weighted_accuracy': 0.6377430931228403,
 'recall_score_micro': 0.6377999999999999,
 'log_loss': 0.6438297338167525,
 'norm_macro_recall': 0.2757139415382094,
 'AUC_weighted': 0.6867168194972465,
 'recall_score_weighted': 0.6377999999999999,
 'accuracy_table': 'aml://artifactId/ExperimentRun/dcid.AutoML_88b5306c-678b-4281-a4ae-9eabc455b96f_38/accura

In [10]:
best_run.properties

{'runTemplate': 'automl_child',
 'pipeline_id': '__AutoML_Ensemble__',
 'pipeline_spec': '{"pipeline_id":"__AutoML_Ensemble__","objects":[{"module":"azureml.train.automl.ensemble","class_name":"Ensemble","spec_class":"sklearn","param_args":[],"param_kwargs":{"automl_settings":"{\'task_type\':\'classification\',\'primary_metric\':\'accuracy\',\'verbosity\':20,\'ensemble_iterations\':15,\'is_timeseries\':False,\'name\':\'MS-Malware\',\'compute_target\':\'malware-compute\',\'subscription_id\':\'6971f5ac-8af1-446e-8034-05acea24681f\',\'region\':\'southcentralus\',\'spark_service\':None}","ensemble_run_id":"AutoML_88b5306c-678b-4281-a4ae-9eabc455b96f_38","experiment_name":"MS-Malware","workspace_name":"quick-starts-ws-133879","subscription_id":"6971f5ac-8af1-446e-8034-05acea24681f","resource_group_name":"aml-quickstarts-133879"}}]}',
 'training_percent': '100',
 'predicted_cost': None,
 'iteration': '38',
 '_aml_system_scenario_identification': 'Remote.Child',
 '_azureml.ComputeTargetType':

## Model Deployment

Below we register the model, create an inference config and deploy the model as a web service.

### Register Model

In [11]:
model = best_run.register_model(model_name = 'best_automl', model_path='outputs/model.pkl')
print(model.name, model.id, model.version, sep='\t')

best_automl	best_automl:1	1


### Create Inference and Deployment Config

In [12]:
azml_env = Environment.get(workspace=ws, name="AzureML-AutoML")
inference_config = InferenceConfig(entry_script='automl_scoring.py', environment=azml_env)
#deployment_config = LocalWebservice.deploy_configuration()
aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1, enable_app_insights=True, auth_enabled=True)
model = Model(ws, name='best_automl')

### Deploy as Web Service

In [13]:
service = Model.deploy(workspace=ws, 
                       name = 'automl-endpoint', 
                       models = [model], 
                       inference_config=inference_config,
                       deployment_config=aci_config,
                       overwrite=True
                      )
service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running..............
Succeeded
ACI service creation operation finished, operation "Succeeded"


### Test Web Service

In [23]:
print("Service state: {}".format(service.state))
print("Scoring URI: {}".format( service.scoring_uri))

Service state: Healthy
Scoring URI: http://e7225a37-1614-44ae-bf46-8c35c575309d.southcentralus.azurecontainer.io/score


In [15]:
#service.get_keys()

In [24]:
import json
test_data_path = 'https://raw.githubusercontent.com/tybyers/AZMLND_projects/capstone/capstone/data/test_data.json'

r = requests.get(test_data_path)
input_data = r.text

In [25]:
input_data

'{"data":[{"MachineIdentifier":"02e125f54e5e4aefc7a42cae452fe9b2","ProductName":"win8defender","EngineVersion":"1.1.15200.1","AppVersion":"4.10.14393.1613","AvSigVersion":"1.275.46.0","IsBeta":0,"RtpStateBitfield":7,"IsSxsPassiveMode":0,"DefaultBrowsersIdentifier":null,"AVProductStatesIdentifier":53447,"AVProductsInstalled":1,"AVProductsEnabled":1,"HasTpm":1,"CountryIdentifier":50,"CityIdentifier":105713.0,"OrganizationIdentifier":null,"GeoNameIdentifier":68,"LocaleEnglishNameIdentifier":51,"Platform":"windows10","Processor":"x64","OsVer":"10.0.0.0","OsBuild":14393,"OsSuite":768,"OsPlatformSubRelease":"rs1","OsBuildLab":"14393.1770.amd64fre.rs1_release.170917-1700","SkuEdition":"Home","IsProtected":1,"AutoSampleOptIn":0,"PuaMode":null,"SMode":0.0,"IeVerIdentifier":96,"SmartScreen":"RequireAdmin","Firewall":1,"UacLuaenable":1,"Census_MDC2FormFactor":"Notebook","Census_DeviceFamily":"Windows.Desktop","Census_OEMNameIdentifier":2668,"Census_OEMModelIdentifier":171230,"Census_ProcessorCore

In [26]:
# Set the content type
headers = {'Content-Type': 'application/json'}
# If authentication is enabled, set the authorization header

headers['Authorization'] = f'Bearer {service.get_keys()[0]}'

# Make the request and display the response
resp = requests.post(service.scoring_uri, input_data, headers=headers)
resp.json()

[0, 1, 0, 1, 0]

In [27]:
# compare to actual:
[obs['HasDetections'] for obs in json.loads(input_data)['data']]

[0, 0, 1, 1, 0]

Accuracy on this small subset is 60% (3/5). 

### Web Service Logs

In [28]:
service.get_logs()

'2021-01-07T19:57:30,539017900+00:00 - iot-server/run \n2021-01-07T19:57:30,550874100+00:00 - rsyslog/run \n2021-01-07T19:57:30,560457900+00:00 - nginx/run \n2021-01-07T19:57:30,564789700+00:00 - gunicorn/run \n/usr/sbin/nginx: /azureml-envs/azureml_8eff28b157f42edcd2424a5aae6c8074/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)\n/usr/sbin/nginx: /azureml-envs/azureml_8eff28b157f42edcd2424a5aae6c8074/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)\n/usr/sbin/nginx: /azureml-envs/azureml_8eff28b157f42edcd2424a5aae6c8074/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)\n/usr/sbin/nginx: /azureml-envs/azureml_8eff28b157f42edcd2424a5aae6c8074/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)\n/usr/sbin/nginx: /azureml-envs/azureml_8eff28b157f42edcd2424a5aae6c8074/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)

### Delete Service

In [29]:
service.delete()

## Shut Down Compute

In [30]:
cpu_cluster.delete()

In [36]:
cpu_cluster.get_status()

ComputeTargetException: ComputeTargetException:
	Message: ComputeTargetNotFound: Compute Target with name malware-compute not found in provided workspace
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "ComputeTargetNotFound: Compute Target with name malware-compute not found in provided workspace"
    }
}