# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

## Import Dependencies

In [2]:
from azureml.core import Workspace, Experiment, Dataset
from azureml.data.dataset_factory import TabularDatasetFactory as tdf
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.train.automl import AutoMLConfig
from azureml.core.model import InferenceConfig
from azureml.core.webservice import LocalWebservice
from azureml.core.model import Model
import requests

## Set Up

### Load Workspace Elements

In [3]:
ws = Workspace.from_config()

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Workspace name: quick-starts-ws-133553
Azure region: southcentralus
Subscription id: 610d6e37-4747-4a20-80eb-3aad70a55f43
Resource group: aml-quickstarts-133553


### Create Experiment

In [4]:
# choose a name for experiment
experiment_name = 'MS-Malware'
project_folder = './automl'

experiment=Experiment(ws, experiment_name)

### Create Environment

In [43]:
%%writefile automl_dependencies.yml

dependencies:
    - python=3.6.2
    - scikit-learn
    - numpy
    - xgboost
    - pandas
    - pip:
        - azureml-defaults
        - azureml-train-automl

Overwriting automl_dependencies.yml


In [44]:
from azureml.core import Environment

automl_env = Environment.from_conda_specification(name = 'automl-env', file_path = './automl_dependencies.yml')

## Create/Get Compute Cluster

In [7]:
cpu_cluster_name = "malware-compute"

# if cluster already exists, use it
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Cluster {} exists. Will use this cluster.'.format(cpu_cluster_name))
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    
    cpu_cluster.wait_for_completion(show_output=True)

Cluster malware-compute exists. Will use this cluster.


## Dataset

### Overview

This data set was taken from the [Microsoft Malware Prediction](https://www.kaggle.com/c/microsoft-malware-prediction) competition run in 2019. "Microsoft is challenging the data science community to develop techniques to predict if a machine will soon be hit with malware. As with their previous, Malware Challenge (2015), Microsoft is providing Kagglers with an unprecedented malware dataset to encourage open-source progress on effective techniques for predicting malware occurrences."  

The raw training data set for the competition is very large: 8,921,483 observations (rows), and 83 features (variables), taking up 4.1 GB of space! 

In order to allow us to complete the tasks for this assignment in a reasonable amount of time, we took a very small subset of the entire train.csv data set -- just 10,000 rows (or about 0.1% of the entire training set). We skimmed off the top 10,000 rows from the data set for use in this project, and put it on our GitHub at https://raw.githubusercontent.com/tybyers/AZMLND_projects/capstone/capstone/data/train_1_10k.csv.

### Goal

The goal for this project is to predict whether a Windows machine is infected by various families of malware, based on different properties of that machine. Each row in this dataset corresponds to a machine, and has observations from telemetry data generated by WindowsDefender. The `HasDetections` column indicates that Malware was detected on the machine. 

For our 10,000 row data set, 4,950 machines have no malware detections, and 5,050 machines have malware detections. This is a nice, balanced data set.


In [8]:
# Load Data set
data_path = 'https://raw.githubusercontent.com/tybyers/AZMLND_projects/capstone/capstone/data/train_1_10k.csv'
dataset = tdf.from_delimited_files(path=data_path)
dataset.to_pandas_dataframe().head()

Unnamed: 0,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,...,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections
0,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1735.0,0,7,0,,53447.0,1.0,...,36144,0,,0,0,0,0.0,0,10,0
1,win8defender,1.1.14600.4,4.13.17134.1,1.263.48.0,0,7,0,,53447.0,1.0,...,57858,0,,0,0,0,0.0,0,8,0
2,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1341.0,0,7,0,,53447.0,1.0,...,52682,0,,0,0,0,0.0,0,3,0
3,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1527.0,0,7,0,,53447.0,1.0,...,20050,0,,0,0,0,0.0,0,3,1
4,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1379.0,0,7,0,,53447.0,1.0,...,19844,0,0.0,0,0,0,0.0,0,1,1


## AutoML Configuration

TODO: Explain why you chose the automl settings and configuration you used below.

In [9]:
# TODO: Put your automl settings here
automl_settings = {
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'accuracy'
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(
    compute_target=cpu_cluster,
    task="classification",
    training_data=dataset,
    label_column_name='HasDetections',
    enable_early_stopping=True,
    path=project_folder,
    debug_log='automl_errors.log',
    n_cross_validations=5,
    **automl_settings
)

In [10]:
# Submit experiment
remote_run = experiment.submit(config=automl_config, show_output=True)

Running on remote.
No run_configuration provided, running on malware-compute with default configuration
Running on remote compute: malware-compute
Parent Run ID: AutoML_b41a0d8e-8a07-4e82-ac68-cae8152970b4

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:    

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [11]:
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [12]:
best_run = remote_run.get_best_child()
best_run.get_metrics()

{'precision_score_macro': 0.6376449033869295,
 'precision_score_micro': 0.6376000000000001,
 'accuracy': 0.6376000000000001,
 'f1_score_micro': 0.6376000000000001,
 'recall_score_micro': 0.6376000000000001,
 'average_precision_score_macro': 0.6811340727295183,
 'recall_score_macro': 0.6376637619117124,
 'AUC_macro': 0.6891508554649647,
 'f1_score_macro': 0.6375678259506241,
 'balanced_accuracy': 0.6376637619117124,
 'f1_score_weighted': 0.6376129954389803,
 'log_loss': 0.6404037431875012,
 'weighted_accuracy': 0.6375363288373189,
 'AUC_micro': 0.6895131,
 'average_precision_score_micro': 0.6850607643945976,
 'AUC_weighted': 0.6891508554649647,
 'average_precision_score_weighted': 0.6814373682936801,
 'recall_score_weighted': 0.6376000000000001,
 'matthews_correlation': 0.2753086631258564,
 'norm_macro_recall': 0.2753275238234247,
 'precision_score_weighted': 0.6377989744319355,
 'confusion_matrix': 'aml://artifactId/ExperimentRun/dcid.AutoML_b41a0d8e-8a07-4e82-ac68-cae8152970b4_38/conf

In [13]:
best_run.properties

{'runTemplate': 'automl_child',
 'pipeline_id': '__AutoML_Ensemble__',
 'pipeline_spec': '{"pipeline_id":"__AutoML_Ensemble__","objects":[{"module":"azureml.train.automl.ensemble","class_name":"Ensemble","spec_class":"sklearn","param_args":[],"param_kwargs":{"automl_settings":"{\'task_type\':\'classification\',\'primary_metric\':\'accuracy\',\'verbosity\':20,\'ensemble_iterations\':15,\'is_timeseries\':False,\'name\':\'MS-Malware\',\'compute_target\':\'malware-compute\',\'subscription_id\':\'610d6e37-4747-4a20-80eb-3aad70a55f43\',\'region\':\'southcentralus\',\'spark_service\':None}","ensemble_run_id":"AutoML_b41a0d8e-8a07-4e82-ac68-cae8152970b4_38","experiment_name":"MS-Malware","workspace_name":"quick-starts-ws-133553","subscription_id":"610d6e37-4747-4a20-80eb-3aad70a55f43","resource_group_name":"aml-quickstarts-133553"}}]}',
 'training_percent': '100',
 'predicted_cost': None,
 'iteration': '38',
 '_aml_system_scenario_identification': 'Remote.Child',
 '_azureml.ComputeTargetType':

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [14]:
type(best_run)

azureml.core.run.Run

In [45]:
model = best_run.register_model(model_name = 'best_automl', model_path='outputs/model.pkl')
print(model.name, model.id, model.version, sep='\t')

best_automl	best_automl:4	4


In [46]:
inference_config = InferenceConfig(entry_script='automl_scoring.py', environment=automl_env)
deployment_config = LocalWebservice.deploy_configuration()
model = Model(ws, name='best_automl')

In [47]:
service = Model.deploy(ws, 'myservice', [model], inference_config, deployment_config)
service.wait_for_deployment(show_output=True)

Downloading model best_automl:4 to /tmp/azureml_031xy3hl/best_automl/4
Generating Docker build context.
2021/01/05 22:17:17 Downloading source code...
2021/01/05 22:17:18 Finished downloading source code
2021/01/05 22:17:19 Creating Docker network: acb_default_network, driver: 'bridge'
2021/01/05 22:17:19 Successfully set up Docker network: acb_default_network
2021/01/05 22:17:19 Setting up Docker configuration...
2021/01/05 22:17:20 Successfully set up Docker configuration
2021/01/05 22:17:20 Logging in to registry: c3d5fe77879a4028b9565b67ead59c84.azurecr.io
2021/01/05 22:17:21 Successfully logged into c3d5fe77879a4028b9565b67ead59c84.azurecr.io
2021/01/05 22:17:21 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'
2021/01/05 22:17:21 Scanning for dependencies...
2021/01/05 22:17:22 Successfully scanned dependencies
2021/01/05 22:17:22 Launching container with name: acb_step_0
Sending build context to Docker daemon  64.51kB
Step 1

Error: Container has crashed. Did your init method fail?




Container Logs:
2021-01-05T22:25:41,109053825+00:00 - gunicorn/run 
2021-01-05T22:25:41,111827995+00:00 - rsyslog/run 
2021-01-05T22:25:41,115220858+00:00 - iot-server/run 
2021-01-05T22:25:41,116158748+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_29d6691f982540a8942a770f453f7d1d/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_29d6691f982540a8942a770f453f7d1d/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_29d6691f982540a8942a770f453f7d1d/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_29d6691f982540a8942a770f453f7d1d/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_29d6691f982540a8942a770f453f7d1d/lib/libssl.so.1.0.0: no version information available (required by /usr/sbi

WebserviceException: WebserviceException:
	Message: Error: Container has crashed. Did your init method fail?
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Error: Container has crashed. Did your init method fail?"
    }
}

TODO: In the cell below, send a request to the web service you deployed to test it.

In [None]:
print("Service state: {}".format(service.state))
print("Scoring URI: {}".format( service.scoring_uri))

In [None]:
test_data_path = 'https://raw.githubusercontent.com/tybyers/AZMLND_projects/capstone/capstone/data/test_data.json'

r = requests.get(test_data_path)
input_data = r.json()

In [None]:
# Set the content type
headers = {'Content-Type': 'application/json'}
# If authentication is enabled, set the authorization header
headers['Authorization'] = f'Bearer {service.key}'

# Make the request and display the response
resp = requests.post(service.scoring_uri, input_data, headers=headers)
print(resp.json())

TODO: In the cell below, print the logs of the web service and delete the service