# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

## Import Dependencies

In [10]:
from azureml.core import Workspace, Experiment, Dataset
from azureml.data.dataset_factory import TabularDatasetFactory as tdf
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.train.automl import AutoMLConfig

## Load Workspace Elements

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'MS-Malware'

experiment=Experiment(ws, experiment_name)

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Workspace name: quick-starts-ws-133396
Azure region: southcentralus
Subscription id: 6b4af8be-9931-443e-90f6-c4c34a1f9737
Resource group: aml-quickstarts-133396


## Create Compute Cluster

## Dataset

### Overview

This data set was taken from the [Microsoft Malware Prediction](https://www.kaggle.com/c/microsoft-malware-prediction) competition run in 2019. "Microsoft is challenging the data science community to develop techniques to predict if a machine will soon be hit with malware. As with their previous, Malware Challenge (2015), Microsoft is providing Kagglers with an unprecedented malware dataset to encourage open-source progress on effective techniques for predicting malware occurrences."  

The raw training data set for the competition is very large: 8,921,483 observations (rows), and 83 features (variables), taking up 4.1 GB of space! 

In order to allow us to complete the tasks for this assignment in a reasonable amount of time, we took a very small subset of the entire train.csv data set -- just 10,000 rows (or about 0.1% of the entire training set). We skimmed off the top 10,000 rows from the data set for use in this project, and put it on our GitHub at 'https://raw.githubusercontent.com/tybyers/AZMLND_projects/capstone/capstone/data/train_1_10k.csv'.

### Goal

The goal for this project is to predict whether a Windows machine is infected by various families of malware, based on different properties of that machine. Each row in this dataset corresponds to a machine, and has observations from telemetry data generated by WindowsDefender. The `HasDetections` column indicates that Malware was detected on the machine. 

For our 10,000 row data set, 4,950 machines have no malware detections, and 5,050 machines have malware detections. This is a nice, balanced data set.


In [20]:
# Load Data set
data_path = 'https://raw.githubusercontent.com/tybyers/AZMLND_projects/capstone/capstone/data/train_1_10k.csv'
dataset = tdf.from_delimited_files(path=data_path)
dataset.to_pandas_dataframe().head()

Unnamed: 0,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,...,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections
0,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1735.0,0,7,0,,53447.0,1.0,...,36144,0,,0,0,0,0.0,0,10,0
1,win8defender,1.1.14600.4,4.13.17134.1,1.263.48.0,0,7,0,,53447.0,1.0,...,57858,0,,0,0,0,0.0,0,8,0
2,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1341.0,0,7,0,,53447.0,1.0,...,52682,0,,0,0,0,0.0,0,3,0
3,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1527.0,0,7,0,,53447.0,1.0,...,20050,0,,0,0,0,0.0,0,3,1
4,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1379.0,0,7,0,,53447.0,1.0,...,19844,0,0.0,0,0,0,0.0,0,1,1


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

In [6]:
# TODO: Put your automl settings here
automl_settings = {}

# TODO: Put your automl config here
automl_config = AutoMLConfig(
    experiment_timeout_minutes=30,
    task="classification",
    primary_metric="accuracy",
    training_data=dataset,
    label_column_name='HasDetections',
    n_cross_validations=5)

In [7]:
# Submit experiment
# Note: I had to `pip install` some dependencies for this step to work ... Open a terminal and enter: `pip install -r /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/automl/core/validated_linux_requirements.txt`
remote_run = experiment.submit(automl_config, show_output=True)

No run_configuration provided, running on local with default configuration
Running on local machine
Parent Run ID: AutoML_4fa5adf8-bcbc-4002-a1ad-9027cfebb3b0

Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

**************************************************

Cannot serialize JSON, possibly due to NaN or Inf, scrubbing to zero and retrying...


Current status: RawFeaturesExplanations. Computation of raw features completed
Current status: BestRunExplainModel. Best run model explanations completed
****************************************************************************************************


## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [11]:
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [16]:
best_run = remote_run.get_best_child()
best_run.get_metrics()

{'weighted_accuracy': 0.6362623565966207,
 'precision_score_micro': 0.6365000000000001,
 'recall_score_micro': 0.6365000000000001,
 'average_precision_score_macro': 0.6798433473962605,
 'AUC_weighted': 0.6875430512091427,
 'f1_score_micro': 0.6365000000000001,
 'f1_score_weighted': 0.6364377584217974,
 'average_precision_score_micro': 0.6836739222547339,
 'matthews_correlation': 0.2735714075347031,
 'AUC_micro': 0.68788105,
 'recall_score_macro': 0.6367379125794927,
 'log_loss': 0.6418984166408535,
 'balanced_accuracy': 0.6367379125794927,
 'average_precision_score_weighted': 0.6801325120982373,
 'AUC_macro': 0.6875430512091426,
 'accuracy': 0.6365000000000001,
 'f1_score_macro': 0.6364568664115853,
 'recall_score_weighted': 0.6365000000000001,
 'norm_macro_recall': 0.27347582515898544,
 'precision_score_weighted': 0.6370333738013207,
 'precision_score_macro': 0.636833535843925,
 'accuracy_table': 'aml://artifactId/ExperimentRun/dcid.AutoML_4fa5adf8-bcbc-4002-a1ad-9027cfebb3b0_30/accur

In [22]:
best_run.properties

{'runTemplate': 'automl_child',
 'pipeline_id': '__AutoML_Ensemble__',
 'pipeline_spec': '{"pipeline_id":"__AutoML_Ensemble__","objects":[{"module":"azureml.train.automl.ensemble","class_name":"Ensemble","spec_class":"sklearn","param_args":[],"param_kwargs":{"automl_settings":"{\'task_type\':\'classification\',\'primary_metric\':\'accuracy\',\'verbosity\':20,\'ensemble_iterations\':15,\'is_timeseries\':False,\'name\':\'MS-Malware\',\'compute_target\':\'local\',\'subscription_id\':\'6b4af8be-9931-443e-90f6-c4c34a1f9737\',\'region\':\'southcentralus\',\'spark_service\':None}","ensemble_run_id":"AutoML_4fa5adf8-bcbc-4002-a1ad-9027cfebb3b0_30","experiment_name":null,"workspace_name":"quick-starts-ws-133396","subscription_id":"6b4af8be-9931-443e-90f6-c4c34a1f9737","resource_group_name":"aml-quickstarts-133396"}}]}',
 'training_percent': '100',
 'predicted_cost': None,
 'iteration': '30',
 '_azureml.ComputeTargetType': 'local',
 '_aml_system_scenario_identification': 'Local.Child',
 'run_tem

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [25]:
type(best_run)

azureml.core.run.Run

In [28]:
model = best_run.register_model(model_name = 'best_automl', model_path='outputs/model.pkl')
print(model.name, model.id, model.version, sep='\t')

best_automl	best_automl:2	2


In [29]:
from azureml.core.model import InferenceConfig

TODO: In the cell below, send a request to the web service you deployed to test it.

TODO: In the cell below, print the logs of the web service and delete the service