# Automated ML

In [1]:
import pickle
import requests

from azureml.core import Environment
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.experiment import Experiment
from azureml.core.dataset import Dataset
from azureml.core.workspace import Workspace
from azureml.core.webservice.aci import AciWebservice
from azureml.core.model import InferenceConfig, Model

from azureml.train.automl import AutoMLConfig

## Dataset

### Overview
In this experiment we will be using **Kaggle - Credit Card Fraud Dataset**, the dataset can be downloaded from [here](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud). This dataset consist of **~99% Non-Fraudulent** transactions while the rest **~0.1% Fraudulent** transaction, hence the data is imbalance. There are no null values, and all the columns except for the **transaction** and **amount** are unknown, maybe for privacy reasons. As additional note, all the data in this datasets has been scaled.

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = "creditcard-experiment"
project_folder = './creditcard-pipeline-project'

experiment = Experiment(ws, experiment_name)

In [25]:
key = "creditcard-dataset"
description = "Credit Card - Dealing from Imbalance Datasets from https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud"

found = False
if key in ws.datasets.keys():
    print("Found existing dataset, use it.")
    found = True
    dataset = ws.datasets[key] # already registered
    
if not found:
    example_data = "https://media.githubusercontent.com/media/satriawadhipurusa/ml-dataset-collection/master/Fraud-Detection/creditcard-fraud.csv" # uploaded to Git for download
    dataset = Dataset.Tabular.from_delimited_files(example_data)
    dataset = dataset.register(workspace=ws, name=key, description=description)

Found existing dataset, use it.


## AutoML Configuration

The AutoML usually consist of the followings configurations:

* Compute Target: For the compute target we will use `STANDARD_D2_v3` (2CPU, 8GB memory, 50GB storage) in low priority which has been created earlier, and max nodes of 4, this will enable more parallel trials in training Automated ML 
* Task: Since we're predicting Fraud (0/1), this should be binary **Classification** task
* Early Stopping 
  * Timeout: We set the timeout to be 30 mins instead of 60 mins, so we can iterate faster
  * Primary Metric: We are interested to see **AUC Weighted** with **0.98** exit score, since this is an imbalance dataset. An accuracy of 0.99 will be misleading since we can achieve the same accuracy with just predicting 0s, but low precision/recall. 

### Compute Cluster

In [3]:
amlcompute_cluster_name = "automl-cls"

try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_v3", max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count=0, timeout_in_minutes=10)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [4]:
# TODO: Put your automl settings here
automl_settings = {
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 4,
    "primary_metric": "AUC_weighted",
    "experiment_exit_score" : 0.98
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(compute_target=compute_target,
                             task="classification",
                             training_data=dataset,
                             label_column_name="Class",
                             path=project_folder,
                             enable_early_stopping=True,
                             featurization="auto",
                             debug_log="automl_errors.log",
                             **automl_settings)

In [5]:
# TODO: Submit your experiment
remote_run = experiment.submit(automl_config)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
creditcard-experiment,AutoML_491025b3-1f9f-436c-8c4d-9b2ce65b7b36,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

There are couple of **models**, **preprocessor**, and **hyperparameters** trained in this Automated ML experiment, some of the models are:

* LightGBM
* RandomForest
* XGBoost
* ExtremeRandomTrees
* LogisticRegression
* VotingEnsemble
* StackEnsemble

These **worst** model is combination of **PCA** and **LighGBM** with AUC Weighted of **0.72**, while the best one is a combination of **StandardScalerWrapper** with **LightGBM**. 

This discrepancy due to the nature of the data and the nature of the modeling done on those data. 
* Since this datasets already very condensed in information, a PCA to reduce the information may backfire and reduce the overall metric
* Standard scaler to scale the various numerical values, can help the learning algorithm to classifify fraud / non-fraud better. 

The other combinations also stemmed from preprocessor that are not suited, algorithm or hyperparameters that may overfit.

In [6]:
from azureml.widgets import RunDetails
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [7]:
remote_run.wait_for_completion(show_output=True)

Experiment,Id,Type,Status,Details Page,Docs Page
creditcard-experiment,AutoML_491025b3-1f9f-436c-8c4d-9b2ce65b7b36,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping
Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Train-Test data split
STATUS:       DONE
DESCRIPTION:  Your input data has been split into a training dataset and a holdout test dataset for validation of the model. The test holdout dataset reflects the original distribution of your input data.
              
DETAILS:      
+------------------------------+------------------------------+------------------------------+
|Dataset                       |Row counts                    |Percentage                    |
|train                         |256326                        |89.99989466551033             |
|test                          |28481       

{'runId': 'AutoML_491025b3-1f9f-436c-8c4d-9b2ce65b7b36',
 'target': 'automl-cls',
 'status': 'Completed',
 'startTimeUtc': '2022-04-26T22:32:47.582371Z',
 'endTimeUtc': '2022-04-26T23:13:00.638221Z',
 'services': {},
   'message': 'Experiment timeout reached, hence experiment stopped. Current experiment timeout: 0 hour(s) 30 minute(s)'}],
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'automl-cls',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"67570178-bf15-46d2-b97e-884a07cb1970\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': None,
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-widgets": "1.40.0", "azureml-training-tabular": "1.40.0", "azureml-train": "1.40.0", "azureml-train-restcl

## Best Model

In [8]:
!mkdir outputs

In [19]:
best_automl_run, best_automl_model = remote_run.get_output()
print(f"Best AutoML Run:\n\n{best_automl_run}")
print("==============")
print(f"Best AutoML Model:\n\n{best_automl_model}")

Best AutoML Run:

Run(Experiment: creditcard-experiment,
Id: AutoML_491025b3-1f9f-436c-8c4d-9b2ce65b7b36_30,
Type: azureml.scriptrun,
Status: Completed)
Best AutoML Model:

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=False, is_onnx_compatible=False, observer=None, task='classification', working_dir='/mnt/batch/tasks/shared/LS_root/moun...
                 LightGBMClassifier(boosting_type='gbdt', colsample_bytree=0.1, learning_rate=0.07894947368421053, max_bin=260, max_depth=5, min_child_weight=6, min_data_in_leaf=0.003457931034482759, min_split_gain=0.21052631578947367, n_estimators=200, n_jobs=1, num_leaves=131, problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'cpu'}), random_state=None, reg_alpha=0.7894736842105263, reg_lambda=0.3684210526

In [16]:
#TODO: Save the best model
print("Saving the best Model.....")
model_path = "outputs/model.pkl"
pickle.dump(best_automl_model, open(model_path, "wb"))

Saving the best Model.....


## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [29]:
# Register the model
best_automl_run.register_model(model_name="credit-fraud-model", model_path="outputs/model.pkl")

Model(workspace=Workspace.create(name='mlops-demo', subscription_id='5e4d75b9-5b13-49fb-8306-ae971a3c14b1', resource_group='mlops-resource'), name=credit-fraud-model, id=credit-fraud-model:1, version=1, tags={}, properties={})

TODO: In the cell below, send a request to the web service you deployed to test it.

In [44]:
best_automl_run.download_file("outputs/conda_env_v_1_0_0.yml", "conda.yaml")
env = Environment.from_conda_specification(name="env", file_path="conda.yaml")

best_automl_run.download_file("outputs/scoring_file_v_2_0_0.py", "score.py")
inference_config = InferenceConfig(entry_script="score.py", environment=env)

deployment_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1, auth_enabled=True)

model = Model(ws, "credit-fraud-model", version=1, run_id=best_automl_run.id)
webservice = Model.deploy(ws, "credit-fraud-model",
                          models=[model],
                          inference_config=inference_config,
                          deployment_config=deployment_config,
                          overwrite=True)
webservice.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2022-04-27 15:48:02+00:00 Creating Container Registry if not exists.
2022-04-27 15:48:05+00:00 Building image..
2022-04-27 16:00:06+00:00 Generating deployment configuration..
2022-04-27 16:00:09+00:00 Submitting deployment to compute..
2022-04-27 16:00:17+00:00 Checking the status of deployment credit-fraud-model..
2022-04-27 16:04:07+00:00 Checking the status of inference endpoint credit-fraud-model.
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [45]:
print(f"Scoring URI:\n\n{webservice.scoring_uri}")
print("==============")
primary_key, secondary_key = webservice.get_keys()
print(f"Primary Key:\n\n{primary_key}")

Scoring URI:

http://8db1a543-7e7d-41eb-bbc4-9fa413db3f24.eastasia.azurecontainer.io/score
Primary Key:

lmMdnCf1C0VIkfhyPSouD6qqJJvITbh5


TODO: In the cell below, print the logs of the web service and delete the service

In [46]:
df = dataset.to_pandas_dataframe()
sample = df.drop(columns="Class").sample(2)

In [53]:
sample_json = sample.to_dict(orient="record")
data = {
    "Inputs": {
        "data": sample_json
    },
    "GlobalParameters": {
        "method": "predict"
    }
}
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {primary_key}"
}
response = requests.post(webservice.scoring_uri, json=data, headers=headers)
print(response.text)

{"Results": [0, 0]}


In [55]:
webservice.update(enable_app_insights=True)  # enable app insights
logs = webservice.get_logs()
for line in logs.split('\n'):
    print(line)

2022-04-27T16:03:46,252294700+00:00 - iot-server/run 
2022-04-27T16:03:46,261633800+00:00 - rsyslog/run 
2022-04-27T16:03:46,274171700+00:00 - gunicorn/run 
Dynamic Python package installation is disabled.
Starting HTTP server
2022-04-27T16:03:46,290205200+00:00 - nginx/run 
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
2022-04-27T16:03:46,705018400+00:00 - iot-server/finish 1 0
2022-04-27T16:03:46,708143100+00:00 - Exit code 1 is normal. Not restarting iot-server.
Starting gunicorn 20.1.0
Listening at: http://127.0.0.1:31311 (74)
Using worker: sync
worker timeout is set to 300
Booting worker with pid: 99
SPARK_HOME not set. Skipping PySpark Initialization.
Initializing logger
2022-04-27 16:03:48,806 | root | INFO | Starting up app insights client
logging socket was found. logging is available.
logging socket was found. logging is available.
2022-04-27 16:03:48,807 | root | INFO | Starting up request id generator
2022-04-27 16:03:48,807 | root | INFO | Star

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.
