# Automated ML

Detailed package dependencies can be found on the [`env.yml`](envs/env.yml).
Use `conda install --file envs/env.yml` on your Terminal.
This file can be used to reproduce the conda environment used in this notebook.

In [1]:
from azureml.core import Workspace, Experiment, Environment, Datastore, Dataset
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails

import pandas as pd
import joblib
import os
import requests

In [2]:
 # Setting up the workspace
ws = Workspace.from_config()

# Registering and building the environment (not needed in AutoML)
env = Environment.from_conda_specification(name = "az-capstone", file_path = "envs/env.yml")
env = env.register(workspace=ws)
env_build = env.build(workspace=ws)

# Setup the experiment
exp_name = 'az-capstone-automl'
exp=Experiment(ws, exp_name)

# Enable logs
run = exp.start_logging()

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.
No Python version provided, defaulting to "3.6.2"


We now deploy the necessary Compute Cluster, or check if there is already an existing one we can use.

In [3]:
# Setup the compute cluster
compute_name = os.environ.get('CLUSTER_NAME', 'automl-cluster')
compute_min_nodes = os.environ.get('CLUSTER_MIN_NODES', 0)
compute_max_nodes = os.environ.get('CLUSTER_MAX_NODES', 4)
vm_size = os.environ.get('CLUSTER_SKU', 'STANDARD_D2_V2')

# Verify if the compute cluster exists
if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(
        vm_size=vm_size,
        min_nodes=compute_min_nodes,
        max_nodes=compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)

    # poll for a minimum number of nodes and for a specific timeout.
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

creating a new compute target...
InProgress.
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Resizing', 'allocationStateTransitionTime': '2022-02-01T16:43:46.117000+00:00', 'errors': None, 'creationTime': '2022-02-01T16:43:45.650701+00:00', 'modifiedTime': '2022-02-01T16:43:49.305778+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT1800S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D2_V2'}


## Dataset

### Overview

The dataset we are using will be the one resulting from the [previous notebook](1-data-sourcing.ipynb) where
we dug into data sourcing and did some processing prior this task. The dataset
consists on financial data including OHLCV (open, high, low, close, volume) from diverse instruments (indices,
commodities, interest rates...) and technical indicators (moving averages, RSI, standard deviation...), that we will
use to create a ML-based trading model, that gives BUY, HOLD or SELL signals for Bitcoin trading.

If you want to dig more into how the dataset looks like or
into how the above-mentioned signals are generated, please refer to the "labelling the data" section of
the [data sourcing notebook](1-data-sourcing.ipynb) or the latest  print view of the DataFrame's head and/or tail
provided on the same file.

The task we will be trying to solve is basically a **classification problem**.
We are to predict whether the next-day, Bitcoin returns will be on the top 25% most positive returns (BUY, 1),
the 25% most negative (SELL, -1), or somewhere in between (HOLD, 0).

Since AutoML does grid search over features and normalization procedures, we will take joint, unaltered data as feed
in to the model. What we will make is dropping the last features and labels that are not really needed for the task.

In [7]:
# Access the data and drop unneeded columns for AutoML exercise
df = pd.read_csv('data/df.csv')
print('dataset shape: ', df.shape)
print('columns:\n', df.columns)

dataset shape:  (2690, 34)
columns:
 Index(['Date', 'shangai', 'btc', 'crude oil', 'euro', 'gold', 'silver', 'ftse',
       'spy', 'hsi', 'nasdaq', 'nikkei', 'rates', 'open', 'high', 'low', 'MA4',
       'MA50', 'MA80', 'stochRSI', 'RSI', 'btc_std_dev', 'std_dif', 'vol_btc',
       'hashrate', 'difficulty', 'transactions', 't_cost', 'y_returns',
       'y_close', 'y_c', 'y_returns_shift', 'y_c_shift', 'y_close_shift'],
      dtype='object')


In [8]:
drop_col_list = ['y_close', 'y_c', 'y_close_shift', 'y_returns_shift']
df.drop(columns=drop_col_list, inplace=True)

In [9]:
# Register the dataset
datastore = ws.get_default_datastore()
dataset = Dataset.Tabular.register_pandas_dataframe(df, datastore, "automl_dataset", show_progress=True)
df = dataset.to_pandas_dataframe()

Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/a9649a16-ed2e-4e83-94b8-e5215479535d/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


## AutoML Configuration

Our AutoML run will have the classification task of predicting next day's buy, sell or hold label, or the column
`y_c_weighted`. Our primary metric will be AUC weighted, to deal with the instability on price dataset.
 I'm also adding the automatic featurization, so
AutoML takes care of necessary data transformations, trying out different methods.

As timeout for this project I will use 30 minutes. The usage of VMs to access Azure on a limited time (1h) adds pressure
on this metric. We also need time to analize results afterwards, so air time using the VM is important.

In [18]:
automl_settings = {"featurization": 'auto'}

In [21]:
# Set parameters for AutoMLConfig
automl_config = AutoMLConfig(
    experiment_timeout_minutes=30,
    task='classification',
    primary_metric='AUC_weighted',
    training_data=df,
    label_column_name='y_c_shift',
    n_cross_validations=5,
    **automl_settings)

In [22]:
# Submit the experiment run
automl_run = exp.submit(automl_config, show_output=False)
automl_run.wait_for_completion(show_output=True)

2022-02-01:16:57:09,723 INFO     [modeling_bert.py:226] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2022-02-01:16:57:09,749 INFO     [modeling_xlnet.py:339] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2022-02-01:16:57:46,301 INFO     [utils.py:159] NumExpr defaulting to 4 threads.


Experiment,Id,Type,Status,Details Page,Docs Page
az-capstone-automl,AutoML_a3d7aed5-a1be-4326-869d-3f12d1d0ca08,automl,Preparing,Link to Azure Machine Learning studio,Link to Documentation


2022-02-01:17:20:31,449 INFO     [explanation_client.py:332] Using default datastore for uploads


Experiment,Id,Type,Status,Details Page,Docs Page
az-capstone-automl,AutoML_a3d7aed5-a1be-4326-869d-3f12d1d0ca08,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation




********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

********************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       DONE
DESCRIPTION:  If the missing values are expected, let the run complete. Otherwise cancel the current run and use a script to customize the handling of missing feature values that may be more appropriate based on the data type and business requirement.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization
DETAILS:      
+------------------------------+------------------------------+------------------------------+
|Column name          

{'runId': 'AutoML_a3d7aed5-a1be-4326-869d-3f12d1d0ca08',
 'target': 'local',
 'status': 'Completed',
 'startTimeUtc': '2022-02-01T16:57:48.228223Z',
 'endTimeUtc': '2022-02-01T17:18:25.470585Z',
 'services': {},
   'message': 'No scores improved over last 20 iterations, so experiment stopped early. This early stopping behavior can be disabled by setting enable_early_stopping = False in AutoMLConfig for notebook/python SDK runs.'}],
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'local',
  'DataPrepJsonString': None,
  'EnableSubsampling': None,
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-widgets": "1.37.0", "azureml-train": "1.37.0", "azureml-train-restclients-hyperdrive": "1.37.0", "azureml-train-core":

## Run Details

By using the RunDetails widget, we can appreciate different experiments metrics.

OPTIONAL: Write about the different models trained and their performance.
Why do you think some models did better than others?

In [23]:
RunDetails(automl_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.

In [41]:
# Retrieve and save best model
best_run, model = automl_run.get_output()
print(best_run, '\n')
print(model)
joblib.dump(value=best_run.id, filename="./models/bitcoin-automl.joblib")

Run(Experiment: az-capstone-automl,
Id: AutoML_a3d7aed5-a1be-4326-869d-3f12d1d0ca08_31,
Type: None,
Status: Completed) 

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, observer=None, task='classification', working_dir='/mnt/batch/tasks/shared/LS_root/mount...
)), ('svcwrapper', SVCWrapper(C=1.2067926406393288, break_ties=False, cache_size=200, class_weight='balanced', coef0=0.0, decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf', max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False))], verbose=False))], meta_learner=LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False, fit_intercept=True, intercept_scaling=1.0, l1_ratios=None, max_iter=100, multi_class='auto', n_jobs=N

['./models/bitcoin-automl.joblib']

## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models.
Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [45]:
model_name = 'best-automl-model'
description = "AutoML model for predicting day-ahead Bitcoin price movements"
tags = None
model = automl_run.register_model(model_name=model_name, description=description, tags=tags)
print(automl_run.model_id)

best-automl-model


In [46]:
from azureml.core.model import Model

In [47]:
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
az-capstone-automl,AutoML_a3d7aed5-a1be-4326-869d-3f12d1d0ca08_31,,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [51]:
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice

# We need the inference script
script_file_name = "inference/score.py"
best_run.download_file("outputs/scoring_file_v_1_0_0.py", "inference/score.py")

# Check what is this inference script
inference_config = InferenceConfig(environment=env, entry_script=script_file_name)
aciconfig = AciWebservice.deploy_configuration(
    cpu_cores=2,
    memory_gb=2,
    tags={"area": "Trading", "type": "automl_classification"},
    description="service for Bitcoin trading signals"
)

aci_service_name = model_name.lower()
print(aci_service_name)
aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)

# aci_service.get_logs()

best-automl-model


WebserviceException: WebserviceException:
	Message: Service best-automl-model with the same name already exists, please use a different service name or delete the existing service.
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Service best-automl-model with the same name already exists, please use a different service name or delete the existing service."
    }
}

TODO: In the cell below, send a request to the web service you deployed to test it.

In [None]:
X_test_json = X_test.to_json(orient="records")
data = '{"data": ' + X_test_json + "}"
headers = {"Content-Type": "application/json"}

resp = requests.post(aci_service.scoring_uri, data, headers=headers)

y_pred = json.loads(json.loads(resp.text))["result"]

### Deletion of endpoints and resources

In [None]:
# Deleting the inference compute instance
aci_service.delete()

# Deleting compute cluster
# compute_target.delete()
# print('Compute cluster deleted!')

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.
