# Automated ML

Detailed package dependencies can be found on the [`env.yml`](envs/env.yml).
Use `conda install --file envs/env.yml` on your Terminal.
This file can be used to reproduce the conda environment used in this notebook.

In [1]:
from azureml.core import Workspace, Experiment, Environment, Datastore, Dataset
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails

import pandas as pd
import joblib
import os
import requests

ModuleNotFoundError: No module named 'azureml.train.automl'

In [None]:
 # Setting up the workspace
ws = Workspace.from_config()

# Registering and building the environment (not needed in AutoML)
env = Environment.from_conda_specification(name = "az-capstone", file_path = "envs/env.yml")
env = env.register(workspace=ws)
env_build = env.build(workspace=ws)

# Setup the experiment
exp_name = 'az-capstone-automl'
exp=Experiment(ws, exp_name)

# Enable logs
run = exp.start_logging()

We now deploy the necessary Compute Cluster, or check if there is already an existing one we can use.

In [None]:
# Setup the compute cluster
compute_name = os.environ.get('CLUSTER_NAME', 'automl-cluster')
compute_min_nodes = os.environ.get('CLUSTER_MIN_NODES', 0)
compute_max_nodes = os.environ.get('CLUSTER_MAX_NODES', 4)
vm_size = os.environ.get('CLUSTER_SKU', 'STANDARD_D2_V2')

# Verify if the compute cluster exists
if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(
        vm_size=vm_size,
        min_nodes=compute_min_nodes,
        max_nodes=compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)

    # poll for a minimum number of nodes and for a specific timeout.
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

## Dataset

### Overview

The dataset we are using will be the one resulting from the [previous notebook](1-data-sourcing.ipynb) where
we dug into data sourcing and did some processing prior this task. The dataset
consists on financial data including OHLCV (open, high, low, close, volume) from diverse instruments (indices,
commodities, interest rates...) and technical indicators (moving averages, RSI, standard deviation...), that we will
use to create a ML-based trading model, that gives BUY, HOLD or SELL signals for Bitcoin trading.

If you want to dig more into how the dataset looks like or
into how the above-mentioned signals are generated, please refer to the "labelling the data" section of
the [data sourcing notebook](1-data-sourcing.ipynb) or the latest  print view of the DataFrame's head and/or tail
provided on the same file.

The task we will be trying to solve is basically a **classification problem**.
We are to predict whether the next-day, Bitcoin returns will be on the top 25% most positive returns (BUY, 1),
the 25% most negative (SELL, -1), or somewhere in between (HOLD, 0).

Since AutoML does grid search over features and normalization procedures, we will take joint, unaltered data as feed
in to the model. What we will make is dropping the last features and labels that are not really needed for the task.

In [None]:
# Access the data and drop unneeded columns for AutoML exercise
df = pd.read_csv('data/df.csv')
df.tail()

In [None]:
drop_col_list = ['', '']
df.drop(columns=drop_col_list, inplace=True)

In [None]:
# Register the dataset
datastore = ws.get_default_datastore()
dataset = Dataset.Tabular.register_pandas_dataframe(df, datastore, "automl_dataset", show_progress=True)
df = dataset.to_pandas_dataframe()

## AutoML Configuration

Our AutoML run will have the classification task of predicting next day's buy, sell or hold label, or the column
`y_c_weighted`. Our primary metric will be AUC weighted, to deal with the instability on price dataset.
 I'm also adding the automatic featurization, so
AutoML takes care of necessary data transformations, trying out different methods.

As timeout for this project I will use 30 minutes. The usage of VMs to access Azure on a limited time (1h) adds pressure
on this metric. We also need time to analize results afterwards, so air time using the VM is important.

In [None]:
automl_settings = {
    'featurization':'auto'
}

In [None]:
# Set parameters for AutoMLConfig
automl_config = AutoMLConfig(
    experiment_timeout_minutes=30,
    task='classification',
    primary_metric='AUC_weighted',
    training_data=df,
    label_column_name='y_c_shifted',
    n_cross_validations=5
    **automl_settings)

In [None]:
# Submit the experiment run
automl_run = exp.submit(automl_config, show_output=False)
automl_run.wait_for_completion(show_output=True)

## Run Details

By using the RunDetails widget, we can appreciate different experiments metrics.

OPTIONAL: Write about the different models trained and their performance.
Why do you think some models did better than others?

In [None]:
RunDetails(hyperdrive_run).show()

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.

In [None]:
# Retrieve and save best automl model.
aml_best_run, model = automl_run.get_output()
print(aml_best_run)
print(model)
joblib.dump(value=aml_best_run.id, filename="./models/bitcoin-automl.joblib")

## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models.
Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [None]:
model_name = automl_run.properties['model_name']
description = "AutoML model for predicting day-ahead Bitcoin price movements"
tags = None
model = remote_run.register_model(model_name=model_name, description=description, tags=tags)
print(remote_run.model_id)

# TODO: CHECK IF WE NEED THESE LINES
script_file_name = "inference/score.py"
best_run.download_file("outputs/scoring_file_v_1_0_0.py", "inference/score.py")

# Check what is this inference script
inference_config = InferenceConfig(entry_script=script_file_name)
aciconfig = AciWebservice.deploy_configuration(
    cpu_cores=2,
    memory_gb=2,
    tags={"area": "Trading", "type": "automl_classification"},
    description="service for Bitcoin trading signals",
)

aci_service_name = model_name.lower()
print(aci_service_name)
aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)

# aci_service.get_logs()

TODO: In the cell below, send a request to the web service you deployed to test it.

In [None]:
X_test_json = X_test.to_json(orient="records")
data = '{"data": ' + X_test_json + "}"
headers = {"Content-Type": "application/json"}

resp = requests.post(aci_service.scoring_uri, data, headers=headers)

y_pred = json.loads(json.loads(resp.text))["result"]

### Deletion of endpoints and resources

In [None]:
# Deleting the inference compute instance
# aci_service.delete()

# Deleting compute cluster
# compute_target.delete()
# print('Compute cluster deleted!')

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.
