# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [2]:
import logging
import os
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset
from azureml.data.dataset_factory import TabularDatasetFactory


## Dataset

### Overview
The dataset being used is the housing dataset for California based on the 1990 consensus. This has been retrieved form kaggle and has 10 columns all used to predict medain house value.

## Setting up Workspace

In [3]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = "\n")

quick-starts-ws-135884
aml-quickstarts-135884
southcentralus
f5091c60-1c3c-430f-8d81-d802f6bf2414


## Initialising Experiment

In [4]:
# choose a name for experiment
experiment_name = 'housing_california_automl'
project_folder = './housing_california_automl'
os.makedirs(project_folder, exist_ok = True)
experiment=Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
housing_california_automl,quick-starts-ws-135884,Link to Azure Machine Learning studio,Link to Documentation


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

## Creating or checking for existing compute cluster

In [6]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

aml_compute_name = 'computecluster1' 
try:
    compute_target = ComputeTarget(workspace = ws, name = aml_compute_name)
    print("Existing cluster. Use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2", max_nodes = 4)
    compute_target = ComputeTarget.create(ws,aml_compute_name, compute_config)
compute_target.wait_for_completion(show_output = True)

Existing cluster. Use it.

Running


## Preparing Data

In [13]:
import pandas as pd
data = pd.read_csv('housing_california.csv', header = 0)

In [15]:
data.head(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [16]:
from sklearn.model_selection import train_test_split

In [19]:
def prepare_data(data):
    encoded_column = pd.get_dummies(data['ocean_proximity'], prefix = 'ocp')
    data = data.join(encoded_column)
    data = data.drop("ocean_proximity", axis = 1)
    train, test = train_test_split(data, test_size = 0.2, random_state = 42)
    return train, test

train,test = prepare_data(data)
type(train), type(test), train.shape, test.shape

(pandas.core.frame.DataFrame,
 pandas.core.frame.DataFrame,
 (16512, 14),
 (4128, 14))

In [20]:
columns = train.columns
train.to_csv(path_or_buf = 'housing_train.csv', columns = columns, header = True, index = False)
test.to_csv(path_or_buf = 'housing_test.csv', columns =columns, header = True, index = False)

In [22]:
datastore = ws.get_default_datastore()
datastore.upload_files(['housing_train.csv'])

Uploading an estimated of 1 files
Uploading housing_train.csv
Uploaded housing_train.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_workspaceblobstore

In [23]:
train = TabularDatasetFactory.from_delimited_files([(datastore, 'housing_train.csv')])
datastore.upload_files(['housing_test.csv'])

Uploading an estimated of 1 files
Uploading housing_test.csv
Uploaded housing_test.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_workspaceblobstore

In [24]:
test = TabularDatasetFactory.from_delimited_files([(datastore, 'housing_test.csv')])

## AutoML Configurations

In [25]:
azureml.train.automl.utilities.get_primary_metrics("regression")

['spearman_correlation',
 'normalized_mean_absolute_error',
 'r2_score',
 'normalized_root_mean_squared_error']

In [26]:
automl_settings = {
    "featurization" : "auto",
    "experiment_timeout_minutes" : 30,
    "enable_early_stopping": True,
    "verbosity": logging.INFO,
    "compute_target": compute_target
}

# TODO: Put your automl config here
task = "regression"
automl_config = AutoMLConfig(
    task = task,
    primary_metrics = 'normalized_root_mean_squared_error',
    training_data = train,
    validation_data = test,
    label_column_name = "median_house_value",
    **automl_settings
)

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [27]:
from azureml.widgets import RunDetails
automl_run = experiment.submit(automl_config, show_output = True)
RunDetails(automl_run).show()



Running on remote.
No run_configuration provided, running on computecluster1 with default configuration
Running on remote compute: computecluster1
Parent Run ID: AutoML_a86a962c-62c4-467f-8dfc-023e31606c67

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

****************************************************************************************************

TYPE:         High cardinality feature detection
STATUS:       PASSED
DESCRIPTION:  Your inpu



         0   MaxAbsScaler LightGBM                          0:00:49       0.0993    0.0993
         1   MaxAbsScaler XGBoostRegressor                  0:00:47       0.1157    0.0993
         2   RobustScaler LassoLars                         0:00:40       0.1444    0.0993
         3   RobustScaler DecisionTree                      0:00:53       0.1335    0.0993
         4   StandardScalerWrapper DecisionTree             0:00:45       0.1402    0.0993
         5   RobustScaler DecisionTree                      0:00:51       0.1366    0.0993
         6   StandardScalerWrapper ElasticNet               0:00:56       0.1441    0.0993
         7   MinMaxScaler DecisionTree                      0:00:55       0.1498    0.0993
         8   MinMaxScaler ElasticNet                        0:00:44       0.1473    0.0993
         9   StandardScalerWrapper DecisionTree             0:00:40       0.1441    0.0993
        10   StandardScalerWrapper DecisionTree             0:00:44       0.1380    0.0993

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [28]:
automl_run.wait_for_completion()

{'runId': 'AutoML_a86a962c-62c4-467f-8dfc-023e31606c67',
 'target': 'computecluster1',
 'status': 'Completed',
 'startTimeUtc': '2021-01-24T11:02:22.391167Z',
 'endTimeUtc': '2021-01-24T11:40:01.951051Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'normalized_root_mean_squared_error',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'computecluster1',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"5a52167e-ac03-4210-84fc-451e7e5440b6\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"housing_train.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-135884\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"f5091c60-1c3c-430f-8d81

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [29]:
automl_run, fitted_automl_model = automl_run.get_output()
print(fitted_automl_model)
import joblib
joblib.dump(fitted_automl_model, 'automl_housing.pkl')

Package:azureml-automl-runtime, training version:1.20.0, current version:1.19.0
Package:azureml-core, training version:1.20.0, current version:1.19.0
Package:azureml-dataprep, training version:2.7.2, current version:2.6.1
Package:azureml-dataprep-native, training version:27.0.0, current version:26.0.0
Package:azureml-dataprep-rslex, training version:1.5.0, current version:1.4.0
Package:azureml-dataset-runtime, training version:1.20.0, current version:1.19.0.post1
Package:azureml-defaults, training version:1.20.0, current version:1.19.0
Package:azureml-interpret, training version:1.20.0, current version:1.19.0
Package:azureml-pipeline-core, training version:1.20.0, current version:1.19.0
Package:azureml-telemetry, training version:1.20.0, current version:1.19.0
Package:azureml-train-automl-client, training version:1.20.0, current version:1.19.0
Package:azureml-train-automl-runtime, training version:1.20.0, current version:1.19.0


RegressionPipeline(pipeline=Pipeline(memory=None,
                                     steps=[('datatransformer',
                                             DataTransformer(enable_dnn=None,
                                                             enable_feature_sweeping=None,
                                                             feature_sweeping_config=None,
                                                             feature_sweeping_timeout=None,
                                                             featurization_config=None,
                                                             force_text_dnn=None,
                                                             is_cross_validation=None,
                                                             is_onnx_compatible=None,
                                                             logger=None,
                                                             observer=None,
                                         

['automl_housing.pkl']

In [31]:
os.listdir(os.curdir), os.getcwd()

(['.azureml',
  '.config',
  '.ipynb_aml_checkpoints',
  '.ipynb_checkpoints',
  'automl.ipynb',
  'automl.log',
  'automl_housing.pkl',
  'azureml_automl.log',
  'housing_california.csv',
  'housing_california_automl',
  'housing_test.csv',
  'housing_train.csv'],
 '/mnt/batch/tasks/shared/LS_root/mounts/clusters/computecluster1/code/Users/odl_user_135884')

In [32]:
automl_run_metrics = automl_run.get_metrics()
automl_run_metrics

{'explained_variance': 0.8228753615353978,
 'mean_absolute_percentage_error': 18.159874417379168,
 'root_mean_squared_log_error': 0.23800051249102416,
 'normalized_median_absolute_error': 0.043801857474187994,
 'normalized_root_mean_squared_error': 0.0993347172460213,
 'normalized_mean_absolute_error': 0.06646896116376264,
 'root_mean_squared_error': 48177.53653375482,
 'median_absolute_error': 21243.988478696127,
 'r2_score': 0.8228739984995292,
 'mean_absolute_error': 32237.57910234721,
 'normalized_root_mean_squared_log_error': 0.06787289643267606,
 'spearman_correlation': 0.9138951035775631,
 'predicted_true': 'aml://artifactId/ExperimentRun/dcid.AutoML_a86a962c-62c4-467f-8dfc-023e31606c67_28/predicted_true',
 'residuals': 'aml://artifactId/ExperimentRun/dcid.AutoML_a86a962c-62c4-467f-8dfc-023e31606c67_28/residuals'}

In [33]:
print("Bets run ID:", automl_run.id)
print("Mean absolute Error:", automl_run_metrics['mean_absolute_error'])

Bets run ID: AutoML_a86a962c-62c4-467f-8dfc-023e31606c67_28
Mean absolute Error: 32237.57910234721


## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

TODO: In the cell below, send a request to the web service you deployed to test it.

TODO: In the cell below, print the logs of the web service and delete the service