## Analyze and prepare the customer ratings dataset

Use `spark.read.csv()` to load the data from the source public blob storage account and display its schema and shape.

In [66]:
url = "wasbs://files@synapsemlpublic.blob.core.windows.net/PersonalizedData.csv"
raw_data = spark.read.csv(url, header=True, inferSchema=True)
print("Schema: ")
raw_data.printSchema()

df = raw_data.toPandas()
print("Shape: ", df.shape)

Take a look at some of the items in the dataset. Notice the two-class ratings (0 vs. 1) provided by customers to products.

The goal of this exercise is to build a Machine Learning classification model capable of predicting the rating based on Cost, Size, Price, PrimaryBrandId, GenderId, MaritalStatus, LowerIncomeBound, and UpperIncomeBound. To achieve the goal, you will use Azure Machine Learning (AML) automated machine learning (Auto ML).

In [67]:
display(df.iloc[:10, :])

Split the data into the train and test parts using a ratio of 80% train to 20% test.



In [68]:
split_ratio = 0.8
seed = 42
raw_train, raw_test = raw_data.randomSplit([split_ratio, 1 - split_ratio], seed=seed)
print("Train: (rows, columns) = {}".format((raw_train.count(), len(raw_train.columns))))
print("Test: (rows, columns) = {}".format((raw_test.count(), len(raw_test.columns))))

Use the subscription id, resource group name, AML workspace name, and AML workspace region from your environment to connect to the AML workspace. Make sure the values are identical to the ones displayed in the Azure portal.

In [69]:
from azureml.core import Workspace

# Enter your workspace subscription, resource group, name, and region.
subscription_id = "153b2544-398a-45ea-a683-b41ddd681d56" #you should be owner or contributor
resource_group = "Synapse-WS-L400-524101" #you should be owner or contributor
workspace_name = "amlworkspace524101" #your workspace name
workspace_region = "northeurope" #your region

ws = Workspace(workspace_name = workspace_name,
               subscription_id = subscription_id,
               resource_group = resource_group)

Persist the train and test datasets as CSV files and upload them to the AML data store.

Load the train dataset as an AML tabular dataset (this format is used by the AutoML run).

In [70]:
import pandas 
from azureml.core import Dataset

# Get the Azure Machine Learning default datastore
datastore = ws.get_default_datastore()

train_pd = raw_train.toPandas()
train_pd[train_pd.columns[2:]].to_csv('train.csv', index=False)
test_pd = raw_test.toPandas()
test_pd[test_pd.columns[2:]].to_csv('test.csv', index=False)

# Convert into an Azure Machine Learning tabular dataset
datastore.upload_files(files = ['train.csv', 'test.csv'],
                       target_path = 'train-dataset/tabular/',
                       overwrite = True,
                       show_progress = True)
ds_train = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-dataset/tabular/train.csv')])

## Use AML Auto ML to train the classification model

Configure the AutoML run to use at most 20 iterations (combinations of ML algorithms and hyper-parameter values). This limitation will ensure the AutoML run will not exceed a total run time of 7-8 minutes.

The `enable_onnx_compatible_models` ensures the run produces a model that is ONNX compatible. This will make the model available for inference directly on dedicated SQL pool tables, via the AML linked service configured in Synapse.

In [71]:
import logging

automl_settings = {
    "iterations": 20,
    "iteration_timeout_minutes": 5,
    "experiment_timeout_minutes": 15,
    "max_concurrent_iterations": 2,
    "enable_early_stopping": True,
    "enable_onnx_compatible_models": True,
    "primary_metric": 'accuracy',
    "featurization": 'auto',
    "verbosity": logging.INFO,
    "n_cross_validations": 2}

Finalize the configuration of the AutoML run. Specify the task type (`classification`), the data to train on, and the compute resource to use. In this case, `spark_context = sc` specifies that the AutoML run will use the local Spark pool as the compute resource to run the entire process. 
The AML workspace is still coordinating the whole process, but the compute being used is local.

In [72]:
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task='classification',
                             debug_log='automated_ml_errors.log',
                             training_data = ds_train,
                             spark_context = sc,
                             model_explainability = True, 
                             label_column_name ="Rating",**automl_settings)

Submit the AutoML run and wait for its completion. The settings were chosen in a way that the total run time should not exceed 7-8 minutes. While the experiment is running, go ahead and open the Azure Machine Learning Studio in the Azure portal and check out the details of the AutoML run.

Once the run completes, check the list of trained models and their performance metric (`accuracy` in our case).

In [73]:
from azureml.core.experiment import Experiment

# Start an experiment in Azure Machine Learning
experiment = Experiment(ws, "aml-synapse-classification")
tags = {"Synapse": "classification"}
local_run = experiment.submit(automl_config, tags = tags)
local_run.wait_for_completion(show_output=True)

## Register the best model in the AML workspace

Retrieve the best model and its associated child run from the AutoML run. Inspect the properties of the child run.

In [None]:
# Get best model
best_run, fitted_model = local_run.get_output()
best_run.properties

In [81]:
description = 'Classification model trained by AutoML running on Synapse Spark'
model_path='outputs/model.onnx'
model = best_run.register_model(model_name = 'aml-synapse-classifier', model_path = model_path, description = description, model_framework='ONNX')
print(model.name, model.version)