# Automated ML

In [1]:
from azureml.core import Experiment, Workspace

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'capstone'

experiment=Experiment(ws, experiment_name)

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


## Dataset

### Overview
TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.


TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

Zindi: DSN AI Bootcamp Qualification Hackathon [data](https://zindi.africa/hackathons/dsn-ai-bootcamp-qualification-hackathon/data)

In [3]:
from azureml.core import Dataset
from sklearn.model_selection import train_test_split
import pandas as pd

from utils import get_data
from scripts.cleaning import clean_data

In [None]:
path = get_data("data/Train.csv")
loan_dataset = pd.read_csv(path)

In [5]:
# Retrieve default datastore and upload dataset 
datastore = ws.get_default_datastore()
datastore.upload('data', target_path='data')

# Create TabularDataset & register in workspace
loan_ds = Dataset.Tabular.from_delimited_files([(datastore, ('data/Train.csv'))])
loan_ds = loan_ds.register(
    ws, name="loan_dataset", create_new_version=True,
    description="Dataset for Udacity Machine Learning with Azure Capstone Project"
)


Uploading an estimated of 1 files
Uploading data\Train.csv
Uploaded data\Train.csv, 1 files out of an estimated total of 1
Uploaded 1 files
    


In [6]:
clean_loan_dataset = clean_data(loan_dataset)

# Stratified train_test_split because dataset is imbalanced
train, test = train_test_split(clean_loan_dataset, test_size=0.3, stratify=clean_loan_dataset.default_status, random_state=42)
train.head()

Unnamed: 0,form_field1,form_field2,form_field3,form_field4,form_field5,form_field6,form_field7,form_field8,form_field9,form_field10,...,form_field39,form_field42,form_field43,form_field44,form_field46,form_field47,form_field48,form_field49,form_field50,default_status
2800,3398.0,1.19505,1.7028,0.5238,0.0,18672.0,5150360.0,20617.0,,5189260.0,...,0.0,0.43043,4.04,0.683232,0.0,1,15.434027,0.739973,,0
20577,3124.0,2.40405,5.1528,0.0,0.0,,,,,0.0,...,0.0,1.026663,0.0,0.555328,0.0,1,,0.0,,1
42690,3510.0,0.0238,0.0908,0.0,0.0,,45740954.0,,8866477.0,60476663.0,...,1.0,0.057893,5.05,0.369424,0.0,1,,0.0,0.255556,0
14918,,,0.1646,0.0,0.0,,,,,0.0,...,,1.32,0.0,,,0,,0.0,,1
5298,3512.0,0.06575,0.72,0.0,0.0,0.0,1025793.0,35788.0,2226636.0,1761392.0,...,0.0,0.22,5.05,0.519776,0.0,0,144.54699,1.744186,0.128355,1


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

In [7]:
import logging

from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails

In [8]:
automl_settings = {
    "featurization": "auto",
    "n_cross_validations": 4,
    "experiment_timeout_minutes": 30,
    "enable_early_stopping": True,
    "verbosity": logging.INFO,
} #  "compute_target"=

automl_config = AutoMLConfig(
    task="classification",
    training_data=train,
    label_column_name="default_status",
    primary_metric="AUC_weighted",
    **automl_settings
)

In [9]:
remote_run = experiment.submit(automl_config, show_output=True)

Running on local machine
Parent Run ID: AutoML_b7783977-c68e-4c7c-a9d0-75c854377da1

Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing f

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [10]:
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [11]:
from azureml.core.model import Model
import joblib

from utils import print_model

In [12]:
automl_run, best_automl_model = remote_run.get_output()

In [13]:
print(automl_run)

Run(Experiment: capstone,
Id: AutoML_718264c0-8f6b-4bc7-bad4-a00e4ed47f2a_25,
Type: None,
Status: Completed)


In [14]:
print_model(best_automl_model)

datatransformer
{'enable_dnn': None,
 'enable_feature_sweeping': None,
 'feature_sweeping_config': None,
 'feature_sweeping_timeout': None,
 'featurization_config': None,
 'force_text_dnn': None,
 'is_cross_validation': None,
 'is_onnx_compatible': None,
 'logger': None,
 'observer': None,
 'task': None,
 'working_dir': None}

prefittedsoftvotingclassifier
{'estimators': ['23', '11', '17', '3', '4', '19', '12', '14', '1', '21'],
 'weights': [0.13333333333333333,
             0.06666666666666667,
             0.13333333333333333,
             0.2,
             0.06666666666666667,
             0.06666666666666667,
             0.13333333333333333,
             0.06666666666666667,
             0.06666666666666667,
             0.06666666666666667]}

23 - maxabsscaler
{'copy': True}

23 - sgdclassifierwrapper
{'alpha': 3.8776122448979593,
 'class_weight': 'balanced',
 'eta0': 0.01,
 'fit_intercept': True,
 'l1_ratio': 0.44897959183673464,
 'learning_rate': 'invscaling',
 'loss': 'log',
 

In [15]:
#TODO: Save the best model
joblib.dump(best_automl_model, "outputs/automl_model.joblib")

In [17]:
model = Model.register(
    workspace=ws,
    model_path="outputs/automl_model.joblib",
    model_name="AutoML_Voting_Ensemble",
    tags={"accuracy": 0.8543},
    description="default_status prediction model"
)

Registering model AutoML_Voting_Ensemble


## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

TODO: In the cell below, send a request to the web service you deployed to test it.

TODO: In the cell below, print the logs of the web service and delete the service