# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.34.0


## Dataset

### Overview
TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.

This dataset contains numerical input variables that come from and PCA algorithm. Due to confidential concerns, the original variables are not available. Features V1, V2, ..., V8 are the outputs of the principal components from the PCA algorithm. The time and amount are the variables that are not transformed by PCA. The target variable is the feature 'Class' and it takes values 1 in case of fraud and 0 otherwise.


This dataset contains transations made by credit cards in September 2012 by European cardholders. This dataset contains 492 frauds out of 284807 transations, and it is highly unbalanced, the positive class (fraud) account for 0.172% of all the transations.


TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'telco-customer-churn'
experiment=Experiment(ws, experiment_name)

# Try to load the dataset from the Workspace. Otherwise, create it from the file
# NOTE: update the key to match the dataset name
found = False
key = "Customer Churn"
description_text = "Customer Churn DataSet for Udacity Capstone Project"

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        example_data = 'https://raw.githubusercontent.com/srees1988/predict-churn-py/main/customer_churn_data.csv'
        dataset = Dataset.Tabular.from_delimited_files(example_data)        
        #Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df = dataset.to_pandas_dataframe()
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges
count,7043.0,7043.0,7043.0,7032.0
mean,0.162147,32.371149,64.761692,2283.300441
std,0.368612,24.559481,30.090047,2266.771362
min,0.0,0.0,18.25,18.8
25%,0.0,9.0,35.5,401.45
50%,0.0,29.0,70.35,1397.475
75%,0.0,55.0,89.85,3794.7375
max,1.0,72.0,118.75,8684.8


Check the first five rows:

In [3]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,True,False,1,False,No phone service,DSL,No,...,No,No,No,No,Month-to-month,True,Electronic check,29.85,29.85,False
1,5575-GNVDE,Male,0,False,False,34,True,No,DSL,Yes,...,Yes,No,No,No,One year,False,Mailed check,56.95,1889.5,False
2,3668-QPYBK,Male,0,False,False,2,True,No,DSL,Yes,...,No,No,No,No,Month-to-month,True,Mailed check,53.85,108.15,True
3,7795-CFOCW,Male,0,False,False,45,False,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,False,Bank transfer (automatic),42.3,1840.75,False
4,9237-HQITU,Female,0,False,False,2,True,No,Fiber optic,No,...,No,No,No,No,Month-to-month,True,Electronic check,70.7,151.65,True


The column customerID should be removed because they have unique values in the whole column:

In [5]:
df['customerID'].nunique() == df.shape[0]

True

In [6]:
df.drop('customerID', axis=1, inplace=True)
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,True,False,1,False,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,True,Electronic check,29.85,29.85,False
1,Male,0,False,False,34,True,No,DSL,Yes,No,Yes,No,No,No,One year,False,Mailed check,56.95,1889.5,False
2,Male,0,False,False,2,True,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,True,Mailed check,53.85,108.15,True
3,Male,0,False,False,45,False,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,False,Bank transfer (automatic),42.3,1840.75,False
4,Female,0,False,False,2,True,No,Fiber optic,No,No,No,No,No,No,Month-to-month,True,Electronic check,70.7,151.65,True


Check the data types present along the columns:

In [7]:
df.dtypes

gender               object
SeniorCitizen         int64
Partner                bool
Dependents             bool
tenure                int64
PhoneService           bool
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling       bool
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                  bool
dtype: object

Let's check the missing values:

In [8]:
# check missing values
df.isnull().sum()

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

Let's check the unique values of the float columns:

In [9]:
float_columns = df.select_dtypes(include=['float64']).columns
print(float_columns)

Index(['MonthlyCharges', 'TotalCharges'], dtype='object')


In [10]:
for column in float_columns:
    print(df[column].value_counts())
    print('\n')

20.05     61
19.85     45
19.95     44
19.90     44
20.00     43
          ..
114.75     1
103.60     1
113.40     1
57.65      1
113.30     1
Name: MonthlyCharges, Length: 1585, dtype: int64


20.20      11
19.75       9
19.65       8
20.05       8
19.90       8
           ..
1066.15     1
249.95      1
8333.95     1
7171.70     1
1024.00     1
Name: TotalCharges, Length: 6530, dtype: int64




Let's check the unique values of the bool columns:

In [11]:
bool_columns = df.select_dtypes(include=['bool']).columns
print(bool_columns)

Index(['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn'], dtype='object')


In [12]:
for column in bool_columns:
    print(df[column].value_counts())
    print('\n')

False    3641
True     3402
Name: Partner, dtype: int64


False    4933
True     2110
Name: Dependents, dtype: int64


True     6361
False     682
Name: PhoneService, dtype: int64


True     4171
False    2872
Name: PaperlessBilling, dtype: int64


False    5174
True     1869
Name: Churn, dtype: int64




In [12]:
5174/1864

2.7757510729613735

The variable Churn is highly skewed toward False by a factor of 2.77

Let's check the unique values of the object columns:

In [13]:
object_columns = df.select_dtypes(include=['object']).columns
print(object_columns)

Index(['gender', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaymentMethod'],
      dtype='object')


In [14]:
for column in object_columns:
    print(df[column].value_counts())
    print('\n')

Male      3555
Female    3488
Name: gender, dtype: int64


No                  3390
Yes                 2971
No phone service     682
Name: MultipleLines, dtype: int64


Fiber optic    3096
DSL            2421
No             1526
Name: InternetService, dtype: int64


No                     3498
Yes                    2019
No internet service    1526
Name: OnlineSecurity, dtype: int64


No                     3088
Yes                    2429
No internet service    1526
Name: OnlineBackup, dtype: int64


No                     3095
Yes                    2422
No internet service    1526
Name: DeviceProtection, dtype: int64


No                     3473
Yes                    2044
No internet service    1526
Name: TechSupport, dtype: int64


No                     2810
Yes                    2707
No internet service    1526
Name: StreamingTV, dtype: int64


No                     2785
Yes                    2732
No internet service    1526
Name: StreamingMovies, dtype: int64


Month-to-mo

## Train Test Splitting

In [15]:
from sklearn.model_selection import train_test_split
from azureml.data.dataset_factory import TabularDatasetFactory
description_text = "Train and Test splitting from Customer Churn DataSet for Udacity Capstone Project"

churn = df['Churn']

# Split data into train and test data taking into account the variable Churn is highly skewed:
train_dataset, test_dataset = train_test_split(df, test_size=0.2, stratify=churn, random_state=42)

directory = 'train'
if not os.path.exists(directory):
    os.makedirs(directory)
    
directory = 'test'
if not os.path.exists(directory):
    os.makedirs(directory)

# Export data as csv
train_dataset.to_csv("./train/train_data.csv", index=False)
test_dataset.to_csv("./test/test_data.csv", index=False)

# # Upload data to the datastore
datastore = ws.get_default_datastore()
datastore.upload(src_dir='./train', target_path = experiment_name)
datastore.upload(src_dir='./test', target_path = experiment_name)
print('Data uploaded to DataStore')

csv_path_train = [(datastore, experiment_name+'/train_data.csv')]
csv_path_test = [(datastore, experiment_name+'/test_data.csv')]

train_data = Dataset.Tabular.from_delimited_files(path=csv_path_train)
test_data = Dataset.Tabular.from_delimited_files(path=csv_path_test)

display(train_data.to_pandas_dataframe().head())
display(test_data.to_pandas_dataframe().head())

Uploading an estimated of 1 files
Uploading ./train/train_data.csv
Uploaded ./train/train_data.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Uploading an estimated of 1 files
Uploading ./test/test_data.csv
Uploaded ./test/test_data.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Data uploaded to DataStore


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Male,0,False,False,35,False,No phone service,DSL,No,No,Yes,No,Yes,Yes,Month-to-month,False,Electronic check,49.2,1701.65,False
1,Male,0,True,True,15,True,No,Fiber optic,Yes,No,No,No,No,No,Month-to-month,False,Mailed check,75.1,1151.55,False
2,Male,0,True,True,13,False,No phone service,DSL,Yes,Yes,No,Yes,No,No,Two year,False,Mailed check,40.55,590.35,False
3,Female,0,True,False,26,True,No,DSL,No,Yes,Yes,No,Yes,Yes,Two year,True,Credit card (automatic),73.5,1905.7,False
4,Male,0,True,True,1,True,No,DSL,No,No,No,No,No,No,Month-to-month,False,Electronic check,44.55,44.55,False


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Male,0,True,True,72,True,Yes,Fiber optic,Yes,Yes,Yes,Yes,Yes,Yes,Two year,True,Credit card (automatic),114.05,8468.2,False
1,Female,1,False,False,8,True,Yes,Fiber optic,No,No,No,Yes,Yes,Yes,Month-to-month,True,Credit card (automatic),100.15,908.55,False
2,Female,0,True,True,41,True,Yes,DSL,Yes,Yes,Yes,No,Yes,No,One year,True,Credit card (automatic),78.35,3211.2,False
3,Male,0,True,False,18,True,No,Fiber optic,No,No,Yes,Yes,No,No,Month-to-month,False,Electronic check,78.2,1468.75,False
4,Female,0,True,False,72,True,Yes,DSL,Yes,Yes,Yes,No,Yes,Yes,Two year,True,Credit card (automatic),82.65,5919.35,False


## Cluster Provisioning


In [16]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# TODO: Create compute cluster
# Use vm_size = "Standard_D2_V2" in your provisioning configuration.
# max_nodes should be no greater than 4.

### YOUR CODE HERE ###

cluster_name = "cluster-vhcg"
# verify that the cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name = cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_D2_V2', max_nodes = 4, idle_seconds_before_scaledown=120)
    cpu_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

InProgress.....
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

In [17]:
# TODO: Put your automl config here
automl_config = AutoMLConfig(
                                compute_target=cpu_cluster,
                                task='classification',
                                training_data=train_data,
                                test_data = test_data,
                                label_column_name='Churn',
                                # n_cross_validations=10,
                                # validation_size=0.2,
                                primary_metric='AUC_weighted',
                                experiment_timeout_minutes=60,
                                max_concurrent_iterations=5,
                                max_cores_per_iteration=-1, 
                                featurization= 'auto',
                                debug_log = "automl_errors.log",                                

                                )

##  Create AutoML Pipeline

In [18]:
from azureml.pipeline.core import PipelineData, TrainingOutput

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

Create AutoMLStep

In [19]:
automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=True)

In [20]:
from azureml.pipeline.core import Pipeline
pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[automl_step])

In [21]:
pipeline_run = experiment.submit(pipeline)



Created step automl_module [367671c7][e8d16878-3d67-4cd3-899f-565f915c3dc3], (This step will run and generate new outputs)
Submitted PipelineRun 1e5eb58c-af82-4e02-99a1-c5df988e69f4
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/1e5eb58c-af82-4e02-99a1-c5df988e69f4?wsid=/subscriptions/d7f39349-a66b-446e-aba6-0053c2cf1c11/resourcegroups/aml-quickstarts-165078/workspaces/quick-starts-ws-165078&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254


In [22]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [23]:
pipeline_run.wait_for_completion()

PipelineRunId: 1e5eb58c-af82-4e02-99a1-c5df988e69f4
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/1e5eb58c-af82-4e02-99a1-c5df988e69f4?wsid=/subscriptions/d7f39349-a66b-446e-aba6-0053c2cf1c11/resourcegroups/aml-quickstarts-165078/workspaces/quick-starts-ws-165078&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254
PipelineRun Status: NotStarted
PipelineRun Status: Running


StepRunId: c642055b-859c-4740-b5c3-5b82fa54d07b
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/c642055b-859c-4740-b5c3-5b82fa54d07b?wsid=/subscriptions/d7f39349-a66b-446e-aba6-0053c2cf1c11/resourcegroups/aml-quickstarts-165078/workspaces/quick-starts-ws-165078&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254
StepRun( automl_module ) Status: Running

StepRun(automl_module) Execution Summary
StepRun( automl_module ) Status: Finished

No scores improved over last 20 iterations, so experiment stopped early. This early stopping behavior can be disabled by setting enable_early_stopping = False in AutoMLC

'Finished'

## Examine Results

Retrieve the metrics of all child runs

In [24]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

Downloading azureml/c642055b-859c-4740-b5c3-5b82fa54d07b/metrics_data
Downloaded azureml/c642055b-859c-4740-b5c3-5b82fa54d07b/metrics_data, 1 files out of an estimated total of 1


In [25]:
import json
with open(metrics_output._path_on_datastore) as f:
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

Unnamed: 0,c642055b-859c-4740-b5c3-5b82fa54d07b_0,c642055b-859c-4740-b5c3-5b82fa54d07b_5,c642055b-859c-4740-b5c3-5b82fa54d07b_1,c642055b-859c-4740-b5c3-5b82fa54d07b_20,c642055b-859c-4740-b5c3-5b82fa54d07b_17,c642055b-859c-4740-b5c3-5b82fa54d07b_21,c642055b-859c-4740-b5c3-5b82fa54d07b_27,c642055b-859c-4740-b5c3-5b82fa54d07b_25,c642055b-859c-4740-b5c3-5b82fa54d07b_23,c642055b-859c-4740-b5c3-5b82fa54d07b_29,...,c642055b-859c-4740-b5c3-5b82fa54d07b_10,c642055b-859c-4740-b5c3-5b82fa54d07b_13,c642055b-859c-4740-b5c3-5b82fa54d07b_19,c642055b-859c-4740-b5c3-5b82fa54d07b_16,c642055b-859c-4740-b5c3-5b82fa54d07b_14,c642055b-859c-4740-b5c3-5b82fa54d07b_18,c642055b-859c-4740-b5c3-5b82fa54d07b_42,c642055b-859c-4740-b5c3-5b82fa54d07b_47,c642055b-859c-4740-b5c3-5b82fa54d07b_11,c642055b-859c-4740-b5c3-5b82fa54d07b_48
AUC_macro,[0.838953857621382],[0.8414934901142965],[0.848425304715561],[0.846403521671539],[0.8423503188491148],[0.8377141471234718],[0.8486629840327548],[0.8490719838275784],[0.8402348244661709],[0.850954480731336],...,[0.8470955878583365],[0.8332053376851004],[0.8358196693717007],[0.846147235882054],[0.8256474326541424],[0.8416784953856767],[0.8285683924741276],[0.8530957278732277],[0.841186228442678],[0.8526808605931809]
f1_score_weighted,[0.7871985268450351],[0.789909259133715],[0.7954938901990269],[0.8004033348441016],[0.75488601090447],[0.7885545920473428],[0.7907828829073248],[0.7912873933922929],[0.7527352004007893],[0.8023688075370942],...,[0.7914976777261207],[0.7585539233393281],[0.7614292268012002],[0.7649483649472031],[0.777013418380122],[0.7904268374724143],[0.7754807332379098],[0.8003092030148776],[0.755164071724895],[0.6834624633111809]
norm_macro_recall,[0.4143429580870414],[0.4093904046544225],[0.43150047203585246],[0.4523702315414911],[0.5193260723917461],[0.40632328406867474],[0.41222728144417614],[0.412799967562977],[0.5243614804619947],[0.4512372807694878],...,[0.40072621586919516],[0.290704732966025],[0.29572801881845745],[0.5360146505345588],[0.3784489603467458],[0.41318843295243246],[0.4204217221545789],[0.4545168091834398],[0.5135935912553644],[0.13761640363881109]
average_precision_score_macro,[0.7929736319880424],[0.7912162167502309],[0.801840145693317],[0.7962236541138634],[0.7946244979064989],[0.7902999143676842],[0.8008036218967685],[0.8008606049041189],[0.7888261909950836],[0.8053362245158393],...,[0.8010323517544314],[0.7784924003178052],[0.783707961127932],[0.7955233312054073],[0.773180455410834],[0.7933484546289225],[0.773826404708989],[0.8068577575965833],[0.7897788169785386],[0.8063902467455618]
precision_score_weighted,[0.785930522402627],[0.7888228820595868],[0.794543712657206],[0.7991389524510518],[0.8030468100409901],[0.7874329828710795],[0.7905885671301162],[0.7911190408489075],[0.8050293265022134],[0.8019481959482103],...,[0.793324621475964],[0.7720419084932231],[0.7779399714035017],[0.8079878746120716],[0.7754218807929726],[0.789745446633018],[0.7757968797132504],[0.7994493557287651],[0.7999250759959695],[0.630879931900937]
AUC_micro,[0.8848453762584763],[0.8868478048422724],[0.8912738553351911],[0.8896896902005661],[0.8347418476668546],[0.8845918947149949],[0.8911039697742633],[0.8913373202331956],[0.8288493941583184],[0.8931171727413396],...,[0.890161400261534],[0.8773475396412245],[0.8786131516280848],[0.8390822229102706],[0.8753740317704429],[0.8871081857715385],[0.872642631096836],[0.8939994435125547],[0.831113526841257],[0.8802467938648771]
balanced_accuracy,[0.7071714790435207],[0.7046952023272112],[0.7157502360179263],[0.7261851157707455],[0.759663036195873],[0.7031616420343374],[0.706113640722088],[0.7063999837814885],[0.7621807402309972],[0.7256186403847439],...,[0.7003631079345976],[0.6453523664830125],[0.6478640094092287],[0.7680073252672793],[0.6892244801733729],[0.7065942164762161],[0.7102108610772895],[0.72725840459172],[0.7567967956276821],[0.5688082018194055]
average_precision_score_weighted,[0.8590503064057925],[0.8580808403874957],[0.8659209605128874],[0.8620072587120765],[0.8597515089426565],[0.8569780025494534],[0.8652313088909843],[0.8649896741477542],[0.8573204551517772],[0.8680557685975848],...,[0.8646605080061894],[0.8501683703178796],[0.8533680909137774],[0.86165496698425],[0.8442691644322108],[0.8594289866134717],[0.8460919121864029],[0.8693885042632309],[0.857907843871342],[0.8690780540126696]
precision_score_macro,[0.7395376697971479],[0.7479671297292411],[0.7526845676526924],[0.7554225032121312],[0.7091277831049135],[0.7460040449827158],[0.7505057708881623],[0.7515128778941781],[0.7090669888723474],[0.761787938748761],...,[0.7601131040986129],[0.7469827789806441],[0.757302309611735],[0.7164606954308589],[0.7290197804107432],[0.7483387205786952],[0.7137175362107119],[0.7550329870431168],[0.7068100944898074],[0.5077237273279162]
f1_score_micro,[0.794462193823216],[0.7992545260915868],[0.8031593894213701],[0.805999290024849],[0.7412140575079872],[0.7980120695775649],[0.8003194888178914],[0.8010294639687611],[0.7383741569045084],[0.8091941782037627],...,[0.80386936457224],[0.7854100106496272],[0.7889598864039757],[0.7520411785587505],[0.7868299609513668],[0.7992545260915868],[0.7763578274760383],[0.8056443024494143],[0.741569045083422],[0.7626908058217962]



Retrieve the Best Model

In [26]:

# Retrieve best model from Pipeline Run
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)

Downloading azureml/c642055b-859c-4740-b5c3-5b82fa54d07b/model_data
Downloaded azureml/c642055b-859c-4740-b5c3-5b82fa54d07b/model_data, 1 files out of an estimated total of 1


In [27]:
import pickle

with open(best_model_output._path_on_datastore, "rb" ) as f:
    best_model = pickle.load(f)
best_model

PipelineWithYTransformations(Pipeline={'memory': None,
                                       'steps': [('datatransformer',
                                                  DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, observer=None, task='classification', working_dir='/mn...
    gpu_training_param_dict={'processing_unit_type': 'cpu'}
), random_state=None))], verbose=False))], flatten_transform=None, weights=[0.2, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.13333333333333333, 0.13333333333333333, 0.06666666666666667, 0.06666666666666667]))],
                                       'verbose': False},
                             y_transformer={},
                             y_transformer_name='LabelEncoder')

In [28]:
best_model.steps

[('datatransformer',
  DataTransformer(
      task='classification',
      is_onnx_compatible=False,
      enable_feature_sweeping=True,
      enable_dnn=False,
      force_text_dnn=False,
      feature_sweeping_timeout=86400,
      featurization_config=None,
      is_cross_validation=True,
      feature_sweeping_config={}
  )),
 ('prefittedsoftvotingclassifier',
  PreFittedSoftVotingClassifier(
      estimators=[('29', Pipeline(
          memory=None,
          steps=[('standardscalerwrapper', StandardScalerWrapper(
              copy=True,
              with_mean=False,
              with_std=False
          )), ('lightgbmclassifier', LightGBMClassifier(
              boosting_type='goss',
              colsample_bytree=0.7922222222222222,
              learning_rate=0.0842121052631579,
              max_bin=140,
              max_depth=6,
              min_child_weight=8,
              min_data_in_leaf=0.024145517241379314,
              min_split_gain=0.7368421052631579,
          

## Test the model

Load Test Data

In [29]:
dataset_test = test_data
df_test = dataset_test.to_pandas_dataframe()
df_test = df_test[pd.notnull(df_test['Churn'])]

y_test = df_test['Churn']
X_test = df_test.drop(['Churn'], axis=1)

Testing Our Best Fitted Model

In [30]:
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
ypred = best_model.predict(X_test)
cm = confusion_matrix(y_test, ypred)
accuracy = accuracy_score(y_test, ypred)
auc = roc_auc_score(y_test, ypred)


In [31]:
# Visualize the confusion matrix
pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)

Unnamed: 0,0,1
0,920,115
1,163,211


In [32]:
print("Accuracy score is: ", accuracy)
print("AUC score is: ", auc)

Accuracy score is:  0.8026969481902059
AUC score is:  0.7265300059417706



## Publish and run from REST endpoint

In [40]:

published_pipeline = pipeline_run.publish_pipeline(
    name="Customer Churn Train", description="Training Customer Churn pipeline", version="1.0")

published_pipeline

Name,Id,Status,Endpoint
Customer Churn Train,9ae34728-365f-4a77-b9dc-d5b12b57e791,Active,REST Endpoint


In [41]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

In [42]:
import requests

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "pipeline-rest-endpoint"}
                        )

In [43]:
try:
    response.raise_for_status()
except Exception:    
    raise Exception("Received bad response from the endpoint: {}\n"
                    "Response Code: {}\n"
                    "Headers: {}\n"
                    "Content: {}".format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

Submitted pipeline run:  ab0b1b46-3654-430b-8d58-f91b266fc5ab


In [44]:

from azureml.pipeline.core.run import PipelineRun
from azureml.widgets import RunDetails

published_pipeline_run = PipelineRun(ws.experiments["pipeline-rest-endpoint"], run_id)
RunDetails(published_pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …