# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.34.0


## Dataset

### Overview
TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.

This dataset contains numerical input variables that come from and PCA algorithm. Due to confidential concerns, the original variables are not available. Features V1, V2, ..., V8 are the outputs of the principal components from the PCA algorithm. The time and amount are the variables that are not transformed by PCA. The target variable is the feature 'Class' and it takes values 1 in case of fraud and 0 otherwise.


This dataset contains transations made by credit cards in September 2012 by European cardholders. This dataset contains 492 frauds out of 284807 transations, and it is highly unbalanced, the positive class (fraud) account for 0.172% of all the transations.


TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'telco-customer-churn'
experiment=Experiment(ws, experiment_name)

# Try to load the dataset from the Workspace. Otherwise, create it from the file
# NOTE: update the key to match the dataset name
found = False
key = "Customer Churn"
description_text = "Customer Churn DataSet for Udacity Capstone Project"

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        example_data = 'https://raw.githubusercontent.com/srees1988/predict-churn-py/main/customer_churn_data.csv'
        dataset = Dataset.Tabular.from_delimited_files(example_data)        
        #Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df = dataset.to_pandas_dataframe()
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges
count,7043.0,7043.0,7043.0,7032.0
mean,0.162147,32.371149,64.761692,2283.300441
std,0.368612,24.559481,30.090047,2266.771362
min,0.0,0.0,18.25,18.8
25%,0.0,9.0,35.5,401.45
50%,0.0,29.0,70.35,1397.475
75%,0.0,55.0,89.85,3794.7375
max,1.0,72.0,118.75,8684.8


Check the first five rows:

In [3]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,True,False,1,False,No phone service,DSL,No,...,No,No,No,No,Month-to-month,True,Electronic check,29.85,29.85,False
1,5575-GNVDE,Male,0,False,False,34,True,No,DSL,Yes,...,Yes,No,No,No,One year,False,Mailed check,56.95,1889.5,False
2,3668-QPYBK,Male,0,False,False,2,True,No,DSL,Yes,...,No,No,No,No,Month-to-month,True,Mailed check,53.85,108.15,True
3,7795-CFOCW,Male,0,False,False,45,False,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,False,Bank transfer (automatic),42.3,1840.75,False
4,9237-HQITU,Female,0,False,False,2,True,No,Fiber optic,No,...,No,No,No,No,Month-to-month,True,Electronic check,70.7,151.65,True


The column customerID should be removed because they have unique values in the whole column:

In [4]:
df['customerID'].nunique() == df.shape[0]

True

In [5]:
df.drop('customerID', axis=1, inplace=True)
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,True,False,1,False,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,True,Electronic check,29.85,29.85,False
1,Male,0,False,False,34,True,No,DSL,Yes,No,Yes,No,No,No,One year,False,Mailed check,56.95,1889.5,False
2,Male,0,False,False,2,True,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,True,Mailed check,53.85,108.15,True
3,Male,0,False,False,45,False,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,False,Bank transfer (automatic),42.3,1840.75,False
4,Female,0,False,False,2,True,No,Fiber optic,No,No,No,No,No,No,Month-to-month,True,Electronic check,70.7,151.65,True


Check the data types present along the columns:

In [6]:
df.dtypes

gender               object
SeniorCitizen         int64
Partner                bool
Dependents             bool
tenure                int64
PhoneService           bool
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling       bool
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                  bool
dtype: object

Let's check the missing values:

In [7]:
# check missing values
df.isnull().sum()

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

Let's check the unique values of the float columns:

In [8]:
float_columns = df.select_dtypes(include=['float64']).columns
print(float_columns)

Index(['MonthlyCharges', 'TotalCharges'], dtype='object')


In [9]:
for column in float_columns:
    print(df[column].value_counts())
    print('\n')

20.05     61
19.85     45
19.95     44
19.90     44
20.00     43
          ..
114.75     1
103.60     1
113.40     1
57.65      1
113.30     1
Name: MonthlyCharges, Length: 1585, dtype: int64


20.20      11
19.75       9
19.65       8
20.05       8
19.90       8
           ..
1066.15     1
249.95      1
8333.95     1
7171.70     1
1024.00     1
Name: TotalCharges, Length: 6530, dtype: int64




Let's check the unique values of the float columns:

In [10]:
bool_columns = df.select_dtypes(include=['bool']).columns
print(bool_columns)

Index(['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn'], dtype='object')


In [11]:
for column in bool_columns:
    print(df[column].value_counts())
    print('\n')

False    3641
True     3402
Name: Partner, dtype: int64


False    4933
True     2110
Name: Dependents, dtype: int64


True     6361
False     682
Name: PhoneService, dtype: int64


True     4171
False    2872
Name: PaperlessBilling, dtype: int64


False    5174
True     1869
Name: Churn, dtype: int64




In [12]:
5174/1864

2.7757510729613735

The variable Churn is highly skewed toward False by a factor of 2.77

Let's check the unique values of the object columns:

In [13]:
object_columns = df.select_dtypes(include=['object']).columns
print(object_columns)

Index(['gender', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaymentMethod'],
      dtype='object')


In [14]:
for column in object_columns:
    print(df[column].value_counts())
    print('\n')

Male      3555
Female    3488
Name: gender, dtype: int64


No                  3390
Yes                 2971
No phone service     682
Name: MultipleLines, dtype: int64


Fiber optic    3096
DSL            2421
No             1526
Name: InternetService, dtype: int64


No                     3498
Yes                    2019
No internet service    1526
Name: OnlineSecurity, dtype: int64


No                     3088
Yes                    2429
No internet service    1526
Name: OnlineBackup, dtype: int64


No                     3095
Yes                    2422
No internet service    1526
Name: DeviceProtection, dtype: int64


No                     3473
Yes                    2044
No internet service    1526
Name: TechSupport, dtype: int64


No                     2810
Yes                    2707
No internet service    1526
Name: StreamingTV, dtype: int64


No                     2785
Yes                    2732
No internet service    1526
Name: StreamingMovies, dtype: int64


Month-to-mo

## Train Test Splitting

In [None]:
from azureml.exceptions import UserErrorException

blob_datastore_name='data_storage'
account_name=os.getenv("BLOB_ACCOUNTNAME_62", "mlstrg162961") # Storage account name
container_name=os.getenv("BLOB_CONTAINER_62", "azureml") # Name of Azure blob container
account_key=os.getenv("BLOB_ACCOUNT_KEY_62", "yzqg7jQBdfeTMfw2yrQFyDQjWxeSlYjxoqsy4p/TApRjFIic4wNVL8niR9r6mIa+heYVeJNdzWrEaW5SxLIxfA==") # Storage account key

try:
    blob_datastore = Datastore.get(ws, blob_datastore_name)
    print("Found Blob Datastore with name: %s" % blob_datastore_name)
except UserErrorException:
    blob_datastore = Datastore.register_azure_blob_container(
                        workspace=ws,
                        datastore_name=blob_datastore_name,
                        account_name=account_name, # Storage account name
                        container_name=container_name, # Name of Azure blob container
                        account_key=account_key) # Storage account key

    print("Registered blob datastore with name: %s" % blob_datastore_name)

blob_data_ref = DataReference(
    datastore=blob_datastore,
    data_reference_name="blob_test_data",
    path_on_datastore="testdata")

In [34]:
from sklearn.model_selection import train_test_split
from azureml.data.dataset_factory import TabularDatasetFactory
description_text = "Train and Test splitting from Customer Churn DataSet for Udacity Capstone Project"
datastore_split_name = 'data_splitted'

churn = df['Churn']

# Split data into train and test data taking into account the variable Churn is highly skewed:
train_dataset, test_dataset = train_test_split(df, test_size=0.2, stratify=churn, random_state=42)

directory = 'train'
if not os.path.exists(directory):
    os.makedirs(directory)
    
directory = 'test'
if not os.path.exists(directory):
    os.makedirs(directory)

# Export data as csv
train_dataset.to_csv("train_data.csv", index=False)
test_dataset.to_csv("test_data.csv", index=False)

# # Upload data to the datastore
# datastore = ws.get_default_datastore()
datastore = ws.get(ws, datastore_split_name)
datastore.upload(src_dir='./train', target_path = experiment_name)
datastore.upload(src_dir='./test', target_path = experiment_name)

# train_dataset_azure = TabularDatasetFactory.from_delimited_files(path=datastore.path("capstone/train_data.csv"))
# test_dataset_azure = TabularDatasetFactory.from_delimited_files(path=datastore.path("capstone/test_data.csv"))

AttributeError: 'str' object has no attribute '_get_service_client'

In [32]:
dataset_train = Dataset.Tabular.from_delimited_files(path = [(datastore, ("train_data.csv"))])
dataset_test = Dataset.Tabular.from_delimited_files(path = [(datastore, ("test_data.csv"))])

# dataset_train

DatasetValidationError: DatasetValidationError:
	Message: Cannot load any data from the specified path. Make sure the path is accessible and contains data.
ScriptExecutionException was caused by StreamAccessException.
  StreamAccessException was caused by NotFoundException.
    Found no resources for the input provided: '[REDACTED]'
| session_id=fa7f0689-4c35-49da-b894-142a8354f102
	InnerException None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Cannot load any data from the specified path. Make sure the path is accessible and contains data.\nScriptExecutionException was caused by StreamAccessException.\n  StreamAccessException was caused by NotFoundException.\n    Found no resources for the input provided: '[REDACTED]'\n| session_id=fa7f0689-4c35-49da-b894-142a8354f102"
    }
}

## Cluster Provisioning


In [22]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# TODO: Create compute cluster
# Use vm_size = "Standard_D2_V2" in your provisioning configuration.
# max_nodes should be no greater than 4.

### YOUR CODE HERE ###

cluster_name = "cluster-vhcg"
# verify that the cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name = cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_D2_V2', max_nodes = 4, idle_seconds_before_scaledown=120)
    cpu_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

InProgress....
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

In [23]:
# TODO: Put your automl config here
automl_config = AutoMLConfig(
                                compute_target=cluster_name,
                                task='classification',
                                training_data=train_dataset,
                                test_data = test_dataset,
                                label_column_name='Churn',
                                # n_cross_validations=10,
                                # validation_size=0.2,
                                primary_metric='AUC_weighted',
                                experiment_timeout_minutes=30,
                                max_concurrent_iterations=5,
                                max_cores_per_iteration=-1, 

                                )

In [24]:
# TODO: Submit your experiment
from azureml.widgets import RunDetails
run = experiment.submit(config=automl_config, show_output=True)
RunDetails(run).show()
run.wait_for_completion()

ConfigException: ConfigException:
	Message: Input of type '<class 'pandas.core.frame.DataFrame'>' is not supported. Supported types: [azureml.data.tabular_dataset.TabularDataset]Please refer to documentation for converting to Supported types: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py
	InnerException: None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Input of type '<class 'pandas.core.frame.DataFrame'>' is not supported. Supported types: [azureml.data.tabular_dataset.TabularDataset]Please refer to documentation for converting to Supported types: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py",
        "details_uri": "https://aka.ms/AutoMLConfig",
        "target": "training_data",
        "inner_error": {
            "code": "BadArgument",
            "inner_error": {
                "code": "ArgumentInvalid",
                "inner_error": {
                    "code": "InvalidInputDatatype"
                }
            }
        }
    }
}

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [19]:
# Retrieve and save your best automl model.

best_automl_run, fitted_automl_model = run.get_output()
print(best_automl_run)

print("Best run metrics: ")
best_automl_run.get_metrics()
fitted_automl_model

Run(Experiment: telco-customer-churn,
Id: AutoML_14e70e02-a31c-4dc2-88df-f660af838a84_34,
Type: azureml.scriptrun,
Status: Completed)
Best run metrics: 


PipelineWithYTransformations(Pipeline={'memory': None,
                                       'steps': [('datatransformer',
                                                  DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, observer=None, task='classification', working_dir='/mn...
), random_state=0, reg_alpha=0, reg_lambda=0.625, subsample=0.8, tree_method='auto'))], verbose=False)), ('25', Pipeline(memory=None, steps=[('truncatedsvdwrapper', TruncatedSVDWrapper(n_components=0.3068421052631579, random_state=None)), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='entropy', max_depth=None, max_features=0.1, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=0.3194736842105263, min_samples_split=0.6

In [20]:
#TODO: Save the best model


#print(best_automl_run.get_file_names())

models = [element for element in best_automl_run.get_file_names() if 'pkl' in element]
models
for model in models:
    best_automl_run.download_file(name=model)

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [None]:
published_pipeline = pipeline_run.publish_pipeline(
    name="Bankmarketing Train", description="Training bankmarketing pipeline", version="1.0")

published_pipeline

TODO: In the cell below, send a request to the web service you deployed to test it.

In [None]:
import requests

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "pipeline-rest-endpoint"}
                        )


try:
    response.raise_for_status()
except Exception:    
    raise Exception("Received bad response from the endpoint: {}\n"
                    "Response Code: {}\n"
                    "Headers: {}\n"
                    "Content: {}".format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

TODO: In the cell below, print the logs of the web service and delete the service

In [None]:
from azureml.pipeline.core.run import PipelineRun
from azureml.widgets import RunDetails

published_pipeline_run = PipelineRun(ws.experiments["pipeline-rest-endpoint"], run_id)
RunDetails(published_pipeline_run).show()