# Churn automl 
with batch transform and clarify

Steve Goodman
Feb 2021


This is a MVP to build a model quickly, using simple dataset that simulates telco churn. Note the other notebook in this repo which comes from AWS examples is more comprehensive and well explained but missing a few key points:
1. We normally have a row identifier such as customer id when scoring the data
2. Explainability is important. 

We are going to 
1. Download the dataset, and make some small data prep changes to it
2. Run Autopilot
3. Evaluate the best model using the auc metric
4. Use Clarify to Explain what features are most important.
5. Run a batch transform job on the test data set to generate predictions. 
6. Evaluate predictions 


todos - 
* align the user ids and figure if column headers are allowed in the test data or if we have to add them later
* also whether y column can/can't be in test data. 
* clean the directories so all in the same project, and in sub directories inputs/training/test, inference/output.


In [None]:
Resources

Links to posts about Autopilot, Clarify

https://github.com/aws/amazon-sagemaker-examples/blob/master/autopilot/autopilot_customer_churn_high_level_with_evaluation.ipynb


https://aws.amazon.com/blogs/machine-learning/explaining-amazon-sagemaker-autopilot-models-with-shap/

https://assets.amazon.science/e8/8b/2366b1ab407990dec96e55ee5664/amazon-sagemaker-autopilot-a-white-box-automl-solution-at-scale.pdf

https://aws.amazon.com/blogs/machine-learning/creating-high-quality-machine-learning-models-for-financial-services-using-amazon-sagemaker-autopilot/

https://aws.amazon.com/blogs/aws/new-amazon-sagemaker-clarify-detects-bias-and-increases-the-transparency-of-machine-learning-models/


In [2]:
import pandas as pd
import numpy as np
import os

from sklearn.model_selection import train_test_split
import boto3
import sagemaker  
from sagemaker.automl.automl import AutoML
from time import gmtime, strftime, sleep
from sagemaker import get_execution_role




In [3]:
sm = boto3.client('sagemaker')
session = sagemaker.Session()
bucket = session.default_bucket()
role = get_execution_role()


In [4]:
auto_ml_job_name = 'automl-churnv4' # This will be the s3 bucket name so dashes are ok but underscores are not 
prefix = 'sagemaker/' + auto_ml_job_name
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
base_job_name = 'automl-card-churn' #redundant 
target_attribute_name = 'churn_flag'
model_name = 'automl-churn-model-' + timestamp_suffix
transform_job_name = "automl-churn-model-transform"

test_data_s3_path = f's3://{bucket}/{prefix}/test/'
s3_transform_output_path = f's3://{bucket}/{prefix}/inference-results/'



# Step 1 - Get datasets 
Not a modelling task - this should ordinarily be done in a separate preprocessing step to e.g. pull data from the Lake or Database.
But some key things of note
1. There is no customer id in the dataset, so we're going to use telephone number (which is unique) as a proxy.
2. The training dataset needs a Target column (churn flag) but ordinarily, this will be abscent from the scoring data- so will drop it from our 'test' dataset. The test set in this case is used to demonstrate how to simulate a scoring job.
3. The training dataset should have column headers in the first row, however, the scoring dataset does not (just how batch transform works it would appear)


In [7]:
os.chdir('/root/churn')

In [8]:
#!apt install unzip
#!wget http://dataminingconsultant.com/DKD2e_data_sets.zip
#!unzip -o DKD2e_data_sets.zip


In [9]:
churn = pd.read_csv('./raw/churn.txt')

In [10]:
churn.shape

(3333, 21)

In [11]:
#There is no explict customer id- so lets use tel number as a proxy
churn.rename(columns={'Phone': 'cust_id'}, inplace = True)

In [14]:
# move cust_id to the first column
cols = list(churn)
cols.insert(0, cols.pop(cols.index('cust_id')))
churn = churn.loc[:, cols]

In [16]:
# the target column is not well formed - lets turn it to an indicator flag instead

churn['churn_flag'] = np.where(churn['Churn?']=='False.', 0, 1)
churn.drop('Churn?', axis='columns', inplace=True)
churn['churn_flag'].value_counts()


0    2850
1     483
Name: churn_flag, dtype: int64

In [17]:
churn.head()

Unnamed: 0,cust_id,State,Account Length,Area Code,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,churn_flag
0,382-4657,KS,128,415,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,371-7191,OH,107,415,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
2,358-1921,NJ,137,415,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0
3,375-9999,OH,84,408,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0
4,330-6626,OK,75,415,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0


In [25]:
y = churn['churn_flag']
X = churn.drop('churn_flag', axis='columns')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [26]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((2233, 20), (1100, 20), (2233,), (1100,))

In [6]:
train_file = 'train/train_data.csv';
test_file = 'test/test_data.csv';

train=pd.read_csv(train_file)
train.columns

Index(['cust_id', 'State', 'Account Length', 'Area Code', 'Int'l Plan',
       'VMail Plan', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge',
       'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls',
       'Night Charge', 'Intl Mins', 'Intl Calls', 'Intl Charge',
       'CustServ Calls', 'churn_flag'],
      dtype='object')

In [28]:
#Couple of changes requred - drop the customer id column from the training dataset,
#and for the test datatet, put it as the first or last column

In [40]:

#for the training data we drop the cust_id
#and for the test data we keep it, but instead drop the target column
training_data = pd.DataFrame(X_train)
training_data.drop('cust_id', axis='columns',inplace=True)
training_data['churn_flag'] = list(y_train)
test_data = pd.DataFrame(X_test)

training_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print('Train data uploaded to: ' + train_data_s3_path)


test_data.to_csv(test_file, index=False, header=False)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
print('Test data uploaded to: ' + test_data_s3_path)


Train data uploaded to: s3://sagemaker-eu-west-1-114936231890/sagemaker/automl-churnv4/train/train_data.csv
Test data uploaded to: s3://sagemaker-eu-west-1-114936231890/sagemaker/automl-churnv4/test/test_data.csv


In [49]:
training_data.shape, test_data.shape

((2233, 21), (1100, 20))

# Step 2 - Training Job
We will ask Autopilot to generate a limited amount of models (if you stick with the defaults the run could take several hrs). We will then retrieve the best one based on the evaluation metric - AUC.
Note you can also run the job through the UI from the launcher

In [41]:
# input_data_config = [{
#       'DataSource': {
#         'S3DataSource': {
#           'S3DataType': 'S3Prefix',
#           'S3Uri': 's3://{}/{}/train'.format(bucket,prefix)
#         }
#       },
#       'TargetAttributeName': 'Class'
#     }
#   ]
automl = AutoML(role=role,
                target_attribute_name=target_attribute_name,
                base_job_name=auto_ml_job_name,
                sagemaker_session=session,
                problem_type='BinaryClassification',
                job_objective={'MetricName': 'AUC'},
                max_candidates=5)                

In [42]:
automl.fit(train_file, job_name=auto_ml_job_name, wait=False, logs=False)
describe_response = automl.describe_auto_ml_job()
print (describe_response)
job_run_status = describe_response['AutoMLJobStatus']
    
while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = automl.describe_auto_ml_job()
    job_run_status = describe_response['AutoMLJobStatus']
    print (job_run_status)
    sleep(30)
print ('completed')

{'AutoMLJobName': 'automl-churnv4', 'AutoMLJobArn': 'arn:aws:sagemaker:eu-west-1:114936231890:automl-job/automl-churnv4', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-eu-west-1-114936231890/auto-ml-input-data/train_data.csv'}}, 'TargetAttributeName': 'churn_flag'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-eu-west-1-114936231890/'}, 'RoleArn': 'arn:aws:iam::114936231890:role/service-role/AmazonSageMaker-ExecutionRole-20210125T152753', 'AutoMLJobObjective': {'MetricName': 'AUC'}, 'ProblemType': 'BinaryClassification', 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 5}, 'SecurityConfig': {'EnableInterContainerTrafficEncryption': False}}, 'CreationTime': datetime.datetime(2021, 1, 31, 21, 19, 50, 579000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2021, 1, 31, 21, 19, 50, 579000, tzinfo=tzlocal()), 'AutoMLJobStatus': 'InProgress', 'AutoMLJobSecondaryStatus': 'Starting', 'GenerateCandidateDefini

In [43]:
best_candidate = automl.describe_auto_ml_job()['BestCandidate']
best_candidate_name = best_candidate['CandidateName']
print("CandidateName: " + best_candidate_name)
print("FinalAutoMLJobObjectiveMetricName: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print("FinalAutoMLJobObjectiveMetricValue: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))


CandidateName: tuning-job-1-8722c107ed7946f791-003-a55c8eb6
FinalAutoMLJobObjectiveMetricName: validation:auc
FinalAutoMLJobObjectiveMetricValue: 0.9613900184631348



----
*Note* we are taking the best candidate forward for production, but we could equally choose one of the other models generated if we wish. Remember the CandidateName - this is the model name we will use from now on.

# Step 3 -  Explainability
Run AWS clarify to get global shap values



In [5]:
from sagemaker import clarify
clarify_processor = clarify.SageMakerClarifyProcessor(role=role,
                                                      instance_count=1,
                                                      instance_type='ml.c4.xlarge',
                                                      sagemaker_session=session)

#Pass it the best model name here....
model_config =  clarify.ModelConfig(model_name='tuning-job-1-8722c107ed7946f791-003-a55c8eb6',
                                   instance_type='ml.c5.xlarge',
                                   instance_count=1,
                                   accept_type='text/csv')

AttributeError: module 'sagemaker.clarify' has no attribute 'ModelPredictedLabelConfigExperimentConfig'

In [17]:
train_file = 'train/train_data.csv';
test_file = 'test/test_data.csv';
test_data = pd.read_csv(test_file)
training_data = pd.read_csv(train_file)
                            

# test_data.to_csv('test_features.csv', index=False, header=False)


In [11]:
test_features.to_csv('test_features.csv', index=False, header=False)

test_data.head()

Unnamed: 0,cust_id,State,Account Length,Area Code,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
0,401-9909,VA,113,415,no,yes,23,149.0,104,25.33,235.8,67,20.04,201.8,76,9.08,9.5,5,2.57,4
1,393-9424,KY,108,415,no,yes,35,169.8,136,28.87,173.7,101,14.76,214.6,105,9.66,9.5,7,2.57,2
2,367-1398,VA,77,510,no,no,0,144.9,136,24.63,151.3,115,12.86,252.4,73,11.36,12.3,3,3.32,2
3,383-1509,AL,98,408,no,no,0,161.0,117,27.37,190.9,113,16.23,227.7,113,10.25,12.1,4,3.27,4
4,360-6024,SD,78,415,no,yes,25,197.4,73,33.56,295.7,113,25.13,211.7,73,9.53,13.2,2,3.56,0


In [37]:

shap_config = clarify.SHAPConfig(baseline=[test_data.iloc[0].values.tolist()],
                                 num_samples=15,
                                 agg_method='mean_abs')

In [35]:
test_data.dtypes

cust_id            object
State              object
Account Length      int64
Area Code           int64
Int'l Plan         object
VMail Plan         object
VMail Message       int64
Day Mins          float64
Day Calls           int64
Day Charge        float64
Eve Mins          float64
Eve Calls           int64
Eve Charge        float64
Night Mins        float64
Night Calls         int64
Night Charge      float64
Intl Mins         float64
Intl Calls          int64
Intl Charge       float64
CustServ Calls      int64
dtype: object

In [36]:
# problem = 
# See for issue and workarounds
# https://stackoverflow.com/questions/50916422/python-typeerror-object-of-type-int64-is-not-json-serializable

test_data = test_data.astype({'Account Length':'float64', 'Area Code':'float64', 
                             'VMail Message':'float64', 'Day Calls':'float64',
                             'Eve Calls':'float64', 'Night Calls':'float64',
                             'Intl Calls':'float64', 'CustServ Calls':'float64'})



In [34]:
shap_config.get_explainability_config()

{'shap': {'baseline': [['401-9909',
    'VA',
    113,
    415,
    'no',
    'yes',
    23,
    149.0,
    104,
    25.33,
    235.8,
    67,
    20.04,
    201.8,
    76,
    9.08,
    9.5,
    5,
    2.57,
    4]],
  'num_samples': 15,
  'agg_method': 'mean_abs',
  'use_logit': False,
  'save_local_shap_values': True}}

In [28]:
train_uri = f's3://{bucket}/{prefix}/train/train_data.csv'
explainability_output_path = f's3://{bucket}/{prefix}/exaplainability/'



In [29]:
train_uri

's3://sagemaker-eu-west-1-114936231890/sagemaker/automl-churnv4/train/train_data.csv'

In [40]:
explainability_output_path = f's3://{bucket}/{prefix}/clarify-explainability'
explainability_data_config = clarify.DataConfig(s3_data_input_path=train_uri,
                                s3_output_path=explainability_output_path,
                                label='churn_flag',
                                headers=training_data.columns.to_list(),
                                dataset_type='text/csv')

In [31]:
# shap_config = clarify.SHAPConfig(baseline=[test_features.iloc[0].values.tolist()],
#                                  num_samples=15,
#                                  agg_method='mean_abs')

In [None]:
"ExperimentConfig": { 
      "ExperimentName": "string",
      "TrialComponentDisplayName": "string",
      "TrialName": "string"
   },

In [41]:
clarify_processor.run_explainability(data_config=explainability_data_config,
                                     model_config=model_config,
                                     explainability_config=shap_config)


Job Name:  Clarify-Explainability-2021-02-09-15-40-02-126
Inputs:  [{'InputName': 'dataset', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-114936231890/sagemaker/automl-churnv4/train/train_data.csv', 'LocalPath': '/opt/ml/processing/input/data', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'analysis_config', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-114936231890/Clarify-Explainability-2021-02-09-15-40-02-126/input/analysis_config/analysis_config.json', 'LocalPath': '/opt/ml/processing/input/config', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'analysis_result', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-114936231890/sagemaker/automl-churnv4/clarify-explainability', 'LocalPath': '/opt/ml/processing/output', 'S3Uploa

In [3]:
!conda install -c conda-forge -y shap

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.2
  latest version: 4.9.2

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - shap


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.9.2                |   py37h89c1867_0         3.0 MB  conda-forge
    python_abi-3.7             |          1_cp37m           4 KB  conda-forge
    shap-0.37.0                |   py37h10a2094_0         510 KB  conda-forge
    slicer-0.0.7               |     pyhd8ed1ab_0          16 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.6 MB

The following NEW packages will be INSTALLED:

  python_abi         conda-forge/linux-64::python_abi-3.

In [15]:
import shap 
from  io import StringIO

In [18]:
training_data.columns

Index(['cust_id', 'State', 'Account Length', 'Area Code', 'Int'l Plan',
       'VMail Plan', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge',
       'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls',
       'Night Charge', 'Intl Mins', 'Intl Calls', 'Intl Charge',
       'CustServ Calls', 'churn_flag'],
      dtype='object')

In [19]:
#This is the right template but not working - training_data and shap_values matrix are different shapes
output = sagemaker.s3.S3Downloader.read_file("""s3://sagemaker-eu-west-1-114936231890/sagemaker/automl-churnv4/clarify-explainability/explanations_shap/out.csv""")
local_shap_values = pd.read_csv(StringIO(output), sep=',', header=None)

#keep only columns where target is 1, not 0
for col in local_shap_values.columns:

features = training_data.drop(columns='churn_flag')
shap.summary_plot(shap_values = local_shap_values.to_numpy(), features=features.to_numpy(), 
                 feature_names=features.columns)

AssertionError: The shape of the shap_values matrix does not match the shape of the provided data matrix.

In [24]:
local_shap_values.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
0,cust_id_label0,State_label0,Account Length_label0,Area Code_label0,Int'l Plan_label0,VMail Plan_label0,VMail Message_label0,Day Mins_label0,Day Calls_label0,Day Charge_label0,...,Eve Mins_label1,Eve Calls_label1,Eve Charge_label1,Night Mins_label1,Night Calls_label1,Night Charge_label1,Intl Mins_label1,Intl Calls_label1,Intl Charge_label1,CustServ Calls_label1
1,-0.08610271903323313,-0.026144739139116635,-0.39879154078549806,-0.30083499880355646,7.071969730197268e-16,0.3752421289682393,0.3987915407854986,0.26724346013587774,-0.16314199395770393,-0.046657497361527724,...,-0.30060422960725086,0.0010753871586025804,-0.060422960725076226,0.06404092248382845,0.30060422960725064,-0.017164454021662834,0.1555891238670701,0.0639863316296092,-1.5543122344752192e-15,0.07960557937622004
2,0.0,-0.02189378659824909,0.0,0.004346612249554738,0.0,-0.03577972162511935,0.0,0.06946638486935056,0.0,0.005107320068015199,...,0.0,-0.03684338057822025,0.0,0.06813514067265414,0.0,0.05677931755781172,0.0,-0.021911671580509696,0.0,-0.266086107678307
3,0.30305024934977165,0.07918526974349642,0.350045024660716,0.13400165478686207,0.1783410812318022,0.06542512418370894,-0.02423560924993928,0.014270462820364775,-0.10781463552879962,-0.024511163399472104,...,0.7159126750133713,0.2601034952526819,0.0077090447001180945,-0.03934757536620058,-0.2102857351818584,-0.08310748083023059,-0.16329095987091347,-0.046383135759408466,-0.8198190785714697,-0.42591944539626764
4,-0.02499999999999994,-0.0009604788385331325,-5.551115123125783e-17,0.0,1.1102230246251565e-16,5.551115123125783e-17,0.050000000000000225,0.020083975605666694,-0.1249999999999999,-0.03997107176110147,...,-0.14999999999999977,0.005332807265222163,0.30000000000000004,-0.007037025317549861,0.0,0.0,-0.6000000000000001,-0.04101655557751642,-0.3999999999999999,-0.34214822277426715


# Step 4 - Batch inference on Test Set
This is likely to be a separate scoring script for prod purposes.

You could create a batch transform through the AWS Console UI, however there are a couple of default behaviours that you can't override in the UI that are undesirable for our purposes

We want our predictions to be probability scores not class labels and we want to associate the scores with a cutomer id at the very least, as well as the original input data


Tasks are
* define the predictive columns we need (probability scores not output by default) 
* join the scores back to the input data (this is not default behaviour

In [25]:
# heres another way of getting the best candidate model name without having to expicitly specify
# the long string of characters! Instead we need to know the Jobname
automl = AutoML.attach(auto_ml_job_name)
best_candidate = automl.describe_auto_ml_job()['BestCandidate']
best_candidate_name = best_candidate['CandidateName']

In [44]:
best_candidate

{'CandidateName': 'tuning-job-1-8722c107ed7946f791-003-a55c8eb6',
 'FinalAutoMLJobObjectiveMetric': {'MetricName': 'validation:auc',
  'Value': 0.9613900184631348},
 'ObjectiveStatus': 'Succeeded',
 'CandidateSteps': [{'CandidateStepType': 'AWS::SageMaker::ProcessingJob',
   'CandidateStepArn': 'arn:aws:sagemaker:eu-west-1:114936231890:processing-job/db-1-f90842bfdbfb4714bea3792b79b872506e785e848ff94bd5890e056256',
   'CandidateStepName': 'db-1-f90842bfdbfb4714bea3792b79b872506e785e848ff94bd5890e056256'},
  {'CandidateStepType': 'AWS::SageMaker::TrainingJob',
   'CandidateStepArn': 'arn:aws:sagemaker:eu-west-1:114936231890:training-job/automl-chu-dpp0-1-09204afc5b0d433b9a901406e079eeb21ca079d27e204',
   'CandidateStepName': 'automl-chu-dpp0-1-09204afc5b0d433b9a901406e079eeb21ca079d27e204'},
  {'CandidateStepType': 'AWS::SageMaker::TransformJob',
   'CandidateStepArn': 'arn:aws:sagemaker:eu-west-1:114936231890:transform-job/automl-chu-dpp0-rpb-1-c856e4481a244ae69d89f14a826fd97640add2ee3

In [45]:
inference_response_keys = ['predicted_label', 'probability']
model = automl.create_model(name=best_candidate_name,
candidate=best_candidate,inference_response_keys=inference_response_keys)

experiment_config =  { 
      "ExperimentName": "string",
      "TrialComponentDisplayName": "string",
      "TrialName": "string"
   }


In [51]:
output_path = s3_transform_output_path + best_candidate['CandidateName'] +'/'
transformer=model.transformer(instance_count=1, 
                          instance_type='ml.m5.xlarge',
                          assemble_with='Line',
                          accept='text/csv', 
                          output_path=output_path)
transformer.transform(job_name=transform_job_name+"v6", 
                      data=test_data_s3_path, 
                      split_type='Line', 
                      input_filter="$", join_source= "Input", 
                      output_filter="$",
                      content_type='text/csv', 
                     
                      wait=False)

describe_response = sm.describe_transform_job(TransformJobName = transform_job_name+"v6")
job_run_status = describe_response['TransformJobStatus']
print (job_run_status)

while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_transform_job(TransformJobName = transform_job_name+"v6")
    job_run_status = describe_response['TransformJobStatus']
    print (describe_response)
    sleep(30)
print ('transform job completed with status : ' + job_run_status)

Using already existing model: tuning-job-1-8722c107ed7946f791-003-a55c8eb6


InProgress
{'TransformJobName': 'automl-churn-model-transformv5', 'TransformJobArn': 'arn:aws:sagemaker:eu-west-1:114936231890:transform-job/automl-churn-model-transformv5', 'TransformJobStatus': 'InProgress', 'ModelName': 'tuning-job-1-8722c107ed7946f791-003-a55c8eb6', 'TransformInput': {'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-eu-west-1-114936231890/sagemaker/automl-churnv4/test/test_data.csv'}}, 'ContentType': 'text/csv', 'CompressionType': 'None', 'SplitType': 'Line'}, 'TransformOutput': {'S3OutputPath': 's3://sagemaker-eu-west-1-114936231890/sagemaker/automl-churnv4/inference-results/tuning-job-1-8722c107ed7946f791-003-a55c8eb6/', 'Accept': 'text/csv', 'AssembleWith': 'Line', 'KmsKeyId': ''}, 'TransformResources': {'InstanceType': 'ml.m5.xlarge', 'InstanceCount': 1}, 'CreationTime': datetime.datetime(2021, 1, 31, 22, 34, 15, 781000, tzinfo=tzlocal()), 'DataProcessing': {'InputFilter': '$', 'OutputFilter': '$', 'JoinSource': 'Input'}, 'Resp

In [52]:
output_path

's3://sagemaker-eu-west-1-114936231890/sagemaker/automl-churnv4/inference-results/tuning-job-1-8722c107ed7946f791-003-a55c8eb6/'

In [56]:
import json
import io
from urllib.parse import urlparse
test_file = 'test_data.csv';

def get_csv_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:].strip('/')
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, '{}/{}'.format(prefix, file_name))
    return obj.get()["Body"].read().decode('utf-8')    
pred_csv = get_csv_from_s3(transformer.output_path, '{}.out'.format(test_file))
data_auc=pd.read_csv(io.StringIO(pred_csv))


In [57]:
data_auc.head()

Unnamed: 0,cust_id,State,Account Length,Area Code,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,1,0.6064140200614929
0,401-9909,VA,113,415,no,yes,23,149.0,104,25.33,...,20.04,201.8,76,9.08,9.5,5,2.57,4,0,0.284306
1,393-9424,KY,108,415,no,yes,35,169.8,136,28.87,...,14.76,214.6,105,9.66,9.5,7,2.57,2,0,0.040555
2,367-1398,VA,77,510,no,no,0,144.9,136,24.63,...,12.86,252.4,73,11.36,12.3,3,3.32,2,0,0.053853
3,383-1509,AL,98,408,no,no,0,161.0,117,27.37,...,16.23,227.7,113,10.25,12.1,4,3.27,4,1,0.753325
4,360-6024,SD,78,415,no,yes,25,197.4,73,33.56,...,25.13,211.7,73,9.53,13.2,2,3.56,0,0,0.052484


In [None]:
# Step 5 - Post processing
Typically the probability scores are bucketed into quintiles for more practical usage

In [60]:
data_auc['decile'] = pd.qcut(data_auc.iloc[:,21], 10, labels=False)

In [61]:
data_auc.head()

Unnamed: 0,cust_id,State,Account Length,Area Code,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,1,0.6064140200614929,decile
0,401-9909,VA,113,415,no,yes,23,149.0,104,25.33,...,201.8,76,9.08,9.5,5,2.57,4,0,0.284306,7
1,393-9424,KY,108,415,no,yes,35,169.8,136,28.87,...,214.6,105,9.66,9.5,7,2.57,2,0,0.040555,0
2,367-1398,VA,77,510,no,no,0,144.9,136,24.63,...,252.4,73,11.36,12.3,3,3.32,2,0,0.053853,1
3,383-1509,AL,98,408,no,no,0,161.0,117,27.37,...,227.7,113,10.25,12.1,4,3.27,4,1,0.753325,9
4,360-6024,SD,78,415,no,yes,25,197.4,73,33.56,...,211.7,73,9.53,13.2,2,3.56,0,0,0.052484,1
