## Track ML experimentation using Amazon SageMaker Experiment & DVC

### Contents 

1. [Background]
2. [Setup]
3. [Data]
4. [Experiment 1]
5. [Experiement 2]
6. [Compare the experiments]
7. [Move to a different data version]
8. [Conclusion]
9. [Refrence]


## 1. Background

The purpose of this notebook is to demonstrate how to integrate Amazon SageMaker experiment with DVC to keep track of data and model during the ML experimentation phase. For the purpose of this demo, we just use customer churn use case as built in reference [1].  

### 2. Setup


#### Let's start with the initial setup which includes:
    - Setup dvc and git for this notebook
    - Import the required python packages
    - Initiate the SageMaker session
    - Get the IAM role for SageMaker
    - Provide The S3 bucket name which will be used to store the data and DVC artifacts

In [44]:
# install dvc (dvc version 1.6.4 is used in this demo) 
! pip install dvc 
# setup giet
! git config --global user.name "USERNAME"   # Replace USERNAME with your source control user name
! git config --global user.email "EMAIL"          # Replace EMAIL with your source control email

# install dvc git hook (Ignore error if DVC Git hooks already exist)
! dvc install

# install sagemaker-experiments if required
! pip install sagemaker-experiments

[31mERROR[39m: failed to install DVC Git hooks - Hook 'post-checkout' already exists. Please refer to <[36mhttps://man.dvc.org/install[39m> for more info.

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
[0m

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import sagemaker
import boto3
import re
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer
from sagemaker.amazon.amazon_estimator import get_image_uri
from smexperiments import experiment
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker
from sagemaker import analytics

In [None]:
# Initiate the sagemaker session
sess = sagemaker.Session()

# Define IAM role


role = get_execution_role()

bucket='BUCKETNAME'  # Replace with your bucket name
prefix ='sagemaker'

## 3. Data 

We use the public dataset attributed to University of California Irvine Repository of Machine Learning Datasets. Let's download this dataset.

In [None]:
!wget http://dataminingconsultant.com/DKD2e_data_sets.zip
!unzip -o DKD2e_data_sets.zip

In [None]:
churn = pd.read_csv('./Data sets/churn.txt')
pd.set_option('display.max_columns', 500)
churn

This dataset includes 3333 records and 21 attributes. The attributes are:

- State: the US state (two-letter abbreviation) in which the customer resides,
- Account Length: the number of days that this account has been active
- Area Code: area code of the corresponding customer’s number
- Phone: phone number
- Int’l Plan: whether the customer has an international calling plan
- VMail Plan: whether the customer has a voice mail feature
- VMail Message: the average number of voice mail messages per month
- Day Mins: the total number of calling during the day
- Day Calls: the total number of calls placed during the day
- Day Charge: the billed cost of daytime calls
- Eve Mins, Eve Calls, Eve Charge: the billed cost for calls placed during the evening
- Night Mins, Night Calls, Night Charge: the billed cost for calls placed during nighttime
- Intl Mins, Intl Calls, Intl Charge: the billed cost for international calls
- CustServ Calls: the number of calls placed to Customer Service
- Churn?: whether the customer left the service: true/false. This is the target attribute and we will use as the label for our ML model

Let's do some initial data cleaining. The detailed data exploration for this dataset can be found in [1].

In [None]:
churn = churn.drop('Phone', axis=1)
churn = churn.drop(['Day Charge', 'Eve Charge', 'Night Charge', 'Intl Charge'], axis=1)
churn['Area Code'] = churn['Area Code'].astype(object)

We need to also convert our categorical features into numeric features.

In [None]:
model_data = pd.get_dummies(churn)
model_data = pd.concat([model_data['Churn?_True.'], model_data.drop(['Churn?_False.', 'Churn?_True.'], axis=1)], axis=1)

Now, the data is ready for experimentation. Let's start the first experiment.

## 4. Experiment 1

A good way to associaite a new experiment with a head of git is to make a new git branch.

In [None]:
! git checkout -b experiment1

Lets prepare the training/validation/test datasets for for experiment 1 and push it to our storage (S3).

In [None]:
def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.iloc[perm[:train_end]]
    validate = df.iloc[perm[train_end:validate_end]]
    test = df.iloc[perm[validate_end:]]
    return train, validate, test


In [None]:
train_data, validation_data, test_data = train_validate_test_split(model_data)

In [None]:
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)
test_data.to_csv('test.csv', header=False, index=False)

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')

In [None]:
! dvc status

Lets add these changes into dvc and git.

In [None]:
! dvc add --external s3://BUCKETNAME/sagemaker/train # Replace the bucket name with your bucket name
! dvc add --external s3://BUCKETNAME/sagemaker/validation # Replace the bucket name with your bucket name
! dvc add --external s3://BUCKETNAME/sagemaker/test # Replace the bucket name with your bucket name

In [None]:
! git add train.dvc validation.dvc test.dvc

In [None]:
! git commit -m "train/val/test datasets are ready for experiment 1"

In [None]:
! git push --set-upstream origin experiment1

Let's capture the commit id. We will use it later to associate it to the experiment.

In [None]:
! git log --format="%H" -n 1   # Get commit id for data version

Now, in order to track this experiment in Sagemaker, we need to create an experiment. We need to also define the trial within the experiment. For the sake of simplicity, we just consider one trial for the experiment, but we can have any number of trials within an experiment. 

In [None]:
create_date = strftime("%Y-%m-%d-%H-%M-%S")

my_experiment = experiment.Experiment.create(experiment_name = "experiment1-{}".format(create_date),
                                    description = "experiment1-{}".format(create_date)
                                    )
my_trial = my_experiment.create_trial(trial_name = "trial1-{}".format(create_date))

To track extra metadata related to a trial, we can use TrialComponent as below. For the purpose of this demo, we create a preprocessing component to track parameters (e.g. train_test_split_ratio) related to preprocessing stage.

In [None]:
se = boto3.Session()
sm   = se.client('sagemaker')


with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker:
 tracker.log_parameters({
 "train_test_split_ratio": 0.6
 })
my_trial.add_trial_component(tracker.trial_component)


To track the data version in SageMaker experiment, we add the git commit id (captured before) as a parameter of this trial.

In [None]:
dataset_commit_id = 'COMMITID' # Enter the CommitID for data version

with Tracker.create(display_name="DatasetLineage", sagemaker_boto_client=sm) as ptracker:
    ptracker.log_parameters({"dataset_commit_id": dataset_commit_id})

my_trial.add_trial_component(ptracker.trial_component)

Now, let's run the training job. To predict the customer churining, we will use XGBoost algorithm. Let's import the container.

In [None]:
container = get_image_uri(boto3.Session().region_name, 'xgboost')

We need to also specify the location of training and validation datasets.

In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

In [None]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess
                                    #,train_use_spot_instances=True,
                                    #train_max_wait=86400,
                                    #subnets = ['subnet-0f9914042f9a20cad'],
                                    #security_group_ids = ['sg-089715a9429257862'],
                                    )
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}, experiment_config={
            "ExperimentName": my_experiment.experiment_name,
            "TrialName": my_trial.trial_name,
            "TrialComponentDisplayName": "Training",
        },)

When the training job is finished, the model will be pushed to S3. Let's source control the model using dvc and associaite it to experiment 1.

In [None]:
! dvc add --external s3://BUCKETNAME/sagemaker/output # Replace the bucket name with your bucket name

In [None]:
! git add output.dvc

In [None]:
! git commit -m " push the model for experiment 1"

In [None]:
! git push 

In [None]:
! git log --format="%H" -n 1   # Get commit id for model version

In [None]:
model_commit_id =  'COMMITID' # Enter your CommitID here

with Tracker.create(display_name="ModelLineage", sagemaker_boto_client=sm) as mtracker:

    mtracker.log_parameters({"model_commit_id": model_commit_id})

my_trial.add_trial_component(mtracker.trial_component)

So, Our first experiment is done. we easily version the data and model (via dvc), capture all the observations in sagemaker experiment. To get these observations and put it into dataframe, we can use analytics class in sagemaker as below.

In [None]:
trial_component_analytics = analytics.ExperimentAnalytics(experiment_name=my_experiment.experiment_name)
experiment1_obs = trial_component_analytics.dataframe()
experiment1_obs

# 5. Experiment 2

Lets create the second experiment through the same process.

In [20]:
! git checkout -b experiment2

M	.dvc/.gitignore
M	.dvc/config
M	.dvc/plots/confusion.json
M	.dvc/plots/default.json
M	.dvc/plots/scatter.json
M	.dvc/plots/smooth.json
M	.dvcignore
M	README.md
M	data.dvc
Switched to a new branch 'experiment2'
  0% Checkout|                                       |0/1 [00:00<?,     ?file/s]
![A
  0%|          |Computing file/dir hashes (only done o0/1 [00:00<?,      ?md5/s][A
[0m                                                                            [A

In this experiment, we just drop one of the column from the original dataset to model a dataset change. Let's follow the same procedure as experiment 1 to push the data into S3 and track the changes.

In [21]:
model_data = model_data.drop(['Account Length'], axis=1)

In [None]:
train_data, validation_data, test_data = train_validate_test_split(model_data)

In [None]:
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)
test_data.to_csv('test.csv', header=False, index=False)

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')

In [None]:
! dvc add --external s3://BUCKETNAME/sagemaker/train # Replace the bucket name with your bucket name
! dvc add --external s3://BUCKETNAME/sagemaker/validation # Replace the bucket name with your bucket name
! dvc add --external s3://BUCKETNAME/sagemaker/test # Replace the bucket name with your bucket name

In [None]:
! git add train.dvc validation.dvc test.dvc

In [None]:
! git commit -m "train/val/test datasets are ready for experiment 2"

In [None]:
! git push --set-upstream origin experiment2

In [None]:
! git log --format="%H" -n 1   # Get commit id for data version

In [None]:
create_date = strftime("%Y-%m-%d-%H-%M-%S")

my_experiment = experiment.Experiment.create(experiment_name = "experiment2-{}".format(create_date),
                                    description = "experiment2-{}".format(create_date)
                                    )
my_trial = my_experiment.create_trial(trial_name = "trial1-{}".format(create_date))

In [None]:
with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker:
 tracker.log_parameters({
 "train_test_split_ratio": 0.6
 })
my_trial.add_trial_component(tracker.trial_component)

In [None]:
dataset_commit_id = 'COMMITID' # Enter the CommitID for data version

with Tracker.create(display_name="DatasetLineage", sagemaker_boto_client=sm) as ptracker:
    ptracker.log_parameters({"dataset_commit_id": dataset_commit_id})

my_trial.add_trial_component(ptracker.trial_component)

In [None]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess
                                    #,train_use_spot_instances=True,
                                    #train_max_wait=86400,
                                    #subnets = ['subnet-0f9914042f9a20cad'],
                                    #security_group_ids = ['sg-089715a9429257862'],
                                    )
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}, experiment_config={
            "ExperimentName": my_experiment.experiment_name,
            "TrialName": my_trial.trial_name,
            "TrialComponentDisplayName": "Training",
        },)

In [None]:
! dvc add --external s3://BUCKETNAME/sagemaker/output # Replace the bucket name with your bucket name

In [None]:
! git add output.dvc

In [None]:
! git commit -m " push the model for experiment 2"

In [None]:
! git push

In [None]:
! git log --format="%H" -n 1   # Get commit id for model version

In [None]:
model_commit_id =  '5c227f6eb6562e22a077453c76dbe94139a0b92f' # Enter your CommitID here

with Tracker.create(display_name="ModelLineage", sagemaker_boto_client=sm) as mtracker:

    mtracker.log_parameters({"model_commit_id": model_commit_id})

my_trial.add_trial_component(mtracker.trial_component)

In [None]:
trial_component_analytics = analytics.ExperimentAnalytics(experiment_name=my_experiment.experiment_name)
experiment2_obs = trial_component_analytics.dataframe()

Now, we have two experiments which are tracked and version controlled. 

## 6. Compare the experiments

In [None]:
experiment1_obs

In [None]:
experiment2_obs

Let's get the average validation error for each experiemnt

In [None]:
experiment1_obs['validation:error - Avg'][1]

In [None]:
experiment2_obs['validation:error - Avg'][1]

## 7. Change the artifact version 


We can easily check out to experiment1 brnach and the dataset and model will also change to the version of this experiment.

In [None]:
! git checkout experiment1

We can also use the commit id to go back to a specific version of a dataset.

In [None]:
! git checkout d72fd71ac8ffedb6b5b8c0492e8dcc5ac0e8610b train.dvc
! dvc pull

## 8. Conclusion

In this notebook, we demonstrated how to use Amazon SageMaker experiment and DVC to keep track of ML experiemtnation artifcats (datasets, model)

## 9. Refrence

1- https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.ipynb