## Track ML experimentation using Amazon SageMaker & DVC

The purpose of this notebook is to demonstrate how to integrate Amazon SageMaker experiment with DVC to keep track of data and ML model.  

### Contents 

1. Background
2. Initial Setup
3. Data
4. Experiment 1
5. Experiement 2


### Initial setup

#### Let's start witht the initial setup which includes:
    - setup of dvc and git on this notebook
    - Initiate the sagemaker session
    - The IAM role arn which will be used to give access to data
    - The S3 bucket which will be used to store the data and DVC artifacts
    - Importing the required python packages

In [44]:
# install dvc 
! pip install dvc
# setup giet
! git config --system user.name "USERNAME"   # Replace USERNAME with your source control user name
! git config --system user.email "EMAIL"          # Replace EMAIL with your source control email

# install dvc git hook (Ignore error if DVC Git hooks already exist)
! dvc install

[31mERROR[39m: failed to install DVC Git hooks - Hook 'post-checkout' already exists. Please refer to <[36mhttps://man.dvc.org/install[39m> for more info.

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
[0m

In [None]:
import sagemaker
sess = sagemaker.Session()

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

bucket='BUCKETNAME'  # Replace with your bucket name
prefix ='data'

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import sagemaker
from sagemaker.predictor import csv_serializer

## Data 

For the purpose of this demo, we consider the customer churn use case. We use the publicly avaialable dataset attributed to University of California Irvine Repository of Machine Learning Datasets. Let's download this dataset.

In [None]:
!wget http://dataminingconsultant.com/DKD2e_data_sets.zip
!unzip -o DKD2e_data_sets.zip

In [None]:
churn = pd.read_csv('./Data sets/churn.txt')
pd.set_option('display.max_columns', 500)
churn

This dataset includes 3333 records and 21 attributes. The attributes are:

- State: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ
- Account Length: the number of days that this account has been active
- Area Code: the three-digit area code of the corresponding customer’s phone number
- Phone: the remaining seven-digit phone number
- Int’l Plan: whether the customer has an international calling plan: yes/no
- VMail Plan: whether the customer has a voice mail feature: yes/no
- VMail Message: presumably the average number of voice mail messages per month
- Day Mins: the total number of calling minutes used during the day
- Day Calls: the total number of calls placed during the day
- Day Charge: the billed cost of daytime calls
- Eve Mins, Eve Calls, Eve Charge: the billed cost for calls placed during the evening
- Night Mins, Night Calls, Night Charge: the billed cost for calls placed during nighttime
- Intl Mins, Intl Calls, Intl Charge: the billed cost for international calls
- CustServ Calls: the number of calls placed to Customer Service
- Churn?: whether the customer left the service: true/false. This is the target attribute and we will use as the label for our ML model

Let's do some initial data cleaining. The detailed data exploration for this dataset can be found in [1].

In [None]:
churn = churn.drop('Phone', axis=1)
churn = churn.drop(['Day Charge', 'Eve Charge', 'Night Charge', 'Intl Charge'], axis=1)
churn['Area Code'] = churn['Area Code'].astype(object)

We need to also convert our categorical features into numeric features.

In [None]:
model_data = pd.get_dummies(churn)
model_data = pd.concat([model_data['Churn?_True.'], model_data.drop(['Churn?_False.', 'Churn?_True.'], axis=1)], axis=1)

And now let's split the data into training and validation sets.

In [None]:
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

## Experiment 1

A good way to associaite a new experiment with a head of git is to make a new git branch for this experiment. Let's create a new git branch.

In [None]:
! git checkout -b experiment1

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

To predict the customer churining, we will use XGBoost algorithm. Now, the first step is to specifyt the location of rhe XGBoost conatiner.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

Then, we need to specify the location of training and validation data.

In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train/'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

To track the metadadata around th first experiment, we need to create an experiment within SageMaker experiment.

In [None]:
from smexperiments import experiment
commit = {'Key': 'commit-id', 'Value': 'COMMITID'} # Replace COMMITID with the commit id that you get from previous cell
branch = {'Key': 'git-branch', 'Value': 'experiment1'}
tags = [commit, branch]
create_date = strftime("%Y-%m-%d-%H-%M-%S")

my_experiment = experiment.Experiment.create(experiment_name = "experiment1-{}".format(create_date),
                                    description = "experiment1-{}".format(create_date),
                                    tags = tags)
my_trial = my_experiment.create_trial(trial_name = "trial1-{}".format(create_date))

In [None]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train}, experiment_config={
            "ExperimentName": my_experiment.experiment_name,
            "TrialName": my_trial.trial_name,
            "TrialComponentDisplayName": "iris-xgb-learner",
        },)

In [None]:
from sagemaker import analytics
trial_component_analytics = analytics.ExperimentAnalytics(experiment_name=my_experiment.experiment_name)
analytic_table = trial_component_analytics.dataframe()
analytic_table.columns

## Step 2: Create the first experiment

A good way to associaite a new experiment with a head of git is to make a new git branch for this experiment. Let's create a new git branch.

In [4]:
! git checkout -b experiment1

M	.dvc/.gitignore
M	.dvc/config
M	.dvc/plots/confusion.json
M	.dvc/plots/default.json
M	.dvc/plots/scatter.json
M	.dvc/plots/smooth.json
M	.dvcignore
M	README.md
M	data.dvc
Switched to a new branch 'experiment1'
[0m                                                                            

As the purpose of this demo is to just demonstrate how to integrate dvc with SageMaker, we do not go through building a ML model as part of this experimentation. we rather focus on data versioning.

Lets import the required python liberaries.

In [48]:
import time
from time import strftime
import sagemaker
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker
from sklearn import datasets
import pandas as pd
import numpy as np
import boto3
import os

We use the popular iris dataset to build this demo.

In [6]:
iris = datasets.load_iris()
data = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

In [7]:
data.to_csv('data.csv', index=False, header=False)

Lets consider this data as training dataset for experiment 1 and push it to our storage (S3).

In [8]:
bucket='BUCKETNAME'  # Replace with your bucket name
prefix ='data'

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'data.csv')).upload_file('data.csv')

If we check the dvc status, there is a change in the data.

In [9]:
! dvc status

data.dvc:                                                                       
	changed outs:
		modified:           s3://sagemaker-demo-sam/data
[0m

Lets add and commit those changes.

In [10]:
! dvc add --external s3://sagemaker-demo-sam/data

Adding...                                                                       
![A
  0%|          |Computing file/dir hashes (only done o0/1 [00:00<?,      ?md5/s][A
Adding...                                                                       [A
![A
  0%|          |Computing file/dir hashes (only done o0/1 [00:00<?,      ?md5/s][A
                                                                                [A
![A
  0%|          |Computing file/dir hashes (only done o0/1 [00:00<?,      ?md5/s][A
                                                                                [A
![A
  0%|          |Saving data                     0.00/1.00 [00:00<?,     ?file/s][A
100%|██████████|Saving data                 1.00/1.00 [00:00<00:00,  2.83file/s][A
100% Add|██████████████████████████████████████████████|1/1 [00:02,  2.20s/file][A

To track the changes with git, run:

	git add data.dvc
[0m

In [None]:
! git add data.dvc

In [13]:
! git commit -m "initial experiment 1"

Data and pipelines are up to date.                                              
[0m[experiment1 66eacfd] initial experiment 1
 1 file changed, 1 insertion(+), 1 deletion(-)


Now, if we push the changes to git, dvc git hook also push the data change to dvc repo.

In [17]:
! git push --set-upstream origin experiment1

Everything is up to date.                                                       
[0mEnumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 340 bytes | 21.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
remote: 
remote: Create a pull request for 'experiment1' on GitHub by visiting:[K
remote:      https://github.com/ghassem1365/sagemaker-dvc-demo/pull/new/experiment1[K
remote: 
To github.com:ghassem1365/sagemaker-dvc-demo.git
 * [new branch]      experiment1 -> experiment1
Branch 'experiment1' set up to track remote branch 'experiment1' from 'origin'.


To relate changes in data to SageMaker experiment, we can use the git commit id. As an example, We add this commit id as a tag to SageMaker experiment to relate these two.

In [18]:
! git log --format="%H" -n 1

66eacfde4eb33565b5d4a9527cc6a8f1a63134d8


In [50]:
commit = {'Key': 'commit-id', 'Value': 'COMMITID'} # Replace COMMITID with the commit id that you get from previous cell
branch = {'Key': 'git-branch', 'Value': 'experiment1'}
tags = [commit, branch]
create_date = strftime("%Y-%m-%d-%H-%M-%S")
experiment1 = Experiment.create(experiment_name = "experiment1-{}".format(create_date),
                                    description = "experiment1-{}".format(create_date),
                                    tags = tags)

So, Our first experiment is done. we easily version the data (via dvc), track the model( sagemaker experiment) and relate them via git commmit id.

# Step 3: Create the scond experiment

Lets create the second experiment through the same process.

In [20]:
! git checkout -b experiment2

M	.dvc/.gitignore
M	.dvc/config
M	.dvc/plots/confusion.json
M	.dvc/plots/default.json
M	.dvc/plots/scatter.json
M	.dvc/plots/smooth.json
M	.dvcignore
M	README.md
M	data.dvc
Switched to a new branch 'experiment2'
  0% Checkout|                                       |0/1 [00:00<?,     ?file/s]
![A
  0%|          |Computing file/dir hashes (only done o0/1 [00:00<?,      ?md5/s][A
[0m                                                                            [A

In this experiment, we just drop one of the column from the original dataset to moel a data change. Let's follow the same procedure as experiment 1 to push the data to S3 and track the changes.

In [21]:
data = data.drop('petal width (cm)', axis=1)

In [22]:
data.head(3)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),target
0,5.1,3.5,1.4,0.0
1,4.9,3.0,1.4,0.0
2,4.7,3.2,1.3,0.0


In [23]:
data.to_csv('data.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'data.csv')).upload_file('data.csv')

In [24]:
! dvc add --external s3://sagemaker-demo-sam/data

Adding...                                                                       
![A
  0%|          |Computing file/dir hashes (only done o0/1 [00:00<?,      ?md5/s][A
Adding...                                                                       [A
![A
  0%|          |Computing file/dir hashes (only done o0/1 [00:00<?,      ?md5/s][A
                                                                                [A
![A
  0%|          |Computing file/dir hashes (only done o0/1 [00:00<?,      ?md5/s][A
                                                                                [A
![A
  0%|          |Saving data                     0.00/1.00 [00:00<?,     ?file/s][A
100%|██████████|Saving data                 1.00/1.00 [00:00<00:00,  3.39file/s][A
100% Add|██████████████████████████████████████████████|1/1 [00:02,  2.45s/file][A

To track the changes with git, run:

	git add data.dvc
[0m

In [26]:
! git add data.dvc

In [27]:
! git commit -m "initial experiment 2"

Data and pipelines are up to date.                                              
[0m[experiment2 c63f862] initial experiment 2
 1 file changed, 1 insertion(+), 1 deletion(-)


In [28]:
! git push --set-upstream origin experiment2

Everything is up to date.                                                       
[0mEnumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 340 bytes | 18.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
remote: 
remote: Create a pull request for 'experiment2' on GitHub by visiting:[K
remote:      https://github.com/ghassem1365/sagemaker-dvc-demo/pull/new/experiment2[K
remote: 
To github.com:ghassem1365/sagemaker-dvc-demo.git
 * [new branch]      experiment2 -> experiment2
Branch 'experiment2' set up to track remote branch 'experiment2' from 'origin'.


Again, we use git commit id as a measure to realte data version to specific a experiment.

In [29]:
! git log --format="%H" -n 1

c63f86215292346491e20174b46e756f5abe8575


In [30]:
commit = {'Key': 'commit-id', 'Value': 'COMMITID'} # Replace COMMITID with the commit id that you get from previous cell
branch = {'Key': 'git-branch', 'Value': 'experiment2'}
tags = [commit, branch]
create_date = strftime("%Y-%m-%d-%H-%M-%S")
demo_experiment = Experiment.create(experiment_name = "experiment2-{}".format(create_date),
                                    description = "experiment2-{}".format(create_date),
                                    tags = tags)

Now, we have two experiments which are totally tracked (data via dvc, model via SageMaker experiemnt and related to each other via git commit id)

## Step 4: Move to a different data version

we can easily check out to experiment1 brnach and the data will also change to the version of this experiment. 

In [None]:
! git checkout experiment1

We can also use the commit id to go back to a specific data version.

In [None]:
! git checkout d72fd71ac8ffedb6b5b8c0492e8dcc5ac0e8610b data.dvc
! dvc pull