## Sagemaker, DAAL Comparsion with MNIST, Synthetic Datasets (0.5Mx20 and 4.8Mx20)


1. [Introduction](#Introduction)

2. [MNIST Example: 50Kx784, k=10](#MNIST-Example:-50Kx784,-k=10)
    1. [Data Prep](#Data-Prep)
    2. [Sagemaker Training Setup]
    3. [Start Sagemaker Training]
    4. [Download Sagemaker Trained Model and Compute Accuracy]
    5. [Intel DAAL Kmeans with MNIST ]
    6. [Data Prep]
    7. [DAAL Training Setup]
    8. [Start DAAL Training]
    9. [Download DAAL Trained Model and Compute Accuracy]
    10.[MNIST Summary]

2. [Synthetic Dataset 1: 0.5Mx20, k=20]
    1. [Data Prep]
    2. [Sagemaker Training Setup]
    3. [Start Sagemaker Training]
    4. [Intel DAAL Kmeans with 0.5Mx20 Synthetic Dataset ]
    5. [Data Prep]
    6. [DAAL Training Setup]
    7. [Start DAAL Training]
    8. [Synthetic Dataset 1 (0.5Mx20) Summary]

3. [Synthetic Dataset 2: 4.8Mx31, k=20]
    1. [Data Prep]
    2. [Sagemaker Training Setup]
    3. [Start Sagemaker Training]
    4. [Intel DAAL Kmeans with 4.8Mx31 Synthetic Dataset ]
    5. [Data Prep]
    6. [DAAL Training Setup]
    7. [Start DAAL Training]
    8. [Synthetic Dataset 1 (4.8Mx31) Summary]


## Introduction

In this we compare KMeans algorithm implemented in Sagemaker and Intel DAAL.

In [1]:
!pip install mxnet

[31mdistributed 1.21.8 requires msgpack, which is not installed.[0m
[33mYou are using pip version 10.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## MNIST Example: 50Kx784, k=10

### Data Prep

Next, we read the dataset from the existing repository into memory, for preprocessing prior to training.  In this case we'll use the MNIST dataset, which contains 70K 28 x 28 pixel images of handwritten digits.  For more details, please see [here](http://yann.lecun.com/exdb/mnist/).

In [2]:
%%time
import pickle, gzip, urllib.request

# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz", "mnist.pkl.gz")
with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

CPU times: user 952 ms, sys: 284 ms, total: 1.24 s
Wall time: 3.06 s


In [3]:
bucket_name = 'rpanchum' 
sm_mnist_data_key = 'kmeans_sm_mnist/'
sm_mnist_data_location = 's3://{}/{}'.format(bucket_name, sm_mnist_data_key)
sm_mnist_output_location = 's3://{}/{}'.format(bucket_name, sm_mnist_data_key)

print('training artifacts will be uploaded to: {}'.format(sm_mnist_output_location))

training artifacts will be uploaded to: s3://rpanchum/kmeans_sm_mnist/


### Sagemaker Training Setup

In [4]:
from sagemaker import KMeans
from sagemaker import get_execution_role

role = get_execution_role()

sm_kmeans = KMeans(role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c5.18xlarge',
                    output_path=sm_mnist_output_location,
                    k=10,
                    center_factor=1,
                    init_method='kmeans++',
                    max_iterations=50,
                    data_location=sm_mnist_data_location)

print ('Building RecordSet format of training data as required by SM Kmeans...')
sm_kmeans_records = sm_kmeans.record_set(train_set[0])
print (sm_kmeans_records)

Building RecordSet format of training data as required by SM Kmeans...
(<class 'sagemaker.amazon.amazon_estimator.RecordSet'>, {'s3_data': 's3://rpanchum/kmeans_sm_mnist/KMeans-2018-09-28-19-38-15-773/.amazon.manifest', 'feature_dim': 784, 'num_records': 50000, 's3_data_type': 'ManifestFile', 'channel': 'train'})


### Start Sagemaker Training

In [5]:
%%time
sm_kmeans.fit(sm_kmeans_records)

INFO:sagemaker:Creating training-job with name: kmeans-2018-09-28-19-38-26-129


2018-09-28 19:38:26 Starting - Starting the training job...
Launching requested ML instances......
Preparing the instances for training...
2018-09-28 19:40:06 Downloading - Downloading input data
2018-09-28 19:40:12 Training - Downloading the training image..
[31mDocker entrypoint called with argument(s): train[0m
[31m[09/28/2018 19:40:40 INFO 140275552188224] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u'0.0001', u'local_lloyd_max_iter': u'300', u'_disable_wait_to_read': u'false', u'extra_center_factor': u'auto', u'eval_metrics': u'["msd"]', u'_num_kv_servers': u'1', u'mini_batch_size': u'5000', u'half_life_time_size': u'0', u'_num_slices': u'1'}

In [6]:
print("Training Job Name: ", sm_kmeans.latest_training_job.name)
print("Model saved at: ", sm_kmeans.model_data)

Training Job Name:  kmeans-2018-09-28-19-38-26-129
Model saved at:  s3://rpanchum/kmeans_sm_mnist/kmeans-2018-09-28-19-38-26-129/output/model.tar.gz


### Download Sagemaker Trained Model and Compute Accuracy

In [7]:
import os, boto3
import mxnet as mx

sm_mnist_model_key = sm_mnist_data_key + sm_kmeans.latest_training_job.name + '/output/model.tar.gz'
print ('Downloading the model saved at: ', sm_kmeans.model_data)

boto3.resource('s3').Bucket(bucket_name).download_file(sm_mnist_model_key, 'sm_model.tar.gz')

os.system('tar -zxf sm_model.tar.gz && unzip model_algo-1')

Kmeans_model_params = mx.ndarray.load('model_algo-1')
sagemaker_centroids=Kmeans_model_params[0].asnumpy()
print('MNIST SM centroids shape: ', sagemaker_centroids.shape)


  from ._conv import register_converters as _register_converters


Downloading the model saved at:  s3://rpanchum/kmeans_sm_mnist/kmeans-2018-09-28-19-38-26-129/output/model.tar.gz
MNIST SM centroids shape:  (10, 784)


In [8]:
from sklearn.cluster import KMeans 
from sklearn.metrics.cluster import v_measure_score

sklearn_kmeans = KMeans(10)
sklearn_kmeans.cluster_centers_ = sagemaker_centroids

sm_mnist_train_assignments=sklearn_kmeans.predict(train_set[0])
sm_mnist_test_assignments=sklearn_kmeans.predict(test_set[0])

print("SM Accuracy on MNIST Train Set: ", str(v_measure_score(train_set[1], sm_mnist_train_assignments)))
print("SM Accuracy on MNIST Test Set: " , str(v_measure_score(test_set[1], sm_mnist_test_assignments)))

SM Accuracy on MNIST Train Set:  0.35307083305090553
SM Accuracy on MNIST Test Set:  0.3718264677316023


## Intel DAAL Kmeans with MNIST 

### Data Prep

In [9]:
import numpy as np
np.savetxt("train_data.csv", train_set[0], delimiter=",")

In [11]:
training_data_file = 'train_data.csv'
daal_mnist_data_key = 'kmeans_daal_mnist'
daal_mnist_output_location = 's3://{}/{}'.format(bucket_name, daal_mnist_data_key)

print ("Training artifacts will be uploaded at: " + daal_mnist_output_location)

Training artifacts will be uploaded at: s3://rpanchum/kmeans_daal_mnist


### DAAL Training Setup

In [12]:
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sess = sage.Session()

daal_mnist_data_location = sess.upload_data(training_data_file, bucket=bucket_name, key_prefix=daal_mnist_data_key)

daal_kmeans_image = '927702822156.dkr.ecr.us-west-2.amazonaws.com/daal-kmeans-sample:latest'

daal_kmeans = sage.estimator.Estimator(image_name=daal_kmeans_image,
                                       role=role,
                                       train_instance_count=1,
                                       train_instance_type='ml.c5.18xlarge',
                                       output_path=daal_mnist_output_location,
                                       sagemaker_session=sess,
                                       hyperparameters={'nClusters': 10,
                                                        'initialCentroidMethod': 'plusPlusDense',
                                                        'maxIterations': 50 } 
                                      )

### Start DAAL Training

In [13]:
%%time
daal_kmeans.fit(daal_mnist_data_location)

INFO:sagemaker:Creating training-job with name: daal-kmeans-sample-2018-09-28-19-42-47-284


2018-09-28 19:42:47 Starting - Starting the training job...
Launching requested ML instances......
Preparing the instances for training...
2018-09-28 19:44:35 Downloading - Downloading input data...
2018-09-28 19:44:53 Training - Downloading the training image...
Training image download completed. Training in progress..
[31m2018-09-28 19:45:44 INFO     Container setup completed, In Docker entrypoint - train... [0m
[31m2018-09-28 19:45:44 INFO     Default Hyperparameters loaded: [0m
[31m2018-09-28 19:45:44 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gamma': 1.0,
 'initialCentroidMethod': 'defaultDense',
 'maxIterations': 300,
 'method': 'defaultDense',
 'nClusters': 2,
 'nRounds': 5,
 'oversamplingFactor': 0.5}[0m
[31m2018-09-28 19:45:44 INFO     Updated with user hyperparameters, Final Hyperparameters: [0m
[31m2018-09-28 19:45:44 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gamma': 1

In [14]:
print("Training Job Name: ", daal_kmeans.latest_training_job.name)
print("Model saved at: ", daal_kmeans.model_data)

Training Job Name:  daal-kmeans-sample-2018-09-28-19-42-47-284
Model saved at:  s3://rpanchum/kmeans_daal_mnist/daal-kmeans-sample-2018-09-28-19-42-47-284/output/model.tar.gz


### Download DAAL Trained Model and Compute Accuracy

In [15]:
import os, boto3

daal_mnist_model_key = daal_mnist_data_key + "/" + daal_kmeans.latest_training_job.name + '/output/model.tar.gz'
print ('Downloading the model saved at: ', daal_kmeans.model_data)
print(daal_mnist_model_key)
boto3.resource('s3').Bucket(bucket_name).download_file(daal_mnist_model_key, 'daal_model.tar.gz')

os.system('tar -zxf daal_model.tar.gz')
daal_centroids, daal_assignments = np.load("daal-kmeans-model.npy", encoding = 'latin1')
print('MNIST DAAL centroids shape: ', daal_centroids.shape)

Downloading the model saved at:  s3://rpanchum/kmeans_daal_mnist/daal-kmeans-sample-2018-09-28-19-42-47-284/output/model.tar.gz
kmeans_daal_mnist/daal-kmeans-sample-2018-09-28-19-42-47-284/output/model.tar.gz
MNIST DAAL centroids shape:  (10, 784)


In [16]:
from sklearn.cluster import KMeans 
from sklearn.metrics.cluster import v_measure_score

sklearn_kmeans = KMeans(10)
sklearn_kmeans.cluster_centers_ = daal_centroids

daal_mnist_train_assignments=sklearn_kmeans.predict(train_set[0])
daal_mnist_test_assignments=sklearn_kmeans.predict(test_set[0])

print("DAAL Accuracy on MNIST Train Set: ", str(v_measure_score(train_set[1], daal_mnist_train_assignments)))
print("DAAL Accuracy on MNIST Test Set: " , str(v_measure_score(test_set[1], daal_mnist_test_assignments)))

DAAL Accuracy on MNIST Train Set:  0.4937834635054327
DAAL Accuracy on MNIST Test Set:  0.5073462004406061


### MNIST Summary
k=10, maxIterations=50 

Sagemaker Training Time Only: 0.38 sec  
DAAL Training Time Only: 0.57 sec  

Sagemaker MNIST Test Accuracy: 37.1%  
DAAL MNIST Test Accuracy: 50.7%  

## Synthetic Dataset 1: 0.5Mx20, k=20

### Data Prep

In [17]:
!wget -O mlsd2_500000_20_20.csv https://s3-us-west-2.amazonaws.com/rpanchum/kmeans_datasets/0.5Mx20/mlsd2_500000_20_20.csv

--2018-09-28 19:46:30--  https://s3-us-west-2.amazonaws.com/rpanchum/kmeans_datasets/0.5Mx20/mlsd2_500000_20_20.csv
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.248.120
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.248.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 105000000 (100M) [text/csv]
Saving to: ‘mlsd2_500000_20_20.csv.3’


2018-09-28 19:46:32 (86.2 MB/s) - ‘mlsd2_500000_20_20.csv.3’ saved [105000000/105000000]



In [18]:
%%time
import pandas as pd
import numpy as np

input_file_0_5Mx20 = "mlsd2_500000_20_20.csv"
data = pd.read_csv(input_file_0_5Mx20, header=None, dtype=np.float32)
print(data.describe())

train_set = np.array(data)
print ("Training data shape: ", train_set.shape)

                  0              1              2              3   \
count  500000.000000  500000.000000  500000.000000  500000.000000   
mean       54.440392      55.849094      57.122341      57.997566   
std        17.286438      15.576500      18.243711      18.005829   
min        24.709999      20.190001      22.700001      20.540001   
25%        36.709999      43.220001      42.349998      42.750000   
50%        54.250000      53.299999      61.349998      60.150002   
75%        71.160004      70.860001      72.519997      74.239998   
max        91.169998      89.129997      89.080002      92.459999   

                  4              5              6              7   \
count  500000.000000  500000.000000  500000.000000  500000.000000   
mean       54.450413      56.562336      55.726784      56.136658   
std        17.778864      18.171492      16.956083      17.414059   
min        24.790001      22.330000      24.530001      21.500000   
25%        35.520000      40.4300

In [19]:
bucket_name = 'rpanchum' 
sm_0_5Mx20_data_key = 'kmeans_sm_0.5Mx20'
sm_0_5Mx20_data_location = 's3://{}/{}'.format(bucket_name, sm_0_5Mx20_data_key)
sm_0_5Mx20_output_location = 's3://{}/{}'.format(bucket_name, sm_0_5Mx20_data_key)

print('training artifacts will be uploaded to: {}'.format(sm_0_5Mx20_output_location))

training artifacts will be uploaded to: s3://rpanchum/kmeans_sm_0.5Mx20


### Sagemaker Training Setup

In [20]:
%%time
from sagemaker import KMeans
from sagemaker import get_execution_role

role = get_execution_role()

sm_kmeans = KMeans(role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c5.18xlarge',
                    output_path=sm_0_5Mx20_output_location,
                    k=20,
                    center_factor=1,
                    init_method='kmeans++',
                    max_iterations=100,
                    data_location=sm_0_5Mx20_data_location)

print ('Building RecordSet format of training data as required by SM Kmeans...')
sm_kmeans_records = sm_kmeans.record_set(train_set)
print (sm_kmeans_records)

Building RecordSet format of training data as required by SM Kmeans...
(<class 'sagemaker.amazon.amazon_estimator.RecordSet'>, {'s3_data': 's3://rpanchum/kmeans_sm_0.5Mx20/KMeans-2018-09-28-19-46-41-600/.amazon.manifest', 'feature_dim': 21, 'num_records': 500000, 's3_data_type': 'ManifestFile', 'channel': 'train'})
CPU times: user 17.3 s, sys: 256 ms, total: 17.6 s
Wall time: 15.2 s


### Start Sagemaker Training

In [21]:
%%time
sm_kmeans.fit(sm_kmeans_records)

INFO:sagemaker:Creating training-job with name: kmeans-2018-09-28-19-46-56-381


2018-09-28 19:46:56 Starting - Starting the training job...
Launching requested ML instances...
Preparing the instances for training......
2018-09-28 19:48:52 Downloading - Downloading input data...
2018-09-28 19:48:58 Training - Training image download completed. Training in progress..
[31mDocker entrypoint called with argument(s): train[0m
[31m[09/28/2018 19:49:28 INFO 140644009961280] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u'0.0001', u'local_lloyd_max_iter': u'300', u'_disable_wait_to_read': u'false', u'extra_center_factor': u'auto', u'eval_metrics': u'["msd"]', u'_num_kv_servers': u'1', u'mini_batch_size': u'5000', u'half_life_time_size': 

In [22]:
print("Training Job Name: ", sm_kmeans.latest_training_job.name)
print("Model saved at: ", sm_kmeans.model_data)

Training Job Name:  kmeans-2018-09-28-19-46-56-381
Model saved at:  s3://rpanchum/kmeans_sm_0.5Mx20/kmeans-2018-09-28-19-46-56-381/output/model.tar.gz


## Intel DAAL Kmeans with 0.5Mx20 Synthetic Dataset 

### Data Prep

In [23]:
bucket_name = 'rpanchum' 
daal_0_5Mx20_data_key = 'kmeans_sm_0.5Mx20'

daal_0_5Mx20_output_location = 's3://{}/{}'.format(bucket_name, sm_0_5Mx20_data_key)

training_data_file = input_file_0_5Mx20
print('training artifacts will be uploaded to: {}'.format(daal_0_5Mx20_output_location))

training artifacts will be uploaded to: s3://rpanchum/kmeans_sm_0.5Mx20


### DAAL Training Setup

In [24]:
%%time
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sess = sage.Session()

daal_0_5Mx20_data_location = sess.upload_data(training_data_file, bucket=bucket_name, key_prefix=daal_0_5Mx20_data_key)
print ("Training data uploaded at: " + daal_0_5Mx20_data_location)

daal_kmeans_image = '927702822156.dkr.ecr.us-west-2.amazonaws.com/daal-kmeans-sample:latest'

daal_kmeans = sage.estimator.Estimator(image_name=daal_kmeans_image,
                                       role=role,
                                       train_instance_count=1,
                                       train_instance_type='ml.c5.18xlarge',
                                       output_path=daal_0_5Mx20_output_location,
                                       sagemaker_session=sess,
                                       hyperparameters={'nClusters': 20,
                                                        'initialCentroidMethod': 'plusPlusDense',
                                                        'maxIterations': 100 } 
                                      )

Training data uploaded at: s3://rpanchum/kmeans_sm_0.5Mx20/mlsd2_500000_20_20.csv
CPU times: user 1.09 s, sys: 172 ms, total: 1.26 s
Wall time: 2.33 s


### Start DAAL Training

In [25]:
%%time
daal_kmeans.fit(daal_0_5Mx20_data_location)

INFO:sagemaker:Creating training-job with name: daal-kmeans-sample-2018-09-28-19-50-10-745


2018-09-28 19:50:10 Starting - Starting the training job...
Launching requested ML instances......
Preparing the instances for training...
2018-09-28 19:51:47 Downloading - Downloading input data
2018-09-28 19:51:55 Training - Downloading the training image...
Training image download completed. Training in progress..
[31m2018-09-28 19:52:39 INFO     Container setup completed, In Docker entrypoint - train... [0m
[31m2018-09-28 19:52:39 INFO     Default Hyperparameters loaded: [0m
[31m2018-09-28 19:52:39 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gamma': 1.0,
 'initialCentroidMethod': 'defaultDense',
 'maxIterations': 300,
 'method': 'defaultDense',
 'nClusters': 2,
 'nRounds': 5,
 'oversamplingFactor': 0.5}[0m
[31m2018-09-28 19:52:39 INFO     Updated with user hyperparameters, Final Hyperparameters: [0m
[31m2018-09-28 19:52:39 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gamma': 1.0,

In [26]:
print("Training Job Name: ", daal_kmeans.latest_training_job.name)
print("Model saved at: ", daal_kmeans.model_data)

Training Job Name:  daal-kmeans-sample-2018-09-28-19-50-10-745
Model saved at:  s3://rpanchum/kmeans_sm_0.5Mx20/daal-kmeans-sample-2018-09-28-19-50-10-745/output/model.tar.gz


### Synthetic Dataset 1 (0.5Mx20) Summary

k=20, maxIterations=100

Sagemaker Training Time Only: 0.69 sec  
DAAL Training Time Only: 0.20 sec  


## Synthetic Dataset 2: 4.8Mx38, k=20 

### Download dataset

In [27]:
!wget -O mlsd1_4898430_38_20.csv https://s3-us-west-2.amazonaws.com/rpanchum/kmeans_datasets/4.8Mx38/mlsd1_4898430_38_20.csv

--2018-09-28 19:53:22--  https://s3-us-west-2.amazonaws.com/rpanchum/kmeans_datasets/4.8Mx38/mlsd1_4898430_38_20.csv
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.209.8
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.209.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 435733898 (416M) [text/csv]
Saving to: ‘mlsd1_4898430_38_20.csv’


2018-09-28 19:53:27 (93.3 MB/s) - ‘mlsd1_4898430_38_20.csv’ saved [435733898/435733898]



### Data Prep

In [28]:
%%time
import pandas as pd
import numpy as np

input_file_4_8Mx38 = "mlsd1_4898430_38_20.csv"
data = pd.read_csv(input_file_4_8Mx38, header=None, dtype=np.float32)
print (data.describe())

train_set = np.array(data)
print ("Training data shape: ", train_set.shape)

                 0             1             2             3             4   \
count  4.898430e+06  4.898430e+06  4.898430e+06  4.898430e+06  4.898430e+06   
mean   4.833713e+01  1.843839e+03  1.091742e+03  5.716117e-06  6.487793e-04   
std    7.223127e+02  9.414288e+05  6.450107e+05  2.390827e-03  4.284953e-02   
min    0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00   
25%    0.000000e+00  4.500000e+01  0.000000e+00  0.000000e+00  0.000000e+00   
50%    0.000000e+00  5.200000e+02  0.000000e+00  0.000000e+00  0.000000e+00   
75%    0.000000e+00  1.032000e+03  0.000000e+00  0.000000e+00  0.000000e+00   
max    5.832900e+04  1.379964e+09  1.309937e+09  1.000000e+00  3.000000e+00   

                 5             6             7             8             9   \
count  4.898430e+06  4.898430e+06  4.898430e+06  4.898430e+06  4.898430e+06   
mean   7.961735e-06  1.243766e-02  3.205108e-05  1.435288e-01  8.088306e-03   
std    7.215080e-03  4.688095e-01  7.299343e-03  3.

In [29]:
bucket_name = 'rpanchum' 
sm_4_8Mx38_data_key = 'kmeans_sm_4.8Mx38'
sm_4_8Mx38_data_location = 's3://{}/{}'.format(bucket_name, sm_4_8Mx38_data_key)
sm_4_8Mx38_output_location = 's3://{}/{}'.format(bucket_name, sm_4_8Mx38_data_key)

print('training artifacts will be uploaded to: {}'.format(sm_4_8Mx38_output_location))

training artifacts will be uploaded to: s3://rpanchum/kmeans_sm_4.8Mx38


### Sagemaker Training Setup

In [30]:
%%time
from sagemaker import KMeans
from sagemaker import get_execution_role

role = get_execution_role()

sm_kmeans = KMeans(role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c5.18xlarge',
                    output_path=sm_4_8Mx38_output_location,
                    k=20,
                    center_factor=1,
                    init_method='kmeans++',
                    max_iterations=100,
                    data_location=sm_4_8Mx38_data_location)

print ('Building RecordSet format of training data as required by SM Kmeans...')
sm_kmeans_records = sm_kmeans.record_set(train_set)
print (sm_kmeans_records)

Building RecordSet format of training data as required by SM Kmeans...
(<class 'sagemaker.amazon.amazon_estimator.RecordSet'>, {'s3_data': 's3://rpanchum/kmeans_sm_4.8Mx38/KMeans-2018-09-28-19-53-51-236/.amazon.manifest', 'feature_dim': 38, 'num_records': 4898430, 's3_data_type': 'ManifestFile', 'channel': 'train'})
CPU times: user 2min 34s, sys: 3.17 s, total: 2min 37s
Wall time: 2min 50s


### Start Sagemaker Training

In [31]:
%%time
sm_kmeans.fit(sm_kmeans_records)

INFO:sagemaker:Creating training-job with name: kmeans-2018-09-28-19-56-41-770


2018-09-28 19:56:42 Starting - Starting the training job...
Launching requested ML instances......
Preparing the instances for training......
2018-09-28 19:58:52 Downloading - Downloading input data
2018-09-28 19:59:08 Training - Downloading the training image...
2018-09-28 19:59:39 Uploading - Uploading generated training model
[31mDocker entrypoint called with argument(s): train[0m
[31m[09/28/2018 19:59:30 INFO 139688635033408] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u'0.0001', u'local_lloyd_max_iter': u'300', u'_disable_wait_to_read': u'false', u'extra_center_factor': u'auto', u'eval_metrics': u'["msd"]', u'_num_kv_servers': u'1', u'mini_bat

In [32]:
print("Training Job Name: ", sm_kmeans.latest_training_job.name)
print("Model saved at: ", sm_kmeans.model_data)

Training Job Name:  kmeans-2018-09-28-19-56-41-770
Model saved at:  s3://rpanchum/kmeans_sm_4.8Mx38/kmeans-2018-09-28-19-56-41-770/output/model.tar.gz


## Intel DAAL Kmeans with 4.8Mx38 Synthetic Dataset 

### Data Prep

In [33]:
bucket_name = 'rpanchum' 
daal_4_8Mx38_data_key = 'kmeans_daal_4.8Mx38'

daal_4_8Mx38_output_location = 's3://{}/{}'.format(bucket_name, daal_4_8Mx38_data_key)

training_data_file = input_file_4_8Mx38
print('training artifacts will be uploaded to: {}'.format(daal_4_8Mx38_output_location))

training artifacts will be uploaded to: s3://rpanchum/kmeans_daal_4.8Mx38


### DAAL Training Setup

In [34]:
%%time
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sess = sage.Session()

daal_4_8Mx38_data_location = sess.upload_data(training_data_file, bucket=bucket_name, key_prefix=daal_4_8Mx38_data_key)
print ("Training data uploaded at: " + daal_4_8Mx38_data_location)

daal_kmeans_image = '927702822156.dkr.ecr.us-west-2.amazonaws.com/daal-kmeans-sample:latest'

daal_kmeans = sage.estimator.Estimator(image_name=daal_kmeans_image,
                                       role=role,
                                       train_instance_count=1,
                                       train_instance_type='ml.c5.18xlarge',
                                       output_path=daal_4_8Mx38_output_location,
                                       sagemaker_session=sess,
                                       hyperparameters={'nClusters': 20,
                                                        'initialCentroidMethod': 'plusPlusDense',
                                                        'maxIterations': 100 } 
                                      )

Training data uploaded at: s3://rpanchum/kmeans_daal_4.8Mx38/mlsd1_4898430_38_20.csv
CPU times: user 3.47 s, sys: 580 ms, total: 4.05 s
Wall time: 5.26 s


### Start DAAL Training

In [35]:
%%time
daal_kmeans.fit(daal_4_8Mx38_data_location)

INFO:sagemaker:Creating training-job with name: daal-kmeans-sample-2018-09-28-20-00-29-976


2018-09-28 20:00:30 Starting - Starting the training job...
Launching requested ML instances.........
Preparing the instances for training......
2018-09-28 20:03:19 Downloading - Downloading input data
2018-09-28 20:03:31 Training - Downloading the training image......
Training image download completed. Training in progress.
[31m2018-09-28 20:04:16 INFO     Container setup completed, In Docker entrypoint - train... [0m
[31m2018-09-28 20:04:16 INFO     Default Hyperparameters loaded: [0m
[31m2018-09-28 20:04:16 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gamma': 1.0,
 'initialCentroidMethod': 'defaultDense',
 'maxIterations': 300,
 'method': 'defaultDense',
 'nClusters': 2,
 'nRounds': 5,
 'oversamplingFactor': 0.5}[0m
[31m2018-09-28 20:04:16 INFO     Updated with user hyperparameters, Final Hyperparameters: [0m
[31m2018-09-28 20:04:16 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gamm

In [36]:
print("Training Job Name: ", daal_kmeans.latest_training_job.name)
print("Model saved at: ", daal_kmeans.model_data)

Training Job Name:  daal-kmeans-sample-2018-09-28-20-00-29-976
Model saved at:  s3://rpanchum/kmeans_daal_4.8Mx38/daal-kmeans-sample-2018-09-28-20-00-29-976/output/model.tar.gz


### Synthetic Dataset 1 ( 4.8Mx38, k=20 ) Summary

k=20, maxIterations=100

Sagemaker Training Time Only: 7.78 sec  
DAAL Training Time Only: 2.92 sec  
