## Sagemaker, DAAL Comparsion with MNIST, Synthetic Datasets (0.5Mx20 and 4.8Mx20)


1. [Introduction](#Introduction)  

2. [MNIST Example: 50Kx784, k=10](#MNIST-Example:-50Kx784,-k=10)  
    1. [Data Prep](#Data-Prep)  
    2. [Sagemaker Training Setup]  
    3. [Start Sagemaker Training]  
    4. [Download Sagemaker Trained Model and Compute Accuracy]  
    5. [Intel DAAL Kmeans with MNIST ]  
    6. [Data Prep]  
    7. [DAAL Training Setup]  
    8. [Start DAAL Training]  
    9. [Download DAAL Trained Model and Compute Accuracy]
    10. [MNIST Summary]  
    
2. [Synthetic Dataset 1: 0.5Mx20, k=20]
    1. [Data Prep]
    2. [Sagemaker Training Setup]
    3. [Start Sagemaker Training]
    4. [Intel DAAL Kmeans with 0.5Mx20 Synthetic Dataset ]
    5. [Data Prep]
    6. [DAAL Training Setup]
    7. [Start DAAL Training]
    8. [Synthetic Dataset 1 (0.5Mx20) Summary]  
    
3. [Synthetic Dataset 2: 4.8Mx31, k=20]
    1. [Data Prep]
    2. [Sagemaker Training Setup]
    3. [Start Sagemaker Training]
    4. [Intel DAAL Kmeans with 4.8Mx31 Synthetic Dataset ]
    5. [Data Prep]
    6. [DAAL Training Setup]
    7. [Start DAAL Training]
    8. [Synthetic Dataset 1 (4.8Mx31) Summary]
    
4. [Covdata Dataset: 0.5Mx53, k=7]
    1. [Data Prep]
    2. [Sagemaker Training Setup]
    3. [Start Sagemaker Training]
    4. [Intel DAAL Kmeans with 4.8Mx31 Synthetic Dataset ]
    5. [Data Prep]
    6. [DAAL Training Setup]
    7. [Start DAAL Training]
    8. [Synthetic Dataset 1 (4.8Mx31) Summary]


## Introduction

In this we compare KMeans algorithm implemented in Sagemaker and Intel DAAL.

In [1]:
!pip install mxnet

Collecting mxnet
[?25l  Downloading https://files.pythonhosted.org/packages/71/64/49c5125befd5e0f0e17f115d55cb78080adacbead9d19f253afd0157656a/mxnet-1.3.0.post0-py2.py3-none-manylinux1_x86_64.whl (27.7MB)
[K    100% |████████████████████████████████| 27.8MB 1.9MB/s 
[?25hCollecting numpy<1.15.0,>=1.8.2 (from mxnet)
[?25l  Downloading https://files.pythonhosted.org/packages/e5/c4/395ebb218053ba44d64935b3729bc88241ec279915e72100c5979db10945/numpy-1.14.6-cp36-cp36m-manylinux1_x86_64.whl (13.8MB)
[K    100% |████████████████████████████████| 13.8MB 6.6MB/s 
Collecting graphviz<0.9.0,>=0.8.1 (from mxnet)
  Downloading https://files.pythonhosted.org/packages/53/39/4ab213673844e0c004bed8a0781a0721a3f6bb23eb8854ee75c236428892/graphviz-0.8.4-py2.py3-none-any.whl
[31mdistributed 1.21.8 requires msgpack, which is not installed.[0m
Installing collected packages: numpy, graphviz, mxnet
  Found existing installation: numpy 1.15.1
    Uninstalling numpy-1.15.1:
      Successfully uninstalled n

## MNIST Example: 50Kx784, k=10

### MNIST Data Prep

Next, we read the dataset from the existing repository into memory, for preprocessing prior to training.  In this case we'll use the MNIST dataset, which contains 70K 28 x 28 pixel images of handwritten digits.  For more details, please see [here](http://yann.lecun.com/exdb/mnist/).

In [2]:
%%time
import pickle, gzip, urllib.request

# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz", "mnist.pkl.gz")
with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
    

CPU times: user 1.15 s, sys: 4.33 s, total: 5.48 s
Wall time: 5.76 s


In [3]:
bucket_name = 'rpanchum' 
data_key = 'kmeans_sm_mnist'
data_location = 's3://{}/{}'.format(bucket_name, data_key)
output_location = 's3://{}/{}'.format(bucket_name, data_key)

print('training artifacts will be uploaded to: {}'.format(output_location))

training artifacts will be uploaded to: s3://rpanchum/kmeans_sm_mnist


### Sagemaker Training Setup

In [4]:
from sagemaker import KMeans
from sagemaker import get_execution_role

role = get_execution_role()

sm_kmeans = KMeans(role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c5.18xlarge',
                    output_path=output_location,
                    k=10,
                    center_factor=1,
                    init_method='kmeans++',
                    max_iterations=50,
                    data_location=data_location)

print ('Building RecordSet format of training data as required by SM Kmeans...')
sm_kmeans_records = sm_kmeans.record_set(train_set[0])
print (sm_kmeans_records)

Building RecordSet format of training data as required by SM Kmeans...
(<class 'sagemaker.amazon.amazon_estimator.RecordSet'>, {'s3_data': 's3://rpanchum/kmeans_sm_mnist/KMeans-2018-10-02-18-09-56-857/.amazon.manifest', 'feature_dim': 784, 'num_records': 50000, 's3_data_type': 'ManifestFile', 'channel': 'train'})


### Start Sagemaker Training

In [5]:
%%time
sm_kmeans.fit(sm_kmeans_records)

INFO:sagemaker:Creating training-job with name: kmeans-2018-10-02-18-10-06-829


2018-10-02 18:10:07 Starting - Starting the training job...
Launching requested ML instances......
Preparing the instances for training...
2018-10-02 18:11:42 Downloading - Downloading input data
2018-10-02 18:11:48 Training - Downloading the training image...
2018-10-02 18:12:29 Uploading - Uploading generated training model
2018-10-02 18:12:34 Completed - Training job completed

[31mDocker entrypoint called with argument(s): train[0m
[31m[10/02/2018 18:12:25 INFO 139680048461632] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u'0.0001', u'local_lloyd_max_iter': u'300', u'_disable_wait_to_read': u'false', u'extra_center_factor': u'auto', u'eval_metri

In [6]:
print("Training Job Name: ", sm_kmeans.latest_training_job.name)
print("Model saved at: ", sm_kmeans.model_data)

Training Job Name:  kmeans-2018-10-02-18-10-06-829
Model saved at:  s3://rpanchum/kmeans_sm_mnist/kmeans-2018-10-02-18-10-06-829/output/model.tar.gz


### Download Sagemaker Trained Model and Compute Accuracy

In [9]:
import os, boto3
import mxnet as mx

model_key = data_key + "/" + sm_kmeans.latest_training_job.name + '/output/model.tar.gz'
print ('Downloading the model saved at: ', sm_kmeans.model_data)

boto3.resource('s3').Bucket(bucket_name).download_file(model_key, 'sm_model.tar.gz')

os.system('tar -zxf sm_model.tar.gz && unzip model_algo-1')

Kmeans_model_params = mx.ndarray.load('model_algo-1')
sagemaker_centroids=Kmeans_model_params[0].asnumpy()
print('MNIST SM centroids shape: ', sagemaker_centroids.shape)


Downloading the model saved at:  s3://rpanchum/kmeans_sm_mnist/kmeans-2018-10-02-18-10-06-829/output/model.tar.gz
MNIST SM centroids shape:  (10, 784)


In [10]:
from sklearn.cluster import KMeans 
from sklearn.metrics.cluster import v_measure_score

sklearn_kmeans = KMeans(10)
sklearn_kmeans.cluster_centers_ = sagemaker_centroids

sm_train_assignments=sklearn_kmeans.predict(train_set[0])
sm_test_assignments=sklearn_kmeans.predict(test_set[0])

print("SM Accuracy on MNIST Train Set: ", str(v_measure_score(train_set[1], sm_train_assignments)))
print("SM Accuracy on MNIST Test Set: " , str(v_measure_score(test_set[1], sm_test_assignments)))

SM Accuracy on MNIST Train Set:  0.35307083305090553
SM Accuracy on MNIST Test Set:  0.3718264677316024


## Intel DAAL Kmeans with MNIST 

### Data Prep

In [12]:
import numpy as np

training_data_file = 'train_data_mnist.csv'
np.savetxt(training_data_file, train_set[0], delimiter=",")

data_key = 'kmeans_daal_mnist'
output_location = 's3://{}/{}'.format(bucket_name, data_key)

print ("Training artifacts will be uploaded at: " + output_location)

Training artifacts will be uploaded at: s3://rpanchum/kmeans_daal_mnist


### DAAL Training Setup

In [13]:
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sess = sage.Session()

data_location = sess.upload_data(training_data_file, bucket=bucket_name, key_prefix=data_key)

daal_kmeans_image = '927702822156.dkr.ecr.us-west-2.amazonaws.com/daal-kmeans-sample:latest'

daal_kmeans = sage.estimator.Estimator(image_name=daal_kmeans_image,
                                       role=role,
                                       train_instance_count=1,
                                       train_instance_type='ml.c5.18xlarge',
                                       output_path=output_location,
                                       sagemaker_session=sess,
                                       hyperparameters={'nClusters': 10,
                                                        'initialCentroidMethod': 'plusPlusDense',
                                                        'maxIterations': 50 } 
                                      )

### Start DAAL Training

In [14]:
%%time
daal_kmeans.fit(data_location)

INFO:sagemaker:Creating training-job with name: daal-kmeans-sample-2018-10-02-18-20-34-929


2018-10-02 18:20:35 Starting - Starting the training job...
Launching requested ML instances...
Preparing the instances for training......
2018-10-02 18:22:10 Downloading - Downloading input data
2018-10-02 18:22:28 Training - Downloading the training image.....
[31m2018-10-02 18:23:16 INFO     Container setup completed, In Docker entrypoint - train... [0m
[31m2018-10-02 18:23:16 INFO     Default Hyperparameters loaded: [0m
[31m2018-10-02 18:23:16 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gamma': 1.0,
 'initialCentroidMethod': 'defaultDense',
 'maxIterations': 300,
 'method': 'defaultDense',
 'nClusters': 2,
 'nRounds': 5,
 'oversamplingFactor': 0.5}[0m
[31m2018-10-02 18:23:16 INFO     Updated with user hyperparameters, Final Hyperparameters: [0m
[31m2018-10-02 18:23:16 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gamma': 1.0,
 'initialCentroidMethod': u'plusPlusDense',
 'maxIterat

In [15]:
print("Training Job Name: ", daal_kmeans.latest_training_job.name)
print("Model saved at: ", daal_kmeans.model_data)

Training Job Name:  daal-kmeans-sample-2018-10-02-18-20-34-929
Model saved at:  s3://rpanchum/kmeans_daal_mnist/daal-kmeans-sample-2018-10-02-18-20-34-929/output/model.tar.gz


### Download DAAL Trained Model and Compute Accuracy

In [16]:
import os, boto3

model_key = data_key + "/" + daal_kmeans.latest_training_job.name + '/output/model.tar.gz'
print ('Downloading the model saved at: ', daal_kmeans.model_data)
print(model_key)
boto3.resource('s3').Bucket(bucket_name).download_file(model_key, 'daal_model.tar.gz')

os.system('tar -zxf daal_model.tar.gz')
daal_centroids, daal_assignments = np.load("daal-kmeans-model.npy", encoding = 'latin1')
print('MNIST DAAL centroids shape: ', daal_centroids.shape)

Downloading the model saved at:  s3://rpanchum/kmeans_daal_mnist/daal-kmeans-sample-2018-10-02-18-20-34-929/output/model.tar.gz
kmeans_daal_mnist/daal-kmeans-sample-2018-10-02-18-20-34-929/output/model.tar.gz
MNIST DAAL centroids shape:  (10, 784)


In [17]:
from sklearn.cluster import KMeans 
from sklearn.metrics.cluster import v_measure_score

sklearn_kmeans = KMeans(10)
sklearn_kmeans.cluster_centers_ = daal_centroids

daal_train_assignments=sklearn_kmeans.predict(train_set[0])
daal_test_assignments=sklearn_kmeans.predict(test_set[0])

print("DAAL Accuracy on MNIST Train Set: ", str(v_measure_score(train_set[1], daal_train_assignments)))
print("DAAL Accuracy on MNIST Test Set: " , str(v_measure_score(test_set[1], daal_test_assignments)))

DAAL Accuracy on MNIST Train Set:  0.4937834635054327
DAAL Accuracy on MNIST Test Set:  0.5073462004406061


### MNIST Summary
k=10, maxIterations=50 

Sagemaker Training Time Only: 0.37 sec  
DAAL Training Time Only: 0.55 sec  

Sagemaker MNIST Test Accuracy: 37.1%  
DAAL MNIST Test Accuracy: 50.7%  

## Synthetic Dataset 1: 0.5Mx20, k=20

### Data Prep

In [18]:
!wget -O mlsd2_500000_20_20.csv https://s3-us-west-2.amazonaws.com/rpanchum/kmeans_datasets/0.5Mx20/mlsd2_500000_20_20.csv

--2018-10-02 18:24:17--  https://s3-us-west-2.amazonaws.com/rpanchum/kmeans_datasets/0.5Mx20/mlsd2_500000_20_20.csv
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.201.136
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.201.136|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 105000000 (100M) [text/csv]
Saving to: ‘mlsd2_500000_20_20.csv’


2018-10-02 18:24:19 (86.8 MB/s) - ‘mlsd2_500000_20_20.csv’ saved [105000000/105000000]



In [19]:
%%time
import pandas as pd
import numpy as np

input_file_0_5Mx20 = "mlsd2_500000_20_20.csv"
data = pd.read_csv(input_file_0_5Mx20, header=None, dtype=np.float32)
print(data.describe())

train_set = np.array(data)
print ("Training data shape: ", train_set.shape)

                  0              1              2              3   \
count  500000.000000  500000.000000  500000.000000  500000.000000   
mean       54.440392      55.849094      57.122341      57.997566   
std        17.286438      15.576500      18.243711      18.005829   
min        24.709999      20.190001      22.700001      20.540001   
25%        36.709999      43.220001      42.349998      42.750000   
50%        54.250000      53.299999      61.349998      60.150002   
75%        71.160004      70.860001      72.519997      74.239998   
max        91.169998      89.129997      89.080002      92.459999   

                  4              5              6              7   \
count  500000.000000  500000.000000  500000.000000  500000.000000   
mean       54.450413      56.562336      55.726784      56.136658   
std        17.778864      18.171492      16.956083      17.414059   
min        24.790001      22.330000      24.530001      21.500000   
25%        35.520000      40.4300

In [20]:
bucket_name = 'rpanchum' 
data_key = 'kmeans_sm_0.5Mx20'
data_location = 's3://{}/{}'.format(bucket_name, data_key)
output_location = 's3://{}/{}'.format(bucket_name, data_key)

print('training artifacts will be uploaded to: {}'.format(output_location))

training artifacts will be uploaded to: s3://rpanchum/kmeans_sm_0.5Mx20


### Sagemaker Training Setup

In [21]:
%%time
from sagemaker import KMeans
from sagemaker import get_execution_role

role = get_execution_role()

sm_kmeans = KMeans(role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c5.18xlarge',
                    output_path=output_location,
                    k=20,
                    center_factor=1,
                    init_method='kmeans++',
                    max_iterations=100,
                    data_location=data_location)

print ('Building RecordSet format of training data as required by SM Kmeans...')
sm_kmeans_records = sm_kmeans.record_set(train_set)
print (sm_kmeans_records)

Building RecordSet format of training data as required by SM Kmeans...
(<class 'sagemaker.amazon.amazon_estimator.RecordSet'>, {'s3_data': 's3://rpanchum/kmeans_sm_0.5Mx20/KMeans-2018-10-02-18-24-21-536/.amazon.manifest', 'feature_dim': 21, 'num_records': 500000, 's3_data_type': 'ManifestFile', 'channel': 'train'})
CPU times: user 13.8 s, sys: 0 ns, total: 13.8 s
Wall time: 14.9 s


### Start Sagemaker Training

In [22]:
%%time
sm_kmeans.fit(sm_kmeans_records)

INFO:sagemaker:Creating training-job with name: kmeans-2018-10-02-18-24-35-967


2018-10-02 18:24:36 Starting - Starting the training job...
Launching requested ML instances......
Preparing the instances for training......
2018-10-02 18:26:51 Downloading - Downloading input data
2018-10-02 18:27:02 Training - Downloading the training image...
2018-10-02 18:27:26 Uploading - Uploading generated training model
2018-10-02 18:27:32 Completed - Training job completed

[31mDocker entrypoint called with argument(s): train[0m
[31m[10/02/2018 18:27:23 INFO 140003487979328] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u'0.0001', u'local_lloyd_max_iter': u'300', u'_disable_wait_to_read': u'false', u'extra_center_factor': u'auto', u'eval_me

In [23]:
print("Training Job Name: ", sm_kmeans.latest_training_job.name)
print("Model saved at: ", sm_kmeans.model_data)

Training Job Name:  kmeans-2018-10-02-18-24-35-967
Model saved at:  s3://rpanchum/kmeans_sm_0.5Mx20/kmeans-2018-10-02-18-24-35-967/output/model.tar.gz


## Intel DAAL Kmeans with 0.5Mx20 Synthetic Dataset 

### Data Prep

In [24]:
bucket_name = 'rpanchum' 
data_key = 'kmeans_sm_0.5Mx20'

output_location = 's3://{}/{}'.format(bucket_name, data_key)

training_data_file = input_file_0_5Mx20
print('training artifacts will be uploaded to: {}'.format(output_location))

training artifacts will be uploaded to: s3://rpanchum/kmeans_sm_0.5Mx20


### DAAL Training Setup

In [25]:
%%time
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sess = sage.Session()

data_location = sess.upload_data(training_data_file, bucket=bucket_name, key_prefix=data_key)
print ("Training data uploaded at: " + data_location)

daal_kmeans_image = '927702822156.dkr.ecr.us-west-2.amazonaws.com/daal-kmeans-sample:latest'

daal_kmeans = sage.estimator.Estimator(image_name=daal_kmeans_image,
                                       role=role,
                                       train_instance_count=1,
                                       train_instance_type='ml.c5.18xlarge',
                                       output_path=output_location,
                                       sagemaker_session=sess,
                                       hyperparameters={'nClusters': 20,
                                                        'initialCentroidMethod': 'plusPlusDense',
                                                        'maxIterations': 100 } 
                                      )

Training data uploaded at: s3://rpanchum/kmeans_sm_0.5Mx20/mlsd2_500000_20_20.csv
CPU times: user 1.01 s, sys: 248 ms, total: 1.26 s
Wall time: 5.1 s


### Start DAAL Training

In [26]:
%%time
daal_kmeans.fit(data_location)

INFO:sagemaker:Creating training-job with name: daal-kmeans-sample-2018-10-02-18-27-54-672


2018-10-02 18:27:54 Starting - Starting the training job...
Launching requested ML instances......
Preparing the instances for training...
2018-10-02 18:29:35 Downloading - Downloading input data
2018-10-02 18:29:42 Training - Downloading the training image...
Training image download completed. Training in progress..
[31m2018-10-02 18:30:26 INFO     Container setup completed, In Docker entrypoint - train... [0m
[31m2018-10-02 18:30:26 INFO     Default Hyperparameters loaded: [0m
[31m2018-10-02 18:30:26 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gamma': 1.0,
 'initialCentroidMethod': 'defaultDense',
 'maxIterations': 300,
 'method': 'defaultDense',
 'nClusters': 2,
 'nRounds': 5,
 'oversamplingFactor': 0.5}[0m
[31m2018-10-02 18:30:26 INFO     Updated with user hyperparameters, Final Hyperparameters: [0m
[31m2018-10-02 18:30:26 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gamma': 1.0,

In [27]:
print("Training Job Name: ", daal_kmeans.latest_training_job.name)
print("Model saved at: ", daal_kmeans.model_data)

Training Job Name:  daal-kmeans-sample-2018-10-02-18-27-54-672
Model saved at:  s3://rpanchum/kmeans_sm_0.5Mx20/daal-kmeans-sample-2018-10-02-18-27-54-672/output/model.tar.gz


### Synthetic Dataset 1 (0.5Mx20) Summary

k=20, maxIterations=100

Sagemaker Training Time Only: 0.7 sec  
DAAL Training Time Only: 0.19 sec  


## Synthetic Dataset 2: 4.8Mx38, k=20 

### Download dataset

In [28]:
!wget -O mlsd1_4898430_38_20.csv https://s3-us-west-2.amazonaws.com/rpanchum/kmeans_datasets/4.8Mx38/mlsd1_4898430_38_20.csv

--2018-10-02 18:31:06--  https://s3-us-west-2.amazonaws.com/rpanchum/kmeans_datasets/4.8Mx38/mlsd1_4898430_38_20.csv
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.204.40
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.204.40|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 435733898 (416M) [text/csv]
Saving to: ‘mlsd1_4898430_38_20.csv’


2018-10-02 18:31:11 (89.7 MB/s) - ‘mlsd1_4898430_38_20.csv’ saved [435733898/435733898]



### Data Prep

In [29]:
%%time
import pandas as pd
import numpy as np

input_file_4_8Mx38 = "mlsd1_4898430_38_20.csv"
data = pd.read_csv(input_file_4_8Mx38, header=None, dtype=np.float32)
print (data.describe())

train_set = np.array(data)
print ("Training data shape: ", train_set.shape)

                 0             1             2             3             4   \
count  4.898430e+06  4.898430e+06  4.898430e+06  4.898430e+06  4.898430e+06   
mean   4.833713e+01  1.843839e+03  1.091742e+03  5.716117e-06  6.487793e-04   
std    7.223127e+02  9.414288e+05  6.450107e+05  2.390827e-03  4.284953e-02   
min    0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00   
25%    0.000000e+00  4.500000e+01  0.000000e+00  0.000000e+00  0.000000e+00   
50%    0.000000e+00  5.200000e+02  0.000000e+00  0.000000e+00  0.000000e+00   
75%    0.000000e+00  1.032000e+03  0.000000e+00  0.000000e+00  0.000000e+00   
max    5.832900e+04  1.379964e+09  1.309937e+09  1.000000e+00  3.000000e+00   

                 5             6             7             8             9   \
count  4.898430e+06  4.898430e+06  4.898430e+06  4.898430e+06  4.898430e+06   
mean   7.961735e-06  1.243766e-02  3.205108e-05  1.435288e-01  8.088306e-03   
std    7.215080e-03  4.688095e-01  7.299343e-03  3.

In [30]:
bucket_name = 'rpanchum' 
data_key = 'kmeans_sm_4.8Mx38'
data_location = 's3://{}/{}'.format(bucket_name, data_key)
output_location = 's3://{}/{}'.format(bucket_name, data_key)

print('training artifacts will be uploaded to: {}'.format(output_location))

training artifacts will be uploaded to: s3://rpanchum/kmeans_sm_4.8Mx38


### Sagemaker Training Setup

In [31]:
%%time
from sagemaker import KMeans
from sagemaker import get_execution_role

role = get_execution_role()

sm_kmeans = KMeans(role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c5.18xlarge',
                    output_path=output_location,
                    k=20,
                    center_factor=1,
                    init_method='kmeans++',
                    max_iterations=100,
                    data_location=data_location)

print ('Building RecordSet format of training data as required by SM Kmeans...')
sm_kmeans_records = sm_kmeans.record_set(train_set)
print (sm_kmeans_records)

Building RecordSet format of training data as required by SM Kmeans...
(<class 'sagemaker.amazon.amazon_estimator.RecordSet'>, {'s3_data': 's3://rpanchum/kmeans_sm_4.8Mx38/KMeans-2018-10-02-18-31-28-226/.amazon.manifest', 'feature_dim': 38, 'num_records': 4898430, 's3_data_type': 'ManifestFile', 'channel': 'train'})
CPU times: user 2min 28s, sys: 1.85 s, total: 2min 30s
Wall time: 2min 42s


### Start Sagemaker Training

In [32]:
%%time
sm_kmeans.fit(sm_kmeans_records)

INFO:sagemaker:Creating training-job with name: kmeans-2018-10-02-18-34-10-411


2018-10-02 18:34:10 Starting - Starting the training job...
Launching requested ML instances......
Preparing the instances for training...
2018-10-02 18:35:47 Downloading - Downloading input data
2018-10-02 18:35:58 Training - Downloading the training image...
2018-10-02 18:36:32 Uploading - Uploading generated training model
2018-10-02 18:36:37 Completed - Training job completed

[31mDocker entrypoint called with argument(s): train[0m
[31m[10/02/2018 18:36:23 INFO 140698647557952] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u'0.0001', u'local_lloyd_max_iter': u'300', u'_disable_wait_to_read': u'false', u'extra_center_factor': u'auto', u'eval_metri

In [33]:
print("Training Job Name: ", sm_kmeans.latest_training_job.name)
print("Model saved at: ", sm_kmeans.model_data)

Training Job Name:  kmeans-2018-10-02-18-34-10-411
Model saved at:  s3://rpanchum/kmeans_sm_4.8Mx38/kmeans-2018-10-02-18-34-10-411/output/model.tar.gz


## Intel DAAL Kmeans with 4.8Mx38 Synthetic Dataset 

### Data Prep

In [34]:
bucket_name = 'rpanchum' 
data_key = 'kmeans_daal_4.8Mx38'

output_location = 's3://{}/{}'.format(bucket_name, data_key)

training_data_file = input_file_4_8Mx38
print('training artifacts will be uploaded to: {}'.format(output_location))

training artifacts will be uploaded to: s3://rpanchum/kmeans_daal_4.8Mx38


### DAAL Training Setup

In [35]:
%%time
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sess = sage.Session()

data_location = sess.upload_data(training_data_file, bucket=bucket_name, key_prefix=data_key)
print ("Training data uploaded at: " + data_location)

daal_kmeans_image = '927702822156.dkr.ecr.us-west-2.amazonaws.com/daal-kmeans-sample:latest'

daal_kmeans = sage.estimator.Estimator(image_name=daal_kmeans_image,
                                       role=role,
                                       train_instance_count=1,
                                       train_instance_type='ml.c5.18xlarge',
                                       output_path=output_location,
                                       sagemaker_session=sess,
                                       hyperparameters={'nClusters': 20,
                                                        'initialCentroidMethod': 'plusPlusDense',
                                                        'maxIterations': 100 } 
                                      )

Training data uploaded at: s3://rpanchum/kmeans_daal_4.8Mx38/mlsd1_4898430_38_20.csv
CPU times: user 3.5 s, sys: 688 ms, total: 4.19 s
Wall time: 6.04 s


### Start DAAL Training

In [36]:
%%time
daal_kmeans.fit(data_location)

INFO:sagemaker:Creating training-job with name: daal-kmeans-sample-2018-10-02-18-36-58-112


2018-10-02 18:36:58 Starting - Starting the training job...
Launching requested ML instances.........
Preparing the instances for training...
2018-10-02 18:39:17 Downloading - Downloading input data
2018-10-02 18:39:24 Training - Downloading the training image......
Training image download completed. Training in progress.
[31m2018-10-02 18:40:11 INFO     Container setup completed, In Docker entrypoint - train... [0m
[31m2018-10-02 18:40:11 INFO     Default Hyperparameters loaded: [0m
[31m2018-10-02 18:40:11 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gamma': 1.0,
 'initialCentroidMethod': 'defaultDense',
 'maxIterations': 300,
 'method': 'defaultDense',
 'nClusters': 2,
 'nRounds': 5,
 'oversamplingFactor': 0.5}[0m
[31m2018-10-02 18:40:11 INFO     Updated with user hyperparameters, Final Hyperparameters: [0m
[31m2018-10-02 18:40:11 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gamma':

In [37]:
print("Training Job Name: ", daal_kmeans.latest_training_job.name)
print("Model saved at: ", daal_kmeans.model_data)

Training Job Name:  daal-kmeans-sample-2018-10-02-18-36-58-112
Model saved at:  s3://rpanchum/kmeans_daal_4.8Mx38/daal-kmeans-sample-2018-10-02-18-36-58-112/output/model.tar.gz


### Synthetic Dataset 2 ( 4.8Mx38, k=20 ) Summary

k=20, maxIterations=100

Sagemaker Training Time Only: 7.67 sec  
DAAL Training Time Only: 2.93 sec  


## Covdata Dataset: 0.5Mx53, k=7 

### Data Prep

In [38]:
!wget -O covtype.data.gz https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz
!gunzip -f covtype.data.gz

--2018-10-02 18:41:10--  https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11240707 (11M) [application/x-gzip]
Saving to: ‘covtype.data.gz’


2018-10-02 18:41:12 (9.83 MB/s) - ‘covtype.data.gz’ saved [11240707/11240707]



In [39]:
%%time
import pandas as pd
import numpy as np

input_file_covdata = "covtype.data"
data = pd.read_csv(input_file_covdata, header=None, dtype=np.float32)
print (data.describe())

X = data.iloc[:,0:53]
train_set = np.array(X.values)

y = data.iloc[:,-1]
train_labels = np.array(y.values)

print ("Training data shape: ", train_set.shape)

                  0              1              2              3   \
count  581012.000000  581012.000000  581012.000000  581012.000000   
mean     2959.195312     155.657913      14.103703     269.556305   
std       279.961426     111.905106       7.488175     212.501099   
min      1859.000000       0.000000       0.000000       0.000000   
25%      2809.000000      58.000000       9.000000     108.000000   
50%      2996.000000     127.000000      13.000000     218.000000   
75%      3163.000000     260.000000      18.000000     384.000000   
max      3858.000000     360.000000      66.000000    1397.000000   

                  4              5              6              7   \
count  581012.000000  581012.000000  581012.000000  581012.000000   
mean       46.417278    2350.075928     212.147415     223.342911   
std        58.296871    1558.983276      26.765314      19.761559   
min      -173.000000       0.000000       0.000000       0.000000   
25%         7.000000    1106.0000

In [40]:
bucket_name = 'rpanchum' 
data_key = 'kmeans_sm_covdata_0.5Mx53'
data_location = 's3://{}/{}'.format(bucket_name, data_key)
output_location = 's3://{}/{}'.format(bucket_name, data_key)

print('training artifacts will be uploaded to: {}'.format(output_location))

training artifacts will be uploaded to: s3://rpanchum/kmeans_sm_covdata_0.5Mx53


### Sagemaker Training Setup

In [41]:
%%time
from sagemaker import KMeans
from sagemaker import get_execution_role

role = get_execution_role()

sm_kmeans = KMeans(role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c5.18xlarge',
                    output_path=output_location,
                    k=7,
                    center_factor=1,
                    init_method='kmeans++',
                    max_iterations=100,
                    data_location=data_location)

print ('Building RecordSet format of training data as required by SM Kmeans...')
sm_kmeans_records = sm_kmeans.record_set(train_set)
print (sm_kmeans_records)

Building RecordSet format of training data as required by SM Kmeans...
(<class 'sagemaker.amazon.amazon_estimator.RecordSet'>, {'s3_data': 's3://rpanchum/kmeans_sm_covdata_0.5Mx53/KMeans-2018-10-02-18-41-16-283/.amazon.manifest', 'feature_dim': 53, 'num_records': 581012, 's3_data_type': 'ManifestFile', 'channel': 'train'})
CPU times: user 18.9 s, sys: 328 ms, total: 19.2 s
Wall time: 21.1 s


### Start Sagemaker Training

In [42]:
%%time
sm_kmeans.fit(sm_kmeans_records)

INFO:sagemaker:Creating training-job with name: kmeans-2018-10-02-18-41-36-963


2018-10-02 18:41:37 Starting - Starting the training job...
Launching requested ML instances......
Preparing the instances for training...
2018-10-02 18:43:33 Downloading - Downloading input data...
2018-10-02 18:43:40 Training - Downloading the training image..
[31mDocker entrypoint called with argument(s): train[0m
[31m[10/02/2018 18:44:17 INFO 140589129791296] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u'0.0001', u'local_lloyd_max_iter': u'300', u'_disable_wait_to_read': u'false', u'extra_center_factor': u'auto', u'eval_metrics': u'["msd"]', u'_num_kv_servers': u'1', u'mini_batch_size': u'5000', u'half_life_time_size': u'0', u'_num_slices': u'1

In [43]:
print("Training Job Name: ", sm_kmeans.latest_training_job.name)
print("Model saved at: ", sm_kmeans.model_data)

Training Job Name:  kmeans-2018-10-02-18-41-36-963
Model saved at:  s3://rpanchum/kmeans_sm_covdata_0.5Mx53/kmeans-2018-10-02-18-41-36-963/output/model.tar.gz


### Download Sagemaker Trained Model and Compute Accuracy

In [44]:
import os, boto3
import mxnet as mx

model_key = data_key + "/" + sm_kmeans.latest_training_job.name + '/output/model.tar.gz'
print ('Downloading the model saved at: ', sm_kmeans.model_data)

boto3.resource('s3').Bucket(bucket_name).download_file(model_key, 'sm_model.tar.gz')

os.system('tar -zxf sm_model.tar.gz && unzip model_algo-1')

Kmeans_model_params = mx.ndarray.load('model_algo-1')
sagemaker_centroids=Kmeans_model_params[0].asnumpy()
print('SM centroids shape: ', sagemaker_centroids.shape)


Downloading the model saved at:  s3://rpanchum/kmeans_sm_covdata_0.5Mx53/kmeans-2018-10-02-18-41-36-963/output/model.tar.gz
SM centroids shape:  (7, 53)


In [53]:
from sklearn.cluster import KMeans 
from sklearn.metrics.cluster import v_measure_score

sklearn_kmeans = KMeans(7)
sklearn_kmeans.cluster_centers_ = sagemaker_centroids

sm_train_assignments=sklearn_kmeans.predict(train_set)

print("SM Accuracy on COVDATA Train Set: ", str(v_measure_score(train_labels, sm_train_assignments)))


SM Accuracy on COVDATA Train Set:  0.061139986086812426


## Intel DAAL Kmeans with Covdata Dataset (0.5Mx53)

### Data Prep

In [46]:
import numpy as np
training_data_file = 'train_data_covdata.csv'
np.savetxt(training_data_file, train_set, delimiter=",")

bucket_name = 'rpanchum' 
data_key = 'kmeans_daal_covdata'

output_location = 's3://{}/{}'.format(bucket_name, data_key)

print('training artifacts will be uploaded to: {}'.format(output_location))

training artifacts will be uploaded to: s3://rpanchum/kmeans_daal_covdata


### DAAL Training Setup

In [47]:
%%time
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sess = sage.Session()

data_location = sess.upload_data(training_data_file, bucket=bucket_name, key_prefix=data_key)
print ("Training data uploaded at: ", data_location)

daal_kmeans_image = '927702822156.dkr.ecr.us-west-2.amazonaws.com/daal-kmeans-sample:latest'

daal_kmeans = sage.estimator.Estimator(image_name=daal_kmeans_image,
                                       role=role,
                                       train_instance_count=1,
                                       train_instance_type='ml.c5.18xlarge',
                                       output_path=output_location,
                                       sagemaker_session=sess,
                                       hyperparameters={'nClusters': 7,
                                                        'initialCentroidMethod': 'plusPlusDense',
                                                        'maxIterations': 100 } 
                                      )

Training data uploaded at:  s3://rpanchum/kmeans_daal_covdata/train_data_covdata.csv
CPU times: user 6.22 s, sys: 1.21 s, total: 7.43 s
Wall time: 5.49 s


### Start DAAL Training

In [48]:
%%time
daal_kmeans.fit(data_location)

INFO:sagemaker:Creating training-job with name: daal-kmeans-sample-2018-10-02-18-45-16-389


2018-10-02 18:45:16 Starting - Starting the training job...
Launching requested ML instances............
Preparing the instances for training...
2018-10-02 18:48:09 Downloading - Downloading input data...
2018-10-02 18:48:22 Training - Downloading the training image...
Training image download completed. Training in progress..
[31m2018-10-02 18:49:11 INFO     Container setup completed, In Docker entrypoint - train... [0m
[31m2018-10-02 18:49:11 INFO     Default Hyperparameters loaded: [0m
[31m2018-10-02 18:49:11 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gamma': 1.0,
 'initialCentroidMethod': 'defaultDense',
 'maxIterations': 300,
 'method': 'defaultDense',
 'nClusters': 2,
 'nRounds': 5,
 'oversamplingFactor': 0.5}[0m
[31m2018-10-02 18:49:11 INFO     Updated with user hyperparameters, Final Hyperparameters: [0m
[31m2018-10-02 18:49:11 INFO     {'accuracyThreshold': 0.0001,
 'assignFlag': True,
 'distanceType': 'euclidean',
 'gam

In [49]:
print("Training Job Name: ", daal_kmeans.latest_training_job.name)
print("Model saved at: ", daal_kmeans.model_data)

Training Job Name:  daal-kmeans-sample-2018-10-02-18-45-16-389
Model saved at:  s3://rpanchum/kmeans_daal_covdata/daal-kmeans-sample-2018-10-02-18-45-16-389/output/model.tar.gz


### Download DAAL Trained Model and Compute Accuracy

In [50]:
import os, boto3

model_key = data_key + "/" + daal_kmeans.latest_training_job.name + '/output/model.tar.gz'
print ('Downloading the model saved at: ', daal_kmeans.model_data)

boto3.resource('s3').Bucket(bucket_name).download_file(model_key, 'daal_model.tar.gz')

os.system('tar -zxf daal_model.tar.gz')
daal_centroids, daal_assignments = np.load("daal-kmeans-model.npy", encoding = 'latin1')
print(' DAAL centroids shape: ', daal_centroids.shape)

Downloading the model saved at:  s3://rpanchum/kmeans_daal_covdata/daal-kmeans-sample-2018-10-02-18-45-16-389/output/model.tar.gz
 DAAL centroids shape:  (7, 53)


In [52]:
from sklearn.cluster import KMeans 
from sklearn.metrics.cluster import v_measure_score

sklearn_kmeans = KMeans(7)
sklearn_kmeans.cluster_centers_ = daal_centroids

daal_train_assignments=sklearn_kmeans.predict(train_set)

print("DAAL Accuracy on COVDATA Train Set: ", str(v_measure_score(train_labels, daal_train_assignments)))


DAAL Accuracy on COVDATA Train Set:  0.07395242346495647


### Covdata (0.5Mx53) Summary

k=7, maxIterations=100

Sagemaker Training Time Only: 1.09 sec  
DAAL Training Time Only: 0.44 sec  

Sagemaker COVDATA Train set Accuracy: 6.1%  
DAAL COVDATA Train set Accuracy: 7.3%  