# K-Means with Intel® Data Analytics Acceleration Library in Amazon SageMaker

## Introduction

Intel® Data Analytics Acceleration Library (Intel® DAAL) is the library of Intel® architecture optimized building blocks covering all stages of data analytics: data acquisition from a data source, preprocessing, transformation, data mining, modeling, validation, and decision making. One of its algorithms is K-Means.

K-Means is among the most popular and simplest clustering methods. It is intended to partition a data set into a small number of clusters such that feature vectors within a cluster have greater similarity with one another than with feature vectors from other clusters. Each cluster is characterized by a representative point, called a centroid, and a cluster radius.

In other words, the clustering methods enable reducing the problem of analysis of the entire data set to the analysis of clusters.

There are numerous ways to define the measure of similarity and centroids. For K-Means, the centroid is defined as the mean of feature vectors within the cluster.

Intel® DAAL developer guide: https://software.intel.com/en-us/daal-programming-guide

Intel® DAAL documentation for K-Means: https://software.intel.com/en-us/daal-programming-guide-k-means-clustering 

## K-Means Usage with SageMaker Estimator
Firstly, you need to import SageMaker package, get execution role and create session.

In [1]:
import sagemaker

role = sagemaker.get_execution_role()
sess = sagemaker.Session()

Secondly, you can specify parameters of K-Means.
#### Hyperparameters
"nClusters" and "maxIterations" hyperparameters of K-Means algorithm are required, all other - optional.
<table style="border: 1px solid black;">
    <tr>
        <td><strong>Parameter name</strong></td>
        <td><strong>Type</strong></td>
        <td><strong>Default value</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>fptype</td>
        <td>str</td>
        <td>"double"</td>
        <td>The floating-point type that the algorithm uses for intermediate computations. Can be "float" or "double"</td>
    </tr>
    <tr>
        <td>nClusters</td>
        <td>int</td>
        <td>2</td>
        <td>The number of clusters</td>
    </tr>
    <tr>
        <td>initMethod</td>
        <td>str</td>
        <td>"defaultDense"</td>
        <td>Available initialization methods for K-Means clustering: defaultDense - uses first nClusters points as initial clusters, randomDense - uses random nClusters points as initial clusters, plusPlusDense - uses K-Means++ algorithm; parallelPlusDense - uses parallel K-Means++ algorithm</td>
    </tr>
    <tr>
        <td>oversamplingFactor</td>
        <td>float</td>
        <td>0.5</td>
        <td>A fraction of nClusters in each of nRounds of parallel K-Means++. L=nClusters*oversamplingFactor points are sampled in a round</td>
    </tr>
    <tr>
        <td>nRounds</td>
        <td>int</td>
        <td>5</td>
        <td>The number of rounds for parallel K-Means++. (L*nRounds) must be greater than nClusters</td>
    </tr>
    <tr>
        <td>seed</td>
        <td>int</td>
        <td>777</td>
        <td>The seed for random number generator</td>
    </tr>
    <tr>
        <td>method</td>
        <td>str</td>
        <td>"lloydDense"</td>
        <td>Computation method for K-Means clustering</td>
    </tr>
    <tr>
        <td>maxIterations</td>
        <td>int</td>
        <td>100</td>
        <td>The number of iterations</td>
    </tr>
    <tr>
        <td>accuracyThreshold</td>
        <td>float</td>
        <td>0</td>
        <td>The threshold for termination of the algorithm</td>
    </tr>
    <tr>
        <td>gamma</td>
        <td>float</td>
        <td>1</td>
        <td>The weight to be used in distance calculation for binary categorical features</td>
    </tr>
    <tr>
        <td>distanceType</td>
        <td>str</td>
        <td>"euclidean"</td>
        <td>The measure of closeness between points (observations) being clustered. The only distance type supported so far is the Euclidian distance</td>
    </tr>
    <tr>
        <td>assignFlag</td>
        <td>bool</td>
        <td>True</td>
        <td>A flag that enables computation of assignments, that is, assigning cluster indices to respective observations</td>
    </tr>
</table>

Example of hyperparameters dictionary:

In [3]:
kmeans_params = {
    "fptype": "float",
    "nClusters": 5,
    "initMethod": "plusPlusDense",
    "maxIterations": 1000
}

Then, you need to create SageMaker Estimator instance with following parameters:
<table style="border: 1px solid black;">
    <tr>
        <td><strong>Parameter name</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>image_name</td>
        <td>The container image to use for training</td>
    </tr>
    <tr>
        <td>role</td>
        <td>An AWS IAM role. The SageMaker training jobs and APIs that create SageMaker endpoints use this role to access training data and models</td>
    </tr>
    <tr>
        <td>train_instance_count</td>
        <td>Number of Amazon EC2 instances to use for training. Should be 1, because it is not distributed version of algorithm</td>
    </tr>
    <tr>
        <td>train_instance_type</td>
        <td>Type of EC2 instance to use for training. See available types on Amazon Marketplace page of algorithm</td>
    </tr>
    <tr>
        <td>input_mode</td>
        <td>The input mode that the algorithm supports. May be "File" or "Pipe"</td>
    </tr>
    <tr>
        <td>output_path</td>
        <td>S3 location for saving the trainig result (model artifacts and output files)</td>
    </tr>
    <tr>
        <td>sagemaker_session</td>
        <td>Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed</td>
    </tr>
    <tr>
        <td>hyperparameters</td>
        <td>Dictionary containing the hyperparameters to initialize this estimator with</td>
    </tr>
</table>
Full SageMaker Estimator documentation: https://sagemaker.readthedocs.io/en/latest/estimators.html

In [4]:
daal_kmeans_arn = "<algorithm-arn>" # you can find it on algorithm page in your subscriptions

daal_kmeans = sagemaker.algorithm.AlgorithmEstimator(
    algorithm_arn=daal_kmeans_arn,
    role=role,
    base_job_name="<base-job-name>",
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    input_mode="File",
    output_path="s3://<bucket-name>/<output-path>",
    sagemaker_session=sess,
    hyperparameters=kmeans_params
)

### Training stage
On training stage, K-Means algorithm consume input data from S3 location and computes centroids.
This container supports only .csv ("comma-separated values") files.

In [5]:
daal_kmeans.fit({"training": "s3://<bucket-name>/<training-data-path>"})

INFO:sagemaker:Creating training-job with name: daal-kmeans-test-2019-02-15-15-47-15-619


2019-02-15 15:47:15 Starting - Starting the training job...
2019-02-15 15:47:17 Starting - Launching requested ML instances......
2019-02-15 15:48:17 Starting - Preparing the instances for training...
2019-02-15 15:49:09 Downloading - Downloading input data
2019-02-15 15:49:09 Training - Downloading the training image......
2019-02-15 15:50:14 Uploading - Uploading generated training model
2019-02-15 15:50:14 Completed - Training job completed

[31m2019-02-15 15:50:04 INFO     Training stage started[0m
[31m2019-02-15 15:50:04 INFO     Final Hyperparameters:[0m
[31m2019-02-15 15:50:04 INFO     {'fptype': 'float', 'initMethod': 'plusPlusDense', 'seed': '777', 'oversamplingFactor': '0.5', 'nRounds': '5', 'kmeansMethod': 'lloydDense', 'accuracyThreshold': '0', 'gamma': '1', 'distanceType': 'euclidean', 'assignFlag': True, 'maxIterations': '1000', 'method': 'lloydDense', 'assignFlag ': 'True', 'nClusters': '5'}[0m
[31m2019-02-15 15:50:04 INFO     Train data shape: (30000, 10)[0m
[3

### Real-time prediction
On prediction stage, K-Means algorithm determines assignments for input data using previously computed centroids.
Firstly, you need to deploy SageMaker endpoint that consumes data.

In [6]:
predictor = daal_kmeans.deploy(1, "ml.m4.xlarge", serializer=sagemaker.predictor.csv_serializer)

INFO:sagemaker:Creating model package with name: daal-kmeans-new-2019-02-15-15-56-26-951


..........

INFO:sagemaker:Creating model with name: daal-kmeans-new-2019-02-15-15-56-26-951-2019-02-15-15-57-12-443





INFO:sagemaker:Creating endpoint with name daal-kmeans-test-2019-02-15-15-47-15-619


------------------------------------------------------------------!

Secondly, you should pass data as numpy array to predictor instance and get assignments.

In this example we are passing random data, but you can use any numpy 2D array.

In [8]:
import numpy as np

predict_data = np.random.random(size=(10,10))
print(predictor.predict(predict_data).decode("utf-8"))

1
0
2
3
3
4
1
2
3
0


Don't forget to delete endpoint if you don't need it anymore.

In [9]:
sess.delete_endpoint(predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: daal-kmeans-test-2019-02-15-15-47-15-619


### Batch transform job
If you don't need real-time prediction, you can use transform job. It uses saved model with centroids, compute assignments one time and saves it in specified or auto-generated output path.

More about transform jobs: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html

Transformer API: https://sagemaker.readthedocs.io/en/latest/transformer.html

In [10]:
transformer = daal_kmeans.transformer(1, 'ml.m4.xlarge')
transformer.transform("s3://<bucket-name>/<training-data-path>", content_type='text/csv')
transformer.wait()
print(transformer.output_path)

INFO:sagemaker:Creating model package with name: daal-kmeans-new-2019-02-15-16-06-34-857


..........

INFO:sagemaker:Creating model with name: daal-kmeans-new-2019-02-15-16-06-34-857-2019-02-15-16-07-20-327





INFO:sagemaker:Creating transform job with name: daal-kmeans-test-2019-02-15-16-07-20-877


........................................!
s3://sagemaker-us-east-2-123123123123/daal-kmeans-test-2019-02-15-16-07-20-877
