# k-Nearest Neighbors (kNN) Classifier with Intel® Data Analytics Acceleration Library in Amazon SageMaker

## Introduction

Intel® Data Analytics Acceleration Library (Intel® DAAL) is the library of Intel® architecture optimized building blocks covering all stages of data analytics: data acquisition from a data source, preprocessing, transformation, data mining, modeling, validation, and decision making. One of its algorithms is kNN.

k-Nearest Neighbors (kNN) classification is a non-parametric classification algorithm. The model of the kNN classifier is based on feature vectors and class labels from the training data set. This classifier induces the class of the query vector from the labels of the feature vectors in the training data set to which the query vector is similar. A similarity between feature vectors is determined by the type of distance (for example, Euclidian) in a multidimensional feature space.

Intel® DAAL developer guide: https://software.intel.com/en-us/daal-programming-guide

Intel® DAAL documentation for kNN: https://software.intel.com/en-us/daal-programming-guide-k-nearest-neighbors-knn-classifier

## kNN Usage with SageMaker Estimator
Firstly, you need to import SageMaker package, get execution role and create session.

In [1]:
import sagemaker

role = sagemaker.get_execution_role()
sess = sagemaker.Session()

Secondly, you can specify parameters of kNN.
#### Hyperparameters
<table style="border: 1px solid black;">
    <tr>
        <td><strong>Parameter name</strong></td>
        <td><strong>Type</strong></td>
        <td><strong>Default value</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>nClasses</td>
        <td>int</td>
        <td>2</td>
        <td>Number of classes in data</td>
    </tr>
    <tr>
        <td>fptype</td>
        <td>str</td>
        <td>"double"</td>
        <td>The floating-point type that the algorithm uses for intermediate computations. Can be "float" or "double"</td>
    </tr>
    <tr>
        <td>method</td>
        <td>str</td>
        <td>"defaultDense"</td>
        <td>The computation method used by the K-D tree based kNN classification. The only training method supported so far is the default dense method.</td>
    </tr>
    <tr>
        <td>k</td>
        <td>int</td>
        <td>1</td>
        <td>The number of neighbors</td>
    </tr>
    <tr>
        <td>dataUseInModel</td>
        <td>str</td>
        <td>"doNotUse"</td>
        <td>A parameter to enable/disable use of the input data set in the kNN model. Possible values:<br/>"doNotUse" - the algorithm does not include the input data and labels in the trained kNN model but creates a copy of the input data set<br/>"doUse" - the algorithm includes the input data and labels in the trained kNN model</td>
    </tr>
    <tr>
        <td>seed</td>
        <td>int</td>
        <td>777</td>
        <td>Seed for random number generator engine that is used internally to perform sampling needed to choose dimensions and cut-points for the K-D tree.</td>
    </tr>
</table>

Example of hyperparameters dictionary:

In [3]:
knn_params = {
    "nClasses":2,
    "fptype":"double",
    "method":"defaultDense",
    "dataUseInModel":"doNotUse",
    "seed": 777,
    "k":1
}

Then, you need to create SageMaker Estimator instance with following parameters:
<table style="border: 1px solid black;">
    <tr>
        <td><strong>Parameter name</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>image_name</td>
        <td>The container image to use for training</td>
    </tr>
    <tr>
        <td>role</td>
        <td>An AWS IAM role. The SageMaker training jobs and APIs that create SageMaker endpoints use this role to access training data and models</td>
    </tr>
    <tr>
        <td>train_instance_count</td>
        <td>Number of Amazon EC2 instances to use for training. Should be 1, because it is not distributed version of algorithm</td>
    </tr>
    <tr>
        <td>train_instance_type</td>
        <td>Type of EC2 instance to use for training. See available types on Amazon Marketplace page of algorithm</td>
    </tr>
    <tr>
        <td>input_mode</td>
        <td>The input mode that the algorithm supports. May be "File" or "Pipe"</td>
    </tr>
    <tr>
        <td>output_path</td>
        <td>S3 location for saving the trainig result (model artifacts and output files)</td>
    </tr>
    <tr>
        <td>sagemaker_session</td>
        <td>Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed</td>
    </tr>
    <tr>
        <td>hyperparameters</td>
        <td>Dictionary containing the hyperparameters to initialize this estimator with</td>
    </tr>
</table>
Full SageMaker Estimator documentation: https://sagemaker.readthedocs.io/en/latest/estimators.html

In [6]:
daal_knn_arn = "<algorithm-arn>" # you can find it on algorithm page in your subscriptions

daal_knn = sagemaker.algorithm.AlgorithmEstimator(
    algorithm_arn=daal_knn_arn,
    role=role,
    base_job_name="<base-job-name>",
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    input_mode="File",
    output_path="s3://<bucket-name>/<output-path>",
    sagemaker_session=sess,
    hyperparameters=knn_params
)

### Training stage
On training stage, kNN algorithm consume input data from S3 location.
This container supports only .csv ("comma-separated values") files.

In [7]:
daal_knn.fit({"training": "s3://<bucket-name>/<training-data-path>"})

INFO:sagemaker:Creating training-job with name: daal-knn-sm-2019-02-24-15-31-09-721


2019-02-24 15:31:09 Starting - Starting the training job...
2019-02-24 15:31:11 Starting - Launching requested ML instances......
2019-02-24 15:32:12 Starting - Preparing the instances for training...
2019-02-24 15:33:02 Downloading - Downloading input data
2019-02-24 15:33:02 Training - Downloading the training image......
2019-02-24 15:34:09 Uploading - Uploading generated training model
[31m2019-02-24 15:34:04 INFO     Container setup completed, In Docker entrypoint - train... [0m
[31m2019-02-24 15:34:04 INFO     Default Hyperparameters loaded: [0m
[31m2019-02-24 15:34:04 INFO     [0m
[31m{'dataUseInModel': 'doNotUse',
 'fptype': 'double',
 'k': '1',
 'method': 'defaultDense',
 'nClasses': '2'}[0m
[31m2019-02-24 15:34:04 INFO     Updated with user hyperparameters, Final Hyperparameters: [0m
[31m2019-02-24 15:34:04 INFO     [0m
[31m{'dataUseInModel': 'doNotUse',
 'fptype': 'double',
 'k': '1',
 'method': 'defaultDense',
 'nClasses': '2',
 'seed': '777'}[0m
[31m2019-02-

### Real-time prediction
Firstly, you need to deploy SageMaker endpoint that consumes data.

In [8]:
predictor = daal_knn.deploy(1, "ml.m4.xlarge", serializer=sagemaker.predictor.csv_serializer)

INFO:sagemaker:Creating model package with name: daal-knn-2019-02-24-15-35-14-387


..........

INFO:sagemaker:Creating model with name: daal-knn-2019-02-24-15-35-14-387-2019-02-24-15-35-59-854





INFO:sagemaker:Creating endpoint with name daal-knn-sm-2019-02-24-15-31-09-721


--------------------------------------------------------------------------!

Secondly, you should pass data as numpy array to endpoint and get predictions.

In this example we are passing random data, but you can use any numpy 2D array

In [11]:
import numpy as np

predict_data = np.random.random(size=(10,5))
print(predictor.predict(predict_data).decode("utf-8"))

1
1
2
4
4
3
3
1
2
4



Don't forget to delete endpoint if you don't need it anymore.

In [12]:
sess.delete_endpoint(predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: daal-knn-sm-2019-02-24-15-31-09-721


### Batch transform job
If you don't need real-time prediction, you can use transform job. It uses saved model, compute predictions one time and saves it in specified or auto-generated output path.

More about transform jobs: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html

Transformer API: https://sagemaker.readthedocs.io/en/latest/transformer.html

In [14]:
transformer = daal_knn.transformer(1, 'ml.m4.xlarge')
transformer.transform("s3://<bucket-name>/<prediction-data-path>", content_type='text/csv')
transformer.wait()
print(transformer.output_path)

INFO:sagemaker:Creating model package with name: daal-knn-2019-02-24-15-53-00-053


..........

INFO:sagemaker:Creating model with name: daal-knn-2019-02-24-15-53-00-053-2019-02-24-15-53-45-504





INFO:sagemaker:Creating transform job with name: daal-knn-sm-2019-02-24-15-53-45-783


........................................!
s3://sagemaker-us-east-2-123123123123/daal-knn-sm-2019-02-24-15-53-45-783
