# Principal Component Analysis with Intel® Data Analytics Acceleration Library in Amazon SageMaker

## Introduction

Intel® Data Analytics Acceleration Library (Intel® DAAL) is the library of Intel® architecture optimized building blocks covering all stages of data analytics: data acquisition from a data source, preprocessing, transformation, data mining, modeling, validation, and decision making. One of its algorithms is PCA.

Principal Component Analysis (PCA) is a method for exploratory data analysis. PCA transforms a set of observations of possibly correlated variables to a new set of uncorrelated variables, called principal components. Principal components are the directions of the largest variance, that is, the directions where the data is mostly spread out.

Because all principal components are orthogonal to each other, there is no redundant information. This is a way of replacing a group of variables with a smaller set of new variables. PCA is one of powerful techniques for dimension reduction.

Intel® DAAL developer guide: https://software.intel.com/en-us/daal-programming-guide

Intel® DAAL documentation for PCA: https://software.intel.com/en-us/daal-programming-guide-principal-component-analysis

## PCA Usage with SageMaker Estimator
Firstly, you need to import SageMaker package, get execution role and create session.

In [13]:
import sagemaker

role = sagemaker.get_execution_role()
sess = sagemaker.Session()

Secondly, you can specify parameters of PCA.
#### Hyperparameters
All hyperparameters of PCA algorithm are optional.
<table style="border: 1px solid black;">
    <tr>
        <td><strong>Parameter name</strong></td>
        <td><strong>Type</strong></td>
        <td><strong>Default value</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>fptype</td>
        <td>str</td>
        <td>"double"</td>
        <td>The floating-point type that the algorithm uses for intermediate computations. Can be "float" or "double"</td>
    </tr>
    <tr>
        <td>method</td>
        <td>str</td>
        <td>"correlationDense"</td>
        <td>Available methods for PCA computation: "correlationDense" ("defaultDense") or "svdDense"</td>
    </tr>
    <tr>
        <td>resultsToCompute</td>
        <td>str</td>
        <td>"none"</td>
        <td>Provide one of the following values to request a single characteristic or use bitwise OR to request a combination of the characteristics: mean, variance, eigenvalue. For example, combination of all is "mean|variance|eigenvalue"</td>
    </tr>
    <tr>
        <td>nComponents</td>
        <td>int</td>
        <td>0</td>
        <td>Number of principal components.<br/> If it is zero, the algorithm will compute the result for number of principal components = number of features.<br/>Remember that number of components must be equal or less than number of features for PCA algorithm</td>
    </tr>
    <tr>
        <td>isDeterministic</td>
        <td>bool</td>
        <td>False</td>
        <td>If True, the algorithm applies the "sign flip" technique to the results</td>
    </tr>
    <tr>
        <td>transformOnTrain</td>
        <td>bool</td>
        <td>False</td>
        <td>If True, training data will be transformed and saved in model package on training stage</td>
    </tr>
</table>

Example of hyperparameters dictionary:

In [14]:
pca_params = {
    "fptype": "float",
    "method": "svdDense",
    "resultsToCompute": "mean|eigenvalue",
    "nComponents": 4,
    "isDeterministic": True
}

Then, you need to create SageMaker Estimator instance with following parameters:
<table style="border: 1px solid black;">
    <tr>
        <td><strong>Parameter name</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>image_name</td>
        <td>The container image to use for training</td>
    </tr>
    <tr>
        <td>role</td>
        <td>An AWS IAM role. The SageMaker training jobs and APIs that create SageMaker endpoints use this role to access training data and models</td>
    </tr>
    <tr>
        <td>train_instance_count</td>
        <td>Number of Amazon EC2 instances to use for training. Should be 1, because it is not distributed version of algorithm</td>
    </tr>
    <tr>
        <td>train_instance_type</td>
        <td>Type of EC2 instance to use for training. See available types on Amazon Marketplace page of algorithm</td>
    </tr>
    <tr>
        <td>input_mode</td>
        <td>The input mode that the algorithm supports. May be "File" or "Pipe"</td>
    </tr>
    <tr>
        <td>output_path</td>
        <td>S3 location for saving the trainig result (model artifacts and output files)</td>
    </tr>
    <tr>
        <td>sagemaker_session</td>
        <td>Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed</td>
    </tr>
    <tr>
        <td>hyperparameters</td>
        <td>Dictionary containing the hyperparameters to initialize this estimator with</td>
    </tr>
</table>
Full SageMaker Estimator documentation: https://sagemaker.readthedocs.io/en/latest/estimators.html

In [9]:
daal_pca_arn = "<algorithm-arn>" # you can find it on algorithm page in your subscriptions

daal_pca = sagemaker.algorithm.AlgorithmEstimator(
    algorithm_arn=daal_pca_arn,
    role=role,
    base_job_name="<base-job-name>",
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    input_mode="File",
    output_path="s3://<bucket-name>/<output-path>",
    sagemaker_session=sess,
    hyperparameters=pca_params
)

### Training stage
On training stage, PCA algorithm consume input data from S3 location and computes eigen vectors and other results (if they are specified in "resultsToCompute" parameter).
This container supports only .csv ("comma-separated values") files.

In [10]:
daal_pca.fit({"training": "s3://<bucket-name>/<training-data-path>"})

INFO:sagemaker:Creating training-job with name: daal-pca-alg-test-2018-11-30-06-50-06-484


2018-11-30 06:50:06 Starting - Starting the training job...
2018-11-30 06:50:07 Starting - Launching requested ML instances......
2018-11-30 06:51:15 Starting - Preparing the instances for training...
2018-11-30 06:52:07 Downloading - Downloading input data...
2018-11-30 06:52:30 Training - Downloading the training image...
2018-11-30 06:53:01 Uploading - Uploading generated training model
2018-11-30 06:53:01 Completed - Training job completed

[31m2018-11-30 06:52:49 INFO     Training stage started[0m
[31m2018-11-30 06:52:49 INFO     Default Paramaters:[0m
[31m2018-11-30 06:52:49 INFO     {'fptype': 'double', 'method': 'correlationDense', 'resultsToCompute': '', 'nComponents': '0', 'isDeterministic': 'False', 'transformOnTrain': 'False'}[0m
[31m2018-11-30 06:52:49 INFO     Updated with user hyperparameters, uncorrect parameters changed or deleted[0m
[31m2018-11-30 06:52:49 INFO     Final Hyperparameters:[0m
[31m2018-11-30 06:52:49 INFO     {'fptype': 'float', 'method': 'def

### Real-time prediction
On prediction stage, PCA algorithm transforms input data using previously computed results.
Firstly, you need to deploy SageMaker endpoint that consumes data.

In [6]:
predictor = daal_pca.deploy(1, "ml.m4.xlarge", serializer=sagemaker.predictor.csv_serializer)

INFO:sagemaker:Creating model package with name: intel-daal-pca1542385402-d0d25e75ca6ef4-2018-11-29-15-37-47-871


..........

INFO:sagemaker:Creating model with name: intel-daal-pca1542385402-d0d25e75ca6ef4-2018-11-29-15-38-33-468





INFO:sagemaker:Creating endpoint with name intel-daal-pca1542385402-d0d25e75ca6ef4-2018-11-29-15-31-41-921


-------------------------------------------------------------!

Secondly, you should pass data as numpy array to predictor instance and get transformed data as space-separated values.

In this example we are passing random data, but you can use any numpy 2D array with one specific condition for PCA: training data and data to transform must have same numbers of features.

In [7]:
import numpy as np

predict_data = np.random.random(size=(10,10))
print(predictor.predict(predict_data).decode("utf-8"))

1.185592651367187500e+00 2.620933353900909424e-01 3.085311949253082275e-01
6.714667081832885742e-01 8.762556910514831543e-01 -8.037568628787994385e-02
1.111641526222229004e+00 3.375906124711036682e-02 -1.244278624653816223e-01
1.294844508171081543e+00 3.067855909466743469e-02 -9.089314937591552734e-02
8.781186342239379883e-01 2.732140570878982544e-02 -5.232766270637512207e-01
5.412490963935852051e-01 3.470270335674285889e-01 -8.399704098701477051e-02
1.048457860946655273e+00 -1.020107120275497437e-01 6.852779537439346313e-02
8.287011384963989258e-01 2.592234015464782715e-01 -1.914716511964797974e-01
1.426655769348144531e+00 1.917291432619094849e-02 -1.039648428559303284e-01
1.183746933937072754e+00 -8.203709125518798828e-02 3.905816972255706787e-01



Don't forget to delete endpoint if you don't need it anymore.

In [8]:
sess.delete_endpoint(predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: intel-daal-pca1542385402-d0d25e75ca6ef4-2018-11-29-15-31-41-921


### Batch transform job
If you don't need real-time prediction, you can use transform job. It uses saved model, compute transformed data one time and saves it in specified or auto-generated output path.

More about transform jobs: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html

Transformer API: https://sagemaker.readthedocs.io/en/latest/transformer.html

In [12]:
transformer = daal_pca.transformer(1, 'ml.m4.xlarge')
transformer.transform("s3://<bucket-name>/<prediction-data-path>", content_type='text/csv')
transformer.wait()
print(transformer.output_path)

INFO:sagemaker:Creating model package with name: intel-daal-pca1542385402-d0d25e75ca6ef4-2018-11-30-07-42-48-772


..........

INFO:sagemaker:Creating model with name: intel-daal-pca1542385402-d0d25e75ca6ef4-2018-11-30-07-43-34-517





INFO:sagemaker:Creating transform job with name: daal-pca-alg-test-2018-11-30-07-43-35-199


......................................!
s3://<output-path>
