# Logistic Regression with Intel® Data Analytics Acceleration Library in Amazon SageMaker

# Introduction

Intel® Data Analytics Acceleration Library (Intel® DAAL) is the library of Intel® architecture optimized building blocks covering all stages of data analytics: data acquisition from a data source, preprocessing, transformation, data mining, modeling, validation, and decision making. One of its classification algorithms is Logistic Regression.

Logistic Regression is a method for modeling the relationships between one or more explanatory variables and a categorical variable by expressing the posterior statistical distribution of the categorical variable via linear functions on observed data. If the categorical variable is binary, that is it takes only two values, "0" and "1", the Logistic Regression is simple, otherwise, it is multinomial.
DAAL Logistic Regression algorithm support L1 and L2 regularizations.


Intel® DAAL developer guide: https://software.intel.com/en-us/daal-programming-guide

Intel® DAAL documentation for Logistic Regression: https://software.intel.com/en-us/daal-programming-guide-logistic-regression

* [Hyperparameters description](#1-bullet)
* [Usage of the algorithm](#2-bullet)
  * [Upload the data for training](#3-bullet)
  * [Creating Training Job using Algorithm ARN](#4-bullet)
  * [Live Inference Endpoint for Prediction stage](#5-bullet)
  * [Batch transform job](#6-bullet)

# Hyperparameters description: <a class="anchor" id="1-bullet"></a>

<table align="left">
    <tr>
        <th>Required parameters</th>
        <th>Type</th>
        <th>Default value</th>
        <th>Description</th>
    </tr>
    <tr>
        <td>nClasses</td>
        <td>integer</td>
        <td>None</td>
        <td>Number of classes in training dataset</td>
    </tr>
</table> 



<table align="left">
    <tr>
        <th>Optional parameters</th>
        <th>Type</th>
        <th>Default value</th>
        <th>Description</th>
    </tr>
    <tr>
        <td>penaltyL1</td>
        <td>float</td>
        <td>0</td>
        <td>Penalty coefficient for L1 regularization</td>
    </tr>
    <tr>
        <td>penaltyL2</td>
        <td>float</td>
        <td>0</td>
        <td>Penalty coefficient for L2 regularization</td>
    </tr>
    <tr>
        <td>interceptFlag</td>
        <td>bool</td>
        <td>False</td>
        <td>A flag that indicates a need to compute θ0j</td>
    </tr>
    <tr>
        <td>solverName</td>
        <td>str</td>
        <td>'sgd'</td>
        <td>Name of solver that will be used for training stage<br>available values:  'lbfgs', 'adagrad', 'saga', 'sgd'</td>
    </tr>
    <tr>
        <td>solverMethod</td>
        <td>str</td>
        <td>'defaultDense'</td>
        <td>Method of the solver. Available values for 'sgd':<br>
               'momentum', 'minibatch', 'defaultDense'<br>
                available values for others solver: 'defaultDense'</td>
    </tr>
    <tr>
        <td>solverMaxIterations</td>
        <td>integer</td>
        <td>100</td>
        <td>Max number of iterations for training stage</td>
    </tr>
    <tr>
        <td>solverAccuracyThreshold</td>
        <td>float</td>
        <td>1.0-e4</td>
        <td>Accuracy of the algorithm. The algorithm terminates when <br>
                                   this accuracy is achieved.</td>
    </tr>
    <tr>
        <td>solverBatchSize</td>
        <td>integer</td>
        <td>number of rows<br>in training dataset</td>
        <td>Number of batch indices to compute the stochastic gradient.</td>
    </tr>
    <tr>
        <td>solverLearningRate</td>
        <td>float</td>
        <td>1.0-e3</td>
        <td>learning rate for optimization problem<br>applicable for 'sgd','adagrad', 'saga' only</td>
    </tr>
    <tr>
        <td>solverStepLength</td>
        <td>float</td>
        <td>1.0-e3</td>
        <td>step size for optimization problem<br>applicable for 'lbfgs' only</td>
    </tr>
    <tr>
        <td>solverCorrectionPairBatchSize</td>
        <td>integer</td>
        <td>number of rows<br>in training dataset</td>
        <td>Number of batch indices to compute Hessian aproximation.<br>applicable for 'lbfgs' only</td>
    </tr>
    <tr>
        <td>solverL</td>
        <td>integer</td>
        <td>1</td>
        <td>The number of iterations between<br>calculations of the curvature estimates<br>
            applicable for 'lbfgs' only
        </td>
    </tr>
</table> 


    For more detailes please visit:
    https://software.intel.com/en-us/daal-programming-guide
     
    All parameters that start from 'solver' have a name without 'solver' prefix in DAAL documentation

# Usage of the algorithm <a class="anchor" id="2-bullet"></a>
At the first we need to import SageMaker Python package, get execution role and create session.

In [1]:
import numpy as np
import pandas as pd
from sagemaker import get_execution_role
import sagemaker as sage

role = get_execution_role()
sess = sage.Session()

## Upload the data for training <a class="anchor" id="3-bullet"></a>
When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using some the classic Iris dataset

We can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket.

At first download iris-data via sklearn and split it into 'training' and 'test' data

In [2]:
from sklearn import datasets
iris = datasets.load_iris()

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.33, random_state=777)

Upload the training data to S3

In [31]:
import numpy as np

reshaped_Y_train = y_train.reshape(y_train.shape[0],1)

#Last column in training dataset is labels. 
training_data = np.concatenate((X_train,reshaped_Y_train),axis=1)

#save the training data
train_data_file = 'training_data.csv'
# NO comma at the end of each line in training data.
np.savetxt(train_data_file,training_data,delimiter=',')

# S3 prefix
bucket_name = 'daal-log-reg-test'
data_key = 'log_reg_iris_data_test'

output_location = 's3://{}/{}'.format(bucket_name, 'output')
data_location = output_location = 's3://{}/{}'.format(bucket_name, data_key)
print ("Training artifacts will be uploaded at: " + output_location)
print ("And data_location will be a parameter for fit method (see training stage below).")

sess.upload_data(train_data_file, bucket=bucket_name, key_prefix=data_key)

Training artifacts will be uploaded at: s3://daal-log-reg-test/log_reg_iris_data_test
And data_location will be a parameter for fit method (see training stage below).


's3://daal-log-reg-test/log_reg_iris_data_test/training_data.csv'

Example of hyperparameters list:

In [32]:
hyperparameters={"nClasses": 3,
                 "penaltyL1": 0,
                 "penaltyL2": 0,
                 "interceptFlag": False,
                 "solverBatchSize":100,   #as training data has 100 samples only, but default value is 150
                 #"solverName": "lbfgs", 
                 #"solverMaxIterations": 1000,
                 #"solverAccuracyThreshold": 0.0001,
                 #"solverL": 1
                }

Then, you need to create SageMaker Estimator instance with following parameters:
<table style="border: 1px solid black;">
    <tr>
        <td><strong>Parameter name</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>image_name</td>
        <td>The container image to use for training</td>
    </tr>
    <tr>
        <td>role</td>
        <td>An AWS IAM role. The SageMaker training jobs and APIs that create SageMaker endpoints use this role to access training data and models</td>
    </tr>
    <tr>
        <td>train_instance_count</td>
        <td>Number of Amazon EC2 instances to use for training. Should be 1, because it is not distributed version of algorithm</td>
    </tr>
    <tr>
        <td>train_instance_type</td>
        <td>Type of EC2 instance to use for training. See available types on Amazon Marketplace page of algorithm</td>
    </tr>
    <tr>
        <td>input_mode</td>
        <td>The input mode that the algorithm supports. May be "File" or "Pipe"</td>
    </tr>
    <tr>
        <td>output_path</td>
        <td>S3 location for saving the trainig result (model artifacts and output files)</td>
    </tr>
    <tr>
        <td>sagemaker_session</td>
        <td>Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed</td>
    </tr>
    <tr>
        <td>hyperparameters</td>
        <td>Dictionary containing the hyperparameters to initialize this estimator with</td>
    </tr>
</table>
SageMaker Estimator documentation: https://sagemaker.readthedocs.io/en/latest/estimators.html

## Creating Training Job using Algorithm ARN<a class="anchor" id="4-bullet"></a>
Please put in the algorithm arn you want to use below. This can either be an AWS Marketplace algorithm you subscribed to (or) one of the algorithms you created in your own account.
The algorithm arn listed below belongs to the Intel DAAL Logistic Regression.

In [33]:
daal_log_reg_arn = "arn:aws:sagemaker:us-east-2:057799348421:algorithm/intel-daal-logistic-regression-ce8a1f38da2f8a234e4205a356021dbf" # you can find it on algorithm page in your subscriptions
#daal_log_reg_arn = "<algorithm-arn>"
daal_log_reg = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=daal_log_reg_arn,
    base_job_name="daal-log-reg-alg-test",
    role=role,
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sess,
    hyperparameters=hyperparameters
)

## Run training stage<a class="anchor" id="4-bullet"></a>
On training stage, Logistic Rergession algorithm consume input data from S3 location and train the model.

In [35]:
daal_log_reg.fit({"training": data_location})

INFO:sagemaker:Creating training-job with name: daal-log-reg-alg-test-2018-11-30-10-58-20-329


2018-11-30 10:58:20 Starting - Starting the training job...
2018-11-30 10:58:25 Starting - Launching requested ML instances......
2018-11-30 10:59:21 Starting - Preparing the instances for training......
2018-11-30 11:00:40 Downloading - Downloading input data
2018-11-30 11:00:40 Training - Downloading the training image..
[31m2018-11-30 11:00:59 INFO     Container setup completed, In Docker entrypoint - train... [0m
[31m2018-11-30 11:00:59 INFO     Default Hyperparameters loaded: [0m
[31m2018-11-30 11:00:59 INFO     {'dtype': 'float',
 'interceptFlag': True,
 'nClasses': 0,
 'penaltyL1': 0,
 'penaltyL2': 0}[0m
[31m2018-11-30 11:00:59 INFO     Reading training data... [0m
[31m2018-11-30 11:00:59 INFO     Training data with labels shape: (100, 5)[0m
[31m2018-11-30 11:00:59 INFO     Updated with user hyperparameters, Final Hyperparameters: [0m
[31m2018-11-30 11:00:59 INFO     {'dtype': 'float',
 'interceptFlag': 'False',
 'nClasses': '3',
 'penaltyL1': '0.0',
 'penaltyL2': '

## Live Inference Endpoint for Prediction stage<a class="anchor" id="5-bullet"></a>
On prediction stage, Logistic Regression algorithm compute probabulity of classes and retern the class for each input samples.
Firstly, you need to deploy SageMaker endpoint that consumes data.


In [23]:
from sagemaker.predictor import csv_serializer
predictor = daal_log_reg.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer)

INFO:sagemaker:Creating model package with name: intel-daal-logistic-regression-ce8a1f38-2018-11-30-10-05-13-703


..........

INFO:sagemaker:Creating model with name: intel-daal-logistic-regression-ce8a1f38-2018-11-30-10-05-59-280





INFO:sagemaker:Creating endpoint with name daal-log-reg-alg-test-2018-11-30-10-00-36-782


--------------------------------------------------------------------------!

Define functions to handle response from predictor instance at first:

In [24]:
def output_to_np(prediction, numberOfSamples, nClasses):
    if nClasses == 2:
        return np.fromstring(prediction, dtype=np.float64, sep=' ').reshape(2,numberOfSamples)
    if nClasses > 2:
        return np.fromstring(prediction, dtype=np.float64, sep=' ').reshape(nClasses+1,numberOfSamples)

def output_to_pd(prediction, nClasses):
    C=[]
    C.append('lables')
    if nClasses > 2:
        for i in range(nClasses):
            C.append('probability of class ' + str(i))
    if nClasses == 2:
        C.append('probability of class 1')
    return pd.DataFrame(np.transpose(prediction), columns=C)

After deployment, you should pass data as numpy array to predictor instance and get predicted lables and probabilities.

In [26]:
#usage from slplited data
payload = X_test

ground_truth = y_test
prediction = predictor.predict(payload).decode('utf-8')
print(prediction)
#np_res = output_to_np(prediction=prediction,numberOfSamples=payload.shape[0],nClasses=3)
#pd_res = output_to_pd(prediction=np_res, nClasses=3)

2.0
0.0
2.0
2.0
1.0
0.0
2.0
2.0
0.0
0.0
2.0
1.0
1.0
2.0
2.0
2.0
0.0
2.0
0.0
1.0
1.0
1.0
2.0
0.0
2.0
0.0
1.0
0.0
2.0
2.0
0.0
2.0
0.0
2.0
1.0
0.0
0.0
0.0
1.0
0.0
0.0
2.0
1.0
1.0
0.0
2.0
2.0
0.0
2.0
1.0




Print the first 5 rows of obtained prediction:

In [32]:
pd_res.head()

Unnamed: 0,lables,probability of class 0,probability of class 1,probability of class 2
0,2.0,9.311358e-10,0.010659,0.9893406
1,0.0,0.9948357,0.005164,1.804341e-12
2,2.0,2.409668e-09,0.018753,0.9812474
3,2.0,3.761642e-09,0.003549,0.9964515
4,1.0,2.466924e-05,0.858058,0.1419178


Compute the accuracy of trained model on test dataset 'y_test'.

In [35]:
from sklearn.metrics.cluster import v_measure_score

prediction_arr = np_res[0]
ground_truth = y_test
#print(prediction_arr)
#print(ground_truth)

print("DAAL Accuracy on Iris Train Set: ", str(v_measure_score(prediction_arr, ground_truth)))

DAAL Accuracy on Iris Train Set:  1.0


Don't forget to delete endpoint if you don't need it anymore. Otherwise it will run as a daemon process.


In [27]:
sess.delete_endpoint(predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: daal-log-reg-alg-test-2018-11-30-10-00-36-782


## Batch transform job<a class="anchor" id="6-bullet"></a>
If you don't need real-time prediction, you can use transform job. It uses saved model, compute transformed data one time and saves it in specified or auto-generated output path.

More about transform jobs: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html

Transformer API: https://sagemaker.readthedocs.io/en/latest/transformer.html

In [28]:
transformer = daal_log_reg.transformer(1, "ml.m4.xlarge")

INFO:sagemaker:Creating model package with name: intel-daal-logistic-regression-ce8a1f38-2018-11-30-10-25-22-669


..........

INFO:sagemaker:Creating model with name: intel-daal-logistic-regression-ce8a1f38-2018-11-30-10-26-08-303





In [29]:
transformer.transform("s3://daal-log-reg-test/input/data/test_data.csv", content_type="text/csv")
transformer.wait()

INFO:sagemaker:Creating transform job with name: daal-log-reg-alg-test-2018-11-30-10-26-55-276


.......................................!


In [30]:
from urllib.parse import urlparse

parsed_url = urlparse(transformer.output_path)
bucket_name = parsed_url.netloc
file_key = '{}/{}.out'.format(parsed_url.path[1:], "test_data.csv") # size of data is equal to 100

s3_client = sess.boto_session.client('s3')

response = s3_client.get_object(Bucket = sess.default_bucket(), Key = file_key)
response_bytes = response['Body'].read().decode('utf-8')
print(response_bytes)
#size_data = 100
#np_res = output_to_np(prediction=response_bytes,numberOfSamples=size_data,nClasses=3)
#pd_res = output_to_pd(prediction=np_res, nClasses=3)


2.0
1.0
2.0
2.0
1.0
0.0
0.0
0.0
2.0
2.0
1.0
2.0
0.0
0.0
0.0
2.0
1.0
2.0
0.0
0.0
1.0
0.0
2.0
1.0
0.0
2.0
1.0
2.0
0.0
0.0
0.0
1.0
2.0
1.0
1.0
2.0
2.0
2.0
2.0
1.0
1.0
2.0
1.0
0.0
2.0
0.0
0.0
0.0
1.0
1.0
1.0
0.0
1.0
1.0
0.0
2.0
2.0
0.0
0.0
1.0
0.0
0.0
1.0
1.0
2.0
0.0
0.0
2.0
1.0
1.0
1.0
2.0
1.0
1.0
2.0
2.0
1.0
1.0
2.0
1.0
2.0
0.0
2.0
0.0
1.0
2.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
2.0
2.0
1.0
1.0
1.0
0.0
2.0




In [47]:
pd_res.head()

Unnamed: 0,lables,probability of class 0,probability of class 1,probability of class 2
0,2.0,2.848256e-10,0.008035,0.991965
1,1.0,7.074607e-06,0.810579,0.189414
2,2.0,2.953847e-09,0.005116,0.994884
3,2.0,8.515568e-09,0.014403,0.985597
4,1.0,0.01201045,0.98755,0.000439
