# Logistic Regression Example

Logistic Regression is a widely used supervised binary and multi-class classification algorithm. The Logistic Regression model is initialized, trained on columns of a frame, predicts the labels of observations, and tests the predicted labels against the true labels. This model runs the [MLLib implementation](https://spark.apache.org/docs/1.5.0/mllib-linear-methods.html#logistic-regression) of Logistic Regression, with enhanced features — trained model summary statistics; Covariance and Hessian matrices; ability to specify the frequency of the train and test observations. Testing performance can be viewed via built-in binary and multi-class Classification Metrics. It also allows the user to select the optimizer to be used - [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) or [SGD](https://en.wikipedia.org/wiki/Stochastic_gradient_descent).


- Read more about [Logistic Regression in Wikipedia](https://en.wikipedia.org/wiki/Logistic_regression)
- See [sparktk Documentation](http://trustedanalytics.github.io/sparktk/) for more information about the the API's

In [1]:
# First, let's verify that the sparktk libraries are installed
import sparktk
print "sparktk installation path = %s" % (sparktk.__path__)

sparktk installation path = ['/opt/anaconda2/lib/python2.7/site-packages/sparktk']


In [2]:
# This notebook assumes you have already created a credentials file.
# Enter the path here to connect to ATK
from sparktk import TkContext
tc = TkContext()

In [3]:
# Create a new frame by uploading rows
data = [ [4.9,1.4,0], 
        [4.7,1.3,0], 
        [4.6,1.5,0], 
        [6.3,4.9,1],
        [6.1,4.7,1], 
        [6.4,4.3,1], 
        [6.6,4.4,1], 
        [7.2,6.0,2],
        [7.2,5.8,2], 
        [7.4,6.1,2], 
        [7.9,6.4,2]]

schema = [('Sepal_Length', float),
          ('Petal_Length', float),
          ('Class', int)]
frame = tc.frame.create(data, schema)

In [4]:
# Consider the following frame containing three columns.
frame.inspect()

[#]  Sepal_Length  Petal_Length  Class
[0]           4.9           1.4      0
[1]           4.7           1.3      0
[2]           4.6           1.5      0
[3]           6.3           4.9      1
[4]           6.1           4.7      1
[5]           6.4           4.3      1
[6]           6.6           4.4      1
[7]           7.2           6.0      2
[8]           7.2           5.8      2
[9]           7.4           6.1      2

In [5]:
# Create a new model and train it
model = tc.models.classification.logistic_regression.train(frame, ['Sepal_Length', 'Petal_Length'],
                                                           'Class',
                                                           num_classes=3,
                                                           optimizer='LBFGS',
                                                           compute_covariance=True)

In [6]:
model.training_summary

coefficients      = {u'Sepal_Length_0': -63.683819066034125, u'Sepal_Length_1': -120.44216460316508, u'Petal_Length_0': 117.97982446106286, u'intercept_1': -90.48440455038572, u'intercept_0': -0.7801530640222967, u'Petal_Length_1': 206.3396486706314}
covariance_matrix = <sparktk.frame.frame.Frame object at 0x7fe841fce4d0>
degrees_freedom   = {u'Sepal_Length_0': 1.0, u'Sepal_Length_1': 1.0, u'Petal_Length_0': 1.0, u'intercept_1': 1.0, u'intercept_0': 1.0, u'Petal_Length_1': 1.0}
num_classes       = 3
num_features      = 2
p_value           = {u'Sepal_Length_0': 1.0, u'Sepal_Length_1': 1.0, u'Petal_Length_0': 0.9980954664211461, u'intercept_1': 1.0, u'intercept_0': 1.0, u'Petal_Length_1': 0.9974815878887049}
standard_errors   = {u'Sepal_Length_0': 19317645.935567528, u'Sepal_Length_1': 19317141.267795388, u'Petal_Length_0': 20706646.795911506, u'intercept_1': 28035868.81120614, u'intercept_0': 28062872.199688833, u'Petal_Length_1': 20711319.532376617}
wald_statistic    = {u'Sepal_Length_

In [7]:
# The covariance matrix is the inverse of the Hessian matrix for the trained model. 
# The Hessian matrix is the second-order partial derivatives of the model’s log-likelihood function
model.training_summary.covariance_matrix.inspect()

[#]  Sepal_Length_0      Petal_Length_0      intercept_0       
[0]   3.73171444492e+14  -4.00052132787e+14  -5.62650094207e+14
[1]  -4.00052131903e+14   4.28765221531e+14   6.00530188945e+14
[2]  -5.62650068572e+14   6.00530162791e+14   7.87524796096e+14
[3]   3.73161692004e+14  -4.00111467117e+14   -5.6225771278e+14
[4]   -4.0007789691e+14   4.28861979964e+14   6.00176171564e+14
[5]  -5.62443729743e+14    6.0041020171e+14    7.8676734233e+14

[#]  Sepal_Length_1      Petal_Length_1      intercept_1       
[0]   3.73161691935e+14  -4.00077897731e+14  -5.62443755249e+14
[1]   -4.0011146616e+14   4.28861979896e+14   6.00410227724e+14
[2]  -5.62257687042e+14   6.00176145315e+14   7.86767342149e+14
[3]    3.7315194676e+14  -4.00137243285e+14  -5.62051362791e+14
[4]   -4.0013724239e+14   4.28958756772e+14   6.00056186956e+14
[5]  -5.62051337182e+14   6.00056160845e+14   7.86009939999e+14

In [8]:
# Use the model to make predictions
model.predict(frame, ['Sepal_Length', 'Petal_Length'])

frame.inspect()

[#]  Sepal_Length  Petal_Length  Class  predicted_label
[0]           4.9           1.4      0                0
[1]           4.7           1.3      0                0
[2]           4.6           1.5      0                0
[3]           6.3           4.9      1                1
[4]           6.1           4.7      1                1
[5]           6.4           4.3      1                1
[6]           6.6           4.4      1                1
[7]           7.2           6.0      2                2
[8]           7.2           5.8      2                2
[9]           7.4           6.1      2                2

In [9]:
# Test the model
test_metrics = model.test(frame, 'Class', ['Sepal_Length', 'Petal_Length'])
test_metrics

accuracy         = 1.0
confusion_matrix =             Predicted_0.0  Predicted_1.0  Predicted_2.0
Actual_0.0              3              0              0
Actual_1.0              0              4              0
Actual_2.0              0              0              4
f_measure        = 1.0
precision        = 1.0
recall           = 1.0