# Logistic Regression Example

Logistic Regression is a widely used supervised binary and multi-class classification algorithm. The Logistic Regression model is initialized, trained on columns of a frame, predicts the labels of observations, and tests the predicted labels against the true labels. This model runs the [MLLib implementation](https://spark.apache.org/docs/1.5.0/mllib-linear-methods.html#logistic-regression) of Logistic Regression, with enhanced features — trained model summary statistics; Covariance and Hessian matrices; ability to specify the frequency of the train and test observations. Testing performance can be viewed via built-in binary and multi-class Classification Metrics. It also allows the user to select the optimizer to be used - [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) or [SGD](https://en.wikipedia.org/wiki/Stochastic_gradient_descent).


- Read more about [Logistic Regression in Wikipedia](https://en.wikipedia.org/wiki/Logistic_regression)
- See [sparktk Documentation](http://trustedanalytics.github.io/sparktk/) for more information about the the API's

In [1]:
# First, let's verify that the sparktk libraries are installed
import sparktk
print "sparktk installation path = %s" % (sparktk.__path__)

sparktk installation path = ['/opt/anaconda2/lib/python2.7/site-packages/sparktk']


In [2]:
# This notebook assumes you have already created a credentials file.
# Enter the path here to connect to ATK
from sparktk import TkContext
tc = TkContext()

In [3]:
# Create a new frame by uploading rows
data = [ [4.9,1.4,0], 
        [4.7,1.3,0], 
        [4.6,1.5,0], 
        [6.3,4.9,1],
        [6.1,4.7,1], 
        [6.4,4.3,1], 
        [6.6,4.4,1], 
        [7.2,6.0,2],
        [7.2,5.8,2], 
        [7.4,6.1,2], 
        [7.9,6.4,2]]

schema = [('Sepal_Length', float),
          ('Petal_Length', float),
          ('Class', int)]
frame = tc.frame.create(data, schema)

In [4]:
# Consider the following frame containing three columns.
frame.inspect()

[#]  Sepal_Length  Petal_Length  Class
[0]           4.9           1.4      0
[1]           4.7           1.3      0
[2]           4.6           1.5      0
[3]           6.3           4.9      1
[4]           6.1           4.7      1
[5]           6.4           4.3      1
[6]           6.6           4.4      1
[7]           7.2           6.0      2
[8]           7.2           5.8      2
[9]           7.4           6.1      2

In [5]:
# Create a new model and train it
model = tc.models.classification.logistic_regression.train(frame, ['Sepal_Length', 'Petal_Length'],
                                                           'Class',
                                                           num_classes=3,
                                                           optimizer='LBFGS',
                                                           compute_covariance=True)

In [6]:
model.training_summary

coefficients      = {u'Sepal_Length_0': -63.68381906591638, u'Sepal_Length_1': -120.44216460295151, u'Petal_Length_0': 117.97982446091233, u'intercept_1': -90.4844045502266, u'intercept_0': -0.7801530640313616, u'Petal_Length_1': 206.33964867030875}
covariance_matrix = <sparktk.frame.frame.Frame object at 0x7f4b8d811110>
degrees_freedom   = {u'Sepal_Length_0': 1.0, u'Sepal_Length_1': 1.0, u'Petal_Length_0': 1.0, u'intercept_1': 1.0, u'intercept_0': 1.0, u'Petal_Length_1': 1.0}
num_classes       = 3
num_features      = 2
p_value           = {u'Sepal_Length_0': 1.0, u'Sepal_Length_1': 1.0, u'Petal_Length_0': 0.9974692240947096, u'intercept_1': 1.0, u'intercept_0': 1.0, u'Petal_Length_1': 0.9966542688059293}
standard_errors   = {u'Sepal_Length_0': 16611711.27902509, u'Sepal_Length_1': 16610947.081891032, u'Petal_Length_0': 11726786.817453029, u'intercept_1': 40269087.87692691, u'intercept_0': 40287879.968022846, u'Petal_Length_1': 11734867.038286857}
wald_statistic    = {u'Sepal_Length_0'

In [7]:
# The covariance matrix is the inverse of the Hessian matrix for the trained model. 
# The Hessian matrix is the second-order partial derivatives of the model’s log-likelihood function
model.training_summary.covariance_matrix.inspect()

[#]  Sepal_Length_0      Petal_Length_0      intercept_0       
[0]   2.75948951618e+14  -2.11652277802e+14  -9.23304845943e+14
[1]  -2.11652198459e+14   1.37517529062e+14   8.98938201283e+14
[2]  -9.23305455582e+14   8.98938934397e+14   1.62311327232e+15
[3]   2.75936251796e+14  -2.11722390638e+14  -9.22846555893e+14
[4]  -2.11662305836e+14   1.37612302185e+14   8.98501135694e+14
[5]   -9.2316375869e+14   8.98903106732e+14   1.62235634858e+15

[#]  Sepal_Length_1      Petal_Length_1      intercept_1       
[0]   2.75936251572e+14  -2.11662384933e+14  -9.23163148862e+14
[1]  -2.11722311118e+14   1.37612301984e+14   8.98902373501e+14
[2]  -9.22847164822e+14   8.98501868071e+14   1.62235634773e+15
[3]   2.75923562957e+14  -2.11732516258e+14  -9.22704834924e+14
[4]  -2.11732436984e+14   1.37707104406e+14   8.98465275197e+14
[5]  -9.22705444042e+14   8.98466007692e+14   1.62159943844e+15

In [8]:
# Use the model to make predictions
model.predict(frame, ['Sepal_Length', 'Petal_Length'])

frame.inspect()

[#]  Sepal_Length  Petal_Length  Class  predicted_label
[0]           4.9           1.4      0                0
[1]           4.7           1.3      0                0
[2]           4.6           1.5      0                0
[3]           6.3           4.9      1                1
[4]           6.1           4.7      1                1
[5]           6.4           4.3      1                1
[6]           6.6           4.4      1                1
[7]           7.2           6.0      2                2
[8]           7.2           5.8      2                2
[9]           7.4           6.1      2                2

In [9]:
# Test the model
test_metrics = model.test(frame, 'Class', ['Sepal_Length', 'Petal_Length'])
test_metrics

accuracy         = 1.0
confusion_matrix =             Predicted_0.0  Predicted_1.0  Predicted_2.0
Actual_0.0              3              0              0
Actual_1.0              0              4              0
Actual_2.0              0              0              4
f_measure        = 1.0
precision        = 1.0
recall           = 1.0