# Logistic Regression

Let's see an example of how to run a logistic regression with Python and Spark! This is documentation example, we will quickly run through this and then show a more realistic example, afterwards, you will have another consulting project!

# B1: Create SparkSession

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("logistic_regression_example").getOrCreate()

# B2: Load input corpus
'sample_libsvm_data.txt' with format 'libsvm'

In [2]:
dir_input_path = "./../input_data/"
file_input_path = dir_input_path + 'sample_libsvm_data.txt'

In [3]:
import os

if not os.path.exists(file_input_path):
    print("File Not Found :", file_input_path)
else:
    print("Verified input file :", file_input_path)

Verified input file : ./../input_data/sample_libsvm_data.txt


In [4]:
df = spark.read.format("libsvm").load(file_input_path)

# B3: Show overview of input corpus

## Schema

In [5]:
df.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)



## Description

In [6]:
df.describe().show()

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                100|
|   mean|               0.57|
| stddev|0.49756985195624287|
|    min|                0.0|
|    max|                1.0|
+-------+-------------------+



## The column names

In [7]:
df.columns

['label', 'features']

## Sample Data

In [8]:
df.show(2)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
+-----+--------------------+
only showing top 2 rows



## Print each item in the first line

In [9]:
for item in df.head():
    print(item)

0.0
(692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0,114.0,253.0,22

# B6: Splitting Full Data to Training set & Testing set

In [10]:
train_set, test_set = df.randomSplit([0.7, 0.3])

## Showing description of training set 

In [11]:
train_set.describe().show()

+-------+------------------+
|summary|             label|
+-------+------------------+
|  count|                69|
|   mean|0.5507246376811594|
| stddev| 0.501064510466231|
|    min|               0.0|
|    max|               1.0|
+-------+------------------+



## Showing description of testing set 

In [12]:
test_set.describe().show()

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                 31|
|   mean| 0.6129032258064516|
| stddev|0.49513764785419084|
|    min|                0.0|
|    max|                1.0|
+-------+-------------------+



# B7: Training & Testing Phase

## Creating Model object

In [13]:
from pyspark.ml.classification import LogisticRegression

In [14]:
df.columns

['label', 'features']

In [15]:
lg = LogisticRegression(featuresCol='features', labelCol='label')

## Fitting the model to data

In [16]:
lg_model = lg.fit(train_set)

## Printing coefficients and intercept for ML model (if needed)

In [17]:
lg_model.coefficients

SparseVector(692, {95: 0.0009, 96: 0.0002, 97: 0.0002, 98: 0.0, 99: 0.004, 100: 0.0066, 101: 0.003, 102: -0.0017, 119: 0.01, 120: 0.0033, 121: 0.0064, 122: 0.0412, 123: 0.002, 124: 0.0013, 125: 0.0009, 126: 0.0007, 127: 0.0013, 128: 0.0011, 129: 0.0, 130: 0.0001, 131: 0.002, 132: 0.0039, 133: 0.0018, 146: 0.0103, 147: 0.0035, 148: 0.0034, 149: 0.0033, 150: 0.0011, 151: -0.0002, 152: 0.001, 153: 0.0005, 154: -0.0007, 155: -0.0004, 156: -0.0008, 157: -0.0006, 158: -0.0005, 159: 0.0005, 160: 0.0012, 161: -0.0006, 162: 0.0006, 163: 0.0015, 164: 0.0089, 174: 0.0048, 175: 0.0029, 176: 0.0007, 177: -0.0002, 178: -0.0003, 179: 0.0001, 180: 0.0003, 181: 0.0, 182: -0.0002, 183: 0.0013, 184: -0.0002, 185: -0.0006, 186: -0.0001, 187: -0.0003, 188: -0.0005, 189: -0.0011, 190: 0.0005, 191: 0.0005, 192: 0.0009, 202: -0.008, 203: -0.0, 204: -0.0005, 205: -0.0005, 206: -0.0003, 207: -0.0005, 208: -0.0004, 209: -0.0009, 210: 0.0009, 211: 0.0018, 212: 0.0001, 213: -0.0015, 214: -0.0005, 215: -0.0007, 216

In [18]:
lg_model.intercept

0.3543545275512681

## B8: Evaluating Phase
### Evaluating the model based on the training set
Showing the predictions of train_result after evaluating

In [19]:
train_result = lg_model.evaluate(train_set)

In [20]:
train_result.predictions.show(3)

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[95,96,97,12...|[22.7536706326033...|[0.99999999986871...|       0.0|
|  0.0|(692,[98,99,100,1...|[25.0243378323464...|[0.99999999998644...|       0.0|
|  0.0|(692,[122,123,124...|[19.9508338032883...|[0.99999999783497...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 3 rows



### Evaluating the model based on the testing set
Showing the predictions of test_result after evaluating

In [21]:
test_result = lg_model.evaluate(test_set)

In [22]:
test_result.predictions.show(3)

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[100,101,102...|[9.91945019493665...|[0.99995079421705...|       0.0|
|  0.0|(692,[121,122,123...|[12.6076298536514...|[0.99999665362604...|       0.0|
|  0.0|(692,[122,123,148...|[15.4974458737593...|[0.99999981398640...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 3 rows



### Using Evaluators
Based on the BinaryClassificationEvaluator, MulticlassClassificationEvaluator from pyspark.ml.evaluation

In [23]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

In [24]:
pred_label_train_result = train_result.predictions.select("prediction", "label")

In [25]:
pred_label_train_result.show(3)

+----------+-----+
|prediction|label|
+----------+-----+
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
+----------+-----+
only showing top 3 rows



In [26]:
pred_label_test_result = test_result.predictions.select("prediction", "label")

In [27]:
pred_label_test_result.show(3)

+----------+-----+
|prediction|label|
+----------+-----+
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
+----------+-----+
only showing top 3 rows



#### Using BinaryClassificationEvaluator

In [28]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", 
                                          labelCol="label", 
                                          metricName='areaUnderROC')
evaluator.evaluate(pred_label_train_result)

1.0

In [29]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", 
                                          labelCol="label", 
                                          metricName='areaUnderPR')
evaluator.evaluate(pred_label_train_result)

1.0

In [30]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", 
                                          labelCol="label", 
                                          metricName='areaUnderROC')
evaluator.evaluate(pred_label_test_result)

1.0

In [31]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", 
                                          labelCol="label", 
                                          metricName='areaUnderPR')
evaluator.evaluate(pred_label_test_result)

1.0

#### Using MulticlassClassificationEvaluator

In [32]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", 
                                              labelCol="label", 
                                              metricName="accuracy") 
evaluator.evaluate(pred_label_train_result)

1.0

In [33]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", 
                                              labelCol="label", 
                                              metricName="f1") 
evaluator.evaluate(pred_label_train_result)

1.0

In [34]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", 
                                              labelCol="label", 
                                              metricName="accuracy") 
evaluator.evaluate(pred_label_test_result)

1.0

In [35]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", 
                                              labelCol="label", 
                                              metricName="f1") 
evaluator.evaluate(pred_label_test_result)

1.0

### Showing the summary of the fitted model
#### Extract the summary from the returned LogisticRegressionModel instance trained in the earlier example

In [36]:
trainingSummary = lg_model.summary

#### Obtain the objective per iteration

In [37]:
objectiveHistory = trainingSummary.objectiveHistory
print("objectiveHistory:")
for objective in objectiveHistory:
    print(objective)

objectiveHistory:
0.6879923392546194
0.21867151065149842
0.045984010122103874
0.013452645710489423
0.002399745063759046
0.0017682298096177486
0.001041529110174008
0.000724063385127086
0.0004405180405019296
0.00025946356080858575
0.0001688205747573391
0.0001269919885232395
4.7837771871312535e-05
2.7055360853185438e-05
1.3445005807708624e-05
6.897753058917522e-06
3.4724977617023924e-06
1.754495690019691e-06
8.821350787277097e-07
4.4305007744416274e-07
2.2214532877357304e-07
1.1130810684799967e-07
5.571459389653745e-08
2.7885323077479978e-08
1.395406046356185e-08
6.989248035938718e-09
3.4701044498172808e-09
1.3696094334566806e-09


#### Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.

In [38]:
trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))

+---+-------------------+
|FPR|                TPR|
+---+-------------------+
|0.0|                0.0|
|0.0|0.02631578947368421|
|0.0|0.05263157894736842|
|0.0|0.07894736842105263|
|0.0|0.10526315789473684|
|0.0|0.13157894736842105|
|0.0|0.15789473684210525|
|0.0|0.18421052631578946|
|0.0|0.21052631578947367|
|0.0|0.23684210526315788|
|0.0| 0.2631578947368421|
|0.0| 0.2894736842105263|
|0.0| 0.3157894736842105|
|0.0|0.34210526315789475|
|0.0| 0.3684210526315789|
|0.0|0.39473684210526316|
|0.0|0.42105263157894735|
|0.0| 0.4473684210526316|
|0.0|0.47368421052631576|
|0.0|                0.5|
+---+-------------------+
only showing top 20 rows

areaUnderROC: 1.0


### Showing the relative scores of training set and after evaluating testing set: Precision, Recall, F1

for multiclass, we can inspect metrics on a per-label basis

In [39]:
print("False positive rate by label:")
for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print("True positive rate by label:")
for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print("Precision by label:")
for i, prec in enumerate(trainingSummary.precisionByLabel):
    print("label %d: %s" % (i, prec))

print("Recall by label:")
for i, rec in enumerate(trainingSummary.recallByLabel):
    print("label %d: %s" % (i, rec))

print("F-measure by label:")
for i, f in enumerate(trainingSummary.fMeasureByLabel()):
    print("label %d: %s" % (i, f))

accuracy = trainingSummary.accuracy
falsePositiveRate = trainingSummary.weightedFalsePositiveRate
truePositiveRate = trainingSummary.weightedTruePositiveRate
fMeasure = trainingSummary.weightedFMeasure()
precision = trainingSummary.weightedPrecision
recall = trainingSummary.weightedRecall
print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))

False positive rate by label:
label 0: 0.0
label 1: 0.0
True positive rate by label:
label 0: 1.0
label 1: 1.0
Precision by label:
label 0: 1.0
label 1: 1.0
Recall by label:
label 0: 1.0
label 1: 1.0
F-measure by label:
label 0: 1.0
label 1: 1.0
Accuracy: 1.0
FPR: 0.0
TPR: 1.0
F-measure: 1.0
Precision: 1.0
Recall: 1.0
