# Logistic Regression

Let's see an example of how to run a logistic regression with Python and Spark! This is documentation example, we will quickly run through this and then show a more realistic example, afterwards, you will have another consulting project!

In [3]:
import findspark
findspark.init('/opt/spark')

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('logregdoc').getOrCreate()

23/05/31 10:29:09 WARN Utils: Your hostname, zack-virtual-machine resolves to a loopback address: 127.0.1.1; using 192.168.40.128 instead (on interface ens33)
23/05/31 10:29:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/31 10:29:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
from pyspark.ml.classification import LogisticRegression

In [6]:
# Load training data
training = spark.read.format("libsvm").load("sample_libsvm_data.txt")

lr = LogisticRegression()

# Fit the model
lrModel = lr.fit(training)

trainingSummary = lrModel.summary

23/05/31 10:29:13 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
23/05/31 10:29:19 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS


In [15]:
training.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|
|  0.0|(692,[129,130,131...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[99,100,101,...|
|  0.0|(692,[154,155,156...|
|  0.0|(692,[127,128,129...|
|  1.0|(692,[154,155,156...|
|  0.0|(692,[153,154,155...|
|  0.0|(692,[151,152,153...|
|  1.0|(692,[129,130,131...|
|  0.0|(692,[154,155,156...|
|  1.0|(692,[150,151,152...|
|  0.0|(692,[124,125,126...|
|  0.0|(692,[152,153,154...|
|  1.0|(692,[97,98,99,12...|
|  1.0|(692,[124,125,126...|
+-----+--------------------+
only showing top 20 rows



In [87]:
trainingSummary.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[127,128,129...|[19.8534775947479...|[0.99999999761359...|       0.0|
|  1.0|(692,[158,159,160...|[-20.377398194909...|[1.41321555110962...|       1.0|
|  1.0|(692,[124,125,126...|[-27.401459284891...|[1.25804865127002...|       1.0|
|  1.0|(692,[152,153,154...|[-18.862741612668...|[6.42710509170470...|       1.0|
|  1.0|(692,[151,152,153...|[-20.483011833009...|[1.27157209200655...|       1.0|
|  0.0|(692,[129,130,131...|[19.8506078990277...|[0.99999999760673...|       0.0|
|  1.0|(692,[158,159,160...|[-20.337256674834...|[1.47109814695468...|       1.0|
|  1.0|(692,[99,100,101,...|[-19.595579753418...|[3.08850168102550...|       1.0|
|  0.0|(692,[154,155,156...|[19.2708803215615...|[0.99999999572670...|       0.0|
|  0.0|(692,[127

In [10]:
lr_trian, lr_test = training.randomSplit([0.7,0.3])

In [11]:
final_model = LogisticRegression()

In [12]:
fit_final = final_model.fit(lr_trian)

In [13]:
prediction_and_label = fit_final.evaluate(lr_test)

In [14]:
prediction_and_label.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[95,96,97,12...|[20.8237329646077...|[0.99999999909558...|       0.0|
|  0.0|(692,[123,124,125...|[39.9773814098283...|           [1.0,0.0]|       0.0|
|  0.0|(692,[124,125,126...|[40.3977727006578...|           [1.0,0.0]|       0.0|
|  0.0|(692,[124,125,126...|[21.5779920359707...|[0.99999999957460...|       0.0|
|  0.0|(692,[126,127,128...|[33.5957930306890...|[0.99999999999999...|       0.0|
|  0.0|(692,[126,127,128...|[17.8123382926323...|[0.99999998162616...|       0.0|
|  0.0|(692,[126,127,128...|[23.6296106829057...|[0.99999999994532...|       0.0|
|  0.0|(692,[126,127,128...|[32.8559479282913...|[0.99999999999999...|       0.0|
|  0.0|(692,[126,127,128...|[37.4520190589043...|           [1.0,0.0]|       0.0|
|  0.0|(692,[127

In [16]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator

In [17]:
my_eval = BinaryClassificationEvaluator()

In [18]:
my_final_roc = my_eval.evaluate(prediction_and_label.predictions)

In [19]:
my_final_roc

0.9722222222222223

## Evaluators

Evaluators will be a very important part of our pipline when working with Machine Learning, let's see some basics for Logistic Regression, useful links:



In [79]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator

In [89]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='label')

In [83]:
# For multiclass
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label',
                                             metricName='accuracy')

In [90]:
acc = evaluator.evaluate(predictionAndLabels)

In [91]:
acc

1.0

Okay let's move on see some more examples!