# Logistic Regression with PySpark

This notebook demonstrates how to train and measure a logistic regression model with PySpark.

* Method: [Logistic Regression](https://spark.apache.org/docs/2.2.0/mllib-linear-methods.html#logistic-regression)
* Dataset: Spark MLlib Sample LibSVM Data

## Imports

In [None]:
import findspark
findspark.init()

import numpy as np

from pyspark import SparkContext
from pyspark.sql import SQLContext

from pyspark.ml.classification import LogisticRegression

import matplotlib.pyplot as plt
%matplotlib inline

## Get Some Context

In [None]:
# Create a SparkContext and a SQLContext context to use
sc = SparkContext(appName="Logistic Regression with Spark")
sqlContext = SQLContext(sc)

## Load and Prepare the Data

In [None]:
DATA_FILE = "/Users/robert.dempsey/Dev/daamlobd/data/mllib/sample_libsvm_data.txt"

In [None]:
data = sqlContext.read.format("libsvm").load(DATA_FILE)

In [None]:
# View one of the records
data.take(1)

In [None]:
# Create train and test datasets
splits = data.randomSplit([0.8, 0.2], 42)
train = splits[0]
test = splits[1]

## Fit a Logistic Regression Model

Arguments:
* maxIter: max number of iterations
* regParam: regularization parameter
* elasticNetParam: ElasticNet mixing param
  * 1 = L1 Regularization (LASSO)
  * 0 = L2 Regularization (Ridge)
  * Between 0 and 1 = ElasticNet (L1 + L2)

In [None]:
lr = LogisticRegression(maxIter=10,
                        regParam=0.3,
                        elasticNetParam=0.8)

In [None]:
lr_model = lr.fit(train)

In [None]:
# Show the intercept
print("Intercept: " + str(lr_model.intercept))

## Create Predictions

In [None]:
# Create the predictions
predictions = lr_model.transform(test)
predictions.show(5)

In [None]:
# Plot the actuals versus predictions
actuals = predictions.select('label').collect()
predictions = predictions.select('prediction').collect()

fig = plt.figure(figsize=(10,5))
plt.scatter(actuals, predictions)
plt.xlabel("Actuals")
plt.ylabel("Predictions")
plt.title("Actuals vs. Predictions")
plt.show()

## Model Evaluation

In [None]:
# Create the summary
metrics = lr_model.summary

### Area Under ROC

A measure of how well a parameter can distinguish between the two groups in a binary classification.

* .90-1 = excellent (A)
* .80-.90 = good (B)
* .70-.80 = fair (C)
* .60-.70 = poor (D)
* .50-.60 = fail (F)

In [None]:
# Area under the ROC
print("Area Under ROC = %.2f" % metrics.areaUnderROC)

## F-Measure (F1)

A measure of a test's accuracy that considers both the precision p and the recall r of the test to compute the score.

In [None]:
# Show all F-Measure scores
metrics.fMeasureByThreshold.show()

In [None]:
# Determine the best threshold to maximize the F-Measure
f_measure = metrics.fMeasureByThreshold
max_f_measure = f_measure.groupBy().max('F-Measure').select('max(F-Measure)').head()
best_threshold = f_measure.where(f_measure['F-Measure'] == max_f_measure['max(F-Measure)']) \
    .select('threshold').head()['threshold']
print("Best Threshold: %0.3f" % best_threshold)

## Use the New Threshold

In [None]:
# Create an instance of the model using our new threshold
lr2 = LogisticRegression(maxIter=10,
                         regParam=0.3,
                         elasticNetParam=0.8,
                         threshold=0.594)
# Train the model
lrm2 = lr.fit(train)

# Create the predictions
p2 = lrm2.transform(test)

# Plot the actuals vs. predicted
a2 = p2.select('label').collect()
pred2 = p2.select('prediction').collect()

fig = plt.figure(figsize=(10,5))
plt.scatter(a2, pred2)
plt.xlabel("Actuals")
plt.ylabel("Predictions")
plt.title("Actuals vs. Predictions")
plt.show()

In [None]:
# New metrics
m2 = lrm2.summary

In [None]:
# Area under the ROC
print("Area Under ROC = %.2f" % m2.areaUnderROC)

## Shut it Down

In [None]:
sc.stop()