# Logistic Regression (Documentation Example)

The documentation example is available here: https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression

Logistic regression is a classification algorithm, unlike linear regression. 

Objective: Let's see an example of how to run a logistic regression with Python and Spark. While this documentation dataset is unrealistic, it provides a basic summary of how to use logistic regression. For a more realistic exercise, move on to the advanced logistic regression exercise. 

In [1]:
# Must be included at the beginning of each new notebook. Remember to change the app name.
import findspark
findspark.init('/home/ubuntu/spark-3.2.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('logistic_regression_docs').getOrCreate()

# If you're getting an error with numpy, please type 'sudo pip install numpy --user' into the EC2 console.
from pyspark.ml.classification import LogisticRegression

In [2]:
# Load training data. Libsvm is used throughout the Spark documentation.
# Libsvm is probably not relevant to your dataset. 
training = spark.read.format("libsvm").load("Datasets/sample_libsvm_data.txt")

# Instance of logistic regression model. This is where you specify features/label/prediction columns.
lr = LogisticRegression()

# Fit the model. Note that the train/test split isn't part of the documentation example.
lrModel = lr.fit(training)

trainingSummary = lrModel.summary

# Raw prediction and probability have to do with logistic regression. 
# As with other models, we simply want to compare the label (actual) to the prediction.
trainingSummary.predictions.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [3]:
# Label and prediction are stacked on each other. 
trainingSummary.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[127,128,129...|[19.8534775947478...|[0.99999999761359...|       0.0|
|  1.0|(692,[158,159,160...|[-20.377398194908...|[1.41321555111056...|       1.0|
|  1.0|(692,[124,125,126...|[-27.401459284891...|[1.25804865126979...|       1.0|
|  1.0|(692,[152,153,154...|[-18.862741612668...|[6.42710509170303...|       1.0|
|  1.0|(692,[151,152,153...|[-20.483011833009...|[1.27157209200604...|       1.0|
|  0.0|(692,[129,130,131...|[19.8506078990277...|[0.99999999760673...|       0.0|
|  1.0|(692,[158,159,160...|[-20.337256674833...|[1.47109814695581...|       1.0|
|  1.0|(692,[99,100,101,...|[-19.595579753418...|[3.08850168102631...|       1.0|
|  0.0|(692,[154,155,156...|[19.2708803215613...|[0.99999999572670...|       0.0|
|  0.0|(692,[127

## Bonus: Train/Test Split

In [4]:
# Let's split the data so we can evalaute the model.
lr_train,lr_test = training.randomSplit([0.7,0.3])

final_model = LogisticRegression()

# Now we're fitting the model on a subset of data.
fit_final = final_model.fit(lr_train)

# And evaluating it against the test data.
predictions_and_labels = fit_final.evaluate(lr_test)

predictions_and_labels.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[100,101,102...|[8.26774799904665...|[0.99974340321337...|       0.0|
|  0.0|(692,[122,123,124...|[19.5940219685167...|[0.99999999690668...|       0.0|
|  0.0|(692,[124,125,126...|[28.9430106369702...|[0.99999999999973...|       0.0|
|  0.0|(692,[124,125,126...|[21.1603318310783...|[0.99999999935407...|       0.0|
|  0.0|(692,[125,126,127...|[18.7156105338896...|[0.99999999255416...|       0.0|
|  0.0|(692,[126,127,128...|[17.7370325045155...|[0.99999998018907...|       0.0|
|  0.0|(692,[126,127,128...|[25.3330396245836...|[0.99999999999004...|       0.0|
|  0.0|(692,[129,130,131...|[15.9433351174379...|[0.99999988090391...|       0.0|
|  0.0|(692,[153,154,155...|[26.9841698807831...|[0.99999999999809...|       0.0|
|  0.0|(692,[153

## Logistic Regression Evaluation Metrics

Evaluators are an important part of our pipline when working with Machine Learning, let's see some basics for Logistic Regression. Check out these links:

For a binary evaluator, you can get the area under the ROC curve or the area under the precision/recall curve.

For a multi-class evalautor, you can get back accuracy, precision/recall, etc. 

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.BinaryClassificationEvaluator

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator

In [5]:
# Let's import two evalulation metrics. 
# Remember, binary is for predictions like true and false (0 and 1), 
# While multi-class is for multiple classification classes.
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator

In [6]:
# According to this evaluation metric, the area under the curve is 1.0. A perfect fit? Is that realistic?  
evaluator = BinaryClassificationEvaluator()
my_final_roc = evaluator.evaluate(predictions_and_labels.predictions)
my_final_roc

1.0

In [7]:
# For multiclass.
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label',
                                             metricName='accuracy')

accuracy = evaluator.evaluate(predictions_and_labels.predictions)

accuracy

1.0

Note that the high evaluation metrics are because of the test dataset. Move on to the next logistic regression lab to start using a more realistic dataset. 
