## Tuning Model Parameters

In this exercise, you will optimise the parameters for a classification model.

### Prepare the Data

First, import the libraries you will need and prepare the training and test data:

In [1]:
# Import Spark SQL and Spark ML libraries
from pyspark.sql.types import *
from pyspark.sql.functions import *

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, TrainValidationSplit
from pyspark.ml.evaluation import BinaryClassificationEvaluator

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

# Load the source data
csv = spark.read.csv('../../data/bank.csv', inferSchema=True, header=True, sep=';')

# Select features and label
data = csv.select(*csv.columns[:-1], (col("y").alias("label")))
print(data)

# Split the data
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1].withColumnRenamed("label", "trueLabel")

DataFrame[age: double, job: double, marital: double, education: double, default: double, balance: double, housing: double, loan: double, contact: double, day: double, month: double, duration: double, campaign: double, pdays: double, previous: double, poutcome: double, label: double]


### Define the Pipeline
Now define a pipeline that creates a feature vector and trains a classification model

In [2]:
# Define the pipeline
assembler = VectorAssembler(inputCols = data.columns[:-1], outputCol="features")
print(assembler.getInputCols())
lr = LogisticRegression(labelCol="label", featuresCol="features")
pipeline = Pipeline(stages=[assembler, lr])

['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome']


### Tune Parameters
You can tune parameters to find the best model for your data. A simple way to do this is to use  **TrainValidationSplit** to evaluate each combination of parameters defined in a **ParameterGrid** against a subset of the training data in order to find the best performing parameters.

In [3]:
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 1, 10]).addGrid(lr.maxIter, [100,10,5]).build()

cv = CrossValidator(estimator=pipeline, evaluator=BinaryClassificationEvaluator(), estimatorParamMaps=paramGrid, numFolds=10)

import time
tic = time.time()
model = cv.fit(train)
toc = time.time()
print("Elapsed time ", toc-tic)

Elapsed time  140.95596885681152


### Test the Model
Now you're ready to apply the model to the test data.

In [4]:
prediction = model.transform(test)
predicted = prediction.select("features", "prediction", "probability", "trueLabel")
predicted.show()

+--------------------+----------+--------------------+---------+
|            features|prediction|         probability|trueLabel|
+--------------------+----------+--------------------+---------+
|[0.0,1.0,0.5,0.33...|       0.0|[0.69943979576858...|      0.0|
|[0.0,1.0,0.5,0.66...|       0.0|[0.89976883533097...|      0.0|
|(16,[1,2,3,5,6,8,...|       0.0|[0.85763097057592...|      1.0|
|[0.01298701298701...|       0.0|[0.75599862273310...|      0.0|
|[0.01298701298701...|       0.0|[0.84074944427605...|      0.0|
|[0.01298701298701...|       0.0|[0.88226765277839...|      0.0|
|[0.01298701298701...|       0.0|[0.86616463410777...|      0.0|
|[0.01298701298701...|       0.0|[0.89141178039311...|      0.0|
|[0.01298701298701...|       0.0|[0.89200632582467...|      0.0|
|[0.01298701298701...|       0.0|[0.86522275872300...|      1.0|
|[0.01298701298701...|       0.0|[0.83647419508122...|      0.0|
|[0.01298701298701...|       0.0|[0.86459977098306...|      1.0|
|[0.02597402597402...|   

### Compute Confusion Matrix Metrics
Classifiers are typically evaluated by creating a *confusion matrix*, which indicates the number of:
- True Positives
- True Negatives
- False Positives
- False Negatives

From these core measures, other evaluation metrics such as *precision* and *recall* can be calculated.

In [5]:
tp = float(predicted.filter("prediction == 1.0 AND trueLabel == 1").count())
fp = float(predicted.filter("prediction == 1.0 AND trueLabel == 0").count())
tn = float(predicted.filter("prediction == 0.0 AND trueLabel == 0").count())
fn = float(predicted.filter("prediction == 0.0 AND trueLabel == 1").count())
metrics = spark.createDataFrame([
 ("TP", tp),
 ("FP", fp),
 ("TN", tn),
 ("FN", fn),
 ("Precision", tp / (tp + fp)),
 ("Recall", tp / (tp + fn))],["metric", "value"])
metrics.show()

+---------+-------------------+
|   metric|              value|
+---------+-------------------+
|       TP|              115.0|
|       FP|               65.0|
|       TN|            11942.0|
|       FN|             1434.0|
|Precision| 0.6388888888888888|
|   Recall|0.07424144609425436|
+---------+-------------------+

