#  Day24 - MLlib with Binary Classification

## Load Data

In this example, we will read in the Adult dataset from databricks-datasets.
We'll read in the data in SQL using the CSV data source for Spark and rename the columns appropriately.

In [0]:
%fs ls databricks-datasets/adult/adult.data

path,name,size
dbfs:/databricks-datasets/adult/adult.data,adult.data,3974305


In [0]:
%fs head databricks-datasets/adult/adult.data

In [0]:
%sql DROP TABLE IF EXISTS adult

In [0]:
%sql
CREATE TABLE adult (
  age DOUBLE,
  workclass STRING,
  fnlwgt DOUBLE,
  education STRING,
  education_num DOUBLE,
  marital_status STRING,
  occupation STRING,
  relationship STRING,
  race STRING,
  sex STRING,
  capital_gain DOUBLE,
  capital_loss DOUBLE,
  hours_per_week DOUBLE,
  native_country STRING,
  income STRING)
USING CSV
OPTIONS (path "/databricks-datasets/adult/adult.data", header "true")


In [0]:
%sql SELECT * FROM adult LIMIT 10

age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K
37.0,Private,284582.0,Masters,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.0,0.0,40.0,United-States,<=50K
49.0,Private,160187.0,9th,5.0,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0.0,0.0,16.0,Jamaica,<=50K
52.0,Self-emp-not-inc,209642.0,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,45.0,United-States,>50K
31.0,Private,45781.0,Masters,14.0,Never-married,Prof-specialty,Not-in-family,White,Female,14084.0,0.0,50.0,United-States,>50K
42.0,Private,159449.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178.0,0.0,40.0,United-States,>50K
37.0,Private,280464.0,Some-college,10.0,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0.0,0.0,80.0,United-States,>50K


In [0]:
dataset = spark.table("adult")
cols = dataset.columns

In [0]:
cols

In [0]:
display(dataset)

age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K
37.0,Private,284582.0,Masters,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.0,0.0,40.0,United-States,<=50K
49.0,Private,160187.0,9th,5.0,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0.0,0.0,16.0,Jamaica,<=50K
52.0,Self-emp-not-inc,209642.0,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,45.0,United-States,>50K
31.0,Private,45781.0,Masters,14.0,Never-married,Prof-specialty,Not-in-family,White,Female,14084.0,0.0,50.0,United-States,>50K
42.0,Private,159449.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178.0,0.0,40.0,United-States,>50K
37.0,Private,280464.0,Some-college,10.0,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0.0,0.0,80.0,United-States,>50K


## Data Preparation

Since we are going to try algorithms like Logistic Regression, we will have to convert the categorical variables in the dataset into numeric variables.We will use one-hot encoding (and not categoy indexing)

*One-Hot Encoding* -  converts categories into binary vectors with at most one nonzero value (eg: (Blue: [1, 0]), (Green: [0, 1]), (Red: [0, 0]))

In this dataset, we have ordinal variables like education (Preschool - Doctorate), and also nominal variables like relationship (Wife, Husband, Own-child, etc).
For simplicity's sake, we will use One-Hot Encoding to convert all categorical variables into binary vectors.
It is possible here to improve prediction accuracy by converting each categorical column with an appropriate method.

Here, we will use a combination of [StringIndexer] and [OneHotEncoder] to convert the categorical variables.
The `OneHotEncoder` will return a [SparseVector].

Since we will have more than 1 stage of feature transformations, we use a [Pipeline] to tie the stages together; similar to chaining.

Predict variable will be `income`; binary variable with two values:
*  "<=50K"
*   ">50K"

All other variables will be used for feature selections

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

We will indexes each categorical column using the `StringIndexer`,and then converts the indexed categories into one-hot encoded variables.
The resulting output has the binary vectors appended to the end of each row.

We use the `StringIndexer` again to encode our labels to label indices.

In [0]:
categoricalColumns = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"]
stages = [] # stages in our Pipeline
for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
    encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]
    
# Convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol="income", outputCol="label")
stages += [label_stringIdx]

Use a `VectorAssembler` to combine all the feature columns into a single vector column. This goes for all types: numeric and one-hot encoded variables.

In [0]:
# Transform all features into a vector using VectorAssembler
numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

In [0]:
stages

Run the stages as a Pipeline. This puts the data through all of the feature transformations we described in a single call.

In [0]:
from pyspark.ml.classification import LogisticRegression
  
partialPipeline = Pipeline().setStages(stages)
pipelineModel = partialPipeline.fit(dataset)
preppedDataDF = pipelineModel.transform(dataset)

In [0]:
# Fit model to prepped data
lrModel = LogisticRegression().fit(preppedDataDF)

# ROC for training data
display(lrModel, preppedDataDF, "ROC")

False Positive Rate,True Positive Rate,Threshold
0.0,0.0,0.9995506943477948
0.0,0.0416666666666666,0.9995506943477948
0.0,0.0833333333333333,0.9989163941898824
0.0,0.125,0.9976719836399542
0.0,0.1666666666666666,0.983672176420753
0.0,0.2083333333333333,0.9683807594578272
0.0,0.25,0.8513144728541281
0.0,0.2916666666666667,0.8008515753945067
0.0099009900990099,0.2916666666666667,0.7992954842010757
0.0099009900990099,0.3333333333333333,0.7969834811921324


In [0]:
display(lrModel, preppedDataDF)

fitted values,residuals
-0.3822160448588013,-0.4055925253742398
0.0260926259001584,-0.5065227864061655
-3.7045987222113688,-0.0240189815507855
-1.3845769514748478,0.7997250728756193
-1.7678281904494306,-0.1458126226199589
1.1144516964125857,0.2470418865847634
-2.074781128022595,-0.1115722337760365
-3.457926198048185,-0.0305333603013103
-5.045791223108038,-0.0063952038761406
-1.5760860514342432,-0.1713505070176361


Features selection gives us the idea, which columns to keep for further analysis and which should be dropped. Check the dataset and with `VectorAssemble` we
have created two columns called: `label` and `features`. `Features` is a combined vector of all categorized variables and one-hot encoding.

In [0]:
# Keep relevant columns
selectedcols = ["label", "features"] + cols
dataset = preppedDataDF.select(selectedcols)
display(dataset)

label,features,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0.0,"List(0, 100, List(1, 10, 23, 31, 43, 48, 52, 53, 94, 95, 96, 99), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 50.0, 83311.0, 13.0, 13.0))",50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
0.0,"List(0, 100, List(0, 8, 25, 38, 44, 48, 52, 53, 94, 95, 96, 99), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 38.0, 215646.0, 9.0, 40.0))",38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
0.0,"List(0, 100, List(0, 13, 23, 38, 43, 49, 52, 53, 94, 95, 96, 99), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 53.0, 234721.0, 7.0, 40.0))",53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
0.0,"List(0, 100, List(0, 10, 23, 29, 47, 49, 62, 94, 95, 96, 99), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 28.0, 338409.0, 13.0, 40.0))",28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K
0.0,"List(0, 100, List(0, 11, 23, 31, 47, 48, 53, 94, 95, 96, 99), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 37.0, 284582.0, 14.0, 40.0))",37.0,Private,284582.0,Masters,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.0,0.0,40.0,United-States,<=50K
0.0,"List(0, 100, List(0, 18, 28, 34, 44, 49, 64, 94, 95, 96, 99), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 49.0, 160187.0, 5.0, 16.0))",49.0,Private,160187.0,9th,5.0,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0.0,0.0,16.0,Jamaica,<=50K
1.0,"List(0, 100, List(1, 8, 23, 31, 43, 48, 52, 53, 94, 95, 96, 99), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 52.0, 209642.0, 9.0, 45.0))",52.0,Self-emp-not-inc,209642.0,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,45.0,United-States,>50K
1.0,"List(0, 100, List(0, 11, 24, 29, 44, 48, 53, 94, 95, 96, 97, 99), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 31.0, 45781.0, 14.0, 14084.0, 50.0))",31.0,Private,45781.0,Masters,14.0,Never-married,Prof-specialty,Not-in-family,White,Female,14084.0,0.0,50.0,United-States,>50K
1.0,"List(0, 100, List(0, 10, 23, 31, 43, 48, 52, 53, 94, 95, 96, 97, 99), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 42.0, 159449.0, 13.0, 5178.0, 40.0))",42.0,Private,159449.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178.0,0.0,40.0,United-States,>50K
1.0,"List(0, 100, List(0, 9, 23, 31, 43, 49, 52, 53, 94, 95, 96, 99), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 37.0, 280464.0, 10.0, 80.0))",37.0,Private,280464.0,Some-college,10.0,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0.0,0.0,80.0,United-States,>50K


In [0]:
### Randomly split data into training and test sets. set seed for reproducibility
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=100)
print(trainingData.count())
print(testData.count())

## Fit and Evaluate Models

We are now ready to try out some of the Binary Classification algorithms available in the Pipelines API.

Out of these algorithms, the below are also capable of supporting multiclass classification with the Python API:
- Decision Tree Classifier

These are the general steps we will take to build our models:
- Create initial model using the training set
- Tune parameters with a `ParamGrid` and 5-fold Cross Validation
- Evaluate the best model obtained from the Cross Validation using the test set

We use the `BinaryClassificationEvaluator` to evaluate our models, which uses [areaUnderROC] as the default metric.

## Logistic Regression

In the Pipelines API, we are now able to perform Elastic-Net Regularization with Logistic Regression, as well as other linear methods.

In [0]:
from pyspark.ml.classification import LogisticRegression

# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)

# Train model with Training Data
lrModel = lr.fit(trainingData)

In [0]:
# Make predictions on test data using the transform() method.
# LogisticRegression.transform() will only use the 'features' column.
predictions = lrModel.transform(testData)

In [0]:
# View model's predictions and probabilities of each prediction class
# You can select any columns in the above schema to view as well. For example's sake we will choose age & occupation
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)

label,prediction,probability,age,occupation
0.0,1.0,"List(1, 2, List(), List(0.15558714514333483, 0.8444128548566652))",36.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6978787145962684, 0.3021212854037317))",32.0,Prof-specialty
0.0,1.0,"List(1, 2, List(), List(0.48936322618261907, 0.5106367738173809))",33.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6787721431468228, 0.32122785685317723))",39.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6057264047792345, 0.3942735952207656))",39.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.606358235816831, 0.3936417641831689))",50.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.5967673056938737, 0.40323269430612624))",51.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.5960955533001752, 0.40390444669982484))",60.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.7633298613814152, 0.2366701386185848))",34.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.9892175014341769, 0.010782498565823072))",20.0,Prof-specialty


We can use ``BinaryClassificationEvaluator`` to evaluate our model. We can set the required column names in `rawPredictionCol` and `labelCol` Param and the metric in `metricName` Param. The default metric for the ``BinaryClassificationEvaluator`` is ``areaUnderROC``

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)

The evaluator currently accepts 2 kinds of metrics - areaUnderROC and areaUnderPR.
We can set it to areaUnderPR by using evaluator.setMetricName("areaUnderPR").

Now we will try tuning the model with the ``ParamGridBuilder`` and the ``CrossValidator``.

If you are unsure what params are available for tuning, you can use ``explainParams()`` to print a list of all params and their definitions.

In [0]:
print(lr.explainParams())

As we indicate 3 values for regParam, 3 values for maxIter, and 2 values for elasticNetParam,
this grid will have 3 x 3 x 3 = 27 parameter settings for CrossValidator to choose from.
We will create a 5-fold cross validator.

In [0]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Create ParamGrid for Cross Validation
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.5, 2.0])
             .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
             .addGrid(lr.maxIter, [1, 5, 10])
             .build())

Running cross validation. With parameters set, we can neavleate the model and reduce any bias.

In [0]:
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations
cvModel = cv.fit(trainingData)

In [0]:
# Use test set to measure the accuracy of our model on new data
predictions = cvModel.transform(testData)

In [0]:
# cvModel uses the best model found from the Cross Validation
# Evaluate best model
evaluator.evaluate(predictions)

We can also access the model's feature weights and intercepts easily

In [0]:
print('Model Intercept: ', cvModel.bestModel.intercept)

In [0]:
weights = cvModel.bestModel.coefficients
weights = [(float(w),) for w in weights]  # convert numpy type to float, and to tuple
weightsDF = sqlContext.createDataFrame(weights, ["Feature Weight"])
display(weightsDF)

Feature Weight
-0.2783399603222051
-0.6391137625407002
-0.4411494528218453
-0.5297064790918096
-0.5294496363738568
0.0248668725356996
0.0637245573186053
-2.506061617464818
-0.5602180997865484
-0.2296194230414342


In [0]:
# View best model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)

label,prediction,probability,age,occupation
0.0,1.0,"List(1, 2, List(), List(0.22341326854888555, 0.7765867314511145))",36.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6532176673497101, 0.34678233265028985))",32.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.5316332266435574, 0.4683667733564426))",33.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6358906170966756, 0.3641093829033244))",39.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.5978620344650845, 0.40213796553491554))",39.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.592588255410988, 0.407411744589012))",50.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.5875745325552946, 0.4124254674447054))",51.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.5956944062410583, 0.4043055937589417))",60.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.705027036804322, 0.294972963195678))",34.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.9608456606182515, 0.0391543393817484))",20.0,Prof-specialty


## Decision Trees

The Decision Trees algorithm is popular because it handles categorical
data and works out of the box with multiclass classification tasks.

In [0]:
from pyspark.ml.classification import DecisionTreeClassifier

# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=3)

# Train model with Training Data
dtModel = dt.fit(trainingData)

We can extract the number of nodes in our decision tree as well as the
tree depth of our model.

In [0]:
print("numNodes = ", dtModel.numNodes)
print("depth = ", dtModel.depth)

In [0]:
display(dtModel)

treeNode
"{""index"":5,""featureType"":""categorical"",""prediction"":null,""threshold"":null,""categories"":[0.0],""feature"":23,""overflow"":false}"
"{""index"":1,""featureType"":""continuous"",""prediction"":null,""threshold"":7565.5,""categories"":null,""feature"":97,""overflow"":false}"
"{""index"":0,""featureType"":null,""prediction"":0.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":3,""featureType"":""continuous"",""prediction"":null,""threshold"":20.5,""categories"":null,""feature"":94,""overflow"":false}"
"{""index"":2,""featureType"":null,""prediction"":0.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":4,""featureType"":null,""prediction"":1.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":9,""featureType"":""continuous"",""prediction"":null,""threshold"":12.5,""categories"":null,""feature"":96,""overflow"":false}"
"{""index"":7,""featureType"":""continuous"",""prediction"":null,""threshold"":3368.0,""categories"":null,""feature"":97,""overflow"":false}"
"{""index"":6,""featureType"":null,""prediction"":0.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":8,""featureType"":null,""prediction"":1.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"


In [0]:
# Make predictions on test data using the Transformer.transform() method.
predictions = dtModel.transform(testData)

In [0]:
predictions.printSchema()

In [0]:
# View model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)

label,prediction,probability,age,occupation
0.0,0.0,"List(1, 2, List(), List(0.6996018286388438, 0.30039817136115615))",36.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6996018286388438, 0.30039817136115615))",32.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6996018286388438, 0.30039817136115615))",33.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6996018286388438, 0.30039817136115615))",39.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6996018286388438, 0.30039817136115615))",39.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6996018286388438, 0.30039817136115615))",50.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6996018286388438, 0.30039817136115615))",51.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6996018286388438, 0.30039817136115615))",60.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6996018286388438, 0.30039817136115615))",34.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6996018286388438, 0.30039817136115615))",20.0,Prof-specialty


We will evaluate our Decision Tree model with
`BinaryClassificationEvaluator`.

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Evaluate model
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(predictions)

Entropy and the Gini coefficient are the supported measures of impurity for Decision Trees. This is ``Gini`` by default. Changing this value is simple, ``model.setImpurity("Entropy")``.

In [0]:
dt.getImpurity()

Now we will try tuning the model with the ``ParamGridBuilder`` and the ``CrossValidator``.

As we indicate 3 values for maxDepth and 3 values for maxBin, this grid will have 3 x 3 = 9 parameter settings for ``CrossValidator`` to choose from. We will create a 5-fold CrossValidator.

In [0]:
# Create ParamGrid for Cross Validation
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1, 2, 6, 10])
             .addGrid(dt.maxBins, [20, 40, 80])
             .build())

In [0]:
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=dt, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations
cvModel = cv.fit(trainingData)
# Takes ~2 minutes

In [0]:
print("numNodes = ", cvModel.bestModel.numNodes)
print("depth = ", cvModel.bestModel.depth)

In [0]:
# Use test set to measure the accuracy of our model on new data
predictions = cvModel.transform(testData)

In [0]:
# cvModel uses the best model found from the Cross Validation
# Evaluate best model
evaluator.evaluate(predictions)

In [0]:
# View Best model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)

label,prediction,probability,age,occupation
0.0,0.0,"List(1, 2, List(), List(1.0, 0.0))",36.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.8337236533957846, 0.16627634660421545))",32.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6678832116788321, 0.33211678832116787))",33.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6858064516129032, 0.3141935483870968))",39.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6858064516129032, 0.3141935483870968))",39.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6858064516129032, 0.3141935483870968))",50.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.6858064516129032, 0.3141935483870968))",51.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.9288888888888889, 0.07111111111111111))",60.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.8337236533957846, 0.16627634660421545))",34.0,Prof-specialty
0.0,0.0,"List(1, 2, List(), List(0.9782608695652174, 0.021739130434782608))",20.0,Prof-specialty


## Make Predictions
As Random Forest gives us the best areaUnderROC value, we will use the bestModel obtained from Random Forest for deployment,
and use it to generate predictions on new data.
In this example, we will simulate this by generating predictions on the entire dataset.

In [0]:
bestModel = cvModel.bestModel

In [0]:
# Generate predictions for entire dataset
finalPredictions = bestModel.transform(dataset)

In [0]:
# Evaluate best model
evaluator.evaluate(finalPredictions)

In this example, we will also look into predictions grouped by age and occupation.

In [0]:
finalPredictions.createOrReplaceTempView("finalPredictions")

In an operational environment, analysts may use a similar machine learning pipeline to obtain predictions on new data, organize it into a table and use it for analysis or lead targeting.

In [0]:
%sql
SELECT occupation, prediction, count(*) AS count
FROM finalPredictions
GROUP BY occupation, prediction
ORDER BY occupation


occupation,prediction,count
?,0.0,1710
?,1.0,133
Adm-clerical,1.0,330
Adm-clerical,0.0,3439
Armed-Forces,1.0,1
Armed-Forces,0.0,8
Craft-repair,0.0,3727
Craft-repair,1.0,372
Exec-managerial,0.0,2046
Exec-managerial,1.0,2020


In [0]:
%sql
SELECT age, prediction, count(*) AS count
FROM finalPredictions
GROUP BY age, prediction
ORDER BY age

age,prediction,count
17.0,0.0,395
18.0,0.0,550
19.0,0.0,711
19.0,1.0,1
20.0,0.0,753
21.0,1.0,1
21.0,0.0,719
22.0,1.0,10
22.0,0.0,755
23.0,0.0,868
