<a href="https://colab.research.google.com/github/rickycl/PySpark/blob/master/Supervised_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Step 1: Create the Spark Session Object
!pip install pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('Linear RM').getOrCreate()

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=527f8cd79c3cd60b89359bbc1d80749c1a6b1a302c3e46b505cb9ae177987e08
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In supervised ML, the correct labels/output is already known during the model training phase - the error can be reduced accordingly.

Mapping of the relationship between the input data and output label to pick up the signals from the training data and generalize about the unseen data.

The training of the model consists of comparing the actual output with the predicted output and then making the changes in predictions, to reduce the total error between what is actual and what is predicted.

#Step 2: Read the Dataset
day: last contact day of the month (numeric)

duration: last contact duration, in seconds (numeric)

campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

previous: number of contacts performed before this campaign and for this client (numeric)

target_class: has the client subscribed a term deposit? (binary: "yes": 1, "no": 0)

In [3]:
df=spark.read.csv('sample_data/bank-full1.csv', inferSchema=True, header=True)

In [4]:
print((df.count(), len(df.columns)))

(45211, 8)


In [5]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- balance: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- target_class: integer (nullable = true)



In [6]:
df.show(10)

+---+-------+---+--------+--------+-----+--------+------------+
|age|balance|day|duration|campaign|pdays|previous|target_class|
+---+-------+---+--------+--------+-----+--------+------------+
| 58|   2143|  5|     261|       1|   -1|       0|           0|
| 44|     29|  5|     151|       1|   -1|       0|           0|
| 33|      2|  5|      76|       1|   -1|       0|           0|
| 47|   1506|  5|      92|       1|   -1|       0|           0|
| 33|      1|  5|     198|       1|   -1|       0|           0|
| 35|    231|  5|     139|       1|   -1|       0|           0|
| 28|    447|  5|     217|       1|   -1|       0|           0|
| 42|      2|  5|     380|       1|   -1|       0|           0|
| 58|    121|  5|      50|       1|   -1|       0|           0|
| 43|    593|  5|      55|       1|   -1|       0|           0|
+---+-------+---+--------+--------+-----+--------+------------+
only showing top 10 rows



# Step 3: Feature Engineering
Create a single vector combining all input features, by using Spark's VectorAssembler. It creates only a single feature that captures the input values for that particular row.
So, instead of 7 input columns, the engine essentially translates the features into a single column with 7 input values, in the form of a list.

In [7]:
from pyspark.ml.linalg import Vector
from pyspark.ml.feature import VectorAssembler

In [8]:
df.columns

['age',
 'balance',
 'day',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'target_class']

In [10]:
vec_assembler=VectorAssembler(inputCols=['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous'], outputCol='features')
features_df=vec_assembler.transform(df)
features_df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- balance: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- target_class: integer (nullable = true)
 |-- features: vector (nullable = true)



In [11]:
# Extra column (features) contains the single dense vector for all of the inputs
# The column target_class had to be renamed
features_df = features_df.withColumnRenamed('target_class', 'label')

In [12]:
features_df.select(['features', 'label']).show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[58.0,2143.0,5.0,...|    0|
|[44.0,29.0,5.0,15...|    0|
|[33.0,2.0,5.0,76....|    0|
|[47.0,1506.0,5.0,...|    0|
|[33.0,1.0,5.0,198...|    0|
|[35.0,231.0,5.0,1...|    0|
|[28.0,447.0,5.0,2...|    0|
|[42.0,2.0,5.0,380...|    0|
|[58.0,121.0,5.0,5...|    0|
|[43.0,593.0,5.0,5...|    0|
|[41.0,270.0,5.0,2...|    0|
|[29.0,390.0,5.0,1...|    0|
|[53.0,6.0,5.0,517...|    0|
|[58.0,71.0,5.0,71...|    0|
|[57.0,162.0,5.0,1...|    0|
|[51.0,229.0,5.0,3...|    0|
|[45.0,13.0,5.0,98...|    0|
|[57.0,52.0,5.0,38...|    0|
|[60.0,60.0,5.0,21...|    0|
|[33.0,0.0,5.0,54....|    0|
+--------------------+-----+
only showing top 20 rows



In [13]:
# Step 4: Split the Dataset
train, test = features_df.randomSplit([0.75, 0.25])
print(f"Size of train Dataset : {train.count()}" )

Size of train Dataset : 33869


In [14]:
print(f"Size of test Dataset : {test.count()}" )

Size of test Dataset : 11342


In [15]:
# Step 5: Build and train Linear RM using features, input and label columns
from pyspark.ml.regression import LinearRegression
lr = LinearRegression()

In [16]:
lr_model = lr.fit(train)
predictions_df = lr_model.transform(test)
predictions_df.show()

+---+-------+---+--------+--------+-----+--------+-----+--------------------+--------------------+
|age|balance|day|duration|campaign|pdays|previous|label|            features|          prediction|
+---+-------+---+--------+--------+-----+--------+-----+--------------------+--------------------+
| 18|      5| 24|     143|       2|   -1|       0|    0|[18.0,5.0,24.0,14...|0.027535736957257355|
| 18|   1944| 10|     122|       3|   -1|       0|    0|[18.0,1944.0,10.0...|0.020649151558401004|
| 19|     56| 12|     246|       1|   -1|       0|    0|[19.0,56.0,12.0,2...| 0.08114475783154632|
| 19|     60| 14|     253|       1|   -1|       0|    0|[19.0,60.0,14.0,2...| 0.08468098940621466|
| 19|    179| 24|      62|       3|   -1|       0|    0|[19.0,179.0,24.0,...|-0.01371466916044...|
| 19|    245| 10|      98|       2|  110|       2|    0|[19.0,245.0,10.0,...| 0.04880677594105844|
| 19|    526|  1|     174|       1|  199|       3|    0|[19.0,526.0,1.0,1...|  0.1189909483478403|
| 19|    5

In [17]:
# Step 6: Evaluate Linear RM on Test Data
# To check the performance of the model on unseen or test data
model_predictions=lr_model.evaluate(test)
model_predictions.r2

0.179991807283855

R squared is the square of the correlation. It measures the proportion of variation in the dependent variable that can be attributed to the independent variable.

It mainly suggests how much of the variation in the dataset can be attributed to regression. The higher the value, the better the performance of the model.

From the above result, it means that the model is weak because 17.1% is the strength and character of the relationship between the term deposit(dependent variable) and a series of other independent variables(i.e, age, balance, day, duration, duration, campaign, pdays and previous)

In [18]:
print(model_predictions.meanSquaredError)

0.08449346613707671


#GENERALIZED LINEAR MODEL REGRESSION
Target variable has an error distribution other than a preferred normal distribution



In [19]:
from pyspark.ml.regression import GeneralizedLinearRegression

In [20]:
glr = GeneralizedLinearRegression()
glr_model = glr.fit(train)
glr_model.coefficients

# One of the features has a negative coefficient value.

DenseVector([0.0007, 0.0, 0.0001, 0.0005, -0.0031, 0.0002, 0.0075])

In [21]:
glr_model.summary

Coefficients:
    Feature Estimate Std Error T Value P Value
(Intercept)  -0.0500    0.0074 -6.7265  0.0000
        age   0.0007    0.0002  4.7764  0.0000
    balance   0.0000    0.0000  7.2152  0.0000
        day   0.0001    0.0002  0.2781  0.7810
   duration   0.0005    0.0000 78.2733  0.0000
   campaign  -0.0031    0.0005 -5.9528  0.0000
      pdays   0.0002    0.0000 13.8317  0.0000
   previous   0.0075    0.0007 10.1601  0.0000

(Dispersion parameter for gaussian family taken to be 0.0858)
    Null deviance: 3501.5884 on 33861 degrees of freedom
Residual deviance: 2906.4149 on 33861 degrees of freedom
AIC: 12966.0259

In [22]:
# Evaluate the Model Performance on Test Data
model_predictions = glr_model.evaluate(test)
model_predictions.predictions.show()

+---+-------+---+--------+--------+-----+--------+-----+--------------------+--------------------+
|age|balance|day|duration|campaign|pdays|previous|label|            features|          prediction|
+---+-------+---+--------+--------+-----+--------+-----+--------------------+--------------------+
| 18|      5| 24|     143|       2|   -1|       0|    0|[18.0,5.0,24.0,14...|0.027535736957257355|
| 18|   1944| 10|     122|       3|   -1|       0|    0|[18.0,1944.0,10.0...|0.020649151558401004|
| 19|     56| 12|     246|       1|   -1|       0|    0|[19.0,56.0,12.0,2...| 0.08114475783154632|
| 19|     60| 14|     253|       1|   -1|       0|    0|[19.0,60.0,14.0,2...| 0.08468098940621466|
| 19|    179| 24|      62|       3|   -1|       0|    0|[19.0,179.0,24.0,...|-0.01371466916044...|
| 19|    245| 10|      98|       2|  110|       2|    0|[19.0,245.0,10.0,...| 0.04880677594105844|
| 19|    526|  1|     174|       1|  199|       3|    0|[19.0,526.0,1.0,1...|  0.1189909483478403|
| 19|    5

In [23]:
# The Akaike information criterion is an evaluation parameter of relative performance of quality of models for the same set dataset.
# It is used to select among multiple models for a given dataset. A lesser value indicates the model is of good quality.
# AIC tries to strike a balance between the variance and bias of the model. The model with the lowest AIC is preferred over other models.
model_predictions.aic

4178.200175053163

In [24]:
glr = GeneralizedLinearRegression(family='Binomial')
glr_model = glr.fit(train)
model_predictions=glr_model.evaluate(test)
model_predictions.aic

6586.788935961635

In [25]:
glr = GeneralizedLinearRegression(family='Poisson')
glr_model = glr.fit(train)
model_predictions=glr_model.evaluate(test)
model_predictions.aic

# The default GLM model with Gaussian distribution has the lowest AIC value at 4178.2002

7440.304523979164

In [26]:
glr = GeneralizedLinearRegression(family='Gamma')
glr_model = glr.fit(train)
model_predictions = glr_model.evaluate(test)
model_predictions.aic

Py4JJavaError: An error occurred while calling o326.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 50.0 failed 1 times, most recent failure: Lost task 0.0 in stage 50.0 (TID 43) (69ed30299582 executor driver): java.lang.IllegalArgumentException: requirement failed: The response variable of Gamma family should be positive, but got 0.0
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.ml.regression.GeneralizedLinearRegression$Gamma$.initialize(GeneralizedLinearRegression.scala:823)
	at org.apache.spark.ml.regression.GeneralizedLinearRegression$FamilyAndLink.$anonfun$initialize$1(GeneralizedLinearRegression.scala:510)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
	at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
	at scala.collection.TraversableOnce.aggregate(TraversableOnce.scala:260)
	at scala.collection.TraversableOnce.aggregate$(TraversableOnce.scala:260)
	at scala.collection.AbstractIterator.aggregate(Iterator.scala:1431)
	at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$4(RDD.scala:1261)
	at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$6(RDD.scala:1262)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2844)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2780)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2779)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2779)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1242)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1242)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1242)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3048)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2982)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2971)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:984)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2398)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2493)
	at org.apache.spark.rdd.RDD.$anonfun$fold$1(RDD.scala:1199)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:407)
	at org.apache.spark.rdd.RDD.fold(RDD.scala:1193)
	at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$2(RDD.scala:1286)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:407)
	at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1253)
	at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1239)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:407)
	at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1239)
	at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:107)
	at org.apache.spark.ml.regression.GeneralizedLinearRegression$FamilyAndLink.initialize(GeneralizedLinearRegression.scala:517)
	at org.apache.spark.ml.regression.GeneralizedLinearRegression.$anonfun$train$1(GeneralizedLinearRegression.scala:431)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
	at org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:380)
	at org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:247)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: requirement failed: The response variable of Gamma family should be positive, but got 0.0
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.ml.regression.GeneralizedLinearRegression$Gamma$.initialize(GeneralizedLinearRegression.scala:823)
	at org.apache.spark.ml.regression.GeneralizedLinearRegression$FamilyAndLink.$anonfun$initialize$1(GeneralizedLinearRegression.scala:510)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
	at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
	at scala.collection.TraversableOnce.aggregate(TraversableOnce.scala:260)
	at scala.collection.TraversableOnce.aggregate$(TraversableOnce.scala:260)
	at scala.collection.AbstractIterator.aggregate(Iterator.scala:1431)
	at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$4(RDD.scala:1261)
	at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$6(RDD.scala:1262)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more


# DECISION TREE REGRESSION
It can be used for both regression and classification. Quite poweful in terms of fitting the data well but comes with the high risk of sometimes overfitting the data. DT contain multiple splits based on entropy or Gini indexes. The deeper the tree, the higher the chance of overfitting the data


In [27]:
# Build and Train DT Regressor Model
from pyspark.ml.regression import DecisionTreeRegressor
dec_tree = DecisionTreeRegressor()
dec_tree_model = dec_tree.fit(train)
dec_tree_model.featureImportances

SparseVector(7, {0: 0.0685, 1: 0.0033, 2: 0.0304, 3: 0.7019, 4: 0.0012, 5: 0.1947})

In [28]:
# Evaluate the Model Performance on Test Data
model_predictions = dec_tree_model.transform(test)
model_predictions.show()

+---+-------+---+--------+--------+-----+--------+-----+--------------------+--------------------+
|age|balance|day|duration|campaign|pdays|previous|label|            features|          prediction|
+---+-------+---+--------+--------+-----+--------+-----+--------------------+--------------------+
| 18|      5| 24|     143|       2|   -1|       0|    0|[18.0,5.0,24.0,14...|  0.0635048231511254|
| 18|   1944| 10|     122|       3|   -1|       0|    0|[18.0,1944.0,10.0...|  0.0635048231511254|
| 19|     56| 12|     246|       1|   -1|       0|    0|[19.0,56.0,12.0,2...| 0.15303430079155672|
| 19|     60| 14|     253|       1|   -1|       0|    0|[19.0,60.0,14.0,2...| 0.15303430079155672|
| 19|    179| 24|      62|       3|   -1|       0|    0|[19.0,179.0,24.0,...|  0.0635048231511254|
| 19|    245| 10|      98|       2|  110|       2|    0|[19.0,245.0,10.0,...| 0.07184241019698726|
| 19|    526|  1|     174|       1|  199|       3|    0|[19.0,526.0,1.0,1...| 0.15313028764805414|
| 19|    5

In [29]:
# To evaluate the performance of the decision tree on test data
# r^2 mainly suggests how much of the variation in the dataset can be attributed to regression. The higher the value, the better the performance of the model.
# RMSE suggests the total errors the model is making, in terms of the difference between actual and predicted values

from pyspark.ml.evaluation import RegressionEvaluator
dt_evaluator = RegressionEvaluator(metricName='r2')
dt_r2 = dt_evaluator.evaluate(model_predictions)
print(f'The r-square value of DecisionTreeRegressor is {dt_r2}')

The r-square value of DecisionTreeRegressor is 0.2337287726638555


In [30]:
dt_evaluator = RegressionEvaluator(metricName='rmse')
dt_rmse = dt_evaluator.evaluate(model_predictions)
print(f'The rmse value of DecisionTreeRegressor is {dt_rmse}')

The rmse value of DecisionTreeRegressor is 0.28099185200234106


#RANDOM FOREST REGRESSORS
are a collection of multiple individual decision trees built using different samples of data. A RF is an ensembling technique that takes a bagging approach that can be used for regression and classification. DT tend to overfit the data. RF remove the element of high variance, by taking the means of the predicted values from individual trees.

In [31]:
# Build and Train RF Regressor Model
from pyspark.ml.regression import RandomForestRegressor
rf = RandomForestRegressor()
rf_model = rf.fit(train)
rf_model.featureImportances

SparseVector(7, {0: 0.0759, 1: 0.016, 2: 0.0237, 3: 0.6807, 4: 0.006, 5: 0.1388, 6: 0.0588})

In [32]:
rf_model.getNumTrees

20

In [33]:
model_predictions = rf_model.transform(test)

In [34]:
model_predictions.show()

+---+-------+---+--------+--------+-----+--------+-----+--------------------+-------------------+
|age|balance|day|duration|campaign|pdays|previous|label|            features|         prediction|
+---+-------+---+--------+--------+-----+--------+-----+--------------------+-------------------+
| 18|      5| 24|     143|       2|   -1|       0|    0|[18.0,5.0,24.0,14...|0.05442794375135808|
| 18|   1944| 10|     122|       3|   -1|       0|    0|[18.0,1944.0,10.0...|0.06002948475348282|
| 19|     56| 12|     246|       1|   -1|       0|    0|[19.0,56.0,12.0,2...|0.08550211821875787|
| 19|     60| 14|     253|       1|   -1|       0|    0|[19.0,60.0,14.0,2...|0.08550211821875787|
| 19|    179| 24|      62|       3|   -1|       0|    0|[19.0,179.0,24.0,...|0.05923398029888029|
| 19|    245| 10|      98|       2|  110|       2|    0|[19.0,245.0,10.0,...|0.15429846250237636|
| 19|    526|  1|     174|       1|  199|       3|    0|[19.0,526.0,1.0,1...|0.13941208047440132|
| 19|    527|  4|   

In [35]:
# Evaluate the Model Performance on Test Data
rf_evaluator = RegressionEvaluator(metricName='r2')
rf_r2 = rf_evaluator.evaluate(model_predictions)
print(f'The r-square value of RandomForestRegressor is {rf_r2}')

The r-square value of RandomForestRegressor is 0.24921720756668175


In [36]:
rf_evaluator = RegressionEvaluator(metricName='rmse')

In [37]:
rf_rmse = rf_evaluator.evaluate(model_predictions)

In [38]:
print(f'The rmse value of RandomForestRegressor is {rf_rmse}')

The rmse value of RandomForestRegressor is 0.27813754856576156


#GRADIENT-BOOSTED TREE REGRESSOR
is also an ensembling technique, which uses boosting, i.e making use of individual weak learners in order to boost the performance of the overall model.
In bagging, the individual models that are built are parallel in nature, meaning they can be built independent of each other.
In boosting, the individual models are built in a sequential manner. In a gradient boosting approach, the second model focuses on the errors made by the first model and tries to reduce overall errors for those data points. Similarly, the next model tries to reduce the errors made by the previous model. In this way, the overall error of prediction is reduced.

In [39]:
# Build and Train the GBT Regressor Model
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor()
gbt_model=gbt.fit(train)
gbt_model.featureImportances

SparseVector(7, {0: 0.1275, 1: 0.076, 2: 0.1628, 3: 0.3579, 4: 0.0449, 5: 0.1968, 6: 0.0341})

In [40]:
model_predictions = gbt_model.transform(test)
model_predictions.show()

+---+-------+---+--------+--------+-----+--------+-----+--------------------+--------------------+
|age|balance|day|duration|campaign|pdays|previous|label|            features|          prediction|
+---+-------+---+--------+--------+-----+--------+-----+--------------------+--------------------+
| 18|      5| 24|     143|       2|   -1|       0|    0|[18.0,5.0,24.0,14...| 0.08665512988784398|
| 18|   1944| 10|     122|       3|   -1|       0|    0|[18.0,1944.0,10.0...| 0.19009065676638015|
| 19|     56| 12|     246|       1|   -1|       0|    0|[19.0,56.0,12.0,2...| 0.15322133431935805|
| 19|     60| 14|     253|       1|   -1|       0|    0|[19.0,60.0,14.0,2...|  0.1521330255411778|
| 19|    179| 24|      62|       3|   -1|       0|    0|[19.0,179.0,24.0,...|0.025074852259006786|
| 19|    245| 10|      98|       2|  110|       2|    0|[19.0,245.0,10.0,...|  0.3264474782807645|
| 19|    526|  1|     174|       1|  199|       3|    0|[19.0,526.0,1.0,1...| 0.27506016811956324|
| 19|    5

In [41]:
# Evaluate the Model Performance on Test Data
gbt_evaluator = RegressionEvaluator(metricName='r2')
gbt_r2 = gbt_evaluator.evaluate(model_predictions)
print(f'The r-square value of GradientBoostedRegressor is {gbt_r2}')

The r-square value of GradientBoostedRegressor is 0.25712331871354677


In [42]:
gbt_evaluator = RegressionEvaluator(metricName='rmse')
gbt_rmse = gbt_evaluator.evaluate(model_predictions)
print(f'The rmse value of GradientBoostedRegressor is {gbt_rmse}')

The rmse value of GradientBoostedRegressor is 0.2766692103328004


The GBT regressor outperforms the random forest model with both the values of r-square and rmse being higher at (25.71, 27.67) against (24.92, 27.81) for the random forest model.