<h4> Exploring The Data
We will use the same data set when we built a Logistic Regression in Python, and it is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe (Yes/No) to a term deposit. The dataset can be downloaded from Kaggle.<h4>
    
    ref:https://towardsdatascience.com/machine-learning-with-pyspark-and-mllib-solving-a-binary-classification-problem-96396065d2aa
   

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ml-bank').getOrCreate()
df = spark.read.csv('bank.csv', header = True, inferSchema = True)
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: integer (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: integer (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- deposit: string (nullable = true)



In [None]:
!pip install numpy

Collecting numpy
  Downloading numpy-1.18.1.zip (5.4 MB)
[K     |████████████████████████████████| 5.4 MB 9.7 MB/s eta 0:00:01
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
[?25hBuilding wheels for collected packages: numpy
  Building wheel for numpy (PEP 517) ... [?25l|

Preparing Data for Machine Learning
The process includes Category Indexing, One-Hot Encoding and VectorAssembler — a feature transformer that merges multiple columns into a vector column.

In [3]:
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
categoricalColumns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome']
stages = []
for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]
label_stringIdx = StringIndexer(inputCol = 'deposit', outputCol = 'label')
stages += [label_stringIdx]
numericCols = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

Randomly split data into train and test sets, and set seed for reproducibility.

In [11]:
cols = df.columns
train, test = df.randomSplit([0.7, 0.3], seed = 2018)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

Training Dataset Count: 7764
Test Dataset Count: 3398


Logistic Regression Model

In [12]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter=10)
lrModel = lr.fit(train)

Make predictions on the test set.

In [15]:
predictions = lrModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction', 'prediction', 'probability').show(10)

+---+-----------+-----+--------------------+----------+--------------------+
|age|        job|label|       rawPrediction|prediction|         probability|
+---+-----------+-----+--------------------+----------+--------------------+
| 18|    student|  1.0|[-0.2251430600501...|       1.0|[0.44395079358481...|
| 18|    student|  1.0|[-0.5636188230682...|       1.0|[0.36271054806663...|
| 19|    student|  1.0|[-0.5684347598631...|       1.0|[0.36159807423288...|
| 19|    student|  1.0|[-3.4591930495953...|       1.0|[0.03049588242942...|
| 20|blue-collar|  1.0|[-0.9641237569114...|       1.0|[0.27605331018526...|
| 20|    student|  1.0|[-0.3057798595901...|       1.0|[0.42414516007933...|
| 20|    student|  1.0|[-3.3025718642084...|       1.0|[0.03548306451732...|
| 20|    student|  0.0|[1.46112621265899...|       0.0|[0.81170486564410...|
| 20|    student|  1.0|[-0.8211329030062...|       1.0|[0.30552322896115...|
| 20|    student|  1.0|[-0.8685589128467...|       1.0|[0.29555425071253...|

In [16]:
#Evaluate our Logistic Regression model.
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
print('Test Area Under ROC', evaluator.evaluate(predictions))

Test Area Under ROC 0.8872231023652682


Decision Tree Classifier
Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multi-class classification, do not require feature scaling, and are able to capture non-linearities and feature interactions.

In [17]:
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label', maxDepth = 3)
dtModel = dt.fit(train)
predictions = dtModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction', 'prediction', 'probability').show(10)

+---+-----------+-----+---------------+----------+--------------------+
|age|        job|label|  rawPrediction|prediction|         probability|
+---+-----------+-----+---------------+----------+--------------------+
| 18|    student|  1.0|[3327.0,1063.0]|       0.0|[0.75785876993166...|
| 18|    student|  1.0|[3327.0,1063.0]|       0.0|[0.75785876993166...|
| 19|    student|  1.0|[3327.0,1063.0]|       0.0|[0.75785876993166...|
| 19|    student|  1.0|   [39.0,412.0]|       1.0|[0.08647450110864...|
| 20|blue-collar|  1.0| [202.0,1254.0]|       1.0|[0.13873626373626...|
| 20|    student|  1.0|[3327.0,1063.0]|       0.0|[0.75785876993166...|
| 20|    student|  1.0|   [39.0,412.0]|       1.0|[0.08647450110864...|
| 20|    student|  0.0|[3327.0,1063.0]|       0.0|[0.75785876993166...|
| 20|    student|  1.0|[3327.0,1063.0]|       0.0|[0.75785876993166...|
| 20|    student|  1.0|[3327.0,1063.0]|       0.0|[0.75785876993166...|
+---+-----------+-----+---------------+----------+--------------

In [18]:
#Evaluate our Decision Tree model.
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.5093844673281636


In [19]:
#Random Forest Classifier
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label')
rfModel = rf.fit(train)
predictions = rfModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction', 'prediction', 'probability').show(10)

+---+-----------+-----+--------------------+----------+--------------------+
|age|        job|label|       rawPrediction|prediction|         probability|
+---+-----------+-----+--------------------+----------+--------------------+
| 18|    student|  1.0|[9.65052898295371...|       1.0|[0.48252644914768...|
| 18|    student|  1.0|[8.3630953852178,...|       1.0|[0.41815476926089...|
| 19|    student|  1.0|[7.85517548687664...|       1.0|[0.39275877434383...|
| 19|    student|  1.0|[2.25149808487983...|       1.0|[0.11257490424399...|
| 20|blue-collar|  1.0|[5.79108197526126...|       1.0|[0.28955409876306...|
| 20|    student|  1.0|[8.93586153317299...|       1.0|[0.44679307665864...|
| 20|    student|  1.0|[2.07923152536544...|       1.0|[0.10396157626827...|
| 20|    student|  0.0|[15.5660186353400...|       0.0|[0.77830093176700...|
| 20|    student|  1.0|[7.70258418113884...|       1.0|[0.38512920905694...|
| 20|    student|  1.0|[7.70258418113884...|       1.0|[0.38512920905694...|

In [20]:
#Evaluate our Random Forest Classifier.
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.8814336140526394


In [21]:
#Gradient-Boosted Tree Classifier
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(maxIter=10)
gbtModel = gbt.fit(train)
predictions = gbtModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction', 'prediction', 'probability').show(10)


+---+-----------+-----+--------------------+----------+--------------------+
|age|        job|label|       rawPrediction|prediction|         probability|
+---+-----------+-----+--------------------+----------+--------------------+
| 18|    student|  1.0|[-0.1733144470155...|       1.0|[0.41420014349658...|
| 18|    student|  1.0|[-0.1698137109977...|       1.0|[0.41589998355429...|
| 19|    student|  1.0|[-0.3894578956498...|       1.0|[0.31455360488836...|
| 19|    student|  1.0|[-1.1260719899887...|       1.0|[0.09516468996793...|
| 20|blue-collar|  1.0|[-0.9981556330144...|       1.0|[0.11959075976740...|
| 20|    student|  1.0|[-0.3199910621568...|       1.0|[0.34525058022815...|
| 20|    student|  1.0|[-1.1120132532229...|       1.0|[0.09761355378069...|
| 20|    student|  0.0|[1.12703537577411...|       0.0|[0.90500109182949...|
| 20|    student|  1.0|[-0.6364679740189...|       1.0|[0.21875508047278...|
| 20|    student|  1.0|[-0.6364679740189...|       1.0|[0.21875508047278...|

In [22]:
#Evaluate our Gradient-Boosted Tree Classifier.
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.8909304733316615


In [23]:
print(gbt.explainParams())


cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 

In [None]:
exit()