<a href="https://colab.research.google.com/github/rohandawar/pyspark/blob/main/RandomForestClassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, I am trying to implment Random Forest Classifier in Pyspark

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285398 sha256=ce62af3700051585c900e461632a30d57ca454c9016afab1272f1e5f91f6679d
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1


In [30]:
# Import the libs

# Pyspark
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Colab
from google.colab import drive

In [3]:
# Create a spark session
spark = SparkSession.builder.appName('RandomForestClassifier').getOrCreate()

In [4]:
# Mount the drive
drive.mount('/content/drive')

Mounted at /content/drive


In [12]:
# read the data
df = spark.read.csv('/content/drive/MyDrive/DataSets_Pyspark_GoogleColab_Primer/glassClass.csv', inferSchema=True, header=True)
df.show(5)

+-------+-----+----+----+-----+----+----+---+---+----+
|     RI|   Na|  Mg|  Al|   Si|   K|  Ca| Ba| Fe|Type|
+-------+-----+----+----+-----+----+----+---+---+----+
|1.52101|13.64|4.49| 1.1|71.78|0.06|8.75|0.0|0.0|   1|
|1.51761|13.89| 3.6|1.36|72.73|0.48|7.83|0.0|0.0|   1|
|1.51618|13.53|3.55|1.54|72.99|0.39|7.78|0.0|0.0|   1|
|1.51766|13.21|3.69|1.29|72.61|0.57|8.22|0.0|0.0|   1|
|1.51742|13.27|3.62|1.24|73.08|0.55|8.07|0.0|0.0|   1|
+-------+-----+----+----+-----+----+----+---+---+----+
only showing top 5 rows



In [14]:
# Check the class distribution
df.groupBy('Type').count().show()

+----+-----+
|Type|count|
+----+-----+
|   1|   70|
|   6|    9|
|   3|   17|
|   5|   13|
|   7|   29|
|   2|   76|
+----+-----+



In [22]:
col_list = df.columns
print('All columns Names:', col_list)
col_list.remove('Type')
print(' columns Names post remove of the Target Variable:', col_list)

All columns Names: ['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'Type']
 columns Names post remove of the Target Variable: ['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe']


In [24]:
#Insitiate the vector Assembler
vec_assembler = VectorAssembler(inputCols=col_list, outputCol='features')

finaldf = vec_assembler.transform(df)
finaldf.show(5)

+-------+-----+----+----+-----+----+----+---+---+----+--------------------+
|     RI|   Na|  Mg|  Al|   Si|   K|  Ca| Ba| Fe|Type|            features|
+-------+-----+----+----+-----+----+----+---+---+----+--------------------+
|1.52101|13.64|4.49| 1.1|71.78|0.06|8.75|0.0|0.0|   1|[1.52101,13.64,4....|
|1.51761|13.89| 3.6|1.36|72.73|0.48|7.83|0.0|0.0|   1|[1.51761,13.89,3....|
|1.51618|13.53|3.55|1.54|72.99|0.39|7.78|0.0|0.0|   1|[1.51618,13.53,3....|
|1.51766|13.21|3.69|1.29|72.61|0.57|8.22|0.0|0.0|   1|[1.51766,13.21,3....|
|1.51742|13.27|3.62|1.24|73.08|0.55|8.07|0.0|0.0|   1|[1.51742,13.27,3....|
+-------+-----+----+----+-----+----+----+---+---+----+--------------------+
only showing top 5 rows



In [25]:
# Train & Test Split
train_df, test_df = finaldf.randomSplit([0.7,0.3], seed=42)

In [28]:
# Instiate the model
rf = RandomForestClassifier(featuresCol='features', labelCol='Type')

# Fit the model on training data
rf_model = rf.fit(train_df)

In [29]:
# Make Predictions
pred_df = rf_model.transform(test_df)
pred_df.show(5)

+-------+-----+----+----+-----+----+----+----+----+----+--------------------+--------------------+--------------------+----------+
|     RI|   Na|  Mg|  Al|   Si|   K|  Ca|  Ba|  Fe|Type|            features|       rawPrediction|         probability|prediction|
+-------+-----+----+----+-----+----+----+----+----+----+--------------------+--------------------+--------------------+----------+
|1.51215|12.99|3.47|1.12|72.98|0.62|8.35| 0.0|0.31|   1|[1.51215,12.99,3....|[0.0,4.1341852770...|[0.0,0.2067092638...|       2.0|
|1.51409|14.25|3.09|2.08|72.28| 1.1|7.08| 0.0| 0.0|   2|[1.51409,14.25,3....|[0.0,1.3225274725...|[0.0,0.0661263736...|       7.0|
|1.51514|14.01|2.68| 3.5|69.89|1.68|5.87| 2.2| 0.0|   5|[1.51514,14.01,2....|[0.0,1.0,3.0,0.0,...|[0.0,0.05,0.15,0....|       7.0|
|1.51514|14.85| 0.0|2.42|73.72| 0.0|8.39|0.56| 0.0|   7|[1.51514,14.85,0....|[0.0,0.0,0.0,0.0,...|[0.0,0.0,0.0,0.0,...|       7.0|
|1.51567|13.29|3.45|1.21|72.74|0.56|8.57| 0.0| 0.0|   1|[1.51567,13.29,3....|[0.0,9

we can have the following metrics in multiclass classifition
*metricName: pyspark.ml.param.Param[MulticlassClassificationEvaluatorMetricType] = Param(parent='undefined', name='metricName', doc='metric name in evaluation (f1|accuracy|weightedPrecision|weightedRecall|weightedTruePositiveRate| weightedFalsePositiveRate|weightedFMeasure|truePositiveRateByLabel| falsePositiveRateByLabel|precisionByLabel|recallByLabel|fMeasureByLabel| logLoss|hammingLoss)')*

In [39]:
# evaluator
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='Type', metricName='accuracy')
accuracy = evaluator.evaluate(pred_df)
f1 = evaluator.setMetricName('f1').evaluate(pred_df)
weightedPrecision = evaluator.setMetricName('weightedPrecision').evaluate(pred_df)
weightedRecall = evaluator.setMetricName('weightedRecall').evaluate(pred_df)
# precisionByLabel = evaluator.setMetricName('precisionByLabel').evaluate(pred_df)
# recallByLabel = evaluator.setMetricName('recallByLabel').evaluate(pred_df)
print(f"Accuracy: {accuracy}")
print(f"f1: {f1}")
print(f"weightedPrecision: {weightedPrecision}")
print(f"weightedRecall: {weightedRecall}")
# print(f"precisionByLabel: {precisionByLabel}")
# print(f"recallByLabel: {recallByLabel}")


Accuracy: 0.6779661016949152
f1: 0.6804191274506982
weightedPrecision: 0.7253771865182483
weightedRecall: 0.6779661016949152


Documentation for MulticlassClassification:

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.MulticlassClassificationEvaluator.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator.metricName