# Spark vs SciKit Learn: A Comparative Analysis

This document provides a comparative analysis of machine learning models using Spark and SciKit Learn. The focus is on building, training, and evaluating three different classifiers: Decision Tree Classifier, Naive Bayes, and Random Forest.

## Models to be Compared:

 - Decision Tree Classifier
 - Naive Bayes
 - Random Forest

## A quick summary:

- Import Libraries
- Build Spark Session
- Data Load
- Data Exploration & Preparation
- Feature Engineering
- Data Scaling
- Data Split
- Build, Train & Evaluate Model
- Comparison

Import required libraries for Spark and SciKit Learn

In [41]:
# ----- Generic Libraries -----

import numpy as np # Linear algebra
import pandas as pd # Data processing, CSV file I/O (e.g. pd.read_csv)

# ----- Pyspark Libraries -----

# Spark base libraries
import pyspark
from pyspark.sql import SparkSession

# Spark machine learning classifier libraries
from pyspark.ml.classification import DecisionTreeClassifier, RandomForestClassifier, NaiveBayes

# Spark evaluation libraries
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Spark feature transformation libraries
from pyspark.ml.feature import StandardScaler, StringIndexer

# Spark pipeline libraries
from pyspark.ml import Pipeline

# Spark DenseVector libraries
from pyspark.ml.linalg import DenseVector

# ----- SciKit Learn Libraries -----

# Data preprocessing libraries
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score

# Machine learning classifier libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

# Tablating Data
from tabulate import tabulate

# Garbage collection
import gc

# Spark

In [2]:
# build a Spark session
spark = (
    SparkSession.builder.appName("Spark SciKit")
    .config("spark.driver.bindAddress", "localhost")
    .getOrCreate()
)

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/05 23:14:44 WARN Utils: Your hostname, Harrys-MacBook-Pro.local, resolves to a loopback address: 127.0.0.1; using 192.168.0.64 instead (on interface en0)
25/08/05 23:14:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/05 23:14:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
spark.sparkContext.setLogLevel("INFO")

In [4]:
spark.version

'4.0.0'

# Data Load
Load data into the two dataframes: `df_spark` for Spark and `df_sk` for SciKit Learn.


In [5]:
iris_csv = 'data/iris.csv'

In [6]:
# PySpark DataFrame
df_spark = spark.read.csv(iris_csv, header=True, inferSchema=True)
df_spark.cache() # For fast reuse

DataFrame[sepal_length: double, sepal_width: double, petal_length: double, petal_width: double, species: string]

In [7]:
# SKLearn DataFrame
df_sk = pd.read_csv(iris_csv)

# Data Exploration & Preparation
Explore the data in both Spark and SciKit Learn dataframes. This includes checking for null values, data types, and basic statistics.

In [8]:
# Total Count
print("PySpark - ", df_spark.count())
print("SciKit Learn - ", df_sk.shape)

PySpark -  150
SciKit Learn -  (150, 5)


In [9]:
# Data Type
print("PySpark - ", df_spark.printSchema())


root
 |-- sepal_length: double (nullable = true)
 |-- sepal_width: double (nullable = true)
 |-- petal_length: double (nullable = true)
 |-- petal_width: double (nullable = true)
 |-- species: string (nullable = true)

PySpark -  None


In [10]:
print("SciKit Learn - ", df_sk.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
SciKit Learn -  None


In [11]:
# Display Records
print("PySpark - ", df_spark.show(5))
print("SciKit Learn - \n ", df_sk.head())

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
+------------+-----------+------------+-----------+-------+
only showing top 5 rows
PySpark -  None
SciKit Learn - 
     sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


In [12]:
# Record per species
print("PySpark - ", df_spark.groupBy("species").count().show())
print("SciKit Learn - \n ", df_sk.groupby('species').size())

+----------+-----+
|   species|count|
+----------+-----+
| virginica|   50|
|versicolor|   50|
|    setosa|   50|
+----------+-----+

PySpark -  None
SciKit Learn - 
  species
setosa        50
versicolor    50
virginica     50
dtype: int64


In [13]:
# Summary Statistics
print("PySpark - ", df_spark.describe().show())
print("SciKit Learn - \n ", df_sk.describe().transpose())

25/08/05 23:14:51 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-------+------------------+-------------------+------------------+------------------+---------+
|summary|      sepal_length|        sepal_width|      petal_length|       petal_width|  species|
+-------+------------------+-------------------+------------------+------------------+---------+
|  count|               150|                150|               150|               150|      150|
|   mean| 5.843333333333335| 3.0540000000000007|3.7586666666666693|1.1986666666666672|     NULL|
| stddev|0.8280661279778637|0.43359431136217375| 1.764420419952262|0.7631607417008414|     NULL|
|    min|               4.3|                2.0|               1.0|               0.1|   setosa|
|    max|               7.9|                4.4|               6.9|               2.5|virginica|
+-------+------------------+-------------------+------------------+------------------+---------+

PySpark -  None
SciKit Learn - 
                count      mean       std  min  25%   50%  75%  max
sepal_length  150.0  5.843

In order for our model to make predictions the Species (Label colum) should be a numerical value. 

To achieve this, apply String Indexing on the Species columns

In [14]:
# Spark
spark_string_indexer = StringIndexer(inputCol="species", outputCol="species_index")
df_spark = spark_string_indexer.fit(df_spark).transform(df_spark)

In [15]:
# SKLearn
label_encoder = LabelEncoder()
df_sk['species_index'] = label_encoder.fit_transform(df_sk['species'])

In [16]:
# Inspect the DataFrame after String Indexing
print("PySpark - ", df_spark.show(5))
print("SciKit Learn - \n ", df_sk.head())

+------------+-----------+------------+-----------+-------+-------------+
|sepal_length|sepal_width|petal_length|petal_width|species|species_index|
+------------+-----------+------------+-----------+-------+-------------+
|         5.1|        3.5|         1.4|        0.2| setosa|          0.0|
|         4.9|        3.0|         1.4|        0.2| setosa|          0.0|
|         4.7|        3.2|         1.3|        0.2| setosa|          0.0|
|         4.6|        3.1|         1.5|        0.2| setosa|          0.0|
|         5.0|        3.6|         1.4|        0.2| setosa|          0.0|
+------------+-----------+------------+-----------+-------+-------------+
only showing top 5 rows
PySpark -  None
SciKit Learn - 
     sepal_length  sepal_width  petal_length  petal_width species  species_index
0           5.1          3.5           1.4          0.2  setosa              0
1           4.9          3.0           1.4          0.2  setosa              0
2           4.7          3.2           

# Feature Engineering

The spark model needs two columns: "label" and "features". The "label" column is the target variable, and the "features" column contains the input features.

Create a separate dataframe `df_spark_features` that contains the "label" and "features" columns. Then define the features using `DenseVector`.

Dense Vector is a local vector backed by a double array representing its entry values. It is used to represent the features in Spark MLlib.

In [17]:
# Create a df with reordered columns
df_spark_features = df_spark.select("species_index", "sepal_length", "sepal_width", "petal_length", "petal_width")

df_spark_features.show(5)

+-------------+------------+-----------+------------+-----------+
|species_index|sepal_length|sepal_width|petal_length|petal_width|
+-------------+------------+-----------+------------+-----------+
|          0.0|         5.1|        3.5|         1.4|        0.2|
|          0.0|         4.9|        3.0|         1.4|        0.2|
|          0.0|         4.7|        3.2|         1.3|        0.2|
|          0.0|         4.6|        3.1|         1.5|        0.2|
|          0.0|         5.0|        3.6|         1.4|        0.2|
+-------------+------------+-----------+------------+-----------+
only showing top 5 rows


In [18]:
# Define the features using DenseVector
input_data = df_spark_features.rdd.map(lambda row: (row[0], DenseVector(row[1:])))

In [19]:
df_spark_index = spark.createDataFrame(input_data, ["label", "features"])

                                                                                

In [20]:
df_spark_index.show(5)

+-----+-----------------+
|label|         features|
+-----+-----------------+
|  0.0|[5.1,3.5,1.4,0.2]|
|  0.0|[4.9,3.0,1.4,0.2]|
|  0.0|[4.7,3.2,1.3,0.2]|
|  0.0|[4.6,3.1,1.5,0.2]|
|  0.0|[5.0,3.6,1.4,0.2]|
+-----+-----------------+
only showing top 5 rows


Feature and target selection using SKLearn

In [21]:
sk_features = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
target = ['species']

X = df_sk[sk_features]
y = df_sk[target]

# Scaling the Data

In [22]:
# Spark
spark_standard_scaler = pyspark.ml.feature.StandardScaler(
    inputCol="features", outputCol="features_scaled"
)
spark_scaler = spark_standard_scaler.fit(df_spark_index)
df_spark_scaled = spark_scaler.transform(df_spark_index)

In [23]:
# SKLearn DataFrame
sk_standard_scaler = StandardScaler()
df_sk_scaled = sk_standard_scaler.fit_transform(X)
df_sk_scaled = pd.DataFrame(
    df_sk_scaled, columns=["sepal_length", "sepal_width", "petal_length", "petal_width"]
)

In [24]:
# Inspect the Scaled Data
print(df_spark_scaled.show(5))
print(df_sk_scaled.head())

+-----+-----------------+--------------------+
|label|         features|     features_scaled|
+-----+-----------------+--------------------+
|  0.0|[5.1,3.5,1.4,0.2]|[6.15892840883878...|
|  0.0|[4.9,3.0,1.4,0.2]|[5.9174018045706,...|
|  0.0|[4.7,3.2,1.3,0.2]|[5.67587520030241...|
|  0.0|[4.6,3.1,1.5,0.2]|[5.55511189816831...|
|  0.0|[5.0,3.6,1.4,0.2]|[6.03816510670469...|
+-----+-----------------+--------------------+
only showing top 5 rows
None
   sepal_length  sepal_width  petal_length  petal_width
0     -0.900681     1.032057     -1.341272    -1.312977
1     -1.143017    -0.124958     -1.341272    -1.312977
2     -1.385353     0.337848     -1.398138    -1.312977
3     -1.506521     0.106445     -1.284407    -1.312977
4     -1.021849     1.263460     -1.341272    -1.312977


In [25]:
# Dropping the Features column
df_spark_scaled = df_spark_scaled.drop("features")

In [26]:
df_spark_scaled.show(5)

+-----+--------------------+
|label|     features_scaled|
+-----+--------------------+
|  0.0|[6.15892840883878...|
|  0.0|[5.9174018045706,...|
|  0.0|[5.67587520030241...|
|  0.0|[5.55511189816831...|
|  0.0|[6.03816510670469...|
+-----+--------------------+
only showing top 5 rows


# Data Split
Split the data into training and testing sets for both Spark and SciKit Learn dataframes.
Using train : test split of 90:10 for both Spark and SciKit Learn dataframes.

In [28]:
# Spark
spark_train, spark_test = df_spark_scaled.randomSplit([0.9, 0.1], seed=42)

# SKLearn
X_train, X_test, y_train, y_test = train_test_split(
    df_sk_scaled, df_sk['species_index'], test_size=0.1, random_state=43)

In [30]:
# Inspect the Training Data
print("PySpark - ", spark_train.show(5))
print("SKLearn - \n ", X_train.head())

+-----+--------------------+
|label|     features_scaled|
+-----+--------------------+
|  0.0|[5.19282199176603...|
|  0.0|[5.31358529390013...|
|  0.0|[5.31358529390013...|
|  0.0|[5.31358529390013...|
|  0.0|[5.43434859603422...|
+-----+--------------------+
only showing top 5 rows
PySpark -  None
SKLearn - 
       sepal_length  sepal_width  petal_length  petal_width
97       0.432165    -0.356361      0.307833     0.133226
15      -0.173674     3.114684     -1.284407    -1.050031
12      -1.264185    -0.124958     -1.341272    -1.444450
114     -0.052506    -0.587764      0.762759     1.579429
100      0.553333     0.569251      1.274550     1.710902


# Build, Train & Evaluate Model
Build, train, and evaluate the three classifiers: Decision Tree Classifier, Naive Bayes, and Random Forest.

Then compaire their accuracy

In [31]:
model = ['Decision Tree', 'Random Forest', 'Naive Bayes']
model_results = []

## Decision Tree Classifier

In [32]:
# Spark
dtc_spark = pyspark.ml.classification.DecisionTreeClassifier(labelCol="label", featuresCol="features_scaled")
dtc_spark_model = dtc_spark.fit(spark_train)
dtc_spark_predictions = dtc_spark_model.transform(spark_test)

#Evaluate the model
dtc_spark_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
dtc_spark_accuracy = dtc_spark_evaluator.evaluate(dtc_spark_predictions)

In [33]:
print(f"Decision Tree Classifier Accuracy (PySpark): {dtc_spark_accuracy:.4f}")

Decision Tree Classifier Accuracy (PySpark): 0.7778


In [34]:
# SKLearn
dtc_sk = DecisionTreeClassifier(random_state=43)
dtc_sk.fit(X_train, y_train)
dtc_sk_predictions = dtc_sk.predict(X_test)

# Evaluate the model
dtc_sk_accuracy = accuracy_score(y_test, dtc_sk_predictions)
print(f"Decision Tree Classifier Accuracy (SKLearn): {dtc_sk_accuracy:.4f}")

Decision Tree Classifier Accuracy (SKLearn): 0.8667


## Random Forest Classifier

In [None]:
# Spark
rfc_spark = pyspark.ml.classification.RandomForestClassifier(labelCol="label", featuresCol="features_scaled", numTrees=100)
rfc_spark_model = rfc_spark.fit(spark_train)
rfc_spark_predictions = rfc_spark_model.transform(spark_test)

# Evaluate the model
rfc_spark_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
rfc_spark_accuracy = rfc_spark_evaluator.evaluate(rfc_spark_predictions)

print(f"Random Forest Classifier Accuracy (PySpark): {rfc_spark_accuracy:.4f}")

Random Forest Classifier Accuracy (PySpark): 0.7778


In [37]:
rfc_sk = RandomForestClassifier(n_estimators=1000, criterion='entropy', random_state=None, bootstrap=True)
rfc_sk.fit(X_train, y_train)
rfc_sk_predictions = rfc_sk.predict(X_test)

# Evaluate the model
rfc_sk_accuracy = accuracy_score(y_test, rfc_sk_predictions)

In [38]:
print(f"Random Forest Classifier Accuracy (SKLearn): {rfc_sk_accuracy:.4f}")

Random Forest Classifier Accuracy (SKLearn): 0.9333


# Naive Bayes Classifier

In [39]:
# Spark
nbc_spark = pyspark.ml.classification.NaiveBayes(smoothing=1.0, modelType="gaussian", labelCol="label", featuresCol="features_scaled")
nbc_spark_model = nbc_spark.fit(spark_train)
nbc_spark_predictions = nbc_spark_model.transform(spark_test)

# Evaluate the model
nbc_spark_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
nbc_spark_accuracy = nbc_spark_evaluator.evaluate(nbc_spark_predictions)
print(f"Naive Bayes Classifier Accuracy (PySpark): {nbc_spark_accuracy:.4f}")

                                                                                

Naive Bayes Classifier Accuracy (PySpark): 0.8889


25/08/05 23:42:35 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS


In [42]:
# SKLearn
nbc_sk = GaussianNB()
nbc_sk.fit(X_train, y_train)
nbc_sk_predictions = nbc_sk.predict(X_test)

# Evaluate the model
nbc_sk_accuracy = accuracy_score(y_test, nbc_sk_predictions)
print(f"Naive Bayes Classifier Accuracy (SKLearn): {nbc_sk_accuracy:.4f}")

Naive Bayes Classifier Accuracy (SKLearn): 1.0000


In [43]:
# Free up memory
gc.collect()

1376

# Model Comparison

Finally, compare the accuracy of the models built using Spark and SciKit Learn. The comparison will be based on the accuracy scores obtained from the evaluation of each model.

In [44]:
model_data = [["Decision Tree Classifier", "{:.2%}".format(dtc_spark_accuracy), '{:.2%}'.format(dtc_sk_accuracy)], \
              ["Random Forest Classifier", "{:.2%}".format(rfc_spark_accuracy), '{:.2%}'.format(rfc_sk_accuracy)], \
              ["Naive Bayes Classifier", "{:.2%}".format(nbc_spark_accuracy), '{:.2%}'.format(nbc_sk_accuracy)]]

In [45]:
print(tabulate(model_data, headers=["Classifier Model", "PySpark Accuracy", "SKLearn Accuracy"]))

Classifier Model          PySpark Accuracy    SKLearn Accuracy
------------------------  ------------------  ------------------
Decision Tree Classifier  77.78%              86.67%
Random Forest Classifier  77.78%              93.33%
Naive Bayes Classifier    88.89%              100.00%


Conclusion - The SciKit Classifier Models perform slightly better than the PySpark Models.