# PySpark Decision Trees

In this Jupyter notebook, we will explore the implementation of decision trees using PySpark. Decision trees are a powerful machine learning algorithm for classification and regression tasks, capable of handling both categorical and numerical data.


# Setting up PySpark Environment

We begin by importing the necessary libraries for working with PySpark, including `pyspark`, `os`, `SparkSession` from `pyspark.sql`, `DoubleType` from `pyspark.sql.types`, and `col` from `pyspark.sql.functions`. These libraries are essential for initializing a PySpark session, defining data types, and performing dataframe operations.


In [None]:
import pyspark
import os
from pyspark.sql import SparkSession
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col

# Initializing Spark Session

We create a SparkSession named `spark` using the `SparkSession.builder.getOrCreate()` method. This initializes a Spark session if it doesn't exist already or retrieves an existing one. The SparkSession is essential for interacting with Spark functionality and executing Spark jobs.


In [None]:
spark = SparkSession.builder.getOrCreate()

# Reading Data Without Header

We use `spark.read.option("inferSchema", True).option("header", False).csv("data/covtype.data")` to read the dataset `covtype.data` without considering the first row as a header. The `inferSchema` option is set to `True` to automatically infer the schema of the dataset. Subsequently, we print the summary statistics of the dataset using `data_without_header.summary`.


In [None]:
data_without_header = spark.read.option("inferSchema",True).option("header",False).csv("data/covtype.data")
print(data_without_header.summary)

# Defining Column Names

We define a list `colnames` containing the names of columns for our dataset. The list includes features such as elevation, aspect, slope, distances to hydrology, roadways, fire points, hillshade values, as well as wilderness area and soil type indicators. The last column represents the cover type, which is the target variable.


In [None]:
colnames = ["Elevation","Aspect","Slope","Horizontal_Distance_To_Hydrology","Vertical_Distance_To_Hydrology",\
           "Horizontal_Distance_To_Roadways","Hillshade_9am","Hillshade_noon","Hillshade_3pm",\
           "Horizontal_Distance_To_Fire_Points"] + \
           [f"Wilderness_Area_{i}" for i in range(4)] + [f"Soil_Type_{i}" for i in range(40)] + ["Cover_Type"]

# Printing Schema of Data

We use the `printSchema()` method on the `data_without_header` DataFrame to display the schema of the dataset. This schema outlines the data types and structure of each column, providing insights into the dataset's organization and format.


In [None]:
data_without_header.printSchema()

# Converting Data and Adjusting Column Types

We convert the `data_without_header` DataFrame to a new DataFrame named `data`, using the column names defined earlier (`colnames`). Additionally, we adjust the data type of the "Cover_Type" column to `DoubleType()` using the `withColumn()` method and `col()` function from PySpark's `pyspark.sql.functions` module. Finally, we print the summary statistics of the `data` DataFrame using `data.summary`.


In [None]:
data = data_without_header.toDF(*colnames).withColumn("Cover_Type",col("Cover_Type").cast(DoubleType()))
print(data.summary)

# Splitting Data into Training and Testing Sets

We split the `data` DataFrame into training and testing sets using the `randomSplit()` method. The training set (`train_data`) contains 90% of the data, while the testing set (`test_data`) contains 10%. This ensures that we have separate datasets for training and evaluating our machine learning models.


In [None]:
train_data,test_data = data.randomSplit([0.9,0.1])

# Creating Feature Vectors

We use the `VectorAssembler` from `pyspark.ml.feature` to assemble the feature vectors for training data. First, we define the input columns (`input_cols`) excluding the target variable. Then, we create a `VectorAssembler` object specifying the input and output columns. Next, we transform the training data using the `vector_assembler`, and finally, we display the assembled feature vectors for training data.


In [None]:
from pyspark.ml.feature import VectorAssembler

input_cols = colnames[:-1]
vector_assembler = VectorAssembler(inputCols = input_cols,outputCol = "featureVector")
assembled_train_data = vector_assembler.transform(train_data)

assembled_train_data.select("featureVector").show(truncate=False)

# Building Decision Tree Classifier

We instantiate a `DecisionTreeClassifier` from `pyspark.ml.classification` with specified parameters such as `seed`, `labelCol`, `featuresCol`, and `predictionCol`. Then, we train the classifier on the assembled training data (`assembled_train_data`) using the `fit()` method, resulting in the `model`. Finally, we print the debug string representation of the trained decision tree model using `model.toDebugString`.


In [None]:
from pyspark.ml.classification import DecisionTreeClassifier

classifier = DecisionTreeClassifier(seed=1234,labelCol="Cover_Type",featuresCol="featureVector",predictionCol="prediction")
model=classifier.fit(assembled_train_data)
print(model.toDebugString)

# Feature Importance Analysis

We import the Pandas library as `pd` and create a DataFrame to analyze the feature importances of the trained decision tree model (`model`). The `featureImportances` attribute of the model is converted to a NumPy array and then to a DataFrame, indexed by the input columns (`input_cols`). We sort the DataFrame by the importance of features in descending order to identify the most influential features.


In [None]:
import pandas as pd

pd.DataFrame(model.featureImportances.toArray(),index=input_cols,columns=['importance']).sort_values(by="importance",ascending=False)

# Evaluating Model Performance

We make predictions on the assembled training data (`assembled_train_data`) using the trained decision tree model (`model`). The predictions include the actual cover type (`Cover_Type`), predicted cover type (`prediction`), and the probability distribution across classes (`probability`). We then use a `MulticlassClassificationEvaluator` to evaluate the model's performance based on accuracy and F1-score metrics. Finally, we print the computed accuracy and F1 score.


In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

predictions = model.transform(assembled_train_data)
predictions.select("Cover_Type","prediction","probability").show(10,truncate=False)

evaluator=MulticlassClassificationEvaluator(labelCol="Cover_Type",predictionCol="prediction")
acc = evaluator.setMetricName("accuracy").evaluate(predictions)
f1 = evaluator.setMetricName("f1").evaluate(predictions)

print('Accuracy: ', acc)
print('F1 Score: ', f1)