<a href="https://colab.research.google.com/github/thomas1631/Portfolio-Projects/blob/main/Challenge_Employee_Attrition_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome the challenge notebook
---


In this challenge, you will work with a dataset provided by an HR manager who wants to predict which employees are at risk of leaving the company. The dataset contains four key performance indicators (KPIs) related to each employee. Your task is to use PySpark to build a machine learning model that can predict employee attrition and to identify which KPI is most strongly associated with attrition in this company.

- Please note that the dataset is already clean and ready to be modeled.
- The dataset only contains numerical features.

Installing pyspark

In [1]:
!pip install pyspark



Importing the needed modules and creating the spark session

In [2]:
pip install plotly



In [3]:
!pip install pyspark



In [4]:
# importing spark session
from pyspark.sql import SparkSession

# data visualization modules
import matplotlib.pyplot as plt
import plotly.express as px

# pandas module
import pandas as pd

# pyspark data preprocessing modules
from pyspark.ml.feature import  VectorAssembler, StandardScaler,StringIndexer

# pyspark data modeling and model evaluation modules
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# creating the spark session
spark = SparkSession.builder.appName("Challenge").getOrCreate()
spark

Loading the `Challenge_dataset.csv` file

In [5]:
data = spark.read.format('csv').option('header',True).option('inferSchema',True).load('Challenge_dataset.csv')
data.show(5)

+----------+------------------+-------------------+-------------------+-------------------+---------------+
|EmployeeID|              KPI1|               KPI2|               KPI3|               KPI4|CurrentEmployee|
+----------+------------------+-------------------+-------------------+-------------------+---------------+
|         0|1.4347155493478079| 0.8445778971189396| 1.2907117554310856|-1.4201273531837943|              1|
|         1|0.8916245735832885| 0.8308158727699302| 1.0779750584283363|-1.0598957663940176|              1|
|         2|-0.891158353098296|-0.9469681237741348|-1.1825287909456643| 1.1269205082112577|              0|
|         3|1.2797294893867808| 1.6690888870054317| 1.9769417044649022| -1.797525912345404|              1|
|         4|0.2576789316661615|0.34201906896710577|0.40342208520171396|-0.3653830886145554|              1|
+----------+------------------+-------------------+-------------------+-------------------+---------------+
only showing top 5 rows



Create the numerical feature vector using `Vector Assembler`.

Hint: The numerical input features are the KPIs.

In [6]:
# Step 1: Select only numerical columns (excluding ID and label)
numerical_columns = [col for col, typ in data.dtypes if typ == "double" or typ == "int"]
numerical_columns.remove("EmployeeID")
numerical_columns.remove("CurrentEmployee")

print(numerical_columns)  # This will print ['KPI1', 'KPI2', 'KPI3', 'KPI4']

# Step 2: Create the feature vector from selected numerical columns
numerical_vector_assembler = VectorAssembler(
    inputCols=numerical_columns,
    outputCol='numerical_feature_vector'
)

# Step 3: Apply the transformation to combine KPI columns into a single vector
data = numerical_vector_assembler.transform(data)

# Step 4: Preview the transformed DataFrame
data.show(3)


['KPI1', 'KPI2', 'KPI3', 'KPI4']
+----------+------------------+-------------------+-------------------+-------------------+---------------+------------------------+
|EmployeeID|              KPI1|               KPI2|               KPI3|               KPI4|CurrentEmployee|numerical_feature_vector|
+----------+------------------+-------------------+-------------------+-------------------+---------------+------------------------+
|         0|1.4347155493478079| 0.8445778971189396| 1.2907117554310856|-1.4201273531837943|              1|    [1.43471554934780...|
|         1|0.8916245735832885| 0.8308158727699302| 1.0779750584283363|-1.0598957663940176|              1|    [0.89162457358328...|
|         2|-0.891158353098296|-0.9469681237741348|-1.1825287909456643| 1.1269205082112577|              0|    [-0.8911583530982...|
+----------+------------------+-------------------+-------------------+-------------------+---------------+------------------------+
only showing top 3 rows



Apply `Standard Scaler` to the numerical feature vector

In [15]:
# Apply Standard Scaler to the numerical feature vector
scaler = StandardScaler(
    inputCol="numerical_feature_vector",
    outputCol="input_features",
    withStd=True,     # divide by standard deviation
    withMean=True     # subtract mean
)

# Fit the scaler to the data and transform it
data = scaler.fit(data).transform(data)

# Keep only the scaled features and the label column
data = data.select(['input_features', 'CurrentEmployee'])

# Show 3 sample rows
data.take(3)


[Row(input_features=DenseVector([1.0822, 0.599, 0.7812, -0.9177]), CurrentEmployee=1),
 Row(input_features=DenseVector([0.6732, 0.5893, 0.6529, -0.6856]), CurrentEmployee=1),
 Row(input_features=DenseVector([-0.6694, -0.665, -0.7101, 0.7235]), CurrentEmployee=0)]

Split the data into train and test sets

In [16]:
# Split the dataset into training and test sets
train, test = data.randomSplit([0.7, 0.3], seed=7)

# Print sizes of train and test sets
print("Train dataset size:", train.count())
print("Test dataset size:", test.count())

# Show sample from training set
train.show()


Train dataset size: 2805
Test dataset size: 1195
+--------------------+---------------+
|      input_features|CurrentEmployee|
+--------------------+---------------+
|[-2.7489719838207...|              0|
|[-2.5469231027321...|              0|
|[-2.4273739447467...|              0|
|[-2.3661953908693...|              0|
|[-2.2726760231995...|              0|
|[-2.2489699467479...|              0|
|[-2.1945084857628...|              0|
|[-2.1559418796819...|              0|
|[-2.1425630527620...|              0|
|[-2.1243203818980...|              0|
|[-2.0648182573101...|              0|
|[-1.9918021917847...|              0|
|[-1.9586952661221...|              0|
|[-1.9551211704237...|              0|
|[-1.9528595003555...|              0|
|[-1.9240885673650...|              0|
|[-1.9111354894003...|              0|
|[-1.9093410121684...|              0|
|[-1.8996214205352...|              0|
|[-1.8787877694015...|              0|
+--------------------+---------------+
only showing to

Train your Decision Tree model. Use `maxDepth = 3`

In [17]:
# Train your Decision Tree model with maxDepth = 3
decision_tree = DecisionTreeClassifier(
    featuresCol='input_features',
    labelCol='CurrentEmployee',
    maxDepth=3
)

dtModel = decision_tree.fit(train)


Perform the prediction on the test set and calculate the accuracy using `BinaryClassificationEvaluator`

In [18]:
# Perform the prediction on the test set
predictions_test = dtModel.transform(test)

# Initialize the binary classification evaluator
evaluator = BinaryClassificationEvaluator(labelCol="CurrentEmployee")

# Evaluate the AUC (Area Under the ROC Curve)
auc_test = evaluator.evaluate(predictions_test, {evaluator.metricName: "areaUnderROC"})

# Print the AUC result
print("Area under the ROC curve - test set: ", auc_test)

Area under the ROC curve - test set:  0.8895066505170132


Apply the hyper paramter tuning to find the proper `maxDepth` for your decision tree from the `candidates` list.

In [24]:
def evaluate_dt(mode_params):
      test_accuracies = []
      train_accuracies = []

      for maxD in mode_params:
        # train the model based on the maxD
        decision_tree = DecisionTreeClassifier(featuresCol = 'input_features', labelCol = 'CurrentEmployee', maxDepth = maxD)
        dtModel = decision_tree.fit(train)

        # calculating test error
        predictions_test = dtModel.transform(test)
        evaluator = BinaryClassificationEvaluator(labelCol='CurrentEmployee')
        auc_test = evaluator.evaluate(predictions_test, {evaluator.metricName: "areaUnderROC"})
        # recording the accuracy
        test_accuracies.append(auc_test)

        # calculating training error
        predictions_training = dtModel.transform(train)
        evaluator = BinaryClassificationEvaluator(labelCol='CurrentEmployee')
        auc_training = evaluator.evaluate(predictions_training, {evaluator.metricName: "areaUnderROC"})
        train_accuracies.append(auc_training)

      return(test_accuracies, train_accuracies)



candidates = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]

# write your code here
test_acc, train_acc = evaluate_dt(candidates)
print("test accuracies:", test_acc)
print("train accuracies:", train_acc)

test accuracies: [0.8984842877329925, 0.8861056905098357, 0.8895066505170132, 0.8667681627526187, 0.8199369714913757, 0.8937599533454456, 0.8928711616535451, 0.8861098961487562, 0.8669644259022498, 0.8607989592445552, 0.8489292443307987, 0.8979375546733059, 0.8959454837045511, 0.8925585424937758, 0.9007258932777068, 0.9005548639615997, 0.8982950339815625, 0.8948982796133056, 0.8948982796133056, 0.8912365699930467]
train accuracies: [0.9031408089076441, 0.8965836235605155, 0.8919902381981341, 0.8758781808475481, 0.8373371126420418, 0.9020794671683149, 0.9018379642575691, 0.8993624323156314, 0.886360932455449, 0.8794852174797264, 0.8600841446983756, 0.9220454025472201, 0.9198174746421943, 0.9192765081221242, 0.9401024480768743, 0.9423263085644559, 0.94327427104253, 0.9425639982713476, 0.9425639982713476, 0.9411040496224927]


Use a line chart to visualize the training and testing accuracy. <br>

Hint: To visualize your data, convert the PySpark dataframe to pandas dataframe.

In [25]:
# Create a DataFrame to store hyperparameter tuning results
tuning_dataframe = pd.DataFrame()
tuning_dataframe['MaxDepth'] = candidates
tuning_dataframe['TestAcc'] = test_acc
tuning_dataframe['TrainAcc'] = train_acc

# Visualize training and testing accuracy across different tree depths
px.line(tuning_dataframe, x="MaxDepth", y=["TestAcc", "TrainAcc"])

Train the decision tree using the proper `maxDepth` parameter.  

In [26]:
# Write your code here
decision_tree = DecisionTreeClassifier(featuresCol = 'input_features', labelCol = 'CurrentEmployee', maxDepth = 6)
dtModel = decision_tree.fit(train)

Use the `Feature Importance` to find the most important factor for the employee attrition using a barchart.

In [27]:
# Write your code here
feature_importance = dtModel.featureImportances
scores = []
for i, score in enumerate(feature_importance):
    scores.append(score)

feat_importance = pd.DataFrame(scores, columns=['Score'], index=numerical_columns)
px.bar(feat_importance, y="Score")