# Welcome the challenge notebook
---


In this challenge, you will work with a dataset provided by an HR manager who wants to predict which employees are at risk of leaving the company. The dataset contains four key performance indicators (KPIs) related to each employee. Your task is to use PySpark to build a machine learning model that can predict employee attrition and to identify which KPI is most strongly associated with attrition in this company.

- Please note that the dataset is already clean and ready to be modeled.
- The dataset only contains numerical features.

Installing pyspark

Importing the needed modules and creating the spark session

In [None]:
# importing spark session
from pyspark.sql import SparkSession

# data visualization modules
import matplotlib.pyplot as plt
import plotly.express as px

# pandas module
import pandas as pd

# pyspark data preprocessing modules
from pyspark.ml.feature import  VectorAssembler, StandardScaler,StringIndexer

# pyspark data modeling and model evaluation modules
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# creating the spark session
spark = SparkSession.builder.appName("Challenge").getOrCreate()
spark

Loading the `Challenge_dataset.csv` file

In [None]:
data = spark.read.format('csv').option('header',True).option('inferSchema',True).load('Challenge_dataset.csv')
data.show(5)

+----------+------------------+-------------------+-------------------+-------------------+---------------+
|EmployeeID|              KPI1|               KPI2|               KPI3|               KPI4|CurrentEmployee|
+----------+------------------+-------------------+-------------------+-------------------+---------------+
|         0|1.4347155493478079| 0.8445778971189396| 1.2907117554310856|-1.4201273531837943|              1|
|         1|0.8916245735832885| 0.8308158727699302| 1.0779750584283363|-1.0598957663940176|              1|
|         2|-0.891158353098296|-0.9469681237741348|-1.1825287909456643| 1.1269205082112577|              0|
|         3|1.2797294893867808| 1.6690888870054317| 1.9769417044649022| -1.797525912345404|              1|
|         4|0.2576789316661615|0.34201906896710577|0.40342208520171396|-0.3653830886145554|              1|
+----------+------------------+-------------------+-------------------+-------------------+---------------+
only showing top 5 rows


Create the numerical feature vector using `Vector Assembler`.

Hint: The numerical input features are the KPIs.

In [None]:
num_cols = [col for col, typ in data.dtypes if typ == "double" or typ == "int"]

In [None]:
# write your code here
num_vector = VectorAssembler(inputCols=['KPI1', 'KPI2', 'KPI3', 'KPI4'], outputCol='num_vector')
data = num_vector.transform(data)
data.show(10, truncate=False)

+----------+-------------------+--------------------+-------------------+-------------------+---------------+---------------------------------------------------------------------------------+
|EmployeeID|KPI1               |KPI2                |KPI3               |KPI4               |CurrentEmployee|num_vector                                                                       |
+----------+-------------------+--------------------+-------------------+-------------------+---------------+---------------------------------------------------------------------------------+
|0         |1.4347155493478079 |0.8445778971189396  |1.2907117554310856 |-1.4201273531837943|1              |[1.4347155493478079,0.8445778971189396,1.2907117554310856,-1.4201273531837943]   |
|1         |0.8916245735832885 |0.8308158727699302  |1.0779750584283363 |-1.0598957663940176|1              |[0.8916245735832885,0.8308158727699302,1.0779750584283363,-1.0598957663940176]   |
|2         |-0.891158353098296 |-0.94696

Apply `Standard Scaler` to the numerical feature vector

In [None]:
# write your code here
scaler = StandardScaler(inputCol='num_vector', outputCol='num_scaled', withMean=True)
data = scaler.fit(data).transform(data)
data.show(10, truncate=False)


+----------+-------------------+--------------------+-------------------+-------------------+---------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------+
|EmployeeID|KPI1               |KPI2                |KPI3               |KPI4               |CurrentEmployee|num_vector                                                                       |num_scaled                                                                       |
+----------+-------------------+--------------------+-------------------+-------------------+---------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------+
|0         |1.4347155493478079 |0.8445778971189396  |1.2907117554310856 |-1.4201273531837943|1              |[1.4347155493478079,0.8445778971189396,1.2907117554310856,-1.42012735

Split the data into train and test sets

In [None]:
# write your code here
train, test = data.randomSplit([0.7, 0.3], seed=100)
print("Train dataset size:", train.count())
print("Test dataset size:",test.count())

Train dataset size: 2813
Test dataset size: 1187


Train your Decision Tree model. Use `maxDepth = 3`

In [None]:
# write your code here
dt = DecisionTreeClassifier(featuresCol='num_scaled', labelCol='CurrentEmployee', maxDepth=3)
model = dt.fit(train)

Perform the prediction on the test set and calculate the accuracy using `BinaryClassificationEvaluator`

In [None]:
# write your code here
pred_test = model.transform(test)
pred_test.select(['CurrentEmployee', 'prediction']).show()

+---------------+----------+
|CurrentEmployee|prediction|
+---------------+----------+
|              1|       1.0|
|              1|       1.0|
|              0|       0.0|
|              1|       1.0|
|              0|       0.0|
|              1|       1.0|
|              0|       0.0|
|              1|       1.0|
|              0|       0.0|
|              0|       0.0|
|              1|       1.0|
|              1|       1.0|
|              0|       0.0|
|              0|       0.0|
|              1|       1.0|
|              1|       1.0|
|              0|       0.0|
|              0|       0.0|
|              1|       1.0|
|              1|       0.0|
+---------------+----------+
only showing top 20 rows


In [None]:
evaluator = BinaryClassificationEvaluator(labelCol="CurrentEmployee")
auc_test = evaluator.evaluate(pred_test, {evaluator.metricName: 'areaUnderROC'})
print("Area under ROC curve - test:", auc_test)

Area under ROC curve - test: 0.8901179103290812


Apply the hyper paramter tuning to find the proper `maxDepth` for your decision tree from the `candidates` list.

In [None]:
def evaluate_dt(mode_params):
      test_accuracies = []
      train_accuracies = []

      for maxD in mode_params:
        # train the model based on the maxD
        decision_tree = DecisionTreeClassifier(featuresCol = 'num_scaled', labelCol = 'CurrentEmployee', maxDepth = maxD)
        dtModel = decision_tree.fit(train)

        # calculating test error
        predictions_test = dtModel.transform(test)
        evaluator = BinaryClassificationEvaluator(labelCol='CurrentEmployee')
        auc_test = evaluator.evaluate(predictions_test, {evaluator.metricName: "areaUnderROC"})
        # recording the accuracy
        test_accuracies.append(auc_test)

        # calculating training error
        predictions_training = dtModel.transform(train)
        evaluator = BinaryClassificationEvaluator(labelCol='CurrentEmployee')
        auc_training = evaluator.evaluate(predictions_training, {evaluator.metricName: "areaUnderROC"})
        train_accuracies.append(auc_training)

      return(test_accuracies, train_accuracies)



candidates = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]

# write your code here
test_acc, train_acc = evaluate_dt(candidates)
print(test_acc)
print(train_acc)

[0.9053763929435552, 0.8939886179046495, 0.8901179103290812, 0.8902087853418604, 0.9030775391046538, 0.9059997387343383, 0.8879851873729172, 0.9153655447389047, 0.8993672827235242, 0.8812817352583691, 0.8601731736962275, 0.8507974282371386, 0.8380664069155884, 0.8304897027251145, 0.833573773471312, 0.8322901639158042, 0.8279196437699499, 0.8301773198686857, 0.8301773198686857, 0.8301773198686857]
[0.9033096386553178, 0.8902327443532626, 0.8857524585208334, 0.8858808566480405, 0.8931472301895337, 0.9040529829311371, 0.8925557899972805, 0.9169650827864617, 0.912998136710642, 0.9004780555468496, 0.8865739032322963, 0.8940836771606019, 0.887620297418591, 0.8821421458258983, 0.8826615516357618, 0.8804524972424733, 0.8751987896201741, 0.8765813442576212, 0.8765813442576212, 0.8765813442576212]


Use a line chart to visualize the training and testing accuracy. <br>

Hint: To visualize your data, convert the PySpark dataframe to pandas dataframe.

In [None]:
# write your code here
df = pd.DataFrame()
df["train_acc"] = train_acc
df["test_acc"] = test_acc
df["candidates"] = candidates

px.line(df, x="candidates", y=["train_acc", "test_acc"])

### Insights from Linechart:

- When max depth is 8, both train_acc and test_acc are at their highest point of approx. 0.91

- The optimal max depth for decision tree model is going to be eight.

Train the decision tree using the proper `maxDepth` parameter.  

In [None]:
# write your code here
dt2 = DecisionTreeClassifier(featuresCol='num_scaled', labelCol='CurrentEmployee', maxDepth=8)
dt2model = dt2.fit(train)
pred_test2 = dt2model.transform(test)

evaluator2 = BinaryClassificationEvaluator(labelCol='CurrentEmployee')
auc_test2 = evaluator2.evaluate(pred_test2, {evaluator.metricName: "areaUnderROC"})
auc_test2


0.9153655447389047

Use the `Feature Importance` to find the most important factor for the employee attrition using a barchart.

In [None]:
# write your code here
input_features = ["KPI1", "KPI2", "KPI3", "KPI4"]
feature_imp = dt2model.featureImportances
scores = [score for i, score in enumerate(feature_imp)]
df1 = pd.DataFrame(scores, columns=["scores"], index=input_features)
px.bar(df1, y='scores')