<h1> Task 2 - Binary Classification using Logistic Regression   </h1>

<h6> GOAL: The goal of this task is to build a machine learning pipeline including a classification model that predicts the `Attrition` (Yes or No) from the features included in the dataset (income, work years, education level, marital status, job role, and so on), which we used in the Lab 3 and Lab 4.  </h6>

In [0]:
#  IMPORTING THE NECESSARY LIBRARIES

# To convert categorical variables to numeric
from pyspark.ml.feature import StringIndexer

# To combine the feature columns into one single column
from pyspark.ml.feature import VectorAssembler

# For logistic regression
from pyspark.ml.classification import LogisticRegression

#For building the Pipeline
from pyspark.ml import Pipeline

# For checking the accuracy
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator


# For Hyperparameter tuning
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

<h5> 2.2 Loading the dataset and displaying the schema</h5>

In [0]:
# Loading the dataset
df1 = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/clvrashmika@gmail.com/EmployeeAttrition.csv", inferSchema = "true")

In [0]:
# Printing the dataset's schema
df1.printSchema()

root
 |-- Age: integer (nullable = true)
 |-- Attrition: string (nullable = true)
 |-- BusinessTravel: string (nullable = true)
 |-- DailyRate: integer (nullable = true)
 |-- Department: string (nullable = true)
 |-- DistanceFromHome: integer (nullable = true)
 |-- Education: integer (nullable = true)
 |-- EducationField: string (nullable = true)
 |-- EmployeeCount: integer (nullable = true)
 |-- EmployeeNumber: integer (nullable = true)
 |-- EnvironmentSatisfaction: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- HourlyRate: integer (nullable = true)
 |-- JobInvolvement: integer (nullable = true)
 |-- JobLevel: integer (nullable = true)
 |-- JobRole: string (nullable = true)
 |-- JobSatisfaction: integer (nullable = true)
 |-- MaritalStatus: string (nullable = true)
 |-- MonthlyIncome: integer (nullable = true)
 |-- MonthlyRate: integer (nullable = true)
 |-- NumCompaniesWorked: integer (nullable = true)
 |-- Over18: string (nullable = true)
 |-- OverTime: string 

<h5> 2.3 Splitting the dataset into training and testing sets & Displaying distribution of HourlyRate and Education </h5>

In [0]:
# Splitting the dataset into train and test dataframes
trainDF, testDF = df1.randomSplit([0.8, 0.2], seed=65)
print(trainDF.cache().count()) # Cache because accessing training data multiple times
print(testDF.count())

1204
266


In [0]:
# Checking the distribution of the 'HourlyRate' field in the training dataset using the summary()
display(trainDF.select('HourlyRate').summary())

summary,HourlyRate
count,1204.0
mean,65.80481727574751
stddev,20.50677831411945
min,30.0
25%,48.0
50%,66.0
75%,84.0
max,100.0


In [0]:
# Checking the distribution of the 'Education' field in the training dataset using groupBY
display(trainDF.groupBy('Education').count().sort("count", ascending = False))

Education,count
3,463
4,330
2,232
1,142
5,37


<h5> 2.4 Feature Processing </h5>

In [0]:
#  2.4.1 - Selecting 5 categorical cols from the dataset
categorical_cols = ["Department", "EducationField", "Gender", "JobRole", "MaritalStatus"]

# Coverting the above columns to numerical using stringIndexer
stringIndexer = StringIndexer(inputCols=categorical_cols, outputCols=[i + "IndexedCol" for i in categorical_cols])

# 2.4.2 - Setting the Attritition Feature (Yes/No) as a label
# Converting to a numeric value
labelToNum = StringIndexer(inputCol="Attrition", outputCol="NewAttritionCol")
labelToNum

#Applying this to the dataset
stringIndexerModel = stringIndexer.fit(trainDF)

labelIndexerModel = labelToNum.fit(trainDF)

In [0]:
#  2.4.3 and 2.4.4 
#  Combining the feature columns into a new single feature 
numerical_columns = ["Age", "DailyRate", "Education", "DistanceFromHome", "HourlyRate", "JobInvolvement", "JobLevel", "JobSatisfaction", "MonthlyIncome", "YearsAtCompany", "YearsInCurrentRole", "YearsWithCurrManager", "NumCompaniesWorked", "PerformanceRating", "EnvironmentSatisfaction" ]

vector_assembler = VectorAssembler(inputCols=numerical_columns, outputCol="features")

<b> 2.5 Defining the Model <b>

In [0]:
# Defining the model for Logistic Regression
log_regression = LogisticRegression(featuresCol="features", labelCol="NewAttritionCol", regParam=1.0)

<b> 2.6  -  Building the Pipeline  </b>

In [0]:
# Defining the pipeline based on the above created stages
pipeline = Pipeline(stages=[stringIndexer, labelToNum, vector_assembler, log_regression])


# Defining the pipeline model
pipelineModel = pipeline.fit(trainDF)


# Apply the pipeline model to the test database
predDF = pipelineModel.transform(testDF)

<b> 2.6 (Cont.) - Displaying the Predictions </b>

In [0]:
display(predDF.select("features", "NewAttritionCol", "prediction", "probability"))

features,NewAttritionCol,prediction,probability
"Map(vectorType -> dense, length -> 15, values -> List(18.0, 230.0, 3.0, 3.0, 54.0, 3.0, 1.0, 3.0, 1420.0, 0.0, 0.0, 0.0, 1.0, 3.0, 3.0))",1.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8101243950612148, 0.1898756049387852))"
"Map(vectorType -> dense, length -> 15, values -> List(19.0, 504.0, 3.0, 10.0, 96.0, 2.0, 1.0, 2.0, 1859.0, 1.0, 1.0, 0.0, 1.0, 4.0, 1.0))",1.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7910603382589526, 0.2089396617410474))"
"Map(vectorType -> dense, length -> 15, values -> List(20.0, 1097.0, 3.0, 11.0, 98.0, 2.0, 1.0, 1.0, 2600.0, 1.0, 0.0, 0.0, 1.0, 3.0, 4.0))",1.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8026752384918958, 0.19732476150810419))"
"Map(vectorType -> dense, length -> 15, values -> List(21.0, 984.0, 1.0, 1.0, 70.0, 2.0, 1.0, 2.0, 2070.0, 2.0, 2.0, 2.0, 1.0, 3.0, 4.0))",0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8173454847286873, 0.1826545152713127))"
"Map(vectorType -> dense, length -> 15, values -> List(21.0, 1427.0, 1.0, 18.0, 65.0, 3.0, 1.0, 4.0, 2693.0, 1.0, 0.0, 0.0, 1.0, 3.0, 4.0))",1.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8216881516476776, 0.1783118483523224))"
"Map(vectorType -> dense, length -> 15, values -> List(22.0, 534.0, 3.0, 15.0, 59.0, 3.0, 1.0, 4.0, 2871.0, 0.0, 0.0, 0.0, 1.0, 3.0, 2.0))",0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8094938330177999, 0.19050616698220013))"
"Map(vectorType -> dense, length -> 15, values -> List(22.0, 1256.0, 4.0, 3.0, 48.0, 2.0, 1.0, 4.0, 2853.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0))",1.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8158633289543219, 0.1841366710456781))"
"Map(vectorType -> dense, length -> 15, values -> List(22.0, 391.0, 1.0, 7.0, 75.0, 3.0, 1.0, 2.0, 2472.0, 1.0, 0.0, 0.0, 1.0, 4.0, 4.0))",1.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8178307847695063, 0.18216921523049368))"
"Map(vectorType -> dense, length -> 15, values -> List(23.0, 885.0, 3.0, 4.0, 58.0, 4.0, 1.0, 1.0, 2819.0, 3.0, 2.0, 2.0, 2.0, 3.0, 1.0))",0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8171309247768448, 0.1828690752231552))"
"Map(vectorType -> dense, length -> 15, values -> List(23.0, 638.0, 3.0, 9.0, 33.0, 3.0, 1.0, 1.0, 1790.0, 1.0, 0.0, 0.0, 1.0, 3.0, 4.0))",1.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8108506929139458, 0.18914930708605415))"


<b> 2.7 - Evaluating the Model </b>

In [0]:
# Plotting the ROC curve
display(pipelineModel.stages[-1], predDF.drop("prediction", "rawPrediction", "probability"), "ROC")

False Positive Rate,True Positive Rate,Threshold
0.0,0.0,0.2089396617410474
0.0,0.0416666666666666,0.2089396617410474
0.0,0.0833333333333333,0.2088354219669564
0.0,0.125,0.1992647781616346
0.0,0.1666666666666666,0.1973247615081041
0.0119047619047619,0.1666666666666666,0.1959153600101811
0.0238095238095238,0.1666666666666666,0.1943071507338811
0.0357142857142857,0.1666666666666666,0.1937053690078395
0.0357142857142857,0.2083333333333333,0.1916581344658113
0.0357142857142857,0.25,0.1898756049387852


In [0]:
# Printing the area under the curve and the accuracy
binary_class_eval = BinaryClassificationEvaluator(metricName="areaUnderROC", labelCol="NewAttritionCol")

print("Area under ROC curve: ", binary_class_eval.evaluate(predDF))

multi_class_eval = MulticlassClassificationEvaluator(metricName="accuracy", labelCol="NewAttritionCol")

print("Accuracy: ", multi_class_eval.evaluate(predDF))

Area under ROC curve:  0.7369510015987963
Accuracy:  0.8157894736842105


<b> 2.8 - Hyperparameter Tuning </b>

In [0]:
# Using ParamGridBuilder
parameterGrid = (ParamGridBuilder()
                 .addGrid(log_regression.regParam, [0.01, 0.5, 2.0])
                 .addGrid(log_regression.elasticNetParam, [0.0, 0.5, 1.0])
                 .build())

In [0]:
# Using CrossValidator

#Creating a 3-fold CrossValidator
cross_validator = CrossValidator(estimator=pipeline, estimatorParamMaps=parameterGrid, evaluator=binary_class_eval, numFolds=3)

# Running the cross validations to find the best model 
cross_validator_model = cross_validator.fit(trainDF)

<b>2.9 -  Make predictions and Evaluate the model performance </b>

In [0]:
cvPredDF = cross_validator_model.transform(testDF)


#Evaluating the Model performance 
print("Area under ROC curve: ", binary_class_eval.evaluate(cvPredDF))
print("Accuracy: ", multi_class_eval.evaluate(cvPredDF))

Area under ROC curve:  0.7123107307439106
Accuracy:  0.8157894736842105


<b> 2.10 Use SQL Commands </b>

In [0]:
# 2.10.1 Creating a temporary view of the predictions dataset
cvPredDF.createOrReplaceTempView("finalPredictions")

<b> 2.10.2  Displaying the predictions grouped by JobRole - Bar Chart</b>

In [0]:
%sql
SELECT JobRole, prediction, count(1||2) as Count
FROM finalPredictions
Group By JobRole, prediction
Order By JobRole

JobRole,prediction,Count
Healthcare Representative,0.0,20
Human Resources,0.0,9
Laboratory Technician,0.0,58
Laboratory Technician,1.0,1
Manager,0.0,19
Manufacturing Director,0.0,26
Research Director,0.0,17
Research Scientist,1.0,1
Research Scientist,0.0,51
Sales Executive,0.0,51


Output can only be rendered in Databricks

<b> 2.10.3  Displaying the predictions grouped by Age - Bar Chart </b>

In [0]:
%sql
SELECT Age, prediction, count(1||2) as Count
FROM finalPredictions
Group By Age, prediction
Order By Age


Age,prediction,Count
18,0.0,1
19,1.0,1
20,0.0,1
21,0.0,2
22,0.0,3
23,0.0,4
24,0.0,5
25,0.0,4
26,0.0,7
27,0.0,11


Output can only be rendered in Databricks

<h3> References: </h3>

<p>
  <b>1.</b>  Dr. Liao’s Code Examples & Tutorials: Blackboard/Liao_PySpark_basic_databricks.html
  <br>
  <b>2.</b> PySpark: https://spark.apache.org/docs/2.4.0/api/python/pyspark.html  
</p>