# Project 3
Authors: Richard Xiao




## Report Components

### Introduction

Supervised learning is a machine learning approach in which datasets are designed to train various algorithms to correctly classify or predict outcomes/decisions accurately based on input data. The point of them is to make predictions in real-time, which allows for faster decision-making and improved efficiency and accuracy. This can save time and resources for organizations.

The dataset I used for this project details employee attrition from a sample of IBM employees. This dataset was obtained from Kaggle. This is a dataset that comes from Kaggle. The goal here is to classify employee attrition based on 7 indicators which are gender, department, age, traveling distance, gender, marital status and monthly income. Do certian indicator variables play a part in influencing the chance of an employee leaving and which classification models will be the best in determining the prediction of employee attrition based on different indicator variables?

## Splitting the Data,Metrics, and Models

For my models, the metrics I want to use are log loss and accuracy for the classification models. The advantage of using accuracy is its' overall simplicity, meaning that it's easy to understand and implement. It's a great metric for balanced datasets, meaning that the number of positive and negative cases is roughly equal. However, the problem for accuracy is that if the dataset isn't balanced, then it's no longer a useful metric to use. 

For log loss, log loss is easy to optimize and is useful to compare models on their probabilistic outcome. It also penalizes wrong predictions quite strongly, meaning that it is more sensitive to the uncertainty of the model's predictions. However, log loss is sensitive to outliers, which can lead to misleading results or even overfitting. Log losses are affected by imbalanced classes since they can be biased towards the majority class which means that they can perform poorly on certain parts of a data set.

There are a number of reasons why we want to split our data into training and test sets. We want to evaluate the performance of a model on unseen data. To do this, we train one subset of the data, which is the training dataset, and evaluate its' performance on another subset of the data, the test set, which will give us an estimate on how well the model will generalize to new data. Evaluating the model on  the test set can prevent overfitting since we can compare the test and training error, which can help us to detect any overfitting. Keeping the test and training data separate can prevent any data leakage, thus keeping the model unbiased. 

In [None]:
import os 
os.getcwd()

In [None]:
import pandas as pd
from pyspark.sql import SparkSession
import pyspark.pandas as ps
from pyspark.ml.evaluation import BinaryClassificationEvaluator
spark = SparkSession.builder.master('local[*]').appName('my_app').getOrCreate()

In [None]:
from pyspark.ml.feature import SQLTransformer, StringIndexer,Binarizer,VectorAssembler, PolynomialExpansion

In [None]:
employee_data = spark.read.load("employee_attrition.csv",
                            format = "csv",
                            sep = ",",
                            inferSchema = "true",
                            header = "true")

employee_data

In [None]:
select_employee = employee_data.select(['Age','Attrition','MaritalStatus','Gender','OverTime','DistanceFromHome','MonthlyIncome','Department'])
select_employee.show()

In [None]:
from pyspark.ml.regression import LinearRegression

In [None]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

## Splitting testing and training data set

Decided to use 80/20 split.

In [None]:
train, test = select_employee.randomSplit([0.8,0.2], seed = 1)
print(train.count(), test.count())

In [None]:
from pyspark.ml import Pipeline

## Model Testing

Here is where I proceed to start testing my 5 models. The 5 I've chosen are elastic net, lasso, decision trees, random forest and naive baye.

## Elastic Net

Elastic Net logistic regression is a type of regression that combines both lasso and ridge logistic regression models. It performs variable selection and regularization by adding a penalty term to the loss function, as well as adding a regularization term that controls the magnitude of the coefficients. This improves the stability of the model and reduces overfitting.

Code below declares the indexer transformation to convert categorical variables into numeric variables to be used for testing.

In [None]:
attrition_indexer = StringIndexer(inputCols = ["Attrition","Department","Gender","OverTime","MaritalStatus"], outputCols = ["attrition_numeric","Department_numeric","gender_numeric","overtime_numeric","maritalstatus_numeric"])
attritionTrans = attrition_indexer.fit(select_employee)
attritionTrans.transform(select_employee).show(5)

Binarizer transformation code is below to use as binary variables to be tested on.

In [None]:
binaryattritionTrans = Binarizer(threshold = 0.5, inputCols = ["attrition_numeric", "Department_numeric","gender_numeric","overtime_numeric","maritalstatus_numeric"], outputCols = ["attrition_indicator","department_indicator","gender_indicator","overtime_indicator","maritalstatus_indicator"])
binaryattritionTrans.transform(attritionTrans.transform(select_employee)).show(30)

Sql transformer is used to select all the indicator variables as well as declaring the attrition as the label, or response variable. Log transformation is done on monthly income.

In [None]:
sqlTrans = SQLTransformer(
    statement = """
                SELECT department_indicator,overtime_indicator,maritalstatus_indicator, Age, gender_indicator,DistanceFromHome,log(MonthlyIncome) as log_monthly_income, attrition_numeric as label FROM __THIS__
                """
)

Assembler transformation code below. Everything is ready to be set for model testing into pipeline.

In [None]:
assembler = VectorAssembler(inputCols = ["department_indicator", "Age","gender_indicator","overtime_indicator","maritalstatus_indicator","DistanceFromHome"], outputCol = "features", handleInvalid = 'keep')

In [None]:
from pyspark.ml.classification import LogisticRegression

For the code below, the logistic regression object was declared. The regParam option controls the strength of the regularization term in the elastic net model. It represents the shrinkage applied to the coefficients of the model. Multiple values were used to find the optimal value to find the best performing model. Values  0.01 and 0.1 are the boundaries, since 0.01 is a good starting value to provide regularization without overly penalizing the coefficients. It shouldn't be set too low or too high so it wouldn't overfit or underfit the model respectively.

Next, the elasticNetParam grid is added to control the ratio between L1 and L2 penalties in the elastic net model. In this case, the ElasticNetParam is set between a value of 0 and 1, where 0 represents L2 regularization and 1 is L1 regularization. Similar to the regParam option, it serves as a tuning hyperparameter to find the best performing model.

In [None]:
lr = LogisticRegression(featuresCol="features", labelCol="label")
param_grid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.01, 0.1]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.5]) \
    .build()
pipeline = Pipeline(stages = [attritionTrans, binaryattritionTrans, sqlTrans, assembler,lr])

Declare evaluator variable to dictate the metric I want to use.

In [None]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", metricName="accuracy")

Cross validator variable is set.

In [None]:
cv = CrossValidator(estimator=pipeline,
                    estimatorParamMaps=paramGrid,
                    evaluator= evaluator,
                    numFolds=5)

Training model is fit

In [None]:
cvModel = cv.fit(train)

Best model method is used to refer to the best model that is selected based on evaluation metric.

In [None]:
bestModel = cvModel.bestModel

Code below shows test data set being used.

In [None]:
predictions = bestModel.transform(test)
predictions.show(10)

Accuracy results code below

In [None]:
evaluator.evaluate(predictions)

New evaluator is declared for log loss calculation.

In [None]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", metricName="logLoss")
logLoss = evaluator.evaluate(predictions)
print(logLoss)

Note that this process is repeated 4 more times. Only difference is the type of classifier differs for each model.

## Lasso Regression

The idea of lasso regression for this dataset is a regularization technique to prevent overfitting and improve the predictive performance of logistic regression models. It adds a penalty term to the logistic regression function that allows the model to select the most important predictors and eliminate any extraneous variables. This is to reduce the variance of the model and improve performance by shrinking the coefficients of the less important variables towards zero. This has an advantage towards datasets where overfitting is likely or where there is a large number of variables. In this case, I don't think it will make that much of a difference since there isn't a large amount of observations.

In [None]:
attrition_indexer = StringIndexer(inputCols = ["Attrition","Department","Gender","OverTime","MaritalStatus"], outputCols = ["attrition_numeric","Department_numeric","gender_numeric","overtime_numeric","maritalstatus_numeric"])
attritionTrans = attrition_indexer.fit(select_employee)
attritionTrans.transform(select_employee).show(5)

In [None]:
binaryattritionTrans = Binarizer(threshold = 0.5, inputCols = ["attrition_numeric", "Department_numeric","gender_numeric","overtime_numeric","maritalstatus_numeric"], outputCols = ["attrition_indicator","department_indicator","gender_indicator","overtime_indicator","maritalstatus_indicator"])
binaryattritionTrans.transform(attritionTrans.transform(select_employee)).show(30)

In [None]:
sqlTrans = SQLTransformer(
    statement = """
                SELECT department_indicator,overtime_indicator,maritalstatus_indicator, Age, gender_indicator,DistanceFromHome,log(MonthlyIncome) as log_monthly_income, attrition_numeric as label FROM __THIS__
                """
)

In [None]:
assembler = VectorAssembler(inputCols = ["department_indicator", "Age","gender_indicator","overtime_indicator","maritalstatus_indicator","DistanceFromHome"], outputCol = "features", handleInvalid = 'keep')

For the two lines of code below, the elasticNetParam value was set to 1 to represent L1 regularization since lasso models use L1 regularization. The regParam serves the same purpose as the regParam option for my elastic net model.

In [None]:
lasso = LogisticRegression(featuresCol="features", labelCol="label", elasticNetParam=1.0)

In [None]:
paramGrid = ParamGridBuilder() \
    .addGrid(lasso.regParam, [0.01, 1.0]) \
    .build()
pipeline = Pipeline(stages = [attritionTrans, binaryattritionTrans, sqlTrans, assembler,lasso])

In [None]:
evaluator=MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", metricName="accuracy")

In [None]:
crossval = CrossValidator(estimator=lasso,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)

In [None]:
cvModel = cv.fit(train)

In [None]:
bestModel = cvModel.bestModel

In [None]:
predictions = bestModel.transform(test)
predictions.show(10)

In [None]:
evaluator.evaluate(predictions)

In [None]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(accuracy)

In [None]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", metricName="logLoss")
logLoss = evaluator.evaluate(predictions)
print(logLoss)

## Random Forests

Random forest is an ensemble mmodel which combines the predictions of multiple decision trees to improve the accuracy and robustness of the model. It selects a subset of the training data and features. Next, they construct a decision tree based off these and repeat this process to a predefined number of trees. These predictions are combined to make a final prediction. This method reduces overfitting and improves the generalization performance of the model. The combination of multiple decision trees also helps to reduce the variance of the model and improve its robustness. It's similar to bagging although it doesn't use all of the predictors in case a strong predictor exists which can skew the results.

In [None]:
attrition_indexer = StringIndexer(inputCols = ["Attrition","Department","Gender","OverTime","MaritalStatus"], outputCols = ["attrition_numeric","Department_numeric","gender_numeric","overtime_numeric","maritalstatus_numeric"])
attritionTrans = attrition_indexer.fit(select_employee)
attritionTrans.transform(select_employee).show(5)

In [None]:
binaryattritionTrans = Binarizer(threshold = 0.5, inputCols = ["attrition_numeric", "Department_numeric","gender_numeric","overtime_numeric","maritalstatus_numeric"], outputCols = ["attrition_indicator","department_indicator","gender_indicator","overtime_indicator","maritalstatus_indicator"])
binaryattritionTrans.transform(attritionTrans.transform(select_employee)).show(30)

In [None]:
sqlTrans = SQLTransformer(
    statement = """
                SELECT department_indicator,overtime_indicator,maritalstatus_indicator, Age, gender_indicator,DistanceFromHome,log(MonthlyIncome) as log_monthly_income, attrition_numeric as label FROM __THIS__
                """
)

In [None]:
assembler = VectorAssembler(inputCols = ["department_indicator", "Age","gender_indicator","overtime_indicator","maritalstatus_indicator","DistanceFromHome"], outputCol = "features", handleInvalid = 'keep')

In [None]:
from pyspark.ml.classification import RandomForestClassifier

For my random forests, I declare numTrees for my randomforestclassifier and paramgridbuilder to indicate how big my forest is. The maxDepth is a tuning parameter to dictate how deep the tree will be when you take a random subset from the data. MaxDepth is set to 5 since it shouldn't exceed the number of indicators present in the model. The minInstancesPerNode option indicates the minimum number of samples required to split a node further in the tree. A higher value for minInstancesPerNode results in smaller trees with fewer splits which can result in lower variance in the model. However, a very high number can result in a higher bias in the model.

In [None]:
rf = RandomForestClassifier(featuresCol="features", labelCol="label", numTrees = 50, maxDepth = 5)

In [None]:
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [20, 30, 50]) \
    .addGrid(rf.maxDepth, [1, 3, 5]) \
    .addGrid(rf.minInstancesPerNode, [1, 5, 10]) \
    .build()
pipeline = Pipeline(stages = [attritionTrans, binaryattritionTrans, sqlTrans, assembler,rf])

In [None]:
evaluator = BinaryClassificationEvaluator(labelCol='label')

In [None]:
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

In [None]:
cvModel = cv.fit(train)

In [None]:
predictions = cvModel.transform(test)
predictions.show(5)

In [None]:
evaluator.evaluate(predictions)

In [None]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", metricName="logLoss")
logloss = evaluator.evaluate(predictions)
print(logloss)

## Decision Tree

Decision trees make choices based on different criteria to help classify and predict outcomes. It starts off with a single node(or point) and then branches out into different paths. Each branch of the tree represents a decision or choice, and each leaf node represents a possible outcome. The tree is constructed by  splitting the data based on different criteria to minimize the error or to maximize information gained at each split.

In [None]:
attrition_indexer = StringIndexer(inputCols = ["Attrition","Department","Gender","OverTime","MaritalStatus"], outputCols = ["attrition_numeric","Department_numeric","gender_numeric","overtime_numeric","maritalstatus_numeric"])
attritionTrans = attrition_indexer.fit(select_employee)
attritionTrans.transform(select_employee).show(5)

In [None]:
binaryattritionTrans = Binarizer(threshold = 0.5, inputCols = ["attrition_numeric", "Department_numeric","gender_numeric","overtime_numeric","maritalstatus_numeric"], outputCols = ["attrition_indicator","department_indicator","gender_indicator","overtime_indicator","maritalstatus_indicator"])
binaryattritionTrans.transform(attritionTrans.transform(select_employee)).show(30)

In [None]:
sqlTrans = SQLTransformer(
    statement = """
                SELECT department_indicator,overtime_indicator,maritalstatus_indicator, Age, gender_indicator,DistanceFromHome,log(MonthlyIncome) as log_monthly_income, attrition_numeric as label FROM __THIS__
                """
)

In [None]:
assembler = VectorAssembler(inputCols = ["department_indicator", "Age","gender_indicator","overtime_indicator","maritalstatus_indicator","DistanceFromHome"], outputCol = "features", handleInvalid = 'keep')

In [None]:
from pyspark.ml.classification import DecisionTreeClassifier

Code below declares the decision tree classifier and param grid. Used the same parameters that I used for the random forest model.

In [None]:
dt = DecisionTreeClassifier(maxDepth=2, labelCol="label", leafCol="leafId")

In [None]:
param_grid = ParamGridBuilder() \
    .addGrid(dt.maxDepth, [1,3,5]) \
    .addGrid(dt.minInstancesPerNode, [1, 5, 10]) \
    .build()
pipeline = Pipeline(stages = [attritionTrans, binaryattritionTrans, sqlTrans, assembler,dt])

In [None]:
evaluator = BinaryClassificationEvaluator(labelCol='label') 

In [None]:
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

In [None]:
cvModel = cv.fit(train)

In [None]:
bestModel = cvModel.bestModel

In [None]:
predictions = bestModel.transform(test)

In [None]:
accuracy = evaluator.evaluate(predictions)
print(accuracy)

In [None]:
predictions = cvModel.transform(test)
predictions.show(100)

In [None]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", metricName="logLoss")
log_loss = evaluator.evaluate(predictions)
print("Log loss: ", log_loss)

## Naive Bayes

Naive Bayes classifies data into different categories or classes based on the probabilities of the features. Like the name suggests, it is based on Bayes theorem and it assumes that the features used for classification are independent of each other. It calculates the probability of each feature given a particular class, and then multiplying all these probabilities to get the probability of that class given the features. One of the key advantages of Naive Bayes is its simplicity and interpretability.

In [None]:
attrition_indexer = StringIndexer(inputCols = ["Attrition","Department","Gender","OverTime","MaritalStatus"], outputCols = ["attrition_numeric","Department_numeric","gender_numeric","overtime_numeric","maritalstatus_numeric"])
attritionTrans = attrition_indexer.fit(select_employee)
attritionTrans.transform(select_employee).show(5)

In [None]:
binaryattritionTrans = Binarizer(threshold = 0.5, inputCols = ["attrition_numeric", "Department_numeric","gender_numeric","overtime_numeric","maritalstatus_numeric"], outputCols = ["attrition_indicator","department_indicator","gender_indicator","overtime_indicator","maritalstatus_indicator"])
binaryattritionTrans.transform(attritionTrans.transform(select_employee)).show(30)

In [None]:
sqlTrans = SQLTransformer(
    statement = """
                SELECT department_indicator,overtime_indicator,maritalstatus_indicator, Age, gender_indicator,DistanceFromHome,log(MonthlyIncome) as log_monthly_income, attrition_numeric as label FROM __THIS__
                """
)

In [None]:
assembler = VectorAssembler(inputCols = ["department_indicator", "Age","gender_indicator","overtime_indicator","maritalstatus_indicator","DistanceFromHome"], outputCol = "features", handleInvalid = 'keep')

In [None]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

The code below declares the naive bayes object as well as setting up the paramgrid for the Naive Bayes model. The smoothing hyperparameter controls the regularization applied to the model. Smoothing is used to tackle the problem of zero probabilities in the training data, which occurs if a feature hasn't been seen in the training dataset. The coefficients in the smoothing option represents the alpha and used to adjust the probability estimates of the model. Larger value means more smoothing which can improve the robustness and generalization of the model. Multiple values are included so the best parameter is picked during cross validation.

In [None]:
nb = NaiveBayes(featuresCol="features", labelCol="label")

In [None]:
paramGrid = ParamGridBuilder() \
    .addGrid(nb.smoothing, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]) \
    .build()
pipeline = Pipeline(stages = [attritionTrans, binaryattritionTrans, sqlTrans, assembler,nb])

In [None]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", metricName="accuracy")

In [None]:
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

In [None]:
cvModel = cv.fit(train)

In [None]:
bestModel = cvModel.bestModel

In [None]:
predictions = bestModel.transform(test)
predictions.show(4)

In [None]:
accuracy = evaluator.evaluate(predictions)
print(accuracy)

In [None]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", metricName="logLoss")
log_loss = evaluator.evaluate(predictions)
print("Log loss: ", log_loss)

## Model Observations

According to each of the outputs from these 5 models, the elastic net and lasso models had the highest accuracy of 0.84 and lowest log loss metric of 0.3926. Note that both these models had the exact same accuracy and log loss values. Random forest model has the next highest accuracy value of about 0.76 and a log loss value about of 0.40. Naive Bayes has an accuracy value of about 0.71 and log loss of 0.64. Finally, the decision tree model has an accuracy of 0.53 and log loss of 4.9. Based on these observations, elastic net and lasso models are the best models since they outperformed the other models. Random forest model is the next best one with a lower accuracy although its' log loss value is very close to the elastic net and lasso models. Following up after is Naive Bayes and here is where there is a significant drop in model quality. There is a significant increase in log loss. Finally, decision tree is by far the worst model with pretty horrible accuracy(0.5!) and higher log loss value(although it is lower than the naive bayes value.) 

It is also worth noting that adding more indicators significantly increased the accuracy/decreased log loss for the random forest, decision tree and Naive Bayes model. Adding in overtime, marital status and travel distance made the models perform better so based on these observations, it is possible that these three indicators could be important in determining the chance of an employee leaving in the context of these three models. However, these indicators did not really impact these metrics for the elastic net and lasso models.

In conclusion, the elastic net and lasso models would serve as the best models in classifying employee attrition based off the seven indicator variables.