# HR ATTRITION ANALYSIS

#### High employee turnover rate has never been fun for a company to have. After spending a significant amount of fund in benefits, trainings, and other activity based costs, a company cannot really force the employees to stay. Recruiting new employees is the only way to keep the business going, but it is not cheap. The worst thing about losing employees is not knowing the reason behind it. Without knowing the reason, recruiting more employees will be useless and ended up losing money even more. 

#### This analysis will show how we can analyze the attrition rate of employees in a company and present it using ggplot charts. To make this easier to understand, I have divided the analysis into four parts that explains each function. 

## PART 1 - PREPARATION
### SUMMARY

Before making the analysis, we need to make sure we have all the required libraries for this analysis. For this analysis, we will be using "caret" package, "ggplot" package, "ggcorrplot" package, and "pROC" package.

- "caret"       = used primarily for Regression and Classification to narrow down the model training process. 

- "ggplot"      = provides customizable plots to map variable to aesthetic.

- "ggcorrplot"  = can be used to visualize the correlation matrix using ggplot function.

- "pROC"        = tools for visualizing, smoothing and comparing receiver operating characteristic (ROC curves).

- "randomForest = to build a model using random forest method.

In [None]:
# Install and load the appropriate libraries

shh <- suppressPackageStartupMessages # Silent the function warnings
shh(library(caret))
shh(library(ggplot2))
shh(library(ggcorrplot))
shh(library(pROC))
shh(library(randomForest))


Next, I am going to load the "csv file" of our dataset from the company. I uploaded my csv file to GitHub, an online hosting for software development. The link that I generated for the csv file was lengthy and did not look clean, so I used "bit.ly" service to shorten that lengthy link address.

In [None]:
# Load the CSV file with employee data
hr_data <- read.csv('http://bit.ly/2Mw5pyl')

### DATA EXPLORATION

After loading the csv data, I am going to explore the dataset for this analysis. I need to know the list, description, as well as the first few part of the data. By performing this step, I can be sure that I have the correct dataset.
There are three functions for this step:
1. str = to list and describe the columns
2. head = to see the first few data
3. summary = to know the descriptive statistics of the data

In [None]:
# Lists and describes each of the fields (columns)
str(hr_data)

In [None]:
# Explores the first few records in the dataset.
head(hr_data)

In [None]:
# Calculates descriptive statistics for each field.
summary(hr_data)

## PART 2 - GRAPH DATASET
### SUMMARY

After making sure I have the right dataset, I am going to answer some questions about the correlation between columns and put them in graphs using "ggplot" function. These questions are typically asked by the stakeholders in the company to find out the reason behind the employee turnover. The graphs that I will be presenting not necessarily show causation, but correlation.

### QUESTION 1 - "Does one gender tend to quit more frequently than the other?"

In [None]:
# Correlation Between Gender and the Number of Attrition

Gender <- ggplot(hr_data, aes(Gender, fill=factor(Attrition))) + 
            labs(title="Correlation Between Gender and the of Attrition",     
                 caption="Source: hr_data") +
            scale_fill_manual(values = c("darkgreen","red"),name="Attrition", labels =c ("No","Yes")) +
            theme_dark() +
            theme(plot.title = element_text(face = "bold", colour = "cornflowerblue"), 
            axis.title.x = element_text(face = "bold", colour = "darkblue"),
            axis.title.y = element_text(face = "bold", colour = "firebrick"))
Gender + geom_bar() + xlab("Gender") + ylab("Number of Employee")       
Gender + geom_bar(position = "fill") + xlab("Gender") + ylab("Percentage")

From the chart above, it seems like the number of attrition in male employees are much higher. However, it should be noted that the overall number of male employees is much higher than female. The realitiy is the gender does not significantly affect the number of attrition 

### QUESTION 2 - "Does age appear to make a difference? Are our younger workers less committed to our company?"

In [None]:
# Correlation Between Gender and the Number of Attrition
Age <- ggplot(hr_data, aes(Attrition, Age)) +
       labs(title="Correlation Between Age and the Number of Attrition",     
            caption="Source: hr_data") +
            theme_dark() +
            scale_fill_manual(name="Attrition", labels =c("No","Yes")) +
            theme(plot.title = element_text(face = "bold", colour = "cornflowerblue"), 
            axis.title.x = element_text(face = "bold", colour = "darkblue"),
            axis.title.y = element_text(face = "bold", colour = "firebrick"))

Age + geom_smooth(method = "gam")


### QUESTION 3 - "Correlation Between Overtime and the of Attrition?"

In [None]:
OT <- ggplot(hr_data, aes(OverTime, fill=factor(Attrition))) + 
            labs(title="Correlation Between Overtime and the of Attrition",     
                 caption="Source: hr_data") +
            scale_fill_manual(values = c("navy","orange"),name="Attrition", labels =c ("No","Yes")) +
            theme_dark() +
            theme(plot.title = element_text(face = "bold", colour = "cornflowerblue"), 
            axis.title.x = element_text(face = "bold", colour = "darkblue"),
            axis.title.y = element_text(face = "bold", colour = "firebrick"))
OT + geom_bar() + xlab("Over Time") + ylab("Number of Employee")       
OT + geom_bar(position = "fill") + xlab("Over Time") + ylab("Percentage")

## PART 3 - PREPARE DATA FOR MODELING
### SUMMARY

The next thing that we want to do is to make sure we don't have empty data in our dataset.

In [None]:
# Sum number of rows that has no value
sum(is.na(hr_data))

In [None]:
cor <- cor(hr_data[c(3,8,12:25)])
options(repr.plot.width=7, repr.plot.height=5)
ggcorrplot(cor)

The is.na function shows that there is no data that has no value. Now, we can split our dataset into training data and test data. We are going to do 80/20 randomly in our new training and test data.

In [None]:
# Split data into training and test using randomized method
set.seed(3456)
trainIndex <- createDataPartition(hr_data$Attrition, p = .8, 
                                  list = FALSE, times = 1)
hr_train <- hr_data[ trainIndex,]
hr_test  <- hr_data[-trainIndex,]

In [None]:
head(hr_train)

In [None]:
head(hr_train)

## PART 4 - BUILDING AND EVALUATING A MODEL TO PREDICT ATTRITION
### SUMMARY

It looks good! The Employee Number in our training data is random. Now, we can start building our new model to predict our attrition. We will be using logistic regression or glm function for this process.

We are going to start with our initial model. I put all the columns to see which columns are statistically significant. We will choose a couple of columns that have the lowest variants.

Techniques we can use are For that reason, I will show the comparison between Logistic Regression and Random Forest methods.

### METHOD 1 - LOGISTIC REGRESSION

First step is to build the model. I used all the variables and compared it to the limited variables, which will be shown in a bit.

In [None]:
# Build a model with the training data that predicts Attrition using logistic regression
# Initial Build
hr_model_lr_initial <- glm(Attrition ~ ., 
                   data=hr_train, family=binomial)
summary(hr_model_lr_initial)

We are done building our model using all variable. We can see the importance of each variable by looking at the P-Value and the asterisks on the right side. However, this will not be very practical to do if we have incredibly large dataset. What we can do is to rank the importance using varImp function.

In [None]:
varImp(hr_model_lr_initial)

After doing some trials, I came out with two variables. We will go ahead and use these two variables in our final model.

In [None]:
hr_model_lr_final <- glm(Attrition ~ OverTime +
                      JobInvolvement, 
                       data=hr_train, family=binomial)
summary(hr_model_lr_final)

Everything looks good! Now, we will use the final model to predict our Attrition. We will use predict function for this step.

In [None]:
# Predict initial “attrition” for customers in the test data using all variables
hr_test$Attrition_pred_lr_initial <- predict(hr_model_lr_initial, newdata=hr_test, type="response")
head(hr_test)

# Predict final “attrition” for customers in the test data using limited variables
hr_test$Attrition_pred_lr_final <- predict(hr_model_lr_final, newdata=hr_test, type="response")
head(hr_test)

In [None]:
Now, I will write the prediction of logistic regression into the test data.

In [None]:
# Write a new csv that includes columns with the employee's ID, actual attrition, and prediction
hr_test <- cbind(hr_test$EmployeeNumber, hr_test$Attrition, hr_test$Attrition_pred_lr_final)
write.csv(hr_test, "hr_attrition_prediction_lr_final.csv", row.names=FALSE)

### METHOD 2 - RANDOM FOREST

Now, I will show you how to create a model using Random Forest method. The code below will generate a model using all variables.

In [None]:
# Create a classification model using the "randomForest" function using all variables
shh(hr_model_rf_initial <- randomForest(Attrition ~ ., data = hr_train, importance=TRUE, ntree=2000))
summary(hr_model_rf_initial)
varImp(hr_model_rf_initial)

Just like the logistic regression model, I narrowed it down to a few variables.

In [None]:
# Create a classification model using the "randomForest" function using limited variables
shh(hr_model_rf_final<- randomForest(Attrition ~ Age + MonthlyIncome + OverTime + JobLevel, data=hr_train, importance = TRUE))
summary(hr_model_rf_final)
varImpPlot(hr_model_rf_final)

The next step is to create predictions of both models that use all variables and limited variables.

In [None]:

# Predict initial “attrition” for customers in the test data using all variables
hr_test$Attrition_pred_rf_initial <- predict(hr_model_rf_initial, newdata=hr_test, type="class")
summary(hr_test)

# Predict final “attrition” for customers in the test data using limited variables
hr_test$Attrition_pred_rf_final <- predict(hr_model_rf_final, newdata=hr_test, type="class")
summary(hr_test)

Last but not least for this model, I am saving the random forest prediction to the test data.

In [None]:
# Write a new csv that includes columns with the employee's ID, actual attrition, and prediction
hr_test <- cbind(hr_test$EmployeeNumber, hr_test$Attrition, hr_test$Attrition_pred_rf_final)
write.csv(hr_test, "hr_attrition_prediction_rf_final.csv")

### EVALUATION

#### Evaluating Logistic Regression Model

To evaluate our model, I put an illustration in ROC curves, which shows the Area Under the Curve (AUC). This curves shows the comparison between the initial build and final build, which shows that the information from the final model is more accurate.

Here is for the Logistic Regression:

In [None]:

# Generate ROC curves for both version of the Logistic Regression model.
ROC_lr_initial <- roc(hr_test$Attrition, hr_test$Attrition_pred_lr_initial)
ROC_lr_final <- roc(hr_test$Attrition, hr_test$Attrition_pred_lr_final)

# Print the AUC for both versions of the model
paste("AUC for Logistic Regression Model (all variables):", round(auc(ROC_lr_initial),2), "(red)")
paste("AUC for Logistic Regression Model (limited variables):", round(auc(ROC_lr_final),2), "(blue)")

# Plot the ROC curves.
plot.roc(ROC_lr_initial, col="red")
lines.roc(ROC_lr_final, col="blue")

And here is for the random forest models:

In [None]:
# Evaluate the random forest models.

# Generate ROC curves for both version of the Random Forest model.
ROC_rf_initial <- roc(hr_test$Attrition, hr_test$Attrition_pred_rf_initial)
ROC_rf_final <- roc(hr_test$Attrition, hr_test$Attrition_pred_rf_final)

# Print the AUC for both versions of the model
paste("AUC for Random Forest Model (all variables):", round(auc(ROC_rf_initial),2), "(red)")
paste("AUC for Random Forest Model (limited variables):", round(auc(ROC_rf_final),2), "(blue)")

# Plot the ROC curves.
plot.roc(ROC_rf_initial, col="red")
lines.roc(ROC_rf_final, col="blue")

## CONCLUSION

From the model that I have built, it shows that the Logistic Regression can provide a better curves compared to the Random Forest. It means that the attrition prediction is more dependable, which also shows a better variable that needs to be the main focus to reduce attrition rate.

## RECOMMENDATION
From our new prediction, it shows that the employees with a lot of overtime and high job involvement tend to leave the company. Therefore, the company need to review the workload and job description for the employees.

Acording to article by Anna Tergesen on Wall Street Journal, Employers are concerned about the impact on productivity and turnover. Research by Todd Baker, a senior fellow at Columbia University’s Richman Center for Business, Law and Public Policy, looked at 16 companies in the U.K. that offered payroll loans and found that borrowers had, on average, an annualized attrition rate 28% lower than the rate for all employees.

For this reason, productivity and turnover need to be the higher focus in the company to reduce attrition rate.

Source: https://www.wsj.com/articles/some-companies-offer-a-new-benefit-payroll-advances-and-loans-11567416601
