# Employee Attrition:

## Predicting turnover

### Introduction-

One important topic of concern for many businesses is hiring new employees. Companies want to hire individuals who will stay with the company for long periods of time. After all, it is cheaper to train and promote an existing employee than it is to hire someone new. For this reason, businesses often invest important time and resources into predicting whether or not employees will stay. This is what we are going to do in the following post. 

Using data provided by IBM's Watson Analytics team, we will use machine learning techniques to predict whether or not employees will stay with a company, or leave to find new jobs. 

### Getting started-

The data that we will use can be downloaded [here](https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/).

Once it has been downloaded, we must inspect the data and make sure that it is in a format that we will be able to easily feed into our machine learning algorithms. 

In [34]:
# Load the required libraries
import random
import os
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder

In [35]:
# Set seed
random.seed(100)

In [36]:
# Set working directory to appropriate folder 
# path = "/path/to/files"

# os.chdir(path)

In [37]:
# Load the data
data = pd.read_excel("ibm_employee_attrition.xlsx")

In [38]:
# Check for missing values
data.isnull().any()

Age                         False
Attrition                   False
BusinessTravel              False
DailyRate                   False
Department                  False
DistanceFromHome            False
Education                   False
EducationField              False
EmployeeCount               False
EmployeeNumber              False
EnvironmentSatisfaction     False
Gender                      False
HourlyRate                  False
JobInvolvement              False
JobLevel                    False
JobRole                     False
JobSatisfaction             False
MaritalStatus               False
MonthlyIncome               False
MonthlyRate                 False
NumCompaniesWorked          False
Over18                      False
OverTime                    False
PercentSalaryHike           False
PerformanceRating           False
RelationshipSatisfaction    False
StandardHours               False
StockOptionLevel            False
TotalWorkingYears           False
TrainingTimesL

### Option 1:

Before we can move forward with our machine learning algorithms, we will first need to pre-process our data by converting all relevant variables to data types that can be understood by our algorithms. This means that we will have to encode all of our factor variables into binary columns. 

Luckily, this can be done quite quickly with the help of pandas using the `get_dummies` function. 

In [54]:
# Drop 'Over18'
data = data.drop(columns = 'Over18', axis = 1)

In [55]:
# Inspect the data
data.dtypes

# One-hot encoding dummy variables
dummies = pd.get_dummies(data, columns = ['Attrition', 'BusinessTravel', 'EducationField', 'Department', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime', 'Education', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'WorkLifeBalance'], 
                               prefix = ['Attrition', 'BusinessTravel', 'EducationField', 'Department', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime', 'Education', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'WorkLifeBalance'], 
                               drop_first = True)

Now that we have encoded all of our categorical variables into separate columns, our dataset can be interpreted by the algorithms we wish to use. The only remaining task is to separate the independent and dependent variables.

In [56]:
# List all of the column headers
dat_vars = dummies.columns.values.tolist()

# Select independent variables
x = [i for i in dat_vars if i not in ['Attrition_Yes']]

In [57]:
# Fill the values
x = dummies[x]
y = dummies['Attrition_Yes']

The only issue that arises from our variable encoding is that we now have an excess of columns. This can lead to the "curse of dimensionality". As a result, the output of our models could be meaningless because there is too much noise for the algorithms to produce accurate predictions. 

In order to fix this issue, we will perform a quick feature selection on our data to reduce it back to the original number of variables. To do this, we will use recursive feature selection or RFE. This method works by iteratively removing variables from the model and judging the change in accuracy to decide which features should eventually be kept or discarded.

In [58]:
# Create a model for running the RFE
log_mod = LogisticRegression()

# Run the feature selection
rfe = RFE(log_mod, 34)
rfe = rfe.fit(x, y)

### Machine Learning-

Before we begin, we must not forget to split up our dataset into different sets for training and testing.

In [59]:
# Split data for training and testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 100)

#### Logistic Regression

In [60]:
# Initialize the model
log_reg = LogisticRegression()

# Logistic Regression
log_reg.fit(x_train, y_train)

# Print results
print("Logistic Regression Accuracy: {:.3f}".format(accuracy_score(y_test, log_reg.predict(x_test))))

Logistic Regression Accuracy: 0.880


#### Random Forest

In [62]:
# Initialize the model
rf = RandomForestClassifier()

# Fit
rf.fit(x_train, y_train)

# Predict
print("Random Forest Accuracy: {:.3f}".format(accuracy_score(y_test, rf.predict(x_test))))

Random Forest Accuracy: 0.853


#### Support Vector Machine

In [63]:
# Initalize the mnodel
svc = SVC()

# Fit
svc.fit(x_train, y_train)

# Predict
print("SVC Accuracy: {:.3f}".format(accuracy_score(y_test, svc.predict(x_test))))

SVC Accuracy: 0.841


### Option 2:

In the previous exercise, we used the `get_dummies` function in pandas to transform some of our data columns into binary variables. Another option we can use instead of creating binary variables is encoding the labels into numbers. Here, we will try this approach and see if it results in an improvement in accuracy over our previous models. 

In [64]:
# Re-check the data
data.head(n = 5)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [65]:
data.dtypes

Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeCount                int64
EmployeeNumber               int64
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears            int64
TrainingTimesLastYea

In [66]:
# Columns to convert
strings = data.select_dtypes(include = 'object')

In [67]:
# Initialize the encoder
le = LabelEncoder()

In [68]:
# Encode the factors
for i in strings:
    data[i] = le.fit_transform(data[i])

In [69]:
# Check the conversion
data.dtypes

Age                         int64
Attrition                   int64
BusinessTravel              int64
DailyRate                   int64
Department                  int64
DistanceFromHome            int64
Education                   int64
EducationField              int64
EmployeeCount               int64
EmployeeNumber              int64
EnvironmentSatisfaction     int64
Gender                      int64
HourlyRate                  int64
JobInvolvement              int64
JobLevel                    int64
JobRole                     int64
JobSatisfaction             int64
MaritalStatus               int64
MonthlyIncome               int64
MonthlyRate                 int64
NumCompaniesWorked          int64
OverTime                    int64
PercentSalaryHike           int64
PerformanceRating           int64
RelationshipSatisfaction    int64
StandardHours               int64
StockOptionLevel            int64
TotalWorkingYears           int64
TrainingTimesLastYear       int64
WorkLifeBalanc

In [14]:
# Redefine the x and y variables
variables = data.columns.values.tolist()

# Independent variables
x = [i for i in variables if i not in ['Attrition']]

# Fill the values
x = data[x]
y = data['Attrition']

In [70]:
# Split data for training and testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 100)

#### Logistic Regression

In [71]:
# Logistic Regression
log_reg.fit(x_train, y_train)

# Print results
print("Logistic Regression Accuracy: {:.3f}".format(accuracy_score(y_test, log_reg.predict(x_test))))

Logistic Regression Accuracy: 0.875


#### Random Forest

In [72]:
# Initialize the model
rf = RandomForestClassifier()

# Fit
rf.fit(x_train, y_train)

# Predict
print("Random Forest Accuracy: {:.3f}".format(accuracy_score(y_test, rf.predict(x_test))))

Random Forest Accuracy: 0.844


#### Support Vector Machine

In [73]:
# Initalize the mnodel
svc = SVC()

# Fit
svc.fit(x_train, y_train)

# Predict
print("SVC Accuracy: {:.3f}".format(accuracy_score(y_test, svc.predict(x_test))))

SVC Accuracy: 0.841


### Conclusion-

Well, it would appear that our use of the label encoder in scikit-learn was not really helpful. In fact, the performance of all 3 of our models actually decreased slightly when using encoded variables as opposed to dummies. This is surprising because it was expected that the "curse of dimensionality" would not be suffered by our option 2 model and thus it was expected to be more accurate.