Using XGBoost to predict employee attrition

In [198]:
import pandas as pd
attrition_df = pd.read_csv("../datasets/employee_attrition.csv")
attrition_df.head(5)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


Let's see what our data looks like

In [146]:
numSamples, numFeatures = attrition_df.shape
print(numSamples)
print(numFeatures)

1470
35


Need to map yes/no values for Attrition to binary 1 and 0, respectively 

In [147]:
attr_dict = {'Yes': 1, 'No': 0}
attrition_df['Attrition'] = attrition_df['Attrition'].map(attr_dict)
target = attrition_df['Attrition'].values.tolist()
data = attrition_df.drop(['Attrition'], axis=1)

Need to map BusinessTravel, Department, and EducationField to numerical values

In [148]:
travel_dict = {'Travel_Frequently': 1, 'Travel_Rarely': 0}
data['BusinessTravel'] = data['BusinessTravel'].map(travel_dict)

dept_dict = {'Research & Development': 1, 'Sales': 2, 'Other': 3}
data['Department'] = data['Department'].map(dept_dict)

ed_field_dict = {'Life Sciences': 1, 'Medical': 2, 'Other': 3}
data['EducationField'] = data['EducationField'].map(ed_field_dict)
data.head(5)

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,0.0,1102,2.0,1,2,1.0,1,1,2,...,1,80,0,8,0,1,6,4,0,5
1,49,1.0,279,1.0,8,1,1.0,1,2,3,...,4,80,1,10,3,3,10,7,1,7
2,37,0.0,1373,1.0,2,2,3.0,1,4,4,...,2,80,0,7,3,3,0,0,0,0
3,33,1.0,1392,1.0,3,4,1.0,1,5,4,...,3,80,0,8,3,3,8,7,3,0
4,27,0.0,591,1.0,2,1,2.0,1,7,1,...,4,80,1,6,3,3,2,2,2,2


Now do the same for Gender, JobRole, MaritalStatus, Over18, and OverTime

In [149]:
print(data.Gender.unique())
print(data.JobRole.unique())
print(data.MaritalStatus.unique())
print(data.Over18.unique())
print(data.OverTime.unique())

['Female' 'Male']
['Sales Executive' 'Research Scientist' 'Laboratory Technician'
 'Manufacturing Director' 'Healthcare Representative' 'Manager'
 'Sales Representative' 'Research Director' 'Human Resources']
['Single' 'Married' 'Divorced']
['Y']
['Yes' 'No']


In [150]:
gender_dict = {'Male': 1, 'Female': 0}
data['Gender'] = data['Gender'].map(gender_dict)


job_role_dict = {'Sales Executive': 1, 'Research Scientist': 2, 'Laboratory Technician': 3, 
                 'Manufacturing Director': 4, 'Healthcare Representative': 5, 'Manager': 6, 
                 'Sales Representative': 7, 'Research Director': 8, 'Human Resources': 9}
data['JobRole'] = data['JobRole'].map(job_role_dict)

marital_dict = {'Single': 1, 'Married': 2, 'Divorced': 3}
data['MaritalStatus'] = data['MaritalStatus'].map(marital_dict)

over_18_dict = {'Y': 1}
data['Over18'] = data['Over18'].map(over_18_dict)

overtime_dict = {'Yes': 1, 'No': 0}
data['OverTime'] = data['OverTime'].map(overtime_dict)

data = data.values.tolist()

[0 1]
[1 2 3 4 5 6 7 8 9]
[1 2 3]
[1]
[1 0]
[[41.0, 0.0, 1102.0, 2.0, 1.0, 2.0, 1.0, 1.0, 1.0, 2.0, 0.0, 94.0, 3.0, 2.0, 1.0, 4.0, 1.0, 5993.0, 19479.0, 8.0, 1.0, 1.0, 11.0, 3.0, 1.0, 80.0, 0.0, 8.0, 0.0, 1.0, 6.0, 4.0, 0.0, 5.0], [49.0, 1.0, 279.0, 1.0, 8.0, 1.0, 1.0, 1.0, 2.0, 3.0, 1.0, 61.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5130.0, 24907.0, 1.0, 1.0, 0.0, 23.0, 4.0, 4.0, 80.0, 1.0, 10.0, 3.0, 3.0, 10.0, 7.0, 1.0, 7.0], [37.0, 0.0, 1373.0, 1.0, 2.0, 2.0, 3.0, 1.0, 4.0, 4.0, 1.0, 92.0, 2.0, 1.0, 3.0, 3.0, 1.0, 2090.0, 2396.0, 6.0, 1.0, 1.0, 15.0, 3.0, 2.0, 80.0, 0.0, 7.0, 3.0, 3.0, 0.0, 0.0, 0.0, 0.0], [33.0, 1.0, 1392.0, 1.0, 3.0, 4.0, 1.0, 1.0, 5.0, 4.0, 0.0, 56.0, 3.0, 1.0, 2.0, 3.0, 2.0, 2909.0, 23159.0, 1.0, 1.0, 1.0, 11.0, 3.0, 3.0, 80.0, 0.0, 8.0, 3.0, 3.0, 8.0, 7.0, 3.0, 0.0], [27.0, 0.0, 591.0, 1.0, 2.0, 1.0, 2.0, 1.0, 7.0, 1.0, 1.0, 40.0, 3.0, 1.0, 3.0, 2.0, 2.0, 3468.0, 16632.0, 9.0, 1.0, 0.0, 12.0, 3.0, 4.0, 80.0, 1.0, 6.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0], [32.0, 1.0, 1005.0, 1.0, 2

In [151]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=0)

Now we'll load up XGBoost, and convert our data into the DMatrix format it expects. One for the training data, and one for the test data.

In [152]:
import xgboost as xgb

train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)

Define hyperparamers and train the model with them

In [196]:
param = {
    'max_depth': 3,
    'eta': 0.3,
    'objective': 'binary:hinge'} 
epochs = 10
model = xgb.train(param, train, epochs)

predictions = model.predict(test)

Measure the accuracy on the test data

In [197]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, predictions)

0.8537414965986394

Need to learn more about the dark arts of parameter tuning, seems that 10 epochs and a max_depth of 3 works best. 