# Employee attrition

In this notebook we are predicting the attrition of employees.

In [395]:
import seaborn as sns
import sklearn as sk 
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

### Importing and pre-processing the data

In [396]:
df = pd.read_csv('Employee_attrition.csv')
df = df[['Attrition', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'RelationshipSatisfaction', 'Department', 'WorkLifeBalance', 'YearsAtCompany']]#subset strong predictive values

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In this notebook we are predicting the attrition of the employees. Therefore, we first have to take a look at how the attrtion is divided between yes and no.

In [398]:
df['Attrition'].value_counts()#count values of attrition for calculating the accurancy 

No     1233
Yes     237
Name: Attrition, dtype: int64

There are more employees that have no attrition. With this we can say that it's more difficult for the employer to predict attrition than no attrition. The actual accurancy is:

In [399]:
accurancy = (1-(237/1233))*100

accurancy

80.7785888077859

Before dividing the data into a training and test set we first need to pre-process some variables since we are working with: ranking numbers, categorical (strings) and boolen variables. First let's see whit what kind of data we are working:

In [400]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Attrition                 1470 non-null   object
 1   YearsInCurrentRole        1470 non-null   int64 
 2   YearsSinceLastPromotion   1470 non-null   int64 
 3   RelationshipSatisfaction  1470 non-null   int64 
 4   Department                1470 non-null   object
 5   WorkLifeBalance           1470 non-null   int64 
 6   YearsAtCompany            1470 non-null   int64 
dtypes: int64(5), object(2)
memory usage: 80.5+ KB


Here we see that python is reading the ranking variables RelationshipSatisfaction and WorkLifeBalance as integers(numbers). If we want to make dummies out of them, we first need to typecast them as objects:

In [401]:
df['RelationshipSatisfaction'] = df['RelationshipSatisfaction'].astype(object)
df['WorkLifeBalance'] = df['WorkLifeBalance'].astype(object)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Attrition                 1470 non-null   object
 1   YearsInCurrentRole        1470 non-null   int64 
 2   YearsSinceLastPromotion   1470 non-null   int64 
 3   RelationshipSatisfaction  1470 non-null   object
 4   Department                1470 non-null   object
 5   WorkLifeBalance           1470 non-null   object
 6   YearsAtCompany            1470 non-null   int64 
dtypes: int64(3), object(4)
memory usage: 80.5+ KB


Now we can convert the variables into dummy variables except for the attrition. All of the attrition data need to be in the same column for the confusion matrix.

In [402]:
df = pd.get_dummies(df, columns=['RelationshipSatisfaction', 'WorkLifeBalance', 'Department'])
df['Attrition'] = df['Attrition'].replace('Yes', 1)
df['Attrition'] = df['Attrition'].replace('No', 0)
df['Attrition'] = df['Attrition'].astype(int)
df

Unnamed: 0,Attrition,YearsInCurrentRole,YearsSinceLastPromotion,YearsAtCompany,RelationshipSatisfaction_1,RelationshipSatisfaction_2,RelationshipSatisfaction_3,RelationshipSatisfaction_4,WorkLifeBalance_1,WorkLifeBalance_2,WorkLifeBalance_3,WorkLifeBalance_4,Department_Human Resources,Department_Research & Development,Department_Sales
0,1,4,0,6,1,0,0,0,1,0,0,0,0,0,1
1,0,7,1,10,0,0,0,1,0,0,1,0,0,1,0
2,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0
3,0,7,3,8,0,0,1,0,0,0,1,0,0,1,0
4,0,2,2,2,0,0,0,1,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,0,2,0,5,0,0,1,0,0,0,1,0,0,1,0
1466,0,7,1,7,1,0,0,0,0,0,1,0,0,1,0
1467,0,2,0,6,0,1,0,0,0,0,1,0,0,1,0
1468,0,6,0,9,0,0,0,1,0,1,0,0,0,0,1


In [403]:
df.info()#check if all variables are the correct type

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 15 columns):
 #   Column                             Non-Null Count  Dtype
---  ------                             --------------  -----
 0   Attrition                          1470 non-null   int64
 1   YearsInCurrentRole                 1470 non-null   int64
 2   YearsSinceLastPromotion            1470 non-null   int64
 3   YearsAtCompany                     1470 non-null   int64
 4   RelationshipSatisfaction_1         1470 non-null   uint8
 5   RelationshipSatisfaction_2         1470 non-null   uint8
 6   RelationshipSatisfaction_3         1470 non-null   uint8
 7   RelationshipSatisfaction_4         1470 non-null   uint8
 8   WorkLifeBalance_1                  1470 non-null   uint8
 9   WorkLifeBalance_2                  1470 non-null   uint8
 10  WorkLifeBalance_3                  1470 non-null   uint8
 11  WorkLifeBalance_4                  1470 non-null   uint8
 12  Department_Human Res

Now all the columns are numbers we can use them in the algorithm. 

Let's create the X and y values and plit the data into training and test data:

In [404]:
X = df.loc[:,'YearsInCurrentRole':'Department_Sales']#creating X 
y = df['Attrition']#creating Y

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)#split in test data

### Training the algorithm 

Because we are predicting a categorial variable (yes attrition or no attrition) we are using classification. For this classification we are using the Random Forest Algorithm. 

Random Forest is a algorithm that makes and combines different decision trees from random selected cases (in this dataset employees). It's a supervised algorithm this means we are training the algorithm with a training set. 

As a following step we need to fit the model and indicate the number of trees:

In [408]:
model = RandomForestClassifier(n_estimators = 100, random_state=1)
model.fit(X_train, y_train)

RandomForestClassifier(random_state=1)

### Evaluating the model 

To evaluate the model we are using a confusion matrix. 

In [406]:
y_pred = model.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_pred) #creates a "confusion matrix"
cm = pd.DataFrame(cm, index=['no attrition (actual)', 'attrition (actual)'], columns = ['no attrition (pred)', 'attrition (pred)']) #label and make df
cm

Unnamed: 0,no attrition (pred),attrition (pred)
no attrition (actual),370,31
attrition (actual),79,6


With the confusion matrix we can calculate the accurancy, recall and precision:

In [407]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.92      0.87       401
           1       0.16      0.07      0.10        85

    accuracy                           0.77       486
   macro avg       0.49      0.50      0.48       486
weighted avg       0.71      0.77      0.74       486



The percision of attrition is not that good (82%) and 84% of this is actual attrition. The recall is really low. We are missing 93%.