# IBM - employee attrition #
__Why employees leave IBM?__<br/>
__Can we predict employee attrition?__

by Viktor Gorchev

Data source: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

Table of Contents<br/>
I. Tidying the data.<br/>
II. Data exploration.<br/>
III. Predict attrition using machine learning.<br/>
IV. Conclusion

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc
from sklearn.cross_validation import KFold

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

# I. Tidying the data.

In [None]:
resource_location = '../input/WA_Fn-UseC_-HR-Employee-Attrition.csv'
data = pd.read_csv(resource_location)

In [None]:
print("Rows count: ", data.shape[0])
print("Columns count: ", data.shape[1])

__COLUMN DEFINITIONS:__<br/>

__Education__ <br/>1 'Below College'; 2 'College'; 3 'Bachelor'; 4 'Master'; 5 'Doctor'.<br/>

__EnvironmentSatisfaction__ <br/>1 'Low'; 2 'Medium'; 3 'High'; 4 'Very High'.<br/>

__JobInvolvement__ <br/>1 'Low'; 2 'Medium'; 3 'High'; 4 'Very High'.<br/>

__JobSatisfaction__ <br/>1 'Low'; 2 'Medium'; 3 'High'; 4 'Very High'.<br/>

__PerformanceRating__ <br/>1 'Low'; 2 'Good'; 3 'Excellent'; 4 'Outstanding'.<br/>

__RelationshipSatisfaction__ <br/>1 'Low'; 2 'Medium'; 3 'High'; 4 'Very High'.<br/>

__WorkLifeBalance__ <br/>1 'Bad'; 2 'Good'; 3 'Better'; 4 'Best'.<br/>

__Unfortunately for the rest the columns there isn’t a definition what each number represents!__

In [None]:
data.head(3)

In [None]:
data.tail(3)

In [None]:
print('NaN values in data:')
data.apply(lambda x: sum(x.isnull()),axis=0)  

No data with NaN values was found.

In [None]:
data.dtypes

Data types correspond to the data in columns.

__Exploring the numerical and categorical data.__

In [None]:
cols = data.columns
numeric_cols = data._get_numeric_data().columns
categirical_cols = cols.drop(numeric_cols.tolist())

In [None]:
separator = "; "
print("Numeric data:\n", separator.join(numeric_cols))
print("Categorical data:\n", separator.join(categirical_cols))

__Numerical data__

In [None]:
data[numeric_cols].describe()

EmployeeCount has just value 1 on each row.<br/> EmployeeNumber is the Id number of the employee.<br/> StandardHours just value 80 on each row.<br/> These columns don’t hold any valuable information so we must delete them.

In [None]:
del data["EmployeeCount"]
del data["EmployeeNumber"]
del data["StandardHours"]

__Categorical data__

In [None]:
for col in categirical_cols:
    print(data[col].value_counts())

In [None]:
data["Over18"].value_counts()

Every employee in the data set is over 18.<br/>Column "Over18" holds no valuable data and must be deleted.

In [None]:
del data["Over18"]

"Attrition", "Gender" and "OverTime" are boolean variables. <br/>We convert them into 0 or 1 numerical data by using dummy variables and delete the originals.

In [None]:
data = pd.concat([data, pd.get_dummies(data[["Gender", "OverTime", "Attrition"]], drop_first=True)], axis=1)

del data["Gender"]
del data["OverTime"]
del data["Attrition"]

In [None]:
data.head(3)

# II. Data exploration.

In [None]:
sns.set(style="whitegrid", font_scale=1.3)
sns.countplot(x="Attrition_Yes", data=data, palette="hls")
sns.plt.title("Attrition Counts")
sns.plt.xlabel("Attrition (No = 0, Yes = 1)")
plt.show()

The attrition is imbalanced. There are far too many '0' than '1'.

__Correlation coefficients.__

In [None]:
sns.set(style="whitegrid", font_scale=1)
plt.figure(figsize=(16,16))
corr = round(data.corr(),2)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, annot=True, cmap="RdBu", mask=mask, )
plt.title("Correlation between features", fontdict={"fontsize":20})
plt.show()

As there are many variables, we extract and visualize only strong correlation coefficients.

In [None]:
extract_cols  = ["Age",
                 "DistanceFromHome",
                 "JobInvolvement",
                 "JobLevel", 
                 "MonthlyIncome",
                 "StockOptionLevel",
                 "TotalWorkingYears", 
                 "YearsAtCompany", 
                 "YearsInCurrentRole",
                 "YearsWithCurrManager",
                 "YearsSinceLastPromotion", 
                 "OverTime_Yes", 
                 "Attrition_Yes"]

sns.set(style="whitegrid", font_scale=1.2)
plt.figure(figsize=(10,7))
corr = round(data[extract_cols].corr(),2)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, annot=True, cmap="RdBu", vmin=-1, vmax=1, mask=mask)
plt.title("Correlation between important features and attrition", fontdict={"fontsize":20})
plt.show()

We will plot the columns with most relevance in the decision of employees to quit their job: 
"JobLevel", "MonthlyIncome", "StockOptionLevel", "TotalWorkingYears", "YearsAtCompany", "YearsInCurrentRole", "YearsWithCurrManager", "OverTime_Yes".

In [None]:
most_relevant = ["JobLevel", 
                 "MonthlyIncome",
                 "StockOptionLevel",
                 "TotalWorkingYears", 
                 "YearsAtCompany", 
                 "YearsInCurrentRole",
                 "YearsWithCurrManager",
                 "OverTime_Yes"]

for col in most_relevant:    
    sns.factorplot(x="Attrition_Yes", y=col, data=data, kind="bar");
    sns.plt.xlabel("Attrition (No = 0, Yes = 1)")
    plt.title(col + " / Attrition")
    plt.show()

__From the correlation data we can see that the main factor for employees to quit their job is overtime work.__

For predicting attrition we can use the columns with most relevance in the decision of employees to quit their job: 
"Age", "JobLevel", "MonthlyIncome", "StockOptionLevel", "TotalWorkingYears", "YearsAtCompany", "YearsInCurrentRole", "YearsWithCurrManager", "OverTime_Yes".

# III. Predict attrition using machine learning.

Creating dataframe with only the most relevant data regarding attrition. (dimensionality reduction)

In [None]:
ibm_data = data[["Age",
                 "JobLevel", 
                 "MonthlyIncome",
                 "StockOptionLevel",
                 "TotalWorkingYears", 
                 "YearsAtCompany", 
                 "YearsInCurrentRole",
                 "YearsWithCurrManager",
                 "OverTime_Yes", 
                 "Attrition_Yes"]]

__Preparing the data.__

In [None]:
ibm_data.dtypes

All the data is in numeric form so no conversion is required.

__Defining functions for measuring each ML model.__

In [None]:
def print_cross_validation_score(model, attributes, labels, n_folds):  
    kf = KFold(attributes.shape[0], n_folds=n_folds)
    error = []
    for train, test in kf:
        train_predictors = (attributes.iloc[train,:])
        train_target = labels.iloc[train]
        
        model.fit(train_predictors, train_target)
        
        error.append(model.score(attributes.iloc[test,:], labels.iloc[test]))
        
    print("Cross-Validation scores: ", error)
    
    print(
        "\nCross-Validation mean score : %s" % "{0:.3%}".format(np.mean(error)), 
        "(standard deviation: %s)" % "{0:.3%}".format(np.array(error).std())
    ) 

In [None]:
def print_clf_quality(labels_test, predicted):    
    accuracy = accuracy_score(labels_test, predicted)
    precision = precision_score(labels_test, predicted, average="weighted")
    recall = recall_score(labels_test, predicted, average="weighted")
    f1 = f1_score(labels_test, predicted, average="weighted")

    print("accuracy: ", accuracy)
    print("precision: ", precision)
    print("recall: ", recall)
    print("f1: ", f1)

    print("\nConfusion matrix :")
    print(confusion_matrix(labels_test, predicted))   

In [None]:
def print_roc_curve(y_score, y_test):    
    n_classes = 1
    
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_test, y_score)
        roc_auc[i] = auc(fpr[i], tpr[i])
   
    fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
    
    plt.figure()
    plt.plot(fpr[0], tpr[0], label='ROC curve (area = %0.2f)' % roc_auc[0])
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC curve')
    plt.legend(loc="lower right")
    plt.show()       

__Splitting the date for training and testing. (training 70%, testing 30%)__

In [None]:
cols = ibm_data.columns.drop("Attrition_Yes")
attributes = ibm_data[cols]
labels = ibm_data["Attrition_Yes"]

features_train, features_test, labels_train, labels_test = train_test_split(attributes, labels, train_size=0.7, stratify=labels)

__We have to do a classification.__<br/>
Some of the best models for classification are <br/>
SVM "Support Vector Machines", <br/>
Kernel Trick, <br/>
k-Nearest Neighbors (kNN), <br/>
AdaBoostClassifier with DecisionTreeClassifier and <br/>
AdaBoostClassifier with RandomForestClassifier<br/> 
so we are going to use them.<br/> 
We will try also one neural network - MLPClassifier().

# SVM "Support Vector Machines"

1. Using default parameters.

In [None]:
svm = SVC(kernel = "linear")

__Performing cross validation with Kfold.__

In [None]:
n_folds = 5
print_cross_validation_score(svm, features_train, labels_train, n_folds)  

__Training the model.__

In [None]:
svm.fit(features_train , labels_train)

__Printing the model quality.__

In [None]:
predicted = svm.predict(features_test)

print_clf_quality(labels_test, predicted)

In [None]:
labels_score = svm.decision_function(features_test)

print_roc_curve(labels_score, labels_test)

__2. Using Grid search.__

In [None]:
svm_search = SVC(kernel = "linear")

In [None]:
#The code below code is not active because the grid search takes more than 1200 seconds and
#kaggle kills the kernel!
#The best estimator from the grid search is C=1000.


#params_svm = {"C": [0.001, 0.01, 0.1, 10, 100, 1000]}
#folds = 5
#search_svm = GridSearchCV(svm_search, params_svm, cv = folds)
#search_svm.fit(features_train , labels_train)

#print(search_svm.best_estimator_)

__The best C value is 1000.__

In [None]:
svm_best = SVC(C = 1000, kernel = "linear")

__Performing cross validation with Kfold__

In [None]:
n_folds = 5
print_cross_validation_score(svm_best, features_train, labels_train, n_folds)  

__Training the tuned model.__

In [None]:
svm_best.fit(features_train , labels_train)

__Printing the model quality.__

In [None]:
predicted = svm_best.predict(features_test)

print_clf_quality(labels_test, predicted)

In [None]:
labels_score = svm_best.decision_function(features_test)

print_roc_curve(labels_score, labels_test)

# Kernel Trick

__1. Using default parameters.__

In [None]:
kt = SVC(kernel = "rbf")

__Performing cross validation with Kfold.__

In [None]:
n_folds = 5
print_cross_validation_score(kt, features_train, labels_train, n_folds)  

__Training the model.__

In [None]:
kt.fit(features_train , labels_train)

__Printing the model quality.__

In [None]:
predicted = kt.predict(features_test)

print_clf_quality(labels_test, predicted)

In [None]:
labels_score = kt.decision_function(features_test)

print_roc_curve(labels_score, labels_test)

__2. Using Grid search.__

In [None]:
kt_search = SVC(kernel = "rbf")

In [None]:
params_kt = {"C": [0.001, 0.01, 100, 1000], "gamma": [0.00001, 10]}
folds_kt = 5

In [None]:
search_kt = GridSearchCV(kt_search, params_kt, cv = folds_kt)
search_kt.fit(features_train , labels_train)

In [None]:
print(search_kt.best_estimator_)

__The best C value is 0.001, gamma value is 0.00001__

In [None]:
kt_best = SVC(C = 0.001, gamma = 0.00001, kernel = "rbf")

__Performing cross validation with Kfold__

In [None]:
n_folds = 5
print_cross_validation_score(kt_best, features_train, labels_train, n_folds) 

__Training the model.__

In [None]:
kt_best.fit(features_train , labels_train)

__Printing the model quality.__

In [None]:
predicted = kt_best.predict(features_test)

print_clf_quality(labels_test, predicted)

In [None]:
labels_score = kt_best.decision_function(features_test)

print_roc_curve(labels_score, labels_test)

# k-Nearest Neighbors (kNN)

__1. Using default parameters.__

In [None]:
knn = KNeighborsClassifier()

__Performing cross validation with Kfold.__

In [None]:
n_folds = 5
print_cross_validation_score(knn, features_train, labels_train, n_folds)

__Training the model.__

In [None]:
knn.fit(features_train , labels_train)

__Printing the model quality.__

In [None]:
predicted = knn.predict(features_test)

print_clf_quality(labels_test, predicted)

In [None]:
labels_score = knn.predict_proba(features_test)

print_roc_curve(labels_score[:, 1], labels_test)

__2. Using Grid search.__

In [None]:
knn_search = KNeighborsClassifier()

In [None]:
params_knn = {"n_neighbors": [2, 6, 7, 8, 9]}
folds_knn = 5
search_knn = GridSearchCV(knn_search, params_knn, cv = folds_knn)
search_knn.fit(features_train , labels_train)

In [None]:
print(search_knn.best_estimator_)

__The best n_neighbors value is 6.__

In [None]:
knn_best = KNeighborsClassifier(n_neighbors = 6)

__Performing cross validation with Kfold__

In [None]:
n_folds = 5
print_cross_validation_score(knn_best, features_train, labels_train, n_folds) 

__Training the model.__

In [None]:
knn_best.fit(features_train , labels_train)

__Printing the model quality.__

In [None]:
predicted = knn_best.predict(features_test)

print_clf_quality(labels_test, predicted)

In [None]:
labels_score = knn_best.predict_proba(features_test)

print_roc_curve(labels_score[:, 1], labels_test)

# Neural Network - MLPClassifier

__1. Using default parameters.__

In [None]:
nn = MLPClassifier()

__Performing cross validation with Kfold__

In [None]:
n_folds = 5
print_cross_validation_score(nn, features_train, labels_train, n_folds)  

__Training the model.__

In [None]:
nn.fit(features_train , labels_train)

__Printing the model quality.__

In [None]:
predicted = nn.predict(features_test)

print_clf_quality(labels_test, predicted)

In [None]:
labels_score = nn.predict_proba(features_test)

print_roc_curve(labels_score[:, 1], labels_test)

__2. Using Grid search.__

In [None]:
nn_search = MLPClassifier()

In [None]:
params_nn = {
    "hidden_layer_sizes" : [(30, 30), (300, 300), (5, 10, 20, 5)], 
    "early_stopping" : [True, False],
    "alpha" : 10.0 ** - np.arange(1, 7),
    "max_iter" : [20000]
}
folds_nn = 5
search_nn = GridSearchCV(nn_search, params_nn, cv = folds_nn)
search_nn.fit(features_train , labels_train)

In [None]:
print(search_nn.best_estimator_)

__The best result is hidden_layer_sizes set to (300, 300) with early_stopping set to True and alpha value of 0.10000000000000001__


In [None]:
nn_best = MLPClassifier(alpha=0.10000000000000001, early_stopping = True, hidden_layer_sizes = (300, 300))

__Performing cross validation with Kfold__

In [None]:
n_folds = 5
print_cross_validation_score(nn_best, features_train, labels_train, n_folds) 

__Training the model.__

In [None]:
nn_best.fit(features_train , labels_train)

__Printing the model quality.__

In [None]:
print("Train data score: ", nn_best.score(features_train , labels_train))
print("Test data score: ", nn_best.score(features_test, labels_test))
print()

predicted = nn_best.predict(features_test)

print_clf_quality(labels_test, predicted)

In [None]:
labels_score = nn_best.predict_proba(features_test)

print_roc_curve(labels_score[:, 1], labels_test)

# AdaBoostClassifier with DecisionTreeClassifier

__Using DecisionTreeClassifier with  max_depth 1.__

In [None]:
ada_tree = DecisionTreeClassifier(max_depth=1)

In [None]:
ada = AdaBoostClassifier(base_estimator = ada_tree)

__Performing cross validation with Kfold__

In [None]:
n_folds = 5
print_cross_validation_score(ada, features_train, labels_train, n_folds) 

__Training the model.__

In [None]:
ada.fit(features_train, labels_train) 

__Printing the model quality.__

In [None]:
print("Train data score: ", ada.score(features_train , labels_train))
print("Test data score: ", ada.score(features_test, labels_test))
print()

predicted = ada.predict(features_test)

print_clf_quality(labels_test, predicted)

In [None]:
labels_score = ada.decision_function(features_test)

print_roc_curve(labels_score, labels_test)

# AdaBoostClassifier with RandomForestClassifier

__Using RandomForestClassifier with default parameters.__

In [None]:
tree = RandomForestClassifier()

In [None]:
ada_rfc = AdaBoostClassifier(base_estimator = tree)

__Performing cross validation with Kfold__

In [None]:
n_folds = 5
print_cross_validation_score(ada_rfc, features_train, labels_train, n_folds) 

__Training the model.__

In [None]:
ada_rfc.fit(features_train, labels_train) 

__Printing the model quality.__

In [None]:
print("Train data score: ", ada_rfc.score(features_train , labels_train))
print("Test data score: ", ada_rfc.score(features_test, labels_test))
print()

predicted = ada_rfc.predict(features_test)

print_clf_quality(labels_test, predicted)

In [None]:
labels_score = ada_rfc.decision_function(features_test)

print_roc_curve(labels_score, labels_test)

# IV. Conclusion

__The main factor for employee attrition is overtime work.__

__We can predict employee attrition with 85.42% accuracy.__

The two best performing machine learning algorithms are:<br/>
1. SVM "Support Vector Machines" with kernel = "linear" and no tuning - __predicting accuracy 85.42% and f1 score 83.71%__;<br/>
2. AdaBoostClassifier using DecisionTreeClassifier with max depth 1 – __predicting accuracy 85.26% and f1 score 83.01%__.


In predicting attrition we use the columns with most relevance, as follows: 
"Age", "JobLevel", "MonthlyIncome", "StockOptionLevel", "TotalWorkingYears", "YearsAtCompany", "YearsInCurrentRole", "YearsWithCurrManager", "OverTime_Yes".