![](https://harver.com/wp-content/uploads/2019/02/Employee-Attrition-Turnover-1024x437.jpg)

# Introduction

The key to success in any organization is attracting and retaining top talent. As an HR analyst one of the key task is to determine which factors keep employees at the company and which prompt others to leave. Given in the data is a set of data points on the employees who are either currently working within the company or have resigned. The objective is to identify and improve these factors to prevent loss of good people.

This notebook predicts whether or not employee's resign their position at work.

Key takeaways:

* Uses CatBoosts, so no manual encoding of categorial values.
* Find the best features (which is partly how much overtime work you put on).
* Plots SHAP values for a few predictions to show what speaks for and what speaks against resignation.


# Data loading

We'll do some simple data loading of the train and test set, encode the targets as numerical values since it's more simplier to work with and put all the data in the dataframe at the end.

In [None]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

pd.reset_option('^display.', silent=True)

# Load data
X_train = pd.read_csv('/kaggle/input/employee-attrition/employee_attrition_train.csv')
X_test = pd.read_csv('/kaggle/input/employee-attrition/employee_attrition_test.csv')

# Make target numerical
X_train.Attrition = X_train.Attrition.apply(lambda x: 0 if x == 'No' else 1)

# Split target and predictors
y_train = X_train['Attrition']
num_train = len(X_train)
X_train.drop(['Attrition'], axis=1, inplace=True)

df = pd.concat([X_train, X_test], ignore_index=True)

# Short EDA

The data contains 1029 training 441 test samples with information about an employee's current position, their background and whether or not they have resigned their position at any point. There are a few NaN values we'll deal with next, but overall no really surprises in the data. We note howver that the dataset is imbalanced, as apparently a lot more people choose to keep their jobs rather than resigning.



In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# Detect if data is imbalanced
print(y_train.value_counts())

# Feature encoding

Here we just fill the NaN values with some good medians based on employee's gender, education and job level. Since we're using CatBoost, we simply save the indices of the categorial features instead of doing any manual encoding of them. CatBoost does this by itself and very effectively too, so better let it have it.

In [None]:
pd.set_option('mode.chained_assignment', None)

# Fill missing values for DailyRate with median
daily_rates = df.groupby(['Gender', 'Education', 'JobLevel']).DailyRate
f = lambda x: x.fillna(x.median())
df.DailyRate = daily_rates.transform(f)

# Fill missing values for age with median
ages = df.groupby(['Gender', 'Education']).Age
f = lambda x: x.fillna(x.median())
df.Age = ages.transform(f)

# Set missing values for travel to Non-Travel
df.BusinessTravel[df.BusinessTravel.isnull()] = 'Non-Travel'

# Set missing values for DistanceFromHome to median
df.DistanceFromHome[df.DistanceFromHome.isnull()] = np.around(df.DistanceFromHome.mean())

# Set missing values for MaritalStatus to Married
df.MaritalStatus[df.MaritalStatus.isnull()] = 'Married'

# Save indices of categorial features
categorical_features_indices = np.where(df.dtypes == 'object')[0]

# Train/test split

Now that we're done encoding, we can make a proper train and validation split and save the test set for later evaluation. The data is imbalanced, so we record a weight for the positive 1's used next for classification. This is simply the ratio between negative and positive values.

In [None]:
# Split the df into train and test set
X_train = df.iloc[:num_train,:]
X_test = df.iloc[num_train:,:]

# Make a training and validation set
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, train_size=0.75, stratify=y_train, random_state=0)

# Imbalanced data, so set weight
pos_weight = sum(y_train.values == 0)/sum(y_train.values == 1)

# Modelling

CatBoost is an algorithm for gradient boosting on decision trees. It is developed by Yandex researchers and engineers, and is used for search, recommendation systems, personal assistant, self-driving cars, weather prediction and many other tasks at Yandex and in other companies, including CERN, Cloudflare, Careem taxi. It is in open-source and can be used by anyone. More here: https://catboost.ai/

It has categorical features support and usually gives excellent results without hyperparameter optimization, so we can pretty much train right away. We set a high number of iterations, in order to train as long as we keep seeing improvements, and so that early stopping (Overfit detector in CatBoost lingo) will kick in should it start to overfit. We optimize for Logloss as well, not surprisingly.

Kaggle does not support this, but if you run this yourself, make sure to enable *plot* so you'll get nice plots :)

In [None]:
import catboost
params = {"iterations": 1000,
          "learning_rate": 0.1,
          "scale_pos_weight": pos_weight,
          "eval_metric": "AUC",
          "custom_loss": "Accuracy",
          "loss_function": "Logloss",
          "od_type": "Iter",
          "od_wait": 30,
          "logging_level": "Verbose",
          "random_seed": 0
}

train_pool = catboost.Pool(X_train, y_train, cat_features=categorical_features_indices)
valid_pool = catboost.Pool(X_valid, y_valid, cat_features=categorical_features_indices)

model = catboost.CatBoostClassifier(**params)
model.fit(train_pool, eval_set=valid_pool, plot=False)

After 20 iterations, CatBoost got a 83% accuracy on the validation set, which didn't take more than a quarter of a second. Next we'll make a classification report, show precision/recall scores and plot the ROC curves.

In [None]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score
from catboost.utils import get_roc_curve, select_threshold

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')

y_pred = model.predict(X_valid)
print(f"Confusion Matrix:\n {confusion_matrix(y_valid, y_pred)}\n")
print(f"Classification Report:\n {classification_report(y_valid, y_pred)}\n")

y_pred = model.predict(X_valid)
print(f"Accuracy on validation set: {accuracy_score(y_valid, y_pred)}")
print(f"Precision on validation set: {precision_score(y_valid, y_pred)}")
print(f"Recall on validation set: {recall_score(y_valid, y_pred)}")

fpr_train, tpr_train, _ = get_roc_curve(model, train_pool)
fpr_valid, tpr_valid, _ = get_roc_curve(model, valid_pool)

plt.figure(figsize=(8,6))
plot_roc_curve(fpr_train, tpr_train, "Training ROC")
plot_roc_curve(fpr_valid, tpr_valid, "Validation ROC")
plt.legend(loc="lower right")
plt.title("ROC plot")
plt.ylabel("TPR")
plt.xlabel("FPR")
plt.show()

# Feature importance

One of the cool things about using decision tree algorithms for machine-learning is that you get a feature ranking for free when you're done training. So what we'll do next is extract the features ordered by importance from the model and show them in a nice table, and plot them as well.

In [None]:
# Get feature importances
model.get_feature_importance(train_pool, fstr_type=catboost.EFstrType.FeatureImportance, prettified=True)

In [None]:
# Plot feature importances
importances = model.get_feature_importance(train_pool, fstr_type=catboost.EFstrType.FeatureImportance)
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(12,12))
plt.title('Feature importance for CatBoost classifier')
plt.barh(X_train.columns[indices][::-1], importances[indices][::-1])

# Feature interactions

Feature interactions show the interaction strength for each pair of features that are used in the model. Formally, it reflects the sum of absolute differences (after being summed across all trees) between the leaves of the tree containing the interaction and the leaves of the tree not containing the interaction. In other words, how correlated two features are. We see that there is a strong link between Department and JobRole in this case. Also the size of your paycheck seems to depend on your seniority, not surprisingly.

In [None]:
interactions = model.get_feature_importance(train_pool, fstr_type=catboost.EFstrType.Interaction)
feature_interaction = [[X_train.columns[interaction[0]], X_train.columns[interaction[1]], interaction[2]] for interaction in interactions]
feature_interaction_df = pd.DataFrame(feature_interaction, columns=['feature1', 'feature2', 'interaction_strength'])
feature_interaction_df.head(10)

In [None]:
pd.Series(index=zip(feature_interaction_df['feature1'], feature_interaction_df['feature2']), data=feature_interaction_df['interaction_strength'].values, name='interaction_strength').head(10)[::-1].plot(kind='barh', figsize=(12,12))

# SHAP Values

SHAP values help explain how our model makes predictions. Below we've used the training data to identify which features makes you more likely to stay in your job role and which features makes you likely to resign. The features are on the left vertical axis ranked in descending order and the SHAP value strengths are on the horizontal axis. The horizontal location of the dots shows whether the effect of that value is associated with a higher or lower prediction and the color shows whether that variable is high (in red) or low (in blue) for that observation.

For example, a *low* value of **StockOptionLevel** has a *positive* impact on the predictions, whereas a *high* value of *YearsInCurrentRole* has a *negative* impact on the predictions. A positive impact pulls us towards classifying a employee as a 1 (they will likely resign) and a negative impact pulls us towards classifying the employee's as a 0 (they will likely be staying). Categorical values are shown in gray. 

See more here: https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d

The next plots show the SHAP values for the training data and a few examples on the test set as well.

In [None]:
import shap
shap_values = model.get_feature_importance(train_pool, fstr_type=catboost.EFstrType.ShapValues)
shap.initjs()
shap.summary_plot(shap_values[:, :-1], X_train, feature_names=X_train.columns.tolist())

In [None]:
shap.summary_plot(shap_values[:, :-1], X_train, feature_names=X_train.columns.tolist(), plot_type="bar")

In [None]:
# Helper function to plot shap values
def shap_plot(j):
    explainerModel = shap.TreeExplainer(model)
    shap_values_Model = explainerModel.shap_values(X_test)
    p = shap.force_plot(explainerModel.expected_value, shap_values_Model[j], X_test.iloc[[j]])
    return(p)

shap_plot(0)

In [None]:
shap_plot(10)

In [None]:
shap_plot(45)

In [None]:
shap_plot(49)

In [None]:
shap_plot(50)

# Model predictions

Lastly we'll make a few predictions on unseen test data. It turns out, that most employee's are happy with their current role and most likely to stay, however 2 out of 10 do want to resign their position.

In [None]:
y_test_preds = model.predict(X_test)
y_test_probas = model.predict_proba(X_test)

print(f"First 20 predictions on test set: {y_test_preds[:10]}")
print(f"First 20 dropout probabilities: {y_test_probas[:10]}")
print(f"Number of predicated dropouts: {np.sum(y_test_preds == 1)}")
print(f"Number of predicated non-dropouts: {np.sum(y_test_preds == 0)}")