# Introduction

Employee resignation happens everyday. Resignation is a difficult decision because it has a huge impact on an employee's livelihood, especially if they have a family. Despite that, the number of resignation increases every year. An analysis conducted by Compdata, the consulting practice at Salary.com, showed that, based on data from nearly 25,000 organizations of varying sizes in the United States, employee quits increased from 13.5 percent in October 2017 to 14.2 percent in October 2018.

There are many factors that can influence resignation such as: imbalanced work division, high sum of work hours, dissatisfaction against corporate, salary range, career prospects, etc. This is where HR plays a huge role. Letting high performing employee leave can be more damaging to the organization compared to saving cost for a cheaper but worse performing employee.

In this notebook we will analyze a sample HR analytics dataset to find out some facts on employee resignation, and will attempt to create a classifier model to predict whether an employee with a specific profile may resign or not.

# Importing Library and Dataset

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
##Import Data
df=pd.read_csv('/kaggle/input/hr-analytics/HR_comma_sep.csv')
df.head()

# Data Inspection

We check for missing values, and dataset damages.

In [None]:
df.describe()

In [None]:
##Data Inspection
print('Existence of null values: ',df.isnull().values.any())
print('Existence of NaN values: ',df.isna().values.any())

In [None]:
df['left'].value_counts(normalize=True)

Here we can see that the target values are imbalanced, with only 23.8% of the data being resignees.

# Exploratory Data Analysis

In this section, we will explore the dataset and calculate some metrics that can illustrate factors that can cause an employee to resign. 

In [None]:
department_list=df['Department'].value_counts()
ret_ratio=df.groupby('Department')['left'].value_counts()
ratio_arr=np.zeros(len(department_list))

i=0
for j in department_list.keys():
    #print(j,'--> Stay: ',ret_ratio[j][0],'Left: ',ret_ratio[j][1])
    ratio_arr[i]=100*ret_ratio[j][1]/(ret_ratio[j][0]+ret_ratio[j][1])
    i=i+1

salary_list=df['salary'].value_counts()

sal_ratio=df.groupby('salary')['left'].value_counts()
sal_arr=np.zeros(len(salary_list))

i=0
for j in salary_list.keys():
    #print(j,'--> Stay: ',ret_ratio[j][0],'Left: ',ret_ratio[j][1])
    sal_arr[i]=100*sal_ratio[j][1]/(sal_ratio[j][0]+sal_ratio[j][1])
    i=i+1


fig,ax = plt.subplots(ncols=2,figsize=(20,5))

plt.sca(ax[0])
_rt_bar=sns.barplot(x=department_list.keys(),y=ratio_arr)
_rt_title=plt.title('Resignation Rate Per Department')
for bar in _rt_bar.patches:
    _rt_bar.annotate(format(bar.get_height(), '.2f'),  
                   (bar.get_x() + bar.get_width() / 2,  
                    bar.get_height()), ha='center', va='center', 
                   size=12, xytext=(0, 8), 
                   textcoords='offset points') 
_rt_xtick=plt.xticks(rotation=45)
_rt_ylim=plt.ylim(0,100)
_rt_ylabel=plt.ylabel('%')

plt.sca(ax[1])
_rt_bar=sns.barplot(x=salary_list.keys(),y=sal_arr)
_rt_title=plt.title('Resignation Rate Per Salary Level')
for bar in _rt_bar.patches:
    _rt_bar.annotate(format(bar.get_height(), '.2f'),  
                   (bar.get_x() + bar.get_width() / 2,  
                    bar.get_height()), ha='center', va='center', 
                   size=12, xytext=(0, 8), 
                   textcoords='offset points') 
_rt_xtick=plt.xticks(rotation=45)
_rt_ylim=plt.ylim(0,100)
_rt_ylabel=plt.ylabel('%')

In [None]:
fig,ax = plt.subplots(ncols=3,figsize=(20,5))
_box=sns.boxplot(data = df,y='satisfaction_level',x='left',showmeans=True,ax=ax[0])
_box=sns.boxplot(data = df,y='last_evaluation',x='left',showmeans=True,ax=ax[1])
_box=sns.boxplot(data = df,y='average_montly_hours',x='left',showmeans=True,ax=ax[2])
for n in range(0,3):
    ax[n].set_xticklabels(labels=['Stayed','Left'])
    ax[n].set_xlabel(None)

fig,ax = plt.subplots(ncols=2,figsize=(10,5))
_box=sns.boxplot(data = df,y='time_spend_company',x='left',showmeans=True,ax=ax[0])
_box=sns.boxplot(data = df,y='number_project',x='left',showmeans=True,ax=ax[1])
for n in range(0,2):
    ax[n].set_xticklabels(labels=['Stayed','Left'])
    ax[n].set_xlabel(None)

We see in the figures above, that HR department has the highest resignation rate, followed by ....
Also, in terms of salary, employees that are on the lower group are very likely to leave.

As for the numerical aspects, we see that employees that handle a large amount of projects, have low evaluation and satisfaction rate on the company, tends to leave. The same applies to employee with high monthly working hours and time employed by the company.
On the other hand, promotion does not seem to be a major factor that causes resignation.

Next, we observe and measure the correlation between each features.

In [None]:
sns.pairplot(df, hue="left")

We see some interesting charts above. 3 charts contain clustered datapoints:
1. Satisfaction vs Last Evaluation
2. Satisfaction vs Average Monthly Hours
3. Last Evaluation vs Average Monthly Hours

In [None]:
fig,ax=plt.subplots(ncols=3,figsize=(20,5))
sns.scatterplot(data=df,x='satisfaction_level',y='last_evaluation',hue='left', ax=ax[0])
sns.scatterplot(data=df,x='satisfaction_level',y='average_montly_hours',hue='left', ax=ax[1])
sns.scatterplot(data=df,x='last_evaluation',y='average_montly_hours',hue='left', ax=ax[2])

The clusters are very similar. On the first scatter plot, resigned employees are high performing-dissatisfied, low performing-dissatisfied, or high performing and satisfied people.

Similarly on the second chart: A lot of resignees are people who worked the highest amount of hours with either low/high satisfaction level. Then there is a cluster where the resignees worked normal to low monthly hours with low satisfaction level.

However on the third chart: The largest resignee cluster comprises of people who performed good and clocked the most hours. Naturally, there is another cluster containing people that performed below average and logged fewer hours than average.

In [None]:
fig=plt.figure(figsize=(10,5))
sns.scatterplot(data=df,x='satisfaction_level',y='last_evaluation',hue='left')
plt.vlines(0.675,0.75,1.0,'red')
plt.hlines(0.75,0.675,0.95,'red')
plt.vlines(0.95,0.75,1.0,'red')

The most interesting cluster is the one marked red above. Why did they resign even though their performance and satisfaction level are above average? Let's try to isolate this group and compare with the average metric of the entire dataset.

In [None]:
df_x=df.loc[(df["left"] == 1) & (df["last_evaluation"] > 0.7) & (df["satisfaction_level"]>0.6)]
print('Resigned Cluster Average - Overall Average')
print(df_x.mean()-df.mean())
print('-----------------------------')
print('Resigned Cluster Salary Range')
print(df_x['salary'].value_counts())

Now we can clearly see why. The resignee in this particular cluster, on average are evaluated 0.2 higher, have worked on 0.74 more projects and 42.57 more hours, is employed 1.6 year longer. Most importantly, 562 of them are on the lower salary range. In terms of work accident and promotion, the gaps are negligible (<0.1).

# Logistic Regression

In this section we will explore the correlation coefficient between features and target, and we will utilize a weighted class logistic regression to build our prediction model.

## Correlation

We analyze the correlation between numerical features and the target.

In [None]:
fig,ax=plt.subplots(ncols=2,figsize=(20,8))
resign_corr=df.corr()
mask = np.triu(np.ones_like(resign_corr, dtype=np.bool))
cat_heatmap = sns.heatmap(df.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG',ax=ax[0])
cat_heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);

heatmap = sns.heatmap(resign_corr[['left']].sort_values(by='left', ascending=False),vmin=-1, vmax=1, annot=True, cmap='BrBG',ax=ax[1])
heatmap.set_title('Features Correlating with Resignation', fontdict={'fontsize':18}, pad=16);

We see above that employee evaluation is almost irrelevant to resignations/retentions.

## One Hot Encoding

We start by assigning numbers to categorical features such as department and salary range.
For department separation, we will use one hot encoding, and for salary range we will assign between 1 (low) to 3 (high).

In [None]:
df_lr=df.copy()
df_lr=pd.get_dummies(df_lr, columns = ['Department','salary'])
df_lr.head()

In [None]:
# fit a logistic regression model on an imbalanced classification dataset
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression

X = np.asarray(df_lr.loc[:, df_lr.columns != 'left'])
y = np.asarray(df_lr.loc[:, df_lr.columns == 'left'])

In [None]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
from sklearn.metrics import roc_auc_score,roc_curve
model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train.ravel())

y_pred=model.predict(X_test)
y_proba=model.predict_proba(X_test)

ns_probs = [0 for _ in range(len(y_test))]
ns_auc = roc_auc_score(y_test, ns_probs)
print("ROC AUC SCORE: ",roc_auc_score(y_test, y_proba[:, 1]))
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, y_proba[:,1])
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


cf_matrix = confusion_matrix(y_test, y_pred)

sns.heatmap(cf_matrix, annot=True, cmap='Blues')

print(classification_report(y_test, y_pred))

Logistic Regression produces a really bad prediction considering 1 or "RESIGNED" is the important class.

# Gradient Boosting

Now we attempt to model another predictor with gradient boosting classifier, to find out whether it produces a more accurate model.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,max_depth=1, random_state=42)
gb_clf.fit(X_train, y_train.ravel())

y_gb=gb_clf.predict(X_test)

In [None]:
gb_matrix=confusion_matrix(y_test, y_gb)
sns.heatmap(gb_matrix, annot=True, cmap='Blues')
print(classification_report(y_test, y_gb))

We see above that gradient boosting yields a far more accurate prediction in both 0 ("Stayed") and 1 ("Resigned").
As such, this should be the preferred prediction model.

# Conclusion

While Gradient Boosting is the preferred prediction algorithm, it is prone to overfitting due to the imbalanced dataset which is on a 76:23 ratio. A more accurate prediction can be made by gathering more resignee data and using continuous variable for the salary column instead of categorical (low-medium-high).