# IBM Employee Attrition dataset

The purpose of this dataset is to analyze factors leading to employee attrition.

## 1. Exploratory Data Analysis

### 1.1. Imports

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
from ipywidgets import interact
import warnings

from sklearn.model_selection import train_test_split

warnings.filterwarnings("ignore")
%matplotlib inline

### 1.2. Load dataset

In [None]:
df = pd.read_csv(r'../input/WA_Fn-UseC_-HR-Employee-Attrition.csv')

Inspect some example entries in the dataset

In [None]:
df.head()

In [None]:
df.tail()

Features:

In [None]:
print(df.columns.values)

Are there any null values among the entries? (couple of ways to do this)

In [None]:
df.info()

In [None]:
df.isnull().any()

In [None]:
df.isnull().values.any()

As we can see there's no need to handle null values in the dataset

### 1.3. Feature analysis

Split dataframe to numerical and categorical features

In [None]:
cat_df = df.select_dtypes(include = 'object')
num_types = [t for t in df.dtypes.unique() if t not in cat_df.dtypes.unique()]
num_df = df.select_dtypes(include = num_types)

First let's print basic statistics about all the numerical features (like mean, std, percentiles etc.)

In [None]:
num_df.describe()

We delete all the numerical features with 0 variance (all observations are the same) - they don't provide any useful information.

In [None]:
drop_labels = num_df.columns[num_df.std() == 0]
num_df.drop(columns = drop_labels, inplace = True)

We can also analyze all the numerical features that have discrete values from a very short range just like we analyze categorical features.

In [None]:
potential_cat_df = num_df[num_df.columns[num_df.nunique() <= 5]].astype('str')
reduced_num_df = num_df.drop(columns=potential_cat_df.columns)
ext_cat_df = pd.concat([cat_df, potential_cat_df], axis=1)

Let's inspect statistics about categorical features.

In [None]:
cat_df.describe()

Just as we did above, we delete Over18 column as every employee from the dataset is older than 18.

In [None]:
drop_labels = cat_df.columns[cat_df.nunique() == 1]
cat_df.drop(columns = drop_labels, inplace = True)

Distribution plots for numerical features:

In [None]:
def num_dist_plot(feature):
    sns.distplot(df[feature])
interact(num_dist_plot, feature=reduced_num_df.columns);

Another column that looks like it might not be necesarry for data analysis based on the distribution plot is EmployeeNumber. The meaning of this column is probably Employee ID.

In [None]:
num_df["EmployeeNumber"].nunique()

As it should be for an ID number every employee has unique one, so this attribute is redundant.

In [None]:
num_df.drop(columns = "EmployeeNumber", inplace = True)

Now let's see correlation between the features

In [None]:
corr = num_df.corr()
mask = np.array(corr)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data = corr, mask = mask, square = True, annot = True, cbar = True);

Some features are predictably strongly correlated like: total working years with income and age, etc.

But some of them are unexpectedly not correlated, that is:
* monthly, hourly, daily rates and monthly income - previously if we assumed these rates meant pay per time period (hour, ...) it would suggest strong correlation. Hard to guess the meaning of those features without prior knowledge about the dataset.
* maybe less unexpected, job involvement and satisfaction is not related with salary.

Now we'll analyze the relation between the target variable and other attributes

In [None]:
temp_df = pd.concat([cat_df["Attrition"], reduced_num_df], axis=1)
def boxplot_numerical_target(feature):
    sns.boxplot(x="Attrition", y=feature, data=temp_df)

interact(boxplot_numerical_target, feature=reduced_num_df.columns);

Based on the plots above we can see that most of the features have some impact on attrition, but there are some like: training times last year, monthly/hourly rate, that look like they have no influence on the target variable. Because of that we can drop them (and also daily rate, even though there's some dominance of people with low daily rate that experience attrition, but as it is probably connected with two other attributes it might just be a coincidence. Without prior knowledge about the dataset we can't really understand the meaning of this feature, which might be another reason to drop these columns).

In [None]:
num_df.drop(columns=["DailyRate", "HourlyRate", "MonthlyRate", "TrainingTimesLastYear"], inplace=True)

Count plots for categorical features with attrition and cross-tabulation of these two factors:

In [None]:
def relation_to_attrition(feature):
    grouped = ext_cat_df.groupby([feature, "Attrition"])["Attrition"].count().unstack()
    grouped.plot(kind="bar", stacked=True)
    xtab = pd.crosstab(columns=ext_cat_df.Attrition, index=ext_cat_df[feature], margins=True, normalize='index')
    table = plt.table(cellText=np.round(xtab.values, 3), rowLabels=xtab.index,
            colLabels=xtab.columns, loc='top', cellLoc='center')
    table.auto_set_column_width(range(xtab.columns.size))
    fig=plt.gcf()
    fig.set_size_inches(8,6)

interact(relation_to_attrition, feature=ext_cat_df.columns.drop("Attrition"));

Based on the above figures we could propose couple of hypotheses (at least the more evident ones):
1. More frequently traveling employees are more likely to experience attrition
2. Employees working overtime are more likely to ...
3. Women are less likely to ...
4. Employees that are single are more likely to ...
5. Stressed employees and those in a weaker mental condition (ones that are giving lowest scores in tests related with situation at work, outside of work, work-life balance) 

Features that are left will be used to train a model to predict employee attrition.

In [None]:
selected = pd.concat([cat_df, num_df], axis=1)
selected.info()

## 2. Test hypothesis

Let's check if men earn more (on similar positions).
* The null hypothesis H0: women earn equal to men
* Alternative hypothesis H1: women don't earn equal to men

To check if we can reject null hypothesis we can use two-tailed t-test. Scipy function used here to calculate this test statistic returns signed values, so if the null hypothesis is rejected we can determine which group earns more.

In [None]:
hyp_df = selected[['Gender', 'JobRole', 'MonthlyIncome', 'TotalWorkingYears', 'YearsAtCompany']]
hyp_df.info()
female = hyp_df[hyp_df.Gender == 'Female']
male = hyp_df[hyp_df.Gender == 'Male']

First we need to calculate the critical value for the test statistic based on level of significance and degrees of freedom.

In [None]:
from scipy import stats
from IPython.display import display

alpha = 0.1
df = len(hyp_df.index) - 2
# alpha/2, because it's a two-tailed test
crit_val = np.abs(stats.t.ppf(alpha/2, df))

Now let's check our null hypothesis regardless of any other factor.

In [None]:
sns.boxplot(x='Gender', y='MonthlyIncome', data=hyp_df);
t, p = stats.ttest_ind(female['MonthlyIncome'], male['MonthlyIncome'], equal_var=False)
if np.abs(t) < crit_val:
    display(f"Can't reject null hypothesis (t_val : {t}, p_val : {p})")
else:
    display(f"Hypothesis rejected (t_val : {t}, p_val : {p})")

It would seem in general women earn around the same as men, but we could also check if there is some difference in wages for people at certain position.

In [None]:
sns.boxplot(x="JobRole", y="MonthlyIncome", hue="Gender", data=hyp_df);
fig=plt.gcf()
fig.set_size_inches(8, 8)

def test_for_position(position):
    dof = len(hyp_df[hyp_df.JobRole == position].index) - 2
    cv = np.abs(stats.t.ppf(alpha/2, dof))
    t, p = stats.ttest_ind(female[female.JobRole == position]['MonthlyIncome'], male[male.JobRole == position]['MonthlyIncome'], equal_var=False)
    if np.abs(t) < cv:
        display(f"Can't reject null hypothesis (t_val : {t}, p_val : {p})")
    else:
        display(f"Hypothesis rejected (t_val : {t}, p_val : {p})")

interact(test_for_position, position=hyp_df["JobRole"].unique());

For all the job roles except one we can't reject the null hypothesis. Only women working as Research Directors are statistically proven to earn less than men on the same position. The difference in wages seems pretty evident when we look at the box plot above.

Let's see if this difference in salaries can be explained by other factors that have strong correlation with monthly income, specifically total working years and years worked in this company.

In [None]:
fig, ax = plt.subplots(1, 2)
fig.set_size_inches(12, 5)
sns.boxplot(y="YearsAtCompany", x="Gender", data=hyp_df[hyp_df.JobRole == 'Research Director'], ax=ax[0]);
sns.boxplot(y="TotalWorkingYears", x="Gender", data=hyp_df[hyp_df.JobRole == 'Research Director'], ax=ax[1]);

hyp_df[hyp_df.JobRole == 'Research Director'].groupby('Gender')[['YearsAtCompany', 'TotalWorkingYears']].describe()

One factor that could explain the difference between the wages is the fact that men on this position have generally more years of experience.

On the other hand women Research Directors are generally working longer in this specific company, and this kind of commitment is usually rewarded by companies with higher salary.

Even though monthly income is more strongly correlated with total working years (which could explain men's higher salaries), the factor that causes difference in wages may still lie outside of this dataset.

## 3. Prepare training and test datasets

After exploratory analysis of the dataset we are left only with features that are potentially useful for predicting if the employee will experience attrition. First we have to encode the categorical features so they can be used to train a model.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

cat_mask = selected.dtypes==object
cat_cols = selected.columns[cat_mask].tolist()

selected[cat_cols] = selected[cat_cols].apply(lambda col: le.fit_transform(col))

Now we will split the dataset in two parts: training and testing (80-20 split).

In [None]:
# could be also done with sklearn.model_selection.train_test_split
mask = np.random.rand(len(selected)) < 0.8
train = selected[mask]
test = selected[~mask]

y_train = train['Attrition']
x_train = train.drop(columns='Attrition')

y_test = test['Attrition']
x_test = test.drop(columns='Attrition')

In [None]:
from imblearn.over_sampling import SMOTE

oversampler=SMOTE(random_state=1234)
x_train_smote,  y_train_smote = oversampler.fit_resample(x_train,y_train)
x_train_smote = pd.DataFrame(data=x_train_smote, columns=x_train.columns)
y_train_smote = pd.Series(data=y_train_smote)

Now we will train a model for predicting employee attrition. The one used below is LightGBM implementation of gradient boosting with decision trees as a weak learner. We will also look for the optimal hyperparameters for the model using grid search which chooses the best performing model based on the results of cross validation.

In [None]:
import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

params = {
        'num_iterations' : [50, 200, 500, 1000],
        'learning_rate' : [0.05, 0.1, 0.25],
        'subsample': [0.2, 0.4, 0.6, 0.8],
        'num_leaves': [4, 6, 10, 20, 50]
        }

gsearch_LGBM = GridSearchCV(estimator=lgb.LGBMClassifier(), param_grid=params,
                            scoring='recall', n_jobs=-1, cv=5)

gsearch_XGB  = GridSearchCV(estimator=xgb.XGBClassifier(), param_grid=params,
                            scoring='recall', n_jobs=-1, cv=5)

%time gsearch_LGBM.fit(x_train_smote, y_train_smote);
%time gsearch_XGB.fit(x_train_smote, y_train_smote);
print("Training finished")

Now let's try the best performing model on our test set.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
import json

pred_LGBM = gsearch_LGBM.predict(x_test) > 0.5
pred_XGB = gsearch_XGB.predict(x_test) > 0.5
preds = {"XGB" : pred_XGB, "LGBM" : pred_LGBM}
metrics = {}
for k, v in preds.items():
    metrics[k] = {'acc' : accuracy_score(y_test, v), 'prec' : precision_score(y_test, v),
                  'rec' : recall_score(y_test, v),   'roc' : roc_auc_score(y_test, v)}

print(f'XGB params: {gsearch_XGB.best_params_}')
print(f'XGB score : {gsearch_XGB.best_score_}')
print('XGB scores: {}'.format(json.dumps(metrics['XGB'], indent=4)))
print(f'LGBM params : {gsearch_LGBM.best_params_}')
print(f'LGBM score : {gsearch_LGBM.best_score_}')
print('LGBM scores: {}'.format(json.dumps(metrics['LGBM'], indent=4)))

In this case we want a model optimized for recall, so that we will predict as much attrition as possible and we don't care that much about false positives. That way it may be possible to handle the cases of employees that are probable to leave the company, before they do it.

The best model (XGBoost) predicts around 50% of employees that will leave the company.

Below we can see the analysis of feature importance on the model prediction.

In [None]:
lgb.plot_importance(gsearch_LGBM.best_estimator_, figsize=(6, 6), title='LGBM');
ax = xgb.plot_importance(gsearch_XGB.best_estimator_, title='XGB');
fig = ax.figure
fig.set_size_inches(6, 6)

As we could see from the exploratory analysis the most important features include: monthly income, distance from home, working over time, job and environment satisfaction. Performance rating of an employee and department in which his currentyl working has almost no effect on the output of the model so we could have dropped these features before training our model.