# Business Problem

Employee is one of the most important resource in company, where a high attrition rate indicates that the company is unable to maintain their employees. In a short term, with high attrition rate, company must pay a great money to cover the cost of turnover. While in a long term, this will affect the company's performance as employees come and go the company's performance will decline.

# Goals

To analyze the factors lead to employee attrition and make prediction of it, therefore company could give an appropriate treatment for the likely attrition employee.

# Mechanism

The modeling will be implemented in 2 phases:
1. Phase 1 <br>
Since the target is imbalance, in this phase I would create 4 different datasets (imbalance, undersampling, oversampling random, and oversampling smote) to see which treatment is best for imbalance class.
2. Phase 2 <br>
In this phase, I would focusing on improving model's performance through feature engineering and feature selection.

# Import Package and Dataset

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from imblearn import under_sampling, over_sampling
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc
from mlxtend.evaluate import bias_variance_decomp
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import confusion_matrix
import shap
from shap import summary_plot

pd.set_option("max_column",100)
pd.set_option("max_colwidth",1000)
pd.set_option("max_row",1000)

In [None]:
df = pd.read_csv('../input/employee-attrition/WA_Fn-UseC_-HR-Employee-Attrition.csv')
df.head()

In [None]:
#Change values of features Attrition and Overtime, where Yes=1 and No=0
df['Attrition'] = np.where(df['Attrition']=='Yes', 1, 0)
df['OverTime'] = np.where(df['OverTime']=='Yes', 1, 0)

In [None]:
#Categorize numerical and categorical features
nums_features1 = ['Age','DailyRate','DistanceFromHome','Education','EmployeeCount','EmployeeNumber',
                 'EnvironmentSatisfaction', 'HourlyRate','JobInvolvement','JobLevel', 'JobSatisfaction','MonthlyIncome',
                 'MonthlyRate','NumCompaniesWorked','OverTime']

nums_features2 = ['PercentSalaryHike','PerformanceRating','RelationshipSatisfaction','StandardHours','StockOptionLevel',
                 'TotalWorkingYears','TrainingTimesLastYear','WorkLifeBalance','YearsAtCompany','YearsInCurrentRole',
                 'YearsSinceLastPromotion','YearsWithCurrManager']

cats_features = ['BusinessTravel','Department','EducationField','Gender','JobRole','MaritalStatus','Over18']

---
# EDA

## Descriptive Statistic

In [None]:
df[nums_features1].describe()

In [None]:
df[nums_features2].describe()

In [None]:
df[cats_features].describe()

Most of numerical features with nominal data type has a high variation therefore it's positively skewed. And for categorical features, the unique value is only a few.

## Univariate Analysis

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(x='Attrition', data=df, palette = 'RdGy')
plt.title('Attrition Rate', fontsize=14, weight='bold')
plt.xlabel('Attrition', fontsize = 12)
plt.ylabel('Total Employee', fontsize = 12);

As we can see, the target is imbalanced

## Multivariate Analysis

### 1. Heatmap

In [None]:
plt.figure(figsize=(20, 20))
sns.heatmap(df.corr(), cmap='Reds', annot=True, fmt='.2f');

The relation between features and target is kinda weak, where the highest correlation is with OverTime 

### 2. Attrition x Numerical Features

In [None]:
numerical_features = []
for column in df.columns:
    if df[column].dtype != object:
        numerical_features.append(column)
        
numerical_features.remove('Attrition')

plt.figure(figsize=(20, 40))

for i, feature in enumerate(numerical_features, 1):
    plt.subplot(9, 3, i)
    df[df["Attrition"] == 0][feature].hist(bins=35, color='blue', label='Not Attrition', alpha=0.6)
    df[df["Attrition"] == 1][feature].hist(bins=35, color='red', label='Attrition', alpha=0.6)
    plt.legend()
    plt.xlabel(feature)
    plt.ylabel('count')

There are some insights:
1. Employees with low satisfaction (indicated by EnvironmentSatisfaction, JobSatisfaction, RelationshipSatisfaction) tend to resign
2. Employees with low benefit (indicated by MonthlyIncome, StockOptionLevel tend to resign
3. Young employees tend to resign
4. Employees with high number of company worked tend to resign
5. Overtime employees tend to resign

### 3. Attrition x Categorical Features

In [None]:
categorical_features = []
for column in df.columns:
    if df[column].dtype == object:
        categorical_features.append(column)

plt.figure(figsize=(20, 15))

for i, feature in enumerate(categorical_features, 1):
    plt.subplot(3, 3, i)
    df[df["Attrition"] == 0][feature].hist(bins=35, color='blue', label='Not Attrition', alpha=0.6)
    df[df["Attrition"] == 1][feature].hist(bins=35, color='red', label='Attrition', alpha=0.6)
    plt.legend()
    plt.xlabel(feature)
    plt.ylabel('count')

There are some insights:
1. Employees who travel frequently tend to resign
2. Sales employees tend to resign
3. Single employees tend to resign
4. Female employees tend to resign

---
# Phase 1 
I'll be using Decision Tree for 4 datasets (imbalance, undersampling, oversampling random, oversampling smote)

In [None]:
df_s1 = df.copy()

## Data Pre-Processing

In [None]:
#Check if there are missing values and whether the data type is appopriate
df_s1.info()

There are no missing values and all data types are appropriate

In [None]:
#Check if there is any duplicate data
df_s1.duplicated().sum()

## Feature Encoding

In [None]:
cats_onehot = ['BusinessTravel','Department', 'EducationField', 'Gender','JobRole','MaritalStatus']

#Feature encoding for categorical data using onehots
for cat in cats_onehot:
    onehots = pd.get_dummies(df_s1[cat], prefix=cat)
    df_s1 = df_s1.join(onehots)

df_s1.head()

## Feature Selection

In [None]:
#Drop categorical data and unnecessary features
df_s1 = df_s1.drop(['BusinessTravel','Department', 'EducationField', 'EmployeeCount', 'EmployeeNumber', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'StandardHours'], axis = 1)

In [None]:
df_s1.info()

There are no duplicate data

## Modeling

In [None]:
#Split features and target
X = df_s1.drop(columns=['Attrition'])
y = df_s1['Attrition']
print(X.shape)
print(y.shape)

In [None]:
#Create undersampling and oversampling datasets
X_under, y_under = under_sampling.RandomUnderSampler(random_state=42).fit_resample(X, y)
X_over, y_over = over_sampling.RandomOverSampler(random_state=42).fit_resample(X, y)
X_over_smote, y_over_smote = over_sampling.SMOTE(random_state=42).fit_resample(X, y)

In [None]:
print(pd.Series(y).value_counts())
print(pd.Series(y_under).value_counts())
print(pd.Series(y_over).value_counts())
print(pd.Series(y_over_smote).value_counts())

In [None]:
#Split data training and data test

#Imbalance
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size = 0.3, random_state = 42)

#Undersampling
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_under, y_under, test_size = 0.3, random_state = 42)

#Oversampling random
X_train3, X_test3, y_train3, y_test3 = train_test_split(X_over, y_over, test_size = 0.3, random_state = 42)

#Oversampling smote
X_train4, X_test4, y_train4, y_test4 = train_test_split(X_over_smote, y_over_smote, test_size = 0.3, random_state = 42)

In [None]:
def eval_classification(model, pred, proba, xtrain, ytrain, xtest, ytest):
    print("Accuracy (Test Set): %.2f" % accuracy_score(ytest, pred))
    print("Precision (Test Set): %.2f" % precision_score(ytest, pred))
    print("Recall (Test Set): %.2f" % recall_score(ytest, pred))
    print("F1-Score (Test Set): %.2f" % f1_score(ytest, pred))
    
    fpr, tpr, thresholds = roc_curve(ytest, proba, pos_label=1)
    print("AUC: %.2f" % auc(fpr, tpr))

### **Imbalance Dataset**

In [None]:
#Training
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train1,y_train1)

#Predict
y_pred = model.predict(X_test1)
y_proba = model.predict_proba(X_test1)
y_proba = y_proba[:,1]

#Eval
eval_classification(model, y_pred, y_proba, X_train1, y_train1, X_test1, y_test1)

In [None]:
#Checking accuracy of data training and data test
print('Train score: ' + str(model.score(X_train1, y_train1))) 
print('Test score:' + str(model.score(X_test1, y_test1)))

### **Undersampling Dataset**

In [None]:
#Training
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train2,y_train2)

#Predict
y_pred = model.predict(X_test2)
y_proba = model.predict_proba(X_test2)
y_proba = y_proba[:,1]

#Eval
eval_classification(model, y_pred, y_proba, X_train2, y_train2, X_test2, y_test2)

In [None]:
#Checking accuracy of data training and data test
print('Train score: ' + str(model.score(X_train2, y_train2))) 
print('Test score:' + str(model.score(X_test2, y_test2)))

### **Oversampling Random Dataset**

In [None]:
#Training
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train3,y_train3)

#Predict
y_pred = model.predict(X_test3)
y_proba = model.predict_proba(X_test3)
y_proba = y_proba[:,1]

#Eval
eval_classification(model, y_pred, y_proba, X_train3, y_train3, X_test3, y_test3)

In [None]:
#Checking accuracy of data training and data test
print('Train score: ' + str(model.score(X_train3, y_train3))) 
print('Test score:' + str(model.score(X_test3, y_test3)))

### **Oversampling SMOTE Dataset**

In [None]:
#Training
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train4,y_train4)

#Predict
y_pred = model.predict(X_test4)
y_proba = model.predict_proba(X_test4)
y_proba = y_proba[:,1]

#Eval
eval_classification(model, y_pred, y_proba, X_train4, y_train4, X_test4, y_test4)

In [None]:
#Checking accuracy of data training and data test
print('Train score: ' + str(model.score(X_train4, y_train4))) 
print('Test score:' + str(model.score(X_test4, y_test4)))

Based on the results, random oversampling show the best performance from all. Therefore, I'll proceed the prediction using random oversampling dataset.

---
# Phase 2
Using CatBoost for random oversampling dataset

In [None]:
df_s2 = df.copy()

## Data Pre-Processing

### 1. Feature Engineering
- Grouping job role based on job level
- Grouping age generation

In [None]:
list_roles = []

for index, kolom in df_s2.iterrows():
    if kolom['JobRole'] == 'Sales Executive' or kolom['JobRole'] == 'Laboratory Technician' or kolom['JobRole'] == 'Human Resources':
        result = 'Staff'
    elif kolom['JobRole'] == 'Sales Representative' or kolom['JobRole'] == 'Healthcare Representative' or kolom['JobRole'] == 'Research Scientist':
        result = 'Middle'
    else:
        result = 'Executive'
        
    list_roles.append(result)

df_s2['JobRole'] = list_roles
df_s2.head()

In [None]:
list_gen = []

for index, kolom in df_s2.iterrows():
    if kolom['Age'] >= 55:
        result = 'Boomers'
    elif kolom['Age'] >= 40 and kolom['Age'] <= 54:
        result = 'Gen X'
    elif kolom['Age'] >= 23 and kolom['Age'] <= 39:
        result = 'Gen Y'
    else:
        result = 'Gen Z'
    list_gen.append(result)

df_s2['Generation'] = list_gen
df_s2.head()

### 2. Feature Selection
In addition of unnecessary features before (EmployeeCount, EmployeeNumber, Over18, StandardHours) I decided to drop Rate features (DailyRate, HourlyRate, MonthlyRate) because it's the rate that company must pay not the employee received.

In [None]:
#Drop unnecessary features
df_s2 = df_s2.drop(['DailyRate', 'EmployeeCount', 'EmployeeNumber', 'HourlyRate', 'Over18', 'MonthlyRate', 'StandardHours'], axis = 1)
df_s2.head()

In [None]:
df_s2.info()

In [None]:
#Define categorical features for modelling
cat_features = ['BusinessTravel','Department','EducationField', 'Gender', 'Generation','JobRole','MaritalStatus']

## Modeling
Since I'll be using oversampling method for training the model, I'll be using 2 kind of dataset for evaluation:
1. Data test (oversampling)
2. Data eval (imbalance)

This is to make sure the model is able to predict imbalance data as well, because in the production most likely the data will be imbalanced

In [None]:
#Split features and target
X = df_s2.drop(columns=['Attrition'])
y = df_s2['Attrition']
print(X.shape)
print(y.shape)

In [None]:
#Split data training and data eval (before oversampling)
X1_train, X_eval, y1_train, y_eval = train_test_split(X, y, test_size = 0.1, random_state = 42)

In [None]:
#Oversampling data training
X_over, y_over = over_sampling.RandomOverSampler(random_state=42).fit_resample(X1_train, y1_train)

In [None]:
#Split data training and data test (after oversampling)
X2_train, X_test, y2_train, y_test = train_test_split(X_over, y_over, test_size = 0.3, random_state = 42)

### Evaluate with data eval

In [None]:
from catboost import CatBoostClassifier
clf = CatBoostClassifier(learning_rate=0.05, random_state=42, iterations=300, eval_metric='AUC')

clf.fit(X2_train, y2_train, cat_features= cat_features, plot=False, eval_set=(X_eval, y_eval), verbose=True)

y_pred = clf.predict(X_eval)
y_proba = clf.predict_proba(X_eval)
y_proba = y_proba[:,1]
eval_classification(clf, y_pred, y_proba, X2_train, y2_train, X_eval, y_eval)

In [None]:
print('Train score: ' + str(clf.score(X2_train, y2_train))) 
print('Test score:' + str(clf.score(X_eval, y_eval))) 

The result of evaluation with imbalance dataset is good enough, but the model shows the sign of overfitting

### Evaluate with data test

In [None]:
from catboost import CatBoostClassifier
clf = CatBoostClassifier(learning_rate=0.05, random_state=42, iterations=300, eval_metric='Accuracy')

clf.fit(X2_train, y2_train, cat_features= cat_features, plot=False, eval_set=(X_test, y_test), verbose=True)

y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)
y_proba = y_proba[:,1]
eval_classification(clf, y_pred, y_proba, X2_train, y2_train, X_test, y_test)

In [None]:
#Checking accuracy of data training and data test
print('Train score: ' + str(clf.score(X2_train, y2_train))) 
print('Test score:' + str(clf.score(X_test, y_test))) 

In [None]:
cf = confusion_matrix(y_test, y_pred)
cf

In [None]:
group_names = ['TN','FP','FN','TP']
group_counts = ['{0:0.0f}'.format(value) for value in
                cf.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in
                     cf.flatten()/np.sum(cf)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf, annot=labels, fmt='', cmap='rocket');

The result of evaluation with oversampling dataset is excellent, with no signs of overfitting

### Interpretation with SHAP

In [None]:
explainer = shap.Explainer(clf)
shap_values = explainer(X)

In [None]:
shap.plots.beeswarm(shap_values)

The plot above shows the 9 features that affecting employee's decision to resign or not. As we can see OverTime, StockOptionLevel, and MonthlyIncome are highly affecting employee's attrition. Therefore, with these insights I came up with some strategies:
1. Evaluate the workload of employees, why do they get overtime? And even if they have to do overtime, the benefit needed to be re-evaluated
2. Build an appropriate culture and create a good work environment in order to increase EnvironmentSatisfaction and JobSatisfaction
3. Give or increase StockOptionLevel to high value employees who tend to attrition