# IBM HR Analytics Employee Attrition Detection

## Business Problem

This exercise will deal with predicting which employees are likely to resign.

As one may already know, attrition of highly skilled talent in a high-tech sector could mean serious business implications for firms like IBM. As such, there is a great business interest in understanding the drivers and predicting the probability of employees quitting.

The model would facilitate timely HR intervention which could possibly avert an employee from quitting.

## Data

The data comprises of an employee survey which records following details of the  employee, including whether or not the employee resigned or is a current employee.

|Feature|Description|
|----|----|
|Age|Age of employee in nUmerical value|
|Attrition|Employee quitting (0=No, 1=Yes)|
|Business Travel|(1=No Travel, 2=Travel Frequently, 3=Tavel Rarely)|
|Daily Rate|Numerical Value - Salary Level|
|Department|(1=HR, 2=R&D, 3=Sales)|
|Distance From Home|Numerical Value - distance from work to home|
|Education|No. of years in numerical value|
|Education Field|(1=HR, 2=LIFE SCIENCES, 3=MARKETING, 4=MEDICAL SCIENCES, 5=OTHERS, 6= TEHCNICAL)|
|Employee Count|Numerical value|
|Employee Number|Employee ID in numerical value|
|Environment Satisfaction|satisfaction level with the environment in numerical value|
|Gender|(1=FEMALE, 2=MALE)|
|Hourly Rate|Hourly salary in numerical value|
|Job Involvement|Numerical value|
|Job Level|Numerical value|
|Job Role|(1=HC REP, 2=HR, 3=LAB TECHNICIAN, 4=MANAGER, 5= MANAGING DIRECTOR, 6= REASEARCH DIRECTOR, 7= RESEARCH SCIENTIST, 8=SALES EXECUTIEVE, 9= SALES REPRESENTATIVE)|
|Job Satisfaction|Numerical value|
|Marital Status|(1=divorced, 2=married, 3=single)|
|Monthly Income|Numerical value|
|Monthly Rate|Numerical value|
|No. of Companies Worked|0-9 numerical value|
|Overtime|(1=NO, 2=YES)|
|Salary Hike %|Numerical Value|
|Over 18|(1=YES, 2=NO)|
|Performance Rating|Numerical value|
|Relationship Satisfaction|Numerical value|
|Standard Hours|Numerical value|
|Stock Option Level|Numerical value|
|Total Working Hours|Numerical value|
|Training Times Last Year|Numerical value|
|Work-Life Balance|Numerical value|
|Years At Company|Numerical value|
|Years Since Last Promotion|Numerical value|
|Years with Current Manager|Numerical value|
|Hiring Source|(seek, referral, recruit.net, linkedin, jora, indeed, glassdoor, adzuna, company website)|

## Solution and Methodology

This problem is similar to predicting cancer, in the sense that cost of False Negative is far greater than cost of False Positive. Therefore the objective of the model would be to **maximize recall score**. In other words, the model should be highly sensitive.

As we will see later, the variables are not normally distributed. Further, although not verified, I feel it wouldn be prudent not to rule out possibility multicollinearity among the feature set. To circumvent the possibility of violating any of the assumptions of a linear model, it would be wise to go for a non-parametric model, say a tree based algorithm.

Therefore, the plan of action will be:
1. Fitting Base Classifiers
    1. RandomForest
    2. Adaboost
    3. Gradient Boosting (or XGBoost)
2. Stacking $\rightarrow$ with all the above 'base' classifiers
3. Voting $\rightarrow$ with the base classifiers plus stacking classifier

For the all the models, the following pipeline shalll be trained:
1. Missing value imputation with column median.
2. Transforming categorical features to their Weight of Evidence: $\ln{\frac{\text{Proportion of Positive Events}}{\text{Proportion of Negative Events}}}$
3. Transforming variables to normal distribution using Yeo-Johnson transformation (all the features in the dataset are of numerical dtype by now)
4. Standardizing the dataset using sklearn ```StandardScaler()```.
5. Adjusting for class imbalance using imbalance-learn combination algorithm, SMOTE-Tomac. As said before, the intent is to increase the sensitivity, by demarkating a clear decision boundary.
6. Fitting the base algorithm

The respective hyperparameters of each algorithm in the above pipeline is in turn tuned using ```GridSearchCV()``` with 10-fold cross validation.

### Model Interpretation

<font size=1.5>*source*:
1. [Importance of ML Interpretation](https://towardsdatascience.com/human-interpretable-machine-learning-part-1-the-need-and-importance-of-model-interpretation-2ed758f5f476)
2. [Model Interpretation Strategies](https://towardsdatascience.com/explainable-artificial-intelligence-part-2-model-interpretation-strategies-75d4afa6b739)
3. [Hands-On Machine Learning Interpretation](https://towardsdatascience.com/explainable-artificial-intelligence-part-3-hands-on-machine-learning-model-interpretation-e8ebe5afc608)
</font>

The models we will be deploying are complex, black-box models in that they are non-parametric and meta-classifiers of many simple tree classifiers (random forest, adaptive boosting or gradient boosting), with which it's not possible to guage theit inner functioning.

To have confidence on these models, often the decision makers may feel the need to understand what drives the model. Besides, it's no longer a luxury; regulations require models to be explainable. Therefore, recently, model explainability is the new "frontier" in ML.

#### Criteria

Although model interpretation is still an evolving field, most of the techniques revolve around the following three criteria:

<u>intrinsic/ post hoc</u>? Intrinsic implies the model itself being interpretable, _viz_., parametric model or a single decision tree; whereas post-hoc involves trying to interpret a pre-trained model.

<u>Model specific/ model-agnostic</u>? Certain model interpretation techniques apply to specific models _viz_., p-values and AIC scores pertaining to regression models. Whereas model-agnostic tools are relavant to performing post-hoc methods and can be applied to any machine learning model.

Broadly speaking scope of interpretability can be either local or global. <u>Local</u> implies (Why did model make a specific decision?) being able to explain the conditional interaction between response variable and predictor variables w.r.t. single example. Whereas, <u>global</u> interpretation (How does model makes interpretations?) tries to explain the model based on complete dataset.

In our analysis, we will implement post-hoc, model-agnostic techniques having both local as well as global scope. 

#### Techniques
##### Feature Importance
Degree to which a predictive model relies on a perticular feature. Typically, it's the increase in model's prediction error after we permuted the feature's values.

Since we will be using SHAP values (discussed below), computing feature importance separately gets redundunt. 

##### Partial Dependence Plots
*Not implemented*

##### Global Surrogate Models
*Not implemented*

##### Local-Interpretable Model-Agnostic Explainations (LIME)
*Not implemented*

##### Shapley Additive Explanations
<font size=1.5>**Source**: [SHAP](https://christophm.github.io/interpretable-ml-book/shap.html)</font>

SHAP (SHapley Additive exPlanations) by Lundberg and Lee (2016) is a method to explain individual predictions. Based on the game theoretically optimal Shapley Values, the goal of SHAP is to explain the prediction of an instance $x$ by computing contribution of each feature to the prediction.

The original Shapley values (from Game Theory) tell how fairly distribute the payout (prediction) among the players (features). In ML context, the Shapely values in SHAP is represented as an additive feature attribution method, a linear model.

\begin{align}
g(z') &= \phi_0+\sum^M_{j=1}\phi_j z'_j \\
\text{where}\\
g &: \text{explanation model} \\
z' & \in\{0,1\}^M\;\;\text{is a coalition vector} \\
\phi_j & \in \mathbb{R};\;\text{is a feature attribution of feature }j \\
\end{align}

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
!pip install feature-engine

In [None]:
!pip install eli5

In [None]:
!pip install shap

In [None]:
import os
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
np.random.seed(0)

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Revise according to your data directory
PATH = '/kaggle/input/ibm-hr-data/'
FILE = 'IBM_HR_Data.csv'

## Cleaning and processing raw data

The entire statstical modelling process begins with loading and cleaning the dataset to desired form.

Certain variables _viz_., EmployeeCount, EmployeeNumber, ApplicationID are not features. Further variables of Over18 and StandardHours are 99.99% constant; features without variability are of no use for model. We shall remove all such variables.

I also have defined a utility function to view each variable, its data type and its sample in tabular form...

In [None]:
df = pd.read_csv(filepath_or_buffer=os.path.join(PATH, FILE))

# We are dropping the following columns since they are not features, just ID nos and so on. Plus, 'StandardHours' contains no variation
df.drop(labels=['EmployeeCount', 'EmployeeNumber', 'ApplicationID', 'Over18', 'StandardHours'], axis='columns', inplace=True)

# Overview of dtypes and Missing Values
def observe_data(df):
    '''
    Presents exploratory data summary in a crisp manner; 
    with dtype, null values, total values and feature summary columns.
    '''
    df = df.copy()
    properties = pd.Series()
    for i in df.columns.tolist():
        if pd.api.types.is_object_dtype(df[i]):
            properties[i] = df[i].unique().tolist()
        elif pd.api.types.is_numeric_dtype(df[i]):
            properties[i] = round(df[i].describe(),2).tolist()
        elif pd.api.types.is_datetime64_any_dtype(df[i]):
            properties[i] = [df[i].min().strftime(format='%Y-%m-%d'), df[i].max().strftime(format='%Y-%m-%d')]
        elif pd.api.types.is_categorical_dtype(df[i]):
            properties[i] = list(df[i].unique())
    observe = pd.concat([df.dtypes, df.isnull().sum(), df.notnull().sum(), properties], axis=1)
    observe.columns = ['dtypes', 'Missing_Vals', 'Total_Vals', 'Properties']
    return observe

observe_data(df)

Despite most features being of integer data type, most could be classified into ordinal or nominal data type.

The following is some observation/ action items:
* Features ```'NumCompaniesWorked'``` and ```'BusinessTravel'``` in ```categorical_features``` list are or ordinal type whereas rest are of nominal types. We will therefore segregate elements of ```categorical_features``` list into their respective sub-categories.
* Feature ```'NumCompaniesWorked'```, contianing 10 labels, will be reclassified into three labels. This will facilitate in classification.
* Finally, we shall classify variables into ```'actual numerical'``` (interval/ measure scale), ```'nominal_variables'``` and ```'ordinal_variables'```.

|actual numerical features|ordinal features|nominal features|
|----|----|----|
|'DistanceFromHome'|'Education'|'TrainingTimesLastYear'|
|'PercentSalaryHike'|'EnvironmentSatisfaction'|'WorkLifeBalance'|
|'YearsAtCompany'|'JobInvolvement'|'Employee Source'|
|'YearsInCurrentRole'|'JobLevel'|'Department'|
|'YearsSinceLastPromotion'|'JobSatisfaction'|'EducationField'|
|'YearsWithCurrManager'|'PerformanceRating'|'Gender'|
|'TotalWorkingYears'|'RelationshipSatisfaction'|'JobRole'|
|'Age'|'StockOptionLevel|'MaritalStatus'|
|'DailyRate'|'BusinessTravel'|'OverTime'|
|'HourlyRate'|'NumCompaniesWorked'||
|'MonthlyIncome'|||
|'MonthlyRate'|||

In [None]:
# We will binarize 'NumCompaniesWorked'
def binarize(column, bins, num_categories=[1,2,3]):
    x = pd.cut(x=column.tolist(), bins=bins, include_lowest=True)
    x.categories = num_categories
    tmp = pd.concat([column, pd.Series(x)], axis=1)
    
    column = x
    return column

In [None]:
# Categories are binarized into: 0-2 years: single; 3-5 years: few, 6-9 years: many
bins = pd.IntervalIndex.from_tuples([(-1, 2), (2, 5), (5, 9)])

# transforming 'NumCompaniesWorked' and a few more variables
df['NumCompaniesWorked'] = binarize(column=df['NumCompaniesWorked'], bins=bins).astype('O')
df['TrainingTimesLastYear'] = df['TrainingTimesLastYear'].astype('O')
df['WorkLifeBalance'] = df['WorkLifeBalance'].map({1:'Low',2:'Medium',3:'High',4:'Very High',5:'Very High'})
df['BusinessTravel'] = df['BusinessTravel'].map({'Travel_Rarely':2, 'Travel_Frequently':3, 'Non-Travel':1})

In [None]:
# List of all variables that are of 'O' type
categorical_features = df.select_dtypes(include=['object','category']).columns.tolist()
categorical_features.remove('Attrition')

# A view of categorical features
print('\033[1m' +'categorical features: ', '\033[0m',categorical_features)
print('='*100)
# List of all variables that are of 'float' or 'int' type
numerical_features = df.select_dtypes(include=np.number).columns.tolist()
print('\033[1m' +'numerical features: ', '\033[0m',numerical_features)
print('='*100)

# This lists shall be expanded after categorical encoding with categorical_features 
ordinal_features = ['Education', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel','BusinessTravel', 'NumCompaniesWorked']
nominal_features = [i for i in categorical_features if i not in ordinal_features]
common = list(set(ordinal_features).intersection(set(numerical_features)))
actual_numerical = [i for i in numerical_features if i not in common]
print('\033[1m' +'ordinal features: ', '\033[0m',ordinal_features) #
print('='*100)
print('\033[1m' +'nominal features: ', '\033[0m',nominal_features) #
print('='*100)
print('\033[1m' +'actual numerical: ', '\033[0m',actual_numerical) #

In [None]:
# If we recall, all the variables with missing values were of numerical types,
# Let's view them once again to ensure the actual_numerical dtypes captures all of them
observe_data(df[actual_numerical])

In [None]:
# The missing values are found in 'Age', 'DailyRate', 'HourlyRate', 'MonthlyIncome' and 'MonthlyRate'; all are numerical type
df.loc[df.isna().any(axis=1),numerical_features]

## Preprocessing pipeline, model fitting and model interpretation

In [None]:
# Scikit-learn libraries
from sklearn.preprocessing import StandardScaler, QuantileTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier, VotingClassifier
from sklearn.metrics import recall_score, classification_report, confusion_matrix, roc_curve
from sklearn import tree
from sklearn.linear_model import LogisticRegression

# For oversampling through imbalance-learn
from imblearn.pipeline import make_pipeline, Pipeline
from imblearn.combine import SMOTETomek

# For data processing through feature-engine
from feature_engine.variable_transformers import YeoJohnsonTransformer
from feature_engine.missing_data_imputers import MeanMedianImputer
from feature_engine.categorical_encoders import WoERatioCategoricalEncoder

# For visualizing trees
from graphviz import Source
from IPython.display import SVG, Image

# Model Interpretation
import eli5
import shap

In [None]:
shap.initjs()

### Preprocessing Pipeline

In [None]:
# Data preprocessing pipeline
## Train-Test split
X = df.drop(labels='Attrition', axis=1)
y = df['Attrition'].map({'Voluntary Resignation':1, 'Current employee':0})

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

## Imputing numerical missing values from their respective 'mean'/ 'median'
impute = MeanMedianImputer(imputation_method='median', variables=actual_numerical)

## Transforming nominal categorical features based on their probability ratio (ordinal features are already transformed)
transform_nominal = WoERatioCategoricalEncoder(encoding_method='ratio', variables=nominal_features)

## Pre-processing pipeline
preprocessor = make_pipeline(impute, transform_nominal)

## Transforming the variable
X_train = preprocessor.fit_transform(X_train, y_train)
X_test = preprocessor.transform(X_test)

## Oversampling using SMOTE and creating hard boundaries using Tomac lines
smotetomec = SMOTETomek(random_state=0)
X_sample, y_sample = smotetomec.fit_resample(X_train, y_train)

### Random Forest Classifier

In [None]:
## Instantiating Random Forest classifier
classifier = RandomForestClassifier(max_depth=5, min_samples_leaf=100, class_weight={1:1.5}, random_state=0)

## Parameters grid
parameter_grid = {
    'criterion': ['entropy', 'gini'], 
    'min_samples_leaf': [10, 50, 80], 
    'max_depth':[2,3,4,5]} #

## Instantiating and fitting GridsearchCV to the train set
gscvrf = GridSearchCV(estimator=classifier, param_grid=parameter_grid, cv=10, iid=False, scoring='f1_weighted', verbose=False)
gscvrf.fit(X_sample, y_sample)

## Predicting test outcomes
y_pred = gscvrf.predict(X_test)

## confusion matrix
cf = pd.DataFrame(confusion_matrix(y_test, y_pred), index=['current_employee', 'resigned'], columns=['current_employee', 'resigned'])[::-1].T[::-1]

In [None]:
# Model Evaluation
print(pd.Series(gscvrf.best_params_))
print('='*40)
print('recall score: %.3f' % recall_score(y_test, y_pred))
print('='*40)
print(classification_report(y_test, y_pred))
print('='*40)
print(cf)

In [None]:
# ROC Curve
y_pred_train_prob = gscvrf.predict_proba(X_train)[:,1]
y_pred_test__prob = gscvrf.predict_proba(X_test)[:,1]

fp_rate_train, tp_rate_train, thresh1 = roc_curve(y_train, y_pred_train_prob)
fp_rate_test, tp_rate_test, thresh2 = roc_curve(y_test, y_pred_test__prob)

plt.figure(figsize=(8,8))
plt.plot(fp_rate_train, tp_rate_train, label='train')
plt.plot(fp_rate_test, tp_rate_test, label='test')
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve', fontweight='semibold')
plt.legend(loc='center left', bbox_to_anchor=(1.,.5), frameon=False)
plt.grid()
plt.show()

In [None]:
# Let's interpret mean weightage of each feature in the model, globally
eli5.show_weights(gscvrf.best_estimator_, feature_names=X_train.columns.tolist())

In [None]:
# We can also view mean weightage of a sample observation, i.e., local interpret
eli5.show_prediction(estimator=gscvrf.best_estimator_, doc=X_test.sample(), feature_names=X_train.columns.tolist(), show_feature_values=True)

In [None]:
observations = shap.sample(X_test)
explainer_rf = shap.TreeExplainer(gscvrf.best_estimator_)

shap_vals_rf = explainer_rf.shap_values(observations)

In [None]:
shap.summary_plot(shap_values=shap_vals_rf, features=X_test)

The collective ```force_plot``` the same information as above, but breaks down for each observation. It is also possible to hold a perticular feature constant and view the interaction.

In [None]:
shap.force_plot(base_value=explainer_rf.expected_value[1], shap_values=shap_vals_rf[1], features=observations, feature_names=X_test.columns.tolist())

In [None]:
joblib.dump(value=gscvrf, filename=os.path.join(PATH, 'randomforest.pkl'))

### Adaptive Boosting Classifier

In [None]:
## Instantiating Decision Tree as base classifier
base_classifier = tree.DecisionTreeClassifier(max_depth=5, min_impurity_decrease=0.001, class_weight={1:1.5})

## Instantiating Adaptive Boosting as meta classifier
meta_classifier = AdaBoostClassifier(learning_rate=0.1, random_state=0, base_estimator=base_classifier)

## Parameters grid
parameter_grid = {
    'n_estimators': [i for i in range(20,50,10)], 
    'learning_rate': [i for i in np.linspace(start=0.1, stop=0.25, num=5)]}

## Instantiating and fitting GridsearchCV to the train set
gscvab = GridSearchCV(estimator=meta_classifier, param_grid=parameter_grid, cv=10, iid=False, n_jobs=-1, scoring='f1_weighted', verbose=False)
gscvab.fit(X_sample, y_sample)

## Predicting test outcomes
y_pred = gscvab.predict(X_test)

## confusion matrix
cf = pd.DataFrame(confusion_matrix(y_test, y_pred), index=['current_employee', 'resigned'], columns=['current_employee', 'resigned'])[::-1].T[::-1]

In [None]:
# Model Evaluation
print(pd.Series(gscvab.best_params_))
print('='*40)
print('recall score: %.3f' % recall_score(y_test, y_pred))
print('='*40)
print(classification_report(y_test, y_pred))
print('='*40)
print(cf)

In [None]:
# ROC Curve
y_pred_train_prob = gscvab.predict_proba(X_train)[:,1]
y_pred_test__prob = gscvab.predict_proba(X_test)[:,1]

fp_rate_train, tp_rate_train, thresh1 = roc_curve(y_train, y_pred_train_prob)
fp_rate_test, tp_rate_test, thresh2 = roc_curve(y_test, y_pred_test__prob)

plt.figure(figsize=(8,8))
plt.plot(fp_rate_train, tp_rate_train, label='train')
plt.plot(fp_rate_test, tp_rate_test, label='test')
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve', fontweight='semibold')
plt.legend(loc='center left', bbox_to_anchor=(1.,.5), frameon=False)
plt.grid()
plt.show()

In [None]:
# # Let's interpret mean weightage of each feature in the model, globally
# eli5.show_weights(gscvab.best_estimator_, feature_names=X_train.columns.tolist())

# observations = X_test.sample(100)
# explainer_rf = shap.BruteForceExplainer(gscvab.best_estimator_.predict, data=observations)

# shap_vals_rf = explainer_rf.shap_values(observations)

# shap.summary_plot(shap_values=shap_vals_rf, features=X_test)

In [None]:
# Model Interpretation
pd.Series(gscvab.best_estimator_.feature_importances_, index=X_train.columns.tolist()).sort_values(ascending=False).plot(kind='bar', figsize=(10,6), title='Feature Importance');

The collective ```force_plot``` the same information as above, but breaks down for each observation. It is also possible to hold a perticular feature constant and view the interaction.

In [None]:
# shap.force_plot(base_value=explainer_rf.expected_value[1], shap_values=shap_vals_rf[1], features=observations, feature_names=X_test.columns.tolist())

In [None]:
joblib.dump(value=gscvab, filename=os.path.join(PATH, 'adaboost.pkl'))

### Gradient Boosting Classifier

In [None]:
## Instantiating Random Forest classifier
classifier = GradientBoostingClassifier(n_estimators=30, min_samples_leaf=50, min_impurity_decrease=0.02, random_state=0, max_features='auto', n_iter_no_change=3)

## Parameters grid
parameter_grid = {
    'n_estimators': [20, 30, 40],
    'max_depth': [3,4,5], 
    'learning_rate': [i for i in np.linspace(start=0.1, stop=0.5, num=5)], 
    'loss': ['deviance', 'exponential']}

## Instantiating and fitting GridsearchCV to the train set
gscvgb = GridSearchCV(estimator=classifier, param_grid=parameter_grid, cv=10, scoring='f1_weighted', verbose=False)
gscvgb.fit(X_sample, y_sample)

## Predicting test outcomes
y_pred = gscvgb.predict(X_test)

## confusion matrix
cf = pd.DataFrame(confusion_matrix(y_test, y_pred), index=['current_employee', 'resigned'], columns=['current_employee', 'resigned'])[::-1].T[::-1]

In [None]:
# Model Evaluation
print(pd.Series(gscvgb.best_params_))
print('='*40)
print('recall score: %.3f' % recall_score(y_test, y_pred))
print('='*40)
print(classification_report(y_test, y_pred))
print('='*40)
print(cf)

In [None]:
# ROC Curve
y_pred_train_prob = gscvgb.predict_proba(X_train)[:,1]
y_pred_test__prob = gscvgb.predict_proba(X_test)[:,1]

fp_rate_train, tp_rate_train, thresh1 = roc_curve(y_train, y_pred_train_prob)
fp_rate_test, tp_rate_test, thresh2 = roc_curve(y_test, y_pred_test__prob)

plt.figure(figsize=(8,8))
plt.plot(fp_rate_train, tp_rate_train, label='train')
plt.plot(fp_rate_test, tp_rate_test, label='test')
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve', fontweight='semibold')
plt.legend(loc='center left', bbox_to_anchor=(1.,.5), frameon=False)
plt.grid()
plt.show()

In [None]:
# Let's interpret mean weightage of each feature in the model, globally
eli5.show_weights(gscvgb.best_estimator_, feature_names=X_train.columns.tolist())

In [None]:
# We can also view mean weightage of a sample observation, i.e., local interpret
eli5.show_prediction(estimator=gscvgb.best_estimator_, doc=X_test.sample(), feature_names=X_train.columns.tolist(), show_feature_values=True)

In [None]:
observations = shap.sample(X_test)
explainer_gb  = shap.TreeExplainer(gscvgb.best_estimator_)

shap_vals_gb = explainer_gb.shap_values(observations)

In [None]:
shap.summary_plot(shap_values=shap_vals_rf, features=observations)

The collective ```force_plot``` the same information as above, but breaks down for each observation. It is also possible to hold a perticular feature constant and view the interaction.

In [None]:
shap.force_plot(base_value=explainer_gb.expected_value, shap_values=shap_vals_gb, features=observations, feature_names=X_test.columns.tolist())

In [None]:
joblib.dump(value=gscvgb, filename=os.path.join(PATH, 'gradientboost.pkl'))

### Stacking Classifier

Intuitively, the performance of random forest classifier is good, that of adaptive boosting is better and that of gradient boosting is the best. Does it imply we should simply go for gradient boosting algorithm? Let's find out...

In [None]:
## Stacking Classifier Steps
### Extracting the best estimator from each model into a list
random_forest = gscvrf.best_estimator_
adaptive_boost= gscvab.best_estimator_
gradient_boost= gscvgb.best_estimator_

classifier_list = [('random_forest',random_forest), 
                   ('adaptive_boost',adaptive_boost), 
                   ('gradient_boost',gradient_boost)]

# ### Declaring meta classifier
m_classifier = LogisticRegression()

# ### Instantiating Stacking Classifier
stack = StackingClassifier(estimators=classifier_list, final_estimator=m_classifier)
stack.fit(X_sample, y_sample)

# Predicting test outcomes
y_pred = stack.predict(X_test)

# ## confusion matrix
cf = pd.DataFrame(confusion_matrix(y_test, y_pred), index=['current_employee', 'resigned'], columns=['current_employee', 'resigned'])[::-1].T[::-1]

In [None]:
# Model Evaluation
print(classification_report(y_test, y_pred))
print('='*80)
print(cf)

In [None]:
# ROC Curve
y_pred_train_prob = stack.predict_proba(X_train)[:,1]
y_pred_test__prob = stack.predict_proba(X_test)[:,1]

fp_rate_train, tp_rate_train, thresh1 = roc_curve(y_train, y_pred_train_prob)
fp_rate_test, tp_rate_test, thresh2 = roc_curve(y_test, y_pred_test__prob)

plt.figure(figsize=(8,8))
plt.plot(fp_rate_train, tp_rate_train, label='train')
plt.plot(fp_rate_test, tp_rate_test, label='test')
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve', fontweight='semibold')
plt.legend(loc='center left', bbox_to_anchor=(1.,.5), frameon=False)
plt.grid()
plt.show()

In [None]:
# Global Interpretation of the Meta Model of Stacking Classifier
eli5.show_weights(stack.final_estimator_, feature_names=['Random_Forest', 'Adaptive_Boost', 'Gradient_Boost'])

In [None]:
# How this perticular observation is scored?
eli5.show_prediction(estimator=stack.final_estimator_, doc=stack.transform(X_test)[0], feature_names=['Random_Forest', 'Adaptive_Boost', 'Gradient_Boost'], show_feature_values=True)

In [None]:
explainer = shap.LinearExplainer(model=stack.final_estimator_, data=stack.transform(X_test), nsamples=100)

observations = stack.transform(X_test.sample(1000, random_state=0))
shap_values = explainer.shap_values(observations)

In [None]:
shap.force_plot(base_value=explainer.expected_value, shap_values=shap_values, features=observations, feature_names=['Random_Forest', 'Adaptive_Boost', 'Gradient_Boost'])

In [None]:
shap.summary_plot(shap_values=shap_values, features=observations, feature_names=['Random_Forest', 'Adaptive_Boost', 'Gradient_Boost'])

It turns out that Stacking classifier performed even better. What's happening? Looking at the weights, it turns out that adaptive boosting classifier holds higher weight than the gradient boosting classifier. Observing the summary plot above, it is aparent that few observations in gradient boosting hold very high SHAP value (thereby high impact on model performance) while most of observations hold low to medium SHAP values. On the other hand, adaptive boosting classifier has greater proportion of high SHAP values, thereby indicating consistance in model prediction.

### Voting Classifier

Final verdict! We will now run our final ensemble classifier. This classifier will combine all above four classifier and apply 'soft' voting mechanism.

In [None]:
# Classifier list: this time we'll also add stacking
classifier_list = [('random_forest',random_forest), 
                   ('adaptive_boost',adaptive_boost), 
                   ('gradient_boost',gradient_boost), 
                   ('stacking', stack)]

# Instantiating Stacking Classifier
vote = VotingClassifier(estimators=classifier_list, voting='soft')
vote.fit(X_sample, y_sample)

## Predicting test outcomes
y_pred = vote.predict(X_test)

# ## confusion matrix
cf = pd.DataFrame(confusion_matrix(y_test, y_pred), index=['current_employee', 'resigned'], columns=['current_employee', 'resigned'])[::-1].T[::-1]

In [None]:
# Model Evaluation
print(classification_report(y_test, y_pred))
print('='*40)
print(cf)

In [None]:
# ROC Curve
y_pred_train_prob = vote.predict_proba(X_train)[:,1]
y_pred_test__prob = vote.predict_proba(X_test)[:,1]

fp_rate_train, tp_rate_train, thresh1 = roc_curve(y_train, y_pred_train_prob)
fp_rate_test, tp_rate_test, thresh2 = roc_curve(y_test, y_pred_test__prob)

plt.figure(figsize=(8,8))
plt.plot(fp_rate_train, tp_rate_train, label='train')
plt.plot(fp_rate_test, tp_rate_test, label='test')
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve', fontweight='semibold')
plt.legend(loc='center left', bbox_to_anchor=(1.,.5), frameon=False)
plt.grid()
plt.show()

## Conclusion

From the above models, we observe that features like Age, overtime, compensation level (daily-rate, monthly-rate, hourly-rate, monthly-income), distance from home play major role in prediction, whereas job role, maritial status, gender, employee source etc play little importance in predicting the outcome.

Things to consider: 
* Certain age groups are more likely to quite than others.
* With increase in overtime, chances of an employee quitting also increases - that's obvious because already employees are clocking standard 80 work hours per week (feature that was dropped), anything beyond that might cause burnout.
* Although, the distribution of compensation levels *viz*., HourlyRate, DailyRate, MonthlyRate between current employees and resigned is equal (as depicted in the following box plot), the interaction of these variables with other variables might explain the outcome better (Adaptive boosting classifier).

In [None]:
compensation = pd.concat([X_train['DailyRate'], y_train], axis=1)

sns.boxplot(data=compensation, y='DailyRate', x='Attrition');