# HR Analytics

## Discovering the factors that influence employee turnover

## Model generation and comparison
***

# Table of Contents

### I. [Data Preprocessing](#preprocess)

### II. Model Training

1. [Logistic Regression (as baseline)](#logistic-reg)
2. [Ridge Regression (Classifier)](#ridge-reg)
3. [SVM](#svm)
4. [Random Forest](#rf)
5. [Gradient Boosting](#gb)

In [1]:
__author__ = "Vita Levytska"
__email__ = "levytska.vita@gmail.com"

## Load Packages

In [2]:
import pandas as pd
import numpy as np

## Read and Display Data

In [3]:
def read_data(file):
    return pd.read_csv(file)

def drop_columns(df, col_list):
    data = df.drop(col_list, axis=1)
    return data

In [4]:
df = read_data("WA_Fn-UseC_-HR-Employee-Attrition.csv")
pd.set_option('display.max_columns', None)
df.head(7)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,2,Female,94,3,2,Sales Executive,4,Single,5993,19479,8,Y,Yes,11,3,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,3,Male,61,2,2,Research Scientist,2,Married,5130,24907,1,Y,No,23,4,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396,6,Y,Yes,15,3,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,4,Female,56,3,1,Research Scientist,3,Married,2909,23159,1,Y,Yes,11,3,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,1,Male,40,3,1,Laboratory Technician,2,Married,3468,16632,9,Y,No,12,3,4,80,1,6,3,3,2,2,2,2
5,32,No,Travel_Frequently,1005,Research & Development,2,2,Life Sciences,1,8,4,Male,79,3,1,Laboratory Technician,4,Single,3068,11864,0,Y,No,13,3,3,80,0,8,2,2,7,7,3,6
6,59,No,Travel_Rarely,1324,Research & Development,3,3,Medical,1,10,3,Female,81,4,1,Laboratory Technician,1,Married,2670,9964,4,Y,Yes,20,4,1,80,3,12,3,2,1,0,0,0


## Drop Columns

In [5]:
col_list = ['EmployeeCount', 'EmployeeNumber', 'StandardHours', 'Over18']
df = drop_columns(df, col_list)

## Check if dataset is balanced

In [6]:
attrition_rate = df["Attrition"].value_counts() / 1470
attrition_rate

No     0.838776
Yes    0.161224
Name: Attrition, dtype: float64

Since our dataset is imbalanced, we will use

Original shape: (11999, 18) (11999,)
Upsampled shape: (18284, 18) (18284,)
SMOTE sample shape: (18284, 18) (18284,)
Downsampled shape: (5714, 18) (5714,)

<a id='preprocess'></a>
## Data Preprocessing

### Encode categorical variables

Label encoding was already done (labels are encoded in data)

Nominal variables are to be encoded

In [7]:
# split into features and target
features_df = df.drop(columns = ['Attrition'])

df['Attrition'] = df['Attrition'].map({'Yes':1, 'No':0 })
target = np.array(df['Attrition'])

# dummy encode variables
features_df = pd.get_dummies(features_df, drop_first = True)

## Train / test split

In [8]:
from sklearn.model_selection import train_test_split
feat_train, feat_test, resp_train, resp_test = train_test_split(features_df, target, test_size=0.25)

In [9]:
len(resp_train)

1102

## Resampling using SMOTE

In [10]:
from imblearn.over_sampling import SMOTE 

sm = SMOTE()
feat_train_sampled, resp_train_sampled = sm.fit_resample(feat_train, resp_train)

In [11]:
len(resp_train_sampled)

1868

## Models to be trained

1. [Logistic Regression (as baseline)](#logistic-reg)
2. [Ridge Regression (Classifier)](#ridge-reg)
3. [SVM](#svm)
4. [Random Forest](#rf)
5. [Gradient Boosting](#gb)

## Model Performance Measure

To measure and compare the performance of our models we will use **Recall** because we want to measure how many of employees that quit their job we labeled correctly. If we can identify the employees that are going to leave the company potentially, we can target them with specific programs or trainings to prevent their attrition. The costs of targeting the employees who may potentially leave (even if they are not going to) is lower than looking for a new employee and train them if we fail to identify those at risk of attrition.

<a id='logistic-reg'></a>
## 1. Logistic Regression

Logistic regression was used as a baseline because linear and logistic regression are commonly used by companies that do not have data scientists and fit these models using Excel. This analysis will demonstrate a few other models that have better performance than logistic regression. 

In [12]:
from sklearn.linear_model import LogisticRegressionCV
#from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score

# fit the model
lg = LogisticRegressionCV(cv=10, random_state=42, max_iter = 10000).fit(feat_train_sampled, resp_train_sampled)

#calculate and print recall score
lg_recall = recall_score(resp_test, lg.predict(feat_test))
print("Recall score for Logistic Regression is %2.2f" % lg_recall)
# lg_acc = lg.score(feat_test, resp_test)
# lg_auc = roc_auc_score(resp_test, lg.predict(feat_test))

Recall score for Logistic Regression is 0.43


In [13]:
# print("Logistic Regression mean accuracy on the given test data and labels is %2.2f" % lg_acc)
# print ("Logistic Regression AUC is %2.2f" % lg_auc)

In [14]:
from sklearn.metrics import classification_report
print(classification_report(resp_test, lg.predict(feat_test)))

              precision    recall  f1-score   support

           0       0.87      0.87      0.87       299
           1       0.44      0.43      0.44        69

    accuracy                           0.79       368
   macro avg       0.66      0.65      0.65       368
weighted avg       0.79      0.79      0.79       368



<a id='ridge-reg'></a>
## 2. Ridge Regression Classifier

In [15]:
from sklearn.linear_model import RidgeClassifier

# fit the model
rc = RidgeClassifier(alpha=18, fit_intercept=True, normalize=True, copy_X=True, max_iter=None, tol=0.001, 
                     class_weight=None, solver='auto', random_state=42)
rc.fit(feat_train_sampled, resp_train_sampled)

#calculate and print recall score
rc_recall = recall_score(resp_test, rc.predict(feat_test))
print("Recall score for Ridge Regression Classifier is %2.2f" % rc_recall)

Recall score for Ridge Regression Classifier is 0.46


<a id='svm'></a>
## 3. SVM

The following parameters were manually tuned: C, kernel and gamma. 

In [16]:
from sklearn.svm import SVC

svm_model = SVC(C=0.01, kernel='poly', degree=2, gamma='scale', coef0=0.0, shrinking=True, probability=False, 
                       tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=- 1, 
                       decision_function_shape='ovr', break_ties=False, random_state=42)
svm_model.fit(feat_train_sampled, resp_train_sampled)

#calculate and print recall score
svm_recall = recall_score(resp_test, svm_model.predict(feat_test))
print("Recall score for SVM is %2.2f" % svm_recall)

Recall score for SVM is 0.96


In [43]:
from sklearn.model_selection import KFold
recall_score_list = []

kf = KFold(n_splits=10, shuffle = True, random_state =40)

for train_index, test_index in kf.split(feat_train_sampled):

    X_train, X_test = feat_train_sampled.loc[train_index], feat_train_sampled.loc[test_index]
    y_train, y_test = resp_train_sampled[train_index], resp_train_sampled[test_index]

    svm_model.fit(X_train,y_train)
    predictions = svm_model.predict(X_test)
    svm_recall = recall_score(y_test, svm_model.predict(X_test))
    recall_score_list.append(svm_recall)


# print(recall_score_list)   
print("Mean recall score after 10-fold cross validation is %2.2f" % np.mean(recall_score_list))

Mean recall score after 10-fold cross validation is 0.96


In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Create the random grid
random_grid_svm = {'n_estimators': [100, 250, 500, 750, 1000, 1250],
               'max_features': [None, 0.9],
               'max_depth': [1,2,3,4,5],
               'min_samples_leaf': [1,2,3]}

rf_random = RandomizedSearchCV(estimator = svm_model, param_distributions = random_grid_svm, n_iter = 100, cv = 3, 
                               verbose = 4, random_state=42)
rf_random.fit(feat_train_sampled, resp_train_sampled)

#select best model
rf_random.best_estimator_

<a id='rf'></a>
## 4. Random Forest

First, we do hyperparameter tuning by hand to determine which hyperparameters yield better results, then use random search to select best model. 

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_depth = 1, max_features = None, min_samples_leaf = 2 ,n_estimators = 1000, oob_score = True, random_state = 42)
rf = rf.fit(feat_train_sampled, resp_train_sampled)
rf_recall = recall_score(resp_test, rf.predict(feat_test))
print("Recall score for Random Forest is %2.2f" % rf_recall)

In [None]:
# Create the random grid
random_grid_rf = {'n_estimators': [100, 250, 500, 750, 1000, 1250],
               'max_features': [None, 0.9],
               'max_depth': [1,2,3,4,5],
               'min_samples_leaf': [1,2,3]}

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid_rf, n_iter = 100, cv = 3, verbose = 4,
                               random_state=42)
rf_random.fit(feat_train_sampled, resp_train_sampled)

#select best model
rf_random.best_estimator_

In [None]:
best_rf = RandomForestClassifier(max_depth=5, max_features=0.9, n_estimators=500,
                       oob_score=True, random_state=42)
best_rf = best_rf.fit(feat_train_sampled, resp_train_sampled)
best_rf_recall = recall_score(resp_test, best_rf.predict(feat_test))
print("Recall score for Random Forest is %2.2f" % rf_recall)

<a id='gb'></a>
## 5. Gradient Boosting

In [45]:
import xgboost as xgb

xgb.XGBClassifier(colsample_bytree=0.5, gamma=0.1, learning_rate=0.01, max_depth=3, min_child_weight=1,
                  n_estimators=100, subsample=0.5)

In [46]:
xgb.XGBClassifier(colsample_bytree=0.5, gamma=0.1, learning_rate=0.01, max_depth=3, min_child_weight=1,
                  n_estimators=100, subsample=0.5)

XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None, gamma=None,
              gpu_id=None, importance_type='gain', interaction_constraints=None,
              learning_rate=None, max_delta_step=None, max_depth=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              random_state=None, reg_alpha=None, reg_lambda=None,
              scale_pos_weight=None, subsample=None, tree_method=None,
              validate_parameters=None, verbosity=None)

In [None]:
best_model.fit(X_train_2, y_train_2, early_stopping_rounds=10, eval_set=eval_set, verbose=True)

In [None]:
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(train_features, train_labels)

In [None]:
# Define Parameters
param_grid = {"max_depth": [2,3,10],
              "max_features" : [1.0,0.3,0.1],
              "min_samples_leaf" : [3,5,9],
              "n_estimators": [50,100,300],
              "learning_rate": [0.05,0.1,0.02,0.2]}

In [None]:
# Perform Grid Search CV
gs_cv = GridSearchCV(model, param_grid=param_grid, cv = 3, verbose=10, n_jobs=-1 ).fit(X_train_2, y_train_2)

In [None]:
# Best hyperparmeter setting
gs_cv.best_estimator_

In [None]:
# Use our best model parameters found by GridSearchCV
best_model = xgb.XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, max_features=1.0, min_child_weight=1,
       min_samples_leaf=3, missing=None, n_estimators=300, n_jobs=1,
       nthread=None, objective='reg:linear', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=1)