Human Resource Analytics 


Reference: 
https://www.kaggle.com/ludobenistant/hr-analytics/data

Objective:

Objective is to analyze the data and predict which valuable employees will leave next.





Fields in the dataset include:

Satisfaction Level

Last evaluation

Number of projects

Average monthly hours

Time spent at the company

Whether they have had a work accident

Whether they have had a promotion in the last 5 years

Departments (column sales)

Salary

Whether the employee has left

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from statsmodels.sandbox.regression.predstd import wls_prediction_std
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

%matplotlib inline 
from sklearn.cross_validation import train_test_split

In [2]:
#Import Csv 
data=pd.read_csv("HR_comma_sep.csv")
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


In [3]:
#Dataset Info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level       14999 non-null float64
last_evaluation          14999 non-null float64
number_project           14999 non-null int64
average_montly_hours     14999 non-null int64
time_spend_company       14999 non-null int64
Work_accident            14999 non-null int64
left                     14999 non-null int64
promotion_last_5years    14999 non-null int64
sales                    14999 non-null object
salary                   14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


In [4]:
#Datatypes 
data.dtypes

satisfaction_level       float64
last_evaluation          float64
number_project             int64
average_montly_hours       int64
time_spend_company         int64
Work_accident              int64
left                       int64
promotion_last_5years      int64
sales                     object
salary                    object
dtype: object

Sales and Salary attributes are object. To fit machine learning models using dataset,we first need to extract categorical variables and convert them to numeric variables.
"pd.get_dummies" helps us in this task.

In [5]:
numeric_data_sales = pd.get_dummies(data['sales'], prefix='sales')
del data['sales'] 

In [6]:
numeric_data = data.join(numeric_data_sales)

In [7]:
numeric_data_salary = pd.get_dummies(numeric_data['salary'], prefix='salary')
del numeric_data['salary'] 

In [8]:
data_final = numeric_data.join(numeric_data_salary)

In [9]:
data_final.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales_IT,sales_RandD,...,sales_hr,sales_management,sales_marketing,sales_product_mng,sales_sales,sales_support,sales_technical,salary_high,salary_low,salary_medium
0,0.38,0.53,2,157,3,0,1,0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.8,0.86,5,262,6,0,1,0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,0.11,0.88,7,272,4,0,1,0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,0.72,0.87,5,223,5,0,1,0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,0.37,0.52,2,159,3,0,1,0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [10]:
data_final.dtypes

satisfaction_level       float64
last_evaluation          float64
number_project             int64
average_montly_hours       int64
time_spend_company         int64
Work_accident              int64
left                       int64
promotion_last_5years      int64
sales_IT                 float64
sales_RandD              float64
sales_accounting         float64
sales_hr                 float64
sales_management         float64
sales_marketing          float64
sales_product_mng        float64
sales_sales              float64
sales_support            float64
sales_technical          float64
salary_high              float64
salary_low               float64
salary_medium            float64
dtype: object

In [11]:
X = data_final.drop('left', axis=1)
y = data_final['left']

We need to find out which independent variables contributes more towards the Attrition.
We can use RFE(Recursive Feature Elimination) function to determine important contributing variables.

In [12]:
logreg = LogisticRegression()
rfe = RFE(logreg,15)
rfe = rfe.fit(X,y )

In [13]:
#The independent variables that contribute more are stated by True
print(rfe.support_)

[ True  True  True False  True  True  True False  True  True  True  True
 False False False  True  True  True  True  True]


In [14]:
print(rfe.ranking_)

[1 1 1 6 1 1 1 4 1 1 1 1 2 5 3 1 1 1 1 1]


In [15]:
X.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years,sales_IT,sales_RandD,sales_accounting,sales_hr,sales_management,sales_marketing,sales_product_mng,sales_sales,sales_support,sales_technical,salary_high,salary_low,salary_medium
0,0.38,0.53,2,157,3,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.8,0.86,5,262,6,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,0.11,0.88,7,272,4,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,0.72,0.87,5,223,5,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,0.37,0.52,2,159,3,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [16]:
Data_vars=['satisfaction_level','last_evaluation','number_project','time_spend_company','Work_accident','promotion_last_5years','sales_RandD','sales_accounting','sales_hr','sales_management','sales_support','sales_technical','salary_high','salary_low','salary_medium']
X=data_final[Data_vars]


In [17]:
X.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,time_spend_company,Work_accident,promotion_last_5years,sales_RandD,sales_accounting,sales_hr,sales_management,sales_support,sales_technical,salary_high,salary_low,salary_medium
0,0.38,0.53,2,3,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.8,0.86,5,6,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.11,0.88,7,4,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.72,0.87,5,5,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.37,0.52,2,3,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Statistical report

In [18]:
import statsmodels.api as sm

logit = sm.Logit(y,X)

In [19]:
result=logit.fit()

Optimization terminated successfully.
         Current function value: 0.431002
         Iterations 7


In [20]:
#statistical report 
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:                   left   No. Observations:                14999
Model:                          Logit   Df Residuals:                    14984
Method:                           MLE   Df Model:                           14
Date:                Fri, 16 Feb 2018   Pseudo R-squ.:                  0.2147
Time:                        02:39:53   Log-Likelihood:                -6464.6
converged:                       True   LL-Null:                       -8232.3
                                        LLR p-value:                     0.000
                            coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------------
satisfaction_level       -4.1018      0.097    -42.175      0.000        -4.292    -3.911
last_evaluation           1.0412      0.144      7.214      0.000         0.758     1.324
number_p

The null (default) hypothesis is always that each independent variable is having absolutely no effect (has a coefficient of 0) and you are looking for a reason to reject this hypothesis.

The p-Value for sales_accounting and sales_support is > 0.05,considering 95% confidence interval.
Therefore, we cannot reject null hytpothesis for these two variables.
We can thus eliminate these two columns to improve the accuracy of the model.

Insights:
Salary play an important role as far as attrition is concerned
Dept. RandD,HR,Management and Technical departments are statistically significant
Satisfaction level,number of projects,time spent in a company,promotions and work accidents plays major role in Attrition

In [21]:
X=X.drop(['sales_accounting','sales_support'],axis=1)  

Splitting the data to avoid model overfitting. 
We can also use Cross Validation.

In [22]:
#Splittimg the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split( X ,y , test_size=0.30, random_state=42)

## Model Implementation : Logistic Regression 

In [23]:
#Initiate Classifier 
lgr=LogisticRegression()

In [24]:
#Fit the model on training set 
lgr.fit(X_train,np.ravel(y_train))

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [25]:
y_pred_lgr=lgr.predict(X_test)
y_pred_lgr

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [26]:
from sklearn.metrics import accuracy_score

a_score_lgr = accuracy_score(y_test,y_pred_lgr)
a_score_lgr=a_score_lgr*100
print("Metric function accuracy for training data :%f" %a_score_lgr)

Metric function accuracy for training data :78.933333


In [27]:
from sklearn.metrics import confusion_matrix

confusion_matrix_lgr=confusion_matrix(y_test,y_pred_lgr)
print(confusion_matrix_lgr)

[[3158  270]
 [ 678  394]]


In [28]:
y_pred_lgr

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [29]:
#Probablity of next Employee leaving 
y_probablity_lgr =  lgr.predict_proba(X_test)
y_probablity_lgr

array([[ 0.97147317,  0.02852683],
       [ 0.89845336,  0.10154664],
       [ 0.77493933,  0.22506067],
       ..., 
       [ 0.85225064,  0.14774936],
       [ 0.97177051,  0.02822949],
       [ 0.99051128,  0.00948872]])

The accuracy of logistic regression model is just fair but not that good. Let us try another prediction model as well

## Random Forest Classifier 

In [30]:
clf=RandomForestClassifier()

In [31]:
clf.fit(X_train,np.ravel(y_train))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [32]:
Y_pred_clf=clf.predict(X_test)
Y_pred_clf

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [33]:
a_score_clf = accuracy_score(y_test,Y_pred_clf)
a_score_clf=a_score_clf*100

print("Metric function accuracy for training data :%f" %a_score_clf)

Metric function accuracy for training data :98.066667


In [34]:
confusion_matrix_clf = confusion_matrix(y_test,Y_pred_clf)
print(confusion_matrix_clf)

[[3403   25]
 [  62 1010]]


In [35]:
from sklearn.metrics import classification_report

print(classification_report(y_test, Y_pred_clf))

             precision    recall  f1-score   support

          0       0.98      0.99      0.99      3428
          1       0.98      0.94      0.96      1072

avg / total       0.98      0.98      0.98      4500



Random Forest has best accuracy of 98.11%.Hence, Random Forest can be used for predicting the next attrition

In [36]:
#Probablity of next Employee Leaving
y_probablity =  clf.predict_proba(X_test)

In [37]:
y_probablity

array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       ..., 
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.]])