# Employee Retention Prediction Using Keras and TensorFlow

# Table of Content

- 1 Importing Relevant Libraries
- 2 Loading Raw Data
- 3 Preprocessing
  - 3.1 Handling Categorical Variables
- 4 Train Test Split
  - 4.1 Data Scaling
- 5 Building ANN
  - 5.1 Gradient Descent
- 6 Select and Train a Model
- 7 Apply Model on Test Set
  - 7.1 Evaluate Model on Confusion Matrix
- 8 Improving the Model
  - 8.1 Applying Cross Validation
  - 8.2 Adding Dropout Regularization
  - 8.3 Applying Grid Search

# 1 Importing Relevant Libraries

In [1]:
import pandas as pd # for data manipulating
import numpy as np  # for data conversion to np array

# 2 Loading Raw Data

In [2]:
df = pd.read_csv('HR_comma_sep.csv')

In [3]:
# first five records of dataframe
df.head(5)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


# 3 Preprocessing

## 3.1 Handling Categorical Variables

Note: The dummy variable trap is a situation whereby two or more variables are highly correlated. This leads to your model performing poorly. 

Therefore, drop one dummy variable to always remain with N-1 dummy variables. Any of the dummy variables can be dropped because there is no preference as long as you remain with N-1 dummy variables.

In [4]:
# converts categorical to numerical variables (dummies)
categorical_v = ['department','salary']
df_final = pd.get_dummies(df,columns=categorical_v,drop_first=True)
df_final.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,department_RandD,department_accounting,department_hr,department_management,department_marketing,department_product_mng,department_sales,department_support,department_technical,salary_low,salary_medium
0,0.38,0.53,2,157,3,0,1,0,0,0,0,0,0,0,1,0,0,1,0
1,0.8,0.86,5,262,6,0,1,0,0,0,0,0,0,0,1,0,0,0,1
2,0.11,0.88,7,272,4,0,1,0,0,0,0,0,0,0,1,0,0,0,1
3,0.72,0.87,5,223,5,0,1,0,0,0,0,0,0,0,1,0,0,1,0
4,0.37,0.52,2,159,3,0,1,0,0,0,0,0,0,0,1,0,0,1,0


# 4 Train Test Split

In [5]:
# split dataset into a training and a testing set
from sklearn.model_selection import train_test_split
X = df_final.drop(['left'],axis=1).values
y = df_final['left'].values
# deep learning model expects to get the data as arrays
# use numpy to convert the data to numpy arrays with the .values attribute

In [6]:
# 80% for the training set and 20% for the testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((10499, 18), (4500, 18), (10499,), (4500,))

## 4.1 Data Scaling

In [7]:
# scale the dataset using StandardScaler
# this will ensure a mean of zero and a unit variable
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# 5 Building ANN

In [8]:
import keras
from keras.models import Sequential # initialize ANN
from keras.layers import Dense # add layers to deep learning model

Using TensorFlow backend.


In [9]:
# create a classifier variable
classifier = Sequential()

In [10]:
# adding layers to your network
classifier.add(Dense(9, kernel_initializer = "uniform",activation = "relu", input_dim=18))

Instructions for updating:
Colocations handled automatically by placer.


The first parameter is the number of nodes that the network should have. One of the strategies to determine the number of nodes is to take the average of the nodes in the input layer and the output layer.

The second parameter is the `kernel_initializer`. When fit the deep learning model the weights will be initialized to numbers close to zero, but not zero. `kernel_initializer` is the function that initializes the weights.

The third parameter is the `activation` function. Deep learning model will learn through this function. There are usually linear and non-linear activation functions. Use the `relu` activation function because it generalizes well on your data.

The last parameter is `input_dim`, which represents the number of features in the dataset.

In [11]:
# add the output layer
classifier.add(Dense(1, kernel_initializer = "uniform",activation = "sigmoid"))

The first parameter is the number of output nodes. It is expected to get one output: if an employee leaves the company. Therefore specify one output node.

The second parameter is for `kernel_initializer`, use the `sigmoid` activation function to get the probability that an employee will leave. In the event where dealing with more than two categories, use the `softmax` activation function, which is a variant of the `sigmoid` activation function.

## 5.1 Gradient Descent
The aim of a gradient descent is to get the point where the error is at its least. This is done by finding where the `cost function` is at its minimum, which is referred to as a local minimum.

There are several types of optimization strategies, use the popular one known as `adam`.

In [12]:
# apply a gradient descent to the neural network
classifier.compile(optimizer= "adam",loss = "binary_crossentropy",metrics = ["accuracy"])

The first parameter `optimizer` is the gradient descent.

The second parameter `loss` is a function used in the gradient descent. Since this is a binary classification problem, use the `binary_crossentropy` loss function.

The last parameter is the `metric` to evaluate your model. In this case, evaluate is based on its accuracy when making predictions.

# 6 Select and Train a Model

In [13]:
# fit classifier to the dataset
classifier.fit(X_train, y_train, batch_size = 10, epochs = 1)

Instructions for updating:
Use tf.cast instead.
Epoch 1/1


<keras.callbacks.History at 0x1faf961cb70>

The third parameter `batch_size` represents the number of samples that will go through the neural network at each training round.

The last paremeter `epochs` represents the number of times that the dataset will be passed via the neural network. The more epochs the longer it will take to run the model, which also gives better results.

# 7 Apply Model on Test Set

In [14]:
# make predictions on the test set
y_pred = classifier.predict(X_test)

In [15]:
# set the threshold as 50%
y_pred = (y_pred > 0.5)

## 7.1 Evaluate Model on Confusion Matrix

In [16]:
# evaluate the model on a confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[3334,  106],
       [ 679,  381]], dtype=int64)

In [17]:
print ('The model accuracy is: ', (3252+867)/(3252+206+175+867))

The model accuracy is:  0.9153333333333333


# 8 Improving the Model
In `K-fold crossvalidation`, K is set to 10. The model is trained on the first 9 folds and tested on the last fold. This iteration continues until all folds have been used. The accuracy of the model becomes the average of all these accuracies.

In [18]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score

def make_classifier():
    classifier = Sequential()
    classifier.add(Dense(9, kernel_initializer = "uniform",activation = "relu", input_dim=18))
    classifier.add(Dense(1, kernel_initializer = "uniform",activation = "sigmoid"))
    classifier.compile(optimizer= "adam",loss = "binary_crossentropy",metrics = ["accuracy"])
    return classifier

The function that will pass to the `KerasClassifier` is a wrapper of the neural network design that was used earlier.

In [19]:
# pass the function to the KerasClassifier
classifier = KerasClassifier(build_fn = make_classifier, batch_size=10, nb_epoch=1)

The first parameter `build_fn` is the function with the neural network design.

The second parameter `batch_size` is the number of samples to be passed via the network in each iteration.

The third parameter `nb_epoch` is the number of epochs the network will run.

## 8.1 Applying Cross Validation

In [20]:
# apply CV using Scikit-learn's cross_val_score
accuracies = cross_val_score(estimator = classifier,X = X_train,y = y_train,cv = 10,n_jobs = -1)

In [21]:
# compute the mean and variance of the accuracies
mean = accuracies.mean()
variance = accuracies.var()
print ('The mean accuracy is:', round(mean,5), 'and the variance is', round(variance,5))
# since the variance is very low
# it means that the model is performing very well.

The mean accuracy is: 0.84275 and the variance is 0.00487


## 8.2 Adding Dropout Regularization
In neural networks, `dropout regularization` is the technique that fights overfitting by adding a `Dropout` layer in your neural network. 

It has a `rate` parameter that indicates the number of neurons that will deactivate at each iteration. The process of deactivating nerurons is usually
random. 

In below, specify 0.1 as the rate meaning that 1% of the neurons will
deactivate during the training process.

In [22]:
from keras.layers import Dropout
classifier = Sequential()
classifier.add(Dense(9, kernel_initializer = "uniform",activation = "relu", input_dim=18))
classifier.add(Dropout(rate = 0.1))
classifier.add(Dense(1, kernel_initializer = "uniform",activation = "sigmoid"))
classifier.compile(optimizer= "adam",loss = "binary_crossentropy",metrics = ["accuracy"])
# during the training process 15 of the neurons will be deactivated

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


# 8.3 Applying Grid Search
`Grid search` is a technique to experiment with different model parameters in order to obtain the best accuracy.

In [23]:
from sklearn.model_selection import GridSearchCV
def make_classifier(optimizer):
    classifier = Sequential()
    classifier.add(Dense(9, kernel_initializer = "uniform",activation = "relu", input_dim=18))
    classifier.add(Dense(1, kernel_initializer = "uniform",activation = "sigmoid"))
    classifier.compile(optimizer= optimizer,loss = "binary_crossentropy",metrics = ["accuracy"])
    return classifier

In [24]:
classifier = KerasClassifier(build_fn = make_classifier)

In [25]:
# set a couple of parameters to experiment
params = {
'batch_size':[20,35],
'epochs':[2,3],
'optimizer':['adam','rmsprop']
}

In [26]:
# search for the best parameters
grid_search = GridSearchCV(estimator=classifier, 
                           param_grid=params, scoring="accuracy", cv=2)
grid_search = grid_search.fit(X_train,y_train)

Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [27]:
# obtain the best parameters
best_param = grid_search.best_params_
best_accuracy = grid_search.best_score_
print (best_param, best_accuracy)

{'batch_size': 20, 'epochs': 3, 'optimizer': 'adam'} 0.8670349557100676


We have built an employee retention model that is able to predict if an employee stays or leaves with an accuracy of up to `86%`.