**Suppose employees are leaving my company in droves and I want to know which employees are most likely to leave next. If I am able to predict which of the current employees are likely to leave then it would be present me an opportunity to retain the talent in the company itself. To predict this I will employ a deep neural network which we will train on this data set.** Note: The data is simulated!

In [None]:
import numpy as np
import pandas as pd

# Importing the dataset
hrdata = pd.read_csv('../input/HR_comma_sep.csv')

**The data in this case is super clean and need not need much work, I would still like to rearrange the columns in order to make better sense when you are directly looking at the raw data**

In [None]:

col_names = ['satisfaction_level',
             'last_evaluation',
             'number_project',
             'average_montly_hours',
             'time_spend_company',
             'Work_accident',
             'salary',
             'promotion_last_5years',
             'sales',
             'left']
hrdata = hrdata.reindex(columns = col_names)


**This will rearrange the column such the label of employees "left" will be on the extreme right. In order to avoid clustering of similar looking data, I will randomly shuffle all the rows.**

In [None]:
#Randomization
hrdata = hrdata.sample(frac = 1)


**We will now create two arrays X (containing all the features e.g. satisfaction level, last evaluation etc) and y (which basically says whether an employee left or not).**

In [None]:
X = hrdata.iloc[:, 0:9].values
y = hrdata.iloc[:, 9].values

**The next section is an important one and is probably one of the most important steps that one should give more thought about. Machine learning is finally an algorithm and like all algorithms if you put garbage in you will get garbage out. Hence the data processing that is required. We have to first take care of features which reflects various categories. In this particular example it is the level of salary and the description of job. We will use first use the LabelEncoder from scikit learn in order to give them a integer based code. However we cannot use the integers as input for our neural network as the integers will carry weights and ideally we would like those specific categories to be equally weighted by the neural network. Thus we will use the OneHotEncoder again from scikit learn in order to binarize the categories. We will then split the data into training and test set so that we can evaluate the efficiency of our algorithm. The training and test set data is also feature scaled afterwards**


In [None]:
#Encode all categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X6 = LabelEncoder()
X[:, 6] = labelencoder_X6.fit_transform(X[:, 6])
labelencoder_X8 = LabelEncoder()
X[:, 8] = labelencoder_X8.fit_transform(X[:, 8])
onehotencoder = OneHotEncoder(categorical_features = [6, 8])
X = onehotencoder.fit_transform(X).toarray()

#Spitting into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
scaling = StandardScaler()
X_train = scaling.fit_transform(X_train)
X_test = scaling.transform(X_test)

**Now we will keras to implement a deep neural network that we will train using the training set data. For simplicity We will have such one hidden layer and train our algorithm. For the purposes of this excercise one hidden layer should be enough. **

In [None]:
from keras.models import Sequential
from keras.layers import Dense
classifier = Sequential()
classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu', input_dim = 20))
#    classifier.add(Dropout(0.1))
classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu'))
#    classifier.add(Dropout(0.1))
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'rmsprop', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)

**Once we have trained our neural network on the training set it is time to see how effective it is on the test set, this will be the real test of accuracy. **

In [None]:
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)

from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
cm = confusion_matrix(y_test, y_pred)
# Accuracy, Precision and Recall 
learning_score = precision_recall_fscore_support(y_test, y_pred)
accuracy_cm = (cm[0, 0] + cm[1, 1])/(sum(sum(cm)))
print('The accuracy is : ', accuracy_cm)
print('The precision is : ', learning_score[0][1])
print('The recall is : ', learning_score[1][1])
print('The fscore is : ', learning_score[2][1])

**The neural network can thus provide an accuracy of nearly 95 % accuracy of predicting whether an employee will leave or not. Given that number of employees who stay back is much larger than the number of employees leaving, the  precision might be a better inidcator of the algorithm. The precision is 0.91 which is not bad. However all these numbers will be slightly different whenever the algorithm is re-run. We need a better way to put a number on the accuracy. **