# Artificial Neural Network for the IBM attrition data.

Here I implemented a quick Keras model to work on this dataset. The idea was to see how high of a accuracy I could get.

Deep Learning works better with bigger datasets, this one being really tiny. Yet I could get some good results very quickly.

When starting this exercise I was not aware it was not a real dataset from real employees... 

In [None]:
import numpy as np 
import pandas as pd 

from subprocess import check_output
dataset = pd.read_csv('../input/WA_Fn-UseC_-HR-Employee-Attrition.csv')
#dataset = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')

## Cleaning the data

Removing the following columns:

* Standard hours: As it was always 80 for all employees this columns was useless.

* Over18 : Yes for all

* Employee number: This column was not usefull for what we are looking for and could have confused the ANN.

The Attrition as been moved to the first column to make the slicing easier later on.

In [None]:
#removing standard hours and employeeID + attrition as first row
dataset = dataset[['Attrition',
                   'Age',
                   'BusinessTravel',
                   'DailyRate',
                   'Department',
                   'DistanceFromHome',
                   'Education',
                   'EducationField',
                   'EmployeeCount',
                   'EnvironmentSatisfaction',
                   'Gender',
                   'HourlyRate',
                   'JobInvolvement',
                   'JobLevel',
                   'JobRole',
                   'JobSatisfaction',
                   'MaritalStatus',
                   'MonthlyIncome',
                   'MonthlyRate',
                   'NumCompaniesWorked',
                   'OverTime',
                   'PercentSalaryHike',
                   'PerformanceRating',
                   'RelationshipSatisfaction',
                   'StockOptionLevel',
                   'TotalWorkingYears',
                   'TrainingTimesLastYear',
                   'WorkLifeBalance',
                   'YearsAtCompany',
                   'YearsInCurrentRole',
                   'YearsSinceLastPromotion',
                   'YearsWithCurrManager']]

## Adding some more derived information

Playing with the data we can add some more information to help the network.

We don't need to be expert in the field in Deep Learning, but providing some ratios and extra information can help the network to converge faster.

In [None]:
dataset['JobInvolment_On_Salary']= dataset['JobInvolvement'] / dataset['MonthlyIncome'] * 1000
dataset['MarriedAndBad_Worklife_Balance'] = np.where(dataset['MaritalStatus']=='Married', 
                                               dataset['WorkLifeBalance']-2,
                                               dataset['WorkLifeBalance']+1)
dataset['DistanceFromHome_rootedTo_JobSatisfaction'] = dataset['DistanceFromHome']**(1/dataset['JobSatisfaction'])
dataset['TotalJobSatisfaction'] = dataset['EnvironmentSatisfaction'] + dataset['JobSatisfaction'] + dataset['RelationshipSatisfaction'] 
dataset['OldLowEmployeeTendToStay'] = dataset['YearsAtCompany'] / dataset['JobLevel']
dataset['Mothers'] = np.where((dataset['Gender']=='Female') & (dataset['Age']>=36), 1,0)
dataset['Rate'] = dataset['DailyRate'] * 20 + dataset['HourlyRate'] * 8 * 20 + dataset['MonthlyRate']
dataset['RateExtended'] = dataset['Rate'] * (8 - dataset['JobSatisfaction'] - dataset['EnvironmentSatisfaction'])

## Separating the data from the labels

In [None]:
X = dataset.iloc[:, 1:].values
y = dataset.iloc[:, 0].values

## Encoding the data and the labels

Neural network only understand numbers. We need to transform the columns with strings into numbers.

What we will do is creating categories. 

ex: Gender: Male as 0 and Female as 1

In [None]:
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_3 = LabelEncoder()
X[:, 3] = labelencoder_X_3.fit_transform(X[:, 3])
labelencoder_X_6= LabelEncoder()
X[:, 6] = labelencoder_X_6.fit_transform(X[:, 6])
labelencoder_X_9= LabelEncoder()
X[:, 9] = labelencoder_X_9.fit_transform(X[:, 9])
labelencoder_X_13= LabelEncoder()
X[:, 13] = labelencoder_X_13.fit_transform(X[:, 13])
labelencoder_X_15= LabelEncoder()
X[:, 15] = labelencoder_X_15.fit_transform(X[:, 15])
labelencoder_X_19= LabelEncoder()
X[:, 19] = labelencoder_X_19.fit_transform(X[:, 19])
X = X.astype(float)
labelencoder_y= LabelEncoder()
y = labelencoder_y.fit_transform(y)

## Dummy variable and dummy trap

Here we will create dummy variable for all the categorical data we just encode. (Only if there is more than 2 categories).

We are doing this because if we leave Single as 0, Married as 1 and Divorced as 2, the network would understand that divorced > married, which doesn't make any sense.

Yet we do not want to fall in the dummy variable trap and we will be removing the first column of each of those dummy variables.

Why?

Because if we have 1 0 0 for a Single Person now. We could guess it is single even if we had removed the first column: 0 0 (not divorced, not married --> single). This way we can remove some duplicated features on each OneHotEncoding

In [None]:
#no dummy trap
onehotencoder1 = OneHotEncoder(categorical_features = [1])
X = onehotencoder1.fit_transform(X).toarray()
X = X[:,1:]
onehotencoder3 = OneHotEncoder(categorical_features = [4])
X = onehotencoder3.fit_transform(X).toarray()
X = X[:,1:]
onehotencoder6 = OneHotEncoder(categorical_features = [8])
X = onehotencoder6.fit_transform(X).toarray()
X = X[:,1:]
onehotencoder13 = OneHotEncoder(categorical_features = [19])
X = onehotencoder13.fit_transform(X).toarray()
X = X[:,1:]
onehotencoder15 = OneHotEncoder(categorical_features = [28])
X = onehotencoder15.fit_transform(X).toarray()
X = X[:,1:]

## Splitting the dataset into Training and Testing set

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

## Scaling the features

We want to have all the features on a similar scale.

This helps in the computational aspect, but also helps the networks not having some features that look much more important than others by their size.

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Hyperparameters

To avoid overfitting such a tiny dataset we will use dropout (randomly putting "off" 10% of the neurons to help them become more independent)

All those parameters are the "winners" of a GridSearch I performed locally.

In [None]:
dropout = 0.1
epochs = 100
batch_size = 30
optimizer = 'adam'
k = 20

## Training the Neural Network, using a K Fold Cross Validation

For initializing the weight we will use a truncated normal distribution.

As we want a probability for output we will use a sigmoid activation function on the output layer.

Because we are working on a categorization problem with only 2 category we will calculate our loss with binary cross entropy.

We will use a 10 Fold Cross validation here: the validation data will be pick randomly K times and we will train the network 10 times on each of those training-validation set.

In [None]:
# Evaluating the ANN
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
def build_classifier():
    classifier = Sequential()
    classifier.add(Dense(16, kernel_initializer="truncated_normal", activation = 'relu', input_shape = (X.shape[1],)))
    classifier.add(Dropout(dropout))
    classifier.add(Dense(1, kernel_initializer="truncated_normal", activation = 'sigmoid', )) #outputlayer
    classifier.compile(optimizer = optimizer, loss = 'binary_crossentropy', metrics = ["accuracy"])
    return classifier
classifier = KerasClassifier(build_fn = build_classifier, batch_size = batch_size, epochs = epochs, verbose=0)
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 30)
max = accuracies.max()

## Display of the epochs

Here I hide the display of the previous cell because it was too many lines.

In case you want to run it you just have to remove "verbose=0" or put it to 1

In [None]:
print("Best accuracy: ",max)

## Done. 

Feel free to ask me anything about what was done here.

Best,

Charles