## Task:

- use Keras :)
- **Predict if employee stayed or left the company**
- build confusion matrix
- experiment with parameters to get better accuracy

Necesary data pre-processing steps:
- Data Normalization (*) 
- Spliting data into training and testing sets


(\*) Read more about Scikit's ```MinMaxScaler``` or ```Normalizer``` on:
- [Normalization](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-normalization)
- [MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)
- [Normalizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer)

----

Dataset source: https://www.kaggle.com/ludobenistant/hr-analytics

## Additional task:

Requirements: Python and maybe basic Numpy

Pretend that the dataset is too bit to fit into single Matrix. 

**Write generator that will iterate over the data during model training.**

The generator should:

- accept batch of any size
- generate batches for training/testing data and training/testing labels
- be usable by keras ```model.fit_generator``` (check method **"fit_generator"** on [Keras page](https://keras.io/models/sequential/))


In [1]:
import pandas as pd
import numpy as np

In [2]:
hr_dataset = pd.read_csv('datasets/HR_comma_sep.csv')
hr_dataset.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


# Data Exploration

Let's look at some basic features and correlations
We can just select dataset where employee left vs stayed, and look at some basic stats of the different columns. 

E.g. `satisfaction_level`, and `promotion_last_5_years` unsurprisingly seem to be quite important. 

In [3]:
# compare left vs stayed
hr_dataset[hr_dataset['left']==1].describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,3571.0,3571.0,3571.0,3571.0,3571.0,3571.0,3571.0,3571.0
mean,0.440098,0.718113,3.855503,207.41921,3.876505,0.047326,1.0,0.005321
std,0.263933,0.197673,1.818165,61.202825,0.977698,0.212364,0.0,0.072759
min,0.09,0.45,2.0,126.0,2.0,0.0,1.0,0.0
25%,0.13,0.52,2.0,146.0,3.0,0.0,1.0,0.0
50%,0.41,0.79,4.0,224.0,4.0,0.0,1.0,0.0
75%,0.73,0.9,6.0,262.0,5.0,0.0,1.0,0.0
max,0.92,1.0,7.0,310.0,6.0,1.0,1.0,1.0


In [4]:
# compare left vs stayed
hr_dataset[hr_dataset['left']==0].describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,11428.0,11428.0,11428.0,11428.0,11428.0,11428.0,11428.0,11428.0
mean,0.66681,0.715473,3.786664,199.060203,3.380032,0.175009,0.0,0.026251
std,0.217104,0.162005,0.979884,45.682731,1.562348,0.379991,0.0,0.159889
min,0.12,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.54,0.58,3.0,162.0,2.0,0.0,0.0,0.0
50%,0.69,0.71,4.0,198.0,3.0,0.0,0.0,0.0
75%,0.84,0.85,4.0,238.0,4.0,0.0,0.0,0.0
max,1.0,1.0,6.0,287.0,10.0,1.0,0.0,1.0


# Pre-processing

We will normalize and clean up data for Keras models.


In [5]:
# one-hot encode sales
hr_dataset = hr_dataset.join(pd.get_dummies(hr_dataset['sales']), rsuffix='_sales').drop('sales', axis=1)

# one-hot encode salary
hr_dataset = hr_dataset.join(pd.get_dummies(hr_dataset['salary']), rsuffix='_salary').drop('salary', axis=1)

In [6]:
hr_dataset.shape

(14999, 21)

In [7]:
# shuffle for random
hr_dataset = hr_dataset.reindex(np.random.permutation(hr_dataset.index))

In [8]:
# split data into features and labels
labels = hr_dataset[['left']].astype(float)
feats = hr_dataset.drop('left', axis=1).astype(float)

In [9]:
# normalize data
from sklearn.preprocessing import Normalizer

n = Normalizer()
df = pd.DataFrame(n.fit_transform(feats))

In [10]:
def split_datasets(dataset, ratio=0.8):
    train_size = int(len(dataset) * ratio)
    test_size = len(dataset) - train_size
    train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]
    return train, test

# split train/test
x_train, x_test = split_datasets(df.values)
y_train, y_test = split_datasets(labels.values)

In [11]:
print x_train.shape, x_test.shape
print y_train.shape, y_test.shape

# check labels in train vs test
print "train left: %s " % str(y_train.sum() / len(y_train))
print "test left: %s " % str(y_test.sum() / len(y_test))

(11999, 20) (3000, 20)
(11999, 1) (3000, 1)
train left: 0.236686390533 
test left: 0.243666666667 


In [12]:
import keras
from keras import backend as K
from keras.models import Sequential, Model
from keras.layers import Input, LSTM
from keras.layers.core import Flatten, Dense, Dropout, Lambda
from keras.optimizers import SGD, RMSprop, Adam
from keras.utils.np_utils import to_categorical


Using Theano backend.


In [13]:
model = Sequential()
model.add(Dense(20, input_dim=20))
model.add(Dense(32))
# model.add(Dropout(0.5))
# model.add(Dense(16))
# model.add(Dropout(0.5))
# model.add(Dense(8))
# model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))


Since we are training for recognizing categories (i.e. left or stayed):

1) We use `categorical_crossentropy` for the loss function
2) Instead of passing in `y_train` directly, we need to one-hot encode it to [left, stayed] array. Keras has a handy function `to_categorical` we can use

In [14]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [15]:
to_categorical(y_train)

array([[ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.],
       ..., 
       [ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.]])

In [16]:
# train model
model.fit(x_train, to_categorical(y_train), nb_epoch=50, batch_size=4, validation_split=0.2)

Train on 9599 samples, validate on 2400 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x111e436d0>

# Predictions

How did we do?
Let's use the model to make predictions. 
Note that the predictions have 2 columns -- left, and stayed. 

We compare with `to_categorical(y_test)` to one-hot encode our test labels to 2 columns. 
Keras has a `categorical_accuracy` metric we can use to measure the error. 
Final score is 0.806 -- Its a so-so score for a Neural Network prediction.

In [17]:
# predict for test data
preds = model.predict(x_test)

In [18]:
preds

array([[ 0.96817136,  0.03182864],
       [ 0.87078327,  0.12921672],
       [ 0.72454727,  0.27545273],
       ..., 
       [ 0.52438545,  0.47561452],
       [ 0.82624406,  0.17375596],
       [ 0.7003119 ,  0.2996881 ]], dtype=float32)

In [19]:
to_categorical(y_test)

array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       ..., 
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.]])

In [20]:
from keras.metrics import categorical_crossentropy, categorical_accuracy
categorical_accuracy(to_categorical(y_test), preds).eval()

array(0.8116666674613953, dtype=float32)