Hello everyone, I'm fairly new to Kaggle and the area of machine learning fascinates me so I'd love to get feedback from this kernel, criticism is welcome! Here's my attempt.

Upon reading the data using Pandas, I realized that I was often getting KeyErrors because the leading column header, Age, had a specific unseen character in the header name. I found there was a /ufeff character which I had to avoid reading, so StackOverflow suggested I use the 'utf-8-sig' econding parameter.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder

#Get data
data = pd.read_csv('../input/WA_Fn-UseC_-HR-Employee-Attrition.csv', encoding='utf-8-sig')

Upon manually investigating the dataframe, I now realized that there was some series that provided no real information for my analysis; which I found to be EmployeeCount, EmployeeNumber, Over18 and StandardHours. I did make note that the StandardHours was 80, which meant a typical biweekly pay period (which is important information for later). I also saved the income data into new dataframes for later analysis.

I also noticed that a few series had numerical values which were orders of magnitude larger than others. I decided to normalize the data to reduce feature favoring.

In [None]:
#Drop columns irrelevant to our analysis
drop_columns = ['EmployeeCount','EmployeeNumber','Over18','StandardHours']
data = data.drop(drop_columns, axis=1)

#Investigate relationship between incomes
mnth_inc = data['MonthlyIncome']
mnth_rte = data['MonthlyRate']
dly_rte = data['DailyRate']
hly_rte = data['HourlyRate']

#Normalize the large income data for HourlyRate, DailyRate, MonthlyRate and MonthlyIncome
norm_col = ['HourlyRate','DailyRate','MonthlyRate','MonthlyIncome']
for col in norm_col:
    data[col] = (data[col] - data[col].mean())/ data[col].std()

Next I separated the target label and convert to binary numerical data. Additionally, I then identified the categorical series and separated them from the main data. I then one-hot encoded the categorical data and combined the two dataframes.

In [None]:
#Convert categorical features to numerical features
le = LabelEncoder()
attrition = data['Attrition']
attrition = le.fit_transform(attrition)
data.drop('Attrition', axis=1, inplace=True)

cat_col = data.select_dtypes(include=['object']).columns.values
data_col = data[cat_col]
data.drop(cat_col, axis=1, inplace=True)

#One hot encode categorical data and combine dataframes
data_col = pd.get_dummies(data_col)
data = pd.concat([data, data_col], axis=1).as_matrix() 

When I looked at the income/rate values for the employees I noticed that there really wasn't a correlation between them. One would expect that the DailyRate would be about 8 * HourlyRate and the MonthlyRate would be 21.67 * DailyRate, (21.67 being the average working days per month), all of this not taking into account over time yet. When I looked into this further some interesting results were found.

In [None]:
#Visiualize the 'questionable' income correlations
f, ((p1, p2), (p3, p4)) = plt.subplots(2,2,figsize=(12,10))
p1.hist(dly_rte / hly_rte, bins=12, edgecolor='k')
p1.set_xlabel('Hours Needed for Daily Rate')
p2.hist(mnth_rte / dly_rte, edgecolor='k')
p2.set_xlabel('Days Needed for Monthly Rate')
p3.hist((mnth_rte * 12 / 26) / (hly_rte * 80), bins=30, edgecolor='k')
p3.set_xlabel('Monthly rate normalized against 80 hour pay period')
p4.hist(mnth_inc/mnth_rte, bins=30, edgecolor='k')
p4.set_xlabel('Ratio of Monthly Income to Monthly Rate')
f.show()

**Note: I do realize that the data is synthetic and therefore not truly representative of real data.**

Firstly, I assumed all rate/income data was gross income.

From the charts we see some interesting correlations:
1. There were employees working 30+ hours a day, some 40+.
2. There were employees working 100+ days a month, some 150+.
3. This correlation makes sense. I'd expect to see an 'almost' normal distribution centered at 1 since the ratio (MonthlyRate x 12 months / 26 pay periods) / (HourlyRate x 80 hours per pay period) should equal 1.
4. There were employees making 2+ times their MonthlyRate as a MonthlyIncome, some 8+.

I decided to train my models excluding all rate/income data except one at a time (Hourly, then Daily, then MonthlyRate, then MonthlyIncome). Oddly enough, I got the best accuracy when all rate/income data was included in training. This puzzles me, but it is synthetic data so the inherent randomness of the information could somehow all work itself out. Who knows.


Finally, I chose my models (KNN, RF and AdaBoost) and trained then. I also compared them against an ANN in Keras.

In [None]:
#Choose our classifiers
rfc = RandomForestClassifier(n_estimators=1000, random_state=0, max_features=.1, max_depth=15)
ada = AdaBoostClassifier(random_state=0)
knn = KNeighborsClassifier()

rfc_accuracy = cross_val_score(rfc, data, attrition, cv=5)
ada_accuracy = cross_val_score(ada, data, attrition, cv=5)
knn_accuracy = cross_val_score(knn, data, attrition, cv=5)

#Create an ANN using Keras and Tensorflow backend
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import Adam

def ANN_Classifier():
    cl = Sequential()
    cl.add(Dense(20, activation='relu', kernel_initializer='random_normal',  input_shape=(data.shape[1],)))
    cl.add(Dropout(0.1))
    cl.add(Dense(1, activation='sigmoid', kernel_initializer='random_normal'))
    
    adm = Adam()
    
    cl.compile(optimizer=adm, loss='binary_crossentropy', metrics=['accuracy'])
    return cl

ann = KerasClassifier(build_fn=ANN_Classifier, batch_size=100, epochs=150, verbose=0)
ann_accuracy = cross_val_score(ann, data, attrition, cv=20)

Finally, the accuracies were as follows:

In [None]:
print('Max RandomForest Accuracy: {:.4f}'.format(rfc_accuracy.max()))
print('Max AdaBoost Accuracy: {:.4f}'.format(ada_accuracy.max()))
print('Max KNN Accuracy: {:.4f}'.format(knn_accuracy.max()))
print('Max ANN Accuracy: {:.4f}'.format(ann_accuracy.max()))

I welcome any questions, comments, feedback or advice. It helps me learn and improve.