A simple EDA and classification notebook.

### Import the necessary modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score
plt.style.use('ggplot')

#### Load the data into a dataframe

In [None]:
df = pd.read_csv("../input/HR_comma_sep.csv", delimiter=',')
df.info()

### Basic EDA

In [None]:
df.describe()

In [None]:
corr = df.corr()
corr

In [None]:
sns.heatmap(corr)

In [None]:
f, axes = plt.subplots(ncols=3,figsize=(17,6))

sns.countplot(x='salary',hue='left',ax=axes[0],data=df)
sns.countplot(x='promotion_last_5years',hue='left',ax=axes[1],data=df)
sns.countplot(x='sales',hue='left',ax=axes[2],data=df)
plt.xticks(rotation=90)
plt.xlabel('Department')

* A larger number of employees with lower salary left compared to others.
* Sales department saw the highest number of employees leave the company, followed by technical and support departments.

In [None]:
f, axes = plt.subplots(ncols=3,figsize=(17,6))

a0 = sns.countplot(x='number_project',hue='left',ax=axes[0],data=df)
a0.set_title('Number of Projects Completed')
a1 = sns.countplot(x='Work_accident',hue='left',ax=axes[1],data=df)
a1.set_title('Work Accident (Y/N)')
a2 = sns.countplot(x='time_spend_company',hue='left',ax=axes[2],data=df)
a2.set_title('Number of Years at the Company')

* People with 2 projects left the company at an alarming rate. There is an upward trend in leaving the company right the year 3.
* A large portion of the employees left the company after  spending 3 years there. There was a downward trend with reasonable numbers up to the sixth year.

In [None]:
sns.barplot(x='sales',y='satisfaction_level',hue='left',data=df)
plt.xlabel('Department')
plt.xticks(rotation=90)
plt.title('Satisfaction Level in Each Department')

In [None]:
left_yes = df[df['left'] == 1]
left_no = df[df['left'] == 0]
sns.distplot(left_no.satisfaction_level,label='0')
sns.distplot(left_yes.satisfaction_level,label='1')
plt.legend(title='left',loc='best')

* Employees with lower job satisfaction left the organization across all the departments.
* A few employees with higher job satisfaction also left.

In [None]:
sns.barplot(x='sales',y='average_montly_hours',hue='left',data=df)
plt.xlabel('Department')
plt.xticks(rotation=90)
plt.title('Average hours clocked by employees in each department')

* People clocking higher hours left the organization across all the departments.

In [None]:
sns.distplot(left_no.average_montly_hours,label='0')
sns.distplot(left_yes.average_montly_hours,label='1')
plt.legend(title='left',loc='best')

* However, there is more to it. People clocking around 150 hours and people clocking around 250 hours left in higher number than other people. The distribution has two peaks in these regions.

In [None]:
sns.barplot(x='sales',y='last_evaluation',hue='left',data=df)
plt.xlabel('Department')
plt.xticks(rotation=90)
plt.title('Last evaluation')

* In *RandD*, people with higher score left the company.
* This was the case (although marginally) with other departments expect *accounting, hr* and *marketing*.

### Classification

#### Data preprocessing

In [None]:
lst = ['sales','technical','support','IT','product_mng','marketing','RandD','accounting','hr','management']

for i, pos in enumerate(lst):
    df.sales.replace(to_replace=pos,value=i,inplace=True)

In [None]:
df.salary.value_counts()

In [None]:
lst = ['low','medium','high']

for i, sal in enumerate(lst):
    df.salary.replace(to_replace=sal,value=i,inplace=True)

In [None]:
df.info()

In [None]:
x = df.drop('left',axis=1)
y = df.left
y.values.reshape(-1,1)

* Split the data into train, dev and test set.

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
x_dev,x_test,y_dev,y_test = train_test_split(x_test,y_test,test_size=0.5)

* Scale the data.

In [None]:
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_dev = scaler.transform(x_dev)
x_test = scaler.transform(x_test)

#### Decision Tree

In [None]:
clf = DecisionTreeClassifier(min_samples_split=3,)
clf.fit(x_train,y_train)
pred = clf.predict(x_train)
print('Training Report\n {}'.format(classification_report(y_train,pred)))
print('Training accuracy: {:.3f}'.format(accuracy_score(y_train,pred)))

In [None]:
pred = clf.predict(x_dev)
print('Dev set Report\n {}'.format(classification_report(y_dev,pred)))
print('Dev set accuracy: {:.3f}'.format(accuracy_score(y_dev,pred)))
print('AUC: {:.3f}'.format(roc_auc_score(y_dev,pred)))

#### Ramdom Forest

In [None]:
clf = RandomForestClassifier(n_estimators=300)
clf.fit(x_train,y_train)
pred = clf.predict(x_train)
print('Training Report\n {}'.format(classification_report(y_train,pred)))
print('Training accuracy: {:.3f}'.format(accuracy_score(y_train,pred)))

In [None]:
pred = clf.predict(x_dev)
print('Dev set Report\n {}'.format(classification_report(y_dev,pred)))
print('Dev set accuracy: {:.3f}'.format(accuracy_score(y_dev,pred)))
print('AUC: {:.3f}'.format(roc_auc_score(y_dev,pred)))

In [None]:
clf = SVC(C=500)
clf.fit(x_train,y_train)
pred = clf.predict(x_train)
print('Training Report\n {}'.format(classification_report(y_train,pred)))
print('Training accuracy: {:.3f}'.format(accuracy_score(y_train,pred)))

In [None]:
pred = clf.predict(x_dev)
print('Dev set Report\n {}'.format(classification_report(y_dev,pred)))
print('Dev set accuracy: {:.3f}'.format(accuracy_score(y_dev,pred)))
print('AUC: {:.3f}'.format(roc_auc_score(y_dev,pred)))

* Random Forest classifier showed better performance on the dev set and the same is chosen.

#### Classifying the data found in the test set.

In [None]:
clf = RandomForestClassifier(n_estimators=300)
clf.fit(x_train,y_train)

In [None]:
pred = clf.predict(x_test)
print('Test set Report\n {}'.format(classification_report(y_test,pred)))
print('Test set accuracy: {:.3f}'.format(accuracy_score(y_test,pred)))
print('AUC: {:.3f}'.format(roc_auc_score(y_test,pred)))