Importing the required libraries

In [29]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import cluster, svm, neighbors
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Glimpse of the data

In [10]:
data = pd.read_csv('HRData.csv')
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


As we can see, there is non numeric data in the columns, 'sales' and 'salary'. We need to convert this text data to numeric values. Following function helps us assign numeric keys to the text data.

In [11]:
def convertTextToNumeric(data):
    columns = data.columns.values
    for column in columns:
        convertedValues = {}
        def convertText(text):
            return convertedValues[text]

        if data[column].dtype != np.int64 and data[column].dtype != np.float64:
            columnElements = data[column].values.tolist()
            uniqueElements = set(columnElements)

            num = 0
            for element in uniqueElements:
                if element not in convertedValues:
                    convertedValues[element] = num
                    num = num + 1

            data[column] = list(map(convertText, data[column]))

    return data

Extracting the X and y columns from the given data.

In [16]:
data = convertTextToNumeric(data)

X = data.drop(['left'], 1)
y = data['left']

XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2)

X = np.array(X)
y = np.array(y)

Let's train different classifiers to apply on our data set.
1. Support Vector Machine
2. K-Nearest Neighbors
3. Logistic Regression

- Training SVM:

In [25]:
svmClassifier = svm.SVC()
svmClassifier.fit(XTrain, yTrain)
svmAccuracy = svmClassifier.score(XTest, yTest) * 100

- Training K-NN

In [26]:
knn = neighbors.KNeighborsClassifier()
knn.fit(XTrain, yTrain)
knnAccuracy = knn.score(XTest, yTest) * 100

- Training Logistic Regression

In [33]:
logReg = LogisticRegression()
logReg.fit(XTrain, yTrain)
logRegAccuracy = logReg.score(XTest, yTest) * 100

In [40]:
Analysis = {}
Analysis['Models'] = ['SVM', 'KNN', 'Log Reg.']
Analysis['Training Error'] = [svmClassifier.score(XTrain, yTrain) * 100, knn.score(XTrain, yTrain) * 100,
                              logReg.score(XTrain, yTrain) * 100]
Analysis['Testing Error'] = [svmAccuracy, knnAccuracy, logRegAccuracy]

Analysis = pd.DataFrame(Analysis)
Analysis

Unnamed: 0,Models,Testing Error,Training Error
0,SVM,95.4,96.199683
1,KNN,93.266667,95.432953
2,Log Reg.,78.933333,79.63997


We need to predict, which valuable employee will leave next. We can check who is most probable to leave by using the trained classifiers.

- Prediction via SVM:

In [66]:
numOfEmployees = 0

for i in range(len(data)):
    if y[i] == 0:
        sample = X[i].reshape(1, -1)
        prediction = svmClassifier.predict(sample)
        if prediction == 1:
            numOfEmployees = numOfEmployees + 1

print "Number of Employees that are likely to leave the job = ", numOfEmployees

Number of Employees that are likely to leave the job =  335


- Prediction via KNN

In [67]:
numOfEmployees = 0

for i in range(len(data)):
    if y[i] == 0:
        sample = X[i].reshape(1, -1)
        prediction = knn.predict(sample)
        if prediction == 1:
            numOfEmployees = numOfEmployees + 1

print "Number of Employees that are likely to leave the job = ", numOfEmployees

Number of Employees that are likely to leave the job =  537


- Prediction via Logistic Regression

In [68]:
numOfEmployees = 0

for i in range(len(data)):
    if y[i] == 0:
        sample = X[i].reshape(1, -1)
        prediction = logReg.predict(sample)
        if prediction == 1:
            numOfEmployees = numOfEmployees + 1

print "Number of Employees that are likely to leave the job = ", numOfEmployees

Number of Employees that are likely to leave the job =  836
