**Introduction: The objective of the current code is to predict whether an employee is going to resign.**

Importing required libraries

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
from sklearn.ensemble import RandomForestRegressor
from imblearn.over_sampling import SMOTE

Define a function for visualization of ROC curves of further analyses

In [None]:
def ROC_GEN(Title, Labels, Output): 
    
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    
    fpr, tpr, _ = roc_curve(Labels, Output)   
    roc_auc = auc(fpr, tpr)    
    plt.figure()
    lw = 2
    plt.plot(fpr, tpr, color='darkorange',
             lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(Title)
    plt.legend(loc="lower right")
    plt.show()
    
    return;

Reading data from a flat-file, Obtaining dimention of it

In [None]:
data = pd.read_csv("../input/WA_Fn-UseC_-HR-Employee-Attrition.csv")

rows = data.shape[0]
columns = data.shape[1]
print("The dataset contains {0} rows and {1} columns".format(rows, columns))


Let's take a look at the header and the ype of data columns.

In [None]:
data.head(1)

Therefore, this data is composed of both numerical and string fields. Now, let's perform some statistical processes on the data. By mean operation, the range of each numerical field is discovered.  

In [None]:
data.mean()

Now, we compare some of the attributes (e.g., "Age", "Education" and "JobLevel" with each other.

In [None]:
sns.pairplot(data[["Age", "Education", "JobLevel"]])
plt.show()

Some general information can be extracted from the above comparison:

1. Education and job level of older employees is higer than the younger ones.
2, There is not any considerable relation between education and job level.

In the next step, using Kmeans clustering, we cluster the data and show them.



In [None]:
kmeans_model = KMeans(n_clusters=5, random_state=1)
good_columns = data._get_numeric_data().dropna(axis=1)
kmeans_model.fit(good_columns)
labels = kmeans_model.labels_
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(good_columns)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1],c=labels)
plt.show()

Then, we make a copy of data to make some changes on the copy so that the original data doesn't affect.Since we want to use output labels of "Attrition" in our numerical process. we have to use the equal numbers. For example "1" and "0" for "Yes" and"No" respectively. We divide the data to two sets of test and train for further processes.

In [None]:
data_copy = data.copy()
data_copy["Attrition"] = data_copy["Attrition"].replace(["Yes","No"],[1,0]);
train = data_copy.sample(frac=0.5, random_state=1)
test = data_copy.loc[~data_copy.index.isin(train.index)]

Let's take a deeper look at the data. Thus, we look at 20 records of data.

In [37]:
data.head(20)

It can be seen, Some attributes like "EmployeeCount" and "StandardHours" don't vary for various records. Thus, we can ignore them for the later processing. On the other hand, since text fields can not be used in numerical processing, we have to ignore them as well. The rest of the attributes are selected for the processing.

In [38]:
Effective_Columns = ["Age", "DailyRate", "DistanceFromHome", "Education", "MonthlyIncome","MonthlyRate" ,"NumCompaniesWorked",
"PercentSalaryHike","PerformanceRating","RelationshipSatisfaction",
"StockOptionLevel","TotalWorkingYears","TrainingTimesLastYear","WorkLifeBalance","YearsAtCompany",
"YearsInCurrentRole","YearsSinceLastPromotion","YearsWithCurrManager",
"EmployeeNumber","EnvironmentSatisfaction","HourlyRate","JobInvolvement","JobLevel","JobSatisfaction"]

Now, we build and train a Random forest for our classification problem.

In [39]:
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3)
rf.fit(train[Effective_Columns], train["Attrition"])
predictions = rf.predict(test[Effective_Columns])

For the visualization of results, we want to draw ROC using our previous-defined function. But, we have to do some pre-processing for type casting.

In [40]:
len = predictions.shape[0];
test_label = [0 for x in range(len)] 
test_attr = test["Attrition"];

for i in range(len):
    test_label[i] = test_attr[test_attr.index[i]];

ROC_GEN('RF1-50%', test_label, predictions)

We got AUC (Area Under Curve) of 70 percent. Let's change the structure of RF to see what happens...

In [41]:
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=10)
rf.fit(train[Effective_Columns], train["Attrition"])
predictions = rf.predict(test[Effective_Columns])

ROC_GEN('RF2-50%', test_label, predictions)

No improvement! we should improve it more. Let's try SVM (Support Vector Machine) by two kernels of "linear" and "RBF". First we need to perform some preprocessing for type-casting and preparing train and test sets.

In [42]:
data = data_copy[Effective_Columns];
out = data_copy["Attrition"];

len = out.shape[0];
dout = [0 for x in range(len)] 
for i in range(len):
    dout[i] = out[out.index[i]];


din = [[0 for x in range(data.shape[1])] for y in range(len)] 
for i in range(len):
    for j in range(data.shape[1]-9,data.shape[1]):#data.shape[1]):       
        din[i][j] = data[Effective_Columns[j]][i];

X_train, X_test, y_train, y_test = train_test_split(din, dout, test_size=.5,
                                                    random_state=0)

First, we try a linear SVM.

In [43]:
random_state = np.random.RandomState(0)
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
ROC_GEN('LinearSVM-50%', y_test, y_score)

Therefore, results is similar ro RF. Let's change the kernel function to "RBF".

In [44]:
classifier = OneVsRestClassifier(svm.SVC(kernel='rbf', probability=True,
                                 random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
ROC_GEN('RBFSVM-50%', y_test, y_score)

The performance fell down. In the next step we increase the percentage of the training data (from 50% to 80%) and repeat RF and SVM experiments for them.

In [45]:
train = data_copy.sample(frac=0.8, random_state=1)
test = data_copy.loc[~data_copy.index.isin(train.index)]

rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3)
rf.fit(train[Effective_Columns], train["Attrition"])
predictions = rf.predict(test[Effective_Columns])

len = predictions.shape[0];
test_label = [0 for x in range(len)] 
test_attr = test["Attrition"];

for i in range(len):
    test_label[i] = test_attr[test_attr.index[i]];

ROC_GEN('RF-80%', test_label, predictions)


X_train, X_test, y_train, y_test = train_test_split(din, dout, test_size=.2,
                                                    random_state=0)
######################################## SVM LINEAR 
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
                                 random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
ROC_GEN('LinearSVM-80%', y_test, y_score)


######################################## SVM RBF 
classifier = OneVsRestClassifier(svm.SVC(kernel='rbf', probability=True,
                                 random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
ROC_GEN('RBFSVM-80%', y_test, y_score)

For RF and linear SVM, the performance has been increased by 5-6 %. But it is still poor. Let's extend the feature (attribute) vector. We undertake it with numerizing of the text-based attributes and adding them to the previous feature vector.

In [46]:
Text_Baesd_columns = ["BusinessTravel","Department", "EducationField","OverTime","Gender","JobRole","MaritalStatus"];
#data[Text_baesd_columns]
data_copy[Text_Baesd_columns]

data_corrected = data_copy.copy()


data_corrected["BusinessTravel"] = data_copy["BusinessTravel"].replace(["Travel_Rarely","Travel_Frequently","Non-Travel"],[0,1,2]);

data_corrected["MaritalStatus"] = data_copy["MaritalStatus"].replace(["Single","Married","Divorced"],[0,1,2]);
data_corrected["Gender"] = data_copy["Gender"].replace(["Female","Male"],[0,1]);

data_corrected["Department"] = data_copy["Department"].replace(["Sales","Research & Development","Human Resources"],[0,1,2]);

data_corrected["EducationField"] = data_copy["EducationField"].replace(["Life Sciences","Other","Medical","Marketing","Technical Degree","Human Resources"],[0,1,2,3,4,5]);

data_corrected["OverTime"] = data_copy["OverTime"].replace(["Yes","No"],[0,1]);

data_corrected["JobRole"] = data_copy["JobRole"].replace(["Sales Executive","Sales Representative","Research Scientist","Laboratory Technician","Manufacturing Director","Healthcare Representative","Manager","Research Director","Human Resources"],[0,1,2,3,4,5,6,7,8]);

Corrected_columns = Effective_Columns + Text_Baesd_columns 

Now with the new data, we retrain and rebuild our system and test it again.

In [47]:
######################################## Classification with new columns
train = data_corrected.sample(frac=0.5, random_state=1)
test = data_corrected.loc[~data_corrected.index.isin(train.index)]
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3)
rf.fit(train[Corrected_columns], train["Attrition"])
predictions = rf.predict(test[Corrected_columns])

len = predictions.shape[0];
test_label = [0 for x in range(len)] 
test_attr = test["Attrition"];

for i in range(len):
    test_label[i] = test_attr[test_attr.index[i]];

ROC_GEN('Corrected-Data-RF1-50%', test_label, predictions)

######################################## RANDOM FOREST2
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=10)
rf.fit(train[Corrected_columns], train["Attrition"])
predictions = rf.predict(test[Corrected_columns])

ROC_GEN('Corrected-Data-RF2-50%', test_label, predictions)

#---------------SVM PREPARE 
data = data_copy[Corrected_columns];
out = data_copy["Attrition"];
len = out.shape[0];
dout = [0 for x in range(len)] 
#test_attr = test["Attrition"];

for i in range(len):
    dout[i] = out[out.index[i]];


din = [[0 for x in range(data.shape[1])] for y in range(len)] 

data_input = data_corrected[Corrected_columns]


for i in range(len):
    for j in range(data.shape[1]-9,data.shape[1]):  #data_input.shape[1]   
        din[i][j] = data_input[Corrected_columns[j]][i];
    
    
    
X_train, X_test, y_train, y_test = train_test_split(din, dout, test_size=.5,
                                                    random_state=0)

######################################## SVM LINEAR 
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
                                 random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
ROC_GEN('Corrected-Data-LinearSVM-50%', y_test, y_score)


######################################## SVM RBF 
classifier = OneVsRestClassifier(svm.SVC(kernel='rbf', probability=True,
                                 random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
ROC_GEN('Corrected-Data-RBFSVM-80%', y_test, y_score)

#§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§ Train 80%
#§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§ Train 80%

train = data_corrected.sample(frac=0.8, random_state=1)
test = data_corrected.loc[~data_corrected.index.isin(train.index)]

######################################## RANDOM FOREST1
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3)
rf.fit(train[Corrected_columns], train["Attrition"])
predictions = rf.predict(test[Corrected_columns])

len = predictions.shape[0];
test_label = [0 for x in range(len)] 
test_attr = test["Attrition"];

for i in range(len):
    test_label[i] = test_attr[test_attr.index[i]];

ROC_GEN('Corrected-Data-RF1-80%', test_label, predictions)


X_train, X_test, y_train, y_test = train_test_split(din, dout, test_size=.2,
                                                    random_state=0)
######################################## SVM LINEAR 
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
                                 random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
ROC_GEN('Corrected-Data-LinearSVM-80%', y_test, y_score)


######################################## SVM RBF 
classifier = OneVsRestClassifier(svm.SVC(kernel='rbf', probability=True,
                                 random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
ROC_GEN('Corrected-Data-RBFSVM-80%', y_test, y_score)

The best result so far is 78% "corrected-Data-RF1-50%". In the sence of data labels (i.e., output) the data is not balanced. Thus , we should balance it.

In [48]:
kind = ['svm'];
sm = [SMOTE(kind=k) for k in kind]
X_resampled = []
y_resampled = []
#X_res_vis = []
for method in sm:
    X_res, y_res = method.fit_sample(din, dout)

Again, we perform our analysis with the balanced data.

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=.2,
                                                    random_state=0)
######################################## SVM LINEAR 
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
                                 random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
ROC_GEN('Balanced-Data-LinearSVM-80%', y_test, y_score)


######################################## SVM RBF 
classifier = OneVsRestClassifier(svm.SVC(kernel='rbf', probability=True,
                                 random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
ROC_GEN('Balanced-Data-RBFSVM-80%', y_test, y_score)

for method in sm:
    X_res, y_res = method.fit_sample(data_corrected[Corrected_columns], data_corrected["Attrition"])


rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)

ROC_GEN('Balanced-Data-RF1-80%', y_test, predictions)

Congradulation!!!!! We got AUC 0f 94% using the balanced data and random forest

I will be very glad to leran more from your nice comments

Thanks for your attention!