# Employee Turnover / Employee Churn Using Machine Learning

## <p>For performing Employee Churn using machine learning we are going to use dataset from <a herf = "https://www.kaggle.com/giripujar/hr-analytics">kaggle</a>.Below is the brief description about data.</p>

## Data Description:

### This data is consist of 10 variables(columns) and 15,000 observation(rows).Each row represent a employee which has revelent values for each 10 variabales.More detail about columns is mentioned below:

* <b>satisfaction_level (Numerical):</b> This column represents employee satisfaction level between 0 to 1 where 0 represents least and 1 represents high.
* <b>last_evaluation (Numerical):</b> This column represents how much time since last evaluation of employee.</br>
* <b>number_project (Numerical):</b> This column reprsents how much projects employee has done so far.</br>
* <b>average_montly_hours (Numerical):</b> This column represents average monthly hours employee has spend in organization.</br>
* <b>time_spend_company (Numerical):</b> This column represents how many years employee has spend in organization</br>
* <b>Work_accident (Numerical):</b></b> This column represents while working whether they have any work accident or not where 1 means yes they have work accident and 0 means No they have not work accident.</br>
* <b>promotion_last_5years (Numerical):</b> This column represents whether employee got any promotion or not in last 5 years where 0 means No they got no promotion in last 5 years and 1 means Yes they got promotion in last 5 years.</br>
* <b>Department (Nominal Categorical):</b> This column represents in which department employee is/was working for example sales,IT,accounting etc. there are 10 total category or department exist in the column.</br>
* <b>Salary (Ordinal Categorical):</b> This column repersents salary of employee whether it is hight,low or medium.</br>
* <b>left (Numerical):</b> This column represents whether employee has left the organization or not where 0 means No employee has not left the organization and 1 means Yes employee has left the organization.</br>

## Now we will try to get some basic understanding of data by looking at each column

In [None]:
# Imporing required libraries for project

import pandas as pd # for data manipluation
import numpy as np  # for data calculations and statistical measurement
import matplotlib.pyplot as plt #for ineractive visualization charts
import seaborn as sns #for ineractive visualization charts
%matplotlib inline 
import warnings  # For warnings
warnings.filterwarnings("ignore") # To ignore unwanted warnings

In [None]:
# Reading Dataset and loading into variable

employee_churn = pd.read_csv("../input/hr-analytics/HR_comma_sep.csv")
          
employee_churn.head()

In [None]:
# structer Of data
employee_churn.shape

In [None]:
# Checking data type and null values for each column
employee_churn.info()

In [None]:
employee_churn.head

In [None]:
employee_churn.columns

### Above output suggest us that there no null values in our dataset and data types are also correct for all columns

In [None]:
# printing summary for numerical columns
employee_churn.describe()

### We can see average statisfaction level of employees is 0.6128,average number of project done by eomployees is approximately 4,average monthly hours spend by employees is 201.0503 hours and on an average time spend in company by employees is approximately 4 years.

In [None]:
# printing catergorical columns unique values
print("Department:\n",employee_churn['Department'].value_counts().to_string())
print("\nSalary:\n",employee_churn['salary'].value_counts().to_string())

### Out all 10 departments most number of employees are in sales where least  number of employees are in managment and salary columns suggest that from given data most of employees salary is low.


## Now we will try to explore data by grouping different column with our target column 'left'  and categorical column in order to generate some insight

In [None]:
# Printing number of churn and non-churn employees
print( "Employee Distribution:\n",employee_churn['left'].value_counts())

In [None]:
#visualizing Employee Distribution
sns.catplot(data=employee_churn,x='left',kind="count")

## Above result tell us that 3571 employees has left and 11428 employees has stayed.

In [None]:
#Printing all numeric variable and comparing them with our target variable
print('Left Vs all numeric variable:\n\n',employee_churn.groupby('left').mean())

## Form the above numbers we interpret that:

* The people who have low statisfaction level are leaving the company more and people with high staisfaction level are staying.
* The people who have less salary are leaving the company more and people with more salary are staying.
* The people who have low promotion rate are leaving the company more and people with high promotion rate are staying.
* The people who have worked more are leaving the company more and people who have worked less are staying.
### This all points makes sense with the reality.

In [None]:
# Printing all features with respect to different departments.
print('Department Vs all numeric variable:\n\n',employee_churn.groupby('Department').mean())

In [None]:
# Printing all columns in accordance with Salary column
print('Salary Vs all numeric variable:\n\n',employee_churn.groupby('salary').mean())

## Data Visualization

## Ploting histogram of all continuous features 

In [None]:
employee_churn["satisfaction_level"].hist(bins=10, figsize=(5,5))
plt.xlabel("satisfaction_level")
plt.show()

In [None]:
employee_churn["last_evaluation"].hist(bins=10, figsize=(5,5))
plt.xlabel("last_evaluation")
plt.show()

In [None]:
employee_churn["average_montly_hours"].hist(bins=10, figsize=(5,5))
plt.xlabel("average_montly_hours")
plt.show()

### We can see the data distribution of the continoues features in the above plots and can say that our data is skewed.

## Data Visualization of all categorical features with respect to feature "left" that is we are checking that how the data is distributed if we compare it with the people who are leaving the company

In [None]:
#ploting all feature with respect to feature "left"

columns=['number_project','time_spend_company','Work_accident', 'promotion_last_5years','Department','salary']
fig=plt.subplots(figsize=(20,30))
for i, j in enumerate(columns):
    plt.subplot(4, 2, i+1)
    plt.subplots_adjust(hspace = 1.0)
    sns.countplot(x=j,data = employee_churn, hue='left')
    plt.xticks(rotation=90)
    plt.title("No. of employee")

### The above plot shows us few important observations

*   Department as people say that sales people do often change their job frequently, also technical, support, IT can be considered as technology oriented people also tends to change there job most of the time.
 and salary are the 2 most important features which affect in 
employee turnover. Thus these 2 feature can be play vital role in predicting the employee churn.
*   Employee engagement is another critical factor to influence the employee to leave the company. Employees with 3-5 projects are less likely to leave the company. The employee with less and more number of projects are likely to leave.
*   Promotion becomes the reason to leave or stay in the company as our data provides evidence to support our statement, people who had promoted in last 5 years stays in the company and the others tends to leave.
*   Work accident is also important factor as we can see that people with work accident stays in the company.

# Handling Categorical Features

###### There is a need of handling categorical features as any machine learning model can only accepts numerical values and cannot take any objects. So there is a need to prepair our data into numerical form to feed into ML model.

*   Department fearure will be handeled by One-Hot-Encoding also known as dummy variables as it is nominal categorical variable
*   Salary feature being the ordinal categorical variable and needs to handeled by Ordinal Encoding which is also known as Lable Encoding. 


In [None]:
#We will convert Department fearure into numerical from by performing One-Hot-Encoding that is Dummy Variable.
employee_churn = pd.get_dummies(employee_churn, columns=['Department'])
employee_churn.columns


In [None]:
employee_churn.head()

In [None]:
# We will convert salary freature to numerical data by just adding a feature in the data frame and mapping salary to it.
# Where we will use 1 as high, 2 as medium, 3 as low in salary feature.
salary_map = {'low':3,
              'medium':2,
              'high':1
              }
        
employee_churn['salary_numeric'] = employee_churn.salary.map(salary_map)

employee_churn['salary_numeric'].value_counts()

###### We have tried Ordinal Encode technique from sklearn.preprocessing also but it does not work on 1D array hence we have to use mapping to convert the salary feature in to numeric form which also considered for ordinal encoading.

In [None]:
#Checking the coloumn names.
employee_churn.columns

In [None]:
#Dropping salary column as it is of no use as of now for modeling.
employee_churn = employee_churn.drop(['salary'], axis=1)
employee_churn.columns

#### Checking correlation between 'left' feature with respect to other features. 

In [None]:
data_corr = employee_churn.corr()
data_corr

###### Our findings show us that there are certain features which are negatively correalted with the 'left' feature. Hence we have to eleminate those features as they are of no use for ML models as they will affect the results. 

###### Features that are positivly associated with feature 'left' are: 'last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'Department_accounting', 'Department_hr', 'Department_sales', 'Department_support', 'Department_technical', 'salary_numeric'

Hence we will use this features to predict which employees are leaving our company.

In [None]:
# Creating X1 as the input dataframe and y1 as output feature for our model.
cols_corr=['last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company',
           'Department_accounting', 'Department_hr', 'Department_sales', 
           'Department_support', 'Department_technical', 'salary_numeric'] 
X1=employee_churn[cols_corr]
y1=employee_churn['left']

## Machine Learning Model Implementaion
 #### Implementing Logistic Regression to predict employee churn.

In [None]:
#Calling and importing the libraries to spilt the data into test and train
from sklearn.model_selection import train_test_split
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3, random_state=0)

#Calling and importing the libraries to fit the Logistic Regression model
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X1_train, y1_train)

In [None]:
# Checking accuracy on our test and train datasets
from sklearn.metrics import accuracy_score
print('Logistic regression train accuracy: {:.3f}'.format(accuracy_score(y1_train, logreg.predict(X1_train))))
print('Logistic regression test accuracy: {:.3f}'.format(accuracy_score(y1_test, logreg.predict(X1_test))))

###### Our baseline accuracy was 0.771 and we just got 0.745 in our model, so now we need to do some changes. May be we have choose wrong features manually.


###### To take care of eleminating these features and selecting the ones which really affect our output there is a process called Automatic feature selection.

# Automatic Feature Selection

###### We will be using one of the most frequently used technique that is called Recursive Feature Elimination (RFE). It works recursively by removing variables and building a model on the variables which are left in this process. It also uses the model accuracy to find which variables or combination of variables contribute the most to predicting the target attribute.

###### When we have checked the correation in the above table and we fould that there are only 10 features positively associated with the feature 'left'. Hence we will ask our model to find 10 best features.

In [None]:
# Creating X2 as the input dataframe and y2 as output feature for our REF model.
employee_churn_vars=employee_churn.columns.values.tolist()
y2=['left']
X2=[i for i in employee_churn_vars if i not in y2]

print('y variable being our output variable \n', y2)
print('X variable being our input dataframe \n', X2)

###### Here we will again take 10 features out of 18 features available in our X, the way we choose 10 feature manually and check is there any difference in our prediction. 

In [None]:
# Importing and calling Recursive Feature Elimination and Logistic Regression
from sklearn.feature_selection import RFE 
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
rfe = RFE(lr, 10)
rfe = rfe.fit(employee_churn[X2], employee_churn[y2])
print(rfe.support_)
print(rfe.ranking_)

###### We can see that RFE chose the 10 variables for us, which are marked True in the support_ array and marked with a choice “1” in the ranking_array. They are:

###### 'satisfaction_level', 'last_evaluation', 'number_project' 'time_spend_company', 'Work_accident', 'promotion_last_5years', 'Department_RandD', 'Department_hr', 'Department_management', 'salary_numeric'


In [None]:
# Creating X as the input dataframe and y as output feature for our REF model.
cols_RFE=['satisfaction_level', 'last_evaluation', 'number_project', 'time_spend_company', 
          'Work_accident', 'promotion_last_5years', 'Department_RandD', 
          'Department_hr', 'Department_management', 'salary_numeric'] 
X = employee_churn[cols_RFE]
y = employee_churn['left']

## Implementing Logistic Regression ML model on feature selected by REF model to find employee churn

In [None]:
#Calling and importing the libraries to spilt the data into test and train
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

#Calling and importing the libraries to fit the Logistic Regression model
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

In [None]:
# Checking accuracy on our test and train datasets
from sklearn.metrics import accuracy_score
print('Logistic regression Train accuracy: {:.3f}'.format(accuracy_score(y_train, logreg.predict(X_train))))
print('Logistic regression Test accuracy: {:.3f}'.format(accuracy_score(y_test, logreg.predict(X_test))))

###### Here we get more accuracy then our baseline accuracy, that is .804 which is actually more then the our pervious accuracy performed by the same logistic regression classifier using the featuers selected manually by checking the correlation table. So from here we will move ahead with the features selected by our REF model to work on other models.

## Evaluating our Logistic Regression model by Precision and Recall and building Confusion Matrix

In [None]:
# Printing Precision and Recall and f1-score for Logistic Regression model
from sklearn.metrics import classification_report
print(classification_report(y_test, logreg.predict(X_test)))

## Building Confusion Matrix for Logistic Regression model

In [None]:
#Plotting Confusion Matrix 
logreg_y_pred = logreg.predict(X_test)
logreg_cm = metrics.confusion_matrix(logreg_y_pred, y_test, [1,0])
sns.heatmap(logreg_cm, annot=True, fmt='.2f',xticklabels = ["Left", "Stayed"] , yticklabels = ["Left", "Stayed"] )
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.title('Logistic Regression')
plt.savefig('logistic_regression')

## Implementing Random Forest Classifier ML model on feature selected by REF model to find employee churn

In [None]:
#Calling and importing the libraries to fit the Random Forest Classifier model
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

In [None]:
# Checking accuracy on our test and train datasets
from sklearn.metrics import accuracy_score
print('Random Forest Classifier Train accuracy: {:.3f}'.format(accuracy_score(y_train, rf.predict(X_train))))
print('Random Forest Classifier Test accuracy: {:.3f}'.format(accuracy_score(y_test, rf.predict(X_test))))

## Evaluating our Random Forest Classifier model by Precision and Recall and building Confusion Matrix

In [None]:
# Printing Precision and Recall and f1-score
from sklearn.metrics import classification_report
print(classification_report(y_test, rf.predict(X_test)))

## Building Confusion Matrix for Random Forest Classifier

In [None]:
y_pred = rf.predict(X_test)
from sklearn.metrics import confusion_matrix
import seaborn as sns
forest_cm = metrics.confusion_matrix(y_pred, y_test, [1,0])
sns.heatmap(forest_cm, annot=True, fmt='.2f',xticklabels = ["Left", "Stayed"] , yticklabels = ["Left", "Stayed"] )
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.title('Random Forest')
plt.savefig('random_forest')

## Implementing Support Vector Machine ML model on feature selected by REF model to find employee churn

In [None]:
#Calling and importing the libraries to fit the Support Vector Machine model
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)

In [None]:
# Checking accuracy on our test and train datasets
from sklearn.metrics import accuracy_score
print('Support Vector Machine Train accuracy: {:.3f}'.format(accuracy_score(y_train, svc.predict(X_train))))
print('Support Vector Machine Test accuracy: {:.3f}'.format(accuracy_score(y_test, svc.predict(X_test))))

## Evaluating our Support Vector Machine model by Precision and Recall and building Confusion Matrix

In [None]:
print(classification_report(y_test, svc.predict(X_test)))

## Building Confusion Matrix for Support Vector Machine

In [None]:
svc_y_pred = svc.predict(X_test)
svc_cm = metrics.confusion_matrix(svc_y_pred, y_test, [1,0])
sns.heatmap(svc_cm, annot=True, fmt='.2f',xticklabels = ["Left", "Stayed"] , yticklabels = ["Left", "Stayed"] )
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.title('Support Vector Machine')
plt.savefig('support_vector_machine')


## Implementing Gradient Boosting Classifier ML model on feature selected by REF model to find employee churn

In [None]:
#Calling and importing the libraries to fit the Gradient Boosting Classifier model
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

In [None]:
# Checking accuracy on our test and train datasets
from sklearn.metrics import accuracy_score
print('Gradient Boosting Classifier Train accuracy: {:.3f}'.format(accuracy_score(y_train, gb.predict(X_train))))
print('Gradient Boosting Classifier Test accuracy: {:.3f}'.format(accuracy_score(y_test, gb.predict(X_test))))

## Evaluating our Gradient Boosting Classifier model by Precision and Recall and building Confusion Matrix

In [None]:
print(classification_report(y_test, gb.predict(X_test)))

## Building Confusion Matrix for Gradient Boosting Classifier


In [None]:
gb_y_pred = gb.predict(X_test)
gb_cm = metrics.confusion_matrix(gb_y_pred, y_test, [1,0])
sns.heatmap(gb_cm, annot=True, fmt='.2f',xticklabels = ["Left", "Stayed"] , yticklabels = ["Left", "Stayed"] )
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.title('Gradient Boosting Classifier')
plt.savefig('Gradient Boosting Classifier')

## Applying Neural Network for Classification Problem

In [None]:
#Calling and importing the libraries to fit the Sequential Neural Network model
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam,SGD

In [None]:
#Fitting our model on our data
model= Sequential()
model.add(Dense(1,input_dim=10,activation='sigmoid'))
model.compile(Adam(lr=0.5),'binary_crossentropy',metrics=['accuracy'])
model.fit(X_train,y_train)

In [None]:
model.summary()

In [None]:
from sklearn.metrics import accuracy_score
print('Sequential Neural Network Test accuracy: {:.3f}'.format(accuracy_score(y_test, model.predict_classes(X_test))))

In [None]:
model_y_pred = model.predict_classes(X_test)
model_cm = metrics.confusion_matrix(model_y_pred, y_test)
sns.heatmap(model_cm, annot=True, fmt='.2f',xticklabels = ["Stayed", "Left"] , yticklabels = ["Stayed", "Left"] )
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.title('Sequential Neural Network')
plt.savefig('Sequential Neural Network')

## Model Selection

###### As mentioned in our Project Proposal that we will perform 4 classifical model those are Logistic Regression, Random Forest, SVM (Support Vector Machine), and Gradient Boosting Classifier. So we have applied all those models. Apart from that we know that if we are doing classification problem then just accuracy is not just the parameter to evaluate our model there are certain other things we need to keep in our mind when we are working on classification model.

###### That are precision, recall, confusion matrix, True Positive, False Positive, True Negative, False Negative.

###### Now lets talk about our problem that is employee churn so here we have to more focus on True Positive and Flase Positive. To understand this in our problem True Positive are those employees who are leaving the company and our ML model predicts them correctly that they are leaving the company and Flase Positive are those employees who are going to leave the company but our model predicts that they will stay in our company. So we need to keep in mind before selecting any ML model from the above 4 models that which model has more precision that is which model has less false positive.

###### By examining all the 4 model we realized that Random Forest wins the race in with high precision and accuracy. That with the score of 0.98 for precision for employees who are going to leave and test accuracy with 0.988.

###### Hence we will select Ranfom Forest as the best model for the employee churn problem.

## Cross Validation
#### Cross-validation is one of the most important techniques for generalizing our model or restricting it from getting overfitting on our dataset.

#### We will use Random Forset Classifer as our best model for Cross Validation.

#### We are using 10 k-fold Cross-Validation to train our Random Forest model.

In [None]:
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = RandomForestClassifier()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))

## The average accuracy remains very close to the Random Forest model accuracy; hence, we can conclude that the model generalizes well.

#### Our average accuracy is remaining very close to the Random Forest model accuracy; hence, we can conclude that the model generalizing well.

#### We have also beat our baseline average accuracy which was 0.977.

## The ROC Curve

#### ROC and AUC curve will help us to evaluate which classification model is better to work it and will help us to understand in which model and at which threshold our model gives us the best recall or true positive rate with the less False positive rate.

#### We will not consider SVC in ROC curve as ROC curve analysis does not use accuracy or error rate. An ROC curve plots sensitivity (y axis) versus 1-specificity (x axis) and SVC gives the probability of 0 or 1 in the output.

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])

rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test))
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1])

gb_roc_auc = roc_auc_score(y_test, gb.predict(X_test))
gb_fpr, gb_tpr, gb_thresholds = roc_curve(y_test, gb.predict_proba(X_test)[:,1])

plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % rf_roc_auc)
plt.plot(gb_fpr, gb_tpr, label='Gradient Boosting (area = %0.2f)' % gb_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('ROC')
plt.show()

#### The Random Forest Classifier is the best model for this specific problem as Random Forest is far away from the dotted line which represents the ROC curve of a purely random classifier. A good classifer stays as far away from that line as possible that towards the top-left corner.

## Feature Importance for Random Forest Model

#### We will perform feature importance as by applying feature importance we can understand that which feature is affecting employee churn the most. It can help HR manager to understand why employee is going to churn and can focus on keeping that employee and try to improve that area of their company.

In [None]:
#Appling Feature Importance
feature_labels = np.array(['satisfaction_level', 'last_evaluation', 'number_project', 'time_spend_company', 
                           'Work_accident', 'promotion_last_5years', 'Department_RandD', 
                           'Department_hr', 'Department_management', 'salary_numeric'])
importance = rf.feature_importances_
feature_indexes_by_importance = importance.argsort()
for index in feature_indexes_by_importance:
    print('{}-{:.2f}%'.format(feature_labels[index], (importance[index] *100.0)))

#### The above results shows the importance of each feature in ascending order 'promotion_last_5years' being the least important and satisfaction_level being the most important feature to understand reason of employee churn.