# HR Analytics

This repo contains the HR Analytics project as part of my data science portfolio. The objective is to predict employee attrition using a HR dataset from IBM Watson Analytics Sample Data - [HR Employee Attrition & Performance](https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/) which contains employee data for 1,470 employees with various information about the employees.

# Problem Statement

Although some staff turnover is inevitable in any company, a high attrition rate is costly. Employee attrition is the overall turnover within a company as existing employees leave and new ones are hired. The attrition rate is usually calculated as the percentage of employees leaving the company over a specified period of time. Recruitment, hiring and training all involve financial costs and a new employee may not be immediately productive in terms of creating profit. The amount of time spent to interview and find a replacement, and the loss of productivity for several months while the new employee gets accustomed to the new role, are indirect costs to the company. These costs can significantly increase if executive-level or highest-paid employees are to be replaced. As such, the costs of replacing employees for most companies are often very significant.

An unusually high employee attrition rate is also considered indicative of problems within the company. Uncompetitive pay scales, micromanagement, ineffective human resource management (HRM) practices and unreasonable expectations can all lead to unacceptable levels of staff turnover. Understanding why and when employees are most likely to leave can lead to actions to improve employee retention as well as possibly planning new hiring in advance.

In this project, I will attempt to answer the following questions:

This is a standard supervised classification problem where the target or label is a binary variable, 0 (active employee), 1 (ex-employee). The objective is to predict employee attrition based on various information about the employee. I will also attempt to generate the probability of an employee leaving the company as our target variable.

# Dataset

The dataset used in this project is IBM Watson Analytics Sample Data - HR Employee Attrition & Performance. The dataset contains 1,470 rows corresponding to 1,470 employees with their various information. It is also available directly within Watson Analytics as Employee Performance. As mentioned on IBM website, the purpose of the dataset is to

Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists.

# Exploratory Data Analysis

In [None]:
# import libraries
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

In [None]:
# read dataset
df_raw = pd.read_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx", sheet_name=0)

# make a copy of the original source file
df = df_raw.copy()

print("Size of dataset is: {}".format(df.shape))
df.head()

Size of dataset is: (1470, 35)

In [None]:
    Age  Attrition  BusinessTravel     DailyRate  m Department            DistanceFromHome   Education   EducationField   EmployeeCount  EmployeeNumber  ...    RelationshipSatisfaction   StandardHours   StockOptionLevel  TotalWorkingYears  TrainingTimesLastYear  WorkLifeBalance  YearsAtCompany  YearsInCurrentRole  YearsSinceLastPromotion  YearsWithCurrManager
0   41   Yes        Travel_Rarely      1102       Sales                   1                  2           Life Sciences    1              1               ...    1                          80              0                 8                  0                      1                6               4                   0                        5
1   49   No         Travel_Frequently  279        Research & Development  8                  1           Life Sciences    1              2               ...    4                          80              1                 10                 3                      3                10              7                   1                        7
2   37   Yes        Travel_Rarely      1373       Research & Development  2                  2           Other            1              4               ...    2                          80              0                 7                  3                      3                0               0                   0                        0
3   33   No         Travel_Frequently  1392       Research & Development  3                  4           Life Sciences    1              5               ...    3                          80              0                 8                  3                      3                8               7                   3                        0
4   27   No         Travel_Rarely      591        Research & Development  2                  1           Medical          1              7               ...    4                          80              1                 6                  3                      3                2               2                   2                        2
5 rows × 35 columns

There are 35 columns and 1,470 rows.

In [None]:
# dataset columns
df.columns

In [None]:
# columns datatypes
df.info()

In [None]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
Age                         1470 non-null int64
Attrition                   1470 non-null object
BusinessTravel              1470 non-null object
DailyRate                   1470 non-null int64
Department                  1470 non-null object
DistanceFromHome            1470 non-null int64
Education                   1470 non-null int64
EducationField              1470 non-null object
EmployeeCount               1470 non-null int64
EmployeeNumber              1470 non-null int64
EnvironmentSatisfaction     1470 non-null int64
Gender                      1470 non-null object
HourlyRate                  1470 non-null int64
JobInvolvement              1470 non-null int64
JobLevel                    1470 non-null int64
JobRole                     1470 non-null object
JobSatisfaction             1470 non-null int64
MaritalStatus               1470 non-null object
MonthlyIncome               1470 non-null int64
MonthlyRate                 1470 non-null int64
NumCompaniesWorked          1470 non-null int64
Over18                      1470 non-null object
OverTime                    1470 non-null object
PercentSalaryHike           1470 non-null int64
PerformanceRating           1470 non-null int64
RelationshipSatisfaction    1470 non-null int64
StandardHours               1470 non-null int64
StockOptionLevel            1470 non-null int64
TotalWorkingYears           1470 non-null int64
TrainingTimesLastYear       1470 non-null int64
WorkLifeBalance             1470 non-null int64
YearsAtCompany              1470 non-null int64
YearsInCurrentRole          1470 non-null int64
YearsSinceLastPromotion     1470 non-null int64
YearsWithCurrManager        1470 non-null int64
dtypes: int64(26), object(9)
memory usage: 402.0+ KB

In [None]:
# groupby column data types
df.columns.groupby(df.dtypes)

In [None]:
{dtype('int64'): Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount',
        'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate',
        'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome',
        'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
        'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
        'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
        'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
        'YearsSinceLastPromotion', 'YearsWithCurrManager'],
       dtype='object'),
 dtype('O'): Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
        'JobRole', 'MaritalStatus', 'Over18', 'OverTime'],
       dtype='object')}

In [None]:
# number of columns
print("No of columns for {} is {}.".format(list(df.columns.groupby(df.dtypes).keys())[0], len(df.columns.groupby(df.dtypes)[list(df.columns.groupby(df.dtypes).keys())[0]])))
print("No of columns for {} is {}.".format(list(df.columns.groupby(df.dtypes).keys())[1], len(df.columns.groupby(df.dtypes)[list(df.columns.groupby(df.dtypes).keys())[1]])))

In [None]:
# check for missing values
df.isnull().sum()

In [None]:
Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64

In [None]:
# summary statistics
df.describe()

In [None]:
      Age           DailyRate     DistanceFromHome  Education    EmployeeCount  EmployeeNumber  EnvironmentSatisfaction  HourlyRate    JobInvolvement JobLevel     ...    RelationshipSatisfaction  StandardHours  StockOptionLevel   TotalWorkingYears  TrainingTimesLastYear  WorkLifeBalance  YearsAtCompany   YearsInCurrentRole  YearsSinceLastPromotion  YearsWithCurrManager
count 1470.000000   1470.000000   1470.000000       1470.000000  1470.0         1470.000000     1470.000000              1470.000000   1470.000000    1470.000000  ...    1470.000000               1470.0         1470.000000        1470.000000        1470.000000            1470.000000      1470.000000      1470.000000         1470.000000              1470.000000
mean  36.923810     802.485714    9.192517          2.912925     1.0            1024.865306     2.721769                 65.891156     2.729932       2.063946     ...    2.712245                  80.0           0.793878           11.279592          2.799320               2.761224         7.008163         4.229252            2.187755                 4.123129
std   9.135373      403.509100    8.106864          1.024165     0.0            602.024335      1.093082                 20.329428     0.711561       1.106940     ...    1.081209                  0.0            0.852077           7.780782           1.289271               0.706476         6.126525         3.623137            3.222430                 3.568136
min   18.000000     102.000000    1.000000          1.000000     1.0            1.000000        1.000000                 30.000000     1.000000       1.000000     ...    1.000000                  80.0           0.000000           0.000000           0.000000               1.000000         0.000000         0.000000            0.000000                 0.000000
25%   30.000000     465.000000    2.000000          2.000000     1.0            491.250000      2.000000                 48.000000     2.000000       1.000000     ...    2.000000                  80.0           0.000000           6.000000           2.000000               2.000000         3.000000         2.000000            0.000000                 2.000000
50%   36.000000     802.000000    7.000000          3.000000     1.0            1020.500000     3.000000                 66.000000     3.000000       2.000000     ...    3.000000                  80.0           1.000000           10.000000          3.000000               3.000000         5.000000         3.000000            1.000000                 3.000000
75%   43.000000     1157.000000   14.000000         4.000000     1.0            1555.750000     4.000000                 83.750000     3.000000       3.000000     ...    4.000000                  80.0           1.000000           15.000000          3.000000               3.000000         9.000000         7.000000            3.000000                 7.000000
max   60.000000     1499.000000   29.000000         5.000000     1.0            2068.000000     4.000000                 100.00000     4.000000       5.000000     ...    4.000000                  80.0           3.000000           40.000000          6.000000               4.000000         40.000000        18.000000           15.000000                17.000000
8 rows × 26 columns

In [None]:
# summary statistics
df.describe(include="all")

In [None]:
        Age	Attrition   BusinessTravel  DailyRate  Department               DistanceFromHome  Education     EducationField   EmployeeCount   EmployeeNumber   ...   RelationshipSatisfaction  StandardHours  StockOptionLevel  TotalWorkingYears   TrainingTimesLastYear  WorkLifeBalance  YearsAtCompany  YearsInCurrentRole  YearsSinceLastPromotion  YearsWithCurrManager
count   1470.000000     1470            1470       1470.000000              1470              1470.000000   1470.000000      1470            1470.000000      ...   1470.000000               1470.0         1470.000000       1470.000000         1470.000000            1470.000000      1470.000000     1470.000000         1470.000000              1470.000000
unique  NaN             2               3          NaN                      3                 NaN           NaN              6               NaN              ...   NaN                       NaN            NaN               NaN                 NaN                    NaN              NaN             NaN                 NaN                      NaN
top	NaN No              Travel_Rarely   NaN        Research & Development   NaN               NaN           Life Sciences    NaN             NaN              ...   NaN                       NaN            NaN               NaN                 NaN                    NaN              NaN             NaN                 NaN                      NaN
freq    NaN             1233            1043       NaN                      961               NaN           NaN              606             NaN              ...   NaN                       NaN            NaN               NaN                 NaN                    NaN              NaN             NaN                 NaN                      NaN
mean    36.923810       NaN             NaN        802.485714               NaN               9.192517      2.912925         NaN             1.0              ...   2.712245                  80.0           0.793878          11.279592           2.799320               2.761224         7.008163        4.229252            2.187755                 4.123129
std     9.135373        NaN             NaN        403.509100               NaN               8.106864      1.024165         NaN             0.0              ...   1.081209                  0.0            0.852077          7.780782            1.289271               0.706476         6.126525        3.623137            3.222430                 3.568136
min     18.000000       NaN             NaN        102.000000               NaN               1.000000      1.000000         NaN             1.0              ...   1.000000                  80.0           0.000000          0.000000            0.000000               1.000000         0.000000        0.000000            0.000000                 0.000000
25%     30.000000       NaN             NaN        465.000000               NaN               2.000000      2.000000         NaN             1.0              ...   2.000000                  80.0           0.000000          6.000000            2.000000               2.000000         3.000000        2.000000            0.000000                 2.000000
50%     36.000000       NaN             NaN        802.000000               NaN               7.000000      3.000000         NaN             1.0              ...   3.000000                  80.0           1.000000          10.000000           3.000000               3.000000         5.000000        3.000000            1.000000                 3.000000
75%     43.000000       NaN             NaN        1157.000000              NaN               14.000000     4.000000         NaN             1.0              ...   4.000000                  80.0           1.000000          15.000000           3.000000               3.000000         9.000000        7.000000            3.000000                 7.000000
max     60.000000       NaN             NaN        1499.000000              NaN               29.000000     5.000000         NaN             1.0              ...   4.000000                  80.0           3.000000          40.000000           6.000000               4.000000         40.000000       18.000000           15.000000                17.000000
11 rows × 35 columns

In [None]:
# plot histogram for numeric columns
df.hist(figsize=(20,20))
# plt.savefig("histograms.png", bbox_inches="tight")
plt.show()

Your code snippet aims to plot histograms for numeric columns in your DataFrame. It's a good way to visualize the distribution of each numerical variable. Here's how you can plot histograms for numeric columns using Matplotlib:

In [None]:
import matplotlib.pyplot as plt

# Plot histograms for numeric columns
df.hist(figsize=(20, 20))
plt.tight_layout()  # Adjust layout to prevent overlapping
plt.show()

This will generate histograms for each numeric column in your DataFrame. Adjust the figsize parameter to change the size of the plot according to your preference. Uncomment the plt.savefig line if you want to save the plot as an image file. Make sure to provide the correct filename and path.

# Features / Input Variables

# Age

Age of employees is from 18 to 60 years old. The average age of ex-employees at 33.6 years old and 37.6 years old for current employees.

In [None]:
# Age
print("Age of employees is from {} to {} years old.".format(df['Age'].min(), df['Age'].max()))

Age of employees is from 18 to 60 years old.

In [None]:
# Ex-Employees
print("Ex-Employees:")
print("Average Age of Ex-Employees = {:1.1f}".format(np.mean(df.loc[df['Attrition'] == 'Yes', 'Age'])))
print("Standard Deviation = {:1.1f}".format(np.std(df.loc[df['Attrition'] == 'Yes', 'Age'])))

In [None]:
# Active Employees
print("Active Employees:")
print("Average Age of Active Employees = {:1.1f}".format(np.mean(df.loc[df['Attrition'] == 'No', 'Age'])))
print("Standard Deviation = {:1.1f}".format(np.std(df.loc[df['Attrition'] == 'No', 'Age'])))

We can create a kernel density estimation (KDE) plot colored by the value of the target. A kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Density plots are representations of the underlying distribution.

In [None]:
# kernel density estimation (KDE) plot 

plt.subplots(figsize=(12,4))

target_0 = df.loc[df["Attrition"] == "No"]
target_1 = df.loc[df["Attrition"] == "Yes"]

sns.distplot(target_0[["Age"]], hist=False, label="Active Employees")
sns.distplot(target_1[["Age"]], hist=False, label="Ex-Exployees")
plt.legend()
plt.xlabel("Age", fontsize=12)
plt.ylabel("Density", fontsize=12)
plt.title("Age Distribution by Attrition Status", x=0.5, y=1.05, ha="center", fontsize=14)

# plt.savefig("age_kde.png", bbox_inches="tight")
plt.show()

# Aag Distribution by Attrition Status

To create a distribution plot of Age by Attrition Status, you can use Seaborn's histplot function. Here's how you can do it:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set the figure size
plt.figure(figsize=(10, 6))

# Create the distribution plot
sns.histplot(data=df, x='Age', hue='Attrition', kde=True, bins=30, palette='muted')

# Set the title and labels
plt.title('Age Distribution by Attrition Status')
plt.xlabel('Age')
plt.ylabel('Count')

# Show the plot
plt.show()

This will generate a histogram of age distribution, with Attrition status differentiated by color. Adjust the bins parameter to change the number of bins in the histogram. You can also modify the palette parameter to change the color scheme.








In [None]:
# kernel density estimation (KDE) plot 

plt.subplots(figsize=(12,4))

target_0 = df.loc[df["Attrition"] == "No"]
target_1 = df.loc[df["Attrition"] == "Yes"]

sns.distplot(target_0[["Age"]], hist=False, label="Active Employees")
sns.distplot(target_1[["Age"]], hist=False, label="Ex-Exployees")
plt.legend()
plt.xlim(df["Age"].min(), df["Age"].max())
plt.xlabel("Age", fontsize=12)
plt.ylabel("Density", fontsize=12)
plt.title("Age Distribution in Percent by Attrition Status", x=0.5, y=1.05, ha="center", fontsize=14)

# plt.savefig("age_kde.png", bbox_inches="tight")
plt.show()

In [None]:
# plot histogram
fig, axes = plt.subplots(1, 2)
fig.set_size_inches(12,4)

target_0[['Age']].hist(bins=20, ax=axes[0])
axes[0].set_title('Active Employees')

target_1[['Age']].hist(bins=20, ax=axes[1])
axes[1].set_title('Ex-Employees')

fig.text(0.5, 0.01, 'Age', ha='center')
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical')

# plt.savefig("age_hist.png", bbox_inches="tight")
plt.show()

# Gender

Gender distribution shows that the dataset features a higher relative proportion of male ex-employees than female ex-employees, with normalised gender distribution of ex-employees in the dataset at 17% for Males and 15% for Females.

In [None]:
# Gender of employees
df['Gender'].value_counts()

In [None]:
# Gender of employees
df['Gender'].value_counts(normalize=True)

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(121)
g2 = df.loc[df['Attrition'] == 'Yes', 'Gender'].value_counts().plot.bar(title= 'Ex-Employees', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,170)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+3, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(122)
g2 = df.loc[df['Attrition'] == 'No', 'Gender'].value_counts().plot.bar(title= 'Active Employees', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,850)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+15, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
fig.text(0.5, 0.01, 'Gender', ha='center')
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical')

# plt.savefig("gender_hist_1.png", bbox_inches="tight")
plt.show()

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(121)
g1 = df.loc[df['Attrition'] == 'Yes', 'Gender'].value_counts(normalize=True).plot.bar(title= 'Ex-Employees', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2., height+0.02, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(122)
g2 = df.loc[df['Attrition'] == 'No', 'Gender'].value_counts(normalize=True).plot.bar(title= 'Active Employees', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.02, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
fig.text(0.5, 0.01, 'Gender', ha='center')
fig.text(0.07, 0.5, '% of Employee', va='center', rotation='vertical')

# plt.savefig("gender_hist_2.png", bbox_inches="tight")
plt.show()

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(121)
g1 = df.loc[df['Gender'] == 'Male', 'Attrition'].value_counts().reindex(["Yes", "No"]).plot.bar(title= 'Male', color="#1f77b4")
g1.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,850)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2., height+20, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(122)
g2 = df.loc[df['Gender'] == 'Female', 'Attrition'].value_counts().reindex(["Yes", "No"]).plot.bar(title= 'Female', color="#1f77b4")
g2.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,600)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+10, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
fig.text(0.5, 0.01, 'Attrition', ha='center')
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical')

# plt.savefig("gender_hist_3.png", bbox_inches="tight")
plt.show()

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(121)
g1 = df.loc[df['Gender'] == 'Male', 'Attrition'].value_counts(normalize=True).reindex(["Yes", "No"]).plot.bar(title= 'Male', color="#1f77b4")
g1.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2., height+0.015, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(122)
g2 = df.loc[df['Gender'] == 'Female', 'Attrition'].value_counts(normalize=True).reindex(["Yes", "No"]).plot.bar(title= 'Female', color="#1f77b4")
g2.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.02, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
fig.text(0.5, 0.01, 'Attrition', ha='center')
fig.text(0.07, 0.5, '% of Employee', va='center', rotation='vertical')

# plt.savefig("gender_hist_4.png", bbox_inches="tight")
plt.show()

# Marital Status

The dataset features three marital status: Married (673 employees), Single (470 employees), Divorced (327 employees). Single employees show the largest proportion of leavers at 25%.

In [None]:
# marital Status of employees
df['MaritalStatus'].value_counts()

In [None]:
Married     673
Single      470
Divorced    327
Name: MaritalStatus, dtype: int64

In [None]:
# marital Status of employees
df['MaritalStatus'].value_counts(normalize=True)

In [None]:
Married     0.457823
Single      0.319728
Divorced    0.222449
Name: MaritalStatus, dtype: float64

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(121)
g2 = df.loc[df['Attrition'] == 'Yes', 'MaritalStatus'].value_counts().plot.bar(title= 'Ex-Employees', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,150)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+3, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(122)
g2 = df.loc[df['Attrition'] == 'No', 'MaritalStatus'].value_counts().reindex(["Single", "Married","Divorced"]).plot.bar(title= 'Active Employees', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,700)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+12, "{:1.0f}".format(height), ha="center", fontsize=14) 

# set title and axis labels
plt.suptitle('Attrition by Marital Status', x=0.5, y=1.05, ha='center', fontsize=16)
fig.text(0.5, 0.001, 'Marital Status', ha='center', fontsize=14)
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical', fontsize=14)

# plt.savefig("marital_hist_1.png", bbox_inches="tight")
plt.show()

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(121)
g1 = df.loc[df["Attrition"] == "Yes", "MaritalStatus"].value_counts(normalize=True).plot.bar(title= "Ex-Employees", color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,0.6)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(122)
g2 = df.loc[df["Attrition"] == "No", "MaritalStatus"].value_counts(normalize=True).reindex(["Single", "Married","Divorced"]).plot.bar(title= "Active Employees", color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,0.6)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
# set title and axis labels
plt.suptitle("Attrition by Marital Status", x=0.5, y=1.05, ha="center", fontsize=16)
fig.text(0.5, 0.001, "Marital Status", ha="center", fontsize=14)
fig.text(0.07, 0.5, "No of Employee", va="center", rotation="vertical", fontsize=14)

# plt.savefig("marital_hist_2.png", bbox_inches="tight")
plt.show()

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(131)
g1 = df.loc[df['MaritalStatus'] == 'Single', 'Attrition'].value_counts().sort_values(ascending=True).plot.bar(title= 'Single', color="#1f77b4")
g1.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,700)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2., height+10, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(132)
g2 = df.loc[df['MaritalStatus'] == 'Married', 'Attrition'].value_counts().sort_values(ascending=True).plot.bar(title= 'Married', color="#1f77b4")
g2.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,700)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+12, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(133)
g3 = df.loc[df['MaritalStatus'] == 'Divorced', 'Attrition'].value_counts().sort_values(ascending=True).plot.bar(title= 'Divorced', color="#1f77b4")
g3.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,700)
for p in g3.patches:
    height = p.get_height()
    g3.text(p.get_x()+p.get_width()/2., height+10, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
# set title and axis labels
plt.suptitle("Marital Status by Attrition Status", x=0.5, y=1.05, ha="center", fontsize=16)
fig.text(0.5, 0.001, "Marital Status", ha="center", fontsize=14)
fig.text(0.07, 0.5, "No of Employee", va="center", rotation="vertical", fontsize=14)

# plt.savefig("marital_hist_3.png", bbox_inches="tight")
plt.show()

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(131)
g1 = df.loc[df['MaritalStatus'] == 'Single', 'Attrition'].value_counts(normalize=True).sort_values(ascending=True).plot.bar(title= 'Single', color="#1f77b4")
g1.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2., height+0.02, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(132)
g2 = df.loc[df['MaritalStatus'] == 'Married', 'Attrition'].value_counts(normalize=True).sort_values(ascending=True).plot.bar(title= 'Married', color="#1f77b4")
g2.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.02, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(133)
g3 = df.loc[df['MaritalStatus'] == 'Divorced', 'Attrition'].value_counts(normalize=True).sort_values(ascending=True).plot.bar(title= 'Divorced', color="#1f77b4")
g3.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g3.patches:
    height = p.get_height()
    g3.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
# set title and axis labels
plt.suptitle("Marital Status by Attrition Status", x=0.5, y=1.05, ha="center", fontsize=16)
fig.text(0.5, 0.001, "Marital Status", ha="center", fontsize=14)
fig.text(0.07, 0.5, "No of Employee", va="center", rotation="vertical", fontsize=14)

# plt.savefig("marital_hist_4.png", bbox_inches="tight")
plt.show()

# Distance from Home

Distance from home for employees to get to work varies from 1 to 29 miles. There is no discernable strong correlation between Distance from Home and Attrition Status as per the KDE plot below.

In [None]:
# Distance from Home
print("Distance from home for employees to get to work is from {} to {} miles.".format(df['DistanceFromHome'].min(), df['DistanceFromHome'].max()))

Distance from home for employees to get to work is from 1 to 29 miles.

In [None]:
# Ex-Employees
print("Ex-Employees:")
print("Average Distance From Home of Ex-Employees = {:1.2f} miles".format(np.mean(df.loc[df['Attrition'] == 'Yes', 'DistanceFromHome'])))
print("Standard Deviation = {:1.2f}".format(np.std(df.loc[df['Attrition'] == 'Yes', 'DistanceFromHome'])))

In [None]:
# Active Employees
print("Active Employees:")
print("Average Distance From Home of Active Employees = {:1.2f} miles".format(np.mean(df.loc[df['Attrition'] == 'No', 'DistanceFromHome'])))
print("Standard Deviation = {:1.2f}".format(np.std(df.loc[df['Attrition'] == 'No', 'DistanceFromHome'])))

In [None]:
# kernel density estimation (KDE) plot 

plt.subplots(figsize=(12,4))

target_0 = df.loc[df["Attrition"] == "No"]
target_1 = df.loc[df["Attrition"] == "Yes"]

sns.distplot(target_0[["DistanceFromHome"]], hist=False, label="Active Employees")
sns.distplot(target_1[["DistanceFromHome"]], hist=False, label="Ex-Exployees")
plt.legend()
plt.xlabel("Distance From Home", fontsize=12)
plt.ylabel("Density", fontsize=12)
plt.title("Distance From Home Distribution by Attrition Status", fontsize=14)

# plt.savefig("distance_kde.png", bbox_inches="tight")
plt.show()

# Distance From Home Distribution by Attrition Status 

To visualize the distribution of "Distance From Home" by "Attrition Status," you can use Seaborn's histplot function again. Here's how you can create the chart:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set the figure size
plt.figure(figsize=(10, 6))

# Create the distribution plot
sns.histplot(data=df, x='DistanceFromHome', hue='Attrition', kde=True, bins=30, palette='muted')

# Set the title and labels
plt.title('Distance From Home Distribution by Attrition Status')
plt.xlabel('Distance From Home')
plt.ylabel('Count')

# Show the plot
plt.show()

This code will generate a histogram showing the distribution of "Distance From Home," with the Attrition status differentiated by color. Adjust the bins parameter to change the number of bins in the histogram, and modify the palette parameter to change the color scheme if needed.

You can get negative x-values ending up with some positive density from a kernel density estimate, simply because of the way KDEs work. Refer https://stats.stackexchange.com/questions/109549/negative-density-for-non-negative-variables and https://www.youtube.com/watch?v=R6_LR-f6Tt4

In [None]:
# kernel density estimation (KDE) plot 

plt.subplots(figsize=(12,4))

target_0 = df.loc[df["Attrition"] == "No"]
target_1 = df.loc[df["Attrition"] == "Yes"]

sns.distplot(target_0[["DistanceFromHome"]], hist=False, label="Active Employees")
sns.distplot(target_1[["DistanceFromHome"]], hist=False, label="Ex-Exployees")
plt.legend()
plt.xlim(df["DistanceFromHome"].min(), df["DistanceFromHome"].max())
plt.xlabel("Distance From Home", fontsize=12)
plt.ylabel("Density", fontsize=12)
plt.title("Distance From Home Distribution by Attrition Status", fontsize=14)

# plt.savefig("distance_kde.png", bbox_inches="tight")
plt.show()

In [None]:
# plot histogram
fig, axes = plt.subplots(1, 2)
fig.set_size_inches(12,4)

target_0[['DistanceFromHome']].hist(bins=20, ax=axes[0])
axes[0].set_title('Active Employees')

target_1[['DistanceFromHome']].hist(bins=20, ax=axes[1])
axes[1].set_title('Ex-Employees')

fig.text(0.5, 0.01, 'DistanceFromHome', ha='center', fontsize=12)
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical', fontsize=12)

# plt.savefig("distance_hist.png", bbox_inches="tight")
plt.show()

# Department

The data features employee data from three departments: Research & Development, Sales, and Human Resources.

In [None]:
# departments
df['Department'].value_counts()

In [None]:
Research & Development    961
Sales                     446
Human Resources            63
Name: Department, dtype: int64

In [None]:
# departments
df['Department'].value_counts(normalize=True)

In [None]:
Research & Development    0.653741
Sales                     0.303401
Human Resources           0.042857
Name: Department, dtype: float64

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(121)
g2 = df.loc[df['Attrition'] == 'Yes', 'Department'].value_counts().plot.bar(title= 'Ex-Employees', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,160)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+3, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(122)
g2 = df.loc[df['Attrition'] == 'No', 'Department'].value_counts().plot.bar(title= 'Active Employees', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,1000)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+12, "{:1.0f}".format(height), ha="center", fontsize=14) 

# set title and axis labels
plt.suptitle('Attrition Status by Department', fontsize=16)
fig.text(0.5, 0.001, 'Marital Status', ha='center', fontsize=14)
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical', fontsize=14)

# plt.savefig("dept_hist_1.png", bbox_inches="tight")
plt.show()

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(121)
g1 = df.loc[df["Attrition"] == "Yes", "Department"].value_counts(normalize=True).plot.bar(title= "Ex-Employees", color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(122)
g2 = df.loc[df["Attrition"] == "No", "Department"].value_counts(normalize=True).plot.bar(title= "Active Employees", color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
# set title and axis labels
plt.suptitle("Attrition Status by Department", x=0.5, y=1.05, ha="center", fontsize=16)
fig.text(0.5, 0.001, "Marital Status", ha="center", fontsize=14)
fig.text(0.07, 0.5, "No of Employee", va="center", rotation="vertical", fontsize=14)

# plt.savefig("dept_hist_2.png", bbox_inches="tight")
plt.show()

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(131)
g1 = df.loc[df['Department'] == 'Research & Development', 'Attrition'].value_counts().sort_values(ascending=True).plot.bar(title= 'Research & Development', color="#1f77b4")
g1.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,1000)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2., height+20, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(132)
g2 = df.loc[df['Department'] == 'Sales', 'Attrition'].value_counts().sort_values(ascending=True).plot.bar(title= 'Sales', color="#1f77b4")
g2.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,500)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+10, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(133)
g3 = df.loc[df['Department'] == 'Human Resources', 'Attrition'].value_counts().sort_values(ascending=True).plot.bar(title= 'Human Resources', color="#1f77b4")
g3.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,100)
for p in g3.patches:
    height = p.get_height()
    g3.text(p.get_x()+p.get_width()/2., height+2, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
# set title and axis labels
plt.suptitle("Attrition Status by Department", x=0.5, y=1.05, ha="center", fontsize=16)
fig.text(0.5, 0.001, "Department", ha="center", fontsize=14)
fig.text(0.07, 0.5, "No of Employee", va="center", rotation="vertical", fontsize=14)

# plt.savefig("sdept_hist_3.png", bbox_inches="tight")
plt.show()

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(131)
g1 = df.loc[df['Department'] == 'Research & Development', 'Attrition'].value_counts(normalize=True).sort_values(ascending=True).plot.bar(title= 'Research & Development', color="#1f77b4")
g1.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(132)
g2 = df.loc[df['Department'] == 'Sales', 'Attrition'].value_counts(normalize=True).sort_values(ascending=True).plot.bar(title= 'Sales', color="#1f77b4")
g2.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(133)
g3 = df.loc[df['Department'] == 'Human Resources', 'Attrition'].value_counts(normalize=True).sort_values(ascending=True).plot.bar(title= 'Human Resources', color="#1f77b4")
g3.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g3.patches:
    height = p.get_height()
    g3.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
# set title and axis labels
plt.suptitle("Attrition Status by Department", x=0.5, y=1.05, ha="center", fontsize=16)
fig.text(0.5, 0.001, "Employee Status", ha="center", fontsize=14)
fig.text(0.07, 0.5, "% of Employee", va="center", rotation="vertical", fontsize=14)

# plt.savefig("dept_hist_4.png", bbox_inches="tight")
plt.show()

# Job Role

Sale Representative has the highest proportion of leavers, followed by Laboratory Technician and Human Resources.

In [None]:
# Employees in the database have several roles on-file
df['JobRole'].value_counts()

In [None]:
Sales Executive              326
Research Scientist           292
Laboratory Technician        259
Manufacturing Director       145
Healthcare Representative    131
Manager                      102
Sales Representative          83
Research Director             80
Human Resources               52
Name: JobRole, dtype: int64

In [None]:
# Employees in the database have several roles on-file
df['JobRole'].value_counts(normalize=True)

In [None]:
Sales Executive              0.221769
Research Scientist           0.198639
Laboratory Technician        0.176190
Manufacturing Director       0.098639
Healthcare Representative    0.089116
Manager                      0.069388
Sales Representative         0.056463
Research Director            0.054422
Human Resources              0.035374
Name: JobRole, dtype: float64

In [None]:
fig, axes = plt.subplots(figsize=(16,8))

plt.subplot(211)
g2 = df.loc[df['Attrition'] == 'Yes', 'JobRole'].value_counts().plot.bar(title= 'Ex-Employee', color="#1f77b4")
plt.xticks(rotation=10)
plt.ylim(0,80)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+3, "{:1.0f}".format(height), ha="center", fontsize=14) 

fig.subplots_adjust(hspace=0.4)

plt.subplot(212)
g2 = df.loc[df['Attrition'] == 'No', 'JobRole'].value_counts().plot.bar(title= 'Active Employee', color="#1f77b4")
plt.xticks(rotation=10)
plt.ylim(0,400)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+12, "{:1.0f}".format(height), ha="center", fontsize=14) 

# set title and axis labels
plt.suptitle('Attrition Status by Job Role', x=0.5, y=0.95, ha="center", fontsize=16)
fig.text(0.5, 0.001, 'Job Role', ha='center', fontsize=14)
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical', fontsize=14)

# plt.savefig("role_hist_1.png", bbox_inches="tight")
plt.show()

# Attrition Status by Job Role

To visualize the distribution of Attrition Status by Job Role, a bar plot or count plot would be suitable. Here's how you can create the chart using Seaborn:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set the figure size
plt.figure(figsize=(12, 8))

# Create the count plot
sns.countplot(data=df, x='JobRole', hue='Attrition', palette='muted')

# Set the title and labels
plt.title('Attrition Status by Job Role')
plt.xlabel('Job Role')
plt.ylabel('Count')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Show the plot
plt.show()

This code will generate a count plot showing the distribution of Attrition Status across different Job Roles, with Attrition status differentiated by color. Adjust the palette parameter to change the color scheme if needed.

In [None]:
fig, axes = plt.subplots(figsize=(16,8))

plt.subplot(211)
g2 = df.loc[df['Attrition'] == 'Yes', 'JobRole'].value_counts(normalize=True).plot.bar(title= 'Ex-Employee', color="#1f77b4")
plt.xticks(rotation=10)
plt.ylim(0,1)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(212)
g2 = df.loc[df['Attrition'] == 'No', 'JobRole'].value_counts(normalize=True).plot.bar(title= 'Active Employee', color="#1f77b4")
plt.xticks(rotation=10)
plt.ylim(0,1)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 

fig.subplots_adjust(hspace=0.4)

# set title and axis labels
plt.suptitle('Attrition Status by Job Role', x=0.5, y=0.95, ha="center", fontsize=16)
fig.text(0.5, 0.001, 'Job Role', ha='center', fontsize=14)
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical', fontsize=14)

# plt.savefig("role_hist_2.png", bbox_inches="tight")
plt.show()

In [None]:
df_JobRole = pd.DataFrame(columns=["Job Role", "% of Leavers"])
i=0
for field in list(df['JobRole'].unique()):
    ratio = df[(df['JobRole']==field)&(df['Attrition']=="Yes")].shape[0] / df[df['JobRole']==field].shape[0]
    df_JobRole.loc[i] = (field, ratio*100)
    i += 1
    #print("In {}, the ratio of leavers is {:.2f}%".format(field, ratio*100))    
df_JR = df_JobRole.groupby(by="Job Role").sum()
g = df_JR.sort_values(by=['% of Leavers'], ascending=False).plot(kind='bar', figsize=(12,4))
plt.ylim(0,50)
plt.title("Proportion of Leavers by Job Role (%)",  fontsize=16)
plt.xticks(rotation=10)
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x()+p.get_width()/2., height+1, "{:1.1f}".format(height), ha="center", fontsize=14) 

# plt.savefig("role_hist_3.png", bbox_inches="tight")
plt.show()

# Job Level

Employees have an assigned level within the organisation which varies from 1 (staff) to 5 (managerial/director). Employees with an assigned Job Level of "1" show the largest normalized proportion of Leavers.

In [None]:
# breakdown by job level
df['JobLevel'].value_counts()

In [None]:
1    543
2    534
3    218
4    106
5     69
Name: JobLevel, dtype: int64

In [None]:
# breakdown by job level
df['JobLevel'].value_counts(normalize=True)

In [None]:
1    0.369388
2    0.363265
3    0.148299
4    0.072109
5    0.046939
Name: JobLevel, dtype: float64

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(121)
g2 = df.loc[df['Attrition'] == 'Yes', 'JobLevel'].value_counts().plot.bar(title= 'Ex-Employee', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,170)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+3, "{:1.0f}".format(height), ha="center", fontsize=14) 

plt.subplot(122)
g2 = df.loc[df['Attrition'] == 'No', 'JobLevel'].value_counts().reindex([1,2,3,4,5]).plot.bar(title= 'Active Employee', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,600)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+12, "{:1.0f}".format(height), ha="center", fontsize=14) 

# set title and axis labels
plt.suptitle('Attrition Status by Job Level', x=0.5, y=1.05, ha="center", fontsize=16)
fig.text(0.5, 0.001, 'Job Level', ha='center', fontsize=14)
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical', fontsize=14)

# plt.savefig("level_hist_1.png", bbox_inches="tight")
plt.show()

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(121)
g2 = df.loc[df['Attrition'] == 'Yes', 'JobLevel'].value_counts(normalize=True).plot.bar(title= 'Ex-Employee', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(122)
g2 = df.loc[df['Attrition'] == 'No', 'JobLevel'].value_counts(normalize=True).plot.bar(title= 'Active Employee', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 

# set title and axis labels
plt.suptitle('Attrition Status by Job Level', x=0.5, y=1.05, ha="center", fontsize=16)
fig.text(0.5, 0.001, 'Job Level', ha='center', fontsize=14)
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical', fontsize=14)

# plt.savefig("level_hist_2.png", bbox_inches="tight")
plt.show()

In [None]:
df_JobLevel = pd.DataFrame(columns=["Job Level", "% of Leavers"])
i=0
for field in list(df['JobLevel'].unique()):
    ratio = df[(df['JobLevel']==field)&(df['Attrition']=="Yes")].shape[0] / df[df['JobLevel']==field].shape[0]
    df_JobLevel.loc[i] = (field, ratio*100)
    i += 1
    #print("In {}, the ratio of leavers is {:.2f}%".format(field, ratio*100))    
df_JL = df_JobLevel.groupby(by="Job Level").sum()
g = df_JL.sort_values(by=['% of Leavers'], ascending=False).plot(kind='bar', figsize=(12,4))
plt.ylim(0,50)
plt.title("Proportion of Leavers by Job Level (%)",  fontsize=16)
plt.xticks(rotation=0)
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x()+p.get_width()/2., height+1, "{:1.1f}".format(height), ha="center", fontsize=14) 

# plt.savefig("level_hist_3.png", bbox_inches="tight")
plt.show()

# Years at the Company

The average number of years at the company for currently active employees is 7.37 years and ex-employees is 5.13 years.

In [None]:
# Distance from Home
print("Years At Company for employees is from {} to {} years.".format(df['YearsAtCompany'].min(), df['YearsAtCompany'].max()))

Years At Company for employees is from 0 to 40 years.

In [None]:
# Ex-Employees
print("Ex-Employees:")
print("Average Years At Company of Ex-Employees = {:1.2f} years".format(np.mean(df.loc[df['Attrition'] == 'Yes', 'YearsAtCompany'])))
print("Standard Deviation = {:1.2f}".format(np.std(df.loc[df['Attrition'] == 'Yes', 'YearsAtCompany'])))

In [None]:
# Active Employees
print("Active Employees:")
print("Average Years At Company of Active Employees = {:1.2f} years".format(np.mean(df.loc[df['Attrition'] == 'No', 'YearsAtCompany'])))
print("Standard Deviation = {:1.2f}".format(np.std(df.loc[df['Attrition'] == 'No', 'YearsAtCompany'])))

In [None]:
# kernel density estimation (KDE) plot 

plt.subplots(figsize=(12,4))

target_0 = df.loc[df["Attrition"] == "No"]
target_1 = df.loc[df["Attrition"] == "Yes"]

sns.distplot(target_0[["YearsAtCompany"]], hist=False, label="Active Employees", color="#2ca02c")
sns.distplot(target_1[["YearsAtCompany"]], hist=False, label="Ex-Exployees", color="#1f77b4")
plt.legend()
plt.xlabel("Years At Company", fontsize=12)
plt.ylabel("Density", fontsize=12)
plt.title("Years At Company Distribution by Attrition Status", fontsize=14)

# plt.savefig("year_kde_1.png", bbox_inches="tight")
plt.show()

# Years At Company  Distribution by Attrition Status


To visualize the distribution of "Years At Company" by "Attrition Status," you can use Seaborn's histplot function. Here's how you can create the chart:

python


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set the figure size
plt.figure(figsize=(10, 6))

# Create the distribution plot
sns.histplot(data=df, x='YearsAtCompany', hue='Attrition', kde=True, bins=30, palette='muted')

# Set the title and labels
plt.title('Years At Company Distribution by Attrition Status')
plt.xlabel('Years At Company')
plt.ylabel('Count')

# Show the plot
plt.show()

This code will generate a histogram showing the distribution of "Years At Company," with the Attrition status differentiated by color. Adjust the bins parameter to change the number of bins in the histogram, and modify the palette parameter to change the color scheme if neede

In [None]:
# plot histogram
fig, axes = plt.subplots(1, 2)
fig.set_size_inches(12,4)

target_0[['YearsAtCompany']].hist(bins=20, ax=axes[0])
axes[0].set_title('Active Employees')

target_1[['YearsAtCompany']].hist(bins=20, ax=axes[1])
axes[1].set_title('Ex-Employees')

fig.text(0.5, 0.01, 'Years At Company', ha='center', fontsize=12)
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical', fontsize=12)

# plt.savefig("year_hist.png"", bbox_inches="tight")
plt.show()

# Overtime

Some employees have overtime commitments. The data clearly show that there is significant larger portion of employees with OT that have left the company.

In [None]:
# breakdown by overtime
df['OverTime'].value_counts()

In [None]:
No     1054
Yes     416
Name: OverTime, dtype: int64

In [None]:
 breakdown by overtime
df['OverTime'].value_counts(normalize=True)

In [None]:
No     0.717007
Yes    0.282993
Name: OverTime, dtype: float64

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(121)
g2 = df.loc[df['Attrition'] == 'Yes', 'OverTime'].value_counts().plot.bar(title= 'Ex-Employee', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,170)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+3, "{:1.0f}".format(height), ha="center", fontsize=14) 

plt.subplot(122)
g2 = df.loc[df['Attrition'] == 'No', 'OverTime'].value_counts().reindex(["Yes", "No"]).plot.bar(title= 'Active Employee', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,1200)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+12, "{:1.0f}".format(height), ha="center", fontsize=14) 

# set title and axis labels
plt.suptitle('Attrition Status by OverTime', x=0.5, y=1.05, ha="center", fontsize=16)
fig.text(0.5, 0.001, 'Job Level', ha='center', fontsize=14)
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical', fontsize=14)

# plt.savefig("ot_hist_1.png", bbox_inches="tight")
plt.show()

# Attrition Stuatus by OverTime 

To visualize the distribution of Attrition Status by Overtime, you can use a count plot or bar plot. Here's how you can create the chart using Seaborn:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set the figure size
plt.figure(figsize=(8, 6))

# Create the count plot
sns.countplot(data=df, x='OverTime', hue='Attrition', palette='muted')

# Set the title and labels
plt.title('Attrition Status by Overtime')
plt.xlabel('Overtime')
plt.ylabel('Count')

# Show the plot
plt.show()

This code will generate a count plot showing the distribution of Attrition Status across Overtime categories, with Attrition status differentiated by color. Adjust the palette parameter to change the color scheme if needed.

In [None]:
fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(121)
g2 = df.loc[df['Attrition'] == 'Yes', 'OverTime'].value_counts(normalize=True).plot.bar(title= 'Ex-Employee', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(122)
g2 = df.loc[df['Attrition'] == 'No', 'OverTime'].value_counts(normalize=True).reindex(["Yes", "No"]).plot.bar(title= 'Active Employee', color="#1f77b4")
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 

# set title and axis labels
plt.suptitle('Attrition Status by OverTime', x=0.5, y=1.05, ha="center", fontsize=16)
fig.text(0.5, 0.001, 'Job Level', ha='center', fontsize=14)
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical', fontsize=14)

# plt.savefig("ot_hist_2.png", bbox_inches="tight")
plt.show()

In [None]:
df_OverTime = pd.DataFrame(columns=["OverTime", "% of Leavers"])
i=0
for field in list(df['OverTime'].unique()):
    ratio = df[(df['OverTime']==field)&(df['Attrition']=="Yes")].shape[0] / df[df['OverTime']==field].shape[0]
    df_OverTime.loc[i] = (field, ratio*100)
    i += 1
    #print("In {}, the ratio of leavers is {:.2f}%".format(field, ratio*100))    
df_OT = df_OverTime.groupby(by="OverTime").sum()
g = df_OT.sort_values(by=['% of Leavers'], ascending=False).plot(kind='bar', figsize=(12,4))
plt.ylim(0,50)
plt.title("Proportion of Leavers by OverTime (%)",  fontsize=16)
plt.xticks(rotation=0)
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x()+p.get_width()/2., height+1, "{:1.1f}".format(height), ha="center", fontsize=14) 

# plt.savefig("ot_hist_3.png", bbox_inches="tight")
plt.show()

# Proportion Of Leavers by Work Life Balance(%)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Filter the dataframe to include only employees who have left (Attrition == 'Yes')
leavers_df = df[df['Attrition'] == 'Yes']

# Calculate the proportion of leavers by Work Life Balance percentage
proportion_leavers = leavers_df.groupby('WorkLifeBalance')['Attrition'].count() / len(leavers_df)

# Plot the proportion of leavers by Work Life Balance percentage
plt.figure(figsize=(10, 6))
sns.barplot(x=proportion_leavers.index, y=proportion_leavers.values, color='skyblue')
plt.title('Proportion of Leavers by Work Life Balance (%)')
plt.xlabel('Work Life Balance (%)')
plt.ylabel('Proportion of Leavers')
plt.show()

This code calculates the proportion of leavers for each Work Life Balance percentage category and then creates a bar plot to visualize these proportions. Adjust the figure size, color, and other parameters as needed for your visualization preferences.

# Monthly Income

Employee Monthly Income varies from 1,009 to
19,999.

In [None]:
# Monthly Income
print("Monthly Income for employees is from ${} to ${}.".format(df['MonthlyIncome'].min(), df['MonthlyIncome'].max()))

Monthly Income for employees is from $1009 to $19999.

In [None]:
# Ex-Employees
print("Ex-Employees:")
print("Average Monthly Income of Ex-Employees = ${:1.2f}".format(np.mean(df.loc[df['Attrition'] == 'Yes', 'MonthlyIncome'])))
print("Standard Deviation = ${:1.2f}".format(np.std(df.loc[df['Attrition'] == 'Yes', 'MonthlyIncome'])))

In [None]:
# Active Employees
print("Active Employees:")
print("Average Monthly Income of Active Employees = ${:1.2f}".format(np.mean(df.loc[df['Attrition'] == 'No', 'MonthlyIncome'])))
print("Standard Deviation = ${:1.2f}".format(np.std(df.loc[df['Attrition'] == 'No', 'MonthlyIncome'])))

In [None]:
# kernel density estimation (KDE) plot 

plt.subplots(figsize=(12,4))

target_0 = df.loc[df["Attrition"] == "No"]
target_1 = df.loc[df["Attrition"] == "Yes"]

sns.distplot(target_0[["MonthlyIncome"]], hist=False, label="Active Employees", color="#2ca02c")
sns.distplot(target_1[["MonthlyIncome"]], hist=False, label="Ex-Exployees", color="#1f77b4")
plt.legend()
plt.xlabel("Monthly Income", fontsize=12)
plt.ylabel("Density", fontsize=12)
plt.title("Distribution of Monthly Income by Attrition Status", fontsize=14)

# plt.savefig("income_kde_1.png", bbox_inches="tight")
plt.show()


You can get negative x-values ending up with some positive density from a kernel density estimate, simply because of the way KDEs work. Refer https://stats.stackexchange.com/questions/109549/negative-density-for-non-negative-variables and https://www.youtube.com/watch?v=R6_LR-f6Tt4

In [None]:
# kernel density estimation (KDE) plot 

plt.subplots(figsize=(12,4))

target_0 = df.loc[df["Attrition"] == "No"]
target_1 = df.loc[df["Attrition"] == "Yes"]

sns.distplot(target_0[["MonthlyIncome"]], hist=False, label="Active Employees", color="#2ca02c")
sns.distplot(target_1[["MonthlyIncome"]], hist=False, label="Ex-Exployees", color="#1f77b4")
plt.legend()
plt.xlim(df["MonthlyIncome"].min(), df["MonthlyIncome"].max())
plt.xlabel("Monthly Income", fontsize=12)
plt.ylabel("Density", fontsize=12)
plt.title("Distribution of Monthly Income by Attrition Status", fontsize=14)

# plt.savefig("income_kde_2.png", bbox_inches="tight")
plt.show()

In [None]:
# plot histogram
fig, axes = plt.subplots(1, 2)
fig.set_size_inches(12,4)

target_0[['MonthlyIncome']].hist(bins=20, ax=axes[0])
axes[0].set_title('Active Employees')

target_1[['MonthlyIncome']].hist(bins=20, ax=axes[1])
axes[1].set_title('Ex-Employees')

fig.text(0.5, 0.01, 'Monthly Income', ha='center', fontsize=12)
fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical', fontsize=12)

# plt.savefig("income_hist.png", bbox_inches="tight")
plt.show()

# Target Variable: Attrition

The feature “Attrition” is what this Machine Learning problem is about. We are trying to predict the value of the feature ‘Attrition’ by using other related features associated with the employee’s personal and professional history.

In the supplied dataset, the percentage of Current Employees is 83.9% and of Ex-employees is 16.1%. Hence, this is an imbalanced class problem.

Machine learning algorithms typically work best when the number of instances of each classes are roughly equal. We will have to address this target feature imbalance prior to implementing our Machine Learning algorithms.

In [None]:
# Attrition indicates if the employee is currently active ('No') or has left the company ('Yes')
df['Attrition'].value_counts()

In [None]:
No     1233
Yes     237
Name: Attrition, dtype: int64

In [None]:
# Attrition indicates if the employee is currently active ('No') or has left the company ('Yes')
df['Attrition'].value_counts(normalize=True)

In [None]:
No     0.838776
Yes    0.161224
Name: Attrition, dtype: float64

In [None]:
# df['Attrition'].value_counts().plot(kind='hist', xTitle='Attrition', yTitle='count', title='Attrition Distribution')

fig, axes = plt.subplots(figsize=(16,4))

plt.subplot(121)
g1 = df['Attrition'].value_counts().reindex(["Yes", "No"]).plot.bar(color="#1f77b4")
plt.xticks(rotation=0)
plt.ylabel("No of Employee")
plt.ylim(0,1500)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2., height+20, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(122)
g2 = df['Attrition'].value_counts(normalize=True).reindex(["Yes", "No"]).plot.bar( color="#1f77b4")
plt.xticks(rotation=0)
plt.ylabel("% of Employee")
plt.ylim(0,1)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
# set title and axis labels
plt.suptitle("Attrition Status", x=0.5, y=1.05, ha="center", fontsize=16)
# fig.text(0.5, 0.001, 'Attrition Status', ha='center', fontsize=14)
# fig.text(0.07, 0.5, 'No of Employee', va='center', rotation='vertical', fontsize=14)

# plt.savefig("att_hist_1.png", bbox_inches="tight")
plt.show()

# Correlation

Let's take a look at some of most significant correlations. It is worth remembering that correlation coefficients only measure linear correlations.

In [None]:
# Find correlations with the target and sort
df_corr = df.copy()
df_corr['Target'] = df_corr['Attrition'].apply(lambda x: 0 if x == 'No' else 1)
df_corr = df_corr.drop(['Attrition', 'EmployeeCount', 'EmployeeNumber', 'StandardHours', 'Over18'], axis=1)
correlations = df_corr.corr()['Target'].sort_values()
print('Most Positive Correlations: \n', correlations.tail(5))
print('\nMost Negative Correlations: \n', correlations.head(5))

In [None]:
# compute the correlation matrix
corr = df_corr.corr()
corr

In [None]:
	Age	DailyRate	DistanceFromHome	Education	EnvironmentSatisfaction	HourlyRate	JobInvolvement	JobLevel	JobSatisfaction	MonthlyIncome	...	RelationshipSatisfaction	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager	Target
Age	1.000000	0.010661	-0.001686	0.208034	0.010146	0.024287	0.029820	0.509604	-0.004892	0.497855	...	0.053535	0.037510	0.680381	-0.019621	-0.021490	0.311309	0.212901	0.216513	0.202089	-0.159205
DailyRate	0.010661	1.000000	-0.004985	-0.016806	0.018355	0.023381	0.046135	0.002966	0.030571	0.007707	...	0.007846	0.042143	0.014515	0.002453	-0.037848	-0.034055	0.009932	-0.033229	-0.026363	-0.056652
DistanceFromHome	-0.001686	-0.004985	1.000000	0.021042	-0.016075	0.031131	0.008783	0.005303	-0.003669	-0.017014	...	0.006557	0.044872	0.004628	-0.036942	-0.026556	0.009508	0.018845	0.010029	0.014406	0.077924
Education	0.208034	-0.016806	0.021042	1.000000	-0.027128	0.016775	0.042438	0.101589	-0.011296	0.094961	...	-0.009118	0.018422	0.148280	-0.025100	0.009819	0.069114	0.060236	0.054254	0.069065	-0.031373
EnvironmentSatisfaction	0.010146	0.018355	-0.016075	-0.027128	1.000000	-0.049857	-0.008278	0.001212	-0.006784	-0.006259	...	0.007665	0.003432	-0.002693	-0.019359	0.027627	0.001458	0.018007	0.016194	-0.004999	-0.103369
HourlyRate	0.024287	0.023381	0.031131	0.016775	-0.049857	1.000000	0.042861	-0.027853	-0.071335	-0.015794	...	0.001330	0.050263	-0.002334	-0.008548	-0.004607	-0.019582	-0.024106	-0.026716	-0.020123	-0.006846
JobInvolvement	0.029820	0.046135	0.008783	0.042438	-0.008278	0.042861	1.000000	-0.012630	-0.021476	-0.015271	...	0.034297	0.021523	-0.005533	-0.015338	-0.014617	-0.021355	0.008717	-0.024184	0.025976	-0.130016
JobLevel	0.509604	0.002966	0.005303	0.101589	0.001212	-0.027853	-0.012630	1.000000	-0.001944	0.950300	...	0.021642	0.013984	0.782208	-0.018191	0.037818	0.534739	0.389447	0.353885	0.375281	-0.169105
JobSatisfaction	-0.004892	0.030571	-0.003669	-0.011296	-0.006784	-0.071335	-0.021476	-0.001944	1.000000	-0.007157	...	-0.012454	0.010690	-0.020185	-0.005779	-0.019459	-0.003803	-0.002305	-0.018214	-0.027656	-0.103481
MonthlyIncome	0.497855	0.007707	-0.017014	0.094961	-0.006259	-0.015794	-0.015271	0.950300	-0.007157	1.000000	...	0.025873	0.005408	0.772893	-0.021736	0.030683	0.514285	0.363818	0.344978	0.344079	-0.159840
MonthlyRate	0.028051	-0.032182	0.027473	-0.026084	0.037600	-0.015297	-0.016322	0.039563	0.000644	0.034814	...	-0.004085	-0.034323	0.026442	0.001467	0.007963	-0.023655	-0.012815	0.001567	-0.036746	0.015170
NumCompaniesWorked	0.299635	0.038153	-0.029251	0.126317	0.012594	0.022157	0.015012	0.142501	-0.055699	0.149515	...	0.052733	0.030075	0.237639	-0.066054	-0.008366	-0.118421	-0.090754	-0.036814	-0.110319	0.043494
PercentSalaryHike	0.003634	0.022704	0.040235	-0.011111	-0.031701	-0.009062	-0.017205	-0.034730	0.020002	-0.027269	...	-0.040490	0.007528	-0.020608	-0.005221	-0.003280	-0.035991	-0.001520	-0.022154	-0.011985	-0.013478
PerformanceRating	0.001904	0.000473	0.027110	-0.024539	-0.029548	-0.002172	-0.029071	-0.021222	0.002297	-0.017120	...	-0.031351	0.003506	0.006744	-0.015579	0.002572	0.003435	0.034986	0.017896	0.022827	0.002889
RelationshipSatisfaction	0.053535	0.007846	0.006557	-0.009118	0.007665	0.001330	0.034297	0.021642	-0.012454	0.025873	...	1.000000	-0.045952	0.024054	0.002497	0.019604	0.019367	-0.015123	0.033493	-0.000867	-0.045872
StockOptionLevel	0.037510	0.042143	0.044872	0.018422	0.003432	0.050263	0.021523	0.013984	0.010690	0.005408	...	-0.045952	1.000000	0.010136	0.011274	0.004129	0.015058	0.050818	0.014352	0.024698	-0.137145
TotalWorkingYears	0.680381	0.014515	0.004628	0.148280	-0.002693	-0.002334	-0.005533	0.782208	-0.020185	0.772893	...	0.024054	0.010136	1.000000	-0.035662	0.001008	0.628133	0.460365	0.404858	0.459188	-0.171063
TrainingTimesLastYear	-0.019621	0.002453	-0.036942	-0.025100	-0.019359	-0.008548	-0.015338	-0.018191	-0.005779	-0.021736	...	0.002497	0.011274	-0.035662	1.000000	0.028072	0.003569	-0.005738	-0.002067	-0.004096	-0.059478
WorkLifeBalance	-0.021490	-0.037848	-0.026556	0.009819	0.027627	-0.004607	-0.014617	0.037818	-0.019459	0.030683	...	0.019604	0.004129	0.001008	0.028072	1.000000	0.012089	0.049856	0.008941	0.002759	-0.063939
YearsAtCompany	0.311309	-0.034055	0.009508	0.069114	0.001458	-0.019582	-0.021355	0.534739	-0.003803	0.514285	...	0.019367	0.015058	0.628133	0.003569	0.012089	1.000000	0.758754	0.618409	0.769212	-0.134392
YearsInCurrentRole	0.212901	0.009932	0.018845	0.060236	0.018007	-0.024106	0.008717	0.389447	-0.002305	0.363818	...	-0.015123	0.050818	0.460365	-0.005738	0.049856	0.758754	1.000000	0.548056	0.714365	-0.160545
YearsSinceLastPromotion	0.216513	-0.033229	0.010029	0.054254	0.016194	-0.026716	-0.024184	0.353885	-0.018214	0.344978	...	0.033493	0.014352	0.404858	-0.002067	0.008941	0.618409	0.548056	1.000000	0.510224	-0.033019
YearsWithCurrManager	0.202089	-0.026363	0.014406	0.069065	-0.004999	-0.020123	0.025976	0.375281	-0.027656	0.344079	...	-0.000867	0.024698	0.459188	-0.004096	0.002759	0.769212	0.714365	0.510224	1.000000	-0.156199
Target	-0.159205	-0.056652	0.077924	-0.031373	-0.103369	-0.006846	-0.130016	-0.169105	-0.103481	-0.159840	...	-0.045872	-0.137145	-0.171063	-0.059478	-0.063939	-0.134392	-0.160545	-0.033019	-0.156199	1.000000
24 rows × 24 columns

In [None]:
# set figure size
plt.figure(figsize=(14, 14))

# generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

fig = sns.heatmap(corr, vmax=1, square=True, cmap="BuPu", linewidths=.1, annot=False, mask=mask)

# fig.get_figure().savefig("corr_heatmap.png", bbox_inches="tight") 
plt.show()

# Pre-processing Pipeline

In this section, we undertake data pre-processing steps to prepare the datasets for Machine Learning algorithm implementation.

# Encoding

Machine Learning algorithms can typically only have numerical values as their predictor variables. Hence Label Encoding becomes necessary as they encode categorical labels with numerical values. To avoid introducing feature importance for categorical features with large numbers of unique values, we will use both Label Encoding and One-Hot Encoding as shown below.

In [None]:
print(df.shape)
df.head()

In [None]:
Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	1	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	...	1	80	0	8	0	1	6	4	0	5
1	49	0	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	...	4	80	1	10	3	3	10	7	1	7
2	37	1	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	...	2	80	0	7	3	3	0	0	0	0
3	33	0	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	...	3	80	0	8	3	3	8	7	3	0
4	27	0	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	...	4	80	1	6	3	3	2	2	2	2
5 rows × 35 columns

In [None]:
# convert rest of categorical variable into dummy
df = pd.get_dummies(df)

In [None]:
print(df.shape)
df.head()

(1470, 54)

In [None]:
Age	Attrition	DailyRate	DistanceFromHome	Education	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	Gender	HourlyRate	...	JobRole_Laboratory Technician	JobRole_Manager	JobRole_Manufacturing Director	JobRole_Research Director	JobRole_Research Scientist	JobRole_Sales Executive	JobRole_Sales Representative	MaritalStatus_Divorced	MaritalStatus_Married	MaritalStatus_Single
0	41	1	1102	1	2	1	1	2	0	94	...	0	0	0	0	0	1	0	0	0	1
1	49	0	279	8	1	1	2	3	1	61	...	0	0	0	0	1	0	0	0	1	0
2	37	1	1373	2	2	1	4	4	1	92	...	1	0	0	0	0	0	0	0	0	1
3	33	0	1392	3	4	1	5	4	0	56	...	0	0	0	0	1	0	0	0	1	0
4	27	0	591	2	1	1	7	1	1	40	...	1	0	0	0	0	0	0	0	1	0
5 rows × 54 columns

# Splitting data into training and testing sets

Prior to implementating or applying any Machine Learning algorithms, we must split the dataset into training and testing sets.

In [None]:
# assign the target to a new dataframe and convert it to a numerical type
y = df['Attrition']

In [None]:
print(type(y))
y[:5]

In [None]:
<class 'pandas.core.series.Series'>
0    1
1    0
2    1
3    0
4    0
Name: Attrition, dtype: int64

In [None]:
# remove the target and redundant features from the dataset
X = df.drop(['Attrition', 'EmployeeCount', 'EmployeeNumber', 'StandardHours', 'Over18'], axis=1)

In [None]:
print(X.shape)
X.head()

(1470, 49)

In [None]:
	Age	DailyRate	DistanceFromHome	Education	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobSatisfaction	...	JobRole_Laboratory Technician	JobRole_Manager	JobRole_Manufacturing Director	JobRole_Research Director	JobRole_Research Scientist	JobRole_Sales Executive	JobRole_Sales Representative	MaritalStatus_Divorced	MaritalStatus_Married	MaritalStatus_Single
0	41	1102	1	2	2	0	94	3	2	4	...	0	0	0	0	0	1	0	0	0	1
1	49	279	8	1	3	1	61	2	2	2	...	0	0	0	0	1	0	0	0	1	0
2	37	1373	2	2	4	1	92	2	1	3	...	1	0	0	0	0	0	0	0	0	1
3	33	1392	3	4	4	0	56	3	1	3	...	0	0	0	0	1	0	0	0	1	0
4	27	591	2	1	1	1	40	3	1	2	...	1	0	0	0	0	0	0	0	1	0
5 rows × 49 columns

In [None]:
# Since we have class imbalance (i.e. more employees with turnover=0 than turnover=1)
# let's use stratify=y to maintain the same ratio as in the training dataset when splitting the dataset

# import library
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0, stratify=y)  

In [None]:
print("Shape of X_train dataset: ", X_train.shape)
print("Shape of y_train dataset: ", y_train.shape)
print("Shape of X_test dataset: ", X_test.shape)
print("Shape of y_test dataset: ", y_test.shape)

In [None]:
X_train.head()

In [None]:
	Age	DailyRate	DistanceFromHome	Education	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobSatisfaction	...	JobRole_Laboratory Technician	JobRole_Manager	JobRole_Manufacturing Director	JobRole_Research Director	JobRole_Research Scientist	JobRole_Sales Executive	JobRole_Sales Representative	MaritalStatus_Divorced	MaritalStatus_Married	MaritalStatus_Single
619	33	586	1	3	1	1	48	4	2	1	...	0	0	0	0	0	1	0	1	0	0
779	51	1323	4	4	1	1	34	3	1	3	...	0	0	0	0	1	0	0	0	1	0
627	52	890	25	4	3	0	81	2	4	4	...	0	0	1	0	0	0	0	0	1	0
301	18	812	10	3	4	0	69	2	1	3	...	0	0	0	0	0	0	1	0	0	1
308	58	390	1	4	4	1	32	1	2	3	...	0	0	0	0	0	0	0	1	0	0
5 rows × 49 columns

In [None]:
X_test.head()

In [None]:
	Age	DailyRate	DistanceFromHome	Education	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobSatisfaction	...	JobRole_Laboratory Technician	JobRole_Manager	JobRole_Manufacturing Director	JobRole_Research Director	JobRole_Research Scientist	JobRole_Sales Executive	JobRole_Sales Representative	MaritalStatus_Divorced	MaritalStatus_Married	MaritalStatus_Single
655	33	1075	3	2	4	1	57	3	1	2	...	0	0	0	0	0	0	0	1	0	0
857	44	1097	10	4	3	1	96	3	1	3	...	0	0	0	0	1	0	0	0	0	1
861	46	1402	2	3	3	0	69	3	4	1	...	0	1	0	0	0	0	0	0	1	0
1407	24	771	1	2	2	1	45	2	2	3	...	0	0	0	0	0	0	0	0	0	1
744	37	1141	11	2	1	0	61	1	2	2	...	0	0	0	0	0	0	0	0	1	0
5 rows × 49 columns

In [None]:
fig, axes = plt.subplots(figsize=(16,8))

plt.subplot(221)
g1 = y_train.value_counts().sort_values(ascending=True).plot.bar(title= 'y_train', color="#1f77b4")
g1.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,1100)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2., height+20, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(222)
g2 = y_test.value_counts().sort_values(ascending=True).plot.bar(title= 'y_test', color="#1f77b4")
g2.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,400)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+10, "{:1.0f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(223)
g1 = y_train.value_counts(normalize=True).sort_values(ascending=True).plot.bar(title= 'y_train', color="#1f77b4")
g1.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 
    
plt.subplot(224)
g2 = y_test.value_counts(normalize=True).sort_values(ascending=True).plot.bar(title= 'y_test', color="#1f77b4")
g2.set_xticklabels(["Ex-Employee", "Active Employee"])
plt.xticks(rotation=0)
plt.ylim(0,1)
for p in g2.patches:
    height = p.get_height()
    g2.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.2f}".format(height), ha="center", fontsize=14) 

    
# set title and axis labels
plt.suptitle("Attrition Status After Splitting", x=0.5, y=0.95, ha="center", fontsize=16)
# fig.text(0.5, 0.001, "Attrition", ha="center", fontsize=14)
# fig.text(0.07, 0.5, "No of Employee", va="center", rotation="vertical", fontsize=14)

# plt.savefig("split_hist.png", bbox_inches="tight")
plt.show()

# Feature Scaling

Feature Scaling using MinMaxScaler essentially shrinks the range such that the range is now between 0 and n. Machine Learning algorithms perform better when input numerical variables fall within a similar scale. In this case, we are scaling between 0 and 1. Refer https://stackoverflow.com/questions/50565937/how-to-normalize-the-train-and-test-data-using-minmaxscaler-sklearn.

In [None]:
# import MinMaxScaler from sklearn.preprocessing
from sklearn.preprocessing import MinMaxScaler

# define the scaler 
scaler = MinMaxScaler(feature_range=(0, 1))

In [None]:
# fit the scaler on the TRAINING data and use the scaler to transform the training data
X_train_scaled = scaler.fit_transform(X_train)

In [None]:
# use the scaler to transform the TEST data
X_test_scaled = scaler.transform(X_test)

In [None]:
print(X_train_scaled.shape)
X_train_scaled

(1102, 49)

In [None]:
array([[0.35714286, 0.34645669, 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.78571429, 0.87401575, 0.10714286, ..., 0.        , 1.        ,
        0.        ],
       [0.80952381, 0.56406586, 0.85714286, ..., 0.        , 1.        ,
        0.        ],
       ...,
       [0.52380952, 0.49319971, 0.03571429, ..., 0.        , 1.        ,
        0.        ],
       [0.45238095, 0.27201145, 0.07142857, ..., 0.        , 1.        ,
        0.        ],
       [0.4047619 , 0.40085898, 0.60714286, ..., 0.        , 1.        ,
        0.        ]])

In [None]:
print(X_test_scaled.shape)
X_test_scaled

(368, 49)

In [None]:
array([[0.35714286, 0.69649248, 0.07142857, ..., 1.        , 0.        ,
        0.        ],
       [0.61904762, 0.71224052, 0.32142857, ..., 0.        , 0.        ,
        1.        ],
       [0.66666667, 0.9305655 , 0.03571429, ..., 0.        , 1.        ,
        0.        ],
       ...,
       [0.11904762, 0.31424481, 0.03571429, ..., 1.        , 0.        ,
        0.        ],
       [0.30952381, 0.84037223, 0.03571429, ..., 1.        , 0.        ,
        0.        ],
       [0.9047619 , 0.57551897, 0.17857143, ..., 0.        , 1.        ,
        0.        ]])

# Building Machine Learning Models

# Baseline Algorithms

First, we will use a range of baseline algorithms (using default or out-of-the-box hyper-parameters) before we move on to more sophisticated solutions. Total of 9 algorithms considered in this section: Logistic Regression, Random Forest, SVM, KNN, Decision Tree Classifier, Gaussian NB, XGBoost, Gradient Boosting, AdaBoost. Then, we will pick the top 2-3 algorithms and further fine-tune their parameter values using GridSearchCV to achieve the best AUC score.

We will evaluate the algorithms based on two metrics:

Classification Accuracy is the number of correct predictions made as a ratio of all predictions made.
It is the most common evaluation metric for classification problems. However, it is often misused as it is only really suitable when there are an equal number of observations in each class and all predictions and prediction errors are equally important. It is not the case in this project, so a different scoring metric may be more suitable.

Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems.
The AUC represents a model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random.

In [None]:
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

# sklearn modules for ML model selection
from sklearn.model_selection import train_test_split  # import 'train_test_split'
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# Libraries for data modelling
from sklearn import svm, tree, linear_model, neighbors
from sklearn import naive_bayes, ensemble, discriminant_analysis, gaussian_process
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier

# sklearn modules for performance metrics
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve
from sklearn.metrics import auc, roc_auc_score, roc_curve, recall_score, log_loss
from sklearn.metrics import f1_score, accuracy_score, roc_auc_score, make_scorer
from sklearn.metrics import average_precision_score

import sklearn
print(sklearn.__version__)

0.19.1

In [None]:
# selection of algorithms to consider and set performance measure
models = []
models.append(("Logistic Regression", LogisticRegression(solver="liblinear", random_state=0)))
models.append(("Random Forest", RandomForestClassifier(n_estimators=100, random_state=0)))
models.append(("SVM", SVC(gamma="auto", random_state=0)))
models.append(("KNN", KNeighborsClassifier()))
models.append(("Decision Tree Classifier", DecisionTreeClassifier(random_state=0)))
models.append(("Gaussian NB", GaussianNB()))
models.append(("XGBoost", XGBClassifier(random_state=0)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=0)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=0)))

In [None]:
%%time
# evaluate each model in turn and provide accuracy and standard deviation scores

acc_results = []
auc_results = []
names = []

# set table to table to populate with performance results
col = ["Algorithm", "ROC AUC Mean", "ROC AUC STD", "Accuracy Mean", "Accuracy STD"]
df_results = pd.DataFrame(columns=col)
i = 0

# evaluate each model using cross-validation
for name, model in models:
    # 10-fold cross-validation
    kfold = KFold(n_splits=10, random_state=0)  

    # accuracy scoring
    cv_acc_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring="accuracy")

    # roc_auc scoring
    cv_auc_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring="roc_auc")

    acc_results.append(cv_acc_results)
    auc_results.append(cv_auc_results)
    names.append(name)
    df_results.loc[i] = [name,
                         round(cv_auc_results.mean()*100, 2),
                         round(cv_auc_results.std()*100, 2),
                         round(cv_acc_results.mean()*100, 2),
                         round(cv_acc_results.std()*100, 2)
                         ]
    i += 1
    

In [None]:
Wall time: 33.5 s

In [None]:
# sort the results by Accuracy
df_results.sort_values(by=["Accuracy Mean"], ascending=False)

In [None]:
Algorithm	ROC AUC Mean	ROC AUC STD	Accuracy Mean	Accuracy STD
0	Logistic Regression	81.41	6.06	87.30	2.85
6	XGBoost	79.92	7.69	87.12	2.19
8	AdaBoost	80.43	9.26	86.48	2.19
1	Random Forest	79.10	5.65	85.85	4.26
7	Gradient Boosting	79.12	8.05	85.67	2.57
2	SVM	50.00	0.00	83.85	4.05
3	KNN	55.58	9.66	81.41	4.04
4	Decision Tree Classifier	58.03	8.14	78.41	2.55
5	Gaussian NB	76.07	7.65	77.85	5.71

In [None]:
# sort the results by ROC AUC
df_results.sort_values(by=["ROC AUC Mean"], ascending=False)

In [None]:
Algorithm	ROC AUC Mean	ROC AUC STD	Accuracy Mean	Accuracy STD
0	Logistic Regression	81.41	6.06	87.30	2.85
8	AdaBoost	80.43	9.26	86.48	2.19
6	XGBoost	79.92	7.69	87.12	2.19
7	Gradient Boosting	79.12	8.05	85.67	2.57
1	Random Forest	79.10	5.65	85.85	4.26
5	Gaussian NB	76.07	7.65	77.85	5.71
4	Decision Tree Classifier	58.03	8.14	78.41	2.55
3	KNN	55.58	9.66	81.41	4.04
2	SVM	50.00	0.00	83.85	4.05

In [None]:
fig = plt.figure(figsize=(16, 8))

ax1 = fig.add_subplot(211)
plt.boxplot(acc_results)
ax1.set_xticklabels(names)
plt.title("Accuracy Comparison")

fig.subplots_adjust(hspace=0.4)

ax2 = fig.add_subplot(212)
plt.boxplot(auc_results)
ax2.set_xticklabels(names)
plt.title("ROC AUC Comparison")

# plt.savefig("boxplots.png", bbox_inches="tight")
plt.show()

Based on our Accuracy and ROC AUC comparison analysis, Logistic Regression and XGBoost have the highest mean Accuracy score while Logistic Regression also has the highest mean AUC scores. We will shortlist these two algorithms together with Random Forest for further analysis.

# Logistic Regression

In [None]:
kfold = KFold(n_splits=10, random_state=0)
modelCV = LogisticRegression(solver='liblinear', random_state=0)
scoring = 'roc_auc'
results = cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print(results, "\n")
print("Average AUC score = {:1.4f}; standard deviation = {:1.4f}".format(results.mean(), results.std()))

# Fine-tuning

GridSearchCV allows us to fine-tune hyper-parameters by searching over specified parameter values for an estimator. The results from GridSearchCV provided us with fine-tuned hyper-parameter using ROC_AUC as the scoring metric.

In [None]:
%%time
param_grid = {"C": np.arange(1e-01, 1.01, 0.1), "class_weight": ["balanced", None]} # hyper-parameter list to fine-tune

log_gs = GridSearchCV(LogisticRegression(solver="liblinear", 
                                         random_state=0),
                                         iid=True,
                                         return_train_score=True,
                                         param_grid=param_grid,
                                         scoring="roc_auc",
                                         cv=10)

log_gs.fit(X_train, y_train)

print("="*20)
print("best estimator: " + str(log_gs.best_estimator_))
print("best params: " + str(log_gs.best_params_))
print("best score:", log_gs.best_score_)
print("="*20)

In [None]:
# make predictions
y_pred = log_gs.predict(X_test)

In [None]:
# calculate accuracy score
logit_accuracy = accuracy_score(y_test, y_pred)
print('Accuracy of Logistic Regression Classifier on Test set: {:.2f}%'.format(logit_accuracy*100))

Accuracy of Logistic Regression Classifier on Test set: 87.50%

In [None]:
# Run the code to view the classification report metrics
from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred)
print(report)

In [None]:
# calculate AUC score using test dataset
# we will only keep probabilities associated with the employee leaving
y_pred_prob = log_gs.predict_proba(X_test)[:, 1]
logit_roc_auc = roc_auc_score(y_test, y_pred_prob)

print("AUC on Test set: {}".format(logit_roc_auc))

AUC on Test set: 0.8144917996818606

In [None]:
# import confusion_matrix
from sklearn.metrics import confusion_matrix
 
cm = confusion_matrix(y_test, y_pred)
print(cm)

# fig, ax = plt.subplots()
# ax.xaxis.set_label_position("top")

# f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(cm, annot=True, annot_kws={"size":16}, fmt="d", cbar=False, linewidths=0.1, cmap="Blues")
plt.title('Confusion matrix of the classifier', fontsize=14)
plt.ylabel('Actual label', fontsize=12)
plt.xlabel('Predicted label', fontsize=12)

# plt.savefig("log_cm.png", bbox_inches="tight")
plt.show()

# XGBoost Classifier

In [None]:
kfold = KFold(n_splits=10, random_state=0)
modelCV = XGBClassifier(random_state=0)
scoring = 'roc_auc'
results = cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print(results, "\n")
print("Average AUC score = {:1.4f}; standard deviation = {:1.4f}".format(results.mean(), results.std()))

In [None]:
[0.86923077 0.82460474 0.80281008 0.71929825 0.90636704 0.81754949
 0.83333333 0.72140351 0.64136414 0.85638298] 

Average AUC score = 0.7992; standard deviation = 0.0769

# Fine-tuning

GridSearchCV allows use to fine-tune hyper-parameters by searching over specified parameter values for an estimator. The results from GridSearchCV provided us with fine-tuned hyper-parameter using ROC_AUC as the scoring metric. The default parameter values for XGBClassifier are n_estimators=100, max_depth=3, learning_rate=0.1, gamma=0.
Refer https://xgboost.readthedocs.io/en/latest/python/python_api.html

In [None]:
%%time
xgb_classifier = XGBClassifier(random_state=0)

param_grid = {"n_estimators": [100, 150, 200],
              "max_depth": [2, 3, 4],
              "learning_rate":[0.01, 0.1],
              "gamma": [1,2,3]}

xgb_clf = GridSearchCV(xgb_classifier,
                        param_grid=param_grid,
                        scoring="roc_auc",
                        cv=10)

xgb_clf.fit(X_train, y_train)

print("="*20)
print("best estimator: " + str(xgb_clf.best_estimator_))
print("best params: " + str(xgb_clf.best_params_))
print("best score:", xgb_clf.best_score_)
print("="*20)

In [None]:
best estimator: XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=2, learning_rate=0.1, max_delta_step=0,
       max_depth=2, min_child_weight=1, missing=None, n_estimators=150,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
best params: {'gamma': 2, 'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 150}
best score: 0.8196073552657408
====================
Wall time: 1min 51s

In [None]:
# get feature importances
importances = xgb_clf.best_estimator_.feature_importances_
df_param_coeff = pd.DataFrame(columns=['Feature', 'Coefficient'])
for i in range(len(X_train.columns)):
    feat = X_train.columns[i]
    coeff = importances[i]
    df_param_coeff.loc[i] = (feat, coeff)
df_param_coeff.sort_values(by='Coefficient', ascending=False, inplace=True)
df_param_coeff = df_param_coeff.reset_index(drop=True)
df_param_coeff.head(10)

In [None]:
Feature	Coefficient
0	MonthlyIncome	0.0750
1	StockOptionLevel	0.0700
2	OverTime	0.0700
3	DailyRate	0.0500
4	DistanceFromHome	0.0500
5	NumCompaniesWorked	0.0500
6	JobSatisfaction	0.0450
7	TrainingTimesLastYear	0.0450
8	YearsAtCompany	0.0425
9	JobInvolvement	0.0375

In [None]:
# plot feature importances
plt.figure(figsize=(16,16))
plt.title("Feature Importances", fontsize=16)
plt.xlim(0,0.11)
feat_importances = pd.Series(xgb_clf.best_estimator_.feature_importances_, index=X_train.columns)
g1 = feat_importances.sort_values().plot(kind="barh", color="#1f77b4")
for i in g1.patches:
    # get_width pulls left or right; get_y pushes up or down
    g1.text(i.get_width()+.001, i.get_y()+.0, "{:1.4f}".format(i.get_width()), fontsize=12)
    
# plt.savefig("xgb_f.png", bbox_inches="tight")

In [None]:
# make predictions
y_pred = xgb_clf.predict(X_test)

In [None]:
# calculate accuracy score
xgb_accuracy = accuracy_score(y_test, y_pred)
print('Accuracy of XGBoost Classifier on test set: {:.2f}%'.format(xgb_accuracy*100))

Accuracy of XGBoost Classifier on test set: 88.04%

In [None]:
# Run the code to view the classification report metrics
from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred)
print(report)

In [None]:
# calculate AUC score using test dataset
# we will only keep probabilities associated with the employee leaving
y_pred_prob = xgb_clf.predict_proba(X_test)[:, 1]
xgb_roc_auc = roc_auc_score(y_test, y_pred_prob)

print("AUC on test set: {}".format(xgb_roc_auc))

AUC on test set: 0.821896769239208

In [None]:
# import confusion_matrix
from sklearn.metrics import confusion_matrix
 
cm = confusion_matrix(y_test, y_pred)
print(cm)

# fig, ax = plt.subplots()
# ax.xaxis.set_label_position("top")

# f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(cm, annot=True, annot_kws={"size":16}, fmt="d", cbar=False, linewidths=0.1, cmap="Blues")
plt.title('Confusion matrix of XGBoost classifier', fontsize=14)
plt.ylabel('Actual label', fontsize=12)
plt.xlabel('Predicted label', fontsize=12)

# plt.savefig("scatterplots.png", bbox_inches="tight")
plt.show()

# Random Forest Classifier

In [None]:
kfold = KFold(n_splits=10, random_state=0)
modelCV = RandomForestClassifier(n_estimators=100, random_state=0)
scoring = 'roc_auc'
results = cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print(results, "\n")
print("Average AUC score = {:1.4f}; standard deviation = {:1.4f}".format(results.mean(), results.std()))

In [None]:
%%time
rf_classifier = RandomForestClassifier(random_state=0)

param_grid = {"n_estimators": [150, 200, 250],
              "min_samples_split":[2,4,6],
              "min_samples_leaf": [1, 2],
#               "max_depth": [5, 10, 15, 20, 25],
              "class_weight": ["balanced", None]}

grid_obj = GridSearchCV(rf_classifier,
                        param_grid=param_grid,
                        scoring="roc_auc",
                        cv=10)

grid_fit = grid_obj.fit(X_train, y_train)
rf_opt = grid_fit.best_estimator_

print("="*20)
print("best estimator: " + str(grid_obj.best_estimator_))
print("best params: " + str(grid_obj.best_params_))
print("best score:", grid_obj.best_score_)
print("="*20)

In [None]:
# get feature importances
importances = grid_obj.best_estimator_.feature_importances_
df_param_coeff = pd.DataFrame(columns=['Feature', 'Coefficient'])
for i in range(len(X_train.columns)):
    feat = X_train.columns[i]
    coeff = importances[i]
    df_param_coeff.loc[i] = (feat, coeff)
df_param_coeff.sort_values(by='Coefficient', ascending=False, inplace=True)
df_param_coeff = df_param_coeff.reset_index(drop=True)
df_param_coeff.head(10)

In [None]:
Feature	Coefficient
0	MonthlyIncome	0.080944
1	Age	0.060090
2	OverTime	0.058972
3	DailyRate	0.054846
4	TotalWorkingYears	0.048805
5	MonthlyRate	0.046522
6	DistanceFromHome	0.044237
7	HourlyRate	0.042248
8	YearsAtCompany	0.040938
9	NumCompaniesWorked	0.032822

In [None]:
# plot feature importances
plt.figure(figsize=(16,16))
plt.title("Feature Importances", fontsize=16)
plt.xlim(0,0.09)
feat_importances = pd.Series(grid_obj.best_estimator_.feature_importances_, index=X_train.columns)
g1 = feat_importances.sort_values().plot(kind="barh", color="#1f77b4")
for i in g1.patches:
    # get_width pulls left or right; get_y pushes up or down
    g1.text(i.get_width()+.001, i.get_y()+.0, "{:1.4f}".format(i.get_width()), fontsize=12)
    
# plt.savefig("rf_f.png", bbox_inches="tight")
plt.show()

In [None]:
# make predictions
y_pred = grid_obj.predict(X_test)
# calculate accuracy score
rf_accuracy = accuracy_score(y_test, y_pred)
print('Accuracy of Random Forest Classifier on Test set: {:.2f}%'.format(rf_accuracy*100))
Accuracy of Random Forest Classifier on Test set: 85.33%
# Run the code to view the classification report metrics
from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred)
print(report)
             precision    recall  f1-score   support

          0       0.85      1.00      0.92       309
          1       0.86      0.10      0.18        59

avg / total       0.85      0.85      0.80       368

# calculate AUC score using test dataset
# we will only keep probabilities associated with the employee leaving
y_pred_prob = grid_obj.predict_proba(X_test)[:, 1]
rf_roc_auc = roc_auc_score(y_test, y_pred_prob)

print("AUC on Test set: {}".format(rf_roc_auc))
AUC on Test set: 0.8028084032691569
# import confusion_matrix
from sklearn.metrics import confusion_matrix
 
cm = confusion_matrix(y_test, y_pred)
print(cm)

# fig, ax = plt.subplots()
# ax.xaxis.set_label_position("top")

# f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(cm, annot=True, annot_kws={"size":16}, fmt="d", cbar=False, linewidths=0.1, cmap="Blues")
plt.title('Confusion matrix of Random Forest classifier', fontsize=14)
plt.ylabel('Actual label', fontsize=12)
plt.xlabel('Predicted label', fontsize=12)

# plt.savefig("rf_cm.png", bbox_inches="tight")
plt.show()

The Confusion matrix is telling us that we have 308+6 correct predictions and 53+1 incorrect predictions. In other words, an accurac of 85.33%.

# ROC Curves

AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much a model is capable of distinguishing between classes. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).

In [None]:
# create ROC curves
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, log_gs.predict_proba(X_test)[:,1])
xgb_fpr, xgb_tpr, xgb_thresholds = roc_curve(y_test, xgb_clf.predict_proba(X_test)[:,1])
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, grid_obj.predict_proba(X_test)[:,1])
plt.figure(figsize=(12, 8))

# plot XGBoost ROC
plt.plot(xgb_fpr, xgb_tpr, label="XGBoost (AUC = {:1.4f})".format(xgb_roc_auc))
# plot Logistic Regression ROC
plt.plot(fpr, tpr, label="Logistic Regression (AUC = {:1.4f})".format(logit_roc_auc))
# plot Random Forest ROC
plt.plot(rf_fpr, rf_tpr, label="Random Forest (AUC = {:1.4f})".format(rf_roc_auc))
# plot Baseline ROC
plt.plot([0,1], [0,1],label="Baseline (AUC = 0.5000)", linestyle="--")

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate", fontsize=14)
plt.ylabel("True Positive Rate", fontsize=14)
plt.title("ROC Curve", fontsize=16)
plt.legend(loc="lower right")
# plt.savefig("roc.png", bbox_inches="tight")
plt.show()

As shown above, the fine-tuned XGBoost model showed a higher AUC score compared to Logistic Regression and Random Forest Classifier.

In [None]:
# combine all models' results into one dataframe
data = {"Model": ["XGBoost", "Logistic Regression", "Random Forest"], 
        "Accuracy": [xgb_accuracy, logit_accuracy, rf_accuracy]}

results = pd.DataFrame(data=data)
results


plt.figure(figsize=(12, 6))

plt.title("Accuracy on Test Set", fontsize=16)
g1 = sns.barplot(x="Model", y="Accuracy", data=results)
plt.ylim(0, 1)
g1.set_xlabel("")
g1.tick_params(labelsize=14)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2., height+0.01, "{:1.4f}".format(height), ha="center", fontsize=14) 
    
# plt.savefig("acc_results.png", bbox_inches="tight")
plt.show()

# Discussion and Conclusion

Comparing the 9 different algorithms with default parameter values (before fine-tuning), Logistic Regression has the best accuracy score (0.8730) and the best ROC AUC (0.8141). After fine-tuning the parameter values, XGBoost performed the best with the highest accuracy (0.8804) and ROC AUC (0.8219) on Test set. It is not surprising that XGBoost is an algorithm that has recently been dominating Kaggle competitions. According to the author of XGBoost, both XGBoost and GBM follows the principle of gradient boosting. There are however, the difference in modeling details. Specifically, xgboost used a more regularized model formalization to control over-fitting, which gives it better performance. The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost. For model, it might be more suitable to be called as regularized gradient boosting.

# Risk Category, Indicators and Strategic Retention Plan

Risk Category

As the company generates more data on its employees (on New Joiners and recent Leavers) the algorithm can be re-trained using the additional data and theoritically generate more accurate predictions to identify high-risk employees of leaving based on the probabilistic label assigned to each feature variable (i.e. employee) by the algorithm.

Employees can be assigning a "Risk Category" based on the predicted probability of leaving the company:

Low-risk for employees with probability < 0.6
Medium-risk for employees with probability between 0.6 and 0.8
High-risk for employees with probability > 0.8

# Strategic Retention Plan

# The stronger indicators of people leaving include:

Monthly Income: people on higher wages are less likely to leave the company. Hence, efforts should be made to gather information on industry benchmarks in the current local market to determine if the company is providing competitive wages.
Over Time: people who work overtime are more likely to leave the company. Hence efforts must be taken to appropriately scope projects upfront with adequate support and manpower so as to reduce the use of overtime.
Age: Employees in relatively young age bracket 25–35 are more likely to leave. Hence, efforts should be made to clearly articulate the long-term vision of the company and young employees fit in that vision, as well as provide incentives in the form of clear paths to promotion for instance.
DistanceFromHome: Employees who live further from home are more likely to leave the company. Hence, efforts should be made to provide support in the form of company transportation for clusters of employees leaving the same area, or in the form of Transportation Allowance. Initial screening of employees based on their home location is probably not recommended as it would be regarded as a form of discrimination as long as employees make it to work on time every day.
TotalWorkingYears: The more experienced employees are less likely to leave. Employees who have between 5–8 years of experience should be identified as potentially having a higher-risk of leaving.
YearsAtCompany: Loyal companies are less likely to leave. Employees who hit their two-year anniversary should be identified as potentially having a higher-risk of leaving.
YearsWithCurrManager: A large number of leavers leave 6 months after their Current Managers. By using Line Manager details for each employee, one can determine which Manager have experienced the largest numbers of employees resigning over the past year.
WorkLifeBalance: Employees who had "Bad" Work-Life Balance are more likely to leave the company. Hence, efforts should be made to improve the work life balance of the employees such as limiting the hours of overtime, flexible working hours, option of working from home, and so on.

Several metrics can be used here to determine whether action should be taken with a Line Manager:

number of years the Line Manager has been in a particular position: this may indicate that the employees may need management training or be assigned a mentor (ideally an Executive) in the organisation
Patterns in the employees who have resigned: this may indicate recurring patterns in employees leaving in which case action may be taken accordingly.

A strategic "Retention Plan" should be drawn for each Risk Category group. In addition to the suggested steps for each feature listed above, face-to-face meetings between a HR representative and employees can be initiated for medium- and high-risk employees to discuss work conditions. Also, a meeting with those employee's Line Manager would allow to discuss the work environment within the team and whether steps can be taken to improve it.

If you have any feedback for this project, feel free to contact me via my LinkedIn or GitHub Pages.