In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style as style
%matplotlib inline
import seaborn as sns

style.use('seaborn-poster')
style.use("fivethirtyeight")
plt.rcParams['font.family'] = 'serif'

import warnings
warnings.filterwarnings('ignore')

# Introduction

This notebook serves as a case study into IBM HR employee attrition data. The main project task for this dataset is to better understand what employee demographics contribute the most to attrition and better understand employee attrition behavior.

Please provide any suggestions or questions about this notebook in the comments below :)

In [None]:
full_data = pd.read_csv("../input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv")

In [None]:
full_data.head()

This dataset is relatively clean, so little data cleaning will be required.

In [None]:
full_data.isnull().sum()

In [None]:
full_data.Age.describe()

In [None]:
full_data.Education.replace({1 : "Below College",
                       2 : "College",
                       3 : "Bachelors", 
                       4 : "Masters", 
                       5 : "PhD"}, inplace = True)

In [None]:
for col in full_data.columns:
    print(col)

I won't need to use all the columns in this dataset. Here I will create a new dataframe to hold only the data I wish to analyze.

In [None]:
employee_data = full_data[["Age",
                        "Attrition",
                        "BusinessTravel",
                        "Department",
                        "DistanceFromHome",
                        "Education",
                        "EducationField",
                        "OverTime",
                        "Gender",
                        "JobInvolvement",
                        "JobRole",
                        "JobSatisfaction",
                        "MonthlyIncome",
                        "PerformanceRating",
                        "YearsAtCompany",
                        "YearsInCurrentRole",
                        "YearsSinceLastPromotion"]]

In [None]:
employee_data.head()

In [None]:
names = ["JobInvolvement", "JobSatisfaction", "PerformanceRating"]

for name in names:
    employee_data[name].replace({1 : "Low",
                                    2 : "Medium",
                                    3 : "High",
                                    4 : "Very High"}, inplace = True)


In [None]:
for col in employee_data:
    cardinality = len(pd.Index(employee_data[col]).value_counts())
    print(employee_data[col].name + ": " + str(cardinality))

The first thing I will look for is any disparities between pay by gender.

In [None]:
ax = sns.boxplot(data = employee_data,
           y = "Gender",
           x = "MonthlyIncome",
           palette = "muted")

ax.set_xlabel("Monthly Income [$ USD]", labelpad = 25)
ax.set_ylabel(None)
ax.set_title("Monthly Income by Gender", pad = 35)

In [None]:
ax = sns.histplot(data = employee_data,
            x = "Age",
            hue = "Attrition",
            palette = "muted")

ax.set_title("Age of Employees", pad = 45)
ax.set_ylabel(None)
ax.set_xlabel("Age in Years", labelpad = 25)

plt.text(x = 45, y = 110, s =  "Average age is 37 years", fontweight = 'semibold')


Seems as if the majority of employees who leave the company are less than 40 years old.

In [None]:
ax = sns.kdeplot(x = employee_data.loc[employee_data.Attrition == "No"].YearsSinceLastPromotion, label = "Still Employed", color = "steelblue")
ax = sns.kdeplot(x = employee_data.loc[employee_data.Attrition == "Yes"].YearsSinceLastPromotion, label = "Left Company", color = "red")
ax.set_title("Years since Last Promotion", pad = 35)
ax.set_xlabel("Years")
ax.legend()

In [None]:
ax = sns.kdeplot(x = employee_data.loc[employee_data.Attrition == "No"].YearsAtCompany, label = "Still Employed", color = "steelblue")
ax = sns.kdeplot(x = employee_data.loc[employee_data.Attrition == "Yes"].YearsAtCompany, label = "Left Company", color = "red")
ax.set_title("Years at Company", pad = 35)
ax.set_xlabel("Years")
ax.legend()

Employees who leave the company dont seem to spend a long time with the company.

In [None]:
ax = sns.kdeplot(x = employee_data.loc[employee_data.Attrition == "No"].DistanceFromHome, label = "Still Employed", color = "steelblue")
ax = sns.kdeplot(x = employee_data.loc[employee_data.Attrition == "Yes"].DistanceFromHome, label = "Left Company", color = "red")
ax.set_title("Distance From Home", pad = 35)
ax.set_xlabel("Distance", labelpad = 25)
ax.legend()

Interestingly, employees who commute a longer distance to work seem to contribute a large amount to attrition. This is a likely reason for attrition.

In [None]:
def plot_stats(column, title, label):
    
    fig, ax = plt.subplots(1,2, figsize = (20, 9))

    sns.countplot(ax = ax[0], 
                 data = employee_data.loc[employee_data.Attrition == "Yes"],
                 x = column,
                 order = ['Low', 'Medium', 'High', 'Very High'])

    sns.countplot(ax = ax[1], 
                 data = employee_data.loc[employee_data.Attrition == "No"],
                 x = column,
                 order = ['Low', 'Medium', 'High', 'Very High'])

    fig.suptitle(title, fontsize = 25, y = 1.05)
    ax[0].set_title("Left Company", pad = 35)
    ax[1].set_title("Still Employed", pad = 35)

    for i in [0,1]:
        ax[i].set_ylabel(None)
        ax[i].set_xlabel(label, labelpad = 20)

In [None]:
plot_stats("JobSatisfaction", "Job Satisfaction by Attrition", "Job Satisfaction")

As expected, employees who report low Job satisfaction contribute to attrition. However we see the most amount of attrition from employees who report a high level of Job satisfaction.

In [None]:
plot_stats("JobInvolvement", "Job Involvement by Attrition", "Job Involvement")

In [None]:
plot_stats("PerformanceRating", "Performance Rating by Attrition", "Performance Rating")

In [None]:
ax = sns.countplot(data = employee_data,
                  x = "Department",
                  hue = "Attrition",
                  palette = "muted")


ax.set_title("Attrition by Department", fontsize = 25, pad = 35)
ax.set_xlabel(None)
ax.set_ylabel(None)

def percent_attrition(column, label):
    column_df = employee_data.loc[employee_data[column] == label]
    index  = column_df.index
    att_df = column_df.loc[column_df.Attrition == "Yes"]
    att_index = att_df.index
    return (len(att_index) / len(index)) * 100
    
sales_str = np.str(np.round(percent_attrition("Department","Sales"),2)) + '% Attrition'
plt.text(x = -0.25 , y = 375, s = sales_str, fontweight = 'semibold')

RD_str = np.str(np.round(percent_attrition("Department", "Research & Development"),2)) + '% Attrition'
plt.text(x = 0.75 , y = 860, s = RD_str, fontweight = 'semibold')

HR_str = np.str(np.round(percent_attrition("Department", "Human Resources"),2)) + '% Attrition'
plt.text(x = 1.78 , y = 100, s = HR_str, fontweight = 'semibold')

The Sales department has the largest percentage of attrition. 

In [None]:
ax = sns.countplot(data = employee_data,
             x = "Gender",
             hue = "Attrition",
             palette = "muted")

ax.set_title("Attrition by Gender", fontsize = 25, pad = 35)
ax.set_ylabel(None)

female_str = np.str(np.round(percent_attrition("Gender","Female"))) + "% Attrition"
plt.text(x = -0.15, y = 545, s = female_str, fontweight = "semibold")

male_str = np.str(np.round(percent_attrition("Gender","Male"))) + "% Attrition"
plt.text(x = .85 , y = 760, s = male_str, fontweight = "semibold")

Attrition is relatively even between genders.

In [None]:
employee_data.loc[employee_data.Department == "Res"].JobRole.unique()

In [None]:
fig, ax = plt.subplots(1,1, figsize = (12, 20))

sns.countplot(data = employee_data,
            y = "JobRole",
            hue = "Attrition",
            palette = "muted")

ax.set_title("Attrition by Job Role", pad = 35, fontsize = 25)
ax.set_ylabel(None)

def plot_per(column, label, x, y):
    string = np.str(np.round(percent_attrition(column, label),2)) + "% Attrition"
    plt.text(x = x, y = y, s = string, fontweight = "semibold")
    
plot_per("JobRole", "Sales Executive", 100, -0.19)
plot_per("JobRole", "Research Scientist", 100, .82)
plot_per("JobRole", "Laboratory Technician", 100, 1.85)

plot_per("JobRole", "Sales Representative", 55, 5.85)

We can see that the job roles with the most attrition include Sales Representitive, Laboratory Technition, Sales Executive, and Research Scientist. 

Lets explore the demographics of each of thes job roles.

In [None]:
salesrep_df = employee_data.loc[employee_data.JobRole == "Sales Representative"]
lab_df = employee_data.loc[employee_data.JobRole == "Laboratory Technician"]
salesex_df = employee_data.loc[employee_data.JobRole == "Sales Executive"]
research_df = employee_data.loc[employee_data.JobRole == "Research Scientist"]

In [None]:
fig, ax = plt.subplots(2,2, figsize = (30,30))

sns.kdeplot(ax = ax[0,0], data = salesrep_df.loc[salesrep_df.Attrition == "Yes"],
                 x = "Age",
                label = "Left Company",
                color = "red")
sns.kdeplot(ax = ax[0,0], data = salesrep_df.loc[salesrep_df.Attrition == "No"],
                 x = "Age",
                label = "Still Employed",
                color = "steelblue")

ax[0,0].legend()
ax[0,0].set_title("Age of Sales Representatives", fontsize = 25, pad = 35)

sns.kdeplot(ax = ax[0,1], data = lab_df.loc[lab_df.Attrition == "Yes"],
                 x = "Age",
                label = "Left Company",
                color = "red")
sns.kdeplot(ax = ax[0,1], data = lab_df.loc[lab_df.Attrition == "No"],
                 x = "Age",
                label = "Still Employed",
                color = "steelblue")

ax[0,1].legend()
ax[0,1].set_title("Age of Lab Tecnicians", fontsize = 25, pad = 35)

sns.kdeplot(ax = ax[1,0], data = salesex_df.loc[salesex_df.Attrition == "Yes"],
                 x = "Age",
                label = "Left Company",
                color = "red")
sns.kdeplot(ax = ax[1,0], data = salesex_df.loc[salesex_df.Attrition == "No"],
                 x = "Age",
                label = "Still Employed",
                color = "steelblue")

ax[1,0].legend()
ax[1,0].set_title("Age of Sales Executive", fontsize = 25, pad = 35)

sns.kdeplot(ax = ax[1,1], data = research_df.loc[research_df.Attrition == "Yes"],
                 x = "Age",
                label = "Left Company",
                color = "red")
sns.kdeplot(ax = ax[1,1], data = research_df.loc[research_df.Attrition == "No"],
                 x = "Age",
                label = "Still Employed",
                color = "steelblue")

ax[1,1].legend()
ax[1,1].set_title("Age of Research Scientists", fontsize = 25, pad = 35)

As expected most of the employees who leave the company are young in comparison to those who are still employed. The only exception is in the Sales Excecutive category where we see a large density of 50-60 year olds leaving the company.

In [None]:
print("Total number of Sales Executives that contribute to attrition: ", salesex_df.loc[salesex_df.Attrition == "Yes"].index.size)

In [None]:
def Demographics(df, title):
    
    fig, ax = plt.subplots(2,2, figsize = (20,25))
    sns.histplot(ax = ax[0,0], data = df.loc[df.Attrition == "Yes"],
                     x = "YearsSinceLastPromotion",
                     color = "steelblue")
    sns.countplot(ax = ax[0,1], data = df.loc[df.Attrition == "Yes"],
                 x = "JobSatisfaction",
                 order = ["Low", "Medium", "High", "Very High"],
                 palette = "muted")
    sns.scatterplot(ax = ax[1,0], data = df,
               y = "MonthlyIncome",
               x = "YearsAtCompany",
               hue = "Attrition",
               hue_order = ["Yes", "No"],
               palette = "muted")
    sns.countplot(ax = ax[1,1], data = df.loc[df.Attrition == "Yes"],
                     x = "OverTime",
                     order = ["Yes", "No"],
                     color = "seagreen")

    fig.suptitle(title, fontsize = 25, y = 0.95)

    def axis_titles(ax, title, x, y):
        ax.set_title(title, fontsize = 25, pad = 30)
        ax.set_xlabel(x, fontsize = 20, labelpad = 20)
        ax.set_ylabel(y, fontsize = 20, labelpad = 20)
        
    axis_titles(ax[0,0], "Years Since Last Promotion", "Years", None)
    axis_titles(ax[0,1], "Job Satisfaction", None, None)
    axis_titles(ax[1,0], "Monthly Income vs. Years at Company", "Years at Company", "Monthly Income [$USD]")
    axis_titles(ax[1,1], "Over Time", None, None)

Demographics(salesex_df, "Sales Executives Attrition Demographic Breakdown")

We can see that one possibility for attrition from Sales Executives is low job satisfaction. There doesn't seem to be a correlation between attrition and low pay and long tenure. Many of the sales executives who leave the company have only gone a few years without promotion so that doesn't seem to be a likely reason. Interestingly, a majority of sales executives who leave the company work over time. This could also be a reason for attrition.

In [None]:
Demographics(salesrep_df, "Sales Representitives Attrition Demographic Breakdown")

For sales representatives many report medium job satisfaction and have spent less than 6 years with the company. 

In [None]:
Demographics(lab_df, "Lab Technitian Attrition Demographic Breakdown")

Factors for attrition here are difficult to determine. We do see a large amount of low job satisfaction ratings, however that is expected of all employees who leave the company. Further analysis into this group may be required.

In [None]:
Demographics(research_df, "Research Scientists Attrition Demographic Breakdown")

For research scientists it is very apparent that many who leave the company have worked over time. We can also observe that many have gone between 3-5 years without a promotion. Both of these could be factors into attrition. 

# Conclusions

From my analysis I have determined four job roles within the company that have the highest percentages of attrition. The factors that may have attributed to attrition have been detailed above. 

A few courses of action for this company to take include
* Focus on employees within the Sales department working over time and those with low job satisfaction ratings
* Focus on employees with a longer distance from home, they might be susceptible to attrition
* Many sales executives that leave the company are older (~50 years), focus on this group and their role within the company
* Many Research Scientists that leave the company work overtime, focus on those working over time to make sure they are not burning out