# HR Employee Attrition Analysis

By Shinin Varongchayakul

Language: R

Dataset: IBM HR Analytics Employee Attrition & Performance

From: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

## Business Questions
1. Attrition Risk by Department & Role
- We’ve been noticing an increase in employee turnover.
- Which departments and job roles have the highest attrition rates?
- Are there any patterns among those leaving (e.g., salary, work experience, satisfaction levels)?

2. Work-Life Balance & Overtime
- Employees have expressed concerns about work-life balance.
- How does overtime impact attrition?
- Are employees who work overtime more likely to leave?

3. Salary vs. Attrition: The Pay Gap Dilemma
- Do employees who earn less tend to leave more frequently?
- What’s the average monthly income of those who stay vs. those who leave?
- Are we paying our high-performing employees enough to retain them?

4. Age & Experience: Who is Most at Risk?
- Are younger employees leaving at a higher rate than older employees?
- How does total working experience influence attrition?

5. Promotion & Career Growth Opportunities
- We want to ensure that employees see long-term career growth in our company.
- How does the number of promotions (YearsSinceLastPromotion) relate to attrition?
- Are employees with fewer promotions more likely to leave?

6. Job Satisfaction vs. Attrition
- How does job satisfaction affect attrition rates?
- Are employees with lower satisfaction scores leaving more often?

7. Remote Work vs. Travel Frequency
- With more employees requesting remote work, does business travel influence attrition?
- Are those who travel frequently more likely to leave?

8. High Performers & Attrition
- Are we losing our top-performing employees?
- How does Performance Rating relate to attrition?

## 1. Import & Load Libraries

In [29]:
# Install
install.packages("tidyverse")
installed.packages("data.table")

"package 'tidyverse' is in use and will not be installed"


Package,LibPath,Version,Priority,Depends,Imports,LinkingTo,Suggests,Enhances,License,License_is_FOSS,License_restricts_use,OS_type,Archs,MD5sum,NeedsCompilation,Built


In [30]:
# Load
library(tidyverse)
library(data.table)

## 2. Load the Dataset

In [31]:
# Load the dataset
hr <- fread("hr_employee_attrition_dataset.csv")

In [32]:
# Check the result
head(hr)

Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,⋯,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
<int>,<chr>,<chr>,<int>,<chr>,<int>,<int>,<chr>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,⋯,1,80,0,8,0,1,6,4,0,5
49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,⋯,4,80,1,10,3,3,10,7,1,7
37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,⋯,2,80,0,7,3,3,0,0,0,0
33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,⋯,3,80,0,8,3,3,8,7,3,0
27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,⋯,4,80,1,6,3,3,2,2,2,2
32,No,Travel_Frequently,1005,Research & Development,2,2,Life Sciences,1,8,⋯,3,80,0,8,2,2,7,7,3,6


In [33]:
# Glimpse the dataset
glimpse(hr)

Rows: 1,470
Columns: 35
$ Age                      [3m[90m<int>[39m[23m 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
$ Attrition                [3m[90m<chr>[39m[23m "Yes", "No", "Yes", "No", "No", "No", "No", "…
$ BusinessTravel           [3m[90m<chr>[39m[23m "Travel_Rarely", "Travel_Frequently", "Travel…
$ DailyRate                [3m[90m<int>[39m[23m 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
$ Department               [3m[90m<chr>[39m[23m "Sales", "Research & Development", "Research …
$ DistanceFromHome         [3m[90m<int>[39m[23m 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
$ Education                [3m[90m<int>[39m[23m 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
$ EducationField           [3m[90m<chr>[39m[23m "Life Sciences", "Life Sciences", "Other", "L…
$ EmployeeCount            [3m[90m<int>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ EmployeeNumber           [3m[90m<int>[39m[23m 1, 2, 4, 5, 7, 8, 10, 11, 12, 13,

I notice that the categorical columns need to be converted to factor.

## 3. Data Cleaning

In [34]:
# Make a copy of hr
hr_cleaned <- copy(hr)

### 3.1 Data Types

In [35]:
# Create a vector for columns to be casted as factor
factor_cols <- c("Attrition", "BusinessTravel","Department",
                    "Education", "EducationField", "Gender",
                    "JobLevel", "JobRole", "MaritalStatus",
                    "Over18", "OverTime", "StockOptionLevel")

# Convert the columns to factor
hr_cleaned[, 
            (factor_cols) := lapply(.SD, as.factor),
            .SDcols = factor_cols]

In [36]:
# Check the results
hr_cleaned[, ..factor_cols] |> glimpse()

Rows: 1,470
Columns: 12
$ Attrition        [3m[90m<fct>[39m[23m Yes, No, Yes, No, No, No, No, No, No, No, No, No, No,…
$ BusinessTravel   [3m[90m<fct>[39m[23m Travel_Rarely, Travel_Frequently, Travel_Rarely, Trav…
$ Department       [3m[90m<fct>[39m[23m Sales, Research & Development, Research & Development…
$ Education        [3m[90m<fct>[39m[23m 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, 4, 2, 2,…
$ EducationField   [3m[90m<fct>[39m[23m Life Sciences, Life Sciences, Other, Life Sciences, M…
$ Gender           [3m[90m<fct>[39m[23m Female, Male, Male, Female, Male, Male, Female, Male,…
$ JobLevel         [3m[90m<fct>[39m[23m 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, 3, 1, 1,…
$ JobRole          [3m[90m<fct>[39m[23m Sales Executive, Research Scientist, Laboratory Techn…
$ MaritalStatus    [3m[90m<fct>[39m[23m Single, Married, Single, Married, Married, Single, Ma…
$ Over18           [3m[90m<fct>[39m[23m Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,

Twelve columns have been successfully converted to factor.

### 3.2 Missing Values

In [37]:
# Check if there are missing values
anyNA(hr_cleaned)

There appears to be no missing values.

## 4. Exploratory Data Analysis