# 1.  Introduction

Employee attrition is a problem every business faces. While the primary reasons for attrition can differ, there may be common themes that are consistent across different job functions. Using a dataset from IBM, I will be exploring how attrition differs across departments and if a certain factor (i.e., work-life balance) significantly influences attrition. I'll test to see if the requisite data is normal and conduct an independent sample T-test to validate significance in the findings. 

This research can benefit senior leadership within IBM, including HR. It may also be beneficial to other organizations as they benchmark their own employee attrition.

# 2. Hypothesis/ Research

I'll first explore which department is experiencing the highest attrition rate. Then I'll look to understand the impact of work/ life balance on attrition.


Ho: There is no significant difference between the average attrition for employees with favorable work/life balance vs. those who have poor work/life balance

Ha: There is significant difference between the average attrition for employees with favorable work/life balance vs. those who have poor work/life balance

# 3. Data

The dataset comes from Kaggle with 1,470 observations and 35 variables. There are no missing values. Sales has the highest attrition rate and the sample looks large enough to analyze further.

In [2]:
import pandas as pd
import numpy as np
import math
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
path = 'C:\\Users\Raj.Mehta\OneDrive - Wolters Kluwer\Downloads\WA_Fn-UseC_-HR-Employee-Attrition.csv'
df = pd.read_csv(path)

In [4]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
Age                         1470 non-null int64
Attrition                   1470 non-null object
BusinessTravel              1470 non-null object
DailyRate                   1470 non-null int64
Department                  1470 non-null object
DistanceFromHome            1470 non-null int64
Education                   1470 non-null int64
EducationField              1470 non-null object
EmployeeCount               1470 non-null int64
EmployeeNumber              1470 non-null int64
EnvironmentSatisfaction     1470 non-null int64
Gender                      1470 non-null object
HourlyRate                  1470 non-null int64
JobInvolvement              1470 non-null int64
JobLevel                    1470 non-null int64
JobRole                     1470 non-null object
JobSatisfaction             1470 non-null int64
MaritalStatus               1470 non-null object
MonthlyIncome         

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [11]:
# counts for attrition across departments
df.groupby('Department')['Attrition'].value_counts()

Department              Attrition
Human Resources         No            51
                        Yes           12
Research & Development  No           828
                        Yes          133
Sales                   No           354
                        Yes           92
Name: Attrition, dtype: int64

In [6]:
# create new column that substitutes yes and no for 1 and 0 to facilitiate analysis
df['Attrition_var'] = df.Attrition.replace(('Yes', 'No'), (1, 0))

df.Department.unique()
df_sales = df[(df['Department'] == 'Sales')]
df_research = df[(df['Department'] == 'Research & Development')]
df_hr = df[(df['Department'] == 'Human Resources')]


In [10]:
print(stats.describe(df_sales['Attrition_var']))
print(stats.describe(df_research['Attrition_var']))
print(stats.describe(df_hr['Attrition_var']))

DescribeResult(nobs=446, minmax=(0, 1), mean=0.2062780269058296, variance=0.16409532926890713, skewness=1.451796505232823, kurtosis=0.10771309260623818)
DescribeResult(nobs=961, minmax=(0, 1), mean=0.1383975026014568, variance=0.11936784599375651, skewness=2.0943237402794512, kurtosis=2.386191929098107)
DescribeResult(nobs=63, minmax=(0, 1), mean=0.19047619047619047, variance=0.15668202764976966, skewness=1.5764815627361632, kurtosis=0.48529411764705577)
