# Employee Retention Analysis

Employee retention data is from "IBM HR Analytics Employee Attrition & Performance" from Kaggle:

https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

Some of the data features include Education, Job Involvement and Satisfaction, Performance, Work life balance, etc... 

See data source for more detail

## Purpose

I will understand what features correlate with employee attrition. This information can be used to help predict people who may be planning on leaving there position. Once identified, the company can have a discussion with the employee to understand if there are things within their job and/or career path that they would like to see imporoved. If at risk employees can be identified early, then this will help both the company and employees. The employee can improve their job and life satisfaction with proper intervention. Also, the company can retain their top talent and minimize hiring and training costs.

## ETL

### Import Data and Load Libraries

In [2]:
import altair as alt
import pandas as pd
#from sklearn.preprocessing import OneHoteEncoder, OrdinalEncoder

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500) 


In [3]:
df = pd.read_csv('EmployeeRetention.csv')
typeDict = {'Age':'int16',
            'Attrition':'category',
            'BusinessTravel':'category',
            'DailyRate':'int32',
            'Department':'category',
            'DistanceFromHome':'int16',
            'Education':'category',
            'EducationField':'category',
            'EmployeeCount':'int32',
            'EmployeeNumber':'int32',
            'EnvironmentSatisfaction':'category',
            'Gender':'category',
            'HourlyRate':'int32',
            'JobInvolvement':'category',
            'JobLevel':'category',
            'JobRole':'category',
            'JobSatisfaction':'category',
            'MaritalStatus':'category',
            'MonthlyIncome':'int32',
            'MonthlyRate':'int32',
            'NumCompaniesWorked':'int16',
            'Over18':'bool',
            'OverTime':'category',
            'PercentSalaryHike':'int16',
            'PerformanceRating':'category',
            'RelationshipSatisfaction':'category',
            'StandardHours':'int16',
            'StockOptionLevel':'category',
            'TotalWorkingYears':'int16',
            'TrainingTimesLastYear':'int16',
            'WorkLifeBalance':'category',
            'YearsAtCompany':'int16',
            'YearsInCurrentRole':'int16',
            'YearsSinceLastPromotion':'int16',
            'YearsWithCurrManager':'int16'
            }
df = df.astype(typeDict)



## Change the features from integers to alphanumeric categories, prior to teaching a model, encorporate OrdinalEncoder
attrition = {'Yes': True,
             'No': False}

Over18 = {'Y': True,
          'N': False}

OverTime = {'Yes': True,
             'No': False}

education = {1:'High School',
             2:'College',
             3:'Bachelor',
             4:'Master',
             5:'Doctor'}

environmentSatisfaction = {1:'Low',
                           2:'Medium',
                           3:'High',
                           4:'Very High'}

jobInvolvement = {1:'Low',
                  2:'Medium',
                  3:'High',
                  4:'Very High'}

jobSatisfaction = {1:'Low',
                   2:'Medium',
                   3:'High',
                   4:'Very High'}

performanceRating = {1:'Low',
                     2:'Good',
                     3:'Excellent',
                     4:'Outstanding'}

relationshipSatisfaction = {1:'Low',
                            2:'Medium',
                            3:'High',
                            4:'Very High'}

workLifeBalance = {1:'Bad',
                   2:'Good',
                   3:'Better',
                   4:'Best'}


## Defining the order of the ordinal features
educationOrder = ['High School','College','Bachelor','Master','Doctor']
ordinalOrder = ['Low','Medium','High','Very High']
performanceOrder = ['Low','Good','Excellent','Outstanding']
workOrder = ['Bad','Good','Better','Best']
jobOrder = [1,2,3,4,5]
stockOrder = [0,1,2,3]

##
df['Attrition'] = df['Attrition'].map(attrition)
df['Over18'] = df['Over18'].map(attrition)
df['OverTime'] = df['OverTime'].map(attrition)
df['Education'] = df['Education'].map(education)
df['EnvironmentSatisfaction'] = df['EnvironmentSatisfaction'].map(environmentSatisfaction)
df['JobInvolvement'] = df['JobInvolvement'].map(jobInvolvement)
df['JobSatisfaction'] = df['JobSatisfaction'].map(jobSatisfaction)
df['PerformanceRating'] = df['PerformanceRating'].map(performanceRating)
df['RelationshipSatisfaction'] = df['RelationshipSatisfaction'].map(relationshipSatisfaction)
df['WorkLifeBalance'] = df['WorkLifeBalance'].map(workLifeBalance)

##
df['Attrition'] = df['Attrition'].astype('bool')
df['Over18'] = df['Over18'].astype('bool')
df['OverTime'] = df['OverTime'].astype('bool')
df['Education'] = df['Education'].astype(pd.api.types.CategoricalDtype(categories = educationOrder, ordered = True))
df['EnvironmentSatisfaction'] = df['EnvironmentSatisfaction'].astype(pd.api.types.CategoricalDtype(categories = ordinalOrder, ordered = True))
df['JobInvolvement'] = df['JobInvolvement'].astype(pd.api.types.CategoricalDtype(categories = ordinalOrder, ordered = True))
df['JobSatisfaction'] = df['JobSatisfaction'].astype(pd.api.types.CategoricalDtype(categories = ordinalOrder, ordered = True))
df['PerformanceRating'] = df['PerformanceRating'].astype(pd.api.types.CategoricalDtype(categories = performanceOrder, ordered = True))
df['RelationshipSatisfaction'] = df['RelationshipSatisfaction'].astype(pd.api.types.CategoricalDtype(categories = ordinalOrder, ordered = True))
df['WorkLifeBalance'] = df['WorkLifeBalance'].astype(pd.api.types.CategoricalDtype(categories = workOrder, ordered = True))
df['JobLevel'] = df['JobLevel'].astype(pd.api.types.CategoricalDtype(categories = jobOrder, ordered = True))
df['StockOptionLevel'] = df['StockOptionLevel'].astype(pd.api.types.CategoricalDtype(categories = stockOrder, ordered = True))


In [4]:
df.describe(include = 'all')

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470,1470,1470.0,1470,1470.0,1470,1470,1470.0,1470.0,1470,1470,1470.0,1470,1470.0,1470,1470,1470,1470.0,1470.0,1470.0,1470,1470,1470.0,1470,1470,1470.0,1470.0,1470.0,1470.0,1470,1470.0,1470.0,1470.0,1470.0
unique,,2,3,,3,,5,6,,,4,2,,4,5.0,9,4,3,,,,1,2,,2,4,,4.0,,,4,,,,
top,,False,Travel_Rarely,,Research & Development,,Bachelor,Life Sciences,,,High,Male,,High,1.0,Sales Executive,Very High,Married,,,,True,False,,Excellent,High,,0.0,,,Better,,,,
freq,,1233,1043,,961,,572,606,,,453,882,,868,543.0,326,459,673,,,,1470,1054,,1244,459,,631.0,,,893,,,,
mean,36.92381,,,802.485714,,9.192517,,,1.0,1024.865306,,,65.891156,,,,,,6502.931293,14313.103401,2.693197,,,15.209524,,,80.0,,11.279592,2.79932,,7.008163,4.229252,2.187755,4.123129
std,9.135373,,,403.5091,,8.106864,,,0.0,602.024335,,,20.329428,,,,,,4707.956783,7117.786044,2.498009,,,3.659938,,,0.0,,7.780782,1.289271,,6.126525,3.623137,3.22243,3.568136
min,18.0,,,102.0,,1.0,,,1.0,1.0,,,30.0,,,,,,1009.0,2094.0,0.0,,,11.0,,,80.0,,0.0,0.0,,0.0,0.0,0.0,0.0
25%,30.0,,,465.0,,2.0,,,1.0,491.25,,,48.0,,,,,,2911.0,8047.0,1.0,,,12.0,,,80.0,,6.0,2.0,,3.0,2.0,0.0,2.0
50%,36.0,,,802.0,,7.0,,,1.0,1020.5,,,66.0,,,,,,4919.0,14235.5,2.0,,,14.0,,,80.0,,10.0,3.0,,5.0,3.0,1.0,3.0
75%,43.0,,,1157.0,,14.0,,,1.0,1555.75,,,83.75,,,,,,8379.0,20461.5,4.0,,,18.0,,,80.0,,15.0,3.0,,9.0,7.0,3.0,7.0


In [5]:
df.value_counts('Attrition')

Attrition
False    1233
True      237
dtype: int64

#### Additional Cleaning from above information

**Things to note:**

- [ ] Labels are unbalanced (237 True: 1233 False)
- [ ] Is MonthlyIncome colinear with MonthlyRate
- [x] EmployeeCount features all have a value of 1. Removed
- [x] StandardHours features all have a value of 80. Removed

In [6]:
del df['EmployeeCount'] #All values were equal to 1
del df['StandardHours'] #All values were equal to 80

## Exploratory Data Analysis

In [19]:
continuous_field = list(df.select_dtypes(include='number').columns)

alt.Chart(df).mark_point().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    color = 'Attrition',
    opacity = alt.value(0.1)
).properties(
    width = 200,
    height = 200
).repeat(
    row = continuous_field,
    column = continuous_field[::-1]
)

### Violin Plots

In [47]:
## VIOLIN PLOT
yAxis = 'Age'

alt.Chart(df).transform_density(
    yAxis,
    as_=[yAxis, 'density'],
    extent=[5, 50],
    groupby=['Attrition']
).mark_area(orient='horizontal').encode(
    y=str(yAxis+":Q"),
    color='Attrition:N',
    x=alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        axis=alt.Axis(labels=False, values=[0],grid=False, ticks=True),
    ),
    column=alt.Column(
        'Attrition:N',
        header=alt.Header(
            titleOrient='bottom',
            labelOrient='bottom',
            labelPadding=0,
        ),
    )
).properties(
    width=100
).configure_facet(
    spacing=0
).configure_view(
    stroke=None
).interactive()

### Swarm Plot

In [68]:
yAxis = 'Age'
alt.Chart(df).mark_circle(size=3).encode(
    x=alt.X(
        'jitter:Q',
        title = None,
        axis = alt.Axis(values=[0], ticks = True, grid = False, labels = False),
        scale = alt.Scale(),
    ),
    y = alt.Y(str(yAxis+":Q")),
    color = alt.Color('Attrition:N',legend=None),
    column = alt.Column(
        'Attrition:N',
        header = alt.Header(
            labelAngle = -90,
            titleOrient='top',
            labelOrient = 'bottom',
            labelAlign = 'right',
            labelPadding = 3,
        ),
    ),
).transform_calculate(
    jitter = 'sqrt(-1000*log(random()))*cos(2*PI*random())'
).configure_facet(
    spacing = 0
).configure_view(
    stroke = None
).properties(height = 300, width = 100
).interactive()

In [42]:
continuous_field


['Age',
 'DailyRate',
 'DistanceFromHome',
 'EmployeeNumber',
 'HourlyRate',
 'MonthlyIncome',
 'MonthlyRate',
 'NumCompaniesWorked',
 'PercentSalaryHike',
 'TotalWorkingYears',
 'TrainingTimesLastYear',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager']