# 员工流失分析
员工流失与招聘一直是困扰企业的众多关键问题之一，在这里将会分析一下企业的员工结构，特别是一些基础信息、收入、晋升、满意度、绩效和工作与生活相关的内容。从中发现影响员工流失的主要因素，达到辅助人力资源团队，进行哪些关键干预，帮助团队留住人才。这次分析使用的数据是IBM数据科学家创建的徐您的员工流失数据。

在对数据进行探索之前，首先导入需要的类库，并读入数据。

In [1]:
import pandas as pd
import numpy as np

# 读入数据
df = pd.read_excel('data/HR-Employee-Attrition.xlsx')
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,College,Life Sciences,1,1,...,Low,80,0,8,0,Bad,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,Below College,Life Sciences,1,2,...,Very High,80,1,10,3,Better,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,College,Other,1,4,...,Medium,80,0,7,3,Better,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,Master,Life Sciences,1,5,...,High,80,0,8,3,Better,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,Below College,Medical,1,7,...,Very High,80,1,6,3,Better,2,2,2,2


数据读入后，首先对数据进行初步的探索，看看是否有缺失值，统计分析等。

In [2]:
df.isnull().all()

Age                         False
Attrition                   False
BusinessTravel              False
DailyRate                   False
Department                  False
DistanceFromHome            False
Education                   False
EducationField              False
EmployeeCount               False
EmployeeNumber              False
EnvironmentSatisfaction     False
Gender                      False
HourlyRate                  False
JobInvolvement              False
JobLevel                    False
JobRole                     False
JobSatisfaction             False
MaritalStatus               False
MonthlyIncome               False
MonthlyRate                 False
NumCompaniesWorked          False
Over18                      False
OverTime                    False
PercentSalaryHike           False
PerformanceRating           False
RelationshipSatisfaction    False
StandardHours               False
StockOptionLevel            False
TotalWorkingYears           False
TrainingTimesL

从结果来看，没有缺失值。接下来看一下数据的统计分析结果：

In [3]:
data = df.describe()
data

Unnamed: 0,Age,DailyRate,DistanceFromHome,EmployeeCount,EmployeeNumber,HourlyRate,JobLevel,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,1.0,1024.865306,65.891156,2.063946,6502.931293,14313.103401,2.693197,15.209524,80.0,0.793878,11.279592,2.79932,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,0.0,602.024335,20.329428,1.10694,4707.956783,7117.786044,2.498009,3.659938,0.0,0.852077,7.780782,1.289271,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,30.0,1.0,1009.0,2094.0,0.0,11.0,80.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,1.0,491.25,48.0,1.0,2911.0,8047.0,1.0,12.0,80.0,0.0,6.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,1.0,1020.5,66.0,2.0,4919.0,14235.5,2.0,14.0,80.0,1.0,10.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,1.0,1555.75,83.75,3.0,8379.0,20461.5,4.0,18.0,80.0,1.0,15.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,1.0,2068.0,100.0,5.0,19999.0,26999.0,9.0,25.0,80.0,3.0,40.0,6.0,40.0,18.0,15.0,17.0


可以看到，企业的员工的平均年龄大约是37岁，中位数为36结构相对合理。员工的平均收入是6500美元左右，中位数是4900美元，企业员工50%的人收入在4900美元以下，中位数更能反应企业员工的收入。员工平均在职时间为7年，中位数为5年，说明企业员工流动可能偏高，并且有一部分为忠诚度非常高的老员工。为了更准确的看到离职状况，接下里看一下员工的离职率。

In [4]:
data = df.groupby(by='Attrition').size()
data

Attrition
No     1233
Yes     237
dtype: int64

In [5]:
from pyecharts import Pie
pie = Pie('Attrition')
attrs = data.index.tolist()
# print(attrs)
values = [data[i] for i in attrs]
# print(values)
pie.add('', attrs, values, is_label_show=True)
pie

在深入的分析各项要素与离职之间的关系之前，先分析一下当前员工的基本情况，如：员工的JobLevel分布，性别比例，工作的参与度等。员工的JobLevel可以使用漏斗图来显示，可以很直观的标示出每个Level的员工。

In [38]:
onjob_df = df[df['Attrition'] == 'No']
data = onjob_df.groupby(by='JobLevel').size()
attrs = [i for i in reversed(data.index.tolist())]
# print(attrs)
values = [data[i] for i in attrs]
# print(values)

In [39]:
from pyecharts import Funnel
funnel = Funnel('Job Level')
funnel.add('', attrs, values, is_label_show=True, funnel_sort='none')
funnel

In [40]:
data = onjob_df.groupby(by='Gender').size()
genderAttrs = data.index.tolist()
genderValues = [data[i] for i in genderAttrs]

data = onjob_df.groupby(by='JobInvolvement').size()
jobInvolvementAttrs = data.index.tolist()
jobInvolvementValues = [data[i] for i in jobInvolvementAttrs]

data = onjob_df.groupby(by='JobSatisfaction').size()
jobSatisfactionAttrs = data.index.tolist()
jobSatisfactionValues = [data[i] for i in jobSatisfactionAttrs]

In [41]:
from pyecharts import Style
style = Style()
pie_style = style.add(label_pos="center", is_label_show=True, label_text_color='#fff', is_legend_show=False)
multiPie = Pie('')
multiPie.add('Gender', genderAttrs, genderValues, 
             center=[25, 30], radius=[0, 30], **pie_style)

multiPie.add('JobInvolvement', jobInvolvementAttrs, jobInvolvementValues, 
             center=[55, 30], radius=[0, 30], **pie_style)

multiPie.add('JobSatisfaction', jobSatisfactionAttrs, jobSatisfactionValues, 
             center=[85, 30], radius=[0, 30], **pie_style)

关于离职的状况，还需要关注每个JobLevel的离职率，在这里计算一下Joblevel的离职率。水滴图可以很好的用来显示某个类别的占比，这里使用水滴图进行显示：

In [42]:
data = df.groupby(['JobLevel', 'Attrition']).size()
# data

In [43]:
show_datas = []
joblevels = df.groupby('JobLevel').size().index.tolist()
for joblevel in joblevels:
    tmp = data[joblevel]
    title = 'Level ' + str(joblevel)
    total = tmp['Yes'] + tmp['No']
    ratio = round(tmp['Yes'] / total, 4)
    show_datas.append((title, ratio))
# print(show_datas)

In [44]:
from pyecharts import Liquid, Page

page = Page('Attrition')
i = 1
# title, value = show_datas[0]
for title, value in show_datas:
    liquid = Liquid(title)
    liquid.add('Attrition', [value])
    page.add(liquid)

page

从结果来看，JobLevel为1的员工流失率为26%，流失率最高。接下来看一下Gender，Age，Department，Education等项目与员工流失率的关系。

In [24]:
attrition_type_list = df.groupby(['Attrition']).size().index.tolist()
# print(attrition_type_list)
data = df.groupby(['Attrition', 'JobLevel']).size()
show_datas = []
for attrition_type in attrition_type_list:
    title = attrition_type
    attrs = data[attrition_type].index.tolist()
    total = sum(data[title][i] for i in attrs)
    values = [data[title][i] / total for i in attrs]
    show_datas.append((title, attrs, values))
# print(show_datas)

from pyecharts import Line
line = Line('JobLevel')
i = 0
for title, attrs, values in show_datas:
    line.add(title, attrs, values, 
             is_fill=True, line_opacity=0.2 + i, 
             area_opacity=0.4, symbol=None, is_smooth=True)
    i += 0.1
line

In [36]:
data = df.groupby(['Attrition', 'Age']).size()
show_datas = []
for attrition_type in attrition_type_list:
    title = attrition_type
    attrs = data[attrition_type].index.tolist()
    total = sum(data[title][i] for i in attrs)
    values = [data[title][i] / total for i in attrs]
    show_datas.append((title, attrs, values))
# print(show_datas)

from pyecharts import Line
line = Line('Age')
i = 0
for title, attrs, values in show_datas:
    line.add(title, attrs, values, 
             is_fill=True, line_opacity=0.2 + i, 
             area_opacity=0.4, is_smooth=True, is_datazoom_show=True,)
    i += 0.1
line

In [28]:
data = df.groupby(['Attrition', 'Department']).size()
show_datas = []
for attrition_type in attrition_type_list:
    title = attrition_type
    attrs = data[attrition_type].index.tolist()
    total = sum(data[title][i] for i in attrs)
    values = [data[title][i] / total for i in attrs]
    show_datas.append((title, attrs, values))
# print(show_datas)

from pyecharts import Line
line = Line('Department')
i = 0
for title, attrs, values in show_datas:
    line.add(title, attrs, values, 
             is_fill=True, line_opacity=0.2 + i, 
             area_opacity=0.4, symbol=None, is_smooth=True)
    i += 0.1
line

In [29]:
data = df.groupby(['Attrition', 'Education']).size()
show_datas = []
for attrition_type in attrition_type_list:
    title = attrition_type
    attrs = data[attrition_type].index.tolist()
    total = sum(data[title][i] for i in attrs)
    values = [data[title][i] / total for i in attrs]
    show_datas.append((title, attrs, values))
# print(show_datas)

from pyecharts import Line
line = Line('Education')
i = 0
for title, attrs, values in show_datas:
    line.add(title, attrs, values, 
             is_fill=True, line_opacity=0.2 + i, 
             area_opacity=0.4, symbol=None, is_smooth=True)
    i += 0.1
line

In [30]:
data = df.groupby(['Attrition', 'BusinessTravel']).size()
show_datas = []
for attrition_type in attrition_type_list:
    title = attrition_type
    attrs = data[attrition_type].index.tolist()
    total = sum(data[title][i] for i in attrs)
    values = [data[title][i] / total for i in attrs]
    show_datas.append((title, attrs, values))
# print(show_datas)

from pyecharts import Line
line = Line('BusinessTravel')
i = 0
for title, attrs, values in show_datas:
    line.add(title, attrs, values, 
             is_fill=True, line_opacity=0.2 + i, 
             area_opacity=0.4, symbol=None, is_smooth=True)
    i += 0.1
line

In [37]:
data = df.groupby(['Attrition', 'NumCompaniesWorked']).size()
show_datas = []
for attrition_type in attrition_type_list:
    title = attrition_type
    attrs = data[attrition_type].index.tolist()
    total = sum(data[title][i] for i in attrs)
    values = [data[title][i] / total for i in attrs]
    show_datas.append((title, attrs, values))
# print(show_datas)

from pyecharts import Line
line = Line('NumCompaniesWorked')
i = 0
for title, attrs, values in show_datas:
    line.add(title, attrs, values, 
             is_fill=True, line_opacity=0.2 + i, 
             area_opacity=0.4, symbol=None, is_smooth=True)
    i += 0.1
line

在这一系列图表中，采用了流失员工在流失总员工中占比来比较数据，因为，离职员工样本和留在公司员工样本的不平衡性，导致没有办法直接进行数量的比较，因此使用占比来比较，这样会较少数据不均匀造成的影响。通过这一系列的图表可以看出，在任职超过五家公司的员工中流失率较高，并且经常Travel的员工流失率较高。可以进一步分析其他项目对流失的影响，留给读者自行验证。