# 员工流失预测
<p>员工流失与招聘一直是困扰企业的众多关键问题之一，在前面已经分析了企业中影响员工流失的主要因素，辅助人力资源团队，进行哪些关键干预，帮助团队留住人才。这次使用的数据是IBM数据科学家创建的徐您的员工流失数据，构建一个模型来预测哪些员工可能流失。</p>
<p>首先导入需要的类库，并读入数据。然后对数据做一些基本的探索。</p>

In [13]:
import pandas as pd
import numpy as np

# 读入数据
df = pd.read_excel('data/HR-Employee-Attrition.xlsx')
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,College,Life Sciences,1,1,...,Low,80,0,8,0,Bad,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,Below College,Life Sciences,1,2,...,Very High,80,1,10,3,Better,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,College,Other,1,4,...,Medium,80,0,7,3,Better,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,Master,Life Sciences,1,5,...,High,80,0,8,3,Better,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,Below College,Medical,1,7,...,Very High,80,1,6,3,Better,2,2,2,2


<p>首先看一下数据是否存在缺失值</p>

In [14]:
df.isnull().any()

Age                         False
Attrition                   False
BusinessTravel              False
DailyRate                   False
Department                  False
DistanceFromHome            False
Education                   False
EducationField              False
EmployeeCount               False
EmployeeNumber              False
EnvironmentSatisfaction     False
Gender                      False
HourlyRate                  False
JobInvolvement              False
JobLevel                    False
JobRole                     False
JobSatisfaction             False
MaritalStatus               False
MonthlyIncome               False
MonthlyRate                 False
NumCompaniesWorked          False
Over18                      False
OverTime                    False
PercentSalaryHike           False
PerformanceRating           False
RelationshipSatisfaction    False
StandardHours               False
StockOptionLevel            False
TotalWorkingYears           False
TrainingTimesL

</p>从结果来看，数据中没有缺失值。因此，不需要对数据进行缺失值的填充。</p>
<p>从数据来看，其中的attrition列标明员工是否离职，这可以作为机器学习进行预测的标签。其他的列可以作为就是特征项目。首先将数据拆分为，特征项目与数据标签。</p>

In [15]:
labes = df['Attrition']
features_ = df.drop('Attrition', axis=1)
print(labes.shape)
print(features_.shape)

(1470,)
(1470, 34)


数据中有一些项目是分类项目，如所在部门，教育程度，JobLevel等，在进行机器学习模型的训练之前，需要将这些特征项目进行one-hot编码，格式化数据。

In [16]:
features = pd.get_dummies(features_)
features.shape

(1470, 75)

查看一下，one-hot编码后的特征数据。

In [17]:
features.head()

Unnamed: 0,Age,DailyRate,DistanceFromHome,EmployeeCount,EmployeeNumber,HourlyRate,JobLevel,MonthlyIncome,MonthlyRate,NumCompaniesWorked,...,PerformanceRating_Excellent,PerformanceRating_Outstanding,RelationshipSatisfaction_High,RelationshipSatisfaction_Low,RelationshipSatisfaction_Medium,RelationshipSatisfaction_Very High,WorkLifeBalance_Bad,WorkLifeBalance_Best,WorkLifeBalance_Better,WorkLifeBalance_Good
0,41,1102,1,1,1,94,2,5993,19479,8,...,1,0,0,1,0,0,1,0,0,0
1,49,279,8,1,2,61,2,5130,24907,1,...,0,1,0,0,0,1,0,0,1,0
2,37,1373,2,1,4,92,1,2090,2396,6,...,1,0,0,0,1,0,0,0,1,0
3,33,1392,3,1,5,56,1,2909,23159,1,...,1,0,1,0,0,0,0,0,1,0
4,27,591,2,1,7,40,1,3468,16632,9,...,1,0,0,0,0,1,0,0,1,0


然后查看一下数据的统计分析结果。

In [18]:
features.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,EmployeeCount,EmployeeNumber,HourlyRate,JobLevel,MonthlyIncome,MonthlyRate,NumCompaniesWorked,...,PerformanceRating_Excellent,PerformanceRating_Outstanding,RelationshipSatisfaction_High,RelationshipSatisfaction_Low,RelationshipSatisfaction_Medium,RelationshipSatisfaction_Very High,WorkLifeBalance_Bad,WorkLifeBalance_Best,WorkLifeBalance_Better,WorkLifeBalance_Good
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,1.0,1024.865306,65.891156,2.063946,6502.931293,14313.103401,2.693197,...,0.846259,0.153741,0.312245,0.187755,0.206122,0.293878,0.054422,0.104082,0.607483,0.234014
std,9.135373,403.5091,8.106864,0.0,602.024335,20.329428,1.10694,4707.956783,7117.786044,2.498009,...,0.360824,0.360824,0.463567,0.390649,0.404657,0.455692,0.226925,0.30547,0.488477,0.423525
min,18.0,102.0,1.0,1.0,1.0,30.0,1.0,1009.0,2094.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,1.0,491.25,48.0,1.0,2911.0,8047.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,36.0,802.0,7.0,1.0,1020.5,66.0,2.0,4919.0,14235.5,2.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,43.0,1157.0,14.0,1.0,1555.75,83.75,3.0,8379.0,20461.5,4.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
max,60.0,1499.0,29.0,1.0,2068.0,100.0,5.0,19999.0,26999.0,9.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


数据中的EmployeeNumber只是一个编号信息，通常其对员工是否离职的影响不大，可以删除。

In [19]:
features.drop(['EmployeeNumber'], axis=1, inplace=True)
features.head()

Unnamed: 0,Age,DailyRate,DistanceFromHome,EmployeeCount,HourlyRate,JobLevel,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,...,PerformanceRating_Excellent,PerformanceRating_Outstanding,RelationshipSatisfaction_High,RelationshipSatisfaction_Low,RelationshipSatisfaction_Medium,RelationshipSatisfaction_Very High,WorkLifeBalance_Bad,WorkLifeBalance_Best,WorkLifeBalance_Better,WorkLifeBalance_Good
0,41,1102,1,1,94,2,5993,19479,8,11,...,1,0,0,1,0,0,1,0,0,0
1,49,279,8,1,61,2,5130,24907,1,23,...,0,1,0,0,0,1,0,0,1,0
2,37,1373,2,1,92,1,2090,2396,6,15,...,1,0,0,0,1,0,0,0,1,0
3,33,1392,3,1,56,1,2909,23159,1,11,...,1,0,1,0,0,0,0,0,1,0
4,27,591,2,1,40,1,3468,16632,9,12,...,1,0,0,0,0,1,0,0,1,0


接下来处理标签数据，因为标签数据是包含Yes，No的分类数据，因此将其简单的转化为1，0就可以。

In [22]:
labes = labes.apply(lambda x : 1 if x == 'Yes' else 0)

在选择算法时，采用k折交叉验证。在这里采用k=10的交叉验证。尺度也会影响算法的精确度，因此会将特征调整的相同的尺度上。

In [24]:
n_splits = 10
seed = 7

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline