<img src="http://inostix.com/blog/wp-content/uploads/2014/01/iquit.jpg"/>


# 员工离职分析

*I AM QUIT! 这是公司希望从员工那里听到的最后一件事。从某种意义上说，是公司的员工。是员工在做这项工作。塑造公司文化的是员工。长期成功，但是公司经历了很高的员工离职率，所以出了点问题。这可能会导致公司因这些创新且有价值的员工而蒙受巨大的金钱损失。* 

*保持健康组织和文化的公司始终是未来繁荣的好兆头。认识并理解与员工离职有关的因素将使公司和个人能够限制这种情况的发生，甚至可以提高员工的生产率和增长。这些预测性见解使管理人员有机会采取纠正措施来建立和维护成功的业务。 *


**我的目标是了解哪些因素是导致员工离职的最主要因素，并创建一个模型来预测某个员工是否会离开公司。**

*将以以下几个步骤来完成模型搭建与预测分析*

1. 获得所需数据是解决问题的第一种方法。我必须从Kaggle的网站下载数据集，然后将其作为csv文件导入到我的工作环境中。

2. 下一步是清理或清理数据。这包括缺少或无效数据的数据估算以及固定列名。

3. 紧随其后进行探索性数据分析，然后进一步了解我们的数据集所包含的内容。寻找任何异常值或怪异数据。了解每个解释变量与响应变量之间的关系就位于此处，我们可以使用相关矩阵来实现。通过使用要素工程来创建或删除要素是可能的。各种图形的使用在这里也起着重要的作用，因为它将使我们直观地了解变量之间的交互方式。我们将看到一些变量是否具有线性或非线性关系。花时间检查和理解我们的数据集将为我们提供关于使用哪种类型的预测模型的建议。

4. 对数据建模将为我们提供有关员工是否离职的预测能力。使用的模型类型可以是RF，SVM，LM，GBM等。此处使用交叉验证，这将用于检查模型的准确性并在必要时调整模型的超参数。在预测任何不确定性之后，接下来就是优化。也可以使用RandomForest中的特征选择。混淆矩阵可通过使用“真正数”和“真负数”来提供模型的精度。我们可以用ROC曲线对此进行绘图。了解为此问题选择正确模型的背后原因。

5. 解释数据是最后的一步。根据所有结果和数据分析，得出什么结论？哪些因素是员工离职的最大原因？发现变量有什么关系？如果从我们的测试集中得出的模型精度过高，则可能会出现过度拟合的情况。防止过度拟合的方法包括：收集更多数据，选择更简单的模型，交叉验证，正则化，使用集成方法或更好的参数调整。简要概述影响模型的功能重要性。将来我们如何改善我们的模型？经过预测，这是一个机会，可以使我们对我们的功能有更多的了解并提出更多的问题。数据的哪些子集对预测影响最大？不利于模型的不良特征是什么？是什么使功能做出了良好的预测？



# <font color='red'>第一步:获取数据 </font>

In [None]:
# Import the neccessary modules for data manipulation and visual representation
%matplotlib inline
import pandas as pd
import seaborn as sns

In [None]:
#Read the analytics csv file and store our dataset into a dataframe called "df"
df = pd.read_csv('/kaggle/input/hr-comma-sepcsv/HR_comma_sep.csv', index_col=None)
df

# <font color='red'>第二步：清洗数据 </font>

*通常，清理数据需要大量工作，并且可能是非常繁琐的过程。来自Kaggle的此数据集非常干净，并且不包含任何缺失值。但是，我仍然必须检查数据集，以确保其他所有内容均可读，并且观察值与特征名称正确匹配。*


In [None]:
# Check to see if there are any missing values in our data set
df.isnull().any()

In [None]:
# Get a quick overview of what we are dealing with in our dataset
df.head()

In [None]:
#打印
df.head()


### <font color='red'> 2a. 标记 </font>
*
查看数据时，我通常要做的是确保列名易于阅读。此过程称为标记。适当且定期地为列名加上标签是理解问题的最佳方法，因为它使您可以查看要使用的功能，并鼓励潜在的功能开发。俗话说“垃圾进，垃圾出”。*

### 查找部门和薪水值的表：
#### 部门: 
*{0} 销售 **/** {1} 会计 **/** {2} HR **/** {3} 技术 **/** {4} 后勤 **/** {5} 管理 **/** {6} IT **/** {7} 产品 **/** {8} 市场 **/** {9} 物采*
#### 薪水:
*{0} 低 / {1} 中 / {2} 高*

In [None]:
df['sales'].replace(['sales', 'accounting', 'hr', 'technical', 'support', 'management',
        'IT', 'product_mng', 'marketing', 'RandD'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], inplace = True)
df['salary'].replace(['low', 'medium', 'high'], [0, 1, 2], inplace = True)

In [None]:
# Renaming certain columns for better readability
# Convert "department" and "salary" features to numeric types because some functions won't be able to work with string types


df = df.rename(columns={'satisfaction_level': '满意度', 
                        'last_evaluation': '评价',
                        'number_project': '工程指标',
                        'average_montly_hours': '月平均工作时间',
                        'time_spend_company': '工龄',
                        'Work_accident': '工作事故',
                        'promotion_last_5years': '升值',
                        'sales' : '部门',
                        'left' : '离职',
                        'salary':'工资'
                        })


df.head()


In [None]:
df

In [None]:
# Move the reponse variable "turnover" to the front of the table
front = df['离职']
df.drop(labels=['离职'], axis=1,inplace = True)
df.insert(0, '离职', front)
df.head()

# <font color='red'>第三步 分析数据</font>

### <font color='red'> 3a. 数据总览 </font>


In [None]:
# The dataset contains 10 columns and 14999 observations
df.shape

In [None]:
# Check the type of our features. 
df.dtypes

In [None]:
# Looks like about 76% of employees stayed and 24% of employees left. 
# NOTE: When performing cross validation, its important to maintain this turnover ratio
df = df.rename(columns={'turnover': '离职'}) 
# df = df.rename(columns={'离职': 'turnover'}) 
# turnover_rate = df.turnover.value_counts() / 14999
turnover_rate = df['离职'].sum() / 14999
turnover_rate

In [None]:
# Overview of summary
# On average, employees who left had a lower satisfaction level of -20%**, worked 8hours more per month, 
# had lower salary, and had a lower promotion rate
turnover_Summary = df.groupby('离职')
turnover_Summary.mean()

In [None]:
# Display the statistical overview of the employees
df.describe()

### <font color='red'> 3b. 协同矩阵 & 热图</font>
##### 强相关因素:
    1. (+) 工程数量 & 月平均工作时间 & 评估
    2. (-) 是否离职 & 满意度 & 工资水平
    
*From the heatmap, there seems to be heavy **positive(+)** correlation between projectCount, averageMonthlyHours, and evaluation. Which could mean that the employees who spent more hours and did more projects were evaluated highly. But the feature evaluation, when compared independently with the response variable turnover, shows little to no relationship. What does this mean? For the **negative(-)** relationships, turnover, satisfaction, and salary are highly correlated. I'm assuming that people tend to leave a company more when they are less satisfied and are lowly paid. *

In [None]:
#Correlation Matrix
import matplotlib.pyplot as plt
df1 = df
df1=df1.rename(columns={'满意度': 'satisfaction_level', 
                        '评价': 'evaluation',
                        '工程指标': 'project_count',
                        '月平均工作时间': 'averageMonthlyHours',
                        '工龄': 'time_spend_company',
                        '工作事故': 'Work_accident',
                        '升值': 'promotion_last_5years',
                        '部门' : 'department',
                        '离职' : 'turnover',
                        '工资':'salary'
                        })

corr = df1.corr()
corr = (corr)
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)
plt.title('Heatmap of Correlation Matrix')
corr

### <font color='red'> 3c. 是否离职 V.S. 部门 </font>
**总结：**
*看起来公司的前三大部门是支持，技术和销售。*
*大多数部门的人员流动率相近，但管理人员的流动率最低。这可能意味着较高职位的人往往不会离开。*


*
管理部门的薪金比率最高，离职率最低。*

In [None]:
#Department   V.S.   Turnover
clarity_color_table = pd.crosstab(index=df1["department"], 
                          columns=df1["turnover"])

clarity_color_table.plot(kind="bar", 
                 figsize=(5,5),
                 stacked=True)

In [None]:
#Department   V.S.   Salary
clarity_color_table = pd.crosstab(index=df1["department"], 
                          columns=df1["salary"])

clarity_color_table.plot(kind="bar", 
                 figsize=(5,5),
                 stacked=True)

### <font color='red'> 3d. 离职 V.S. 薪水 </font>
**总结:**
*这是一个非常有趣的观察。几乎所有离职的员工都从**低**到**中**的薪水水平。几乎没有“高”薪水的人都离开了公司*

In [None]:
#Salary   V.S.   Turnover
clarity_color_table = pd.crosstab(index=df1["salary"], 
                          columns=df1["turnover"])

clarity_color_table.plot(kind="bar", 
                 figsize=(5,5),
                 stacked=True)

### <font color='red'> 3e. 离职 V.S. 工程数量 </font>
**总结:** 
*这个图也很有趣。只有两个项目的一半以上的员工离开了公司。对于项目数从6到7的员工，也可以这样说。也许这意味着项目计数为2或更少的员工工作不足或价值不高，从而离开了公司？也许拥有6个以上项目的员工正过度劳累，从而离开了公司？*

In [None]:
#projectCount V.S. turnover
clarity_color_table = pd.crosstab(index=df1["project_count"], 
                          columns=df1["turnover"])

clarity_color_table.plot(kind="bar", 
                 figsize=(5,5),
                 stacked=True)

### <font color='red'> 3f. 离职 V.S. 评价 </font>
**总结:** 
那些有营业额的人似乎有一种生物模式的混乱。表现不佳或表现良好的员工似乎离开了公司。对于那些留下的人来说，最好的地方似乎是在0.6-0.8之间。

In [None]:
import matplotlib.pyplot as plt 
fig = plt.figure(figsize=(10,4),)
ax=sns.kdeplot(df1.loc[(df1['turnover'] == 0),'evaluation'] , color='b',shade=True,label='no turnover')
ax=sns.kdeplot(df1.loc[(df1['turnover'] == 1),'evaluation'] , color='r',shade=True, label='turnover')
plt.title('Last evaluation')

### <font color='red'> 3g. 离职 V.S. 月平均工作时间 </font>  
**总结:**  
员工流动的另一种明显的双峰扰动。看起来工作时间较少（〜150小时或更短）的员工和工作时间过多（〜250或更长时间）的员工离开了公司。这意味着离职的员工通常不会工作不足或过度劳累。

In [None]:
#KDEPlot: Kernel Density Estimate Plot
fig = plt.figure(figsize=(10,4))
ax=sns.kdeplot(df1.loc[(df1['turnover'] == 0),'averageMonthlyHours'] , color='b',shade=True, label='no turnover')
ax=sns.kdeplot(df1.loc[(df1['turnover'] == 1),'averageMonthlyHours'] , color='r',shade=True, label='turnover')
plt.title('Average monthly hours worked')

 ### <font color='red'> 3h. 工程数量 V.S. 月平均工作时间 </font>  
**总结：**  
常驻员工每月工作时间平均为200小时/月，而离职员工工作约为250小时/月到150小时/月。

In [None]:
#UPDATE 8/1/2017
#ProjectCount VS AverageMonthlyHours
#Looks like the average employees who stayed worked about 200hours/month. Those that had a turnover worked about 250hours/month and 150hours/month
import seaborn as sns
sns.boxplot(x="project_count", y="averageMonthlyHours", hue="turnover", data=df1)

### <font color='red'> 3h. 工程数量 V.S. 评价 </font>  
**总结：**  
看起来即使没有执行不同的项目，没有离开公司的员工的平均评价也高达70％
员工流失率存在巨大偏差： 项目计数大于三之后，它发生了巨大变化。
有两个项目并进行了较差评估的员工和拥有3个以上项目并获得超高评价的员工都离职了


In [None]:
#UPDATE 8/1/2017
#ProjectCount VS Evaluation
#Looks like employees who did not leave the company had an average evaluation of around 70% even with different projectCounts
#There is a huge skew in employees who had a turnover tho. It drastically changes after 3 projectCounts. 
#Employees that had two projects and a horrible evaluation left. Employees with more than 3 projects and super high evaluations left
import seaborn as sns
sns.boxplot(x="project_count", y="evaluation", hue="turnover", data=df1)

# <font color='red'> 第四步: 数据建模 </font>

In [None]:
#Train-Test split
from sklearn.model_selection import train_test_split
label = df1.pop('turnover')
data_train, data_test, label_train, label_test = train_test_split(df1, label, test_size = 0.2, random_state = 15)

In [None]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()
lg.fit(data_train, label_train)
lg_score_train = lg.score(data_train, label_train)
print("Training score: ",lg_score_train)
lg_score_test = lg.score(data_test, label_test)
print("Testing score: ",lg_score_test)

In [None]:
#SVM
from sklearn.svm import SVC
svm = SVC()
svm.fit(data_train, label_train)
svm_score_train = svm.score(data_train, label_train)
print("Training score: ",svm_score_train)
svm_score_test = svm.score(data_test, label_test)
print("Testing score: ",svm_score_test)


In [None]:
#kNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(data_train, label_train)
knn_score_train = knn.score(data_train, label_train)
print("Training score: ",knn_score_train)
knn_score_test = knn.score(data_test, label_test)
print("Testing score: ",knn_score_test)

In [None]:
#random forest
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(data_train, label_train)
rfc_score_train = rfc.score(data_train, label_train)
print("Training score: ",rfc_score_train)
rfc_score_test = rfc.score(data_test, label_test)
print("Testing score: ",rfc_score_test)

y_pred_proba = rfc.predict_proba(data_test)
from sklearn.metrics import roc_curve
fpr, tpr, thres = roc_curve(label_test.values, y_pred_proba[:,1])


In [None]:
# 查看指标重要性
importances = rfc.feature_importances_ 
features = data_train.columns
importances_df = pd.DataFrame([features, importances], index=['Features', 'Importances']).T
importances_df.sort_values('Importances', ascending=False)

In [None]:
import matplotlib.pyplot as plt
plt.plot(fpr, tpr)
plt.show()

In [None]:
from sklearn.metrics import roc_auc_score
score = roc_auc_score(label_test.values, y_pred_proba[:,1])
score

### <font color='red'> Part 5: 解释数据 </font>

Some important questions to answer:
    How are we going to use this information? Highlighting any important findings? Can we use this information to help filter or evaluate employees? 

How was the data obtained? From what time range? Was it sampled? 

How was the response variable labeled/collected? Was it from an employer?

How did we generate our features? Which is most important?

How can we improve the model in the future? Gather more data? Gather more features?


UPDATE 8/1/2017:
Employees who did not leave had an average monthly work hour of ~200hr/mo.
Employees who left were at ~150hr/mo OR ~250hr/mo
This equates to 8hr/day for the average workers that did not leave
Employees who left worked about 6hr/day or 10hr/day, which means they were either underworked or overworked

Employees who had a turnover had:

 1. less satisfaction level
2. worked either 2 hours more or less than the average employee
    -was MOSTLY within the low/med salary range
    -either too little (~2) projects or too many (5-7) projects
    -were considered "underworked" or "overworked"