# People Analytics

My goal with this notebook is to understand what factors contribute to employee turnover based on the [HR Analytics dataset](https://www.kaggle.com/giripujar/hr-analytics). To do so, I will follow a standard data science methodology with business problem, [data preparation](https://en.wikipedia.org/wiki/Data_preparation), [exploratory data analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis) and [model building](https://towardsdatascience.com/seven-major-steps-for-building-a-data-science-model-c1761408dd17) to predict and describe the underlying factors that contribute to employee turnover.

# Business problem

Implementing workplace policies that benefit workers and help boost employee retention is not just a “nice” thing for businesses to do for their employees. Maintaining
a stable workforce by reducing employee turnover, using data science to understand their causes and consequences, also makes good business sense, as it can result in significant cost savings to employers.

In fact, according to the [Work Institute’s 2019 Retention Report](https://info.workinstitute.com/hubfs/2019%20Retention%20Report/Work%20Institute%202019%20Retention%20Report%20final-1.pdf#page=9), employee turnover costs businesses more than 600 billion dollars every year. They estimate that for each employee you lose, it will cost you up to 33\% of their annual salary to replace them.

## Goal

> **What factors contribute to employee turnover and what are our recomendation for managers to avoid that?**

# Data Sources

This notebook is based on the [HR Analytics dataset](https://www.kaggle.com/giripujar/hr-analytics). It was uploaded by Giri Pujar in 2018 and, currently, is evaluated as a Bronze data set. There are no indications if this is a fictional or an anonymized dataset. One of the secondary goals of this notebook is to find clues that suggest if  this was created using rules and random values or not.

## Obtain and scrub

In [None]:
# Import the neccessary libraries for data manipulation and visual representation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv("../input/hr-analytics/HR_comma_sep.csv")
df.info()

In [None]:
df.head()

In [None]:
df['Work_accident'] = df['Work_accident'] == 1
df['left'] = df['left'] == 1
df['promotion_last_5years'] = df['promotion_last_5years'] == 1
df.dtypes

In [None]:
df = df.rename(columns={'satisfaction_level': 'satisfaction', 
                        'last_evaluation': 'evaluation',
                        'number_project': 'projectCount',
                        'average_montly_hours': 'averageMonthlyHours',
                        'time_spend_company': 'yearsAtCompany',
                        'Work_accident': 'workAccident',
                        'promotion_last_5years': 'hadPromotion',
                        'Department' : 'department',
                        })
df.tail()

# Data Analysis

## Current turnover rates

In the context of human resources, turnover is the act of replacing an employee with a new employee \[[1]\]. An organization’s turnover is measured as a percentage rate, which is referred to as its turnover rate, using this formula:

$$ T = \frac{left}{avg. employees} * 100 $$

[1]: https://en.wikipedia.org/wiki/Turnover_(employment)

So we calculate the turnover of our dataset with:

In [None]:
print('Turnover rate was: %.2f%%' % (len(df[df.left == 1]) / len(df) * 100))


If all those people left in the same year (there are no indications in the dataset for the date when people left), this is a **very** significant turnover rate. In fact, in 2017, LinkedIn analysis finds [an average worldwide turnover rate of 10.9% in their platform](https://business.linkedin.com/talent-solutions/blog/trends-and-research/2018/the-3-industries-with-the-highest-turnover-rates). Our dataset has turnover rates higher than the top industry - Technology (Software) - with an average of 13.2%.

So lets investigate further!

## Statistical overview

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.corrwith(df.left)

Turnover is not strongly correlated with any single variable. But it has a medium correlation with satisfaction which is enough for further analysis of this variable.

## Satisfaction

In [None]:
sns.boxplot(x=df.left,y=df.satisfaction)

Using the boxplot above, we can see that people who left had a consistenly lower satisfaction rate. We will use this information as part of our [feature set](https://en.wikipedia.org/wiki/Feature_(machine_learning)) for the modeling step.

## Average Hours, evaluation and project count
So let's move to the other attributes. We will begin with the [Pearson Correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient):

In [None]:
sns.heatmap(abs(df.corr()))

Project count, average monthly hours and evaluation have medium correlations among themselves. So let's see them in a scatterplot

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
sns.scatterplot(ax = ax, x=df.averageMonthlyHours,y=df.evaluation, hue=df.projectCount, palette="seismic")

The image above shows two strange clusters:
 1. A black one, with evaluation centered on 0.5, average monthly hours around 140 and project count around 2
 1. A red one, with high evaluation scores (centered on 0.9), average monthly hours above 250 and project count above 6
 
Now, lets see who have left the company from this evaluation/monthly hours plot. We will change the project count colors to use the turnover information

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
sns.scatterplot(ax = ax, x=df.averageMonthlyHours,y=df.evaluation, hue=df.left)

Interesting enough, it looks like the same clusters. These three attributes together (evaluation, monthlyHours, projectCount) seems to be related to the turnover. We will include then in our feature set.

## Years at the company

We begin by exploring the years attribute and comparing it with other features. See the data below

In [None]:
df.yearsAtCompany.describe()

In [None]:
sns.boxplot(y=df.yearsAtCompany,x=df.left)

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 2, figsize=(18,12))
sns.boxplot(ax = ax1[0],y=df.yearsAtCompany,x=df.hadPromotion,hue=df.left)
sns.boxplot(ax = ax1[1], y=df.yearsAtCompany, x=df.projectCount, hue=df.left)
sns.boxplot(ax = ax2[0], y=df.yearsAtCompany, x=df.evaluation, orient='h', hue=df.left)
sns.boxplot(ax = ax2[1], y=df.yearsAtCompany, x=df.workAccident, hue=df.left)

Years at the company seems to be a categorical value:
  1. 3 years or less
  1. Between 3 and 5
  1. More than 5
  
When the employee is between 3 and 5 years at the company, if he has or has not a promotion seems to be a decisive factor for the turnover. Moreover, in this specific range of 3-5 years, working on exactly 3 projects seems to be an unfortunate combination. The remaining attributes does not seem to correlate with years and turnover. We will create a new categorical attribute to reflect that discovery called *jobStability*.

In [None]:
df['jobStability'] = df.yearsAtCompany.apply(lambda y: 'lessThan3Years' if y <= 3 else 'averageYears' if y < 5  else 'moreThan5Years')
df[['jobStability','yearsAtCompany']]

## Effects of promotions on turnover

Intuitively, a promotion, which is an essential part of the many rewards distributed by organizations, should affect the quitting behavior of individual employees.

In [None]:
df[['hadPromotion','left']].value_counts()

In [None]:
print('Only {:.2%} of the workforce had a promotion in the last 5 years. From those that had, only {:.2%} have left'.format(len(df[df.hadPromotion]) / len(df), len(df[df.hadPromotion & df.left])/ len(df[df.hadPromotion])))
print('Remember that our average turnover is {:.2%}'.format(len(df[df.left == 1]) / len(df)))

Indeed, we can see that the number of people that left the company after receiving a promotion in the last 5 years is significantly lower. We will make this attribute part of our feature set.

## Work accident

In [None]:
df[['workAccident','left']].value_counts(normalize=True)

In [None]:
print('{:.2%} of the workforce had a work accident. From those that had, only {:.2%} have left'.format(len(df[df.workAccident]) / len(df), len(df[df.workAccident & df.left])/ len(df[df.workAccident])))

With that, we end our exploratory data analysis and start to model our classifier.

# Describing turnover

We will use a [Decision Tree](#) to explain the factors that are making people leave the company and to let us give management recommendations. We will begin by preparing our data for the machine learning technique.

## Data preparation for modeling

In this section we will create our feature set (X) and expected outputs (Y). To do so, we will also normalize and covert labels to numbers.

In [None]:
# Import necessary machine learning libraries
from sklearn import preprocessing
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

In [None]:
features = df[['satisfaction','averageMonthlyHours','evaluation','projectCount','yearsAtCompany','workAccident','hadPromotion']].copy()
features = pd.concat([features,pd.get_dummies(df['jobStability'])], axis=1)
features['workAccident'] = features['workAccident'].apply(lambda v: 1 if v else 0)
features['hadPromotion'] = features['hadPromotion'].apply(lambda v: 1 if v else 0)
X = features
X[0:5]

In [None]:
y = df['left'].values
y[0:5]

Now we will separate the train, validation and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
X_model_train, X_validation, y_model_train, y_validation = train_test_split(X_train, y_train, test_size=0.1, random_state=0)
print ('Model train set:', X_model_train.shape,  y_model_train.shape)
print ('Validation (hyperparameters test) set:', X_validation.shape,  y_validation.shape)
print ('Test set:', X_test.shape,  y_test.shape)

And train our decision tree (it may take a while to run)

In [None]:
def train_tree_with_validation():
    f1_scores = {}
    max_depth = range(1,20)
    impurity_decrease = [0.1,0.01,0.001,0.0001,0]
    for d in max_depth:
        for i in impurity_decrease:
            dTree = DecisionTreeClassifier(criterion="gini", max_depth = d, min_impurity_decrease=i)
            dTree.fit(X_model_train,y_model_train)
            f1_scores[(d,i)] = f1_score(y_validation, dTree.predict(X_validation))
    return max(f1_scores.keys(), key=(lambda key: f1_scores[key]))

In [None]:
best_hyper_values = train_tree_with_validation()
dTree = DecisionTreeClassifier(criterion="gini", max_depth = best_hyper_values[0], min_impurity_decrease=best_hyper_values[1])
dTree.fit(X_train,y_train)
print('Trained a decision tree with max_depth = {} and min_impurity_decrease = {}'.format(best_hyper_values[0],best_hyper_values[1]))
print('F1 score on test set: {:.3}'.format(f1_score(y_test, dTree.predict(X_test), average='weighted')))

Even though this decision tree had a high F1 score, it had a quite significant depth (9) to enable us to create meaningfull recomendations. If accuracy was the only thing that we were aiming for, there are better techniques to model the data, like Random Forest and XBoost. As such, we will reduce the max_depth to help us create management recomendations, even if it reduces the F1 for that.

In [None]:
dTree = DecisionTreeClassifier(criterion="gini", max_depth = 4, min_impurity_decrease=0.01)
dTree.fit(X_train,y_train)
print('F1 score with max_depth=4 on the test set: {:.3}'.format(f1_score(y_test, dTree.predict(X_test), average='weighted')))

In [None]:
fig, ax = plt.subplots(figsize=(40,30))
tree.plot_tree(dTree, feature_names=features.columns,label='none',filled=True,proportion=True,impurity=False,rounded=True, max_depth=4)
plt.show()

## Recomendations

The decision tree above allow us to give management recomendations to avoid turnover:
* The single most important attribute is employee satisfaction. Management should focus on keeping all employees with **satisfaction above 0.465**
* Failing that, management should pay close attention on:
     * the employess with low satisfaction that are working only on 2 projects and had an evaluation score below 0.575. They will probabily leave or be fired
     * the employees with extremely low satisfaction (0.115 or less) working on 3 or more projects. They will surely leave. 
* Even for the workers that have a high satisfaction, they should pay attention on old employees (5 or more years at the company) with evaluation above 0.815 (top performers) that are working more than 216 hours.

In [None]:
reason1 = df[(df.satisfaction < 0.465)]
reason2 = df[(df.satisfaction < 0.465) & (df.projectCount == 2) & (df.evaluation <= 0.575) ]
reason3 = df[(df.satisfaction <= 0.115) &(df.projectCount >=3)]
reason4 = df[(df.satisfaction > 0.465) & (df.yearsAtCompany >= 5) & (df.evaluation > 0.815) & (df.averageMonthlyHours > 216)]

print('1. In general, keep satisfaction high. - Satisfaction < 0.465: {:.2%} left'.format(len(reason1[reason1.left])/ len(reason1)))
print('2. Unhappy, low performers working on only 2 projects leave or are fired - Satisfaction < 0.465, projectCount == 2, evaluation <= 0.575: {:.2%} left'.format(len(reason2[reason2.left])/ len(reason2)))
print('3. Very unhappy, working on 3 or more projects leave - Satisfaction <= 0.115, projectCount >= 3: {:.2%} left'.format(len(reason3[reason3.left])/ len(reason3)))
print('4. Long time overworked top-performers leave - Satisfaction > 0.465, yearsAtCompany >= 5, evaluation > 0.815, workingHours > 216: {:.2%} left'.format(len(reason4[reason4.left])/ len(reason4)))

# Predicting turnover

We will use [Logistic Regression](#) to predict the employees that are most likely to leave and to give management a more direct recommendation.

**Under construction**

In [None]:
#To do