# DS-NYC-DAT-45 | Final Project 2: Project Design Writeup

### Problem Statement

- Determine which employees will leave the company using employee data from the Kaggle dataset.
- This is a supervised learning classification problem. We are predicting a binary value (1 = employee has left; 0 = still employed)
- Understanding which employees have left will help us identify which employees will leave next. Followed by engagement with this group that would determine motivating factors will help the company in employee retention.
- Hypothesis: High performing employees (evaluation score 0.8 or higher) who have few number of projects (3 or less) have a high likelihood to leave the company (probability > 0.5).

### Outline of Potential Methods

- Potential models: logistic regression, decision trees and random forest
- Fit dataset using random forest and determine which features are more important according to the feature importance score. Try fitting decision trees to determine which feature interactions have high significance. Improve the logistic regression model using the important features and interactions learned from the previous steps. The final step will quantify the impact of each feature and features' interaction on employee attrition and help initiate the dialogue for the next step in employee engagement.

### Dataset

- Data Source: Kaggle (15,000 employees, 10 variables including the dependent variable)
- Data Dictionary

Variable | Description | Type of Variable | Range
---|---|---|---
satisfaction_level | Satisfaction level of employee based on survey | Continuous | [0.09, 1]
last_evaluation | Score based on employee's last evaluation | Continuous | [0.36, 1]
number_project | Number of projects | Continuous | [2, 7]
average_montly_hours | Average monthly hours | Continuous | [96, 310]
time_spend_company | Years at company | Continuous | [2, 10]
Work_accident | Whether employee had a work accident | Categorical | {0, 1}
left | Whether employee had left | Categorical | {0, 1}
promotion_last_5years | Whether employee had a promotion in the last 5 years | Categorical | {0, 1}
sales | Department employee worked in | Categorical | 10 departments 
salary | Level of employee's salary | Categorical | {low, medium, high}

- Sample Data

In [8]:
import pandas as pd
df = pd.read_csv('hr/HR_data.csv')
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


### Project Concerns

- Salary variable would be better if it were continuous rather than categorical (low, medium, high). Since this variable is ordinal would it make sense to change it to {1, 2, 3}? Are there risks in doing so for certain models (logistic regression) but not for others (decision trees/random forest)? Is it better to convert it to numbers that are more representative such as {3, 7, 15} that reflect average salary of 30K, 70K, 150K for low, medium and high?
- By definition, the data is not a perfect cross-sectional data (unless ~3,500 out of 15,000 employee quit on the same day.) And the year the employee left is not available, which may introduce some noise in the data. (For example, less people may be jumping to other companies during a recession.)
- Also, the data does not specify whether the employee left voluntarily or involuntarily. The assumption is that the data includes employees who have been laid off or fired and they are indicated the same as those who left voluntarily.
- The dataset is submitted by a Kaggle user, but did not come with many details surrounding the data, such as what industry, time period or country. Because of the lack of details there may be questions regarding the accuracy of the data.

### Outcomes

- The main objective of the model is to correctly predict which employees will leave and thus prediction accuracy will be the criteria for model selection. The lower bound of the prediction accuracy is 0.76, which is the proportion of the population that have not left the company. The upper bound is set at 0.99, which is the cross validation accuracy score of the Random Forest model.
- The goal is to fit a model that could explain the drivers of the outcome (e.g., logistic regression) with a prediction accuracy of 90% or better on the testing set.
- Another objective of the project is to determine which features and feature interactions are the most significant in predicting the outcome. This could involve comparing the coefficients of the logistic regression model or where the nodes are in the decision tree.

#### Taking one step further
Make it a clustering problem:
1. Take the group of employees who have left and use that dataset to create segments to better understand what type of employees are leaving. (e.g., "disgruntled and low performing", "overworked", "high performers but not enough work to do", etc.)
2. Alternatively, you can create segments from the entire population (while excluding the outcome variable) and determine which groups have the highest attrition rate to target the employees in that group for follow up discussions.

#### Resources
Similar projects on this topic:
1. A Study on Employee Attrition and Retention in Manufacturing Industries
http://www.bvimsr.com/documents/publication/2013V5N1/09.pdf
2. Whole Foods Market Case Study: Leadership and Employee Retention 
http://scholarsarchive.jwu.edu/mba_student/8/
3. Predictive Employee Turnover Analysis Flow Chart
http://www.inostix.com/blog/en/case-study-predictive-employee-turnover-analysis-flow-chart-hr-analytics/