# Introduction
- Intuitively we know there is a significant cost with employee turnover. The basic ideas behind this project are: can we accurately predict employee attrition, what are the most important features that contribute to attrition, and is there a method of reducing the cost of employee turnover? 
    - Structure:
        - Cleaning and EDA
        - Survival Analysis of an employee's tenure with the company
        - Machine Learning for predicting attrition
        - Developing a cost function for our models

### Main Goals
- Find useful information through Exploratory Analysis
    - Using survival analysis to speculate on important features for attrition
- Predict employee attrition
    - Target: Yes (16.2%) vs. No (83.88%)
    - Benchmark (predicting every employee as No Attrition): 83.88% accuracy
    - Metrics: Accuracy, False Negative Rate
- Preemptively combat attrition and try to minimize cost

### Findings
- We were able to find the most important features for attrition (all of them make intuitive sense)
- Predicted employee attrition with 91% accuracy
- By giving a bonus for employees that were at risk of attrition, we were able to save IBM a significant amount of money
    - The cost function used very conservative estimates (as a worst case scenario) and we we're still able to show savings

# Importance of keeping quality employees
- Cost of advertising job opening, interview process, training, lost productivity, etc.
    - Studies show it can take up 2 years for a new hire to reach the same productivity level as the old employee for certain postitions
- Total cost for losing an employee can be anywhere between 6-24 months of their salary (keep this in mind when we try to reduce cost for the company)
    - e.g. an employee making \$50,000/year will cost \$25,000 - \$100,000 to replace
        - More research on the cost of replacing IBM employees will need further analysis

### Disclaimer
- It is an ARTIFICIAL dataset
    - However, we can still gain insights and I believe the process used is repeatable with a real world dataset
- I am still new to the Data Science world (this is my first contribution to Kaggle), so any feedback is much appreciated!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

from IPython.display import display
from scipy import stats

import warnings
%matplotlib inline 
np.random.seed(42)
warnings.filterwarnings("ignore")

In [None]:
hr_df = pd.read_csv('../input/WA_Fn-UseC_-HR-Employee-Attrition.csv')

In [None]:
hr_df.head()

> # Exploratory Analysis

In [None]:
hr_df.info()

### There are no missing values and most of the features are the right data type
### Changing some numerical data to categorical
- Education: 1 = 'Below College', 2 = 'College', 3 = 'Bachelor', 4 = 'Master', 5 = 'Doctor'
- EnvironmentSatisfaction: 1 = 'Low', 2 = 'Medium', 3 = 'High', 4 = 'Very High'
- JobInvolvement: 1 = 'Low', 2 = 'Medium', 3 = 'High', 4 = 'Very High'
- JobSatisfaction: 1 = 'Low', 2 = 'Medium', 3 = 'High', 4 = 'Very High'
- PerformanceRating: 1 = 'Low', 2 = 'Good', 3 = 'Excellent', 4 'Outstanding'
- RelationshipSatisfaction: 1 = 'Low', 2 = 'Medium', 3 = 'High', 4 = 'Very High'
- WorkLifeBalance: 1 = 'Bad', 2 = 'Good', 3 = 'Better', 4 = 'Best'

In [None]:
# Changing numeric values to corresponding categorical values
hr_df['Education'] = hr_df['Education'].map({1: 'Below College', 2: 'College', 3: 'Bachelor', 4: 'Masters', 5: 'Doctor'})
hr_df['EnvironmentSatisfaction'] = hr_df['EnvironmentSatisfaction'].map({1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'})
hr_df['JobInvolvement'] = hr_df['JobInvolvement'].map({1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'})
hr_df['JobSatisfaction'] = hr_df['JobSatisfaction'].map({1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'})
hr_df['RelationshipSatisfaction'] = hr_df['RelationshipSatisfaction'].map({1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'})
hr_df['PerformanceRating'] = hr_df['PerformanceRating'].map({1: 'Low', 2: 'Good', 3: 'Excellent', 4: 'Outstanding'})
hr_df['WorkLifeBalance'] = hr_df['WorkLifeBalance'].map({1: 'Bad', 2: 'Good', 3: 'Better', 4: 'Best'})

#### We will drop EmployeeCount, StandardHours, and EmployeeNumber because they are not informative

In [None]:
hr_df = hr_df.drop(['EmployeeCount', 'StandardHours', 'EmployeeNumber', 'Over18'], axis = 1)

### We will make data frames for both numerical and categorical data to make EDA easier

In [None]:
#making categorical and numerical data frames
hr_categorical = []
hr_numerical = []
for column in hr_df:
    if type(hr_df[column][1]) == str:
        hr_categorical.append(column)
    
    else:
        hr_numerical.append(column)
        
numerical_df = hr_df[hr_numerical]
categorical_df = hr_df[hr_categorical]

In [None]:
# histograms of the numerical data
fig = plt.figure(figsize = (15,30))
i = 0

for column in numerical_df:
    i += 1
    fig.add_subplot(6,3,i)
    plt.hist(numerical_df[column])
    plt.title(column)
plt.tight_layout()

In [None]:
hr_df['Attrition'] = hr_df['Attrition'].map({'Yes': 1, 'No': 0})

In [None]:
fig = plt.figure(figsize = (30,40))
i = 0

for col in categorical_df:
    i += 1
    fig.add_subplot(5,3,i)
    sns.countplot(categorical_df[col])
    plt.xticks(rotation=35, fontsize = 20)
    plt.title(col, fontsize = 20)
    
plt.tight_layout()

### We will now look at the correlations between the numerical features

In [None]:
cor = numerical_df.corr()
plt.figure(figsize = (15,15))
sns.heatmap(cor, annot = True)
plt.show()

### Since that looks clumsy, we will show a heat map with the highest correlations highlighted

In [None]:
plt.figure(figsize = (10,8))
big_cor = cor.where(abs(cor) > .6)
sns.heatmap(big_cor.replace(np.nan, 0))
plt.title('Correlation Heat Map with High Correlations Highlighted')
plt.show()

# Summary of Data
- Very clean dataset
    - No missing values and everything is the right data type
- Not a whole lot of high correlations across the board
    - Things that deal with "Years" are obviously correlated
- A lot of the numerical features are skewed right
- There are only two classes of performances of employees (Excellent vs. Outstanding)
    - We can reasonably assume that the reason behind their attrition is VOLUNTARILY leaving IBM
        - Meaning there will be a cost of losing an employee
- Class inbalances in Attrition

### Moving forward
- We will use survival analysis to speculate on important features for attrition

## Survival Analysis
- Analyzing the expected duration of time until one or more events happen
    - Typical example: Use this analysis in clinical trials to test the effect of a certain treatment on survival time
- Kaplan-Meier Estimator: $\hat S(t) = \prod _{t_{i} \leq t} \big ( 1 - \frac{d_i}{n_i} \big )$ 
    - Where $d_i$ is the number of exits and $n_i$ is the total number of individuals still at the company

### We saw that there were some outliers in YearsAtCompany
- Let's explore this further

In [None]:
sns.boxplot(hr_df['YearsAtCompany'])
plt.title('Box Plot for Years at Company')
plt.show()

In [None]:
threshold = np.std(hr_df['YearsAtCompany']) * 3 # 3 std above mean
hr_df = hr_df[hr_df['YearsAtCompany'] < np.mean(hr_df['YearsAtCompany']) + threshold]

### 25 observations are above 3 standard deviations above the mean
- ~1.7% of data
- I am comfortable removing these observations

In [None]:
from lifelines import KaplanMeierFitter

### Could not import the lifelines package...
- Any help would be appreciated
- If you'd like to see this analysis, it is on my GitHub at https://github.com/eprentice30/hr_analytics/blob/master/IBM_hr_analytics/01_survival_analysis.ipynb

### Anyways, the important stuff is the likely important features for attrition

    - Overtime
    - JobSatisfaction
    - MaritalStatus
    - Age
    - MonthlyIncome
    - EnvironmentSatisfaction
    - StockOptionLevel
    - NumCompaniesWorked
- We will keep these features in mind while we try to validate the most important features for attrition

## Predicting Attrition

In [None]:
sns.countplot(hr_df['Attrition'])
plt.text(x = -.15, y = 800, s = str(np.round(1233/1470.0, 4) * 100) + '%', fontsize = 16)
plt.text(x = .85, y = 100, s = str(np.round(237/1470.0, 4) * 100) + '%', fontsize = 16)
plt.xticks(np.arange(2),('No', 'Yes'))
plt.show()

In [None]:
from sklearn.metrics import auc, roc_curve, roc_auc_score, classification_report, confusion_matrix, \
                        precision_recall_fscore_support

### Metrics to Consider
- Accuracy and False Negative Rate
    - False negative rate because this will be most costly for the company (predicting no attrition when there is attrition)
- Benchmark accuracy score
    - Accuracy = 83.88% (predicting every employee as leaving)

## Preprocessing
### Getting Dummy Variables

In [None]:
hr_df = pd.get_dummies(hr_df, drop_first = True) #to avoid multicolinearity

### Setting Up Feature and Target Matrix

In [None]:
X = hr_df.drop('Attrition', axis = 1)
y = hr_df['Attrition']

### Scaling

In [None]:
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler

 ### Train-test split and scaling data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

#scale the data
scaler = StandardScaler()
# Fit_transform
X_train_scaled = scaler.fit_transform(X_train)
# transform
X_test_scaled = scaler.transform(X_test)

### Now we will fit a bunch of classification models and see which performs the best

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

## For the sake of space I will only show our top 2 performing models
- XGBoost and an Artificial Neural Network

### XGBoost

In [None]:
xgbparams = {
    'max_depth':[1,3,5],
    'learning_rate':[.1,.5,.7,.8],
    'n_estimators':[25,50,100]
}

xgb_gs = GridSearchCV(XGBClassifier(), param_grid = xgbparams, cv=5, n_jobs=-1, verbose = 1)
xgb_gs.fit(X_train_scaled, y_train)
xgb_gs.best_params_

In [None]:
print('Train acc =', xgb_gs.score(X_train_scaled, y_train))
print('Test acc = ', xgb_gs.score(X_test_scaled, y_test))

### Testing accuracy of 90%
- Below is the confusion matrix and other metrics

In [None]:
y_pred = xgb_gs.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(classification_report(y_test, y_pred))

### Artificial Neural Network

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
mlpparams = {
            'learning_rate': ["constant", "invscaling", "adaptive"],
            'hidden_layer_sizes': [(30,), (60,), (50,), (40,)],
            'alpha': [.1],
            'activation': ["logistic", "relu", "tanh"]
            }

mlp_gs = GridSearchCV(MLPClassifier(), param_grid = mlpparams, cv = 5, verbose = 1)
mlp_gs.fit(X_train_scaled, y_train)
mlp_gs.best_params_

In [None]:
print('Train acc =', mlp_gs.score(X_train_scaled, y_train))
print('Test acc =', mlp_gs.score(X_test_scaled, y_test))

### Testing accuracy of 90.6%
- Below is the confusion matrix and other metrics

In [None]:
y_pred = mlp_gs.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(classification_report(y_test, y_pred))

### Comparing Models

| Model                            | Parameters                                                                                     | False Negative Rate | Train Acc | Test Acc |
|----------------------------------|------------------------------------------------------------------------------------------------|---------------------|-------------|------------|
| ANN                              | activation = logistic, alpha = 0.1, hidden_layer_sizes = (60,)  | 43%                 | 0.8957      | 0.9061      |
| XGBoost                          | learning_rate = 0.8, max_depth = 1, n_estimates = 50                                             | 60%                 | 0.9058      | 0.9006     |
| Logistic Regression (1st Deg poly) | c = 0.1, penalty = 'l2'                                                                         | 58%                 | 0.8938      | 0.8723     |
| Logistic Regression (2nd Deg poly) | c  = 0.1, penalty = 'l2'                                                                        | 60%                 | 0.9437      | 0.8777     |
| KNN                              | n_neighbors = 5                                                                                | 85%                 | 0.8675      | 0.8890     |
| Random Forrest                   | max_depth = None, max_features = auto, n_estimates = 100                                         | 92%                 | 1.000       | 0.8696     |

### The top two models were our Artificial Neural Network and our XGBoost

### Feature Importance

In [None]:
xgb = XGBClassifier(max_depth = 1, learning_rate = .8, n_estimators = 50)
xgb.fit(X_train_scaled, y_train)

In [None]:
zipped = list(zip(X.columns, xgb.feature_importances_))
zipped = pd.DataFrame(zipped)
zipped = zipped.sort_values(by = 1)
zipped = zipped.iloc[27:]

In [None]:
plt.figure(figsize=(10,10))
plt.barh(np.arange(30), zipped[1],)
plt.yticks(np.arange(30), (list(zipped[0])))
plt.ylabel('Feature', fontsize=15)
plt.xlim(xmin = .015, xmax = .14)
plt.xlabel('Importance')
plt.show()

### We see that there is a lot of overlap with the important features found with Survival Analysis and XGBoost
- We can be fairly confident in our findings of important features (especially since they all make intuitive sense)

# Ok so we know what features are important for attrition and we are fairly accurate with our predictions (91%)

# So now what?

## We can do nothing and view attrition as an overhead cost

### Confusion Matrix
|            | Predicted No | Predicted Yes |
|------------|--------------|---------------|
| Actual No  | True Negative       |    False Positive    |
| Actual Yes |    False Negative   |     True Positive    |

### Cost Matrix

|            | Predicted No | Predicted Yes |
|------------|--------------|---------------|
| Actual No  | c = 0     |    c = 0    |
| Actual Yes |    c = Salary*.5  |     Salary*.5    |

- Basically, if the employee is still with IBM, there is no cost invovled
- But if the employee leaves we will take the CONSERVATIVE estimate that it costs half of their yearly salary to replace them
    - Assuming an average salary of \$50,000 per year (also conservative)

$$\Rightarrow C = TN_n\cdot 0 + FP_n \cdot 0 + FN_n \cdot Salary \cdot 0.5+ TP_n \cdot Salary \cdot 0.5$$
$$\Rightarrow C = FN_n \cdot Salary \cdot 0.5 + TP_n \cdot Salary \cdot 0.5= \boxed{\$ 1,375,000}$$

- So we have an overhead cost of attrition of around 1.375 Million dollars per year (using the "testing" employees: 55 employees leaving and 307 staying)

## Or we could take preemptive action against attrition
- What was the most important feature for attrition?
    - INCOME (obviously)
- Let's give employees that are at risk for attrtition a 10% bonus in the hopes that they will stay
- Assumptions
    - Bonus will prevent an employee from leaving a percentage ($p$) of the time
    - Cost of losing an employee = $.5 \cdot Salary = \$25,000$
    - Bonus = $0.1 \cdot Salary = \$5,000$
    
### New cost function if we implement this bonus program
|            | Predicted No | Predicted Yes |
|------------|--------------|---------------|
| Actual No  | c = 0     |    c = 5,000    |
| Actual Yes |    c = 25,000  |   p $\cdot$ 5,000 + (1 - p) $\cdot$ 25,000   |

- So if we predict no attirtion and the employee will be staying we incure a cost of \$0
- But if we predict attrition when there is not attrition, we take a hit of the bonus (\$5,000)
- If we don't predict attrition when there is, we lose half of the employee's salary (\$25,000)
- Now if we predict attrition when the employee was planning on leaving, there's a probability that the bonus will keep them and the subsequent opposite probabilty that it doesn't work, so we can come up with an expected cost this way

- Here's our new cost function
$\Rightarrow C = TN_n\cdot 0 + FP_n \cdot 5000 + FN_n \cdot 25000+ TP_n (p \cdot 5000+ (1 - p) \cdot 25000)$ 

### We will now try to tune our models to minimize this cost function


In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
mlp = MLPClassifier(activation = 'logistic', hidden_layer_sizes = (60,), alpha = .1, \
                    learning_rate = 'adaptive')
mlp.fit(X_train_scaled, y_train)

In [None]:
def best_threshold (model, steps, X, y, p):
    salary = 50000.0
    bonus = 5000.0
    TN_cost = 0
    TP_cost = p*bonus + (1-p)*.5*salary
    FP_cost = bonus
    FN_cost = .5*salary
    
    cost = []
    threshold = 0
    
    m = model
    #train test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
    #scale the data
    scaler = StandardScaler()
    # Fit_transform
    X_train_scaled = scaler.fit_transform(X_train)
    # transform
    X_test_scaled = scaler.transform(X_test)
    
    m.fit(X_train_scaled, y_train)
    
    for i in range(steps + 1):
        y_pred_train = (model.predict_proba(X_train_scaled)[:,1] > threshold)
        y_pred_test = (model.predict_proba(X_test_scaled)[:,1] > threshold)
        
        cm = confusion_matrix(y_test, y_pred_test)
        TN = cm[0,0]
        TP = cm[1,1]
        FP = cm[0,1]
        FN = cm[1,0]
        
        total_cost = TN_cost*TN + TP_cost*TP + FP_cost*FP + FN_cost*FN
        results_dict = {
                'threshold' : threshold,
                'cost' : total_cost,
                'precision_score_test': precision_score(y_test, y_pred_test),
                'recall_score_test': recall_score(y_test, y_pred_test),
                'TN': TN,
                'FP': FP,
                'FN': FN,
                'TP': TP,
                        }
        cost.append(results_dict)
        threshold += (1.0/steps) 
    
    thresh_results = pd.DataFrame(cost, columns=['cost', 'threshold', 'precision_score_test',\
                        'recall_score_test','FN', 'FP','TN','TP'])
    return thresh_results

In [None]:
fig = plt.figure(figsize = (15,90))
j = 0
probabilities = -(np.sort(-np.linspace(0,1,11)))

for i in probabilities:
    df1 = best_threshold(xgb,20,X,y, i)
    df2 = best_threshold(mlp,20,X,y, i)
    
    j += 1
    fig.add_subplot(12,1,j)
    plt.plot(df2['threshold'], df2['cost'], label = 'ANN')
    plt.plot(df1['threshold'], df1['cost'], label = 'XGBoost')
    plt.plot(np.linspace(0,1,9), 1375000*np.ones(9), label = 'Overhead Cost')
    plt.ylabel('Cost')
    plt.xlabel('Probability Threshold')
    plt.title('Comparing Models to Minimze Cost')
    plt.text(x = .325, y = 1500000, s = 'Probability that bonus is successful = ' + str(i), fontsize = 14)

    m = df2['cost'].min()
    profit = 1375000 - m
    plt.text(x = .325, y = 1400000, s = 'Savings = $' + str(profit), fontsize = 14)
    plt.legend()
plt.tight_layout()


## Best case scenario (works 100% of the time)
- We can expect savings of \$590,000 (from the testing set)

## Worst Case (Works 10% of the time)
- We can expect savings of \$37,000 (from the testing set)

### So when we come up with a confident answer for how often the bonus actually works
- We can see where the cost functon is minimized and say "Hey this employee has probability (p > threshold) of leaving, let's give them a bonus" and we can expect significant savings
    - For example, if the bonus program only works 50% of the time, we will use our ANN model to come up with a probability of the employee leaving and if that probability is greater than 0.45 we give them a bonus. We can then expect savings of \$260,000

### It's pretty clearly a bad idea to give everyone a bonus
- And if nobody gets the bonus, the cost function converges to the "overhead cost"
    - This also means that even if the bonus NEVER works it is not any worse than doing nothing to combat it
- But those middle grounds are the important metrics
    - Also keep in mind these are based on conservative estimates (only costs half of the employees salary to lose them)

# Conclusions
- Yes, it is an artificial dataset
    - But for the most part a lot of these metrics are recorded by companies
        - Even if they are not recorded, it would not be to difficult to obtain this data
- We were able to reliably predict whether an employee will leave the company (91% accuracy vs. benchmark accuracy of 84%)
- We are confident in which attributes are most important to attrition
  - MonthlyIncome
  - Overtime
  - DailyRate
  - DistanceFromHome
  - EnvironmentSatisfaction
  - RelationshipSatisfaction
  - NumCompaniesWorked
  - StockOptionLevel
  - TotalWorkingYears
  - JobRole
- More importantly we looked at whether we should do nothing about employee turnover or try to combat it
    - We used conservative estimates for how much it will cost IBM to replace an employee
    - We used a conservative estimate for the average employee's salary at IBM (\$50,000 vs the actual median of \$58,000)
    - We looked at a wide range of possibities of how often the bonus program will work
        - Even if it works 0% of the time there is no cost involved

## Further Work
- Actually validating the assumptions of our cost function
    - How often a bonus will keep an employee at IBM
    - How much it actually costs to lose an employee
 - Rather than taking the an "average" salary for employees
     - Change the cost function to reflect everyone's individual salary, among other metrics
- Continue tuning models to more accurately reflect an employee's probability of attrition
     - Look into model stacking
     - Look into methods of dealing with a class inbalance   