# Project 1 - Employee Attrition

This version of the notebook will demonstrate the use of the "Best" model that will output prediction results that is whether the list of employees will most likely leave or stay with the company. This notebook aims to cut out all the step-by-step procedure to achieving the end result that was previously experimented in the "Employee Attrition.ipynb" file. 

Required Files to run this project/demonstration:
- Preprocessing_Utilities.py
- Model_Evaluation_Utilities.py

## Introduction:

The goal of this project is to develop a model that lowers the costs associated with hiring and training employees, this is done by focusing on the approach that predicts which employees might leave the company. Within any organisation/company, the approach towards spending decisions every day plays a major role in company success and one of these decisions would be the important investment in people. The hiring process takes up a lot of skills, patience, time and money. 

The following outlines the most common hiring costs (https://toggl.com/blog/cost-of-hiring-an-employee):
1. External Hiring Teams.
2. Internal HR Teams.
3. Career Events.
4. Job boards fees.
5. Background Checks.
6. Onboarding and training.
7. Careers page.
8. Salary and extras.

As it can be seen from the lists above, it can be very difficult to pinpoint precisely the costs that are associated with hiring an employee. With all these rigorous processes already set up for a given company, perhaps there are better questions to ask, such as:
1. Which employee will stay and which will leave?
2. What are the factors that leads to an employee leaving the company and how it can be predicted?

## Breakdown of this Project:
1. Loading in the Dataset.
2. Visualise the data.
3. Dataset preparation (Data cleaning, training and testing splits)
4. Classifier models (Logistic Regression, Neural Networks, Random Forest)
5. Evaluation methodologies (Accuracy, Precision, Recall and F1-Scores)
6. Classifier Model training and its evaluation.

## Dataset:

Link: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

Dataset Description (from source): Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists. 

What Is Attrition? \
Attrition in business describes a gradual but deliberate reduction in staff numbers that occurs as employees retire or resign and are not replaced. The term is also sometimes used to describe the loss of customers or clients as they mature beyond a product or company's target market without being replaced by a younger generation. (ref -> https://www.investopedia.com/terms/a/attrition.asp)

The following shows the dataset's columns:
- Education
- EnvironmentSatisfaction
- JobInvolvement
- JobSatisfaction
- PerformanceRating
- RelationshipSatisfaction
- WorkLifeBalance

## Requirements:
- Numpy
- Pandas
- Seaborn
- Matplotlib
- sk-learn
- os
- tensorflow (or Keras)

## Summary of all Feature Engineering Trials from the Experimental Notebook:

Feature engineering techniques that was considered and implemented:
1. Statistical-Based selection method on the existing features.
    - Pearson Correlations.
    - Hypothesis Testing. (P-values).
2. Model-based selection method on the existing features.
    - Tree-based models.
3. Principcal Component Analysis to transform the features (Parametric assumption).
4. Restricted Boltzmann Machine (RBM) to create more features (non-parametric assumption).

#### Below shows the summary results:

| Technique | Class | Accuracy (%) | F1-Score (%) | Precision (%) | Recall (%) | GridSearch (fine-tune) |
|  --- | --- | --- | --- | --- | --- | --- |
| Baseline Weighted LogReg | 0 (No, stayed) | 86 | 92 | 87 | 97 | None |
|  | 1 (Yes, left) |  | 47 | 72 | 35 |
|  --- | --- | --- | --- | --- | --- | --- |
| Pearson Correlations | 0 (No, stayed) | 88 | 93 | 89 | 98 | YES |
|  | 1 (Yes, left) |  | 48 | 80 | 34 |
|  --- | --- | --- | --- | --- | --- | --- |
| Hypothesis Testing (P-values) | 0 (No, stayed) | 87 | 93 | 87 | 99 | YES |
|  | 1 (Yes, left) |  | 22 | 78 | 13 |
|  --- | --- | --- | --- | --- | --- | --- |
| Tree-based model selection | 0 (No, stayed) | 83 | 91 | 83 | 1 | NO |
|  | 1 (Yes, left) |  | 3 | 100 | 2 |
|  --- | --- | --- | --- | --- | --- | --- |
| PCA | 0 (No, stayed) | 83 | 90 | 83 | 100 | YES |
|  | 1 (Yes, left) |  | 3 | 5 | 2 |
|  --- | --- | --- | --- | --- | --- | --- |
| RBM | 0 (No, stayed) | 84 | 91 | 86 | 96) | YES |
|  | 1 (Yes, left) |  | 27 | 50 | 19 |

The final model implemented here was chosen based on the output results above and subsequent elimination of under-performing models.

## 1 - Setting Up the Libraries:

In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import matplotlib.pyplot as plt

from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.utils.class_weight import compute_class_weight
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline

from numpy import mean
from Preprocessing_Utilities import *
from Model_Evaluation_Utillities import *

## 2 - Load in the Dataset from Folder:

In [2]:
# Dataset Filename:
dataset_fileName = 'Human_Resources.csv'

# Set up the Working Directory:
currentDirectory = os.getcwd()
path_to_dataset = currentDirectory + '/Dataset/' + dataset_fileName

In [3]:
# Load in the File: .csv format
employee_data_df = pd.read_csv(path_to_dataset)

In [4]:
# For the target variable: Attrition.
y_employee_target = pd.DataFrame(employee_data_df['Attrition'])

# For the Feature variables: All of the rest.
employee_data_df = employee_data_df.drop(labels='Attrition', axis=1)

In [5]:
# Encode the Target variable to 0s and 1s:
y_employee_target = y_employee_target['Attrition'].apply(lambda x: 1 if x == "Yes" else 0)

## 3 - Clean and Preprocess the Dataset in Pipeline: 

Here, the pre-processing steps defined above will be set up in a pipeline to process and transform the dataset to be ready for the following stages suchh as Model training and Classification.

The sequence of the Pipeline:
1. Drop the unwanted columns.
2. One-hot encode the Nominal (categorical) values in the columns.
3. Encode the Ordinal (categorical) values in the columns.
4. Scale the features in the dataset.
5. Encode the Target variable to 0s and 1s.

In [6]:
# ==============================================================================
# 1. Drop the unwanted columns.
# ==============================================================================
# list the columns to drop:
columns_to_drop = ['EmployeeCount', 'Over18', 'StandardHours']

# Apply the custom encoder for Ordinal columns, instantiate:
cc_drop_columns = CustomDropUnwantedColumns(col=columns_to_drop)

# ==============================================================================
# 2. One-hot encode the Nominal (categorical) values in the columns:
# ==============================================================================
# Nominal Columns:
list_categorical_columns_nominal = ['Department', 'EducationField', 'Gender', 
                                    'JobRole', 'MaritalStatus', 'OverTime']

# Apply the custom encoder for Nominal columns, instantiate: similar to One-hot-encoding.
cc_nominal_encoder = CustomCategoryEncoder_nominal(cols= list_categorical_columns_nominal)

# ==============================================================================
# 3. Encode the Ordinal (categorical) values in the column:
# ==============================================================================
# Ordinal Columns:
list_categorical_columns_ordinal = ['BusinessTravel']

# Apply the custom encoder for Ordinal columns, instantiate:
cc_ordinal_encoder = CustomCategoryEncoder_ordinal(cols=list_categorical_columns_ordinal)

# ==============================================================================
# 4. Scale the features in the dataset:
# ==============================================================================
# scaler = scale_features_dataFrame()
scaler = MinMaxScaler()

# ==============================================================================
# 5. Encode the Target variable to 0s and 1s:
# ==============================================================================
# Done Above.

### 3.1 - Define the Feature Engineering Pipeline:

The sequence of the Pipeline:
1. Statistical-Based selection method on the existing features.
    - Pearson Correlations.
    - Hypothesis Testing. (P-values).
2. Principcal Component Analysis to transform the features (Parametric assumption).
3. Restricted Boltzmann Machine (RBM) to create more features (non-parametric assumption).

In [7]:
# ==============================================================================
# 1. Statistical-Based selection method:
# ==============================================================================
# For Pearson Correlations:
cscs = CustomStatsCorrSelector(response=y_employee_target, threshold=0.05)

# For Hypothesis Testing (P-values):
k_best_features = SelectKBest(score_func=f_classif)

# ==============================================================================
# 2. Principcal Component Analysis :
# ==============================================================================
pca = PCA()

# ==============================================================================
# 3. Restricted Boltzmann Machine (RBM):
# ==============================================================================
rbm = BernoulliRBM()


#### Transform the dataset with the pipeline:


In [8]:
# Define the Preprocessing Pipeline Sequence: 
pipe_preprocessing = Pipeline(steps=[("drop_columns", cc_drop_columns),
                                     ("nominal_encoder", cc_nominal_encoder), 
                                     ("ordinal_encoder", cc_ordinal_encoder),
                                     ("corr", cscs), # Part of feature engineering, but required to be placed here
                                     ('scaler', scaler)]
                             )

# Define the Feature Engineering Pipeline (Feature Union) Sequence: ("corr", cscs)
pipe_featEng = FeatureUnion(transformer_list=[("k_best", k_best_features),
                                              ("pca", pca),
                                              ("rbm", rbm)]
                           )


# Overall Preprocessing + Feature Engineering Pipelines:
pipe_preprocessing_featureEng = Pipeline(steps=[("preprocessing", pipe_preprocessing),
                                                ("featureEngineering", pipe_featEng)])

In [9]:
# Convert to Numpy Array:
y_employee_target = y_employee_target.to_numpy()

## 4 - Set-up the Final Logistic Classifier Model:

The model here is based on the GridSearched results that found the best hyperparameters for usage.

The final model's Settings are as follows: best Parameters found.

Best Accuracy: __87.687__% \
Best Parameters: {'P_featureEng__featureEngineering__k_best__k': 'all', 'P_featureEng__featureEngineering__pca__n_components': 30, 'P_featureEng__featureEngineering__rbm__n_components': 300, 'P_featureEng__featureEngineering__rbm__n_iter': 100, 'P_featureEng__preprocessing__corr__threshold': 0.02, 'classifier__C': 0.1}
Average Time to Fit (s): 1.712
Average Time to Score (s): 0.006

In [10]:
# Get the best weighting for Logistic Regression Classifier: computed as [0.59610706, 3.10126582]
weighting = [1, 1]

# Overall Preprocessing + Feature Engineering + Classification Pipelines:
pipe_modeling = Pipeline(steps=[('P_featureEng', pipe_preprocessing_featureEng),
                                ('classifier', LogisticRegression(max_iter=10000, class_weight=weighting))])

# Define the parameters for the grid: dict
model_featEng_params = {'P_featureEng__featureEngineering__k_best__k': ['all'], 
                        'P_featureEng__featureEngineering__pca__n_components': [30], 
                        'P_featureEng__featureEngineering__rbm__n_components': [300], 
                        'P_featureEng__featureEngineering__rbm__n_iter': [100], 
                        'P_featureEng__preprocessing__corr__threshold': [0.02], 
                        'classifier__C': [0.1]}

# Run the Gridsearch Model:
grid = get_best_model_and_accuracy(model=pipe_modeling, 
                                   params=model_featEng_params, 
                                   X=employee_data_df, 
                                   y=y_employee_target.ravel())


Best Accuracy: 87.551%
Best Parameters: {'P_featureEng__featureEngineering__k_best__k': 'all', 'P_featureEng__featureEngineering__pca__n_components': 30, 'P_featureEng__featureEngineering__rbm__n_components': 300, 'P_featureEng__featureEngineering__rbm__n_iter': 100, 'P_featureEng__preprocessing__corr__threshold': 0.02, 'classifier__C': 0.1}
Average Time to Fit (s): 3.12
Average Time to Score (s): 0.01


## 5 - Output the Predictions from the model:

In [11]:
# Take a random sample of 100 employees from the dataset: instead of creating new entries to the dataset.
X_sample_employee = employee_data_df.sample(n=100, random_state=2)
X_sample_employee.head()

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
721,50,Travel_Rarely,939,Research & Development,24,3,Life Sciences,1,1005,4,...,4,80,1,22,2,3,12,11,1,5
843,26,Travel_Rarely,1384,Research & Development,3,4,Medical,1,1177,1,...,2,80,1,8,2,3,8,7,0,7
627,52,Travel_Frequently,890,Research & Development,25,4,Medical,1,867,3,...,3,80,0,31,3,3,9,8,0,0
1368,34,Travel_Frequently,735,Research & Development,22,4,Other,1,1932,3,...,2,80,0,16,3,3,15,10,6,11
305,36,Non-Travel,1105,Research & Development,24,4,Life Sciences,1,419,2,...,3,80,1,11,3,3,9,8,0,8


In [12]:
predictions_on_employees_df, employees_stay, employees_leave = predict_stay_leave(x_data=X_sample_employee, 
                                                                                  grid_model=grid, 
                                                                                  output_col_name='Attrition')
# Inspect the Results:
employees_stay

Unnamed: 0,Attrition
721,No
843,No
627,No
1368,No
305,No
...,...
1259,No
738,No
763,No
977,No


### 5.1 - Output List of Employees that wants to Leave:

In [13]:
employees_leave

Unnamed: 0,Attrition
798,Yes
14,Yes
1142,Yes
239,Yes
688,Yes
