# <font color=darkblue>Demystifying Employee Attrition with Machine Learning</font>

### By Kevin Chan Yongda

## <font color=darkblue> 1. Introduction

### 1.1 Problem Statement

<b>Talent Management Strategy</b> is dominating the boardroom agenda of many organisations across the globe. Increasingly, it have evolved into a core Business Strategy where organisations strive to retain existing talents and compete to attract more top talents in order to have a competitive edge over their closest business rivals. 

This is especially prevalent in organisations with businesses that are heavily-reliant on human capital for a differentiating edge, such as Financial Institutions. In such organisations, Staff Cost is a substantial driver of Total Cost, and hence Profitability. For mid-size Financial Institutions, Staff Cost represents as high as about <b>60%-70%</b> of their Total Cost. In larger Financial Institutions, Staff Cost as a percentage of Total Cost remains sizable at about <b>40%-50%</b>.

When employees (especially top talents) leave the organisation, they bring about disruptions to client experience, team's morale and project delivery timelines. They will leave and also bring along with them the experience, training and development exposures that the organisation have previously invested in them. To the organisation, this means additional costs will have to be spent again in the acquisiton and subsequent development of replacement hires.

As such, <b>Employee Attrition</b> is a real strategic issue that organisations have to manage in order to protect the overall profitability of the business.  Organisations that master the effective planning and execution of Talent Management Strategy will emerge a winner in this competitive landscape, especially amidst the challenging and peculiar macro-economic setting that we are in today.


Therefore, my analysis hereinafter hopes to unravel <b>key insights</b> on Employee Attrition Risks (through Company XYZ dataset), with the 2 following objectives:

            a. Provide the Management (of Company XYZ) with the Top Factors that have a major influence on why their employees leave the organisation
            b. Build a Machine Learning model to predict and identify employees with high attrition risks for early Management (of Company XYZ) intervention

![employee-retention.jpg](attachment:employee-retention.jpg)

### 1.2 Understanding the Dataset

<b>Link:</b> [HR Analytics Case Study - Kaggle](https://www.kaggle.com/vjchoudhary7/hr-analytics-case-study)

<b>Case Study Extract (Source: Kaggle)</b>

A large company named XYZ, employs, at any given point of time, around 4000 employees. However, every year, around 15% of its employees leave the company and need to be replaced with the talent pool available in the job market. 

<b>Datasets</b>

   There are 3 data files that I will be using for this case study in total.  
  
    
   

<font color=black><b>1. general_data.csv</b>



  - <b>Age:</b> Age of the employee
  - <b>Attrition:</b> Whether the employee has left the organisation
  - <b>BusinessTravel:</b> How frequent the employee travelled for business in the last year
  - <b>Department:</b> Employee's department 
  - <b>DistanceFromHome:</b> Distance between Office and Employee's home (in km)
  - <b>Education:</b> Employee's level of education (1: 'Below College' , 2: 'College' , 3: 'Bachelor's Degree' , 4 : 'Masters Degree' , 5. 'Doctorate')
  - <b>EducationField:</b> Employee's field of education
  - <b>EmployeeCount:</b> Employee count
  - <b>EmployeeID:</b> Unique Employee ID
  - <b>Gender:</b> Employee's gender
  - <b>JobLevel:</b> Employee's job level on a scale of 1 to 5
  - <b>JobRole:</b> Employee's role title
  - <b>MaritalStatus:</b> Employee's marital status
  - <b>MonthlyIncome:</b> Employee's monthly income (in Rupees per month)
  - <b>NumCompaniesWorked:</b> Total number of companies the employee has worked for
  - <b>Over18:</b> Whether the employee is above 18 years of age
  - <b>PercentSalaryHike:</b> Employee's salary hike last year (in percentage points)
  - <b>StandardHours:</b> Employee's standard working hours (duration)
  - <b>StockOptionLevel:</b> Employee's stock option level
  - <b>TotalWorkingYears:</b> Employee's total number of working years (entire life)
  - <b>TrainingTimesLastYear:</b> Number of times employee attended training last year
  - <b>YearsAtCompany:</b> Employee's total number of working years (in the company)
  - <b>YearsSinceLastPromotion:</b> Employee's number of years since last promotion
  - <b>YearsWithCurrManager:</b> Employee's number of years working under current manager


<b>2. employee_survey_data.csv</b>
 
- <b>EmployeeID:</b> Unique Employee ID
- <b>EnvironmentSatisfaction:</b> Employee's Work Environment Satisfaction Level (1: 'Low' , 2: 'Medium' , 3: 'High' , 4 : 'Very High')
- <b>JobSatisfaction:</b> Employee's Job Satisfaction Level (1: 'Low' , 2: 'Medium' , 3: 'High' , 4 : 'Very High') 
- <b>WorkLifeBalance:</b> Employee's Work Life Balance Rating Level (1: 'Low' , 2: 'Medium' , 3: 'High' , 4 : 'Very High') 


<b>3. manager_survey_data.csv</b>

- <b>EmployeeID:</b> Unique Employee ID
- <b>JobInvolvement:</b> Employee's Job Involvement Level (1: 'Low' , 2: 'Medium' , 3: 'High' , 4 : 'Very High')  
- <b>PerformanceRating:</b> Employee's performance rating last year
  
  

### 1.3 How is my approach different from the rest?

<b>1. Holistic Consideration of Classification Modelling Techniques</b>

One of the objectives of this analysis is to predict and identify employees with high attrition risks. As such, this is clearly a Classification problem with a binary/categorical predictive outcome (i.e. Attrition:Yes or Attrition:No).

Many of those who have attempted this case study focused on applying only Logistic Regression technique in building their machine learning model, perhaps because the case study specifically prescribed for this technique to be applied.

As part of my learning journey, I have decided to go beyond the use of Logistic Regression technique to compare and evaluate the efficacies of a few Classification techniques before recommending the machine learning model to be used. 

I will be using the following Classification Modelling Techniques:
- Logistic Regression
- Random Forest Classifier
- Support Vector Machine


<b>2. Using F1 Score as the Model Evaluation Metric</b>

My area of interest is clearly on employees with high attrition risks. Employees who are deemed to have no attrition risks are not the key focus of this analysis. As such, the model evaluation metric should not be centered on the Accuracy Score in the Confusion Matrix. Many of those who have attempted this case study stopped at just evaluating the Accuracy Score.

Instead, I will be using <b>F1 Score</b>, which focuses on Class 1 of the Classification outcome, as the key model evaluation metric when I am comparing across the 3 Classification techniques.


<b>3. Performing Hyperparameter Tuning</b>

Rather than just leaving it to the default hyperparameters defined in the algorithm, I will also be performing hyperparameter tuning to optimise the performance of the model.

## <font color=darkblue>2. Importing Dataset and Libraries</font>

### <b>2.1 Importing the libraries</b>

In [None]:
#Importing the usual python libraries
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline

#Importing libraries to hide warnings as well as to time code execution duration
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from timeit import default_timer as timer

#Importing libraries for machine learning algorithms
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score, roc_curve, auc, recall_score
from sklearn.model_selection import GridSearchCV

### <b>2.2 Importing and performing high-level checks on the 3 datasets</b>

In [None]:
empdf = pd.read_csv('../input/hr-analytics-case-study/general_data.csv', dtype={'EmployeeID': object})
empsurvey = pd.read_csv('../input/hr-analytics-case-study/employee_survey_data.csv',dtype={'EmployeeID': object})
mgrsurvey = pd.read_csv('../input/hr-analytics-case-study/manager_survey_data.csv',dtype={'EmployeeID': object})

In [None]:
empdf.head()

In [None]:
empsurvey.head()

In [None]:
empsurvey.describe()

In [None]:
mgrsurvey.head()

In [None]:
mgrsurvey.describe()

### <b>2.3 Merging the 3 dataframes into 1 dataframe based on EmployeeID key</b>

In [None]:
df = pd.merge(pd.merge(empdf, empsurvey, on = 'EmployeeID'), mgrsurvey, on = 'EmployeeID')

In [None]:
df.head()

### <b>2.4 Checking the merged dataframe for data attributes and completeness</b>

<b>Key Observation:</b> There are some columns with blank/missing data and will require further pre-processing.

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

## <font color=darkblue>3. Data Pre-processing</font>

<b>Check if there are any duplicated employee entries in the dataframe
    </b>

In [None]:
dupcheck = df['EmployeeID'].nunique() - df['EmployeeID'].count()

print('No. of duplicated Employee records :')
print(dupcheck)

<b>Identify all the columns in the dataframe with null values
    </b>

In [None]:
nullcount = df.isnull().sum()
nullcheck = nullcount[nullcount >0]

print('These are the columns that contain null values: ' + '\n')
if nullcheck.empty == False:
    print(nullcheck)
else:
    print('*No more columns with null values*')

<b>Fill all null values with median of the column. I chose median instead of mean because most of these identified columns are ordinal data with categorical underlyings. Using mean may introduce decimal values which may not make sense to these columns.
    </b>

In [None]:
df.fillna(df.median(), inplace=True)

<b>Check if there are still any null values
    </b>

In [None]:
nullcount = df.isnull().sum()
nullcheck = nullcount[nullcount >0]

print('These are the columns that contain null values: ' + '\n')
if nullcheck.empty == False:
    print(nullcheck)
else:
    print('*No more columns with null values*')

<b>Check if there are any columns with just 1 unique value (not useful)
    </b>

In [None]:
uniquecount = df.nunique()
uniquecheck = uniquecount[uniquecount == 1]

print('These are the columns with just 1 unique value: ' + '\n')
print(uniquecheck)

<b>Remove columns identified above (i.e. with just 1 unique value)
    </b>

In [None]:
df = df.drop(columns = ['EmployeeCount', 'Over18','StandardHours','EmployeeID'])

<b>Feature Engineering: Deriving new features relating to Employee Survey Scores and Manager Survey Scores
    </b>

In [None]:
df['AvgEmpScore'] = round(df[['EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance']].mean(axis=1),1)
df['AvgMgrScore'] = round(df[['JobInvolvement', 'PerformanceRating']].mean(axis=1),1)
df.head()

## <font color=darkblue>4. Exploratory Data Analysis</font>

### <b>4.1 Visualising the distribution of key categorical data</b>

In [None]:
sns.set(context="paper", font_scale=1.5)

plt.figure(figsize=(20,5))

plt.subplot(1,5,1)
sns.countplot(df['Gender'])
plt.xticks(rotation=90)
plt.ylim((0,3000))
plt.title('Distribution by Gender')

plt.subplot(1,5,2)
sns.countplot(df['MaritalStatus'])
plt.xticks(rotation=90)
plt.yticks([])
plt.ylim((0,3000))
plt.ylabel('')
plt.title('Distribution by Marital Status')

plt.subplot(1,5,3)
sns.countplot(df['Department'])
plt.xticks(rotation=90)
plt.yticks([])
plt.ylim((0,3000))
plt.ylabel('')
plt.title('Distribution by Department')

plt.subplot(1,5,4)
sns.countplot(df['JobLevel'])
plt.xticks(rotation=0)
plt.yticks([])
plt.ylim((0,3000))
plt.ylabel('')
plt.title('Distribution by JobLevel')

plt.subplot(1,5,5)
sns.countplot(df['BusinessTravel'])
plt.xticks(rotation=90)
plt.yticks([])
plt.ylim((0,3000))
plt.ylabel('')
plt.title('Distribution by BusinessTravel')

plt.show()

  <b>Key Observations:</b> 
  
 
  a. There are more Male than Female employees.
  
  b. There are more Married employees than Single or Divorced employees.
  
  c. Most of the employees are in Research and Development Department. 
  
  d. Details of the Job Level are not given, but based on data inspection, it is probable that JobLevel increases with seniority (i.e. 5 being the most senior) assuming normal organisational hierarchy.
  
  e. Most of the employees do low frequncy business travels. Only a small proportion of employees do no travel for business at all.

### <b>4.2 Visualising the distribution of key numerical data and its impact on Attrition</b>

In [None]:
sns.set(context="paper", font_scale=1.5)
plt.figure(figsize=(20,5))

plt.subplot(1,3,1)
sns.kdeplot(data= df['Age'][df.Attrition == 'Yes'], color='red', shade=True)
sns.kdeplot(data= df['Age'][df.Attrition == 'No'], color='green', shade=True)
plt.axvline(df['Age'].median(), color='k', linestyle='dashed', linewidth=1)
plt.legend(['Attrition = Yes' , 'Attrition = No', 'Median Age'],prop={'size': 10})
plt.yticks([])
plt.title('Distribution by Age')

plt.subplot(1,3,2)
sns.kdeplot(data= df['DistanceFromHome'][df.Attrition == 'Yes'], color='red', shade=True)
sns.kdeplot(data= df['DistanceFromHome'][df.Attrition == 'No'], color='green', shade=True)
plt.axvline(df['DistanceFromHome'].median(), color='k', linestyle='dashed', linewidth=1)
plt.legend(['Attrition = Yes' , 'Attrition = No', 'Median Distance'],prop={'size': 10})
plt.yticks([])
plt.title('Distribution by DistanceFromHome')

plt.subplot(1,3,3)
sns.kdeplot(data= df['AvgEmpScore'][df.Attrition == 'Yes'], color='red', shade=True)
sns.kdeplot(data= df['AvgEmpScore'][df.Attrition == 'No'], color='green', shade=True)
plt.axvline(df['AvgEmpScore'].median(), color='k', linestyle='dashed', linewidth=1)
plt.legend(['Attrition = Yes' , 'Attrition = No', 'Median EmpScore'],prop={'size': 10})
plt.yticks([])
plt.title('Distribution by AvgEmpScore')


plt.show()

  <b>Key Observations of Charts Above:</b> 
  
  a. Median Age of employee is 36 years old. Younger employees seem to have a higher risk of attrition.
  
  b. Median Distance from office is 7 km. Employees living further from office seem to have a higher risk of attrition.
  
  c. Median AvgEmpScore is about 2.7. Employees who scored the Employee Survey with lower scores seem to have higher risk of attrition.

In [None]:
sns.set(context="paper", font_scale=1.5)
plt.figure(figsize=(20,5))

plt.subplot(1,3,1)
sns.kdeplot(data= df['YearsAtCompany'][df.Attrition == 'Yes'], color='red', shade=True)
sns.kdeplot(data= df['YearsAtCompany'][df.Attrition == 'No'], color='green', shade=True)
plt.axvline(df['YearsAtCompany'].median(), color='k', linestyle='dashed', linewidth=1)
plt.legend(['Attrition = Yes' , 'Attrition = No', 'Median Years'],prop={'size': 10})
plt.yticks([])
plt.title('Distribution by YearsAtCompany')

plt.subplot(1,3,2)
sns.kdeplot(data= df['YearsSinceLastPromotion'][df.Attrition == 'Yes'], color='red', shade=True)
sns.kdeplot(data= df['YearsSinceLastPromotion'][df.Attrition == 'No'], color='green', shade=True)
plt.axvline(df['YearsSinceLastPromotion'].median(), color='k', linestyle='dashed', linewidth=1)
plt.legend(['Attrition = Yes' , 'Attrition = No', 'Median Years'],prop={'size': 10})
plt.yticks([])
plt.title('Distribution of YearsSinceLastPromotion')

plt.subplot(1,3,3)
sns.kdeplot(data= df['YearsWithCurrManager'][df.Attrition == 'Yes'], color='red', shade=True)
sns.kdeplot(data= df['YearsWithCurrManager'][df.Attrition == 'No'], color='green', shade=True)
plt.axvline(df['YearsWithCurrManager'].median(), color='k', linestyle='dashed', linewidth=1)
plt.legend(['Attrition = Yes' , 'Attrition = No', 'Median Years'],prop={'size': 10})
plt.yticks([])
plt.title('Distribution of YearsWithCurrManager')

plt.show()

  <b>Key Observations of Charts Above:</b> 
  
 
  a. Median YearsAtCompany is 5 years. Newer employees (who have spent lesser years in company) seem to have a higher risk of attrition.
  
  b. Median YearsSinceLastPromotion is 1 year. Employees who were recently promoted seem to have higher risk of attrition.
  
  c. Median YearsWithCurrManager is 3 years. Employees who have spent lesser years with current manager seem to have higher risk of attrition.

In [None]:
sns.set(context="paper", font_scale=1.5)
plt.figure(figsize=(20,5))

plt.subplot(1,2,1)
sns.kdeplot(data= df['MonthlyIncome'][df.Attrition == 'Yes'], color='red', shade=True)
sns.kdeplot(data= df['MonthlyIncome'][df.Attrition == 'No'], color='green', shade=True)
plt.axvline(df['MonthlyIncome'].median(), color='k', linestyle='dashed', linewidth=1)
plt.legend(['Attrition = Yes' , 'Attrition = No', 'Median Income'],prop={'size': 10})
plt.yticks([])
plt.title('Distribution of MonthlyIncome')

plt.show()

# plt.subplot(1,2,2)
# sns.kdeplot(x="PercentSalaryHike", y="MonthlyIncome", hue="Attrition", palette=['g','r'], data=df)
# #sns.despine()
# plt.title('MonthlyIncome vs. PercentSalaryHike')

  <b>Key Observation of Chart Above:</b> 
  
 
Median MonthlyIncome is around 50,000 Rupees. Employees who are paid lower than 80,000 Rupees seem to have a slightly higher risk of attrition. However, the observed difference is not as visually significant compared to features explored earlier.

In [None]:
sns.set(context="paper", font_scale=1.5)
plt.figure(figsize=(20,10))

plt.subplot(1,2,1)
sns.kdeplot(data=df['MonthlyIncome'][df.Attrition == 'No'], data2=df['DistanceFromHome'][df.Attrition == 'No'], cmap="Greens", shade=True, shade_lowest=False)
plt.title('MonthlyIncome vs. DistanceFromHome (of Attrition = No)')

plt.subplot(1,2,2)
sns.kdeplot(data=df['MonthlyIncome'][df.Attrition == 'Yes'], data2=df['DistanceFromHome'][df.Attrition == 'Yes'], cmap="Reds", shade=True, shade_lowest=False)
plt.title('Distribution of MonthlyIncome')
plt.title('MonthlyIncome vs. DistanceFromHome (of Attrition = Yes)')

plt.show()

  <b>Key Observations of Charts Above:</b> 
  
 
  a. Most of the employees who lived further away from office (more than 15 km) are paid around Median MonthlyIncome of 50,000 Rupees. They also exhibit higher risk of attrition.
  
  b. Employees who are paid 3x more (i.e. more than 150,000 Rupees) than Median are also exhibiting higher risk of attrition if they live beyond 5 km of the office.
  

In [None]:
sns.set(context="paper", font_scale=1.5)
plt.figure(figsize=(20,10))

plt.subplot(1,2,1)
sns.boxplot(x="AvgEmpScore", y="MonthlyIncome", hue="Attrition", palette=['g','r'], data=df)
plt.legend(loc=1,title='Attrition')
sns.despine()
plt.title('MonthlyIncome vs. AvgEmpScore')

plt.subplot(1,2,2)
sns.boxplot(x="Education", y="MonthlyIncome", hue="Attrition", palette=['g','r'], data=df)
plt.legend(loc=1, title='Attrition')
plt.yticks([])
plt.ylabel('')
sns.despine()
plt.title('MonthlyIncome vs. Education')

plt.show()

  <b>Key Observations of Charts Above:</b> 
  
 
  a. Employees who scored 1.0 in the Employee Survey (AvgEmpScore) have higher risk of attrition regardless of MonthlyIncome.
  
  b. Employees with Education Level '5' (i.e. Doctorate) have higher risk of attrition, especially if their MonthlyIncome is no low (at overall Median level or lower).
  

In [None]:
sns.set(context="paper", font_scale=1.5)

Attr = df.groupby(['Attrition']).size()

fig, (ax1) = plt.subplots(1,1,figsize=(5,5))

ax1.pie(Attr, autopct = '%.0f%%', radius= 1, startangle = 0,labels = ('Attrition: No','Attrition: Yes'),labeldistance = 1.1, colors = ('gray','r'),explode=(0,0.1))
ax1.set_title('Overall Attrition Rate')

plt.show()

  <b>Key Observation of Chart Above:</b> 
  
 Overall attrition rate of Company XYZ is 16%.
  

In [None]:
sns.set(context="paper", font_scale=1.5)

Male = df[df['Gender'] == 'Male'].groupby(['Attrition']).size()
Female = df[df['Gender'] == 'Female'].groupby(['Attrition']).size()

fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,10))

ax1.pie(Male, autopct = '%.0f%%', radius= 1, startangle = 0,labels = ('Attrition: No','Attrition: Yes'),labeldistance = 1.1,colors = ('gray','r'),explode=(0,0.1))
ax1.set_title('Male Attrition Rate')


ax2.pie(Female, autopct = '%.0f%%', radius= 1, startangle = 0, labels =('Attrition: No','Attrition: Yes'), labeldistance = 1.1,colors = ('gray','r'),explode=(0,0.1))
ax2.set_title('Female Attrition Rate')

plt.show()

  <b>Key Observation of Charts Above:</b> 
  
 
Male employees have marginally higher attrition rate compared to Female employees.
  

In [None]:
sns.set(context="paper", font_scale=1.5)

Married = df[df['MaritalStatus'] == 'Married'].groupby(['Attrition']).size()
Single = df[df['MaritalStatus'] == 'Single'].groupby(['Attrition']).size()
Divorced = df[df['MaritalStatus'] == 'Divorced'].groupby(['Attrition']).size()

fig, (ax1,ax2, ax3) = plt.subplots(1,3,figsize=(15,15))

ax1.pie(Married, autopct = '%.0f%%', radius= 1, startangle = 0,labels = ('Attrition: No','Attrition: Yes'),labeldistance = 1.1,colors = ('gray','r'), explode=(0,0.1))
ax1.set_title('Married Attrition Rate')

ax2.pie(Single, autopct = '%.0f%%', radius= 1, startangle = 0, labels = ('Attrition: No','Attrition: Yes'), labeldistance = 1.1,colors = ('gray','r'),explode=(0,0.1))
ax2.set_title('Single Attrition Rate')

ax3.pie(Divorced, autopct = '%.0f%%', radius= 1, startangle = 0, labels =('Attrition: No','Attrition: Yes'), labeldistance = 1.1,colors = ('gray','r'),explode=(0,0.1))
ax3.set_title('Divorced Attrition Rate')

plt.show()

  <b>Key Observation of Charts Above:</b> 
  
 
  Attrition rate for Single employees is more than twice that of Married and Divorced employees.
  

In [None]:
sns.set(context="paper", font_scale=1.5)

RD = df[df['Department'] == 'Research & Development'].groupby(['Attrition']).size()
Sales = df[df['Department'] == 'Sales'].groupby(['Attrition']).size()
HR = df[df['Department'] == 'Human Resources'].groupby(['Attrition']).size()

fig, (ax1,ax2, ax3) = plt.subplots(1,3,figsize=(15,15))

ax1.pie(RD, autopct = '%.0f%%', radius= 1, startangle = 0,labels = ('Attrition: No','Attrition: Yes'),labeldistance = 1.1,colors = ('gray','r'), explode=(0,0.1))
ax1.set_title('R&D Dept Attrition Rate')

ax2.pie(Sales, autopct = '%.0f%%', radius= 1, startangle = 0, labels =('Attrition: No','Attrition: Yes'), labeldistance = 1.1,colors = ('gray','r'), explode=(0,0.1))
ax2.set_title('Sales Dept Attrition Rate')

ax3.pie(HR, autopct = '%.0f%%', radius= 1, startangle = 0, labels =('Attrition: No','Attrition: Yes'), labeldistance = 1.1,colors = ('gray','r'), explode=(0,0.1))
ax3.set_title('HR Dept Attrition Rate')

plt.show()

  <b>Key Observation of Charts Above:</b> 
  
 
  Attrition rate of HR Department is twice that of Sales or R&D Department.
  

In [None]:
sns.set(context="paper", font_scale=1.5)

Rare = df[df['BusinessTravel'] == 'Travel_Rarely'].groupby(['Attrition']).size()
Freq = df[df['BusinessTravel'] == 'Travel_Frequently'].groupby(['Attrition']).size()
Non = df[df['BusinessTravel'] == 'Non-Travel'].groupby(['Attrition']).size()

fig, (ax1,ax2, ax3) = plt.subplots(1,3,figsize=(15,15))

ax1.pie(Rare, autopct = '%.0f%%', radius= 1, startangle = 0,labels = ('Attrition: No','Attrition: Yes'),labeldistance = 1.1,colors = ('gray','r'), explode=(0,0.1))
ax1.set_title('Rare Traveller Attrition Rate')

ax2.pie(Freq, autopct = '%.0f%%', radius= 1, startangle = 0, labels = ('Attrition: No','Attrition: Yes'), labeldistance = 1.1,colors = ('gray','r'), explode=(0,0.1))
ax2.set_title('Frequent Traveller Attrition Rate')

ax3.pie(Non, autopct = '%.0f%%', radius= 1, startangle = 0, labels =('Attrition: No','Attrition: Yes'), labeldistance = 1.1,colors = ('gray','r'), explode=(0,0.1))
ax3.set_title('Non Traveller Attrition Rate')

plt.show()

  <b>Key Observation of Charts Above:</b> 
  
 
  Attrition rate of employees who are Frequent Business Travellers is significantly higher than employees who are Rare or Non Business Travellers.
  

In [None]:
sns.set(context="paper", font_scale=1.5)
sns.pairplot(df[['Age','MonthlyIncome','DistanceFromHome','AvgEmpScore','YearsWithCurrManager','Attrition']],hue = 'Attrition', height = 5, kind="reg")
plt.show()

  <b>Key Observation of Charts Above:</b> 
  
 
There are some features that are correlated with each other (e.g. Age vs. YearsWithCurrManager). I will be further analysing collinearity of the features using Correlation Matrix in subsequent segments below.
  

## <font color=darkblue>5. Preparing DataFrame for Machine Learning</font>

<b>Duplicating the dataframe to keep a copy of the original
    </b>

In [None]:
#maintaining the integrity of the original dataframe, transform categorical data into numerical data in new dataframe

df_num = df.copy()
df_num.head()

<b>Identifying all features that are categorical in nature
    </b>

In [None]:
catcoldf = df.select_dtypes(include='object')
catcolname = list(catcoldf.columns.values)
catcolname

<b>Transforming the categorical features into numerical features using get_dummies function in Pandas in order for these features to be used by machine learning algorithms
    </b>

In [None]:
df_num = pd.get_dummies(df_num, columns = catcolname)
df_num.head()

<b>Plotting a Correlation Matrix to identify feature-pairs with high correlation with each other
    </b>

In [None]:
sns.set(style="white")
corr = df_num.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
f, ax = plt.subplots(figsize=(30, 30))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
cm = sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,annot=True, fmt=".0%", square=True, linewidths=.1, cbar_kws={"shrink": .5}, annot_kws={"size": 15})
bottom, top = cm.get_ylim()
cm.set_ylim(bottom + 0.5, top - 0.5)

plt.show()

<b>Using a for-loop to cycle through the Correlation Matrix to identify 1 leg of feature-pairs with correlation of >0.7
    </b>

In [None]:
highcorrel = set()
correlmatrix = df_num.corr()

for x in range(len(correlmatrix.columns)):
    for y in range(x):
        if abs(correlmatrix.iloc[x, y]) > 0.7:
            colname = correlmatrix.columns[y]
            highcorrel.add(colname)

highcorrel

<b>Dropping the features identified above to reduce multi-collinearity
    </b>

In [None]:
df_num = df_num.drop(columns=highcorrel)

<b>Re-plotting Correlation Matrix to check again after feature selection process above
    </b>

In [None]:
sns.set(style="white")
corr = df_num.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
f, ax = plt.subplots(figsize=(30, 30))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
cm = sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,annot=True, fmt=".0%", square=True, linewidths=.1, cbar_kws={"shrink": .5}, annot_kws={"size": 15})
bottom, top = cm.get_ylim()
cm.set_ylim(bottom + 0.5, top - 0.5)

sns.set_context("paper", font_scale=1.5)

plt.show()

<b>DataFrame is now ready for Machine Learning implementations
    </b>

## <font color=darkblue>6. Implementing Machine Learning Algorithms</font>

The key objectives of this project are to: 

    a. Provide the Management (of Company XYZ) with the Top Factors that have a major influence on why their employees leave the organisation
    b. Build a Machine Learning model to predict and identify employees with high attrition risks for early Management (of Company XYZ) intervention
  
As such, this is a <b>Classification</b> problem with a binary/categorical predictive outcome (i.e. Attrition:Yes or Attrition:No). The key is to be able to identify top features that can be used to accurately predict employees who are likely to leave the company (i.e. target variable value = 1). Therefore, <b>F1 Score</b> should be used as the main metric to evaluate the performance of the model.

Model explainability, or the ability to articulate to Management of Company XYZ the top reasons that the model used for prediction, is a fundamental requirement. A 'black-box' model, even if it is extremely accurate, is an unacceptable outcome. For this reason, I choose not to apply Principal Component Analysis (PCA) as a dimensionality reduction technique as PCA would heavily transform known features into unrecognisable features.

I will be implementing and optimising the following Classification algorithms to compare and evaluate their performance, before recommending the final model.

    1. Logistic Regression
    2. Random Forest
    3. Support Vector Machine
    


### 6.1 Logistic Regression Model

### <font color=darkred>6.1.1 Baseline Hyperparameters</font>

<b>Performing Train-Test Split and Data Scaling required for Logistic Regression
    </b>

In [None]:
X = df_num.drop('Attrition_Yes', axis =1)
y = df_num['Attrition_Yes']

X_train,X_test, y_train, y_test = train_test_split(X,y, test_size = 0.30, random_state =1)

ss = preprocessing.StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

<b>Creating Model Object, Training the Model and Creating Predictions from the Model (using default model hyperparameters)
    </b>

In [None]:
#TimerStart
lrstart = timer()

lr = LogisticRegression(random_state=1)

lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

<b>Model Performance Evaluation
    </b>

In [None]:
lraccuracy = accuracy_score(y_test, y_pred)

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
lrroc_auc = auc(fpr, tpr)

lrf1_score = f1_score(y_test, y_pred)

lrrecall = recall_score(y_test, y_pred)

print('Model Accuracy: ', lraccuracy)
print('ROC_AUC Score: ', lrroc_auc)
print('F1 Score: ', lrf1_score)
print('Recall Score: ', lrrecall)

<b>Plotting Confusion Matrix
    </b>

In [None]:
sns.set(context="paper", font_scale=1.5)
lrcm = confusion_matrix(y_test, y_pred)

ax = heatmap = sns.heatmap(lrcm, cmap="Blues", annot= True,fmt=".0f")
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.title('Confusion Matrix: Logistic Regression (*Tuned*)')
plt.ylabel('Real Outcome')
plt.xlabel('Predicted Outcome')
plt.yticks(rotation=0)
plt.show()

#TimerEnd
lrtime = (timer() - lrstart)

  <b>Conclusion</b> 
  
 
While Model Accuracy is decent at 84%, F1 Score is unacceptable at 26%. We should either reject this model or optimise the hyperparameters to seek a higher F1 Score.

### <font color=darkred>6.1.2 Tuned Hyperparameters</font>

<b>Performing Train-Test Split and Data Scaling required for Logistic Regression
    </b>

In [None]:
X = df_num.drop('Attrition_Yes', axis =1)
y = df_num['Attrition_Yes']

X_train,X_test, y_train, y_test = train_test_split(X,y, test_size = 0.30, random_state =1)

ss = preprocessing.StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

<b>Creating Model Object, Training the Model and Creating Predictions from the Model (using default model hyperparameters)
    </b>

I will introduce 1 change to the model training by including a definition on class_weight. From the Exploratory Data Analysis, I know that the dataset is not balanced, with only 16% of the records belonging to Class 1 (our subject of interest) and 84% belonging to Class 0.

Therefore, setting a 'balanced' class_weight attempts to correct the model training to shift more weight to Class 1 relative to the default hyperparameter.

In [None]:
#TimerStart
lrtunedstart = timer()

lrtuned = LogisticRegression(random_state=1, class_weight="balanced")

lrtuned.fit(X_train, y_train)

y_pred = lrtuned.predict(X_test)

<b>Model Performance Evaluation
    </b>

In [None]:
lrtunedaccuracy = accuracy_score(y_test, y_pred)

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
lrtunedroc_auc = auc(fpr, tpr)

lrtunedf1_score = f1_score(y_test, y_pred)

lrtunedrecall = recall_score(y_test, y_pred)

print('Model Accuracy: ', lrtunedaccuracy)
print('ROC_AUC Score: ', lrtunedroc_auc)
print('F1 Score: ', lrtunedf1_score)
print('Recall Score: ', lrtunedrecall)

<b>Plotting Confusion Matrix
    </b>

In [None]:
sns.set(context="paper", font_scale=1.5)
lrtunedcm = confusion_matrix(y_test, y_pred)

ax = heatmap = sns.heatmap(lrtunedcm, cmap="Blues", annot= True,fmt=".0f")
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.title('Confusion Matrix: Logistic Regression')
plt.ylabel('Real Outcome')
plt.xlabel('Predicted Outcome')
plt.yticks(rotation=0)
plt.show()

#TimerEnd
lrtunedtime = (timer() - lrtunedstart)

  <b>Conclusion</b> 
  
 
Model Accuracy is dropped from 84% to 69%, but F1 Score improved significantly from 26% to 43% and Recall Score is now at 70% vs. 16% from before. This is still not good enough a model because this means that the model will be falsely predicting too many employees leaving when they did not in reality.

<b>Retrieving and plotting the coefficient all features in the model to identify Top Features
    </b>

In [None]:
lrtunedcoef = []

for i in range(len(lrtuned.coef_)):
    for j in range(len(lrtuned.coef_[0])):
            featcoef = (lrtuned.coef_[0][j])
            lrtunedcoef.append(featcoef)

colname = X.columns.values

lrtunedtopfeat = pd.DataFrame( data = lrtunedcoef , index = colname, columns = ['Coefficient'])
lrtunedtopfeat = lrtunedtopfeat.sort_values(by = ['Coefficient'] ,ascending = False)
lrtunedtopfeat['Positive'] = lrtunedtopfeat['Coefficient']>0

In [None]:
sns.set(context="paper", font_scale=1.2)
plt.figure(figsize=(15,10))
sns.set(font_scale=1.2)
lrtunedtopfeat['Coefficient'].plot(kind='barh', color=lrtunedtopfeat.Positive.map({True: 'g', False: 'r'}))
green = mpatches.Patch(color='g', label = 'Positive Coefficient')
red = mpatches.Patch(color='r', label = 'Negative Coefficient')
plt.gca().invert_yaxis()
plt.legend(handles = [green, red], loc='lower right')
plt.title('Coefficients of ALL Features')
plt.show()

  <b>Key Observation of Chart Above:</b> 
  
 
There are more features with negative coefficients compared to positive coefficients. Those features with negative cofficients (e.g. YearsWithCurrManger) have an inverse relationship with Attrition Risk (target variable value = 1). 

Using YearsWithCurrManager as an example, this means that the lesser the years the employee spent with the current manager, the more likely the employee is to leave the organisation. This coincides with the observation made in Exploratory Data Analysis section.
  

<b>Top 5 Features (absolute value of coefficients)
    </b>

In [None]:
lrtunedtopfeat['AbsCoef'] = lrtunedtopfeat['Coefficient'].abs()
lrtunedtopfeat = lrtunedtopfeat.sort_values(by = 'AbsCoef', ascending = False)[0:5]

sns.set(context="paper", font_scale=1.5)
plt.figure(figsize=(15,10))
lrtunedtopfeat['AbsCoef'].plot(kind = 'barh', color=lrtunedtopfeat.Positive.map({True: 'g', False: 'r'}))
plt.gca().invert_yaxis()
plt.legend(handles = [green, red], loc='lower right')
plt.title('(Absolute) Coefficients of Top 5 Features')
plt.show()

The <b>Top 5</b> Features are in the order of:

    1. YearsWithCurrManager
    2. YearsSinceLastPromotion
    3. BusinessTravel_Non-Travel
    4. TotalWorkingYears
    5. BusinessTravel_Travel_Rarely
  

### 6.2 Random Forest Classification

### <font color=darkred>6.2.1 Baseline Hyperparameters (with Max Depth randomly set at 10)</font>

<b>Performing Train-Test Split
    </b>

In [None]:
X = df_num.drop('Attrition_Yes', axis =1)
y = df_num['Attrition_Yes']

X_train,X_test, y_train, y_test= train_test_split(X,y, test_size = 0.30, random_state=1)

<b>Creating Model Object, Training the Model and Creating Predictions from the Model (using default model hyperparameters)
    </b>

In [None]:
#TimerStart
rfstart = timer()

rf = RandomForestClassifier(n_estimators = 100, max_depth = 10, max_features = None, random_state=1, class_weight = "balanced")
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

<b>Model Performance Evaluation
    </b>

In [None]:
rfaccuracy = accuracy_score(y_test, y_pred)

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
rfroc_auc = auc(fpr, tpr)

rff1_score = f1_score(y_test, y_pred)

rfrecall = recall_score(y_test, y_pred)

print('Model Accuracy: ', rfaccuracy)
print('ROC_AUC Score: ', rfroc_auc)
print('F1 Score: ', rff1_score)
print('Recall Score: ', rfrecall)

<b>Plotting Confusion Matrix
    </b>

In [None]:
sns.set(context="paper", font_scale=1.5)

rfcm = confusion_matrix(y_test, y_pred)

ax = heatmap = sns.heatmap(rfcm, cmap="Blues", annot= True,fmt=".0f") 
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

plt.title('Confusion Matrix: Random Forest')
plt.ylabel('Real Outcome')
plt.xlabel('Predicted Outcome')
plt.yticks(rotation=0)
plt.show()

#TimerEnd
rftime = (timer() - rfstart)

  <b>Conclusion</b> 
  
 The model accuracy and F1 Score of Random Forest classification model improved dramatically over the Logistic Regression model built earler. In particular, both F1 Score and Recall Score (more relevant to our analysis) have Scores of 86%. 
 
 With these Scores, this model is good enough for use. However. I will attempt to see if I can further increase these Scores by optimising the Hyperparameters of Random Forest algorithm.


### <font color=darkred>6.2.2 Tuned Hyperparameters</font>

<b>Optimising number of trees used in the model (n_estimator)
    </b>

With a for-loop on 9 n_estimator values , I am seeking to find the most optimum number of trees by running multiple Random Forest iterations to pick n_estimator value with the highest F1 Score.

In [None]:
#TimerStart
rftunedstart = timer()

n_estimators = [1, 2, 4, 8, 16, 32, 64, 100, 200]
trainf1 = []
testf1 = []

for i in n_estimators:
    rf = RandomForestClassifier(n_estimators=i, random_state=1, class_weight = "balanced")
    rf.fit(X_train, y_train)
    
    train_pred = rf.predict(X_train)
    
    f1train = f1_score(y_train, train_pred)
    trainf1.append(f1train)
    
    y_pred = rf.predict(X_test)
    
    f1test = f1_score(y_test, y_pred)
    testf1.append(f1test)
    

sns.set(context="paper", font_scale=1.5)
plt.figure(figsize=(15,10))
plt.plot(n_estimators, trainf1, 'b', label= 'Train F1 Score')
plt.plot(n_estimators, testf1, 'g', label= 'Test F1 Score')

plt.legend()
plt.ylabel('F1 Score')
plt.xlabel('n_estimators')
plt.show()

<b>Extracting the best n_estimator value and storing it as a variable for subsequent optimisation
    </b>

In [None]:
a = list(zip(n_estimators, testf1))
b = pd.DataFrame( data = a , columns = ('NTrees','Accuracy'))
bestntree = int(b.iloc[ b['Accuracy'].idxmax(axis = 0) , 0])
print('Optimum No. of Trees: ' , bestntree)

<b>Optimising the max depth used in the model (max_depth)
    </b>

With a for-loop on 20 max_depth values (from 1 to 20), I am seeking to find the most optimum max_depth by running multiple Random Forest iterations to pick max_depth value that will yield the highest F1 Score. 

The variable created earlier containing the most optimum number of trees (bestntree) is introduced into this for-loop.

In [None]:
max_depth = np.arange(1,21,1)
trainf1 = []
testf1 = []

for i in max_depth:
    rf = RandomForestClassifier(n_estimators = bestntree, max_depth=i, random_state=1,class_weight = "balanced")
    rf.fit(X_train, y_train)
    
    train_pred = rf.predict(X_train)
    
    f1train = f1_score(y_train, train_pred)
    trainf1.append(f1train)
    
    y_pred = rf.predict(X_test)
    
    f1test = f1_score(y_test, y_pred)
    testf1.append(f1test)

sns.set(context="paper", font_scale=1.5)
plt.figure(figsize=(15,10))
plt.plot(max_depth, trainf1, 'b', label= 'Train F1 Score')
plt.plot(max_depth, testf1, 'g', label= 'Test F1 Score')

plt.legend()
plt.ylabel('F1 Score')
plt.xlabel('max_depth')
plt.show()

<b>Extracting the best max_depth value and storing it as a variable for subsequent optimisation
    </b>

In [None]:
c = list(zip(max_depth, testf1))
d = pd.DataFrame( data = c , columns = ('NDepth','Accuracy'))
bestndepth = int(d.iloc[ d['Accuracy'].idxmax(axis = 0) , 0])
print('Optimum Max Depth: ' , bestndepth) 

<b>Optimising the max features used in the model (max_depth)
    </b>

With a for-loop on all the input feature counts, I am seeking to find the most optimum max_features by running multiple Random Forest iterations to pick max_feature value that will yield the highest F1 Score. 

The variables created earlier containing the most optimum number of trees (bestntree) and most optimum max_depth (bestdepth) are introduced into this for-loop.

In [None]:
max_features = list(range(1,X_train.shape[1]))
trainf1 = []
testf1 = []
for i in max_features:
    rf = RandomForestClassifier(n_estimators = bestntree, max_depth=bestndepth, max_features=i, random_state=1,class_weight = "balanced")
    rf.fit(X_train, y_train)
    
    train_pred = rf.predict(X_train)
    
    f1train = f1_score(y_train, train_pred)
    trainf1.append(f1train)
    
    y_pred = rf.predict(X_test)
    
    f1test = f1_score(y_test, y_pred)
    testf1.append(f1test)

sns.set(context="paper", font_scale=1.5)
plt.figure(figsize=(15,10))
plt.plot(max_features, trainf1, 'b', label= 'Train F1 Score')
plt.plot(max_features, testf1, 'g', label= 'Test F1 Score')

plt.legend()
plt.ylabel('F1 Score')
plt.xlabel('max_features')
plt.show()

<b>Extracting the best max_features value and storing it as a variable for subsequent optimisation
    </b>

In [None]:
e = list(zip(max_features, testf1))
f = pd.DataFrame( data = e , columns = ('NFeat','Accuracy'))
bestnfeat = int(f.iloc[ f['Accuracy'].idxmax(axis = 0) , 0])
print('Optimum Max Feature: ' , bestnfeat)

<b>Re-creating a new and tuned Random Forest model with these identified input hyperparameters
    </b>

- n_estimators = 32 (bestntree)
- max_depth = 16 (bestndepth)
- max_features = 12 (bestnfeat)

In [None]:
X = df_num.drop('Attrition_Yes', axis =1)
y = df_num['Attrition_Yes']

X_train,X_test, y_train, y_test= train_test_split(X,y, test_size = 0.30, random_state=1)

rftuned = RandomForestClassifier(n_estimators = bestntree, max_depth = bestndepth , max_features = bestnfeat, random_state=1, class_weight="balanced")
rftuned.fit(X_train, y_train)
y_pred = rftuned.predict(X_test)

<b>Model Evaluation Scores
    </b>

In [None]:
rftunedaccuracy = accuracy_score(y_test, y_pred)

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
rftunedroc_auc = auc(fpr, tpr)

rftunedf1_score = f1_score(y_test, y_pred)

rftunedrecall = recall_score(y_test, y_pred)

print('Model Accuracy: ', rftunedaccuracy)
print('ROC_AUC Score: ', rftunedroc_auc)
print('F1 Score: ', rftunedf1_score)
print('Recall: ', rftunedrecall)

<b>Plotting Confusion Matrix
    </b>

In [None]:
sns.set(context="paper", font_scale=1.5)

rftcm = confusion_matrix(y_test,y_pred)

ax = heatmap = sns.heatmap(rftcm, cmap="Blues", annot= True,fmt=".0f") 
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.title('Confusion Matrix: Random Forest (*Tuned*)')
plt.ylabel('Real Outcome')
plt.xlabel('Predicted Outcome')
plt.yticks(rotation=0)
plt.show()

#TimerEnd
rftunedtime = (timer() - rftunedstart)

  <b>Conclusion</b> 
  
 With the 3 optimum variables derived as hyperparameter inputs, the F1 Score improved from 86% to 95%! Recall Score is also considerably high at 91%. This model is definitely good enough for use.


<b>Retrieving and visualising the Feature Importance
    </b>

In [None]:
feature_importances = pd.DataFrame(rftuned.feature_importances_, index = X.columns, columns=['importance']).sort_values('importance', ascending=False)

plt.figure(figsize=(15,10))
sns.set(context="paper", font_scale=1.2)
feature_importances['importance'].plot(kind='barh')
plt.title('Relative Importance of ALL Features')
plt.gca().invert_yaxis()
plt.show()

<b>Top 5 Most Important Features of Random Forest Model
    </b>

In [None]:
feature_importances = pd.DataFrame(rftuned.feature_importances_, index = X.columns, columns=['importance']).sort_values('importance', ascending=False)[0:5]

sns.set(context="paper", font_scale=1.5)
plt.figure(figsize=(15,10))
feature_importances['importance'].plot(kind='barh')
plt.title('Top 5 Features')
plt.gca().invert_yaxis()
plt.show()

The <b>Top 5</b> Features are in the order of:

    1. Age
    2. TotalWorkingYears
    3. MonthlyIncome
    4. YearsWithCurrManager
    5. DistanceFromHome
  

### 6.3 Support Vector Machine

### <font color=darkred>6.3.1 Baseline Hyperparameters</font>

<b>Performing Train-Test Split and Scaling of data for SVM
    </b>

In [None]:
X = df_num.drop('Attrition_Yes', axis =1)
y = df_num['Attrition_Yes']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

ss = preprocessing.StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

<b>Creating Model Object, Training the Model and Creating Predictions from the Model (using default model hyperparameters)
    </b>

In [None]:
#TimerStart
svmstart = timer()

svm = SVC(random_state=1, class_weight = 'balanced')
svm.fit(X_train,y_train)
y_pred = svm.predict(X_test)

<b>Model Performance Scores
    </b>

In [None]:
svmaccuracy = accuracy_score(y_test, y_pred)

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
svmroc_auc = auc(fpr, tpr)

svmf1_score = f1_score(y_test, y_pred)

svmrecall = recall_score(y_test, y_pred)

print('Model Accuracy: ', svmaccuracy)
print('ROC_AUC Score: ', svmroc_auc)
print('F1 Score: ', svmf1_score)
print('Recall: ', svmrecall)

<b>Plotting Confusion Matrix
    </b>

In [None]:
sns.set(context="paper", font_scale=1.5)

svmcm = confusion_matrix(y_test,y_pred)

ax = heatmap = sns.heatmap(svmcm, cmap="Blues", annot= True,fmt=".0f") 
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

plt.title('Confusion Matrix: SVM')
plt.ylabel('Real Outcome')
plt.xlabel('Predicted Outcome')
plt.yticks(rotation=0)
plt.show()

#TimerEnd
svmtime = (timer() - svmstart)

  <b>Conclusion</b> 
  
  Model accuracy is good at 91%, but F1 Score is only at 75%. Since F1 Score is the key metric, I will proceed to tune the hyperparameter C to see if I am able to obtain a better F1 Score.
 

### <font color=darkred>6.3.2 Tuned Hyperparameters</font>

<b>Optimising C (regularisation parameter) with the use of GridSearchCV
    </b>

In [None]:
#TimerStart
svmtunedstart = timer()

param_grid = {'C': [0.1,0.5,0.6,0.7,0.8,0.9, 1 , 10 , 20],  
              'gamma': ['auto','scale']} 
  
grid = GridSearchCV(SVC(class_weight = 'balanced'), param_grid, refit = True, verbose = 1)

grid.fit(X_train,y_train)

In [None]:
print ('Best Combination of Hyperparameters: ')
grid.best_estimator_

<b>Model Performance Scores
    </b>

In [None]:
y_pred = grid.best_estimator_.predict(X_test)

svmtaccuracy = accuracy_score(y_test, y_pred)

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
svmtroc_auc = auc(fpr, tpr)

svmtf1_score = f1_score(y_test, y_pred)

svmtrecall = recall_score(y_test, y_pred)

print('Model Accuracy: ', svmtaccuracy)
print('ROC_AUC Score: ', svmtroc_auc)
print('F1 Score: ', svmtf1_score)
print('Recall: ', svmtrecall)

<b>Plotting Confusion Matrix
    </b>

In [None]:
sns.set(context="paper", font_scale=1.5)

svmtcm = confusion_matrix(y_test,y_pred)

ax = heatmap = sns.heatmap(svmtcm, cmap="Blues", annot= True,fmt=".0f") 
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.title('Confusion Matrix: SVM (*tuned*)')
plt.ylabel('Real Outcome')
plt.xlabel('Predicted Outcome')
plt.yticks(rotation=0)
plt.show()

#TimerEnd
svmtunedtime = (timer()- svmtunedstart)

  <b>Conclusion</b> 
  
  By optimising hyperparameter C, the F1 Score improved from 75% to 91%! However, an important downside of using SVM is the fact that the feature importance cannot be retrieved if we use RBF kernal. This goes against our analysis objective of preserving model explainability.
 
 Furthermore, the F1 Score is slightly less superior compared to an optimised Random Forest model (seen earlier).

## 7. Model Evaluation and Recommendation

Thus far, I have effectively built 6 models applying 3 different Classification Machine Learning algorithms. Let me now consolidate and summarise the scores of all 6 models in a table to facilitate evaluation and selection of the final model.

As explained earlier, though I have computed 4 metrics (Accuracy, ROC_AUC, F1 and Recall) for all 6 models, <b>F1 Score</b> remains the chosen evaluation metric given the objective of the analysis.

### 7.1 Summarising scores from all 6 models

In [None]:
LogReg = ( lraccuracy, lrroc_auc , lrf1_score , lrrecall )
LogRegT = ( lrtunedaccuracy, lrtunedroc_auc , lrtunedf1_score , lrrecall )
RF = ( rfaccuracy , rfroc_auc , rff1_score , rfrecall )
RFT = ( rftunedaccuracy , rftunedroc_auc , rftunedf1_score , rftunedrecall )
SVC = ( svmaccuracy , svmroc_auc , svmf1_score , svmrecall )
SVCT = ( svmtaccuracy , svmtroc_auc , svmtf1_score , svmtrecall )

In [None]:
modelsummary = pd.DataFrame(data = (LogReg, LogRegT, RF, RFT, SVC, SVCT), 
                         columns = ('Accuracy', 'ROC_AUC Score', 'F1 Score','Recall Score')).mul(100).round(1).astype(str).add('%')

modelsummary.rename(index={0 : 'Logistic Regression (*Baseline*)' , 
                        1 : 'Logistic Regression (*Tuned*)' , 
                        2 : 'Random Forest (*Baseline*)' , 
                        3 : 'Random Forest (*Tuned*)' , 
                        4 : 'Support Vector Machine (*Baseline*)' , 
                        5 : 'Support Vector Machine (*Tuned*)' },
                 inplace = True)

modelsummary['Run Time'] = (lrtime, lrtunedtime, rftime, rftunedtime, svmtime, svmtunedtime)

In [None]:
modelsummary

  <b>Conclusion</b> 
  
Based on its F1 Score of <b>95.1%</b>, <b>Random Forest (Tuned)</b> model performs the best among the 6 models.

### 7.2 Proposal to Management of Company XYZ

I would like to propose for the adoption and implementation of my <b>Random Forest (Tuned)</b> model. With this model, the Management would be able to effectively predict and identify the list of employees with high attrition risk.

The adoption of my Random Forest (Tuned) model with a <b>95%</b> F1 Score would imply the following:

1. For every <b>100</b> employees who truly want to leave the organisation, the model will be able to identify up to <b>91</b> of them on a timely basis.
2. For every <b>100</b> employees identified by the model to have high attrition risk, <b>all 100</b> of them indeed want to leave the company.
    
The <b>Top 5</b> factors that the model is basing the prediction on are in the following order:

1. <b>Age:</b> The younger the employee, the more likely he/she would want to leave the organisation.
2. <b>TotalWorkingYears:</b> The lesser the number of working years the employee has clocked in his/her life, the more likely he/she will leave the organisation.
3. <b>MonthlyIncome:</b> The lower the monthly income, the more likely the employee will leave the organisation.
4. <b>YearsWithCurrManager:</b> The lesser the number of years spent with the current manager, the more likely the employee will leave the organisation.
5. <b>DistanceFromHome:</b> The further the distance from office the employee's home is located, the more likely he/she will leave the organisation.