# Supervised Machine Learning Models - Random Forest Classification - Exercise Solution

#### Exercise

In this exercise you will return to the **WA_Fn-UseC_-HR-Employee-Attrition.csv** dataset to predict employee attrition that we used when learning logistic regression in a previous lecture. As a reminder, the data was obtained from https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset and contains the following variables:

| Variable | Definition |
| --- | --- |
| Attrition | 'Yes' if the employee leaves the company, 'No' if the employee stays with the company |
| EmployeeNumber | Unique identifier for each employee |
| Age | Age in years of the employee |
| BusinessTravel | Frequency of business travel: 'Frequently', 'Rarely' or 'Non-Travel' |
| DailyRate | Daily rate of pay for the employee |
| Department | Department the employee belongs to: 'Sales', 'Research & Development' or 'Human Resources' |
| DistanceFromHome | Distance from employee's home to workplace |
| Education | Level of education: 1 'Below College', 2 'College', 3 'Bachelor', 4 'Master', 5 'Doctor' |
| EducationField | Field of study in which the employee obtained their highest education: 'Life Sciences', 'Medical', 'Marketing', 'Technical Degree', 'Human Resources', 'Engineering', 'Arts', or 'Other' |
| EmployeeCount | Number of employees in the company |
| EnvironmentSatisfaction | Employee's level of satisfaction with their work environment: 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| Gender | Employee's gender: 'Male' or 'Female' |
| HourlyRate | Hourly rate of pay for the employee |
| JobInvolvement | Employee's level of job involvement: 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| JobLevel | Employee's job level: 1 'Entry Level', 2 'Intermediate Level', 3 'Managerial Level', 4 'Director Level', 5 'Executive Level' |
| JobRole | Employee's job role: 'Sales Executive', 'Research Scientist', 'Laboratory Technician', 'Manufacturing Director', 'Healthcare Representative', 'Manager', 'Sales Representative', 'Research Director', 'Human Resources' |
| JobSatisfaction | Employee's level of job satisfaction: 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| MaritalStatus | Employee's marital status: 'Single', 'Married' or 'Divorced' |
| MonthlyIncome | Monthly income of the employee |
| MonthlyRate | Monthly rate of pay for the employee |
| NumCompaniesWorked | Number of companies the employee has worked for |
| Over18 | Whether the employee is over 18 years old: 'Y' or 'N' |
| OverTime | Whether the employee works overtime: 'Yes' or 'No' |
| PercentSalaryHike | Percentage increase in salary for the employee |
| PerformanceRating | Employee's performance rating: 1 'Low', 2 'Good', 3 'Excellent', 4 'Outstanding' |
| RelationshipSatisfaction | Employee's level of satisfaction with their relationships at work: 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| StandardHours | Standard number of working hours for the company |
| StockOptionLevel | Employee's level of stock options: 0 'None', 1 'Low', 2 'Medium', 3 'High' |
| TotalWorkingYears | Total number of years the employee has worked |
| TrainingTimesLastYear | Number of times the employee received training last year |
| WorkLifeBalance | Employee's level of work-life balance: 1 'Bad', 2 'Good', 3 'Better', 4 'Best' |
| YearsAtCompany | Number of years the employee has worked at the company |
| YearsInCurrentRole | Number of years the employee has been in their current role |
| YearsSinceLastPromotion | Number of years since the employee's last promotion |
| YearsWithCurrManager | Number of years the employee has been working under their current manager |

Use random forest classification to predict whether the employee will leave the company (i.e., **Attrition**) using all possible independent variables in the model. Train the model on a random sample of 75% of the observations in the dataset, and test the model on the remaining 25% of the observations in the dataset. How well did the model perform?

In [1]:
import pandas as pd
import numpy as np
import warnings
import statsmodels.api as sm
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 10000)
pd.set_option('display.max_colwidth', None)
pd.options.display.float_format = '{:,.3f}'.format

In [2]:
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,2,Female,94,3,2,Sales Executive,4,Single,5993,19479,8,Y,Yes,11,3,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,3,Male,61,2,2,Research Scientist,2,Married,5130,24907,1,Y,No,23,4,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396,6,Y,Yes,15,3,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,4,Female,56,3,1,Research Scientist,3,Married,2909,23159,1,Y,Yes,11,3,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,1,Male,40,3,1,Laboratory Technician,2,Married,3468,16632,9,Y,No,12,3,4,80,1,6,3,3,2,2,2,2


In [None]:
# Read in the data
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')

# Notice that 'EmployeeCount' has all values equal to 1. Let's remove it.
#del df['EmployeeCount']

# Notice that 'StandardHours' has all values equal to 80. Let's remove it.
#del df['StandardHours']

# EmployeeNumber should have no relation with Attrition. Let's remove it.
#del df['EmployeeNumber']
df.drop(columns=['EmployeeCount', 'StandardHours', 'EmployeeNumber'], axis=1, inplace=True)

# Create an indicator variable equal to 1 if Attrition is 'Yes' and equal to 0 if Attrition is 'No'
df['Attrition'] = np.where(df['Attrition'] == 'Yes', 1, 0)

# Create dummy variables for categorical columns
df = pd.get_dummies(df, columns=['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime'], drop_first=True, dtype= int)

# Create list of independent variables
indep_vars = [x for x in df.columns if x not in ['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime']]

# Create X and y DataFrames
X = df[indep_vars].assign(_const=1)
y = df[['Attrition']]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123456, stratify=y)

print(f'# Observations in X_train: {len(X_train)}')
print(f'# Observations in y_train: {len(y_train)}')
print(f'# Observations in X_test: {len(X_test)}')
print(f'# Observations in y_test: {len(y_test)}')

display(X_train.head())
display(y_train.head())

# Observations in X_train: 1102
# Observations in y_train: 1102
# Observations in X_test: 368
# Observations in y_test: 368


Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,Department_Research & Development,Department_Sales,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,EducationField_Other,EducationField_Technical Degree,Gender_Male,JobRole_Human Resources,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Married,MaritalStatus_Single,OverTime_Yes,_const
241,32,976,26,4,3,100,3,2,4,4465,12069,0,18,3,1,0,4,2,3,3,2,2,2,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,1
769,26,921,1,1,1,66,2,1,3,2007,25265,1,13,3,3,2,5,5,3,5,3,1,3,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
1455,40,1322,2,4,3,52,2,1,3,2809,2725,2,14,3,4,0,8,2,3,2,2,2,2,0,1,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,1
1089,37,674,13,3,1,47,3,2,4,4285,3031,1,17,3,1,0,10,2,3,10,8,3,7,0,1,1,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1
1007,29,337,14,1,3,84,3,3,4,7553,22930,0,12,3,1,0,9,1,3,8,7,7,7,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1


Unnamed: 0,Attrition
241,0
769,0
1455,0
1089,0
1007,1


In [4]:
# Estimate the random forest classification model and create predictions in the out-of-sample testing set
model = RandomForestClassifier(n_estimators=1000, bootstrap=True, max_features='sqrt', random_state=123456).fit(X_train, y_train.values.ravel())
y_test['Attrition_p'] = model.predict(X_test)
y_test.head()

Unnamed: 0,Attrition,Attrition_p
1076,0,0
392,0,0
1134,0,0
409,0,0
1350,0,0


In [5]:
# Evaluate the model
from sklearn.metrics import confusion_matrix

TN, FP, FN, TP = confusion_matrix(y_test['Attrition'], y_test['Attrition_p']).ravel()

Accuracy = (TN + TP)/(TN + TP + FN + FP)
Sensitivity = TP/(TP + FN)
Specificity = TN/(TN + FP)

print(f'True Negative : {TN}')
print(f'True Positive : {TP}')
print(f'False Negative: {FN}')
print(f'False Positive: {FP}')

print()

print(f'Accuracy    : {Accuracy:.3f}')
print(f'Sensitivity : {Sensitivity:.3f}')
print(f'Specificity : {Specificity:.3f}')

True Negative : 308
True Positive : 5
False Negative: 54
False Positive: 1

Accuracy    : 0.851
Sensitivity : 0.085
Specificity : 0.997


In [6]:
from statsmodels.api import Logit
logit_model = Logit(y_test['Attrition'], y_test[['Attrition_p']].assign(_const=1)).fit()
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.421746
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:              Attrition   No. Observations:                  368
Model:                          Logit   Df Residuals:                      366
Method:                           MLE   Df Model:                            1
Date:                Sat, 30 Nov 2024   Pseudo R-squ.:                 0.04194
Time:                        22:56:03   Log-Likelihood:                -155.20
converged:                       True   LL-Null:                       -162.00
Covariance Type:            nonrobust   LLR p-value:                 0.0002275
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Attrition_p     3.3506      1.105      3.031      0.002       1.184       5.517
_const         -1.7411    

In [7]:
# Feature importance
feature_importance = pd.DataFrame({'feature': list(X_train.columns), 'importance': model.feature_importances_})
feature_importance = feature_importance.sort_values('importance', ascending = False)
feature_importance

Unnamed: 0,feature,importance
9,MonthlyIncome,0.07
0,Age,0.061
16,TotalWorkingYears,0.052
1,DailyRate,0.052
10,MonthlyRate,0.051
43,OverTime_Yes,0.05
5,HourlyRate,0.048
2,DistanceFromHome,0.044
19,YearsAtCompany,0.041
12,PercentSalaryHike,0.034
