# Supervised Machine Learning Models - Random Forest Classification - Exercise Solution

#### Exercise

In this exercise you will return to the **WA_Fn-UseC_-HR-Employee-Attrition.csv** dataset to predict employee attrition that we used when learning logistic regression in a previous lecture. As a reminder, the data was obtained from https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset and contains the following variables:

| Variable | Definition |
| --- | --- |
| Attrition | 'Yes' if the employee leaves the company, 'No' if the employee stays with the company |
| EmployeeNumber | Unique identifier for each employee |
| Age | Age in years of the employee |
| BusinessTravel | Frequency of business travel: 'Frequently', 'Rarely' or 'Non-Travel' |
| DailyRate | Daily rate of pay for the employee |
| Department | Department the employee belongs to: 'Sales', 'Research & Development' or 'Human Resources' |
| DistanceFromHome | Distance from employee's home to workplace |
| Education | Level of education: 1 'Below College', 2 'College', 3 'Bachelor', 4 'Master', 5 'Doctor' |
| EducationField | Field of study in which the employee obtained their highest education: 'Life Sciences', 'Medical', 'Marketing', 'Technical Degree', 'Human Resources', 'Engineering', 'Arts', or 'Other' |
| EmployeeCount | Number of employees in the company |
| EnvironmentSatisfaction | Employee's level of satisfaction with their work environment: 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| Gender | Employee's gender: 'Male' or 'Female' |
| HourlyRate | Hourly rate of pay for the employee |
| JobInvolvement | Employee's level of job involvement: 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| JobLevel | Employee's job level: 1 'Entry Level', 2 'Intermediate Level', 3 'Managerial Level', 4 'Director Level', 5 'Executive Level' |
| JobRole | Employee's job role: 'Sales Executive', 'Research Scientist', 'Laboratory Technician', 'Manufacturing Director', 'Healthcare Representative', 'Manager', 'Sales Representative', 'Research Director', 'Human Resources' |
| JobSatisfaction | Employee's level of job satisfaction: 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| MaritalStatus | Employee's marital status: 'Single', 'Married' or 'Divorced' |
| MonthlyIncome | Monthly income of the employee |
| MonthlyRate | Monthly rate of pay for the employee |
| NumCompaniesWorked | Number of companies the employee has worked for |
| Over18 | Whether the employee is over 18 years old: 'Y' or 'N' |
| OverTime | Whether the employee works overtime: 'Yes' or 'No' |
| PercentSalaryHike | Percentage increase in salary for the employee |
| PerformanceRating | Employee's performance rating: 1 'Low', 2 'Good', 3 'Excellent', 4 'Outstanding' |
| RelationshipSatisfaction | Employee's level of satisfaction with their relationships at work: 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| StandardHours | Standard number of working hours for the company |
| StockOptionLevel | Employee's level of stock options: 0 'None', 1 'Low', 2 'Medium', 3 'High' |
| TotalWorkingYears | Total number of years the employee has worked |
| TrainingTimesLastYear | Number of times the employee received training last year |
| WorkLifeBalance | Employee's level of work-life balance: 1 'Bad', 2 'Good', 3 'Better', 4 'Best' |
| YearsAtCompany | Number of years the employee has worked at the company |
| YearsInCurrentRole | Number of years the employee has been in their current role |
| YearsSinceLastPromotion | Number of years since the employee's last promotion |
| YearsWithCurrManager | Number of years the employee has been working under their current manager |

Use random forest classification to predict whether the employee will leave the company (i.e., **Attrition**) using all possible independent variables in the model. Train the model on a random sample of 75% of the observations in the dataset, and test the model on the remaining 25% of the observations in the dataset. How well did the model perform?

In [None]:
import pandas as pd
import numpy as np
import warnings
import statsmodels.api as sm
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 10000)
pd.set_option('display.max_colwidth', None)
pd.options.display.float_format = '{:,.3f}'.format

In [None]:
# Read in the data
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')

# Notice that 'EmployeeCount' has all values equal to 1. Let's remove it.
#del df['EmployeeCount']

# Notice that 'StandardHours' has all values equal to 80. Let's remove it.
#del df['StandardHours']

# EmployeeNumber should have no relation with Attrition. Let's remove it.
#del df['EmployeeNumber']

df.drop(['EmployeeCount', 'StandardHours', 'EmployeeNumber'], axis=1, inplace=True)

# Create an indicator variable equal to 1 if Attrition is 'Yes' and equal to 0 if Attrition is 'No'


# Create dummy variables for categorical columns


# Create list of independent variables
indep_vars = [x for x in df.columns if x not in ['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime']]
print(indep_vars)

# Create X and y DataFrames


# Split into training and testing sets


In [None]:
# Estimate the random forest classification model and create predictions in the out-of-sample testing set



In [None]:
# Evaluate the model


In [None]:
# Feature importance
