#Live Code 5 - Set 2
Name: Rahardiansyah Fatoni

Batch: RMT-027

Objective:
- Mampu memahami konsep ensemble learning dengan Decision Tree dan Random Forest.

- Mampu mempersiapkan data untuk digunakan dalam model Decision Tree dan Random Forest.

- Mampu mengimplementasikan Decision Tree dan Random Forest untuk membuat prediksi.



#Import Libraries

In [48]:
import pandas as pd
import numpy as np

#Preprocesssing
from sklearn.model_selection import train_test_split
from scipy import stats
from scipy.stats import chi2_contingency
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, LabelEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

#ML Model
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

#Evaluasi
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, f1_score, accuracy_score, precision_score, recall_score

#Data Loading

In [49]:
df = pd.read_csv("https://raw.githubusercontent.com/rahardianfatoni/LC/main/employee-attrition.csv")

In [50]:
#A quick look at the dataframe using Transpose of df.head() to see all the columns, and a few values.
df.head().T

Unnamed: 0,0,1,2,3,4
Age,28,37,38,55,31
Attrition,No,Yes,No,Yes,No
BusinessTravel,Travel_Rarely,Travel_Rarely,Travel_Rarely,Travel_Rarely,Travel_Rarely
Department,Research & Development,Research & Development,Sales,Research & Development,Sales
DistanceFromHome,3,11,2,2,5
Education,3,2,2,3,4
EducationField,Medical,Medical,Marketing,Medical,Life Sciences
EmployeeID,1121,1033,1125,787,1673
Gender,Female,Female,Male,Male,Female
JobRole,Manufacturing Director,Healthcare Representative,Sales Executive,Manager,Sales Executive


In [51]:
#To see the information about columns, datatype, and null values.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 22 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Age                      1470 non-null   int64 
 1   Attrition                1470 non-null   object
 2   BusinessTravel           1470 non-null   object
 3   Department               1470 non-null   object
 4   DistanceFromHome         1470 non-null   int64 
 5   Education                1470 non-null   int64 
 6   EducationField           1470 non-null   object
 7   EmployeeID               1470 non-null   int64 
 8   Gender                   1470 non-null   object
 9   JobRole                  1470 non-null   object
 10  JobSatisfaction          1470 non-null   object
 11  MaritalStatus            1470 non-null   object
 12  MonthlyIncome            1470 non-null   int64 
 13  NumCompaniesWorked       1470 non-null   int64 
 14  PercentSalaryHike        1470 non-null  

$Insight:$
- The data consists of 1470 entries, with no detectable null values.
- From here we can already conclude the feature that can be used as target which is `PerformanceRating`.

In [52]:
df['PerformanceRating'].value_counts()

3    1244
4     226
Name: PerformanceRating, dtype: int64

$Insight:$
- But now there's a problem, based on the `PerformanceRating` column, there is only 2 values, `3` and `4`. If we reference the dataset description there should be:
- `1` : Low
- `2` : Good
- `3` : Excellent
- `4` : Outstanding

Therefore we need to treat `PerformanceRating` as a numerical entries instead.

#Data Filtering (Split Feature Types)

##Numerical Columns:

In [53]:
num_columns = df.select_dtypes(include='number').columns.tolist()

In [54]:
num_columns.remove('PerformanceRating')

In [55]:
num_columns

['Age',
 'DistanceFromHome',
 'Education',
 'EmployeeID',
 'MonthlyIncome',
 'NumCompaniesWorked',
 'PercentSalaryHike',
 'TotalWorkingYears',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrentManager']

$Insight:$
- We can conclude based on the dataset's description that `Education` have ordinal values, therefore we shall treat it as such.

In [56]:
num_columns.remove('Education')

##Categorical Columns:

In [57]:
cat_columns = df.select_dtypes(include='object').columns.tolist()

In [58]:
cat_columns

['Attrition',
 'BusinessTravel',
 'Department',
 'EducationField',
 'Gender',
 'JobRole',
 'JobSatisfaction',
 'MaritalStatus',
 'WorkLifeBalance']

We should check the value_counts of each categorical column to check their cardinality.

In [59]:
for col in cat_columns:
  print(df[col].value_counts())
  print(" ")

No     1233
Yes     237
Name: Attrition, dtype: int64
 
Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: BusinessTravel, dtype: int64
 
Research & Development    961
Sales                     446
Human Resources            63
Name: Department, dtype: int64
 
Life Sciences       606
Medical             464
Marketing           159
Technical Degree    132
Other                82
Human Resources      27
Name: EducationField, dtype: int64
 
Male      882
Female    588
Name: Gender, dtype: int64
 
Sales Executive              326
Research Scientist           292
Laboratory Technician        259
Manufacturing Director       145
Healthcare Representative    131
Manager                      102
Sales Representative          83
Research Director             80
Human Resources               52
Name: JobRole, dtype: int64
 
Very High    459
High         442
Low          289
Medium       280
Name: JobSatisfaction, dtype: int64
 
Married     673
Single      470
Div

$Insight:$
- It seems only `EducationField` and `JobRole` have more than 5 categories, this means it's ok to continue for now.

##Ordinal Columns:

In [60]:
ord_columns = ['Education']

In [61]:
ord_columns

['Education']

#Splitting X (Features) to Y (Target)

In [62]:
X = df[cat_columns + ord_columns + num_columns]
y = df['PerformanceRating']

#Splitting Train-set and Test-set

In [63]:
# Split dataset into train-Set and test-Set

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print('Train size : ', X_train.shape)
print('Test size  : ', X_test.shape)

Train size :  (1102, 21)
Test size  :  (368, 21)


#Feature Selection

Now, we shall run a few tests to see the correlation between the features and the target variable, a good start would be by using techniques such as:
- `Kendall's` for the categorical and ordinal features
- `Pearson's` for the numerical features

We used the mentioned techniques since the target (`PerformanceRating`) is numerical.

##Categorical Features:

In [64]:
#We will use a for loop to get the Kendall Tau values of each categorical features.
for i in cat_columns:
    corr_tau, pval_k = stats.kendalltau(X_train[i], y_train)

    if pval_k > 0.05:
        print("")
        print(f"- No significant correlation between 'PerformanceRating' and {i}")
    else:
        print("")
        print(f"- Significant correlation between 'PerformanceRating' and {i}:")
        print(f"  Kendall correlation: {corr_tau:.2f}, p-value: {pval_k}")


- No significant correlation between 'PerformanceRating' and Attrition

- No significant correlation between 'PerformanceRating' and BusinessTravel

- No significant correlation between 'PerformanceRating' and Department

- No significant correlation between 'PerformanceRating' and EducationField

- No significant correlation between 'PerformanceRating' and Gender

- No significant correlation between 'PerformanceRating' and JobRole

- Significant correlation between 'PerformanceRating' and JobSatisfaction:
  Kendall correlation: 0.06, p-value: 0.04506927371956192

- No significant correlation between 'PerformanceRating' and MaritalStatus

- No significant correlation between 'PerformanceRating' and WorkLifeBalance


##Ordinal Features:

In [65]:
#We will use a for loop to get the Kendall Tau values of each categorical features.
for i in ord_columns:
    corr_tau, pval_k = stats.kendalltau(X_train[i], y_train)

    if pval_k > 0.05:
        print("")
        print(f"- No significant correlation between 'PerformanceRating' and {i}")
    else:
        print("")
        print(f"- Significant correlation between 'PerformanceRating' and {i}:")
        print(f"  Kendall correlation: {corr_tau:.2f}, p-value: {pval_k}")


- No significant correlation between 'PerformanceRating' and Education


##Numerical Features:

In [66]:
for i in num_columns:
    p, pval_r = stats.pearsonr(X_train[i], y_train)

    if pval_r > 0.05:
      print("")
      print(f"- No significant correlation between 'PerformanceRating' and {i}")
    else:
      print("")
      print(f"- Significant correlation between 'PerformanceRating' and {i}:")
      print(f"  Pearson correlation: {p:.2f}, p-value: {pval_r}")


- No significant correlation between 'PerformanceRating' and Age

- No significant correlation between 'PerformanceRating' and DistanceFromHome

- No significant correlation between 'PerformanceRating' and EmployeeID

- No significant correlation between 'PerformanceRating' and MonthlyIncome

- No significant correlation between 'PerformanceRating' and NumCompaniesWorked

- Significant correlation between 'PerformanceRating' and PercentSalaryHike:
  Pearson correlation: 0.78, p-value: 1.0228083687311068e-222

- No significant correlation between 'PerformanceRating' and TotalWorkingYears

- No significant correlation between 'PerformanceRating' and YearsAtCompany

- No significant correlation between 'PerformanceRating' and YearsInCurrentRole

- No significant correlation between 'PerformanceRating' and YearsSinceLastPromotion

- No significant correlation between 'PerformanceRating' and YearsWithCurrentManager


##$Insight:$
- Based on the correlation tests, it seems only `JobSatisfaction` and `PercentSalaryHike` that have a significant correlation to `PerformanceRating`. This seems like a fair correlation, since it can be assumed that an employee's willingness to work more efficiently would be increased or decreased based on the amount of satisfaction and increase in salary.
- We will use only those 2 features, making this a fairly small feature model but nonetheless better than using irrelevant features and overfitting the model.

In [67]:
features = ['JobSatisfaction', 'PercentSalaryHike']

In [68]:
num_columns = ['PercentSalaryHike']
cat_columns = ['JobSatisfaction']

In [69]:
X_train

Unnamed: 0,Attrition,BusinessTravel,Department,EducationField,Gender,JobRole,JobSatisfaction,MaritalStatus,WorkLifeBalance,Education,...,DistanceFromHome,EmployeeID,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,TotalWorkingYears,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrentManager
944,No,Travel_Rarely,Human Resources,Medical,Male,Manager,Very High,Single,Better,2,...,6,1550,16437,1,21,21,21,7,7,7
1402,No,Travel_Rarely,Research & Development,Other,Female,Manufacturing Director,Very High,Divorced,Better,2,...,2,1635,5770,1,19,10,10,7,3,9
1054,No,Travel_Rarely,Sales,Medical,Male,Sales Executive,Low,Married,Bad,4,...,18,1945,5561,0,16,6,5,3,0,4
1128,Yes,Travel_Frequently,Sales,Technical Degree,Female,Sales Executive,Medium,Single,Bad,3,...,13,1487,5765,5,11,7,5,3,0,0
1323,No,Travel_Rarely,Research & Development,Life Sciences,Female,Research Scientist,Very High,Married,Better,4,...,1,2052,2977,1,12,4,4,3,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,No,Travel_Frequently,Research & Development,Life Sciences,Female,Research Scientist,High,Divorced,Better,2,...,29,547,4556,2,11,19,5,4,0,2
835,No,Travel_Rarely,Research & Development,Life Sciences,Male,Research Scientist,Very High,Married,Best,3,...,1,1799,2044,1,11,5,5,3,0,3
1216,No,Travel_Rarely,Research & Development,Medical,Male,Manager,High,Married,Best,2,...,4,1256,18711,2,13,23,1,0,0,0
559,No,Travel_Frequently,Research & Development,Medical,Male,Research Scientist,Medium,Divorced,Better,4,...,9,964,3617,8,14,3,1,1,0,0


In [70]:
X_train = X_train[features]

In [71]:
X_test = X_test[features]

#Feature Preprocessing

In [72]:
X_train['JobSatisfaction'].values

array(['Very High', 'Very High', 'Low', ..., 'High', 'Medium', 'High'],
      dtype=object)

In [73]:
X_train['JobSatisfaction'] = X_train['JobSatisfaction'].replace({'Low': 1, 'Medium':2, 'High':3, 'Very High':4})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['JobSatisfaction'] = X_train['JobSatisfaction'].replace({'Low': 1, 'Medium':2, 'High':3, 'Very High':4})


In [74]:
X_test['JobSatisfaction'] = X_test['JobSatisfaction'].replace({'Low': 1, 'Medium':2, 'High':3, 'Very High':4})

##MinMaxScaler:

In [75]:
minmax_pipeline = make_pipeline(SimpleImputer(strategy='median'),
                             MinMaxScaler())
ordinal_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'),
                                 OrdinalEncoder(categories=[[1, 2, 3, 4]]))
minmaxscaler_pipeline = ColumnTransformer([
    ('minmax_pipe', minmax_pipeline, num_columns),
    ('ordinal_pipe', ordinal_pipeline, cat_columns)
],
remainder="passthrough",
)

In [76]:
minmaxscaler_pipeline

In [77]:
X_train_minmax= minmaxscaler_pipeline.fit_transform(X_train)
X_test_minmax= minmaxscaler_pipeline.transform(X_test)

##StandardScaler:

In [108]:
standard_pipeline = make_pipeline(SimpleImputer(strategy='median'),
                             StandardScaler())
ordinal_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'),
                                 OrdinalEncoder(categories=[[1, 2, 3, 4]]))
StandardScaler_pipeline = ColumnTransformer([
    ('standard_pipe', standard_pipeline, num_columns),
    ('ordinal_pipe', ordinal_pipeline, cat_columns)
],
remainder="passthrough",
)

In [110]:
StandardScaler_pipeline

In [111]:
X_train_standard= StandardScaler_pipeline.fit_transform(X_train)
X_test_standard= StandardScaler_pipeline.transform(X_test)

##RobustScaler:

In [81]:
Robust_pipeline = make_pipeline(SimpleImputer(strategy='median'),
                             RobustScaler())
Robustscaler_pipeline = ColumnTransformer([
    ('Robust_pipe', Robust_pipeline, num_columns),
],
remainder="passthrough",
)

In [82]:
Robustscaler_pipeline

In [83]:
X_train_Robust= Robustscaler_pipeline.fit_transform(X_train)
X_test_Robust= Robustscaler_pipeline.transform(X_test)

#Model Training

##Unscaled Features:

In [84]:
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)

In [85]:
y_pred_train = dt.predict(X_train)
y_pred_test = dt.predict(X_test)

In [86]:
import numpy as np
from sklearn import metrics

print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_train, y_pred_train))
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_train, y_pred_train))
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(y_train, y_pred_train)))
mape = np.mean(np.abs((y_train - y_pred_train) / np.abs(y_train)))
print('Mean Absolute Percentage Error (MAPE):', round(mape * 100, 2))
print('Accuracy:', round(100*(1 - mape), 2))

Mean Absolute Error (MAE): 0.0
Mean Squared Error (MSE): 0.0
Root Mean Squared Error (RMSE): 0.0
Mean Absolute Percentage Error (MAPE): 0.0
Accuracy: 100.0


In [87]:
#Evaluating test set
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_test, y_pred_test))
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_test, y_pred_test))
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(y_test, y_pred_test)))
mape = np.mean(np.abs((y_test - y_pred_test) / np.abs(y_test)))
print('Mean Absolute Percentage Error (MAPE):', round(mape * 100, 2))
print('Accuracy:', round(100*(1 - mape), 2))

Mean Absolute Error (MAE): 0.0
Mean Squared Error (MSE): 0.0
Root Mean Squared Error (RMSE): 0.0
Mean Absolute Percentage Error (MAPE): 0.0
Accuracy: 100.0


$Insight:$
- It seems we get a perfect score in the train set and test set of unscaled features, this is invalid.

##MinMax-scaled Features:

In [88]:
dt_mm = DecisionTreeRegressor()
dt_mm.fit(X_train_minmax, y_train)

In [89]:
y_pred_train = dt.predict(X_train_minmax)
y_pred_test = dt.predict(X_test_minmax)



In [90]:
#Evaluating train-set
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_train, y_pred_train))
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_train, y_pred_train))
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(y_train, y_pred_train)))
mape = np.mean(np.abs((y_train - y_pred_train) / np.abs(y_train)))
print('Mean Absolute Percentage Error (MAPE):', round(mape * 100, 2))
print('Accuracy:', round(100*(1 - mape), 2))

Mean Absolute Error (MAE): 0.1515426497277677
Mean Squared Error (MSE): 0.1515426497277677
Root Mean Squared Error (RMSE): 0.38928479257192633
Mean Absolute Percentage Error (MAPE): 3.79
Accuracy: 96.21


In [91]:
#Evaluating test set
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_test, y_pred_test))
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_test, y_pred_test))
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(y_test, y_pred_test)))
mape = np.mean(np.abs((y_test - y_pred_test) / np.abs(y_test)))
print('Mean Absolute Percentage Error (MAPE):', round(mape * 100, 2))
print('Accuracy:', round(100*(1 - mape), 2))

Mean Absolute Error (MAE): 0.16032608695652173
Mean Squared Error (MSE): 0.16032608695652173
Root Mean Squared Error (RMSE): 0.40040740122595353
Mean Absolute Percentage Error (MAPE): 4.01
Accuracy: 95.99


$Insight:$
- It seems we get a very good accuracy of 96 and 95 in the minmax scaled features.

##Standard-scaled Features:

In [112]:
X_train_standard

array([[ 1.58426789,  3.        ],
       [ 1.03979161,  3.        ],
       [ 0.22307717,  0.        ],
       ...,
       [-0.59363726,  2.        ],
       [-0.32139912,  1.        ],
       [-1.13811355,  2.        ]])

In [113]:
dt_standard = DecisionTreeRegressor()
dt_standard.fit(X_train_standard, y_train)

In [114]:
y_pred_train = dt.predict(X_train_standard)
y_pred_test = dt.predict(X_test_standard)



In [115]:
# Check performance model

#Evaluating train-set
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_train, y_pred_train))
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_train, y_pred_train))
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(y_train, y_pred_train)))
mape = np.mean(np.abs((y_train - y_pred_train) / np.abs(y_train)))
print('Mean Absolute Percentage Error (MAPE):', round(mape * 100, 2))
print('Accuracy:', round(100*(1 - mape), 2))

Mean Absolute Error (MAE): 0.1515426497277677
Mean Squared Error (MSE): 0.1515426497277677
Root Mean Squared Error (RMSE): 0.38928479257192633
Mean Absolute Percentage Error (MAPE): 3.79
Accuracy: 96.21


##Robust-scaled Features:

#Preprocessing + RF

In [98]:
num_pipeline = make_pipeline(SimpleImputer(strategy='median'),
                             MinMaxScaler())
ordinal_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'),
                                 OrdinalEncoder(categories=[[1, 2, 3, 4]]))

preprocessing_pipeline = ColumnTransformer([
    ('pipe_num', num_pipeline, num_columns),
    ('ordinal_pipe', ordinal_pipeline, cat_columns)
],
remainder="passthrough",
)

clf_rf = make_pipeline(preprocessing_pipeline, RandomForestRegressor())
clf_rf.fit(X_train, np.ravel(y_train))

In [99]:
# Check performance model

y_pred_train = clf_rf.predict(X_train)
y_pred_test = clf_rf.predict(X_test)

#Model evaluation
#Evaluating train-set
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_train, y_pred_train))
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_train, y_pred_train))
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(y_train, y_pred_train)))
mape = np.mean(np.abs((y_train - y_pred_train) / np.abs(y_train)))
print('Mean Absolute Percentage Error (MAPE):', round(mape * 100, 2))
print('Accuracy:', round(100*(1 - mape), 2))

Mean Absolute Error (MAE): 0.0
Mean Squared Error (MSE): 0.0
Root Mean Squared Error (RMSE): 0.0
Mean Absolute Percentage Error (MAPE): 0.0
Accuracy: 100.0


In [100]:
#Evaluating test set
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_test, y_pred_test))
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_test, y_pred_test))
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(y_test, y_pred_test)))
mape = np.mean(np.abs((y_test - y_pred_test) / np.abs(y_test)))
print('Mean Absolute Percentage Error (MAPE):', round(mape * 100, 2))
print('Accuracy:', round(100*(1 - mape), 2))

Mean Absolute Error (MAE): 0.0
Mean Squared Error (MSE): 0.0
Root Mean Squared Error (RMSE): 0.0
Mean Absolute Percentage Error (MAPE): 0.0
Accuracy: 100.0
