## Model Training

In [1]:
#importing necessary modules
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from xgboost import XGBClassifier
from catboost import CatBoostClassifier 
import warnings
warnings.filterwarnings('ignore')

### Reading the processed data

In [2]:
data_path = 'D:/iabac/data/processed.xls' #data path
df = pd.read_excel(data_path)

In [3]:
df

Unnamed: 0,Age,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,PerformanceRating
0,32,1,2,2,5,13,2,10,3,4,...,4,10,2,2,10,7,0,8,0,3
1,47,1,2,2,5,13,2,14,4,4,...,4,20,2,3,7,7,1,7,0,3
2,40,1,1,1,5,13,1,5,4,4,...,3,20,2,3,18,13,1,12,0,4
3,41,1,0,0,3,8,2,10,4,2,...,2,23,2,2,21,6,12,6,0,3
4,60,1,2,2,5,13,2,16,4,1,...,4,10,1,3,2,2,2,2,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1195,27,0,3,0,5,13,1,3,1,4,...,2,6,3,3,6,5,0,4,0,4
1196,37,1,1,2,1,15,2,10,2,4,...,1,4,2,3,1,0,0,0,0,3
1197,50,1,3,1,1,15,2,28,1,4,...,3,20,3,3,20,8,3,8,0,3
1198,34,0,3,2,0,1,2,9,3,4,...,2,9,3,4,8,7,7,7,0,3


Since there are around 27 features in the dataset this causes Curse of Dimensionality and the performance decreased when trained with all the features.

The reduction of features where done using the Gradient Boosting classifier to find the important features that affects the target feature (done during the visualization ).

In [4]:
#top 16 features affecting the target
predictor1 = df.loc[:,['EmpEnvironmentSatisfaction', 'EmpLastSalaryHikePercent','YearsSinceLastPromotion','EmpJobRole','ExperienceYearsInCurrentRole',
                      'EmpHourlyRate','EmpWorkLifeBalance','DistanceFromHome','BusinessTravelFrequency','ExperienceYearsAtThisCompany','YearsWithCurrManager','OverTime',
                      'NumCompaniesWorked','EmpJobInvolvement','TotalWorkExperienceInYears','TrainingTimesLastYear']]


In [5]:
#splitting the data into predictors and target
predictor = df.iloc[:,:-1]
target = df.iloc[:,-1]

## Scaling and Smoting

Since there is a huge count difference the target classes, this causes bias in the model. Synthetic creation of data (smoting) is necessary in order to decrease the level of bias in the model.

Most of the features are in different scales this effect linear model like logistic regression so standard scalar is used to transform the data to a smaller scale.

In [6]:
scale = StandardScaler()
smote = SMOTE()

In [7]:
#train test split for training the model and for predictions
x_train, x_test, y_train, y_test = train_test_split(predictor1, target, stratify = target, test_size =0.25, random_state = 41)

In [8]:
#smoting the splitted data
x_train, y_train = smote.fit_resample(x_train, y_train)

In [9]:
y_train.value_counts()

4    656
3    656
2    656
Name: PerformanceRating, dtype: int64

In [10]:
#scaling the predictor train and test data before training 
sc_x_train = scale.fit_transform(x_train) 
sc_x_test = scale.transform(x_test)

## Models

Since this is a classification problems models like logistics regression, random forest classifier, XG boost classifier, catboost classifier are used.

Random forest(strong learner) is a collection of decision tree( weak learner) so here ensemble technique where the results from all the individual trees are considered for the final classification which improves performance of the classifer better than the single decision tree.

Much more advanced algorithms like extreme gradient boosting and cat boost(which works especially well in classification) are used for the predictions.



In [19]:
#different models are intialised 
log_model = LogisticRegression()
rf_model = RandomForestClassifier(max_depth=15, n_estimators = 150, random_state = 42)
xg_model = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, monotone_constraints='()',
              n_estimators=100, n_jobs=12, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
cb_model = CatBoostClassifier()

In [20]:
#fitting the data in each models for the predictions
log_model.fit(sc_x_train, y_train)
rf_model.fit(x_train, y_train)
xg_model.fit(x_train, y_train)
cb_model.fit(x_train, y_train)

Learning rate set to 0.081907
0:	learn: 0.9961343	total: 2.28ms	remaining: 2.28s
1:	learn: 0.9141009	total: 4.22ms	remaining: 2.11s
2:	learn: 0.8392523	total: 6.13ms	remaining: 2.04s
3:	learn: 0.7759224	total: 7.98ms	remaining: 1.99s
4:	learn: 0.7277672	total: 9.92ms	remaining: 1.97s
5:	learn: 0.6813113	total: 11.1ms	remaining: 1.83s
6:	learn: 0.6375855	total: 13ms	remaining: 1.84s
7:	learn: 0.6014400	total: 15.2ms	remaining: 1.88s
8:	learn: 0.5685875	total: 17.3ms	remaining: 1.9s
9:	learn: 0.5389676	total: 19.3ms	remaining: 1.91s
10:	learn: 0.5158306	total: 21.4ms	remaining: 1.92s
11:	learn: 0.4939998	total: 23.7ms	remaining: 1.95s
12:	learn: 0.4704560	total: 25.9ms	remaining: 1.97s
13:	learn: 0.4527177	total: 28ms	remaining: 1.97s
14:	learn: 0.4351329	total: 29.9ms	remaining: 1.96s
15:	learn: 0.4187208	total: 31.8ms	remaining: 1.96s
16:	learn: 0.4035303	total: 33.4ms	remaining: 1.93s
17:	learn: 0.3892921	total: 34.9ms	remaining: 1.9s
18:	learn: 0.3759682	total: 36.4ms	remaining: 1.88

<catboost.core.CatBoostClassifier at 0x18d6e677af0>