## Model Creation and Evaluation

### Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

import warnings # Used to supressed the warnings
warnings.filterwarnings('ignore')

### Loading pre-processed data

In [2]:
df = pd.read_csv('employee_performance_analysis_preprocessed_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,EmpDepartment,EmpJobRole,EmpEnvironmentSatisfaction,EmpLastSalaryHikePercent,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,log_YearsSinceLastPromotion,YearsWithCurrManager,PerformanceRating
0,0,5,18,4,-0.889111,2,0.868276,0.864098,-1.012816,1.202103,3
1,1,5,18,4,-0.889111,3,0.200371,0.864098,0.050319,0.902825,3
2,2,5,18,4,1.594054,3,2.649355,2.661702,0.050319,2.399219,4
3,3,2,11,2,-0.061389,2,-0.244898,0.564498,0.050319,0.603546,3
4,4,5,18,1,-0.337297,3,-0.912803,-0.633905,0.672213,-0.59357,3


In [3]:
df.drop('Unnamed: 0',axis=1,inplace=True) # Droping unwanted feature
df.head()

Unnamed: 0,EmpDepartment,EmpJobRole,EmpEnvironmentSatisfaction,EmpLastSalaryHikePercent,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,log_YearsSinceLastPromotion,YearsWithCurrManager,PerformanceRating
0,5,18,4,-0.889111,2,0.868276,0.864098,-1.012816,1.202103,3
1,5,18,4,-0.889111,3,0.200371,0.864098,0.050319,0.902825,3
2,5,18,4,1.594054,3,2.649355,2.661702,0.050319,2.399219,4
3,2,11,2,-0.061389,2,-0.244898,0.564498,0.050319,0.603546,3
4,5,18,1,-0.337297,3,-0.912803,-0.633905,0.672213,-0.59357,3


## Defining Independent and Dependent Feature

In [4]:
X = df.drop(columns=['PerformanceRating'])
y = df.PerformanceRating

In [5]:
X

Unnamed: 0,EmpDepartment,EmpJobRole,EmpEnvironmentSatisfaction,EmpLastSalaryHikePercent,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,log_YearsSinceLastPromotion,YearsWithCurrManager
0,5,18,4,-0.889111,2,0.868276,0.864098,-1.012816,1.202103
1,5,18,4,-0.889111,3,0.200371,0.864098,0.050319,0.902825
2,5,18,4,1.594054,3,2.649355,2.661702,0.050319,2.399219
3,2,11,2,-0.061389,2,-0.244898,0.564498,0.050319,0.603546
4,5,18,1,-0.337297,3,-0.912803,-0.633905,0.672213,-0.593570
...,...,...,...,...,...,...,...,...,...
1195,5,18,4,1.318147,3,-0.022263,0.264897,-1.012816,0.004988
1196,4,12,4,0.490425,3,-1.135438,-1.233106,-1.012816,-1.192127
1197,4,12,4,-1.165018,3,3.094625,1.163699,1.113454,1.202103
1198,0,5,4,-0.337297,4,0.423006,0.864098,2.176589,0.902825


In [6]:
y

0       3
1       3
2       4
3       3
4       3
       ..
1195    4
1196    3
1197    3
1198    3
1199    2
Name: PerformanceRating, Length: 1200, dtype: int64

### BALANCING THE TARGET FEATURE
**smote** :- it is an oversampling technique used to address class imbalance in a dataset. Class imbalance occurs when there is a significant difference in the number of instances between two or more classes in a dataset. This can lead to machine learning models that are biased towards the majority class and perform poorly on the minority class.

In [7]:
from collections import Counter
from imblearn.over_sampling import SMOTE #SMOTE(synthetic minority oversampling techinque)
sm = SMOTE() # obeject creation
print("unbalanced data   :  ",Counter(y))
X_sm,y_sm = sm.fit_resample(X,y)
print("balanced data:    :",Counter(y_sm))

unbalanced data   :   Counter({3: 874, 2: 194, 4: 132})
balanced data:    : Counter({3: 874, 4: 874, 2: 874})


* Target Fature is balanced now

### SPLIT TRAINING AND TESTING DATA

In [8]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X_sm,y_sm,random_state=42,test_size=0.2) # 20% data given to testing

In [9]:
# Check shape of train and test
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((2097, 9), (525, 9), (2097,), (525,))

### MODEL CREATION, PREDICTION AND EVALUATION


#### Here we will be using this Algorithm
* Logistic Regression
* K-Nearest Neighbor
* Support Vector Machine
* Decision Tree
* Random Forest
* Xg Boost Classifier
* Artificial Neural Network [MLP Classifier]

## 1. Logistic Regression

In [10]:
# importing library
from sklearn.linear_model import LogisticRegression
# Object Creaation
LR = LogisticRegression()
# Fitting training and testing data
LR.fit(X_train,y_train)
# Prediction on train data
LR_train_predict = LR.predict(X_train)
# Prediction on test data
LR_test_predict = LR.predict(X_test)

### Training accuracy

In [12]:
# import metrics
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score,classification_report,confusion_matrix
LR_train_accuracy = accuracy_score(LR_train_predict,y_train)
print("Training accuracy of Logistic Regression model", LR_train_accuracy)
print("Logistic Regression Classification report: \n", classification_report(LR_train_predict,y_train))

Training accuracy of Logistic Regression model 0.8183118741058655
Logistic Regression Classification report: 
               precision    recall  f1-score   support

           2       0.86      0.84      0.85       712
           3       0.71      0.79      0.75       629
           4       0.89      0.83      0.85       756

    accuracy                           0.82      2097
   macro avg       0.82      0.82      0.82      2097
weighted avg       0.82      0.82      0.82      2097



### Testing accuracy

In [13]:
LR_test_accuracy = accuracy_score(LR_test_predict,y_test)
print("Testing accuracy of Logistic Regression model",LR_test_accuracy)
print("Logistic Regression Classification report: \n",classification_report(LR_test_predict,y_test))

Testing accuracy of Logistic Regression model 0.7961904761904762
Logistic Regression Classification report: 
               precision    recall  f1-score   support

           2       0.85      0.81      0.83       193
           3       0.65      0.82      0.73       137
           4       0.89      0.76      0.82       195

    accuracy                           0.80       525
   macro avg       0.80      0.80      0.79       525
weighted avg       0.81      0.80      0.80       525



### Hyperparameter tunning with grid-search cv

In [14]:
from sklearn.model_selection import GridSearchCV
parameters_lr = [{'penalty':['l1','l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 
                  'solver': ['lbfgs','liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']}]

grid_search_lr = GridSearchCV(estimator = LR,
                           param_grid = parameters_lr,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)

grid_search_lr.fit(X_train, y_train)
best_accuracy_lr = grid_search_lr.best_score_
best_paramaeter_lr = grid_search_lr.best_params_  
print("Best Accuracy of LR: {:.2f} %".format(best_accuracy_lr.mean()*100))
print("Best Parameter of LR:", best_paramaeter_lr)

Best Accuracy of LR: 81.73 %
Best Parameter of LR: {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}


In [15]:
# set the best parameter 
clf =LogisticRegression(C = 1, penalty= 'l2', solver= 'liblinear')
# fit the model
clf.fit(X_train,y_train)
# Predict the x test
y_hat_clf = clf.predict(X_test)

### Testing accuracy after Hyperparameter tunning

In [16]:
test_accuracy = accuracy_score(y_hat_clf ,y_test)
print("Testing accuracy of Logistic Regression model",test_accuracy)
print("Logistic Regression Classification report: \n",classification_report(y_hat_clf,y_test))

Testing accuracy of Logistic Regression model 0.7961904761904762
Logistic Regression Classification report: 
               precision    recall  f1-score   support

           2       0.88      0.80      0.84       201
           3       0.62      0.86      0.72       126
           4       0.89      0.75      0.81       198

    accuracy                           0.80       525
   macro avg       0.80      0.80      0.79       525
weighted avg       0.82      0.80      0.80       525



* After hyperparameter accuracy remains same, there is no change in accuracy

In [17]:
confusion_matrix(y_test, y_hat_clf)

array([[161,   6,  17],
       [ 33, 108,  32],
       [  7,  12, 149]], dtype=int64)

## 2. K-Nearest Neighbor

In [18]:
from sklearn.neighbors import KNeighborsClassifier
#Create KNN Object.
knn = KNeighborsClassifier()
#Training the model.
knn.fit(X_train, y_train)
# Prediction on train data
knn_train_predict = knn.predict(X_train)
#Predict test data set.
knn_test_predict = knn.predict(X_test)

### Training accuracy

In [19]:
# import metrics
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score,classification_report,confusion_matrix
knn_train_accuracy = accuracy_score(knn_train_predict,y_train)
print("Training accuracy of K-Nearest Neighbor model", knn_train_accuracy)
print("K-Nearest Neighbor Classification report: \n", classification_report(knn_train_predict,y_train))

Training accuracy of K-Nearest Neighbor model 0.9275154983309489
K-Nearest Neighbor Classification report: 
               precision    recall  f1-score   support

           2       0.99      0.92      0.95       739
           3       0.80      0.99      0.88       568
           4       1.00      0.89      0.94       790

    accuracy                           0.93      2097
   macro avg       0.93      0.93      0.93      2097
weighted avg       0.94      0.93      0.93      2097



### Testing accuracy

In [20]:
knn_test_accuracy = accuracy_score(knn_test_predict,y_test)
print("Testing accuracy of K-Nearest Neighbor model",knn_test_accuracy)
print("K-Nearest Neighbor Classification report: \n",classification_report(knn_test_predict,y_test))

Testing accuracy of K-Nearest Neighbor model 0.8914285714285715
K-Nearest Neighbor Classification report: 
               precision    recall  f1-score   support

           2       0.98      0.89      0.93       202
           3       0.71      0.97      0.82       127
           4       0.98      0.84      0.91       196

    accuracy                           0.89       525
   macro avg       0.89      0.90      0.89       525
weighted avg       0.92      0.89      0.90       525



### Hyperparameter tunning with grid-search cv

In [21]:
from sklearn.model_selection import GridSearchCV
parameter={'n_neighbors': np.arange(2, 30, 1)}
knn=KNeighborsClassifier()
knn_cv=GridSearchCV(knn, param_grid=parameter, cv=10, scoring='f1', verbose=1)
knn_cv.fit(X_train, y_train)
print(knn_cv.best_params_)

Fitting 10 folds for each of 28 candidates, totalling 280 fits
{'n_neighbors': 2}


In [22]:
# set the best parameter 
knn_clf = KNeighborsClassifier(n_neighbors=2)
# fit the model
knn_clf.fit(X_train,y_train)
# Predict the x test
knn_hat_clf = knn_clf.predict(X_test)

### Testing accuracy after Hyperparameter tunning

In [23]:
test_accuracy = accuracy_score(knn_hat_clf,y_test)
print("Testing accuracy of K-Nearest Neighbor model",test_accuracy)
print("K-Nearest Neighbor Classification report: \n",classification_report(knn_hat_clf,y_test))

Testing accuracy of K-Nearest Neighbor model 0.9123809523809524
K-Nearest Neighbor Classification report: 
               precision    recall  f1-score   support

           2       1.00      0.87      0.93       211
           3       0.79      0.94      0.86       144
           4       0.95      0.94      0.94       170

    accuracy                           0.91       525
   macro avg       0.91      0.92      0.91       525
weighted avg       0.92      0.91      0.91       525



In [26]:
confusion_matrix(y_test, knn_hat_clf)

array([[184,   0,   0],
       [ 25, 134,  14],
       [  1,  11, 156]], dtype=int64)

## 3. SVM

In [24]:
# importing library
from sklearn.svm import SVC
# Object Creaation
svc = SVC()
# Fitting training and testing data
svc.fit(X_train,y_train)
# Prediction on train data
svc_train_predict = svc.predict(X_train)
# Prediction on test data
svc_test_predict = svc.predict(X_test)

### Training accuracy

In [25]:
# import metrics
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score,classification_report,confusion_matrix
svc_train_accuracy = accuracy_score(svc_train_predict,y_train)
print("Training accuracy of support vector classifier model",svc_train_accuracy)
print("support vector classifier Classification report: \n",classification_report(svc_train_predict,y_train))

Training accuracy of support vector classifier model 0.8392942298521697
support vector classifier Classification report: 
               precision    recall  f1-score   support

           2       0.93      0.85      0.89       753
           3       0.71      0.84      0.77       589
           4       0.89      0.83      0.86       755

    accuracy                           0.84      2097
   macro avg       0.84      0.84      0.84      2097
weighted avg       0.85      0.84      0.84      2097



### Testing accuracy

In [26]:
svc_test_accuracy = accuracy_score(svc_test_predict,y_test)
print("Testing accuracy of support vector classifier model",svc_test_accuracy)
print("support vector classifier Classification report: \n",classification_report(svc_test_predict,y_test))

Testing accuracy of support vector classifier model 0.8019047619047619
support vector classifier Classification report: 
               precision    recall  f1-score   support

           2       0.88      0.81      0.85       199
           3       0.64      0.87      0.73       127
           4       0.89      0.75      0.81       199

    accuracy                           0.80       525
   macro avg       0.80      0.81      0.80       525
weighted avg       0.82      0.80      0.81       525



### Hyperparameter tunning with grid-search cv

In [39]:
from sklearn.model_selection import GridSearchCV
param = {'C':[0.5,10,50,60,70,80], 'gamma':[0.1,0.001,0.0001,0.00001]}
# Object Creaation
svc = SVC()
grid = GridSearchCV(svc, param_grid=param, cv=10,refit=True, scoring='f1', verbose=2)
grid.fit(X_train,y_train)
print(grid.best_estimator_)

Fitting 10 folds for each of 24 candidates, totalling 240 fits
[CV] END ...................................C=0.5, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.5, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.5, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.5, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.5, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.5, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.5, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.5, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.5, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.5, gamma=0.1; total time=   0.0s
[CV] END .................................C=0.5, gamma=0.001; total time=   0.0s
[CV] END .................................C=0.

[CV] END .................................C=50, gamma=0.0001; total time=   0.0s
[CV] END .................................C=50, gamma=0.0001; total time=   0.0s
[CV] END .................................C=50, gamma=0.0001; total time=   0.0s
[CV] END .................................C=50, gamma=0.0001; total time=   0.0s
[CV] END .................................C=50, gamma=0.0001; total time=   0.0s
[CV] END .................................C=50, gamma=0.0001; total time=   0.0s
[CV] END .................................C=50, gamma=0.0001; total time=   0.0s
[CV] END .................................C=50, gamma=0.0001; total time=   0.0s
[CV] END .................................C=50, gamma=0.0001; total time=   0.0s
[CV] END ..................................C=50, gamma=1e-05; total time=   0.0s
[CV] END ..................................C=50, gamma=1e-05; total time=   0.0s
[CV] END ..................................C=50, gamma=1e-05; total time=   0.0s
[CV] END ...................

[CV] END ....................................C=80, gamma=0.1; total time=   0.0s
[CV] END ....................................C=80, gamma=0.1; total time=   0.0s
[CV] END ....................................C=80, gamma=0.1; total time=   0.0s
[CV] END ....................................C=80, gamma=0.1; total time=   0.0s
[CV] END ..................................C=80, gamma=0.001; total time=   0.0s
[CV] END ..................................C=80, gamma=0.001; total time=   0.0s
[CV] END ..................................C=80, gamma=0.001; total time=   0.0s
[CV] END ..................................C=80, gamma=0.001; total time=   0.0s
[CV] END ..................................C=80, gamma=0.001; total time=   0.0s
[CV] END ..................................C=80, gamma=0.001; total time=   0.0s
[CV] END ..................................C=80, gamma=0.001; total time=   0.0s
[CV] END ..................................C=80, gamma=0.001; total time=   0.0s
[CV] END ...................

In [36]:
# set the best parameter 
svm_clf =SVC(C=0.5, gamma=0.1)
# fit the model
svm_clf.fit(X_train,y_train)
# Predict the x test
svm_hat_clf = svm_clf.predict(X_test)
# Predict the x train
svm_hat_train_clf = svm_clf.predict(X_train)

### Training accuracy after Hyperparameter tunning

In [37]:
train_accuracy = accuracy_score(svm_hat_train_clf,y_train)
print("Training accuracy of support vector classifier model",train_accuracy)
print("support vector classifier Classification report: \n",classification_report(svm_hat_train_clf,y_train))

Training accuracy of support vector classifier model 0.9098712446351931
support vector classifier Classification report: 
               precision    recall  f1-score   support

           2       0.99      0.92      0.95       739
           3       0.81      0.91      0.86       624
           4       0.93      0.90      0.91       734

    accuracy                           0.91      2097
   macro avg       0.91      0.91      0.91      2097
weighted avg       0.92      0.91      0.91      2097



### Testing accuracy after Hyperparameter tunning

In [38]:
test_accuracy = accuracy_score(svm_hat_clf,y_test)
print("Testing accuracy of support vector classifier model",test_accuracy)
print("support vector classifier Classification report: \n",classification_report(svm_hat_clf,y_test))

Testing accuracy of support vector classifier model 0.8895238095238095
support vector classifier Classification report: 
               precision    recall  f1-score   support

           2       0.95      0.91      0.93       192
           3       0.79      0.89      0.84       152
           4       0.93      0.86      0.89       181

    accuracy                           0.89       525
   macro avg       0.89      0.89      0.89       525
weighted avg       0.90      0.89      0.89       525



## 4. Decision Tree

In [40]:
from sklearn.tree import DecisionTreeClassifier
# fit a model with default parameters
dt = DecisionTreeClassifier()
dt.fit(X_train,y_train)
# Prediction on train data
dt_train_predict = dt.predict(X_train)
#Predict test data set.
dt_test_predict = dt.predict(X_test)

### Training accuracy

In [41]:
# import metrics
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score,classification_report,confusion_matrix
dt_train_accuracy = accuracy_score(dt_train_predict,y_train)
print("Training accuracy of Decision Tree Classifier model",dt_train_accuracy)
print("Decision Tree Classifier Classification report: \n",classification_report(dt_train_predict,y_train))

Training accuracy of Decision Tree Classifier model 1.0
Decision Tree Classifier Classification report: 
               precision    recall  f1-score   support

           2       1.00      1.00      1.00       690
           3       1.00      1.00      1.00       701
           4       1.00      1.00      1.00       706

    accuracy                           1.00      2097
   macro avg       1.00      1.00      1.00      2097
weighted avg       1.00      1.00      1.00      2097



### Testing accuracy

In [42]:
dt_test_accuracy = accuracy_score(dt_test_predict,y_test)
print("Testing accuracy of Decision Tree Classifier model",dt_test_accuracy)
print("Decision Tree Classifier Classification report: \n",classification_report(dt_test_predict,y_test))

Testing accuracy of Decision Tree Classifier model 0.9295238095238095
Decision Tree Classifier Classification report: 
               precision    recall  f1-score   support

           2       0.96      0.97      0.96       182
           3       0.90      0.91      0.90       172
           4       0.93      0.91      0.92       171

    accuracy                           0.93       525
   macro avg       0.93      0.93      0.93       525
weighted avg       0.93      0.93      0.93       525



### Hyperparameter tunning with grid-search cv

In [43]:
params = {'max_depth':[3,5,7,10,15],
          'min_samples_leaf':[3,5,10,15,20],
          'min_samples_split':[8,10,12,18,20,16],
          'criterion':['gini','entropy']}

dt = DecisionTreeClassifier()
# create an instance of the grid search object
g1 = GridSearchCV(dt, param_grid=params, cv=5, n_jobs=-1,verbose=True, scoring='f1')

# conduct grid search over the parameter space
g1.fit(X_train,y_train)

# show best parameter configuration found for classifier
cls_params1 = g1.best_params_
cls_params1

Fitting 5 folds for each of 300 candidates, totalling 1500 fits


{'criterion': 'gini',
 'max_depth': 3,
 'min_samples_leaf': 3,
 'min_samples_split': 8}

In [44]:
# set the best parameter 
dt_clf = DecisionTreeClassifier(criterion= 'gini', max_depth= 3, min_samples_leaf= 3, min_samples_split= 8)
# fit the model
dt_clf.fit(X_train,y_train)
# Predict the x test
dt_hat_clf = dt_clf.predict(X_test)
# Predict the x train
dt_hat_train_clf = dt_clf.predict(X_train)

### Training accuracy after Hyperparameter tunning

In [47]:
train_accuracy = accuracy_score(dt_hat_train_clf,y_train)
print("Training accuracy of Decision Tree Classifier model",train_accuracy)
print("Decision Tree Classifier Classification report: \n",classification_report(dt_hat_train_clf,y_train))

Training accuracy of Decision Tree Classifier model 0.8974725798760134
Decision Tree Classifier Classification report: 
               precision    recall  f1-score   support

           2       0.98      0.87      0.92       777
           3       0.87      0.87      0.87       702
           4       0.85      0.97      0.90       618

    accuracy                           0.90      2097
   macro avg       0.90      0.90      0.90      2097
weighted avg       0.90      0.90      0.90      2097



### Testing accuracy after Hyperparameter tunning

In [46]:
test_accuracy = accuracy_score(dt_hat_clf ,y_test)
print("Testing accuracy of Decision Tree Classifier model",test_accuracy)
print("Decision Tree Classifier Classification report: \n",classification_report(dt_hat_clf,y_test))

Testing accuracy of Decision Tree Classifier model 0.8990476190476191
Decision Tree Classifier Classification report: 
               precision    recall  f1-score   support

           2       0.98      0.89      0.94       203
           3       0.87      0.86      0.86       174
           4       0.84      0.95      0.89       148

    accuracy                           0.90       525
   macro avg       0.90      0.90      0.90       525
weighted avg       0.90      0.90      0.90       525



## 5. Random Forest with GridSearchCV

In [98]:
# import library and imputation of parameter
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {'criterion':['gini','entropy'],
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'min_samples_leaf': [3,5,10],
    'min_samples_split': [4,8,10,12],
    'n_estimators': [100, 200, 300, 1000]
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)
# Fitting the training data
grid_search.fit(X_train,y_train)

# Get best parameter
rf_best_params = grid_search.best_params_
print(f"Best parameter: {rf_best_params}")

Fitting 3 folds for each of 384 candidates, totalling 1152 fits
Best parameter: {'bootstrap': True, 'criterion': 'entropy', 'max_depth': 110, 'min_samples_leaf': 3, 'min_samples_split': 4, 'n_estimators': 200}


In [105]:
# Create object and place the best paramter
rf_clf = RandomForestClassifier(bootstrap = True, criterion = 'entropy', max_depth = 110,  min_samples_leaf = 3, 
                                min_samples_split = 4, n_estimators = 200)
# Fitting the training data
rf_clf.fit(X_train,y_train)
# Prediction on test data
rf_clf_predict = rf_clf.predict(X_test)
#  prediction on training data
rf_clf_train_predict = rf_clf.predict(X_train)

### Training accuracy after Hyperparameter tunning

In [106]:
train_accuracy = accuracy_score(rf_clf_train_predict,y_train)
print("Testing accuracy of random forest",train_accuracy)
print("Classification report of testing: \n",classification_report(rf_clf_train_predict,y_train))

Testing accuracy of random forest 0.9785407725321889
Classification report of testing: 
               precision    recall  f1-score   support

           2       0.99      0.97      0.98       705
           3       0.97      0.97      0.97       697
           4       0.98      0.99      0.99       695

    accuracy                           0.98      2097
   macro avg       0.98      0.98      0.98      2097
weighted avg       0.98      0.98      0.98      2097



### Testing accuracy after Hyperparameter tunning

In [107]:
test_accuracy = accuracy_score(rf_clf_predict,y_test)
print("Testing accuracy of random forest",test_accuracy)
print("Classification report of testing: \n",classification_report(rf_clf_predict,y_test))

Testing accuracy of random forest 0.9371428571428572
Classification report of testing: 
               precision    recall  f1-score   support

           2       0.96      0.94      0.95       188
           3       0.94      0.89      0.91       182
           4       0.91      0.99      0.95       155

    accuracy                           0.94       525
   macro avg       0.94      0.94      0.94       525
weighted avg       0.94      0.94      0.94       525



In [87]:
pd.crosstab(rf_clf_predict,y_test)

PerformanceRating,2,3,4
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,178,10,2
3,6,161,10
4,0,2,156


## 6. XGBoost Classifier with GridSearch CV

In [35]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
xgb_y_train = le.fit_transform(y_train)
xgb_y_test = le.fit_transform(y_test)  

Doing label encoding because the y_train must be encoded(column has to start from 0). An easy way to solve that is using LabelEncoder from sklearn.preprocssing library.

In [36]:
xgb_y_train[0:20]      # values are encoded in 0,1,2

array([0, 0, 1, 1, 1, 1, 0, 1, 2, 1, 0, 1, 2, 2, 0, 0, 0, 1, 2, 0],
      dtype=int64)

In [48]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.1, 0.01, 0.001],
    'subsample': [0.5, 0.7, 1]
}

# Create the XGBoost model object
xgb_model = xgb.XGBClassifier()

# Create the GridSearchCV object
grid_search = GridSearchCV(xgb_model, param_grid, cv=5, scoring='f1', n_jobs = -1, verbose = 2)

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, xgb_y_train)

# Print the best set of hyperparameters and the corresponding score
print("Best set of hyperparameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best set of hyperparameters:  {'learning_rate': 0.1, 'max_depth': 3, 'subsample': 0.5}
Best score:  nan


In [49]:
# Create object and place the best paramter
xgb_clf = xgb.XGBClassifier(learning_rate= 0.1, max_depth= 3, subsample=0.5)

# Fitting the training data
xgb_clf.fit(X_train,xgb_y_train)
# Prediction on test data
xgb_clf_predict = xgb_clf.predict(X_test)
#  prediction on training data
xgb_clf_train_predict = xgb_clf.predict(X_train)

### Training accuracy after Hyperparameter tunning

In [50]:
# import metrics
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score,classification_report,confusion_matrix
xgb_train_accuracy = accuracy_score(xgb_clf_train_predict, xgb_y_train)
print("Training accuracy of XGBoost Classifier model",xgb_train_accuracy)
print("XGBoost Classifier Classification report: \n",classification_report(xgb_clf_train_predict, xgb_y_train))

Training accuracy of XGBoost Classifier model 0.9647114926084883
XGBoost Classifier Classification report: 
               precision    recall  f1-score   support

           0       0.99      0.95      0.97       719
           1       0.95      0.96      0.95       695
           2       0.95      0.99      0.97       683

    accuracy                           0.96      2097
   macro avg       0.96      0.96      0.96      2097
weighted avg       0.97      0.96      0.96      2097



### Testing accuracy after Hyperparameter tunning

In [51]:
test_accuracy = accuracy_score(xgb_clf_predict, xgb_y_test)
print("Testing accuracy of XGBoost Classifier",test_accuracy)
print("Classification report of testing: \n",classification_report(xgb_clf_predict, xgb_y_test))

Testing accuracy of XGBoost Classifier 0.9542857142857143
Classification report of testing: 
               precision    recall  f1-score   support

           0       0.99      0.95      0.97       192
           1       0.92      0.95      0.94       169
           2       0.95      0.97      0.96       164

    accuracy                           0.95       525
   macro avg       0.95      0.95      0.95       525
weighted avg       0.96      0.95      0.95       525



In [41]:
pd.crosstab(xgb_clf_predict,xgb_y_test)

col_0,0,1,2
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,183,8,2
1,1,158,5
2,0,7,161


## 7. Artificial Neural Network

In [22]:
# Importing library and object creation
from sklearn.neural_network import MLPClassifier
model =  MLPClassifier(hidden_layer_sizes=(100,100,100),batch_size=10,learning_rate_init=0.01,max_iter=2000,random_state=10)

In [23]:
# Fitting the training data
model.fit(X_train,y_train)

In [24]:
# Predicting the probability
mlp_prdict_probability = model.predict_proba(X_test)
mlp_prdict_probability

array([[9.03854264e-03, 9.63459230e-01, 2.75022271e-02],
       [7.57659673e-03, 9.83177086e-01, 9.24631707e-03],
       [9.91894610e-01, 8.10538307e-03, 6.54827242e-09],
       ...,
       [9.46589195e-01, 5.33750820e-02, 3.57232929e-05],
       [9.20973654e-01, 7.88125781e-02, 2.13767815e-04],
       [1.44374331e-05, 2.36111942e-03, 9.97624443e-01]])

In [25]:
# Prediction on test data
mlp_test_predict = model.predict(X_test)

# Prediction on training data
mlp_train_predict = model.predict(X_train)

### Training accuracy

In [27]:
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score,classification_report,confusion_matrix
mlp_train_accuracy = accuracy_score(mlp_train_predict,y_train)
print("Training accuracy of MLP model is:",mlp_train_accuracy*100)
print("Classification report of training:"'\n',classification_report(mlp_train_predict,y_train))

Training accuracy of MLP model is: 92.7515498330949
Classification report of training:
               precision    recall  f1-score   support

           2       1.00      0.92      0.96       748
           3       0.87      0.91      0.89       671
           4       0.92      0.95      0.93       678

    accuracy                           0.93      2097
   macro avg       0.93      0.93      0.93      2097
weighted avg       0.93      0.93      0.93      2097



### Testing accuracy

In [28]:
mlp_test_accuracy = accuracy_score(mlp_test_predict,y_test)
print("Testing accuracy of MLP model is:",mlp_test_accuracy*100)
print("Classification report of testing:"'\n',classification_report(mlp_test_predict,y_test))

Testing accuracy of MLP model is: 90.66666666666666
Classification report of testing:
               precision    recall  f1-score   support

           2       0.97      0.90      0.94       198
           3       0.82      0.92      0.87       154
           4       0.92      0.90      0.91       173

    accuracy                           0.91       525
   macro avg       0.91      0.91      0.90       525
weighted avg       0.91      0.91      0.91       525



In [29]:
pd.crosstab(mlp_test_predict,y_test)

PerformanceRating,2,3,4
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,179,17,2
3,1,142,11
4,4,14,155


## Observation:
-     used algorithms like Logistic Regression, Support Vector Machine, Decision Tree, Random Forest,  K-Nearest Neighbor, XGBoost Classifier and Artificial Neural Network to calculate the accuracy and found out that XGBoost Classifier gives the maximum accuracy of 95%.

## Model saving

In [53]:
# saving model with the help of pickle
import pickle

file = open('xgb_classifier_model.pkl','wb')
pickle.dump(xgb_clf,file)