# PyCaret AutoML - Employee Promotion Data

### Importing all the Required Libraries

+ Import Pandas, Matplot, and Plotly for Data Analysis and Visualizations
+ Import Pandas Profiling for Exploratory Data Analysis
+ Import PyCaret, Sklearn for Machine Learning Modelling

In [2]:
# for AutoML modeling
from pycaret.classification import *

# for EDA & visualization
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from pandas_profiling import ProfileReport

In [3]:
pycaret.__version__

'2.3.10'

### Workflow in PyCaret consist of following steps in this order:

#### EDA ➡️ Setup ➡️ Compare Models ➡️ Analyze Model ➡️ Prediction ➡️ Save Model

### Load dataset

In [4]:
df_train = pd.read_csv('emp_promo_data/emp_train.csv')
df_test = pd.read_csv('emp_promo_data/emp_test.csv')

df_train.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,73,0


## 1. EDA

In [5]:
df_train.shape

(54808, 13)

In [6]:
df_test.shape

(23490, 12)

In [7]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           54808 non-null  int64  
 1   department            54808 non-null  object 
 2   region                54808 non-null  object 
 3   education             52399 non-null  object 
 4   gender                54808 non-null  object 
 5   recruitment_channel   54808 non-null  object 
 6   no_of_trainings       54808 non-null  int64  
 7   age                   54808 non-null  int64  
 8   previous_year_rating  50684 non-null  float64
 9   length_of_service     54808 non-null  int64  
 10  awards_won?           54808 non-null  int64  
 11  avg_training_score    54808 non-null  int64  
 12  is_promoted           54808 non-null  int64  
dtypes: float64(1), int64(7), object(5)
memory usage: 5.4+ MB


Data types in the dataset

In [8]:
pd.value_counts(df_train.dtypes)

int64      7
object     5
float64    1
dtype: int64

#### Descriptive Statistics

Descriptive Statistics is one of the most Important Step to Understand the Data and take out Insights
+ First we will the Descriptive Statistics for the Numerical Columns
+ for Numerical Columns we check for stats such as Max, Min, Mean, count, standard deviation, 25 percentile, 50 percentile, and 75 percentile.
+ Then we will check for the Descriptive Statistics for Categorical Columns
+ for Categorical Columns we check for stats such as count, frequency, top, and unique elements.

Statistics for numerical columns

In [9]:
df_train.describe()

Unnamed: 0,employee_id,no_of_trainings,age,previous_year_rating,length_of_service,awards_won?,avg_training_score,is_promoted
count,54808.0,54808.0,54808.0,50684.0,54808.0,54808.0,54808.0,54808.0
mean,39195.830627,1.253011,34.803915,3.329256,5.865512,0.023172,63.38675,0.08517
std,22586.581449,0.609264,7.660169,1.259993,4.265094,0.15045,13.371559,0.279137
min,1.0,1.0,20.0,1.0,1.0,0.0,39.0,0.0
25%,19669.75,1.0,29.0,3.0,3.0,0.0,51.0,0.0
50%,39225.5,1.0,33.0,3.0,5.0,0.0,60.0,0.0
75%,58730.5,1.0,39.0,4.0,7.0,0.0,76.0,0.0
max,78298.0,10.0,60.0,5.0,37.0,1.0,99.0,1.0


Statististic for categorical columns

In [10]:
df_train.describe(include = 'object')

Unnamed: 0,department,region,education,gender,recruitment_channel
count,54808,54808,52399,54808,54808
unique,9,34,3,2,3
top,Sales & Marketing,region_2,Bachelor's,m,other
freq,16840,12343,36669,38496,30446


In [11]:
# values in Departments
df_train['department'].value_counts()

Sales & Marketing    16840
Operations           11348
Technology            7138
Procurement           7138
Analytics             5352
Finance               2536
HR                    2418
Legal                 1039
R&D                    999
Name: department, dtype: int64

Statististic of the **target variable** (is_promoted)

In [12]:
df_train.is_promoted.value_counts()

0    50140
1     4668
Name: is_promoted, dtype: int64

In [13]:
px.histogram(df_train,'is_promoted', color='is_promoted')

In [14]:
px.histogram(df_train,'is_promoted',facet_col='gender', color='is_promoted')

EDA with Pandas Profiling

In [15]:
profile_df = ProfileReport(df_train)
profile_df.to_file("eda_profile_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Feature relationship

In [16]:
px.imshow(df_train.corr(), text_auto= True, title='Correlation Between the Variables in the Model', height=1000)

## 2. Setup Experiment

In [17]:
setup(df_train, target = 'is_promoted')

Unnamed: 0,Description,Value
0,session_id,2902
1,Target,is_promoted
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(54808, 13)"
5,Missing Values,True
6,Numeric Features,4
7,Categorical Features,8
8,Ordinal Features,False
9,High Cardinality Features,False


(False,
 False,
 True,
 Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[],
                                       target='is_promoted', time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_...
                 ('scaling', 'passthrough'), ('P_transform', 'passthrough'),
                 ('binn', 'passthrough'), ('rem_outliers', 'passthrough'),
                 ('cluster_all', 'passthrough'),
                 ('dummy', Dummify(target='is_prom

## 3. Compare Models

In [18]:
available_models = models()
available_models

Unnamed: 0_level_0,Name,Reference,Turbo
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lr,Logistic Regression,sklearn.linear_model._logistic.LogisticRegression,True
knn,K Neighbors Classifier,sklearn.neighbors._classification.KNeighborsCl...,True
nb,Naive Bayes,sklearn.naive_bayes.GaussianNB,True
dt,Decision Tree Classifier,sklearn.tree._classes.DecisionTreeClassifier,True
svm,SVM - Linear Kernel,sklearn.linear_model._stochastic_gradient.SGDC...,True
rbfsvm,SVM - Radial Kernel,sklearn.svm._classes.SVC,False
gpc,Gaussian Process Classifier,sklearn.gaussian_process._gpc.GaussianProcessC...,False
mlp,MLP Classifier,sklearn.neural_network._multilayer_perceptron....,False
ridge,Ridge Classifier,sklearn.linear_model._ridge.RidgeClassifier,True
rf,Random Forest Classifier,sklearn.ensemble._forest.RandomForestClassifier,True


This function trains and evaluates the performance of all the estimators available in the model library using cross-validation. The output of this function is a scoring grid with average cross-validated scores.

In [19]:
best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.9421,0.8213,0.3374,0.9468,0.4973,0.4738,0.546,0.385
xgboost,Extreme Gradient Boosting,0.9414,0.8141,0.343,0.9109,0.4978,0.4734,0.5382,3.097
gbc,Gradient Boosting Classifier,0.9397,0.8175,0.304,0.9556,0.4606,0.4377,0.52,2.321
lda,Linear Discriminant Analysis,0.9383,0.7812,0.3295,0.8534,0.4748,0.4488,0.5071,0.383
rf,Random Forest Classifier,0.9323,0.7917,0.2318,0.8864,0.3668,0.3439,0.4327,1.43
ridge,Ridge Classifier,0.9282,0.0,0.1544,0.998,0.2667,0.2498,0.3769,0.039
ada,Ada Boost Classifier,0.9273,0.7956,0.1766,0.8435,0.2916,0.2702,0.3656,0.687
et,Extra Trees Classifier,0.9236,0.7771,0.2223,0.6451,0.3302,0.2998,0.3486,2.316
nb,Naive Bayes,0.9159,0.7026,0.0098,1.0,0.0194,0.0178,0.092,0.051
lr,Logistic Regression,0.9156,0.5956,0.0102,0.1061,0.018,0.0161,0.0273,1.491


In [20]:
best = compare_models(include = ['rf', 'lr', 'knn','nb','svm'])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.9323,0.7917,0.2318,0.8864,0.3668,0.3439,0.4327,1.485
nb,Naive Bayes,0.9159,0.7026,0.0098,1.0,0.0194,0.0178,0.092,0.045
lr,Logistic Regression,0.9156,0.5956,0.0102,0.1061,0.018,0.0161,0.0273,0.252
knn,K Neighbors Classifier,0.9102,0.5411,0.0126,0.1522,0.0232,0.0103,0.02,0.358
svm,SVM - Linear Kernel,0.9102,0.0,0.0138,0.1843,0.0247,0.0122,0.0256,0.673


## 4. Analyze Model

In [21]:
evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

## 5. Prediction

Predict on test data

In [22]:
# predict on test
best_test = predict_model(best)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.9329,0.7799,0.2502,0.8869,0.3903,0.3663,0.4505


In [23]:
best_test.head(10)

Unnamed: 0,employee_id,age,length_of_service,avg_training_score,department_Analytics,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,...,previous_year_rating_1.0,previous_year_rating_2.0,previous_year_rating_3.0,previous_year_rating_4.0,previous_year_rating_5.0,previous_year_rating_not_available,awards_won?_1,is_promoted,Label,Score
0,53977.0,30.0,6.0,54.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,0,0.95
1,42842.0,38.0,3.0,50.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0,0,0.99
2,53785.0,53.0,16.0,46.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0,0,0.97
3,53126.0,29.0,3.0,86.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1,0,0.62
4,25911.0,37.0,9.0,58.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0,0,0.95
5,2554.0,34.0,8.0,64.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,0,0.97
6,51523.0,42.0,3.0,50.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0,0.96
7,48839.0,49.0,12.0,55.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0,0.93
8,31577.0,43.0,10.0,73.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,0,0.97
9,69775.0,33.0,6.0,47.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0


Predict on new dataset

In [24]:
# predict model on new_data
predictions = predict_model(best, data = df_test)
predictions.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won?,avg_training_score,Label,Score
0,8724,Technology,region_26,Bachelor's,m,sourcing,1,24,,1,0,77,0,0.95
1,74430,HR,region_4,Bachelor's,f,other,1,31,3.0,5,0,51,0,0.91
2,72255,Sales & Marketing,region_13,Bachelor's,m,other,1,31,1.0,4,0,47,0,0.99
3,38562,Procurement,region_2,Bachelor's,f,other,3,31,2.0,9,0,65,0,0.9
4,64486,Finance,region_29,Bachelor's,m,sourcing,1,30,4.0,7,0,61,0,1.0


In [25]:
value_counts = predictions['Label'].value_counts()
print(value_counts)

0    22996
1      494
Name: Label, dtype: int64


In [26]:
px.histogram(predictions,'Label', color='Label')

Best Model

In [27]:
print(best)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=2902, verbose=0,
                       warm_start=False)


# Sklearn ML modeling

In [99]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

import time

In [100]:
f1=df_train.dropna()
X=f1[['age','previous_year_rating','length_of_service','awards_won?']]
y=f1['is_promoted']

In [101]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0)

Random Forest

In [102]:
from sklearn.ensemble import RandomForestClassifier

In [103]:
classifier=RandomForestClassifier()

start = time.time()

classifier.fit(X_train, y_train)
y_pred=classifier.predict(X_test)

end = time.time()

In [104]:
score = accuracy_score(y_test,y_pred)
auc = roc_auc_score(y_test,y_pred)
recall = recall_score(y_test,y_pred)
prec = precision_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred)

print("Accuracy : ", score)
print("AUC : ", auc)
print("Recall : ", recall)
print("Precision : ", prec)
print("F1 : ", f1)
print("TT: ",end - start, "sec")

cm = confusion_matrix(y_test,y_pred)
cm_df = pd.DataFrame(cm)
print("Confusion Matrix : ")
print(cm_df)

Accuracy :  0.9127962734621181
AUC :  0.5313287186798631
Recall :  0.0717163577759871
Precision :  0.4238095238095238
F1 :  0.12267401791867676
TT:  2.012218713760376 sec
Confusion Matrix : 
       0    1
0  13236  121
1   1152   89


Logistic Regression

In [105]:
from sklearn.linear_model import LogisticRegression

In [106]:
classifier = LogisticRegression()

start = time.time()

classifier.fit(X_train, y_train)
y_pred=classifier.predict(X_test)

end = time.time()

In [107]:
score = accuracy_score(y_test,y_pred)
auc = roc_auc_score(y_test,y_pred)
recall = recall_score(y_test,y_pred)
prec = precision_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred)

print("Accuracy : ", score)
print("AUC : ", auc)
print("Recall : ", recall)
print("Precision : ", prec)
print("F1 : ", f1)
print("TT: ",end - start, "sec")

cm = confusion_matrix(y_test,y_pred)
cm_df = pd.DataFrame(cm)
print("Confusion Matrix : ")
print(cm_df)

Accuracy :  0.9149198520345253
AUC :  0.5277380835962179
Recall :  0.06124093473005641
Precision :  0.49673202614379086
F1 :  0.10903873744619799
TT:  0.13562321662902832 sec
Confusion Matrix : 
       0   1
0  13280  77
1   1165  76


Naive Bayes

In [108]:
from sklearn.naive_bayes import GaussianNB

In [109]:
classifier = GaussianNB()

start = time.time()

classifier.fit(X_train, y_train)
y_pred=classifier.predict(X_test)

end = time.time()

In [110]:
score = accuracy_score(y_test,y_pred)
auc = roc_auc_score(y_test,y_pred)
recall = recall_score(y_test,y_pred)
prec = precision_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred)

print("Accuracy : ", score)
print("AUC : ", auc)
print("Recall : ", recall)
print("Precision : ", prec)
print("F1 : ", f1)
print("TT: ",end - start, "sec")

cm = confusion_matrix(y_test,y_pred)
cm_df = pd.DataFrame(cm)
print("Confusion Matrix : ")
print(cm_df)

Accuracy :  0.9119057405123989
AUC :  0.5564247956251546
Recall :  0.12812248186946013
Precision :  0.4380165289256198
F1 :  0.1982543640897756
TT:  0.01100301742553711 sec
Confusion Matrix : 
       0    1
0  13153  204
1   1082  159


KNN

In [111]:
from sklearn.neighbors import KNeighborsClassifier

In [112]:
knn = KNeighborsClassifier(n_neighbors=3)

start = time.time()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
end = time.time()

In [113]:
score = accuracy_score(y_test,y_pred)
auc = roc_auc_score(y_test,y_pred)
recall = recall_score(y_test,y_pred)
prec = precision_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred)

print("Accuracy : ", score)
print("AUC : ", auc)
print("Recall : ", recall)
print("Precision : ", prec)
print("F1 : ", f1)
print("TT: ",end - start, "sec")

cm = confusion_matrix(y_test,y_pred)
cm_df = pd.DataFrame(cm)
print("Confusion Matrix : ")
print(cm_df)

Accuracy :  0.8905329497191397
AUC :  0.5231829538025283
Recall :  0.08058017727639001
Precision :  0.17953321364452424
F1 :  0.11123470522803115
TT:  0.529353141784668 sec
Confusion Matrix : 
       0    1
0  12900  457
1   1141  100


SVM

In [121]:
from sklearn import svm


In [122]:
classifier = svm.SVC(kernel='linear')

start = time.time()

classifier.fit(X_train, y_train)
y_pred=classifier.predict(X_test)

end = time.time()


In [123]:
score = accuracy_score(y_test,y_pred)
auc = roc_auc_score(y_test,y_pred)
recall = recall_score(y_test,y_pred)
prec = precision_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred)

print("Accuracy : ", score)
print("AUC : ", auc)
print("Recall : ", recall)
print("Precision : ", prec)
print("F1 : ", f1)
print("TT: ",end - start, "sec")

cm = confusion_matrix(y_test,y_pred)
cm_df = pd.DataFrame(cm)
print("Confusion Matrix : ")
print(cm_df)

Accuracy :  0.9149883545691191
AUC :  0.5
Recall :  0.0
Precision :  0.0
F1 :  0.0
TT:  12.308684349060059 sec
Confusion Matrix : 
       0  1
0  13357  0
1   1241  0
