# PyCaret AutoML - Employee Promotion Data

### Importing all the Required Libraries

+ Import Pandas, Matplot, and Plotly for Data Analysis and Visualizations
+ Import Pandas Profiling for Exploratory Data Analysis
+ Import PyCaret, Sklearn for Machine Learning Modelling

In [104]:
# for AutoML modeling
from pycaret.classification import *

# for EDA & visualization
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from pandas_profiling import ProfileReport

### Workflow in PyCaret consist of following steps in this order:

#### EDA ➡️ Setup ➡️ Compare Models ➡️ Analyze Model ➡️ Prediction ➡️ Save Model

### Load dataset

In [92]:
df_train = pd.read_csv('emp_promo_data/emp_train.csv')
df_test = pd.read_csv('emp_promo_data/emp_test.csv')

df_train.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,73,0


## 1. EDA

In [75]:
df_train.shape

(54808, 13)

In [76]:
df_test.shape

(23490, 12)

In [77]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           54808 non-null  int64  
 1   department            54808 non-null  object 
 2   region                54808 non-null  object 
 3   education             52399 non-null  object 
 4   gender                54808 non-null  object 
 5   recruitment_channel   54808 non-null  object 
 6   no_of_trainings       54808 non-null  int64  
 7   age                   54808 non-null  int64  
 8   previous_year_rating  50684 non-null  float64
 9   length_of_service     54808 non-null  int64  
 10  awards_won?           54808 non-null  int64  
 11  avg_training_score    54808 non-null  int64  
 12  is_promoted           54808 non-null  int64  
dtypes: float64(1), int64(7), object(5)
memory usage: 5.4+ MB


Data types in the dataset

In [93]:
pd.value_counts(df_train.dtypes)

int64      7
object     5
float64    1
dtype: int64

#### Descriptive Statistics

Descriptive Statistics is one of the most Important Step to Understand the Data and take out Insights
+ First we will the Descriptive Statistics for the Numerical Columns
+ for Numerical Columns we check for stats such as Max, Min, Mean, count, standard deviation, 25 percentile, 50 percentile, and 75 percentile.
+ Then we will check for the Descriptive Statistics for Categorical Columns
+ for Categorical Columns we check for stats such as count, frequency, top, and unique elements.

Statistics for numerical columns

In [79]:
df_train.describe()

Unnamed: 0,employee_id,no_of_trainings,age,previous_year_rating,length_of_service,awards_won?,avg_training_score,is_promoted
count,54808.0,54808.0,54808.0,50684.0,54808.0,54808.0,54808.0,54808.0
mean,39195.830627,1.253011,34.803915,3.329256,5.865512,0.023172,63.38675,0.08517
std,22586.581449,0.609264,7.660169,1.259993,4.265094,0.15045,13.371559,0.279137
min,1.0,1.0,20.0,1.0,1.0,0.0,39.0,0.0
25%,19669.75,1.0,29.0,3.0,3.0,0.0,51.0,0.0
50%,39225.5,1.0,33.0,3.0,5.0,0.0,60.0,0.0
75%,58730.5,1.0,39.0,4.0,7.0,0.0,76.0,0.0
max,78298.0,10.0,60.0,5.0,37.0,1.0,99.0,1.0


Statististic for categorical columns

In [80]:
df_train.describe(include = 'object')

Unnamed: 0,department,region,education,gender,recruitment_channel
count,54808,54808,52399,54808,54808
unique,9,34,3,2,3
top,Sales & Marketing,region_2,Bachelor's,m,other
freq,16840,12343,36669,38496,30446


In [81]:
# values in Departments
df_train['department'].value_counts()

Sales & Marketing    16840
Operations           11348
Procurement           7138
Technology            7138
Analytics             5352
Finance               2536
HR                    2418
Legal                 1039
R&D                    999
Name: department, dtype: int64

Statististic of the **target variable**

In [82]:
df_train.is_promoted.value_counts()

0    50140
1     4668
Name: is_promoted, dtype: int64

In [95]:
px.histogram(df_train,'is_promoted', color='is_promoted')

In [84]:
px.histogram(df_train,'is_promoted',facet_col='gender', color='is_promoted')

EDA with Pandas Profiling

In [107]:
profile_df = ProfileReport(df_train)
profile_df.to_file("eda_profile_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Feature relationship

In [96]:
px.imshow(df_train.corr(), text_auto= True, title='Correlation Between the Variables in the Model', height=1000)

## 2. Setup Experiment

In [86]:
setup(df_train, target = 'is_promoted')

Unnamed: 0,Description,Value
0,session_id,1974
1,Target,is_promoted
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(54808, 13)"
5,Missing Values,True
6,Numeric Features,4
7,Categorical Features,8
8,Ordinal Features,False
9,High Cardinality Features,False


(None,
 False,
 {'acc': <pycaret.containers.metrics.classification.AccuracyMetricContainer at 0x1e59458f908>,
  'auc': <pycaret.containers.metrics.classification.ROCAUCMetricContainer at 0x1e59458fa20>,
  'recall': <pycaret.containers.metrics.classification.RecallMetricContainer at 0x1e59458f9e8>,
  'precision': <pycaret.containers.metrics.classification.PrecisionMetricContainer at 0x1e598027ac8>,
  'f1': <pycaret.containers.metrics.classification.F1MetricContainer at 0x1e5980279e8>,
  'kappa': <pycaret.containers.metrics.classification.KappaMetricContainer at 0x1e598027908>,
  'mcc': <pycaret.containers.metrics.classification.MCCMetricContainer at 0x1e598027898>},
 False,
 5,
 {'USI',
  'X',
  'X_test',
  'X_train',
  '_all_metrics',
  '_all_models',
  '_all_models_internal',
  '_available_plots',
  '_gpu_n_jobs_param',
  '_internal_pipeline',
  '_ml_usecase',
  'create_model_container',
  'dashboard_logger',
  'data_before_preprocess',
  'display_container',
  'exp_name_log',
  'expe

## 3. Compare Models

In [87]:
best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.9428,0.8109,0.3356,0.9416,0.4939,0.4708,0.5426,1.457
gbc,Gradient Boosting Classifier,0.9416,0.8125,0.3156,0.9548,0.4734,0.4508,0.5299,1.946
lda,Linear Discriminant Analysis,0.9389,0.7759,0.3297,0.8451,0.4732,0.4474,0.5044,0.46
rf,Random Forest Classifier,0.9334,0.7811,0.2263,0.9045,0.3609,0.3389,0.4324,1.457
ridge,Ridge Classifier,0.9288,0.0,0.1467,1.0,0.2554,0.2392,0.3681,0.048
ada,Ada Boost Classifier,0.928,0.7907,0.1751,0.8281,0.2885,0.2672,0.3602,0.484
et,Extra Trees Classifier,0.9249,0.7651,0.2248,0.6436,0.3324,0.3025,0.3505,2.131
nb,Naive Bayes,0.9172,0.6995,0.0081,1.0,0.0161,0.0148,0.0844,0.069
lr,Logistic Regression,0.9165,0.5613,0.0,0.0,0.0,0.0,0.0,2.004
dummy,Dummy Classifier,0.9165,0.5,0.0,0.0,0.0,0.0,0.0,0.42


## 4. Analyze Model

In [None]:
evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

# Sklearn ML modeling

In [106]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn import metrics