# Exercise: Predict Employee Resignation using Scikit-Learn Pipelines

In these exercises, we will predict which employees will quit their jobs based on a variety of real-world data. We will use pipelines to simplify data preprocessing, modelling and fine-tuning.

## Exercise 3: Model Fine-Tuning using Grid Search and Cross Validation

The third exercise practices model fine-tuning using grid search and cross validation. Your tasks are:

- Use grid search to optimize the whole preprocessing and training pipeline
- Evaluate your best model

## 1. Data Analysis

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:
# load the data
data = pd.read_csv("../../data/Employee.csv")

In [3]:
# separate features from labels
X = data.drop('LeaveOrNot', axis=1)
y = data['LeaveOrNot'].copy()

print('Features:', X.head())
print('Labels:', y.head())

Features:    Education  JoiningYear       City  PaymentTier   Age  Gender EverBenched  \
0  Bachelors       2017.0  Bangalore          3.0  34.0    Male          No   
1  Bachelors       2013.0       Pune          1.0  28.0  Female          No   
2  Bachelors       2014.0  New Delhi          3.0  38.0  Female          No   
3    Masters       2016.0  Bangalore          3.0  27.0    Male          No   
4    Masters       2017.0       Pune          3.0  24.0    Male         Yes   

   ExperienceInCurrentDomain  
0                          0  
1                          3  
2                          2  
3                          5  
4                          2  
Labels: 0    0
1    1
2    0
3    1
4    1
Name: LeaveOrNot, dtype: int64


## 2. Data Preprocessing using Pipelines

In [4]:
# split data into numerical and categorical features
num_features = X.select_dtypes(exclude=['object']).columns
print('Numerical features:', num_features)
cat_features = X.select_dtypes(include=['object']).columns
print('Categorical features:', cat_features)

Numerical features: Index(['JoiningYear', 'PaymentTier', 'Age', 'ExperienceInCurrentDomain'], dtype='object')
Categorical features: Index(['Education', 'City', 'Gender', 'EverBenched'], dtype='object')


In [5]:
# split data into training and test sets (best practice to split before data preprocessing)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.2, shuffle=True, random_state=42)

In [6]:
# define pipeline for numerical features
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", MinMaxScaler())
    ])

In [7]:
# show the pipeline diagram
num_pipeline

In [8]:
# define pipeline for categorical features
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
        ("ordinal_encoder", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)),    
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("cat_encoder", OneHotEncoder(sparse_output=False, handle_unknown="ignore")),
    ])



In [9]:
# show the pipeline diagram
cat_pipeline

In [10]:
# combine numerical and categorical pipelines
from sklearn.compose import ColumnTransformer

preprocessing = ColumnTransformer([
        ("num", num_pipeline, num_features),
        ("cat", cat_pipeline, cat_features),
    ])
preprocessing

In [11]:
# apply the pipeline to the data
X_train_transformed = preprocessing.fit_transform(X_train)

# convert back to pandas dataframe
X_train_transformed = pd.DataFrame(
    X_train_transformed, columns=preprocessing.get_feature_names_out(),
    index=X_train.index)

print('Features after transformation:', X_train_transformed.head())
print('Features shape after transformation:', X_train_transformed.shape)

Features after transformation:       num__JoiningYear  num__PaymentTier  num__Age  \
2850          0.166667               1.0  0.421053   
589           0.000000               1.0  0.157895   
2086          0.833333               0.5  0.368421   
445           0.000000               1.0  0.105263   
3654          0.833333               0.5  0.684211   

      num__ExperienceInCurrentDomain  cat__Education_0.0  cat__Education_1.0  \
2850                        0.000000                 0.0                 1.0   
589                         0.428571                 1.0                 0.0   
2086                        0.285714                 0.0                 1.0   
445                         0.285714                 0.0                 1.0   
3654                        0.285714                 0.0                 1.0   

      cat__Education_2.0  cat__City_0.0  cat__City_1.0  cat__City_2.0  \
2850                 0.0            0.0            1.0            0.0   
589              

## 3. Training Classifiers with Pipelines

In [12]:
# train a random forest classifier using a pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
model_forest_pipeline = make_pipeline(preprocessing, RandomForestClassifier(random_state=42))
model_forest_pipeline.fit(X_train, y_train)

In [13]:
# evaluate the model
from sklearn.metrics import accuracy_score
y_pred_forest_pipeline = model_forest_pipeline.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)
print('Accuracy:', accuracy_forest_pipeline)

Accuracy: 0.8464017185821697


## 4. Fine-Tuning Models using Grid Search and Cross Validation

**TODO**: Fine-tune your classification model using grid search over the whole random forest pipeline. Use cross validation with 5 folds. Try to optimize at the same time:

- the `RandomForestClassifier` by optimizing the number of estimators and number of features used for a split, and
- the `MinMaxScaler` by scaling either between (0, 1) or between (-1, 1).

If you like, try to optimize other parameters as well! A dictionary of all available parameters can be accessed like this:

In [14]:
# print all parameters that can be tuned using grid search
model_forest_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'randomforestclassifier', 'columntransformer__force_int_remainder_cols', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__verbose_feature_names_out', 'columntransformer__num', 'columntransformer__cat', 'columntransformer__num__memory', 'columntransformer__num__steps', 'columntransformer__num__verbose', 'columntransformer__num__imputer', 'columntransformer__num__scaler', 'columntransformer__num__imputer__add_indicator', 'columntransformer__num__imputer__copy', 'columntransformer__num__imputer__fill_value', 'columntransformer__num__imputer__keep_empty_features', 'columntransformer__num__imputer__missing_values', 'columntransformer__num__imputer__strategy', 'columntransformer__num__scaler__clip', 'columntransformer__num__scaler__copy', 'columntransforme

In [21]:
# TODO: YOUR CODE GOES HERE
from sklearn.model_selection import GridSearchCV

param_grid = [{'randomforestclassifier__n_estimators': [100,200],
              'columntransformer__num__scaler__feature_range': [(0,1), (-1, 1)]}]

grid_search = GridSearchCV(model_forest_pipeline, param_grid, cv=5, verbose=3)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV 1/5] END columntransformer__num__scaler__feature_range=(0, 1), randomforestclassifier__n_estimators=100;, score=0.824 total time=   0.3s
[CV 2/5] END columntransformer__num__scaler__feature_range=(0, 1), randomforestclassifier__n_estimators=100;, score=0.823 total time=   0.2s
[CV 3/5] END columntransformer__num__scaler__feature_range=(0, 1), randomforestclassifier__n_estimators=100;, score=0.815 total time=   0.2s
[CV 4/5] END columntransformer__num__scaler__feature_range=(0, 1), randomforestclassifier__n_estimators=100;, score=0.829 total time=   0.2s
[CV 5/5] END columntransformer__num__scaler__feature_range=(0, 1), randomforestclassifier__n_estimators=100;, score=0.827 total time=   0.2s
[CV 1/5] END columntransformer__num__scaler__feature_range=(0, 1), randomforestclassifier__n_estimators=200;, score=0.826 total time=   0.5s
[CV 2/5] END columntransformer__num__scaler__feature_range=(0, 1), randomforestclassifier__n_e

**TODO**: Retrieve the best model from the grid search and calculate the final accuracy on the test data! Which optimal parameters did you find?

In [22]:
# TODO: YOUR CODE GOES HERE
grid_search.best_params_

{'columntransformer__num__scaler__feature_range': (0, 1),
 'randomforestclassifier__n_estimators': 200}

In [26]:
model_best = grid_search.best_estimator_


In [29]:
final_model = model_best.fit(X_train, y_train)
y_pred_best = final_model.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)
print("previous accuracy:", accuracy_forest_pipeline)
print("Best accuracy:", accuracy_best)

previous accuracy: 0.8464017185821697
Best accuracy: 0.8528464017185822
