# Exercise: Predict Employee Resignation using Scikit-Learn Pipelines

In these exercises, we will predict which employees will quit their jobs based on a variety of real-world data. We will use pipelines to simplify data preprocessing, modelling and fine-tuning.

## Exercise 2: Model Training using Pipelines

The second exercise deals with training classifiers using scikit-learn pipelines. Your tasks are the following:

- Define a pipeline that includes data preprocessing and model training
- Evaluate the pipeline

## 1. Data Analysis

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:
# load the data
data = pd.read_csv("../../data/Employee.csv")

In [3]:
# separate features from labels
X = data.drop('LeaveOrNot', axis=1)
y = data['LeaveOrNot'].copy()

print('Features:', X.head())
print('Labels:', y.head())

Features:    Education  JoiningYear       City  PaymentTier   Age  Gender EverBenched  \
0  Bachelors       2017.0  Bangalore          3.0  34.0    Male          No   
1  Bachelors       2013.0       Pune          1.0  28.0  Female          No   
2  Bachelors       2014.0  New Delhi          3.0  38.0  Female          No   
3    Masters       2016.0  Bangalore          3.0  27.0    Male          No   
4    Masters       2017.0       Pune          3.0  24.0    Male         Yes   

   ExperienceInCurrentDomain  
0                          0  
1                          3  
2                          2  
3                          5  
4                          2  
Labels: 0    0
1    1
2    0
3    1
4    1
Name: LeaveOrNot, dtype: int64


## 2. Data Preprocessing using Pipelines

In [4]:
# split data into numerical and categorical features
num_features = X.select_dtypes(exclude=['object']).columns
print('Numerical features:', num_features)
cat_features = X.select_dtypes(include=['object']).columns
print('Categorical features:', cat_features)

Numerical features: Index(['JoiningYear', 'PaymentTier', 'Age', 'ExperienceInCurrentDomain'], dtype='object')
Categorical features: Index(['Education', 'City', 'Gender', 'EverBenched'], dtype='object')


In [5]:
# split data into training and test sets (best practice to split before data preprocessing)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.2, shuffle=True, random_state=42)

In [6]:
# define pipeline for numerical features
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", MinMaxScaler())
    ])

In [7]:
# show the pipeline diagram
num_pipeline

In [8]:
# define pipeline for categorical features
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
        ("ordinal_encoder", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)),    
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("cat_encoder", OneHotEncoder(sparse_output=False, handle_unknown="ignore")),
    ])



In [9]:
# show the pipeline diagram
cat_pipeline

In [10]:
# combine numerical and categorical pipelines
from sklearn.compose import ColumnTransformer

preprocessing = ColumnTransformer([
        ("num", num_pipeline, num_features),
        ("cat", cat_pipeline, cat_features),
    ])
preprocessing

In [11]:
# apply the pipeline to the data
X_train_transformed = preprocessing.fit_transform(X_train)

# convert back to pandas dataframe
X_train_transformed = pd.DataFrame(
    X_train_transformed, columns=preprocessing.get_feature_names_out(),
    index=X_train.index)

print('Features after transformation:', X_train_transformed.head())
print('Features shape after transformation:', X_train_transformed.shape)

Features after transformation:       num__JoiningYear  num__PaymentTier  num__Age  \
2850          0.166667               1.0  0.421053   
589           0.000000               1.0  0.157895   
2086          0.833333               0.5  0.368421   
445           0.000000               1.0  0.105263   
3654          0.833333               0.5  0.684211   

      num__ExperienceInCurrentDomain  cat__Education_0.0  cat__Education_1.0  \
2850                        0.000000                 0.0                 1.0   
589                         0.428571                 1.0                 0.0   
2086                        0.285714                 0.0                 1.0   
445                         0.285714                 0.0                 1.0   
3654                        0.285714                 0.0                 1.0   

      cat__Education_2.0  cat__City_0.0  cat__City_1.0  cat__City_2.0  \
2850                 0.0            0.0            1.0            0.0   
589              

## 3. Training Classifiers with Pipelines

In [12]:
# instantiate a SVM classifier
from sklearn.svm import SVC
model_svm = SVC(random_state=42)

# fit the model (use the values attribute to pass a numpy array instead of a pandas dataframe)
model_svm.fit(X_train_transformed.values, y_train.values)


In [13]:
# we use accuracy as the evaluation metric
from sklearn.metrics import accuracy_score

# don't forget to apply the pipeline to the test data
X_test_transformed = preprocessing.transform(X_test)

# make predictions and evaluate the model
y_pred_svm = model_svm.predict(X_test_transformed)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print('Accuracy:', accuracy_svm)

Accuracy: 0.8066595059076263


**TODO**: To replace the code above, train a support vector machine classifier using a pipeline that includes all preprocessing steps as well!

In [14]:
# TODO: YOUR CODE GOES HERE
from sklearn.pipeline import make_pipeline
model_pipeline = make_pipeline(preprocessing, SVC(random_state=42))
model_pipeline.fit(X_train, y_train)

**TODO**: Evaluate the pipeline's accuracy on the test data - make sure the result is the same as the accuracy without using a pipeline!

In [15]:
# TODO: YOUR CODE GOES HERE
y_pred_pipeline = model_pipeline.predict(X_test)
accuracy_pipeline = accuracy_score(y_test, y_pred_pipeline)
print("the accuracy_pipeline is:", accuracy_pipeline)

the accuracy_pipeline is: 0.8066595059076263
