# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini Project Notebook: Employee Attrition Prediction

## **Note:** This notebook is part of an in-house Kaggle competition

## Problem Statement

To predict employee attrition using CatBoost and XgBoost

## Learning Objectives

At the end of the experiment, you will be able to

* explore the employee attrition dataset
* apply CatBoost and XgBoost on the dataset
* tune the model hyperparameters to improve accuracy
* evaluate the model using suitable metrics


## Introduction

Employee attrition is the gradual reduction in employee numbers. Employee attrition happens when the size of your workforce diminishes over time. This means that employees are leaving faster than they are hired. Employee attrition happens when employees retire, resign, or simply aren't replaced.
Although employee attrition can be company-wide, it may also be confined to specific parts of a business.

Employee attrition can happen for several reasons. These include unhappiness about employee benefits or the pay structure, a lack of employee development opportunities, and even poor conditions in the workplace.

To know more about the factors that lead to employee attrition, refer [here](https://www.betterup.com/blog/employee-attrition#:~:text=Employee%20attrition%20is%20the%20gradual,or%20simply%20aren't%20replaced).


**Gradient Boosted Decision Trees**

* Gradient boosted decision trees (GBDTs) are one of the most important machine learning models.

* GBDTs originate from AdaBoost, an algorithm that ensembles weak learners and uses the majority vote, weighted by their individual accuracy, to solve binary classification problems. The weak learners in this case are decision trees with a single split, called decision stumps.

* Some of the widely used gradient boosted decision trees are XgBoost, CatBoost and LightGBM.

## Dataset

The dataset used for this mini-project is [HR Employee Attrition dataset](https://data.world/aaizemberg/hr-employee-attrition). This dataset is synthetically created by IBM data scientists. There are 35 features and 1470 records.

There are numerical features such as:

* Age
* DistanceFromHome
* EmployeeNumber
* PerformanceRating

There are several categorical features such as:
* JobRole
* EducationField
* Department
* BusinessTravel

Dependent or target feature is 'attrition' which has values as Yes/No.

### **Kaggle Competition**

Please refer to the link for viewing the
[Kaggle Competition Document](https://drive.google.com/file/d/1c7PrbKrURFcnEB61dSoS9cBnUUVhhj-l/view?usp=drive_link) and join the Kaggle Competition using the hyperlink given in this document under '*Kaggle* Competition site'.

## Grading = 10 Points

In [1]:
!uv add ipykernel numpy matplotlib scikit-learn seaborn catboost scipy==1.12 xgboost lightgbm --upgrade


[2K[2mResolved [1m58 packages[0m [2min 576ms[0m[0m                                        [0m
[2mAudited [1m52 packages[0m [2min 0.59ms[0m[0m


In [2]:
# @title Download the data
from utility import download_if_missing

download_if_missing(
    filename="hr_employee_attrition_train.csv",
    url="https://cdn.iisc.talentsprint.com/CDS/MiniProjects/hr_employee_attrition_train.csv",
)

# !wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/hr_employee_attrition_train.csv
print("Data Downloaded Successfuly!!")


Data Downloaded Successfuly!!


### Install CatBoost

In [3]:
# !pip -qq install catboost
# !uv add catboost numpy --upgrade


### Import Required Packages

In [4]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix, f1_score
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier, metrics
import warnings

warnings.filterwarnings("ignore")
plt.style.use("fivethirtyeight")
pd.set_option("display.max_columns", 100)
%matplotlib inline


## Load the Dataset

**Exercise 1: Read the dataset [0.5 Mark]**

**Hint:** pd.read_csv()

In [5]:
# read the dataset
# YOUR CODE HERE

df = pd.read_csv("hr_employee_attrition_train.csv")


In [6]:
# Check the shape of dataframe.
# YOUR CODE HERE

df.shape


(1170, 35)

There can be more than one file to read as this is introduced as a competition, dataset has one file for training the model. Their can be other files as one containing the test features and the other can be the true labels.

## Data Exploration

- Check for missing values
- Check for consistent data type across a feature
- Check for outliers or inconsistencies in data columns
- Check for correlated features
- Do we have a target label imbalance
- How our independent variables are distributed relative to our target label
- Are there features that have strong linear or monotonic relationships? Making correlation heatmaps makes it easy to identify possible collinearity

**Exercise 2: Create a `List` of numerical and categorical columns. Display a statistical description of the dataset. Remove missing values (if any) [0.5 Mark]**

**Hint:** Use `for` to iterate through each column.

In [8]:
# Check the name and the data types of columns.
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1170 entries, 0 to 1169
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   age                       1170 non-null   int64 
 1   businesstravel            1170 non-null   object
 2   dailyrate                 1170 non-null   int64 
 3   department                1170 non-null   object
 4   distancefromhome          1170 non-null   int64 
 5   education                 1170 non-null   int64 
 6   educationfield            1170 non-null   object
 7   employeecount             1170 non-null   int64 
 8   employeenumber            1170 non-null   int64 
 9   environmentsatisfaction   1170 non-null   int64 
 10  gender                    1170 non-null   object
 11  hourlyrate                1170 non-null   int64 
 12  jobinvolvement            1170 non-null   int64 
 13  joblevel                  1170 non-null   int64 
 14  jobrole                 

In [10]:
# check a random sample from the dataframe
df.sample(10)


Unnamed: 0,age,businesstravel,dailyrate,department,distancefromhome,education,educationfield,employeecount,employeenumber,environmentsatisfaction,gender,hourlyrate,jobinvolvement,joblevel,jobrole,jobsatisfaction,maritalstatus,monthlyincome,monthlyrate,numcompaniesworked,over18,overtime,percentsalaryhike,performancerating,relationshipsatisfaction,standardhours,stockoptionlevel,totalworkingyears,trainingtimeslastyear,worklifebalance,yearsatcompany,yearsincurrentrole,yearssincelastpromotion,yearswithcurrmanager,attrition
301,33,Travel_Frequently,515,Research & Development,1,2,Life Sciences,1,73,1,Female,98,3,3,Research Director,4,Single,13458,15146,1,Y,Yes,12,3,3,80,0,15,1,3,15,14,8,12,No
951,40,Travel_Rarely,555,Research & Development,2,3,Medical,1,521,2,Female,78,2,2,Laboratory Technician,3,Married,3448,13436,6,Y,No,22,4,2,80,1,20,3,3,1,0,0,0,No
1037,31,Travel_Rarely,196,Sales,29,4,Marketing,1,1784,1,Female,91,2,2,Sales Executive,4,Married,5468,13402,1,Y,No,14,3,1,80,2,13,3,3,12,7,5,7,No
800,24,Travel_Frequently,1287,Research & Development,7,3,Life Sciences,1,647,1,Female,55,3,1,Laboratory Technician,3,Married,2886,14168,1,Y,Yes,16,3,4,80,1,6,4,3,6,3,1,2,Yes
835,50,Travel_Rarely,264,Sales,9,3,Marketing,1,1591,3,Male,59,3,5,Manager,3,Married,19331,19519,4,Y,Yes,16,3,3,80,1,27,2,3,1,0,0,0,No
240,35,Travel_Rarely,1224,Sales,7,4,Life Sciences,1,1962,3,Female,55,3,2,Sales Executive,4,Married,5204,13586,1,Y,Yes,11,3,4,80,0,10,2,3,10,8,0,9,No
637,50,Travel_Rarely,939,Research & Development,24,3,Life Sciences,1,1005,4,Male,95,3,4,Manufacturing Director,3,Married,13973,4161,3,Y,Yes,18,3,4,80,1,22,2,3,12,11,1,5,No
302,42,Non-Travel,179,Human Resources,2,5,Medical,1,1231,4,Male,79,4,2,Human Resources,1,Married,6272,12858,7,Y,No,16,3,1,80,1,10,3,4,4,3,0,3,No
1110,49,Travel_Rarely,527,Research & Development,8,2,Other,1,944,1,Female,51,3,3,Laboratory Technician,2,Married,7403,22477,4,Y,No,11,3,3,80,1,29,3,2,26,9,1,7,No
587,59,Travel_Frequently,1225,Sales,1,1,Life Sciences,1,91,1,Female,57,2,2,Sales Executive,3,Single,5473,24668,7,Y,No,11,3,4,80,0,20,2,2,4,3,1,3,No


In [None]:
# unique values in each column
# count of unique values in each column
# separate the columns as categorical and numerical


In [12]:
categorical_columns = df.select_dtypes(include=["object"]).columns
numerical_columns = df.select_dtypes(exclude=["object"]).columns

categorical_columns = [
    "businesstravel",
    "department",
    "education",
    "educationfield",
    "environmentsatisfaction",
    "gender",
    "jobinvolvement",
    "joblevel",
    "jobrole",
    "jobsatisfaction",
    "maritalstatus",
    "over18",
    "overtime",
    "performancerating",
    "relationshipsatisfaction",
    "stockoptionlevel",
    "worklifebalance",
    "attrition",
]
numerical_columns = [
    "age",
    "dailyrate",
    "distancefromhome",
    "hourlyrate",
    "monthlyincome",
    "monthlyrate",
    "numcompaniesworked",
    "percentsalaryhike",
    "standardhours",
    "totalworkingyears",
    "trainingtimeslastyear",
    "yearsatcompany",
    "yearsincurrentrole",
    "yearssincelastpromotion",
    "yearswithcurrmanager",
]
useless_columns = ["employee_count", "employee_number"]

print(f"Number of Categorical Columns: {len(categorical_columns)}")
print(f"Number of Numerical Columns: {len(numerical_columns)}")
print(f"Number of Useless Columns: {len(useless_columns)}")
print(
    f"Total Columns: {len(categorical_columns) + len(numerical_columns) + len(useless_columns)}"
)


Number of Categorical Columns: 18
Number of Numerical Columns: 15
Number of Useless Columns: 2
Total Columns: 35


In [14]:
for col in categorical_columns:
    print(f"Unique values in {col}: {df[col].nunique()}")
    print(f"Count of unique values in {col}: {df[col].value_counts()}")


Unique values in businesstravel: 3
Count of unique values in businesstravel: businesstravel
Travel_Rarely        831
Travel_Frequently    221
Non-Travel           118
Name: count, dtype: int64
Unique values in department: 3
Count of unique values in department: department
Research & Development    767
Sales                     356
Human Resources            47
Name: count, dtype: int64
Unique values in education: 5
Count of unique values in education: education
3    473
4    305
2    223
1    129
5     40
Name: count, dtype: int64
Unique values in educationfield: 6
Count of unique values in educationfield: educationfield
Life Sciences       477
Medical             379
Marketing           126
Technical Degree    103
Other                64
Human Resources      21
Name: count, dtype: int64
Unique values in environmentsatisfaction: 4
Count of unique values in environmentsatisfaction: environmentsatisfaction
3    364
4    356
1    229
2    221
Name: count, dtype: int64
Unique values in gen

In [7]:
# YOUR CODE HERE
df.describe()


Unnamed: 0,age,dailyrate,distancefromhome,education,employeecount,employeenumber,environmentsatisfaction,hourlyrate,jobinvolvement,joblevel,jobsatisfaction,monthlyincome,monthlyrate,numcompaniesworked,percentsalaryhike,performancerating,relationshipsatisfaction,standardhours,stockoptionlevel,totalworkingyears,trainingtimeslastyear,worklifebalance,yearsatcompany,yearsincurrentrole,yearssincelastpromotion,yearswithcurrmanager
count,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0
mean,36.85812,797.822222,9.262393,2.917949,1.0,1021.113675,2.723932,65.501709,2.747009,2.049573,2.723077,6456.937607,14411.584615,2.750427,15.204274,3.153846,2.739316,80.0,0.78547,11.315385,2.842735,2.745299,6.925641,4.17094,2.138462,4.061538
std,9.183448,403.592661,8.150868,1.011534,0.0,600.548289,1.095847,20.310054,0.702625,1.096352,1.103799,4660.476506,7078.167743,2.521952,3.688952,0.360955,1.075225,0.0,0.847253,7.879363,1.306385,0.718864,6.232076,3.62788,3.201738,3.58635
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,1.0,1009.0,2094.0,0.0,11.0,3.0,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,461.0,2.0,2.0,1.0,496.25,2.0,48.0,2.0,1.0,2.0,2914.75,8387.5,1.0,12.0,3.0,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,35.0,798.0,7.0,3.0,1.0,1008.5,3.0,66.0,3.0,2.0,3.0,4903.5,14536.0,2.0,14.0,3.0,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1146.75,14.0,4.0,1.0,1548.5,4.0,83.0,3.0,3.0,4.0,8215.25,20456.25,4.0,18.0,3.0,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,2.75,7.0
max,60.0,1499.0,29.0,5.0,1.0,2065.0,4.0,100.0,4.0,5.0,4.0,19999.0,26999.0,9.0,25.0,4.0,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


First, we want to get a sense of our data:
- What features have the most divergent distributions based on target class
- Do we have a target label imbalance
- How our independent variables are distributed relative to our target label
- Are there features that have strong linear or monotonic relationships, making correlation heatmaps makes it easy to identify possible colinearity

### Check for outliers

**Exercise 3: Create a box plot to check for outliers [0.5 Mark]**

In [None]:
# Check for outliers
# YOUR CODE HERE


### Handling outliers

**Exercise 4: Use lower bound as 25% and upper bound as 75% to handle the outliers [0.5 Mark]**

In [None]:
# YOUR CODE HERE


In [None]:
# Recheck for outliers
# YOUR CODE HERE


### Target label imbalance

**Exercise 5: Check if there is an imbalance in target label [0.5 Mark]**

**Hint:** Use value_counts()

In [None]:
# Count of unique values in Attrition column
# YOUR CODE HERE


In [None]:
# Plot barplot to visualize balance/imbalance
# YOUR CODE HERE


If there is any imbalance in the dataset then a few techniques can be utilised (optional):
1. SMOTE
2. Cross Validation
3. Regularizing the model's parameters

###Plot pairplot

**Exercise 6: Visualize the relationships between the predictor variables and the target variable using a pairplot [0.5 Mark]**

**Hint:** Use sns.pairplot

In [None]:
# Visualize a pairplot with relevant features
# YOUR CODE HERE


### Explore Correlation

- Plotting the Heatmap

**Exercise 7: Visualize the correlation among IBM employee attrition numerical features using a heatmap [0.5 Mark]**

In [None]:
# Visualize heatmap
# YOUR CODE HERE


Comment on the observations made with the pairplot and heatmap

### Preparing the test feature space
* Remove outliers if any
* Handle the categorical feature if required
* Other processing steps can also be followed.

In [None]:
# YOUR CODE HERE


Optional:
Use `Hyperopt`, a hyperparameter tuning technique to identify the best set of parameters.

Refer to the Additional Notebook: CatBoost parameter tuning [CDS-B8 GDrive -> Module 3 -> Assignments -> July 27, 2024 -> Additional Notebook (ungraded) -> Addl_NB_Tuning_hyerparameters_using_Hyperopt]

In the notebook, data processing is done separately for different models.
Considering the fact that different models may require data in different format and in turn different processes may be followed to process the data.

If the processing steps followed for the models are same, data processing can also be done once.

## Apply CatBoost

Catboost was released in 2017 by Yandex, showing, by their benchmark to be faster in prediction, better in accuracy, and easier to use for categorical data across a series of GBDT tasks. Additional capabilities of catboost include plotting feature interactions and object (row) importance.

[Here](https://catboost.ai/en/docs/) is the official documentation of CatBoost

### Data Processing for CatBoost

**Exercise 8: Data processing for CatBoost [1 Mark]**
* **Copy the dataframe that was created after removing the outliers**
* **Handle the categorical features if required**
* **Create target column and feature space**

**Hint:** Column containing the information on attrition will be the target column.

In [None]:
# Copy the data
# YOUR CODE HERE


In [None]:
# Target Column
# YOUR CODE HERE


In [None]:
# Feature Space
# YOUR CODE HERE


### Model Definition

**Exercise 9: Define, train the model and display the results [2 Mark]**

**Hint:**
* Use CatBoostClassifier() to define the model with relevant parameters.
* Use `fit` to fit the data to the model. Refer [here](https://catboost.ai/en/docs/concepts/speed-up-training) to see some ways to speedup CatBoost training.
* Evaluate the model using roc_auc_score, accuracy_score, f1_score, predict methods or other relevant techniques.

In [None]:
# Create CatBoost model
# YOUR CODE HERE


In [None]:
# Model training
# YOUR CODE HERE


### Model performance

In [None]:
# Model performance on all sets
# YOUR CODE HERE


## Apply XGBoost

XGBoost is a workhorse gradient boosted decision tree algorithm. Its been around since 2014 and has come to dominate the Kaggle and data science community. XGB introduced gradient boosting where new models are fit to the residuals of prior models and then added together, using a gradient descent algorithm to minimize the loss.

Read [here](https://xgboost.readthedocs.io/en/stable/parameter.html) on XGBoost parameters.

Refer [here](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier) for the official documentation of XGBoost classifier.

### Data Processing for XGBoost


**Exercise 10: Data Processing for XGBoost [1 Mark]**
* **Copy the dataframe after the outliers were removed.**
* **Handle the categorical features if required**
* **Create target column and feature space**

In [None]:
# Copy dataframe
# YOUR CODE HERE


**Hint:** Use pd.get_dummies

In [None]:
# Handling categorical features
# YOUR CODE HERE


In [None]:
# Concat the dummy variables to actual dataframe and remove initial categorical columns
# YOUR CODE HERE


When creating the dummy variables, the name of attrition column was changed, rename to 'attrition' again.

**Hint:** Use .rename

In [None]:
# Rename target column
# YOUR CODE HERE


In [None]:
# Feature Space
# YOUR CODE HERE

# Targer label
# YOUR CODE HERE


### Model Definition

**Exercise 11: Define, train the model and display the results [2 Mark]**

**Hint:**
* Use XGBClassifier() to define the model with relevant parameters.
* Use `fit` to fit the data to the model.
* Evaluate the model using roc_auc_score, accuracy_score, f1_score, predict methods or other relevant techniques.

In [None]:
# Create XGBoost classifier model
# YOUR CODE HERE


In [None]:
# Model training
# YOUR CODE HERE


### Model Performance

In [None]:
# Model performance on all sets
# YOUR CODE HERE


## Apply LightGBM (Optional)

LightGBM is an open-source GBDT framework created by Microsoft as a fast and scalable alternative to XGB and GBM. By default LightGBM will train a Gradient Boosted Decision Tree (GBDT), but it also supports random forests, Dropouts meet Multiple Additive Regression Trees (DART), and Gradient Based One-Side Sampling (Goss).

To know more about LightGBM parameters, refer [here](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier).

### Feature Engineering for LightGBM

In [None]:
## Following the same procedure as followed in XGBoost

# Copy the dataframe
# YOUR CODE HERE

# Handling categorical features
# YOUR CODE HERE

# Concat the dummy variables to actual dataframe and remove initial categorical columns
# YOUR CODE HERE

# Rename target column
# YOUR CODE HERE

# Features Space
# YOUR CODE HERE

# Target Label
# YOUR CODE HERE


### Model Definition

**Hint:**
* Use LGBMClassifier() to define the model with relevant parameters.
* Use `fit` to fit the data to the model.
* Evaluate the model using roc_auc_score, accuracy_score, f1_score, predict methods or other relevant techniques.

In [None]:
# Create LightGBM classifier model
# YOUR CODE HERE


In [None]:
# Model training
# YOUR CODE HERE


### Model performance

In [None]:
# Model performance on all sets
# YOUR CODE HERE


## Results

**Exercise 12: Create a dataframe of XGBoost results and CatBoost results and display them [0.5 Mark]**

**Hint:** Use pd.DataFrame

In [None]:
# Create a dataframe for computed metrics for different models
# YOUR CODE HERE


Reference reading:
1. https://machinelearningmastery.com/xgboost-for-imbalanced-classification/

## Kaggle Prediction

Load data from Kaggle competition site

In [None]:
# From the given Kaggle competition link, load the dataset 'hr_employee_attrition_test.csv'
# YOUR CODE HERE


In [None]:
# From the dataset 'hr_employee_attrition_test.csv', drop columns ['id','employeenumber', 'employeecount', 'over18'] having single value
# YOUR CODE HERE


In [None]:
# Handle categorical features
# YOUR CODE HERE


In [None]:
# Concat the dummy variables to actual dataframe and remove initial categorical columns
# YOUR CODE HERE


Predictions

In [None]:
# Get the predictions using your already trained CatBoost classifier model achieved in Exercise 9
# YOUR CODE HERE


In [None]:
# Get the predictions using your already trained XGBoost classifier model achieved in Exercise 11
# YOUR CODE HERE


Get the predictions using your trained Microsoft LightGBM model (Optional)

In [None]:
# Get the predictions using your already trained Microsoft LightGBM classifier model
# achieved under the optional exercise 'Apply LightGBM (Optional)'
# YOUR CODE HERE


Save predictions to csv and submit under given Kaggle competiton link

In [None]:
# YOUR CODE HERE
