<a href="https://www.kaggle.com/code/romanvelichkin/ibm-hr-classification-using-resampling?scriptVersionId=142823340" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## IBM HR Classification
#### IBM HR Analytics Employee Attrition & Performance
https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset  

Predict attrition of your valuable employees.  

### Abstract:
What is meant by attrition in this problem? Some experts distinguish between turnover and attrition. Judging by the data, we will have to work with something in between - any employee departure from the company. But in general, both attrition and turnover problems would be solved the same way, except that turnover would also require data on the labor market surrounding the company.  

If you want to know the difference between turnover and attrition, check out these links:  
https://sprigghr.com/blog/hr-professionals/employee-attrition-vs-employee-turnover/  
https://www.linkedin.com/pulse/employee-turnover-vs-attrition-david-o-bryant/  

Leaving a company is often the result of long deliberation. Abrupt and spontaneous withdrawals are rather rare. Therefore, it is not entirely correct to require a deterministic prediction from the solution. The employee is still working today, but he has already made a decision to leave the company, which he will announce tomorrow. In the data, this will be noted as if the employee is working, although in fact it is rather the other way around. It would be more correct to present this task as a regression problem. 

In addition, the presented data is imbalanced - there are five times more records in `attrition-no` class than in `attrition-yes` class. This creates additional difficulties for creating a predictive model. 
This data is a nice exmaple of *accuracy paradox* - by simply predicting each case as an `attrition-no`, we already get an accuracy of about 84%. But recall and precision for another class will always be very low. 

For the above reasons, this problem, in my opinion, does not have a real solution. 

**I think that for the above reasons, this problem has no real solution.**

However, I will show you how I've tried to solve this problem:
1. Feature analysis - *How I analyzed the features and highlighted the most important ones*
2. Trying to solve problem using under- and over-sampling - *How I tried to solve the problem of data imbalance using under- and over-sampling*

## List of attributes:
https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

## Prepare tools

In [None]:
# import data analysis and plotting tools 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# enable drawing plots in jupyter
%matplotlib inline

# processing data
from sklearn.preprocessing import StandardScaler

# import models from scikit-learn
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# import model evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

## Import data

In [None]:
df = pd.read_csv("/kaggle/input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv")
pd.set_option("display.max_columns", None)
df

Here some features that we need to remove:
1. `EmployeeNumber` - it's a some sort id for employee, which won't help us.
2. Features that contain same value over all data: `EmployeeCount` contains 1, `Over18` contains "Yes", `StandardHours` contains 8.

In [None]:
# Let's check what values are stored in these features

print("EmployeeCount countains value:", pd.unique(df.EmployeeCount))
print("Over18 countains value:", pd.unique(df.Over18))
print("StandardHours countains value:", pd.unique(df.StandardHours))

#### Function for data preparation

We'll use a function in the notebook so we don't have to go back to the beginning.

In [None]:
def df_preparation():
    """
    Function loads data as dataframe, shuffles it and drops features we don't need.
    Function returns final dataframe
    """
    # Get data
    df = pd.read_csv("/kaggle/input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv")
    
    # Drop redundant features (meta information or have same value over all data)
    df = df.drop("EmployeeCount", axis=1)
    df = df.drop("EmployeeNumber", axis=1)
    df = df.drop("Over18", axis=1)
    df = df.drop("StandardHours", axis=1)
    
    # Shuffle the data, in case the authors accidentally arranged them in a certain order.
    df = df.sample(frac=1, random_state=99).reset_index()
    df = df.drop("index", axis=1)
    
    return df

In [None]:
df = df_preparation()
df.head()

## Data exploration

In [None]:
# Look target (Attrition) distribution
print(df["Attrition"].value_counts())
df["Attrition"].value_counts().plot(kind="bar")
plt.xticks(rotation=0);

Data is imbalanced - there are five times more records in `attrition-no` class than in `attrition-yes` class.

In [None]:
# Explore data types
df.info()

In [None]:
# Explore if there are empty values
df.isna().sum()

### No nulls

## 1. Feature analysis

First we need to deal with imbalance between `Attrition` classes, so features plots will be more accurate.

In [None]:
# Get data (shuffled and without redundant features)
df = df_preparation()

In [None]:
# Create copy of our dataframe with increased amount of Attrition-Yes rows for analysis
df_analysis = []

for row in df.itertuples(index=False):
    if row.Attrition == 'Yes':
        df_analysis.extend([list(row)]*5)
    else:
        df_analysis.append(list(row))
        
df_analysis = pd.DataFrame(df_analysis, columns=df.columns)
df_analysis.head()

In [None]:
df_analysis["Attrition"].value_counts().plot(kind="bar")
plt.xticks(rotation=0);

Now that the classes have an almost equal amount of data, we can build plots that are easier to understand and analyze.

### Plotting all features and how they affect `Attrition`

#### Uncomment this section using `Ctrl+/` to plot all features

In [None]:
# # Draw plots for all features
# # I won't do it because there will be many, but you can uncomment it using Ctrl+/ and do it by yourself

# # Some design settings
# sns.set_context("paper")
# sns.set_style("white")

# for column in df_analysis:
    
#     # I excluded plots with lots of values that are difficult to draw, also I excluded Attrition
#     # You can exclude them from list and explore by yourself
#     if column not in ['Age', 'Attrition', 'DailyRate', 'HourlyRate', 'MonthlyIncome', 'MonthlyRate']:      
        
#         # You can increase aspect for displot and (x, ) for crosstab figsize to make plots wider 
#         # Respectively you can increase height for displot and (, y) for crosstab figsize to make plots higher 
        
#         # bar-plot where Attrition-Yes will be drawn over Attrition-No, so difference between them will be more clear  
#         sns.displot(x=column, data=df_analysis, hue='Attrition', height=4, aspect=1, palette="colorblind");
    
#         # bar-plot where Attrition-Yes will be drawn near with Attrition-No, for additional comparison 
#         pd.crosstab(df_analysis.Attrition, [df_analysis[column]]).plot(kind="bar", figsize=(5, 5))
#         plt.xticks(rotation=0);

### Features that affect Attrition
Most features could affect `Attrition` in some way. That's why we build models - to find this connections.
But I will show only features which effect on `Attrition` is noticeably visible on plots.

For each feature I draw:
- displot where `Attrition-Yes` will be drawn over `Attrition-No`, so difference between them will be more clear
- crosstab-plot where `Attrition-Yes` will be drawn near with `Attrition-No`, for additional comparison 

We have to search for situations when graphs for `Attrition-Yes` and `Attrition-No` have significantly different forms or when there is a strong deviation from normal distibution.

In [None]:
# Some design settings
sns.set_context("paper")
sns.set_style("white")

#### EnvironmentSatisfaction

In [None]:
sns.displot(x="EnvironmentSatisfaction", data=df_analysis, hue='Attrition', height=4, palette="colorblind");
pd.crosstab(df_analysis.Attrition, [df_analysis.EnvironmentSatisfaction]).plot(kind="bar", figsize=(10, 5));
plt.xticks(rotation=0);

Employees with low `EnviromentSatisfaction` quit job more often, while employees with high `EnviromentSatisfaction` quit job less often.

#### JobInvolvement

In [None]:
sns.displot(x="JobInvolvement", data=df_analysis, hue='Attrition', height=4, palette="colorblind");
pd.crosstab(df_analysis.Attrition, [df_analysis.JobInvolvement]).plot(kind="bar", figsize=(10, 5));
plt.xticks(rotation=0);

Employees with low `JobInvolvement` quit job more often, while employees with high `JobInvolvement` quit job less often.

#### JobLevel

In [None]:
sns.displot(x="JobLevel", data=df_analysis, hue='Attrition', height=4, palette="colorblind");
pd.crosstab(df_analysis.Attrition, [df_analysis.JobLevel]).plot(kind="bar", figsize=(10, 5));
plt.xticks(rotation=0);

Employees with lowest `JobLevel` quit job much more often.
I think it's because employees haven't made career yet and they don't lose much when they leave company.

#### JobRole

In [None]:
sns.displot(x="JobRole", data=df_analysis, hue='Attrition', height=4, aspect=3, palette="colorblind");
pd.crosstab(df_analysis.Attrition, [df_analysis.JobRole]).plot(kind="bar", figsize=(15, 5));
plt.xticks(rotation=0);

Certain `JobRoles` affect `Attrition`:
- Healtcare Representatives, Managers, Manufacturing Directors, Research Directors quit job less often;
- Laboratory Technicians and Sales Representatives quit job more often.

#### JobSatisfaction

In [None]:
sns.displot(x="JobSatisfaction", data=df_analysis, hue='Attrition', height=4, palette="colorblind");
pd.crosstab(df_analysis.Attrition, [df_analysis.JobSatisfaction]).plot(kind="bar", figsize=(10, 5));
plt.xticks(rotation=0);

Employees with lowest `JobSatisfaction` quit job more often, while employees with highest `JobSatisfaction` quit job less often.

`EnvironmentSatisfaction` and `JobSatisfaction` plots look similar. We need to check if those features are correlated.

In [None]:
# Example of how correlation should look

pd.crosstab(df_analysis.EnvironmentSatisfaction, [df_analysis.EnvironmentSatisfaction]).plot(kind="bar", figsize=(5, 5));
plt.xticks(rotation=0);

If there is a correlation then value "1" of one class should be concetrated in position "1" of another class. Same for the rest of values.

In [None]:
# Check if EnvironmentSatisfaction and JobSatisfaction have correlation

pd.crosstab(df_analysis.JobSatisfaction, [df_analysis.EnvironmentSatisfaction]).plot(kind="bar", figsize=(10, 5));
plt.xticks(rotation=0);

There is no corralation between `EnvironmentSatisfaction` and `JobSatisfaction`. It's independant features.

#### MaritalStatus

In [None]:
sns.displot(x="MaritalStatus", data=df_analysis, hue='Attrition', height=4, palette="colorblind");
pd.crosstab(df_analysis.Attrition, [df_analysis.MaritalStatus]).plot(kind="bar", figsize=(10, 5));
plt.xticks(rotation=0);

Single employees quit job more often. I think it's because married people have responsibilities and changes in their lives take longer to plan. Similar situation could be with divorced people, because thy could have kids from marriage that they are responsible for.


#### OverTime

In [None]:
sns.displot(x="OverTime", data=df_analysis, hue='Attrition', height=4, palette="colorblind");
pd.crosstab(df_analysis.Attrition, [df_analysis.OverTime]).plot(kind="bar", figsize=(10, 5));
plt.xticks(rotation=0);

Employees who have overtimes quit job more often. And vice versa - employees who don't have overtimes quit job less often.

#### RelationshipSatisfaction

In [None]:
sns.displot(x="RelationshipSatisfaction", data=df_analysis, hue='Attrition', height=4, palette="colorblind");
pd.crosstab(df_analysis.Attrition, [df_analysis.RelationshipSatisfaction]).plot(kind="bar", figsize=(10, 5));
plt.xticks(rotation=0);

Employees who unsatisfied with their relationship quit job more often.

Let's find out - what if `RelationshipSatisfaction` doesn't affect `Attrition` by itself but because `RelationshipSatisfaction` affected by `MaritalStatus`?

In [None]:
# Check if RelationshipSatisfaction and MaritalStatus have correlation

pd.crosstab(df_analysis.RelationshipSatisfaction, [df_analysis.MaritalStatus]).plot(kind="bar", figsize=(10, 5));
plt.xticks(rotation=0);

However it's doesn't. As you can see all `MaritalStatuses` have similar distribution on all `RelationshipSatisfaction` levels.

#### StockOptionLevel

In [None]:
sns.displot(x="StockOptionLevel", data=df_analysis, hue='Attrition', height=4, palette="colorblind");
pd.crosstab(df_analysis.Attrition, [df_analysis.StockOptionLevel]).plot(kind="bar", figsize=(10, 5));
plt.xticks(rotation=0);

Employees with stocks quit job less often. Positive values of `StockOptionLevel` affect `Attrition`.
It's understandable - stock options is a popular way to motivate employees. People with stocks of company they work at tend to stay in this company longer.

#### YearsAtCompany

In [None]:
sns.displot(x="YearsAtCompany", data=df_analysis, hue='Attrition', height=4, aspect=3, palette="colorblind");
pd.crosstab(df_analysis.Attrition, [df_analysis.YearsAtCompany]).plot(kind="bar", figsize=(15, 10));
plt.xticks(rotation=0);

Both plots are close to normal destibution. But there is obvious difference between two graphs. They have different peaks.
`Attrition-Yes` has large peak on 1 year while another graph doesn't have peak there.
One year is enough amount of time to understand if you want to keep working in chosen company or not. It looks like many people decided to leave company.

#### TotalWorkingYears

In [None]:
sns.displot(x="TotalWorkingYears", data=df_analysis, hue='Attrition', height=4, aspect=3, palette="colorblind");
pd.crosstab(df_analysis.Attrition, [df_analysis.TotalWorkingYears]).plot(kind="bar", figsize=(15, 10));
plt.xticks(rotation=0);

Right graph has same peak as `YearsAtCompany` plot - at 1 year on `Attrition-Yes` graph. Let's explore why is that.

Actually, we know why this is so - people who have worked for 1 year totally, most likely worked this year in the company, since our data came from company. But we will still learn how to check it.


In [None]:
# Employees who totally worked for 1 year

df_filter = df_analysis[df_analysis.Attrition == "Yes"]
df_filter = df_filter[df_filter.TotalWorkingYears == 1]

# Check how many of them worked in company for 1 year

one_year = df_filter[df_filter.YearsAtCompany == 1].YearsAtCompany.count()
rest = df_filter[df_filter.YearsAtCompany < 1].YearsAtCompany.count()

columns = ["one year", "less than one year"]

plt.bar(columns, [one_year, rest], color=["limegreen", "lightcoral"])
plt.title("Employees who totally worked for 1 year already - how long did they work in company?")
plt.xticks(rotation=0);

1 year peak on `Attrition-Yes` graph for `TotalWorkingYears` almost completely comes from `YearsAtCompany`.m

#### YearsInCurrentRole

In [None]:
sns.displot(x="YearsInCurrentRole", data=df_analysis, hue='Attrition', height=4, palette="colorblind");
pd.crosstab(df_analysis.Attrition, [df_analysis.YearsInCurrentRole]).plot(kind="bar", figsize=(10, 5));
plt.xticks(rotation=0);

It looks like there is a peak on 0 years for `Attririon-Yes` which shouldn't be there. Probably some employees were moved 
on new position and didn't like it.

#### YearsWithCurrManager

In [None]:
sns.displot(x="YearsWithCurrManager", data=df_analysis, hue='Attrition', height=4, palette="colorblind");
pd.crosstab(df_analysis.Attrition, [df_analysis.YearsWithCurrManager]).plot(kind="bar", figsize=(10, 5));
plt.xticks(rotation=0);

This graph looks very similar with previous `YearsInCurrentRole` with same peak on 0 years for `Attrition-Yes`. 
I think it's because managers change their positions rarely.

## 2. Trying to solve problem using under- and over-sampling

`Attrition` feature is imbalanced. `No` class has five time more values than `Yes` class. I'll try to fix it using under- and over-smapling.

### Data preparation

In [None]:
df = df_preparation()
df.head()

In [None]:
# Transform text categories into numerical
from sklearn.preprocessing import OrdinalEncoder

categorical_features = ["Attrition", 
                        "BusinessTravel", 
                        "Department", 
                        "EducationField", 
                        "Gender", 
                        "JobRole", 
                        "MaritalStatus", 
                        "OverTime"] 

encoder = OrdinalEncoder()
df[categorical_features] = encoder.fit_transform(df[categorical_features])
df

### Creating train and test datasets

In [None]:
# Create train and test datasets
df_train = []
df_test = []
y = 50
n = 100

for row in df.itertuples(index=False):
    if row.Attrition == 1: # Attrition-Yes
        
        # first 50 Attrition-Yes rows we take into test dataset
        if y > 0:
            y -= 1
            df_test.append(list(row))
            
        # rest goes to train dataset
        else:
            df_train.append(list(row))
    else:
        # first 100 Attrition-No rows we take into test dataset
        if n > 0:
            n -= 1
            df_test.append(list(row))
            
        # rest goes to train dataset
        else:
            df_train.append(list(row))
            
df_train = pd.DataFrame(df_train, columns=df.columns)            
df_test = pd.DataFrame(df_test, columns=df.columns)    
df_train

### Oversampling and undersampling train dataset

There are two main ways to deal with imbalanced data: reduce amount of samples in larger class (undersampling) or increase amount of samples in smaller class (oversampling). There are many mathimatical algorithms for both methods.

I used both undersampling and oversampling to make dataset balanced.

In [None]:
# Undersample major class using EditedNearestNeighbours method
# From class will be removed samples which do not agree "enough" with their neighboorhood

from imblearn.under_sampling import EditedNearestNeighbours 

# Resample strategy for ENN
enn = EditedNearestNeighbours(sampling_strategy='majority')

# Fit the model to generate the data.
resampled_train_X, resampled_train_y = enn.fit_resample(df_train.drop('Attrition', axis=1), 
                                                        df_train['Attrition'])

df_train_resampled = pd.concat([resampled_train_y,
                               resampled_train_X], axis=1)

print(df_train_resampled["Attrition"].value_counts())

In [None]:
# Oversampling minor class using SMOTE (Synthetic Minority Oversampling Technique)
# This method generates new samples in by interpolation

from imblearn.over_sampling import SMOTE

# Resample strategy for SMOTE
sm = SMOTE(sampling_strategy='minority', random_state=99)

# Fit the model to generate the data.
resampled_train_X, resampled_train_y = sm.fit_resample(df_train_resampled.drop('Attrition', axis=1), 
                                                       df_train_resampled['Attrition'])
df_train_resampled = pd.concat([resampled_train_y,
                               resampled_train_X], axis=1)

print(df_train_resampled["Attrition"].value_counts())

### Modelling

In [None]:
# Split train and test data into X and y
X = df_train_resampled.drop("Attrition", axis=1)
y = df_train_resampled["Attrition"]

X_test = df_test.drop("Attrition", axis=1)
y_test = df_test["Attrition"]
y_test

In [None]:
# standartize data
scaler = StandardScaler()
X = scaler.fit_transform(X)

scaler = StandardScaler()
X_test = scaler.fit_transform(X_test)
X_test

In [None]:
# Split train data into train and evaluation sets
np.random.seed(99)

X_train, X_evaluation, y_train, y_evaluation = train_test_split(X, y)

In [None]:
# Function to fit and score models

def fit_and_score(models, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates given machine learning models.
    model: a dict of different models
    X_train: training data (no labels)
    X_test: test data (no labels)
    y_train: training labels
    y_train: training labels
    """
    np.random.seed(99)
    
    # dict for model scores
    model_scores = {}
    
    # loop through models
    for name, model in models.items():
        # fir model to the data
        model.fit(X_train, y_train)
        # evaluate model
        model_scores[name] = model.score(X_test, y_test)
        
    return model_scores

In [None]:
# LinearSVC
# KNeighborsClassifier
# RandomForestClassifier
# LogisticRegression

models = {"LinearSVC": LinearSVC(),
          "KNeighborsClassifier": KNeighborsClassifier(),
          "RandomForestClassifier": RandomForestClassifier(),
          "LogisticRegression": LogisticRegression()}

model_scores = fit_and_score(models=models,
                                 X_train=X_train,
                                 X_test=X_evaluation,
                                 y_train=y_train,
                                 y_test=y_evaluation)
model_scores

I'll use `LogisticRegression` because it requires minumum tuning. Other models could show close results.

### Evaluating LogisticRegression model

In [None]:
# RFC model with tuned parameters
np.random.seed(99)
lr_model = LogisticRegression()

lr_model.fit(X_train, y_train)
lr_model.score(X_evaluation, y_evaluation)

In [None]:
lr_model.score(X_train, y_train)

In [None]:
lr_model.score(X_test, y_test)

In [None]:
y_preds = lr_model.predict(X_test)
y_preds

In [None]:
# Plot ROC curve, calculate AUC metric and confusion matrix
plot_roc_curve(lr_model, X_test, y_test);

In [None]:
# Confusion matrix
print(confusion_matrix(y_test, y_preds))

sns.set(font_scale=1.5)

def plot_conf_mat(y_test, y_preds):
    """
    Plots a confusion matrix using Seaborn's heatmap
    """
    fig, ax = plt.subplots(figsize=(3, 3))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds),
                     annot=True,
                     cbar=False,
                     fmt="g")
    plt.xlabel("Predicted label")
    plt.ylabel("True label")
    
plot_conf_mat(y_test, y_preds)

In [None]:
# Classification report as cross-validated precision, recall and f1-score

print(classification_report(y_test, y_preds))

In [None]:
# Cross-validated accuracy
np.random.seed(99)

cv_accuracy = cross_val_score(lr_model, X, y, cv=5, scoring="accuracy")
cv_accuracy_mean = np.mean(cv_accuracy)
print(cv_accuracy_mean, cv_accuracy)

In [None]:
# Cross-validated precision
np.random.seed(99)

cv_precision = cross_val_score(lr_model, X, y, cv=5, scoring="precision")
cv_precision_mean = np.mean(cv_precision)
print(cv_precision_mean, cv_precision)

In [None]:
# Cross-validated recall
np.random.seed(99)

cv_recall = cross_val_score(lr_model, X, y, cv=5, scoring="recall")
cv_recall_mean = np.mean(cv_recall)
print(cv_recall_mean, cv_recall)

In [None]:
# Cross-validated f1-score
np.random.seed(99)

cv_f1 = cross_val_score(lr_model, X, y, cv=5, scoring="f1")
cv_f1_mean = np.mean(cv_f1)
print(cv_f1_mean, cv_f1)

In [None]:
# Visualise cross-validated metrics
cv_metrics = pd.DataFrame({"Accuracy": cv_accuracy_mean,
                           "Precision": cv_precision_mean,
                           "Recall": cv_recall_mean,
                           "F1": cv_f1_mean},
                           index=[0])
cv_metrics.T.plot.bar(title="Cross-validated classification metrics",
                      legend=False);

### Feature importance for Logistic Regression

In [None]:
# Check feature importance
lr_model.coef_

In [None]:
# Match coef of features to columns
# Exclude labels from 'columns' (0 column)
lr_feature_dict = dict(zip(df_train_resampled.columns[1:], list(lr_model.coef_[0])))
lr_feature_dict

In [None]:
# Visualisation of feature importance
lr_feature_df = pd.DataFrame(lr_feature_dict, index=[0])
lr_feature_df.T.plot.bar(title="Feature importance for Logistic Regression",
                         legend=False,
                         figsize=(15, 5));

Leaving a company is often the result of long deliberation. Abrupt and spontaneous withdrawals are rather rare. Therefore, it is not entirely correct to require a deterministic prediction from the solution. The employee is still working today, but he has already made a decision to leave the company, which he will announce tomorrow. In the data, this will be noted as if the employee is working, although in fact it is rather the other way around. It would be more correct to present this task as a regression problem.

In addition, the presented data is imbalanced - there are five times more records in attrition-no class than in attrition-yes class. This creates additional difficulties for creating a predictive model. This data is a nice exmaple of accuracy paradox - by simply predicting each case as an attrition-no, we already get an accuracy of about 84%. But recall and precision for another class will always be very low.

For the above reasons, this problem, in my opinion, does not have a real solution. 

However, resampling data let me get relatively high f1-score comparing to other way that I've tried.