# IBM HR Analytics Employee Attrition

This project was developed using a fictional dataset created by IBM data scientists and aims to investigate factors that lead to employee attrition, as well as to develop a Machine Learning model capable of predicting whether employees tend to leave the company or not. 


## 1. Exploring the data

Let's start by importing the necessary libraries, loading the data, and checking the data set.

The second column of this data set, 'Attrition', will be our target variable, that is, the one we want to predict (whether an employee will leave the company or not). All other columns are characteristics of each employee in the company's database.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Importing the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#Loading the dataset
employee_df = pd.read_csv("/kaggle/input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv")

In [None]:
# Viewing the first lines

employee_df.head()

In [None]:
employee_df.info()

* The dataset has 35 features (columns), 26 of which are numeric and 9 are categorical, in addition to 1470 rows.
* The dataset does not have null values

In [None]:
# Analyzing statistical information about numerical variables
employee_df.describe()

In [None]:
# Transforming some categorical variables with YES / NO content to numeric 0/1

In [None]:
employee_df['Attrition'].value_counts()

In [None]:
employee_df['OverTime'].value_counts()

In [None]:
employee_df['Over18'].value_counts()

In [None]:
employee_df['Attrition'] = employee_df['Attrition'].apply(lambda x: 1 if x == 'Yes' else 0)
employee_df['OverTime'] = employee_df['OverTime'].apply(lambda x: 1 if x == 'Yes' else 0)
employee_df['Over18'] = employee_df['Over18'].apply(lambda x: 1 if x == 'Y' else 0)


In [None]:
employee_df.head()

In [None]:
# Plotting a histogram to visualize how each feature is distributed into dataset

employee_df.hist(bins = 30, figsize = (20,20), color = 'b');

* Most of the employees are around between 27 and 40 years old
* Most of the employees live close to work
* Most of the employees have Education level 3
* Most of the employees have less than 10 years working in the company
* Several features such as 'MonthlyIncome' and 'TotalWorkingYears' are tail heavy

In [None]:
# It makes sense to drop 'EmployeeCount' , 'Standardhours' and 'Over18' since they do not change from one employee to the other
# Let's drop 'EmployeeNumber' as well
employee_df.drop(['EmployeeCount', 'StandardHours', 'Over18', 'EmployeeNumber'], axis=1, inplace=True)

In [None]:
employee_df.head()
# Now we have 31 columns

In [None]:
# Let's see how many employees left the company! 
left_df = employee_df[employee_df['Attrition'] == 1]
stayed_df = employee_df[employee_df['Attrition'] == 0]

# Count the number of employees who stayed and left
# It seems that we are dealing with an imbalanced dataset 

print("Total =", len(employee_df))

print("Number of employees who left the company:", len(left_df))
print(f"Percentage of employees who left the company: {1.*len(left_df)/len(employee_df)*100.0:.2f}%") 
print("Number of employees who did not leave the company (stayed) =", len(stayed_df))
print(f"Percentage of employees who did not leave the company (stayed): {1.*len(stayed_df)/len(employee_df)*100.0:.2f}%") 


In [None]:
# Lets have a look in the statistics of the employees who stayed and left to make some comparisions

left_df.describe()

In [None]:
stayed_df.describe()


After comparing the mean and std of the employees who stayed and left we can conclude: 
* Age: mean age of the employees who stayed is higher compared to who left (37.5 x 33.6)
* DailyRate: Rate of employees who stayed is higher (812 x 750)
* DistanceFromHome: Employees who stayed live closer to home (8.9km x 10.6km)
* EnvironmentSatisfaction and JobSatisfaction: Employees who stayed are generally more satisifed with their jobs
* StockOptionLevel: Employees who stayed tend to have higher stock option level


In [None]:
# Lets have a look in the different correlations between the features

correlations = employee_df.corr()
f, ax = plt.subplots(figsize = (20, 20))
sns.heatmap(correlations, annot = True);

Verifying the correlation between variables is extremely important to achieve a broader view of the data and how they relate to each other.

The lighter the color the more positive it correlates

* Job level is strongly correlated with total working years
* Monthly income is strongly correlated with Job level
* Monthly income is strongly correlated with total working years

* Age is stongly correlated with monthly income

In [None]:
# Lets investigate if there is any correlation between people who left the company with some specific variables such as 'Age', 'JobRole', 'MaritalStatus', 'JobInvolvement' and 'JobLevel'

plt.figure(figsize=[25, 12])
sns.countplot(x = 'Age', hue = 'Attrition', data = employee_df);

Blue is represented by employees who stayed, orange by those who left the company.
* Up to 31 years of age, the largest number of employees who left the company is concentrated compared to those who stayed; Between 18 to 21 years of age are concentrated the largest number of employees that leave proportionally the amount that remains.
* After the 31's, as age increases, there is a decrease in the number of employees who left the company;

In [None]:
plt.figure(figsize=[20,20])

plt.subplot(411)

sns.countplot(x = 'JobRole', hue = 'Attrition', data = employee_df)
plt.title("In which position the Attrition is higher / lower?");

* Almost half of the team who work in Sales Representative left the company. However a very small number of Reseach Director left.

In [None]:
# Let's see the Monthly Income vs. Job Role

plt.figure(figsize=(10, 10))
sns.boxplot(x = 'MonthlyIncome', y = 'JobRole', data = employee_df);
plt.title("How is the distribution of wages among the different positions?");

* Sales Representative, Laboratory Technician and Research Scientist are the least paid, while Research Director and Manager are best paid.

In [None]:
sns.countplot(x = 'MaritalStatus', hue = 'Attrition', data = employee_df);
plt.title("Marital Status Vs Attrition");

* Single employees tend to leave compared to married and divorced

In [None]:
sns.countplot(x = 'JobInvolvement', hue = 'Attrition', data = employee_df);
plt.title("How does the level of involvement at work affect the Attrition?");

* The less employees are involved, the more they tend to leave the company

In [None]:
sns.countplot(x = 'JobLevel', hue = 'Attrition', data = employee_df)
plt.title("Job level Vs Attrition");

* Less experienced (low job level) tend to leave the company 

In [None]:
# Let's use KDE (Kernel Density Estimate) to visualize the probability density of a continuous variable.

# Investigating DistanceFromHome

plt.figure(figsize=(12,7))
sns.kdeplot(left_df['DistanceFromHome'], label = 'Employees who left', shade = True, color = 'r')
sns.kdeplot(stayed_df['DistanceFromHome'], label = 'Employees who Stayed', shade = True, color = 'b')
plt.xlabel('Distance From Home');
plt.ylabel('Attrition');
plt.title("Does the distance from home to work impact Attrition?");

* As the distance from home increases, the number of employees who tends to leave is higher.

In [None]:
# Investigating YearsWithCurrManager

plt.figure(figsize=(12,7))
sns.kdeplot(left_df['YearsWithCurrManager'], label = 'Employees who left', shade = True, color = 'r')
sns.kdeplot(stayed_df['YearsWithCurrManager'], label = 'Employees who Stayed', shade = True, color = 'b')
plt.xlabel('Years With Current Manager');
plt.title("Does the length of stay as a Current Manager influence the departure of employees?");

* The shorter the time as a Current Manager, the greater the tendency for employees to leave.

In [None]:
# Investigating TotalWorkingYears

plt.figure(figsize=(12,7))
sns.kdeplot(left_df['TotalWorkingYears'], shade = True, label = 'Employees who left', color = 'r')
sns.kdeplot(stayed_df['TotalWorkingYears'], shade = True, label = 'Employees who Stayed', color = 'b')
plt.xlabel('Total Working Years');
plt.ylabel('Attrition');
plt.title("Is there a relationship between total working time in the company and Attrition?");

* The critical period that employees most tend to leave is up to about 7 years working at the company. From there they tend to stay.

## 2. Performing data cleaning

In this process, the main objective is to ensure that the data is correct, consistent and usable, identifying any errors or corruptions in the data, correcting or deleting them, or manually processing them as needed to prevent the error from happening again.

In [None]:
# Checking the types of each feature
employee_df.dtypes

In [None]:
# Separating categorical data from the rest of the dataframe to then convert it to numeric
X_cat = employee_df[['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus']]
X_cat

There are several different ways to convert categorical to numeric values. In this project we will use the One Hot Encoder from the Scikit Learn library.

In [None]:
# Converting the categorical features into numbers using OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

onehotencoder = OneHotEncoder()
X_cat = onehotencoder.fit_transform(X_cat).toarray()
X_cat.shape

In [None]:
# Converting into dataframe
X_cat = pd.DataFrame(X_cat)
X_cat 

In [None]:
# Separating the numerical data
X_numerical = employee_df[['Age', 'DailyRate', 'DistanceFromHome','Education', 'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement','JobLevel','JobSatisfaction','MonthlyIncome','MonthlyRate','NumCompaniesWorked',	'OverTime',	'PercentSalaryHike', 'PerformanceRating','RelationshipSatisfaction','StockOptionLevel','TotalWorkingYears'	,'TrainingTimesLastYear', 'WorkLifeBalance','YearsAtCompany','YearsInCurrentRole', 'YearsSinceLastPromotion','YearsWithCurrManager']]
X_numerical

In [None]:
# Concatenating the categorical dataset X_cat and the numerical dataset X_numerical into a unique dataset

X_all = pd.concat([X_cat, X_numerical], axis = 1)
X_all

Lets use sklearn's MinMaxScaler to transform the data by scaling each resource to an interval between 0 and 1 to ensure that our machine learning model handles the features equally

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X_all)
X

In [None]:
# Separating the feature that we want to predict

y = employee_df['Attrition']
y

## 3. Creating Testing and Training datasets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
X_train.shape

In [None]:
X_test.shape

## 4. Building, training and evaluating different Machine Learning models

## 4.1 Logistic Regression Classifier

Logistic Regression is a Machine Learning algorithm which is used for the classification problems, it is a predictive analysis algorithm and based on the concept of probability.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression()

In [None]:
# Training the data

model.fit(X_train, y_train)

In [None]:
# Making predictions and visualizing the accuracy

LRC_pred = model.predict(X_test)


print("Accuracy: {}%".format( 100 * accuracy_score(LRC_pred, y_test)))

In [None]:
# Comparing the results using Confusion Matrix

from sklearn.metrics import confusion_matrix, classification_report

In [None]:
# Testing Set Performance

cm = confusion_matrix(LRC_pred, y_test)
sns.heatmap(cm, annot=True);

* The model was able to correctly classify around 3,000 registers and erroneously classify a very small number of employees

In [None]:
# Analyzing the KPI (Key Performance Indicator)

print(classification_report(y_test, LRC_pred))

## 4.2 Random Forest Classifier

It is also widely used in classification problems and like its name implies, consists of a large number of individual decision trees that operate as an ensemble.

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

In [None]:
# Training the data

model.fit(X_train, y_train)

In [None]:
# Making predictions and visualizing the accuracy

RFC_pred = model.predict(X_test)
print("Accuracy: {}%".format( 100 * accuracy_score(RFC_pred, y_test)))

In [None]:
# Testing Set Performance

cm = confusion_matrix(RFC_pred, y_test)
sns.heatmap(cm, annot=True)

In [None]:
# Analyzing the KPI (Key Performance Indicator)

print(classification_report(y_test, RFC_pred))

## 4.3 K-Nearest Neighbors Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
model = KNeighborsClassifier()

In [None]:
model.fit(X_train, y_train)


In [None]:
KNNC_pred = model.predict(X_test)
print("Accuracy: {}%".format( 100 * accuracy_score(KNNC_pred, y_test)))

In [None]:
# Testing Set Performance

cm = confusion_matrix(KNNC_pred, y_test)
sns.heatmap(cm, annot=True)

In [None]:
# Analyzing the KPI (Key Performance Indicator)

print(classification_report(y_test, KNNC_pred))

## 4.4 Artificial Neural Network Classifier

In summary, a Neural Network consists of units (neurons), arranged in layers, which convert an input vector into some output

In [None]:
import tensorflow as tf

In [None]:
# Creating the layers
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units=500, activation='relu', input_shape=(50, )))
model.add(tf.keras.layers.Dense(units=500, activation='relu'))
model.add(tf.keras.layers.Dense(units=500, activation='relu'))
model.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

In [None]:
model.summary()

In [None]:
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics = ['accuracy'])

In [None]:
# Training the model

epochs_hist = model.fit(X_train, y_train, epochs = 100, batch_size = 50)

In [None]:
ANNC_pred = model.predict(X_test)
ANNC_pred = (ANNC_pred > 0.5)
print("Accuracy: {}%".format( 100 * accuracy_score(ANNC_pred, y_test)))

In [None]:
epochs_hist.history.keys()

In [None]:
plt.plot(epochs_hist.history['loss'])
plt.title('Model Loss Progress During Training')
plt.xlabel('Epoch')
plt.ylabel('Training Loss')
plt.legend(['Training Loss']);

In [None]:
plt.plot(epochs_hist.history['accuracy'])
plt.title('Model Accuracy Progress During Training')
plt.xlabel('Epoch')
plt.ylabel('Training Accuracy')
plt.legend(['Training Accuracy']);

In [None]:
# Testing Set Performance
cm = confusion_matrix(y_test, ANNC_pred)
sns.heatmap(cm, annot=True);

In [None]:
print(classification_report(y_test, ANNC_pred))

# 5. Model evaluation

After testing the four models we came to the conclusion that the best model is the Logistic Regression Classifier with an accuracy of 91.85%

In [None]:
# Showing the results

print("Logistic Regression Classifier: {:.2f}% Accuracy".format( 100 * accuracy_score(LRC_pred, y_test)))
print("Random Forest Classifier: {:.2f}% Accuracy".format( 100 * accuracy_score(RFC_pred, y_test)))
print("K-Nearest Neighbors Classifier: {:.2f}% Accuracy".format( 100 * accuracy_score(KNNC_pred, y_test)))
print("Artificial Neural Network Classifier: {:.2f}% Accuracy".format( 100 * accuracy_score(ANNC_pred, y_test)))