# IBM Employee Attrition Analysis and Prediction

In collaboration with [Rohit Sahoo](https://www.kaggle.com/rohitsahoo)

![](https://www.clearpeaks.com/wp-content/uploads/2019/05/Advanced-analytics-Employee-Attrition-1200-630.jpg)

# Introduction
Attrition is a problem that impacts all businesses, irrespective of geography, industry and size of the company. Employee attrition leads to significant costs for a business, including the cost of business disruption, hiring new staff and training new staff. As such, there is great business interest in understanding the drivers of, and minimizing staff attrition. Let us therefore turn to our predictive modelling capabilities and see if we can predict employee attrition on this IBM dataset.

This notebook is structured as follows:

1. **Exploratory Data Analysis:** In this section, we explore the dataset by taking a look at the feature distributions, how correlated one feature is to the other and create some Seaborn and Plotly visualisations
2. **Feature Engineering and Categorical Encoding:** Conduct some feature engineering as well as encode all our categorical features into dummy variables
3. **Implementing Machine Learning models:** We implement a Random Forest and a Gradient Boosted Model after which we look at feature importances from these respective models

Let's Go.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Exploratory Data Analysis

In [None]:
attrition = pd.read_csv("../input/employee/train.csv")

In [None]:
attrition.head() #Top 5 Records

**Data quality checks**

To look for any null values, we can just invoke the **isnull** call as follows

In [None]:
attrition.isnull().any()

In [None]:
attrition.dtypes

**For Futhur Analysis we can seperate the numerical and categorical columns**

In [None]:
categorical = attrition.select_dtypes(include = 'object')
print(categorical.columns)

In [None]:
numerical = attrition.select_dtypes(include=['float64','int64'])

In [None]:
print((numerical.columns))

### Distribution of the dataset

Generally one of the first few steps in exploring the data would be to have a rough idea of how the features are distributed with one another. To do so, I shall invoke the familiar **kdeplot** function from the Seaborn plotting library and this generates bivariate plots as follows:

In [None]:
sns.kdeplot(attrition['Age'])

In [None]:
sns.distplot(attrition['Age'])

In [None]:
fig, ax = plt.subplots(5,2, figsize=(9,9))
sns.distplot(attrition['TotalWorkingYears'], ax = ax[0,0])
sns.distplot(attrition['MonthlyIncome'], ax = ax[0,1])
sns.distplot(attrition['YearsAtCompany'], ax = ax[1,0])
sns.distplot(attrition['DistanceFromHome'], ax = ax[1,1])
sns.distplot(attrition['YearsWithCurrManager'], ax = ax[2,0])
sns.distplot(attrition['YearsSinceLastPromotion'], ax = ax[2,1])
sns.distplot(attrition['PercentSalaryHike'], ax = ax[3,0])
sns.distplot(attrition['YearsAtCompany'], ax = ax[3,1])
sns.distplot(attrition['YearsSinceLastPromotion'], ax = ax[4,0])
sns.distplot(attrition['TrainingTimesLastYear'], ax = ax[4,1])
plt.tight_layout()
plt.show()

In [None]:
sns.factorplot(data = attrition, kind = 'count', aspect = 3, size = 5, x = 'BusinessTravel')

In [None]:
sns.factorplot(data = attrition, kind = 'count', aspect = 3, size = 5, x = 'Department')

In [None]:
sns.factorplot(data = attrition, kind = 'count', aspect = 3, size = 5, x = 'EducationField')

In [None]:
bins = [0, 18, 35, 60, np.inf]
labels = ['Student', 'Freshers/junior', 'Senior', 'Retired']
attrition['AgeGroup'] = pd.cut(attrition["Age"], bins, labels = labels)
sns.factorplot(data = attrition, kind = 'count', aspect = 3, size = 5, x = 'AgeGroup')

In [None]:
sns.factorplot(data = attrition, kind = 'count', aspect = 3, size = 5, x = 'Gender')

In [None]:
sns.factorplot(data = attrition, kind = 'count', aspect = 3, size = 5, x = 'JobRole')

In [None]:
sns.factorplot(data = attrition, kind = 'count', aspect = 3, size = 5, x = 'Over18')

In [None]:
sns.factorplot(data = attrition, kind = 'count', aspect = 3, size = 5, x = 'OverTime')

In [None]:
sns.factorplot(data = attrition, kind = 'count', aspect = 3, size = 5, x = 'MaritalStatus')

In [None]:
sns.factorplot(data = attrition, kind = 'count', aspect = 3, size = 5, x = 'Attrition')

### Correlation of Features

The next tool in a data explorer's arsenal is that of a correlation matrix. By plotting a correlation matrix, we have a very nice overview of how the features are related to one another. For a Pandas dataframe, we can conveniently use the call **.corr** which by default provides the Pearson Correlation values of the columns pairwise in that dataframe.

In this correlation plot, I will use the the Plotly library to produce a interactive Pearson correlation matrix via the Heatmap function as follows:

In [None]:
cor_mat = attrition.corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)]=False
fig = plt.gcf()
fig.set_size_inches(60,12)
sns.heatmap(data = cor_mat, mask = mask, square = True, annot = True, cbar = True)

In [None]:
attrition.columns

In [None]:
continious = ['Age',  'DailyRate', 'HourlyRate', 'MonthlyIncome', 'MonthlyRate', 'TotalWorkingYears', 'YearsAtCompany' ]

In [None]:
for var in continious:
    #boxplot
    plt.figure(figsize = (10,5))
    plt.subplot(1,2,1)
    fig = attrition.boxplot(column = var)
    fig.set_ylabel(var)
    
    #histogram
    plt.subplot(1,2,2)
    fig = attrition[var].hist(bins = 20)
    fig.set_ylabel('No. of Employees')
    fig.set_xlabel(var)
    
    plt.show()
    

In [None]:
attrition['TotalWorkingYears'].describe()

In [None]:
categorical.head()

In [None]:
attrition_cat = pd.get_dummies(categorical)

In [None]:
attrition_cat.head()

In [None]:
numerical.head()

In [None]:
attrition_final = pd.concat([numerical,attrition_cat], axis=1)

In [None]:
attrition_final.head()

In [None]:
attrition_final = attrition_final.drop('Attrition', axis = 1)

In [None]:
attrition_final

In [None]:
target = attrition['Attrition']

Build Basline Models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(attrition_final ,target, test_size = 0.2, random_state = 0)

In [None]:
x_train.shape

In [None]:
x_test.shape

In [None]:
model = RandomForestClassifier()
model.fit(x_train,y_train)
model_predictions = model.predict(x_test)
print("Accuracy: ", accuracy_score(y_test, model_predictions))
print(classification_report(y_test, model_predictions))

In [None]:
model = LogisticRegression()
model.fit(x_train,y_train)
model_predictions = model.predict(x_test)
print("Accuracy: ", accuracy_score(y_test, model_predictions))
print(classification_report(y_test, model_predictions))

In [None]:
model = DecisionTreeClassifier()
model.fit(x_train,y_train)
model_predictions = model.predict(x_test)
print("Accuracy: ", accuracy_score(y_test, model_predictions))
print(classification_report(y_test, model_predictions))

In [None]:
model = KNeighborsClassifier()
model.fit(x_train,y_train)
model_predictions = model.predict(x_test)
print("Accuracy: ", accuracy_score(y_test, model_predictions))
print(classification_report(y_test, model_predictions))

In [None]:
model = SVC()
model.fit(x_train,y_train)
model_predictions = model.predict(x_test)
print("Accuracy: ", accuracy_score(y_test, model_predictions))
print(classification_report(y_test, model_predictions))

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
oversampler = SMOTE(random_state = 12, sampling_strategy = 1.0)
smote_train, smote_target = oversampler.fit_sample(x_train,y_train)

In [None]:
smote_train.shape

In [None]:
model = RandomForestClassifier()
model.fit(smote_train,smote_target)
model_predictions = model.predict(x_test)
print("Accuracy: ", accuracy_score(y_test, model_predictions))
print(classification_report(y_test, model_predictions))

**If you like my notebook, please upvote it!**